Proposals for amending "High Performance Fortran" as described
in draft 1.0 of 27 September 1991.

Guy Steele, Thinking Machines Corporation, 17 January 1992


================ INTRODUCTION ================

Thinking Machines Corporation very much favors development of a
commonly agreed upon version of Fortran suitable for execution on a
wide variety of platforms, including workstations, mainframes,
multicomputers, vector supercomputers, and massively parallel
computers.  We believe that the draft "High Performance Fortran" is
very close to being a workable solution.  We suggest modifications in
four areas:

[1] The HPF draft states that it is fully Fortran 90 conformant.  We
believe this is a laudable long-term goal but could be an impediment
to short-term adoption of HPF.  We suggest a layered definition
consisting of a subset, labelled HPF, that can be extended compatibly
to become a larger language (Fortran 90 HPF) that encompasses all of
Fortran 90.

[2] The crucial improvement of HPF over the Fortran 90 array features
is the ability to control the arrangement of data arrays within a
memory or processor set that may be distributed.  The original
Fortran D proposal is very general, but the implementation techniques
required to implement its full generality are still a subject of
research (which was one of the original goals of Fortran D).  We
believe that it is necessary either to restrict the generality of the
Fortran D proposal or to redesign it to make it easier to implement.
On the other hand, we believe that the ability to realign and
redistribute arrays at run time is very important, and would prefer
to see this Fortran D facility retained in HPF.  However, we propose
a restricted version that we think is more consistent with Fortran 90
and also easier to implement, with fewer surprises for the user.

[3] The HPF proposal provides good support for data parallel
programming, in which a program is organized more or less around a
single thread of control that may occasionally be split and then
re-joined in fairly controlled ways.  (For example, a WHERE statement
or a vectorized DO loop containing an IF may be compiled in such a
way that different processors may follow different conditional
execution paths, but they are effectively re-synchronized at the end
of the WHERE or before dropping out of the loop.)  However, the HPF
proposal provides little support for a style of programming in which
one writes a routine with the intention that a copy is run on each
processor, with interprocessor communication managed explicitly by
the programmer rather than implicitly by the compiler as a
consequence of array subscripting operations or array intrinsics such
as CSHIFT.  We propose an interface that allows these two styles of
programming to be integrated smoothly.  Although one would not bother
to use the interface in a program designed to run only on
uniprocessors, it is designed to provide portability for programs
that do use the SPMD style while nevertheless also resulting in
efficient operation on uniprocessors.

[4] We also propose a number of miscellaneous small changes and
additions that have proven useful, in our experience, in the writing
of parallel or potentially parallel programs.

Thinking Machines Corporation intends to implement all four of the
above categories of changes, or something like them, for its
Connection Machine products.  To the extent that many institutions
can agree on their precise form, we will all be better off.

Thinking Machines regards only points [1] and [2] as crucial to the
success of HPF; points [3] and [4] are desirable but not critical.
We would be happy to see these placed in a category described as,
"You don't have to implement these to be HPF, but if you do choose
to implement them, you ought to consider using this particular
specification for the sake of portability."


================ OVERVIEW ================

We propose a number of changes to High Performance Fortran:

Limitations on what must be included from Fortran 90:

(0) The main issue is array support
    (a) Array features (assignment, expressions, intrinsics)
    (b) Allocatable arrays; assumed-shape and assumed-size arrays
    (c) WHERE statement
    (d) Fortran 90 pointers (in support of dynamic allocation)
    (d) The following Fortran 90 features are *not* needed in HPF
        (i)   modules
        (ii)  arithmetic precision specification
        (iii) derived types (i.e., record structures)
        (iv)  internal procedures
        (v)   INTENT statement and attribute
        Some of these may be quite useful (we are considering the
        implementation of derived types, for example), but should
	not be required features of HPF.

Changes to Fortran D:

(1) Redesign the ALIGN directive:
    (a) eliminate dummy variables
    (b) eliminate partial array alignment
    (c) clearly restrict its semantics
    (d) make it easy to align many arrays the same way
    (e) provide a version (REALIGN) that is dynamically
        executable, with suitable restrictions
(2) Allow dynamic redistribution of decompositions,
    with suitable restrictions
(3) Introduce a new abstraction to represent rectilinear
    arrangements of the physical processors; decompositions
    are then mapped onto these rectilinear arrangements

Support for SPMD programming:

(4) Introduce the concept of "local subroutines", which when
    invoked from local code are executed one independent
    copy per physical processor.
(5) Introduce a LOCALLY statement block that is
    consistent with the behavior of local subroutines;
    this solves some of the same problems that the
    Fortran D FORALL statement was intended to handle.

Other small improvements:

(6) Parallel prefix intrinsics
(7) Miscellaneous stuff we have already discussed before
(8) What can be said about EQUIVALENCE?  Certain special
    cases may be particularly important for array handling.

================ RATIONALE ================

(1) Redesign the ALIGN directive

Experience with the CMF$ALIGN directive as it currently exists in CM
Fortran has illuminated a few problems with its current design.  The
principal flaw is that it is much too difficult to declare a number
of arrays to have similar alignment; we have received a *large*
number of complaints from CM Fortran users about this.  It is
absolutely necessary that at least one form of the ALIGN directive
allow the user to specify an alignment once and then list only the
names of many arrays, separated by commas.

Another problem is that the use of dummy index variables constantly
misleads users into thinking that the directive is much more general
than it really is.  It would appear that such directives as

CMF$ALIGN A(I,J) WITH B(J,I)
CMF$ALIGN A(I,J) WITH B(I,MOD(J-I,SIZE(B,2)))
CMF$ALIGN A(I,J) WITH B(INT(LOG(REAL(I))),INT(SIN(REAL(J))))

might in principle be allowed, whereas in fact they are not.  This
misconception arises because the syntax appears to use a very general
syntactic item, the expression, in a place where in fact only a
limited number of alternatives are allowed.

Each of the three examples shown above presents a particular
difficulty.  Taking them in reverse order:

CMF$ALIGN A(I,J) WITH B(INT(LOG(REAL(I))),INT(SIN(REAL(J))))

presents an extreme difficulty to the implementor, because the index
expressions are difficult to invert.  The expressions indicate the
decomposition element to which each array element should be mapped,
but it is extremely difficult to determine, given a decomposition
element (or, by extension, a physical processor), which array
elements fall within that element (or processor).  Even if this could
be determined, it is difficult to construct an efficient memory
layout within the processor for these array elements in the general
case.

The directive 

CMF$ALIGN A(I,J) WITH B(I,MOD(J-I,SIZE(B,2)))

is not that difficult to handle, consisting as it does of expressions
linear in the dummy variables.  Nevertheless, it requires array
descriptors of much greater generality than might otherwise be
required, which is likely to impose some overheads on program
execution.

Even the apparently innocuous

CMF$ALIGN A(I,J) WITH B(J,I)

conceals a pitfall: it may result in violation of the widespread
convention that the first subscript varies fastest.  This has
implications both for the implementor (perhaps subscripts can no
longer be calculated by Horner's rule) and for the programmer
(intuitions about locality within the memory of a single processor
may no longer be reliable).

For High Performance Fortran we propose a redesign of the ALIGN
directive that relies on an analogy to array notation rather than on
an analogy to the CM Fortran FORALL statement.  In general, where
one might use an index variable, one instead uses triplet notation.
Rather than saying

CHPF$ ALIGN A(I,J) WITH D(I,J+3)

one would write

CHPF$ ALIGN A(:,:) WITH D(:,4:N+3)

Rather than saying

CHPF$ ALIGN A(:,I,:) WITH D(I,:)

we revive the "*" notation of Fortran D with a slightly different
meaning, and write

CHPF$ ALIGN A(*,:,*) WITH D(:,*)

Briefly put, axes described by colon notation are eligible for
matching between left and right hand sides; asterisks indicate axes
that are not matched, and so indicate collapsed axes on the left hand
side and replication on the right hand side.  Complete details are
discussed in a later section.

We believe that this notation is not incompatible with the
explicit-index notation; we could choose to adopt just one of them,
knowing that the other can be integrated later.

Finally, users tell us that there is a real need for calculating
alignments at run time, especially for allocated arrays, and that
having to call a subroutine to do this is quite awkward.  In earlier
versions of this proposal we had suggested allowing an ALIGN
directive to appear anywhere in a program unit.  In turns out that,
for technical reasons, it is desirable in Fortran to be able to
distinguish executable statements from declarations.  Therefore we
propose to introduce a REALIGN directive that is exactly like the
ALIGN directive except that REALIGN is like an executable statement
but ALIGN is a declaration.  The distinction is particularly
important in the case of parameters: an ALIGN directive on a
parameter is descriptive, and the compiler should *assume* (or check)
that the stated alignment is a property of the parameter; where
REALIGN on a parameter indicates that the parameter might not have
the stated alignment, and should be rearranged as necessary to cause
it to have that alignment.

After a great deal of internal debate within Thinking Machines, we
have concluded that the Fortran D specification that realignments are
confined to subprograms is the best; that is, if a subprogram
realigns anything, the realignment is effectively undone on exit from
the subprogram (there are various ways of implementing this, having
differing space and time characteristics).  Therefore a compiler can
count on array alignments remaining stable across subroutine calls.
Our standard counterexample--wanting to write a subroutine whose very
purpose is to realign some arrays---can be handled in some cases by
writing a subroutine that calculates the realignment parameters, and
then putting one or more REALIGN directives after the subroutine
call.


(2) Allow dynamic redistribution of decompositions

Just as there is a need to realign arrays on the fly, there is also a need
for redistributing decompositions on the fly.  Users point out to us that
it would not be unusual for a program to decide which of several algorithms
to use based on the size of a data set, and that the different algorithms
might require different distributions.  Therefore we propose to have a
declarative DISTRIBUTE directive and an executable REDISTRIBUTE directive.
We also propose the restriction that a decomposition may not be
redistributed unless no arrays are currently aligned to that decomposition.
(This can easily be enforced, if desired, by a simple run-time check based
on reference counts.)  The reason for this is that, after extensive
conversations with users, we have found that each user easily grasps a
model of alignment and distribution after a few minutes of explanation--but
each user jumps to a slightly different model, and then becomes confused
(in some cases indignant) when told that the model that is "obvious" to him
or her is not quite the one in Fortran D.  In particular, users divide just
about evenly on the question of whether redistributing a decomposition
should (a) affect just the decomposition, leaving all arrays as they
already are, or (b) drag all arrays already aligned to that decomposition
along with it.  Our restriction sidesteps the question by forcing the user
to do things in a particular order.  In particular, you have to distribute
a decomposition before aligning arrays to it.  You can redistribute a
decomposition only by first detaching all arrays currently aligned to it.
(You can detach an array either by using REALIGN to attach it to some other
decomposition, or by deallocating the array with a DEALLOCATE statement if
that is applicable.)


(3) Introduce a new abstraction to represent rectilinear
    arrangements of the physical processors

The Fortran D ALIGN directive conflated several different notions:

[a] relative alignment of array indices
[b] wraparound subscripting (which we feel should be addressed,
    if at all, by a separate linguistic mechanism)
[c] possible assignment of multiple positions along an
    array axis to a single decomposition element in
    arbitrary patterns

Our proposed redesign of ALIGN retains [a], but eliminates [b] and
[c].  The mapping described by the new version of ALIGN is always
injective (one-to-one), with the sole exception of collapsing an
entire axis.  This greatly simplifies both the intuitions and the
implementation.  All other cases of reorganizing and grouping the
data are expressed by the DISTRIBUTE statement.

The Fortran D DISTRIBUTE statement also conflated two different
notions:

[x] how a decomposition is to be mapped onto a rectilinear
    arrangement of processors
[y] how that rectilinear arrangement of processors is to
    be realized within the actual machine structure

We propose to separate these, because the information for [x] is
largely algorithm-dependent but machine-independent, whereas [y] is
surely machine-dependent.  We introduce the PROCESSORS statement, and
modify the DISTRIBUTE statement to look more like the ALIGN
statement:

CHPF$ PROCESSORS P(4,8)
CHPF$ DISTRIBUTE D(BLOCK,CYCLIC) ONTO P

Of course, this piece of code has assumed that there are exactly (or
at least) 32 processors.  One might strive for slightly better
machine-independence in this way:

CHPF$ PROCESSORS P(N$PROC/8, 8)
CHPF$ DISTRIBUTE D(BLOCK,CYCLIC) ONTO P

or perhaps

      PARAMETER (IX=2**INT(LOG(REAL(N$PROC))/LOG(4.0)+.2), IY=N$PROC/IX)
CHPF$ PROCESSORS P(IX,IY)
CHPF$ DISTRIBUTE D(BLOCK,CYCLIC) ONTO P

An additional machine-dependent MACHINE_LAYOUT directive can provide
target-machine-specific parameters describing how a three-dimensional
grid should be mapped to the hypercube or grid or whatever of the
actual target machine.  (Indeed, perhaps HPF should not specify a
HPF$MACHINE_LAYOUT directive, but encourage individual
implementations to have implementation-specific directives such as
CMF$MACHINE_LAYOUT.  The advantage is that a single source code
could contain the implementation-specific directives for a number
of machines without the need for conditional compilation.)


(4) Introduce the concept of "local subroutines"

Many algorithms for multiple processors work best, and are expressed
best, as a two-level structure: the parts that communicate among
processors, and the parts that operate independently within each
processor.  (It is even frequently the case that each part is making
the same conceptual contribution to the entire algorithm, but
different strategies work best for the two cases.  For example, to
sum a bunch of numbers, one might want to organize the processors
into a tree structure, but using a tree organization on the data
within each processor would only add needless overhead compared to a
linear scan over the data.)

Often the independent, per-processor parts of an algorithm can
usefully have independent control structure.  However, independence of
control structure does not suffice for efficient implementation; the
compiler also needs information about independence of data.  The
problem with the original Fortran D FORALL statement was that it
assumed a shared-memory model at the language level: any iteration of
a FORALL loop could access any data item.  This model is difficult to
support on many architectures.  It is difficult or impossible for a
compiler to determine whether any given subscripted array access will
require interprocessor communication.

We propose to introduce a new kind of subroutine called a "local"
subroutine.  Calling a local subroutine causes it to be executed on
each processor, one copy (one thread) per processor.  Ordinary
array-subscripting syntax may be used to access any array element
that is mapped to the executing processor.  Access to array elements
in another processor requires a different and intentionally more
clumsy mechanism, a call to a library routine.  This provides a
mechanism for coding per-processor portions of algorithms in such a
way that the compiler can trivially determine which accesses are
local and which may require communication.


(5) Introduce a LOCALLY statement block

Given the concept of a local subroutine, one can introduce a
statement that has the effect of introducing a block of local code
within an otherwise global routine.  One must understand, however,
that within such a block of local code the meaning of each array
variable changes; it represents only the portion of the array local
to that processor.


(6) Parallel prefix intrinsics

Parallel prefix operations are extremely important in many
parallel codes.   It's just silly not to include them in
a Fortran dialect designed for high parallel performance.
There are also good serial implementations of these operations.


================ DETAILS ================


[1] DECOMPOSITION Directive

The DECOMPOSITION directive declares one or more decompositions,
specifying for each the name, the rank (number of dimensions), and
the size of each dimension.  It must appear in the declaration-part
of a program unit.  A decomposition is simply an abstract space of
indexed positions.  No storage is allocated for a decomposition.

Syntax:

decomposition-directive	  is  DECOMPOSITION decomposition-decl-list

decomposition-decl	  is  decomposition-name ( explicit-shape-spec-list )

decomposition-name	  is  object-name

Examples:

	DECOMPOSITION A(N)
	DECOMPOSITION B(N,N), C(N,2*N)


[2] PROCESSORS Directive

The PROCESSORS directive declares one or more processor arrangements,
specifying for each the name, the rank (number of dimensions), and
the size of each dimension.  It must appear in the declaration-part
of a program unit.  The product of the dimensions must be greater
than zero and not greater than the number of processors to be used to
execute the program.  No storage is allocated for a processor
arrangement.

The variable N$PROC may be used to refer to the total number of
processors.

Syntax:

processors-directive	  is  PROCESSORS processors-decl-list

processors-decl		  is  processors-name [ ( explicit-shape-spec-list ) ]

processors-name		  is  object-name

Examples:

	PROCESSORS P(N)
	PROCESSORS Q(N$PROC), R(N,N$PROC/N)
	PROCESSORS SCALARPROC

If no shape is specified, then the processors arrangement is
conceptually scalar and global.  Depending on the implementation
architecture, data distributed onto such a processor arrangement may
reside in a single control processor (if the machine has one) or be
replicated over all processors.


[3] ALIGN and REALIGN Directives

The ALIGN directive is used to map data objects onto elements of a
decomposition.  If two data objects are mapped to the same
decomposition element, then the two data objects are said to be
aligned.  Operations between aligned data objects are likely to be
more efficient than operations between data objects that are not
known to be aligned.  The ALIGN directive is designed to make it
particularly easy to specify mappings for all the elements of an
array at once.

The ALIGN directive may appear only in the declaration-part of a
program unit.  The REALIGN directive is similar but may appear only
in the execution-part of a program unit.  The only syntactic
difference between ALIGN and REALIGN is that ALIGN must contain only
a specification-expr as a subscript or in a subscript-triplet,
whereas REALIGN may use any integer expression.

When applied to a parameter, an ALIGN directive in the
declarative-part is assumed to be declarative in nature: the
parameter must already be aligned as described, and the compiler may
rely on that when compiling the subprogram.  (By "must already be
aligned", we mean that that is the programmer's responsibility, and a
compiler with interprocedural analysis would be justified in issuing
an error message on detecting violations of this requirement.)  In
contrast, a REALIGN directive always forces an array to have the
specified alignment, moving data around at run time if necessary.

Syntax:

align-directive		  is  ALIGN align-stuff

realign-directive	  is  REALIGN align-stuff

align-stuff		  is  single-align-stuff
			  or  multiple-align-stuff

single-align-stuff	  is  alignee ( align-source-list )
				WITH align-spec

alignee			  is  object-name

align-source		  is  :
			  or  *

In a single-align-stuff directive, every axis of the alignee is
specified as either : or *.  If it is :, then positions along that
axis will be spread out across the matching axis of the align-spec;
if it is *, then that axis is collapsed: positions along that axis
make no difference in determining the corresponding decomposition
element.

All array elements are aligned with corresponding positions of the
align-spec.  If the alignee is a scalar, then it has a replicated
representation.  If the align-spec is scalar, all elements of the
alignees will reside in the same processor.

multiple-align-stuff	  is  [ ( align-source-list ) ]
				WITH align-spec :: alignee-list

A multiple-align-stuff directive is equivalent to a series of
single-align-stuff directives.  All alignees must have the same rank.
If the alignee-source-list is omitted, it is assumed to consist
of a list of ":" entries, equal in number to the rank of the alignees.
So the directive

CHPF$ ALIGN ( align-source-list ) WITH align-spec :: alignee-list

is equivalent to the sequence

CHPF$ ALIGN alignee-1( align-source-list ) WITH align-spec
CHPF$ ALIGN alignee-2( align-source-list ) WITH align-spec
      ...
CHPF$ ALIGN alignee-m( align-source-list ) WITH align-spec


align-spec		  is  align-name [ ( align-subscript-list ) ]

align-name		  is  decomposition-name
			  or  object-name

align-subscript		  is  int-expr
			  or  subscript-triplet
			  or  *

Every combination of possible values for the index variables selects
an element of the array to be aligned.  The align-spec indicates a
decomposition element, scalar, or array element with which that array
element should be aligned.

The align-spec must contain exactly as many subscript-triplets
as the number of colons ":" appearing in the align-source-list.
These are the matched up in corresponding left-to-right order.
A "*" as an align-spec indicates a replicated representation.
Thus, for example,

CHPF$ALIGN A(:) WITH D(:,*)

means that a copy of A is aligned with every column of D.  This means
that an optimizing compiler can arrange to read whichever copy is
closest to hand.  (Of course, when an element of A is written, all
copies must be updated, not just one copy.  Replicated
representations are very useful for such tricks as small lookup
tables, where it much faster to have a copy in each physical
processor but you don't want to be bothered giving it an extra
dimension which is logically unnecessary to the algorithm.)

Note that an align-name may be an object-name as well as a
decomposition-name.  The main purpose of this is to allow alignment
or realignment to a subroutine argument.  Aligning to an array
element has the same meaning as aligning to the decomposition element
to which that array element was aligned.

      SUBROUTINE GLUTINOUS_RICE(P,Q,R,S)
      REAL P(:,:),Q(:,:),R(:,:),S(:,:)
CHPF$ ALIGN WITH P:: Q,R,S
      ...
      END

The directive allows the compiler to assume that the four
subroutine arguments will all be aligned one with another,
even though nothing is said about how they may be distributed
onto processors.

      SUBROUTINE EIGHT_TREASURE_RICE(P,Q,R,S)
      REAL P(:,:),Q(:,:),R(:,:),S(:,:)
CHPF$ REALIGN WITH P:: Q,R,S
      ...
      END

In contrast, this directive says nothing about the alignment
of the arguments as they come into the routine, but specifies
that they should be forced into alignment before executing
the remainder of the subroutine.

(We examined some alternatives to allowing objects as alignment
targets, but they all seemed to require giving decompositions
first-class status in the language, including allowing them as
subroutine parameters and/or in COMMON blocks.  This approach seemed
less attractive to us, partly because it might require modififying
the source code to change subroutine calls rather than merely adding
directives.)

More examples of ALIGN directives:

CHPF$ DECOMPOSITION D1(N),D2(N,N)
      REAL,DIMENSION(N,N):: X,A,B,C,AR1,AR2A,P,Q,R,S
CHPF$ ALIGN X(:,*) WITH D1(:)
CHPF$ ALIGN (:,*) WITH D1:: A,B,C,AR1,AR2A
CHPF$ ALIGN WITH D2:: P,Q,R,S

Note that, in a multiple-align, the alignees must all have the same
rank but need not all have the same shape; the sizes need match only
for dimensions that correspond to colons in the align-source-list.
This turns out to be *extremely* important to users of CM Fortran:
one of the most common cases is aligning arrays that match in
parallel dimensions but may differ in "serial" (collapsed)
dimensions:

CHPF$ DECOMPOSITION D1(N)
      REAL A(3,N), B(4,N), C(43,N)
CHPF$ ALIGN (*,:) WITH D1:: A,B,C

The idea here is that you know there are processors (perhaps N of
them), and you want arrays of different sizes (3, 4, 43) within each
processor.  It's okay for the numbers 3, 4, and 43 to be different,
because those axes will be collapsed so that array elements with
indices differing only along that axis will all reside in the same
processor anyway.

[4] DISTRIBUTE and REDISTRIBUTE Directive

The DISTRIBUTE directive specifies a mapping of abstract
decomposition elements to processors in a processor arrangement.  It
must appear in the declaration-part of a program unit.  It must
appear before any ALIGN directive that refers to the decomposition in
question either directly or indirectly.  If an array is aligned to a
decomposition before the decomposition has been explicitly
distributed, then the decomposition is distributed in an
implementation-dependent manner onto an implementation-dependent
processors arrangement.

The REDISTRIBUTE directive is similar but is considered executable.  A
decomposition may be redistributed at any time that there is no array
aligned to it.  (You can detach an array either by using REALIGN to attach
it to some other decomposition, or by deallocating the array with a
DEALLOCATE statement if that is applicable.  It is easy to provide an error
check at run time by using reference counts: every freshly created
decomposition has a count that is initially zero; aligning an array to a
decomposition decrements the count for the decomposition it was formerly
aligned with (if any) and increments the count of the decomposition it is
to be newly aligned with; and deallocating an array decrements the count of
its decomposition.  It is necessary for this error check that the dope
information for an array to include the equivalent of a pointer to the
decomposition information.)

The DISTRIBUTE directive may appear only in the declaration-part of a
program unit.  The REDISTRIBUTE directive is similar but may appear
only in the execution-part of a program unit.  The only syntactic
difference between DISTRIBUTE and REDISTRIBUTE is that DISTRIBUTE
must contain only a specification-expr as the argument to BLOCK or
CYCLIC, whereas REDISTRIBUTE may use any integer expression.

distribute-directive	  is  DISTRIBUTE dist-stuff

redistribute-directive	  is  REDISTRIBUTE dist-stuff

dist-stuff		  is  single-dist-stuff
			  or  multiple-dist-stuff

single-dist-stuff	  is  distributee ( dist-format-list )
				[ ONTO dist-target ]

multiple-dist-stuff	  is  [ ( dist-format-list ) ]
				[ ONTO dist-target ] :: distributee-list

distributee		  is  decomposition-name

dist-format		  is  BLOCK  [ ( specification-expr ) ]
			  or  CYCLIC [ ( specification-expr ) ]		;???
			  or  *

dist-target		  is  processors-name

Examples:

CHPF$ DISTRIBUTE D1(BLOCK)
CHPF$ DISTRIBUTE (BLOCK,*,BLOCK) ONTO P:: D2,D3,D4


Define the ceiling division function CD(J,K) = (J+K-1)/K.

Let D be the size of the decomposition in a certain dimension and let N be
the size of the processor arrangement in the same dimension.  Then BLOCK
means the same as BLOCK(CD(D,N)), and BLOCK(M) means that a decomposition
element whose index along that dimension is J resides in a processor with
index CD(J,M) (M*N must be no smaller than D).

CYCLIC means the same as CYCLIC(1).

CYCLIC(M) means that a decomposition element whose index along that
dimension is J resides in a processor with index 1+MODULO(CD(J,M)-1,N).

CYCLIC(M) and BLOCK(M) mean the same thing when M*N <= D.  But CYCLIC and
BLOCK do not mean the same thing.

If the ONTO clause is present, every dist-source decomposition in the same
DISTRIBUTE directive must have the same rank as the dist-target.  If the
ONTO clause is omitted, then an implementation-dependent processor
arrangement is chosen arbitrarily for each dist-source.

A multiple-dist-stuff directive is equivalent to a series of
single-dist-stuff directives.  All distributees must have the same rank.
If the distributee-source-list is omitted, it is assumed to consist of a
list of "BLOCK" entries, equal in number to the rank of the distributees.
So the directive

CHPF$ DISTRIBUTE ( dist-format-list ) ONTO dist-target:: distributee-list

is equivalent to the sequence

CHPF$ DISTRIBUTE distributee-1( dist-format-list ) ONTO dist-target
CHPF$ DISTRIBUTE distributee-2( dist-format-list ) ONTO dist-target
      ...
CHPF$ DISTRIBUTE distributee-m( dist-format-list ) ONTO dist-target

The directive

CHPF$ DISTRIBUTE ONTO P:: D1,D2,D3

means

CHPF$ DISTRIBUTE D1(BLOCK,BLOCK) ONTO P
CHPF$ DISTRIBUTE D2(BLOCK,BLOCK) ONTO P
CHPF$ DISTRIBUTE D3(BLOCK,BLOCK) ONTO P


[5] LOCAL Subroutines

[Note to myself: Hm.  BLOCK, BLOCK(m), and CYCLIC in DISTRIBUTE
directives work out well for this, but not CYCLIC(m), which requires
each dimension to become two dimensions.  That is, CYCLIC(m) requires
an index to be split into a greater number of discontiguous pieces
when mapping it to processor number and memory address within
processor, thereby incurring higher overheads.  We should worry about
this.]

The basic idea is to add a new kind of subroutine, called a LOCAL
subroutine.  Code in a LOCAL subroutine runs within a processor.  It
can access any data within the processor using Fortran syntax; it can
also use array notation, but only to operate on data within the
processor.  To access data outside the processor requires either
preparatory communication before running the LOCAL code, or the use
of communications primitives such as a message-passing library.

A LOCAL subroutine can call another LOCAL subroutine and everything
behaves normally.

LOCAL execution is initiated by calling a LOCAL subroutine from
ordinary ("global") HPF code.  When this occurs, all global arrays
accessible to the subroutine (notably arrays passed as arguments) are
logically carved up into pieces by physical processor, and each
instance of the subroutine sees just the subgrid in its processor.


Now for the details:

A function or subroutine in an HPF program may have the keyword LOCAL
preceding its header:

      LOCAL SUBROUTINE PAUL_SIMON(A, B, C)
      LOCAL INTEGER FUNCTION BOY_GEORGE(X)

Such a subprogram unit is compiled in a special way.  The main
difference is that, in a machine (such as the CM-5) that has a
central control processor directing the actions of the mass of
individual parallel processors, *all* the code runs on the individual
processors; operations that might normally be carried out in the
control processor are all compiled for the individual processors.
Local subprograms may use array notation.  If the individual
processors have vector hardware, for example, then one can expect
that such per-processor array code would be compiled into
per-processor vector operations, etc.

A local subprogram is executed on a processor and executes
asynchronously and independently of all other processors.  With the
exception of returning from such a local subroutine to the global
caller that initiated local execution (see below), there is no
implicit synchronization of processors.  So such a subroutine may use
any control structure whatsoever.

One local subprogram can call another local subprogram.  This behaves
as an ordinary FORTRAN subprogram call.

One may also call a local subprogram from a global program unit; this
is how local execution is initiated in the first place.  When this
occurs, one subprogram call occurs in each physical processor.
In this case special rules apply:

(1) Scalar parameters are broadcast, so that each processor sees a
copy of the scalar parameter.  (Question: is it a run-time error to
assign to this parameter?  Or maybe it is okay as long as all
processors assign the same value.  This needs work.)

(2) Array parameters are broken up into subgrids, and the subprogram
on a given processor receives as its argument an array representing
the subgrid of elements residing on that processor.  So if you have a
50x50 global array on a 16-processor machine, using a 4x4 processor
layout, then most of the processors will receive a 13x13 array as
argument; six will see 13x11 or 11x13 arrays, and one will see a
11x11 subgrid.

(3) As a local subprogram on a processor returns to a global caller,
the processor performs a synchronization function.  The result is
that the global caller proceeds only when *all* processors have
completed execution of the subprogram.

(4) In the case of a local function (as opposed to a subroutine), the
result of the call is a one-dimensional array of the results
(assuming the result of the function is nominally scalar).  The
length of this array is equal to the number of physical processors.
In the case of an array-valued local function, the ranks and shapes
of the arrays must be identical, and an extra dimension is appended.
So if you have 128 processors and each returns a 4x2 array, the
result is 4x2x128.

Within a local subprogram, all the usual array inquiry intrinsics may
be used.  When applied to subgrids resulting from the breakup of a
global parameter, such intrinsics are regarded as applying to the
local subgrid.  For example:

      REAL A(50,50)
      CALL CINDI_LAUPER(A)
      ...
      LOCAL SUBROUTINE CINDI_LAUPER(X)
      REAL, ARRAY(:,:):: X
      N = SIZE(X,1)
      L = LBOUND(X,1)
      U = UBOUND(X,1)
      ...

Assuming 16 processors again, 4x4, then some processors would set N to 13
but those on the high edge would set N to 11.  All processors set
L to 1 and U to the same value as N (13 or 11).

[A piece of rationale:

There is a question as to what L and U should be: should they be
equal to the global array indices, so that a processor might see
indices from 1:13, or 14:26, or 27:39, or 40:50?  Or should they be
normalized to start at 1, so that each processor sees a range of 1:13
or 1:11?  The former is in some ways more elegant, but the latter may
be prone to fewer errors because old Fortran programmers tend to
assume *all* arrays start at 1.

Taking a cue from the existing Fortran 90 specification for
assumed-shape arrays, I conclude that L should be 1 and U should be
13 or 11, as the case may be.

End of rationale.]

Another series of intrinsics, GLOBAL_SIZE, GLOBAL_LBOUND, and
GLOBAL_UBOUND may be applied to arrays to determine their global
extents.  So GLOBAL_SIZE(X,1) would return 50.  Another intrinsic,
PROCESSOR_INDEX, could return the 1-origin physical (processor)
self-address with respect to a given array and axis.  So
PROCESSOR_INDEX(X,1) would return 1, 2, 3, or 4, depending on the
processor.  PROCESSOR_SIZE(X,1) would be 4, the limit on the range of
values returned by PROCESSOR_INDEX.  GLOBAL_SHAPE and PROCESSOR_SHAPE
are defined in terms of GLOBAL_SIZE and PROCESSOR_SIZE in the same
manner that SHAPE is defined in terms of SIZE.  So

      GLOBAL_SHAPE(X) is (/ 50, 50 /)      on every processor
      PROCESSOR_SHAPE(X) is (/ 4, 4 /)     on every processor
      SHAPE(X) is (/ 13, 13 /)	     in 9 of the processors,
		  (/ 13, 11 /)	     in 3 of the processors,
		  (/ 11, 13 /)	     in 3 of the processors, and
		  (/ 11, 11 /)	     in 1 of the processors.

The GLOBAL_ and PROCESSOR_ intrinsics may be applied to an array that was
allocated locally.  The GLOBAL values are the same as the non-GLOBAL
values, and all elements are on a single processor.  The same is true
of an array that is not distributed (for example, allocated in the
control processor).

      LOCAL SUBROUTINE KINKS()
      REAL, ARRAY(40,40):: YOU_REALLY_GOT_ME
      CALL CINDI_LAUPER(YOU_REALLY_GOT_ME)

Then GLOBAL_SIZE(YOU_REALLY_GOT_ME, 1) is 40,
GLOBAL_LBOUND(YOU_REALLY_GOT_ME, 1) is 1,
GLOBAL_UBOUND(YOU_REALLY_GOT_ME, 1) is 40, and
GLOBAL_INDEX(YOU_REALLY_GOT_ME, 1) is 1.

You can't access array elements outside the local subgrid using
array notation, but you can do it with special communication
routines.

      CALL ARRAY_FETCH(result, global_source, i1, ..., in)
      CALL ARRAY_STORE(source, global_dest, i1, ..., in)

These transfer a scalar (integer, logical, real, complex) from source
to dest.  The global operand must be the name of an array, and the
following arguments are integers that are used as indexes into the
global array of which the named array is presumably a subgrid.  (If
it's a local array, it still works, but it obviously doesn't have to
go outside the executing processor!  It's probably a silly thing to
do, though, unless you're trying to use a general-purpose routine on
a local array.)  The intended CM-5 implementation is through the use
of messages that generate interrupts at the destination processor;
the interrupt routine services read and write requests.


Here is some code that implements bitwise-AND reduction
on a one-dimensional array of integers:

      INTEGER FUNCTION BITAND(X)
      INTEGER, ARRAY(:):: X
      INTEGER, ARRAY(PROCESSOR_SIZE(X)):: PARTIAL_RESULTS
C  GET EACH PROCESSOR TO COMBINE ITS OWN VALUES AND DELIVER ONE RESULT EACH.
      CALL BITAND_WITHIN_PROCESSOR(X, PARTIAL_RESULTS)
C  NOW USE A BINARY TREE TO COMBINE ONE VALUE FROM EACH PROCESSOR.
C  CAREFUL FOOTWORK MAKES SURE THAT ANY SIZE WORKS, NOT JUST POWER OF 2.
      K = PROCESSOR_SIZE(X)
      DO WHILE (K .GT. 1)
	PARTIAL_RESULTS(1:K/2) = IAND(PARTIAL_RESULTS(1:K/2),         &
				      PARTIAL_RESULTS((K+1)/2):K)
	K = (K+1)/2
      END DO
      BITAND = PARTIAL_RESULTS(1)
      END

      LOCAL SUBROUTINE BITAND_WITHIN_PROCESSOR(X, PARTIAL_RESULTS)
      INTEGER, ARRAY(:):: X
      INTEGER, ARRAY(1):: PARTIAL_RESULTS
      INTEGER P
C  WITHIN A PROCESSOR, JUST DO A SERIAL LOOP TO AND THINGS UP
C  (THIS CODE ASSUMES NO VECTOR PROCESSING IS AVAILABLE).
      P = -1
      DO J = LBOUND(X),UBOUND(X)
	P = IAND(P, X(J))
      END DO
      PARTIAL_RESULTS(1) = P
      END

In a global program unit, COMMON means global common.  The keyword
COMMON may optionally be preceded by the word GLOBAL.

In a LOCAL program unit, COMMON means local common.  Global COMMON may
be specified by prefixing the word COMMON with the word GLOBAL.
The keyword LOCAL may also appear before the word COMMON.


      SUBROUTINE QUESTION_MARK
      COMMON /FOO/ TEARS(96)          !COULD ALSO HAVE SAID "GLOBAL COMMON"
      CALL THE_MYSTERIANS
      END QUESTION_MARK

      LOCAL SUBROUTINE THE_MYSTERIANS
      GLOBAL COMMON /FOO/ TEARS(96)   !EACH PROCESSOR HAS CEILING(96/N) ELEMENTS
      COMMON /BAR/ CRY(4)             !EACH PROCESSOR HAS 4 ELEMENTS
      LOCAL COMMON /BAZ/ TEARDROPS(10000000)  !EACH PROCESSOR HAS 1E7 ELEMENTS
					      ! (TOO MANY TEARDROPS :-)
      ...
      !EACH PROCESSOR ASSIGNS ONLY TO ITS OWN SUBSET OF THE TEARS.
      FORALL (J=1:UBOUND(TEARS,1)) TEARS(J) = CRY(1+(MOD(J,4))
      ...
      END THE_MYSTERIANS


[6] LOCALLY Statement

locally-statement	  is  LOCALLY
			        declarations ???
			        block
			      END LOCALLY

This statement behaves as if its contents were the body of a local
subroutine invoked at that point, with all visible objects that
it uses passed as parameters of the same name.

You may not use LOCALLY within local code (there is no point),
even though you may call a local subroutine from local code.

So the above example could be written:


      INTEGER FUNCTION BITAND(X)
      INTEGER, ARRAY(:):: X
      INTEGER, ARRAY(PROCESSOR_SIZE(X)):: PARTIAL_RESULTS
C  GET EACH PROCESSOR TO COMBINE ITS OWN VALUES AND DELIVER ONE RESULT EACH.
      LOCALLY
	INTEGER P
C  WITHIN A PROCESSOR, JUST DO A SERIAL LOOP TO AND THINGS UP
C  (THIS CODE ASSUMES NO VECTOR PROCESSING IS AVAILABLE).
	P = -1
	DO J = LBOUND(X),UBOUND(X)
	  P = IAND(P, X(J))
	END DO
	PARTIAL_RESULTS(1) = P
      END LOCALLY
C  NOW USE A BINARY TREE TO COMBINE ONE VALUE FROM EACH PROCESSOR.
C  CAREFUL FOOTWORK MAKES SURE THAT ANY SIZE WORKS, NOT JUST POWER OF 2.
      K = PROCESSOR_SIZE(X)
      DO WHILE (K .GT. 1)
	PARTIAL_RESULTS(1:K/2) = IAND(PARTIAL_RESULTS(1:K/2),         &
				      PARTIAL_RESULTS((K+1)/2):K)
	K = (K+1)/2
      END DO
      BITAND = PARTIAL_RESULTS(1)
      END


[7] Parallel prefix (and suffix) intrinsics

For every reduction operation XXX in the language, introduce two new
intrinsics XXX_PREFIX and XXX_SUFFIX.  They take the same arguments
as the corresponding reduction intrinsic, plus one additional
optional argument called SEGMENT, which is of type logical and
conformable with the ARRAY argument (a TRUE element indicates the
start of a new segment of the first argument, that is, a place where
the running accumulation is to be reset).  If the DIM argument is
omitted, then the arrays are processed in array element order
("column-major"), as if temporarily regarded as one-dimensional.

In all cases the result has the same shape as the first argument.

[Should there be another optional argument, a scalar logical, named
EXCLUSIVE, default .FALSE., which determines whether the prefix or
suffix operation is inclusive (the default) or exclusive?]

SUM_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

Probably so.]

Examples:

SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/)
SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/)

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) ) yields (/1,1,1,2,3,4,4,5,5/)
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) yields (/0,1,1,1,2,3,4,4,5/)

SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/),
    SEGMENT=(/T,F,F,F,T,F,T,T,F/)) yields (/1,3,6,10,5,11,7,8,17/)
              ------- --- - ---             -------- ---  - ----
	     four input segments       four independent result segments


[8] Miscellaneous Stuff

(a) Intrinsics POPCNT, POPPAR, and LEADZ.

(b) Reduction versions of IAND, IOR, and IEOR.  How about AND, OR, and EOR?

(c) An integer-length intrinsic:

  ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 ))

This is related to LEADZ but is often much more convenient for
the calculation of array dimensions, etc.  It is the number of
bits required to store the signed integer x.  As examples of
its use,  2**ILEN(N-1)  rounds N up to a power of 2 (for N > 0),
whereas  2**(ILEN(N)-1)  rounds N down to a power of 2.

(d) Consistent with Fortran 90 and IEEE-STD-754-1985, eliminate
the operator "<>" and use only the Fortran 90 "/=".

(e) The MAXLOC and MINLOC instrinsics should have an optional DIM
argument.  If such an argument is present, then the shape of the
result equals the shape of the first argument with one dimension (that
indicated by the DIM argument) deleted; it is as if a series of
one-dimensional MAXLOC or MINLOC operations were performed.

Example: If A has the value

[  0  -5   8  -3  ]
[  3   4  -1   2  ]
[  1   5   6  -4  ]

then	MINLOC(A, DIM=1) has the value [ 1, 1, 2, 3 ]
	MAXLOC(A, DIM=1) has the value [ 2, 3, 1, 2 ]
	MINLOC(A, DIM=2) has the value [ 2, 3, 4 ]
	MAXLOC(A, DIM=2) has the value [ 3, 2, 3 ].

(f) Block FORALL statements:

      FORALL (... e1 ... e2 ...)
	s1
	s2
	...
	sn
      END FORALL

means exactly the same as

      temp1 = e1
      temp2 = e2
      FORALL (... temp1 ... temp2 ...) s1
      FORALL (... temp1 ... temp2 ...) s2
      ...
      FORALL (... temp1 ... temp2 ...) sn

That is, a block FORALL means exactly the same as replicating
the FORALL header in front of each array assignment statement
in the block, except that any expressions in the FORALL header
are evaluated only once.

(g) Nested WHERE statements.

Here is the text of a proposal once sent to X3J3:

| Briefly put, the less WHERE is like IF, the more difficult it is to
| translate existing serial codes into array notation.  Such codes tend to
| have the general structure of one or more DO loops iterating over array
| indices and surrounding a body of code to be applied to array elements.
| Conversion to array notation frequently involves simply deleting the DO
| loops and changing array element references to array sections or whole
| array references.  If the loop body contains logical IF statements, these
| are easily converted to WHERE statements.  The same is true for translating
| IF-THEN constructs to WHERE constructs, except in two cases.  If the IF
| constructs are nested (or contain IF statements), or if ELSE IF is used,
| then conversion suddenly becomes disproportionately complex, requiring the
| user to create temporary variables or duplicate mask expressions and to use
| explicit .AND. operators to simulate the effects of nesting.
| 
| Users also find it confusing that ELSEWHERE is syntactically and
| semantically analogous to ELSE rather than to ELSE IF.
| 
| [We] ... propose that the syntax of WHERE constructs be extended and
| changed to have the form
| 
| 	where-construct       is  where-construct-stmt
| 				    [ where-body-construct ]...
| 				  [ elsewhere-stmt
| 				    [ where-body-construct ]... ]...
| 				  [ where-else-stmt
| 				    [ where-body-construct ]... ]
| 				  end-where-stmt
| 
| 	where-construct-stmt  is  WHERE ( mask-expr )
| 
| 	elsewhere-stmt        is  ELSE WHERE ( mask-expr )
| 
| 	where-else-stmt       is  ELSE
| 
| 	end-where-stmt        is  END WHERE
| 
| 	mask-expr             is  logical-expr
| 
| 	where-body-construct  is  assignment-stmt
| 			      or  where-stmt
| 			      or  where-construct
| 	
| Constraint: In each assignment-stmt, the mask-expr and the variable
| 	being defined must be arrays of the same shape.  If a
| 	where-construct contains a where-stmt, an elsewhere-stmt,
| 	or another where-construct, then the two mask-expr's must
| 	be arrays of the same shape.
| 
| The meaning of such statements may be understood by rewrite rules.  First
| one may eliminate all occurrences of ELSE WHERE:
| 
| 	WHERE (m1)				WHERE (m1)
| 	  xxx					  xxx
| 	ELSE WHERE (m2)		becomes		ELSE
| 	  yyy					  WHERE (m2)
| 	END WHERE				    yyy
| 						  END WHERE
| 						END WHERE
| 
| where xxx and yyy represent any sequences of statements, so long as the
| original WHERE, ELSE WHERE, and END WHERE match, and the ELSE WHERE is the
| first ELSE WHERE of the construct (that is, yyy may include additional ELSE
| WHERE or ELSE statements of the construct).  Next one eliminates ELSE:
| 
| 	WHERE (m)				temp = m
| 	  xxx					WHERE (temp)
| 	ELSE			becomes		  xxx
| 	  yyy					END WHERE
| 	END WHERE				WHERE (.NOT. temp)
| 						  yyy
| 						END WHERE
| 
| Finally one eliminates nested WHERE constructs:
| 
| 	WHERE (m1)				temp = m1
| 	  xxx					WHERE (temp)
| 	  WHERE (m2)				  xxx
| 	    yyy			becomes		END WHERE
| 	  END WHERE				WHERE (temp .AND. (m2))
| 	  zzz					  yyy
| 	END WHERE				END WHERE
| 						WHERE (temp)
| 						  zzz
| 						END WHERE
| 
| and similarly for nested WHERE statements.
| 
| The effects of these rules will surely be a familiar or obvious possibility
| to all the members of the committee; I enumerate them explicitly here only
| so that there can be no doubt as to the meaning I intend to support.
| 
| Such rewriting rules are simple for a compiler to apply, or the code may
| easily be compiled even more directly.  But such transformations are
| tedious for our users to make by hand and result in code that is
| unnecessarily clumsy and difficult to maintain.
| 
| One might propose to make WHERE and IF even more similar by making two
| other changes.  First, require the noise word THERE to appear in a WHERE
| and ELSE WHERE statement after the parenthesized mask-expr, in exactly the
| same way that the noise word THEN must appear in IF and ELSE IF statements.
| (Read aloud, the results might sound a trifle old-fashioned--"Where knights
| dare not go, there be dragons!"--but technically would be as grammatically
| correct English as the results of reading an IF construct aloud.)  Second,
| allow a WHERE construct to be named, and allow the name to appear in ELSE
| WHERE, ELSE, and END WHERE statements.  I do not feel very strongly one way
| or the other about these no doubt obvious points, but offer them for your
| consideration lest the possibilities be overlooked.

[End of proposal as sent to X3J3.]

Now, for compatibility with Fortran 90, HPF should continue to
use ELSEWHERE instead of ELSE, but this causes no ambiguity:

      WHERE(...)
	...
      ELSE WHERE(...)
	...
      ELSEWHERE
	...
      END WHERE

is perfectly unambiguous, even when blanks are not significant.
Since X3J3 declined to adopt the keyword THERE, it should not be
used in HPF either (alas).


[9] What about EQUIVALENCE?

What can be said about EQUIVALENCE in HPF?  Sometimes it is very
convenient to treat a multidimensional array as one-dimensional, or
as having some smaller number of dimensions, with some dimensions
combined to make a single axis.


[end]