Proposals for amending "High Performance Fortran" as described in draft 1.0 of 27 September 1991. Guy Steele, Thinking Machines Corporation, 17 January 1992 ================ INTRODUCTION ================ Thinking Machines Corporation very much favors development of a commonly agreed upon version of Fortran suitable for execution on a wide variety of platforms, including workstations, mainframes, multicomputers, vector supercomputers, and massively parallel computers. We believe that the draft "High Performance Fortran" is very close to being a workable solution. We suggest modifications in four areas: [1] The HPF draft states that it is fully Fortran 90 conformant. We believe this is a laudable long-term goal but could be an impediment to short-term adoption of HPF. We suggest a layered definition consisting of a subset, labelled HPF, that can be extended compatibly to become a larger language (Fortran 90 HPF) that encompasses all of Fortran 90. [2] The crucial improvement of HPF over the Fortran 90 array features is the ability to control the arrangement of data arrays within a memory or processor set that may be distributed. The original Fortran D proposal is very general, but the implementation techniques required to implement its full generality are still a subject of research (which was one of the original goals of Fortran D). We believe that it is necessary either to restrict the generality of the Fortran D proposal or to redesign it to make it easier to implement. On the other hand, we believe that the ability to realign and redistribute arrays at run time is very important, and would prefer to see this Fortran D facility retained in HPF. However, we propose a restricted version that we think is more consistent with Fortran 90 and also easier to implement, with fewer surprises for the user. [3] The HPF proposal provides good support for data parallel programming, in which a program is organized more or less around a single thread of control that may occasionally be split and then re-joined in fairly controlled ways. (For example, a WHERE statement or a vectorized DO loop containing an IF may be compiled in such a way that different processors may follow different conditional execution paths, but they are effectively re-synchronized at the end of the WHERE or before dropping out of the loop.) However, the HPF proposal provides little support for a style of programming in which one writes a routine with the intention that a copy is run on each processor, with interprocessor communication managed explicitly by the programmer rather than implicitly by the compiler as a consequence of array subscripting operations or array intrinsics such as CSHIFT. We propose an interface that allows these two styles of programming to be integrated smoothly. Although one would not bother to use the interface in a program designed to run only on uniprocessors, it is designed to provide portability for programs that do use the SPMD style while nevertheless also resulting in efficient operation on uniprocessors. [4] We also propose a number of miscellaneous small changes and additions that have proven useful, in our experience, in the writing of parallel or potentially parallel programs. Thinking Machines Corporation intends to implement all four of the above categories of changes, or something like them, for its Connection Machine products. To the extent that many institutions can agree on their precise form, we will all be better off. Thinking Machines regards only points [1] and [2] as crucial to the success of HPF; points [3] and [4] are desirable but not critical. We would be happy to see these placed in a category described as, "You don't have to implement these to be HPF, but if you do choose to implement them, you ought to consider using this particular specification for the sake of portability." ================ OVERVIEW ================ We propose a number of changes to High Performance Fortran: Limitations on what must be included from Fortran 90: (0) The main issue is array support (a) Array features (assignment, expressions, intrinsics) (b) Allocatable arrays; assumed-shape and assumed-size arrays (c) WHERE statement (d) Fortran 90 pointers (in support of dynamic allocation) (d) The following Fortran 90 features are *not* needed in HPF (i) modules (ii) arithmetic precision specification (iii) derived types (i.e., record structures) (iv) internal procedures (v) INTENT statement and attribute Some of these may be quite useful (we are considering the implementation of derived types, for example), but should not be required features of HPF. Changes to Fortran D: (1) Redesign the ALIGN directive: (a) eliminate dummy variables (b) eliminate partial array alignment (c) clearly restrict its semantics (d) make it easy to align many arrays the same way (e) provide a version (REALIGN) that is dynamically executable, with suitable restrictions (2) Allow dynamic redistribution of decompositions, with suitable restrictions (3) Introduce a new abstraction to represent rectilinear arrangements of the physical processors; decompositions are then mapped onto these rectilinear arrangements Support for SPMD programming: (4) Introduce the concept of "local subroutines", which when invoked from local code are executed one independent copy per physical processor. (5) Introduce a LOCALLY statement block that is consistent with the behavior of local subroutines; this solves some of the same problems that the Fortran D FORALL statement was intended to handle. Other small improvements: (6) Parallel prefix intrinsics (7) Miscellaneous stuff we have already discussed before (8) What can be said about EQUIVALENCE? Certain special cases may be particularly important for array handling. ================ RATIONALE ================ (1) Redesign the ALIGN directive Experience with the CMF$ALIGN directive as it currently exists in CM Fortran has illuminated a few problems with its current design. The principal flaw is that it is much too difficult to declare a number of arrays to have similar alignment; we have received a *large* number of complaints from CM Fortran users about this. It is absolutely necessary that at least one form of the ALIGN directive allow the user to specify an alignment once and then list only the names of many arrays, separated by commas. Another problem is that the use of dummy index variables constantly misleads users into thinking that the directive is much more general than it really is. It would appear that such directives as CMF$ALIGN A(I,J) WITH B(J,I) CMF$ALIGN A(I,J) WITH B(I,MOD(J-I,SIZE(B,2))) CMF$ALIGN A(I,J) WITH B(INT(LOG(REAL(I))),INT(SIN(REAL(J)))) might in principle be allowed, whereas in fact they are not. This misconception arises because the syntax appears to use a very general syntactic item, the expression, in a place where in fact only a limited number of alternatives are allowed. Each of the three examples shown above presents a particular difficulty. Taking them in reverse order: CMF$ALIGN A(I,J) WITH B(INT(LOG(REAL(I))),INT(SIN(REAL(J)))) presents an extreme difficulty to the implementor, because the index expressions are difficult to invert. The expressions indicate the decomposition element to which each array element should be mapped, but it is extremely difficult to determine, given a decomposition element (or, by extension, a physical processor), which array elements fall within that element (or processor). Even if this could be determined, it is difficult to construct an efficient memory layout within the processor for these array elements in the general case. The directive CMF$ALIGN A(I,J) WITH B(I,MOD(J-I,SIZE(B,2))) is not that difficult to handle, consisting as it does of expressions linear in the dummy variables. Nevertheless, it requires array descriptors of much greater generality than might otherwise be required, which is likely to impose some overheads on program execution. Even the apparently innocuous CMF$ALIGN A(I,J) WITH B(J,I) conceals a pitfall: it may result in violation of the widespread convention that the first subscript varies fastest. This has implications both for the implementor (perhaps subscripts can no longer be calculated by Horner's rule) and for the programmer (intuitions about locality within the memory of a single processor may no longer be reliable). For High Performance Fortran we propose a redesign of the ALIGN directive that relies on an analogy to array notation rather than on an analogy to the CM Fortran FORALL statement. In general, where one might use an index variable, one instead uses triplet notation. Rather than saying CHPF$ ALIGN A(I,J) WITH D(I,J+3) one would write CHPF$ ALIGN A(:,:) WITH D(:,4:N+3) Rather than saying CHPF$ ALIGN A(:,I,:) WITH D(I,:) we revive the "*" notation of Fortran D with a slightly different meaning, and write CHPF$ ALIGN A(*,:,*) WITH D(:,*) Briefly put, axes described by colon notation are eligible for matching between left and right hand sides; asterisks indicate axes that are not matched, and so indicate collapsed axes on the left hand side and replication on the right hand side. Complete details are discussed in a later section. We believe that this notation is not incompatible with the explicit-index notation; we could choose to adopt just one of them, knowing that the other can be integrated later. Finally, users tell us that there is a real need for calculating alignments at run time, especially for allocated arrays, and that having to call a subroutine to do this is quite awkward. In earlier versions of this proposal we had suggested allowing an ALIGN directive to appear anywhere in a program unit. In turns out that, for technical reasons, it is desirable in Fortran to be able to distinguish executable statements from declarations. Therefore we propose to introduce a REALIGN directive that is exactly like the ALIGN directive except that REALIGN is like an executable statement but ALIGN is a declaration. The distinction is particularly important in the case of parameters: an ALIGN directive on a parameter is descriptive, and the compiler should *assume* (or check) that the stated alignment is a property of the parameter; where REALIGN on a parameter indicates that the parameter might not have the stated alignment, and should be rearranged as necessary to cause it to have that alignment. After a great deal of internal debate within Thinking Machines, we have concluded that the Fortran D specification that realignments are confined to subprograms is the best; that is, if a subprogram realigns anything, the realignment is effectively undone on exit from the subprogram (there are various ways of implementing this, having differing space and time characteristics). Therefore a compiler can count on array alignments remaining stable across subroutine calls. Our standard counterexample--wanting to write a subroutine whose very purpose is to realign some arrays---can be handled in some cases by writing a subroutine that calculates the realignment parameters, and then putting one or more REALIGN directives after the subroutine call. (2) Allow dynamic redistribution of decompositions Just as there is a need to realign arrays on the fly, there is also a need for redistributing decompositions on the fly. Users point out to us that it would not be unusual for a program to decide which of several algorithms to use based on the size of a data set, and that the different algorithms might require different distributions. Therefore we propose to have a declarative DISTRIBUTE directive and an executable REDISTRIBUTE directive. We also propose the restriction that a decomposition may not be redistributed unless no arrays are currently aligned to that decomposition. (This can easily be enforced, if desired, by a simple run-time check based on reference counts.) The reason for this is that, after extensive conversations with users, we have found that each user easily grasps a model of alignment and distribution after a few minutes of explanation--but each user jumps to a slightly different model, and then becomes confused (in some cases indignant) when told that the model that is "obvious" to him or her is not quite the one in Fortran D. In particular, users divide just about evenly on the question of whether redistributing a decomposition should (a) affect just the decomposition, leaving all arrays as they already are, or (b) drag all arrays already aligned to that decomposition along with it. Our restriction sidesteps the question by forcing the user to do things in a particular order. In particular, you have to distribute a decomposition before aligning arrays to it. You can redistribute a decomposition only by first detaching all arrays currently aligned to it. (You can detach an array either by using REALIGN to attach it to some other decomposition, or by deallocating the array with a DEALLOCATE statement if that is applicable.) (3) Introduce a new abstraction to represent rectilinear arrangements of the physical processors The Fortran D ALIGN directive conflated several different notions: [a] relative alignment of array indices [b] wraparound subscripting (which we feel should be addressed, if at all, by a separate linguistic mechanism) [c] possible assignment of multiple positions along an array axis to a single decomposition element in arbitrary patterns Our proposed redesign of ALIGN retains [a], but eliminates [b] and [c]. The mapping described by the new version of ALIGN is always injective (one-to-one), with the sole exception of collapsing an entire axis. This greatly simplifies both the intuitions and the implementation. All other cases of reorganizing and grouping the data are expressed by the DISTRIBUTE statement. The Fortran D DISTRIBUTE statement also conflated two different notions: [x] how a decomposition is to be mapped onto a rectilinear arrangement of processors [y] how that rectilinear arrangement of processors is to be realized within the actual machine structure We propose to separate these, because the information for [x] is largely algorithm-dependent but machine-independent, whereas [y] is surely machine-dependent. We introduce the PROCESSORS statement, and modify the DISTRIBUTE statement to look more like the ALIGN statement: CHPF$ PROCESSORS P(4,8) CHPF$ DISTRIBUTE D(BLOCK,CYCLIC) ONTO P Of course, this piece of code has assumed that there are exactly (or at least) 32 processors. One might strive for slightly better machine-independence in this way: CHPF$ PROCESSORS P(N$PROC/8, 8) CHPF$ DISTRIBUTE D(BLOCK,CYCLIC) ONTO P or perhaps PARAMETER (IX=2**INT(LOG(REAL(N$PROC))/LOG(4.0)+.2), IY=N$PROC/IX) CHPF$ PROCESSORS P(IX,IY) CHPF$ DISTRIBUTE D(BLOCK,CYCLIC) ONTO P An additional machine-dependent MACHINE_LAYOUT directive can provide target-machine-specific parameters describing how a three-dimensional grid should be mapped to the hypercube or grid or whatever of the actual target machine. (Indeed, perhaps HPF should not specify a HPF$MACHINE_LAYOUT directive, but encourage individual implementations to have implementation-specific directives such as CMF$MACHINE_LAYOUT. The advantage is that a single source code could contain the implementation-specific directives for a number of machines without the need for conditional compilation.) (4) Introduce the concept of "local subroutines" Many algorithms for multiple processors work best, and are expressed best, as a two-level structure: the parts that communicate among processors, and the parts that operate independently within each processor. (It is even frequently the case that each part is making the same conceptual contribution to the entire algorithm, but different strategies work best for the two cases. For example, to sum a bunch of numbers, one might want to organize the processors into a tree structure, but using a tree organization on the data within each processor would only add needless overhead compared to a linear scan over the data.) Often the independent, per-processor parts of an algorithm can usefully have independent control structure. However, independence of control structure does not suffice for efficient implementation; the compiler also needs information about independence of data. The problem with the original Fortran D FORALL statement was that it assumed a shared-memory model at the language level: any iteration of a FORALL loop could access any data item. This model is difficult to support on many architectures. It is difficult or impossible for a compiler to determine whether any given subscripted array access will require interprocessor communication. We propose to introduce a new kind of subroutine called a "local" subroutine. Calling a local subroutine causes it to be executed on each processor, one copy (one thread) per processor. Ordinary array-subscripting syntax may be used to access any array element that is mapped to the executing processor. Access to array elements in another processor requires a different and intentionally more clumsy mechanism, a call to a library routine. This provides a mechanism for coding per-processor portions of algorithms in such a way that the compiler can trivially determine which accesses are local and which may require communication. (5) Introduce a LOCALLY statement block Given the concept of a local subroutine, one can introduce a statement that has the effect of introducing a block of local code within an otherwise global routine. One must understand, however, that within such a block of local code the meaning of each array variable changes; it represents only the portion of the array local to that processor. (6) Parallel prefix intrinsics Parallel prefix operations are extremely important in many parallel codes. It's just silly not to include them in a Fortran dialect designed for high parallel performance. There are also good serial implementations of these operations. ================ DETAILS ================ [1] DECOMPOSITION Directive The DECOMPOSITION directive declares one or more decompositions, specifying for each the name, the rank (number of dimensions), and the size of each dimension. It must appear in the declaration-part of a program unit. A decomposition is simply an abstract space of indexed positions. No storage is allocated for a decomposition. Syntax: decomposition-directive is DECOMPOSITION decomposition-decl-list decomposition-decl is decomposition-name ( explicit-shape-spec-list ) decomposition-name is object-name Examples: DECOMPOSITION A(N) DECOMPOSITION B(N,N), C(N,2*N) [2] PROCESSORS Directive The PROCESSORS directive declares one or more processor arrangements, specifying for each the name, the rank (number of dimensions), and the size of each dimension. It must appear in the declaration-part of a program unit. The product of the dimensions must be greater than zero and not greater than the number of processors to be used to execute the program. No storage is allocated for a processor arrangement. The variable N$PROC may be used to refer to the total number of processors. Syntax: processors-directive is PROCESSORS processors-decl-list processors-decl is processors-name [ ( explicit-shape-spec-list ) ] processors-name is object-name Examples: PROCESSORS P(N) PROCESSORS Q(N$PROC), R(N,N$PROC/N) PROCESSORS SCALARPROC If no shape is specified, then the processors arrangement is conceptually scalar and global. Depending on the implementation architecture, data distributed onto such a processor arrangement may reside in a single control processor (if the machine has one) or be replicated over all processors. [3] ALIGN and REALIGN Directives The ALIGN directive is used to map data objects onto elements of a decomposition. If two data objects are mapped to the same decomposition element, then the two data objects are said to be aligned. Operations between aligned data objects are likely to be more efficient than operations between data objects that are not known to be aligned. The ALIGN directive is designed to make it particularly easy to specify mappings for all the elements of an array at once. The ALIGN directive may appear only in the declaration-part of a program unit. The REALIGN directive is similar but may appear only in the execution-part of a program unit. The only syntactic difference between ALIGN and REALIGN is that ALIGN must contain only a specification-expr as a subscript or in a subscript-triplet, whereas REALIGN may use any integer expression. When applied to a parameter, an ALIGN directive in the declarative-part is assumed to be declarative in nature: the parameter must already be aligned as described, and the compiler may rely on that when compiling the subprogram. (By "must already be aligned", we mean that that is the programmer's responsibility, and a compiler with interprocedural analysis would be justified in issuing an error message on detecting violations of this requirement.) In contrast, a REALIGN directive always forces an array to have the specified alignment, moving data around at run time if necessary. Syntax: align-directive is ALIGN align-stuff realign-directive is REALIGN align-stuff align-stuff is single-align-stuff or multiple-align-stuff single-align-stuff is alignee ( align-source-list ) WITH align-spec alignee is object-name align-source is : or * In a single-align-stuff directive, every axis of the alignee is specified as either : or *. If it is :, then positions along that axis will be spread out across the matching axis of the align-spec; if it is *, then that axis is collapsed: positions along that axis make no difference in determining the corresponding decomposition element. All array elements are aligned with corresponding positions of the align-spec. If the alignee is a scalar, then it has a replicated representation. If the align-spec is scalar, all elements of the alignees will reside in the same processor. multiple-align-stuff is [ ( align-source-list ) ] WITH align-spec :: alignee-list A multiple-align-stuff directive is equivalent to a series of single-align-stuff directives. All alignees must have the same rank. If the alignee-source-list is omitted, it is assumed to consist of a list of ":" entries, equal in number to the rank of the alignees. So the directive CHPF$ ALIGN ( align-source-list ) WITH align-spec :: alignee-list is equivalent to the sequence CHPF$ ALIGN alignee-1( align-source-list ) WITH align-spec CHPF$ ALIGN alignee-2( align-source-list ) WITH align-spec ... CHPF$ ALIGN alignee-m( align-source-list ) WITH align-spec align-spec is align-name [ ( align-subscript-list ) ] align-name is decomposition-name or object-name align-subscript is int-expr or subscript-triplet or * Every combination of possible values for the index variables selects an element of the array to be aligned. The align-spec indicates a decomposition element, scalar, or array element with which that array element should be aligned. The align-spec must contain exactly as many subscript-triplets as the number of colons ":" appearing in the align-source-list. These are the matched up in corresponding left-to-right order. A "*" as an align-spec indicates a replicated representation. Thus, for example, CHPF$ALIGN A(:) WITH D(:,*) means that a copy of A is aligned with every column of D. This means that an optimizing compiler can arrange to read whichever copy is closest to hand. (Of course, when an element of A is written, all copies must be updated, not just one copy. Replicated representations are very useful for such tricks as small lookup tables, where it much faster to have a copy in each physical processor but you don't want to be bothered giving it an extra dimension which is logically unnecessary to the algorithm.) Note that an align-name may be an object-name as well as a decomposition-name. The main purpose of this is to allow alignment or realignment to a subroutine argument. Aligning to an array element has the same meaning as aligning to the decomposition element to which that array element was aligned. SUBROUTINE GLUTINOUS_RICE(P,Q,R,S) REAL P(:,:),Q(:,:),R(:,:),S(:,:) CHPF$ ALIGN WITH P:: Q,R,S ... END The directive allows the compiler to assume that the four subroutine arguments will all be aligned one with another, even though nothing is said about how they may be distributed onto processors. SUBROUTINE EIGHT_TREASURE_RICE(P,Q,R,S) REAL P(:,:),Q(:,:),R(:,:),S(:,:) CHPF$ REALIGN WITH P:: Q,R,S ... END In contrast, this directive says nothing about the alignment of the arguments as they come into the routine, but specifies that they should be forced into alignment before executing the remainder of the subroutine. (We examined some alternatives to allowing objects as alignment targets, but they all seemed to require giving decompositions first-class status in the language, including allowing them as subroutine parameters and/or in COMMON blocks. This approach seemed less attractive to us, partly because it might require modififying the source code to change subroutine calls rather than merely adding directives.) More examples of ALIGN directives: CHPF$ DECOMPOSITION D1(N),D2(N,N) REAL,DIMENSION(N,N):: X,A,B,C,AR1,AR2A,P,Q,R,S CHPF$ ALIGN X(:,*) WITH D1(:) CHPF$ ALIGN (:,*) WITH D1:: A,B,C,AR1,AR2A CHPF$ ALIGN WITH D2:: P,Q,R,S Note that, in a multiple-align, the alignees must all have the same rank but need not all have the same shape; the sizes need match only for dimensions that correspond to colons in the align-source-list. This turns out to be *extremely* important to users of CM Fortran: one of the most common cases is aligning arrays that match in parallel dimensions but may differ in "serial" (collapsed) dimensions: CHPF$ DECOMPOSITION D1(N) REAL A(3,N), B(4,N), C(43,N) CHPF$ ALIGN (*,:) WITH D1:: A,B,C The idea here is that you know there are processors (perhaps N of them), and you want arrays of different sizes (3, 4, 43) within each processor. It's okay for the numbers 3, 4, and 43 to be different, because those axes will be collapsed so that array elements with indices differing only along that axis will all reside in the same processor anyway. [4] DISTRIBUTE and REDISTRIBUTE Directive The DISTRIBUTE directive specifies a mapping of abstract decomposition elements to processors in a processor arrangement. It must appear in the declaration-part of a program unit. It must appear before any ALIGN directive that refers to the decomposition in question either directly or indirectly. If an array is aligned to a decomposition before the decomposition has been explicitly distributed, then the decomposition is distributed in an implementation-dependent manner onto an implementation-dependent processors arrangement. The REDISTRIBUTE directive is similar but is considered executable. A decomposition may be redistributed at any time that there is no array aligned to it. (You can detach an array either by using REALIGN to attach it to some other decomposition, or by deallocating the array with a DEALLOCATE statement if that is applicable. It is easy to provide an error check at run time by using reference counts: every freshly created decomposition has a count that is initially zero; aligning an array to a decomposition decrements the count for the decomposition it was formerly aligned with (if any) and increments the count of the decomposition it is to be newly aligned with; and deallocating an array decrements the count of its decomposition. It is necessary for this error check that the dope information for an array to include the equivalent of a pointer to the decomposition information.) The DISTRIBUTE directive may appear only in the declaration-part of a program unit. The REDISTRIBUTE directive is similar but may appear only in the execution-part of a program unit. The only syntactic difference between DISTRIBUTE and REDISTRIBUTE is that DISTRIBUTE must contain only a specification-expr as the argument to BLOCK or CYCLIC, whereas REDISTRIBUTE may use any integer expression. distribute-directive is DISTRIBUTE dist-stuff redistribute-directive is REDISTRIBUTE dist-stuff dist-stuff is single-dist-stuff or multiple-dist-stuff single-dist-stuff is distributee ( dist-format-list ) [ ONTO dist-target ] multiple-dist-stuff is [ ( dist-format-list ) ] [ ONTO dist-target ] :: distributee-list distributee is decomposition-name dist-format is BLOCK [ ( specification-expr ) ] or CYCLIC [ ( specification-expr ) ] ;??? or * dist-target is processors-name Examples: CHPF$ DISTRIBUTE D1(BLOCK) CHPF$ DISTRIBUTE (BLOCK,*,BLOCK) ONTO P:: D2,D3,D4 Define the ceiling division function CD(J,K) = (J+K-1)/K. Let D be the size of the decomposition in a certain dimension and let N be the size of the processor arrangement in the same dimension. Then BLOCK means the same as BLOCK(CD(D,N)), and BLOCK(M) means that a decomposition element whose index along that dimension is J resides in a processor with index CD(J,M) (M*N must be no smaller than D). CYCLIC means the same as CYCLIC(1). CYCLIC(M) means that a decomposition element whose index along that dimension is J resides in a processor with index 1+MODULO(CD(J,M)-1,N). CYCLIC(M) and BLOCK(M) mean the same thing when M*N <= D. But CYCLIC and BLOCK do not mean the same thing. If the ONTO clause is present, every dist-source decomposition in the same DISTRIBUTE directive must have the same rank as the dist-target. If the ONTO clause is omitted, then an implementation-dependent processor arrangement is chosen arbitrarily for each dist-source. A multiple-dist-stuff directive is equivalent to a series of single-dist-stuff directives. All distributees must have the same rank. If the distributee-source-list is omitted, it is assumed to consist of a list of "BLOCK" entries, equal in number to the rank of the distributees. So the directive CHPF$ DISTRIBUTE ( dist-format-list ) ONTO dist-target:: distributee-list is equivalent to the sequence CHPF$ DISTRIBUTE distributee-1( dist-format-list ) ONTO dist-target CHPF$ DISTRIBUTE distributee-2( dist-format-list ) ONTO dist-target ... CHPF$ DISTRIBUTE distributee-m( dist-format-list ) ONTO dist-target The directive CHPF$ DISTRIBUTE ONTO P:: D1,D2,D3 means CHPF$ DISTRIBUTE D1(BLOCK,BLOCK) ONTO P CHPF$ DISTRIBUTE D2(BLOCK,BLOCK) ONTO P CHPF$ DISTRIBUTE D3(BLOCK,BLOCK) ONTO P [5] LOCAL Subroutines [Note to myself: Hm. BLOCK, BLOCK(m), and CYCLIC in DISTRIBUTE directives work out well for this, but not CYCLIC(m), which requires each dimension to become two dimensions. That is, CYCLIC(m) requires an index to be split into a greater number of discontiguous pieces when mapping it to processor number and memory address within processor, thereby incurring higher overheads. We should worry about this.] The basic idea is to add a new kind of subroutine, called a LOCAL subroutine. Code in a LOCAL subroutine runs within a processor. It can access any data within the processor using Fortran syntax; it can also use array notation, but only to operate on data within the processor. To access data outside the processor requires either preparatory communication before running the LOCAL code, or the use of communications primitives such as a message-passing library. A LOCAL subroutine can call another LOCAL subroutine and everything behaves normally. LOCAL execution is initiated by calling a LOCAL subroutine from ordinary ("global") HPF code. When this occurs, all global arrays accessible to the subroutine (notably arrays passed as arguments) are logically carved up into pieces by physical processor, and each instance of the subroutine sees just the subgrid in its processor. Now for the details: A function or subroutine in an HPF program may have the keyword LOCAL preceding its header: LOCAL SUBROUTINE PAUL_SIMON(A, B, C) LOCAL INTEGER FUNCTION BOY_GEORGE(X) Such a subprogram unit is compiled in a special way. The main difference is that, in a machine (such as the CM-5) that has a central control processor directing the actions of the mass of individual parallel processors, *all* the code runs on the individual processors; operations that might normally be carried out in the control processor are all compiled for the individual processors. Local subprograms may use array notation. If the individual processors have vector hardware, for example, then one can expect that such per-processor array code would be compiled into per-processor vector operations, etc. A local subprogram is executed on a processor and executes asynchronously and independently of all other processors. With the exception of returning from such a local subroutine to the global caller that initiated local execution (see below), there is no implicit synchronization of processors. So such a subroutine may use any control structure whatsoever. One local subprogram can call another local subprogram. This behaves as an ordinary FORTRAN subprogram call. One may also call a local subprogram from a global program unit; this is how local execution is initiated in the first place. When this occurs, one subprogram call occurs in each physical processor. In this case special rules apply: (1) Scalar parameters are broadcast, so that each processor sees a copy of the scalar parameter. (Question: is it a run-time error to assign to this parameter? Or maybe it is okay as long as all processors assign the same value. This needs work.) (2) Array parameters are broken up into subgrids, and the subprogram on a given processor receives as its argument an array representing the subgrid of elements residing on that processor. So if you have a 50x50 global array on a 16-processor machine, using a 4x4 processor layout, then most of the processors will receive a 13x13 array as argument; six will see 13x11 or 11x13 arrays, and one will see a 11x11 subgrid. (3) As a local subprogram on a processor returns to a global caller, the processor performs a synchronization function. The result is that the global caller proceeds only when *all* processors have completed execution of the subprogram. (4) In the case of a local function (as opposed to a subroutine), the result of the call is a one-dimensional array of the results (assuming the result of the function is nominally scalar). The length of this array is equal to the number of physical processors. In the case of an array-valued local function, the ranks and shapes of the arrays must be identical, and an extra dimension is appended. So if you have 128 processors and each returns a 4x2 array, the result is 4x2x128. Within a local subprogram, all the usual array inquiry intrinsics may be used. When applied to subgrids resulting from the breakup of a global parameter, such intrinsics are regarded as applying to the local subgrid. For example: REAL A(50,50) CALL CINDI_LAUPER(A) ... LOCAL SUBROUTINE CINDI_LAUPER(X) REAL, ARRAY(:,:):: X N = SIZE(X,1) L = LBOUND(X,1) U = UBOUND(X,1) ... Assuming 16 processors again, 4x4, then some processors would set N to 13 but those on the high edge would set N to 11. All processors set L to 1 and U to the same value as N (13 or 11). [A piece of rationale: There is a question as to what L and U should be: should they be equal to the global array indices, so that a processor might see indices from 1:13, or 14:26, or 27:39, or 40:50? Or should they be normalized to start at 1, so that each processor sees a range of 1:13 or 1:11? The former is in some ways more elegant, but the latter may be prone to fewer errors because old Fortran programmers tend to assume *all* arrays start at 1. Taking a cue from the existing Fortran 90 specification for assumed-shape arrays, I conclude that L should be 1 and U should be 13 or 11, as the case may be. End of rationale.] Another series of intrinsics, GLOBAL_SIZE, GLOBAL_LBOUND, and GLOBAL_UBOUND may be applied to arrays to determine their global extents. So GLOBAL_SIZE(X,1) would return 50. Another intrinsic, PROCESSOR_INDEX, could return the 1-origin physical (processor) self-address with respect to a given array and axis. So PROCESSOR_INDEX(X,1) would return 1, 2, 3, or 4, depending on the processor. PROCESSOR_SIZE(X,1) would be 4, the limit on the range of values returned by PROCESSOR_INDEX. GLOBAL_SHAPE and PROCESSOR_SHAPE are defined in terms of GLOBAL_SIZE and PROCESSOR_SIZE in the same manner that SHAPE is defined in terms of SIZE. So GLOBAL_SHAPE(X) is (/ 50, 50 /) on every processor PROCESSOR_SHAPE(X) is (/ 4, 4 /) on every processor SHAPE(X) is (/ 13, 13 /) in 9 of the processors, (/ 13, 11 /) in 3 of the processors, (/ 11, 13 /) in 3 of the processors, and (/ 11, 11 /) in 1 of the processors. The GLOBAL_ and PROCESSOR_ intrinsics may be applied to an array that was allocated locally. The GLOBAL values are the same as the non-GLOBAL values, and all elements are on a single processor. The same is true of an array that is not distributed (for example, allocated in the control processor). LOCAL SUBROUTINE KINKS() REAL, ARRAY(40,40):: YOU_REALLY_GOT_ME CALL CINDI_LAUPER(YOU_REALLY_GOT_ME) Then GLOBAL_SIZE(YOU_REALLY_GOT_ME, 1) is 40, GLOBAL_LBOUND(YOU_REALLY_GOT_ME, 1) is 1, GLOBAL_UBOUND(YOU_REALLY_GOT_ME, 1) is 40, and GLOBAL_INDEX(YOU_REALLY_GOT_ME, 1) is 1. You can't access array elements outside the local subgrid using array notation, but you can do it with special communication routines. CALL ARRAY_FETCH(result, global_source, i1, ..., in) CALL ARRAY_STORE(source, global_dest, i1, ..., in) These transfer a scalar (integer, logical, real, complex) from source to dest. The global operand must be the name of an array, and the following arguments are integers that are used as indexes into the global array of which the named array is presumably a subgrid. (If it's a local array, it still works, but it obviously doesn't have to go outside the executing processor! It's probably a silly thing to do, though, unless you're trying to use a general-purpose routine on a local array.) The intended CM-5 implementation is through the use of messages that generate interrupts at the destination processor; the interrupt routine services read and write requests. Here is some code that implements bitwise-AND reduction on a one-dimensional array of integers: INTEGER FUNCTION BITAND(X) INTEGER, ARRAY(:):: X INTEGER, ARRAY(PROCESSOR_SIZE(X)):: PARTIAL_RESULTS C GET EACH PROCESSOR TO COMBINE ITS OWN VALUES AND DELIVER ONE RESULT EACH. CALL BITAND_WITHIN_PROCESSOR(X, PARTIAL_RESULTS) C NOW USE A BINARY TREE TO COMBINE ONE VALUE FROM EACH PROCESSOR. C CAREFUL FOOTWORK MAKES SURE THAT ANY SIZE WORKS, NOT JUST POWER OF 2. K = PROCESSOR_SIZE(X) DO WHILE (K .GT. 1) PARTIAL_RESULTS(1:K/2) = IAND(PARTIAL_RESULTS(1:K/2), & PARTIAL_RESULTS((K+1)/2):K) K = (K+1)/2 END DO BITAND = PARTIAL_RESULTS(1) END LOCAL SUBROUTINE BITAND_WITHIN_PROCESSOR(X, PARTIAL_RESULTS) INTEGER, ARRAY(:):: X INTEGER, ARRAY(1):: PARTIAL_RESULTS INTEGER P C WITHIN A PROCESSOR, JUST DO A SERIAL LOOP TO AND THINGS UP C (THIS CODE ASSUMES NO VECTOR PROCESSING IS AVAILABLE). P = -1 DO J = LBOUND(X),UBOUND(X) P = IAND(P, X(J)) END DO PARTIAL_RESULTS(1) = P END In a global program unit, COMMON means global common. The keyword COMMON may optionally be preceded by the word GLOBAL. In a LOCAL program unit, COMMON means local common. Global COMMON may be specified by prefixing the word COMMON with the word GLOBAL. The keyword LOCAL may also appear before the word COMMON. SUBROUTINE QUESTION_MARK COMMON /FOO/ TEARS(96) !COULD ALSO HAVE SAID "GLOBAL COMMON" CALL THE_MYSTERIANS END QUESTION_MARK LOCAL SUBROUTINE THE_MYSTERIANS GLOBAL COMMON /FOO/ TEARS(96) !EACH PROCESSOR HAS CEILING(96/N) ELEMENTS COMMON /BAR/ CRY(4) !EACH PROCESSOR HAS 4 ELEMENTS LOCAL COMMON /BAZ/ TEARDROPS(10000000) !EACH PROCESSOR HAS 1E7 ELEMENTS ! (TOO MANY TEARDROPS :-) ... !EACH PROCESSOR ASSIGNS ONLY TO ITS OWN SUBSET OF THE TEARS. FORALL (J=1:UBOUND(TEARS,1)) TEARS(J) = CRY(1+(MOD(J,4)) ... END THE_MYSTERIANS [6] LOCALLY Statement locally-statement is LOCALLY declarations ??? block END LOCALLY This statement behaves as if its contents were the body of a local subroutine invoked at that point, with all visible objects that it uses passed as parameters of the same name. You may not use LOCALLY within local code (there is no point), even though you may call a local subroutine from local code. So the above example could be written: INTEGER FUNCTION BITAND(X) INTEGER, ARRAY(:):: X INTEGER, ARRAY(PROCESSOR_SIZE(X)):: PARTIAL_RESULTS C GET EACH PROCESSOR TO COMBINE ITS OWN VALUES AND DELIVER ONE RESULT EACH. LOCALLY INTEGER P C WITHIN A PROCESSOR, JUST DO A SERIAL LOOP TO AND THINGS UP C (THIS CODE ASSUMES NO VECTOR PROCESSING IS AVAILABLE). P = -1 DO J = LBOUND(X),UBOUND(X) P = IAND(P, X(J)) END DO PARTIAL_RESULTS(1) = P END LOCALLY C NOW USE A BINARY TREE TO COMBINE ONE VALUE FROM EACH PROCESSOR. C CAREFUL FOOTWORK MAKES SURE THAT ANY SIZE WORKS, NOT JUST POWER OF 2. K = PROCESSOR_SIZE(X) DO WHILE (K .GT. 1) PARTIAL_RESULTS(1:K/2) = IAND(PARTIAL_RESULTS(1:K/2), & PARTIAL_RESULTS((K+1)/2):K) K = (K+1)/2 END DO BITAND = PARTIAL_RESULTS(1) END [7] Parallel prefix (and suffix) intrinsics For every reduction operation XXX in the language, introduce two new intrinsics XXX_PREFIX and XXX_SUFFIX. They take the same arguments as the corresponding reduction intrinsic, plus one additional optional argument called SEGMENT, which is of type logical and conformable with the ARRAY argument (a TRUE element indicates the start of a new segment of the first argument, that is, a place where the running accumulation is to be reset). If the DIM argument is omitted, then the arrays are processed in array element order ("column-major"), as if temporarily regarded as one-dimensional. In all cases the result has the same shape as the first argument. [Should there be another optional argument, a scalar logical, named EXCLUSIVE, default .FALSE., which determines whether the prefix or suffix operation is inclusive (the default) or exclusive?] SUM_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE) Probably so.] Examples: SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/) SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/) LOGICAL T,F PARAMETER (T = .TRUE., F = .FALSE. ) COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) ) yields (/1,1,1,2,3,4,4,5,5/) COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) yields (/0,1,1,1,2,3,4,4,5/) SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/), SEGMENT=(/T,F,F,F,T,F,T,T,F/)) yields (/1,3,6,10,5,11,7,8,17/) ------- --- - --- -------- --- - ---- four input segments four independent result segments [8] Miscellaneous Stuff (a) Intrinsics POPCNT, POPPAR, and LEADZ. (b) Reduction versions of IAND, IOR, and IEOR. How about AND, OR, and EOR? (c) An integer-length intrinsic: ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 )) This is related to LEADZ but is often much more convenient for the calculation of array dimensions, etc. It is the number of bits required to store the signed integer x. As examples of its use, 2**ILEN(N-1) rounds N up to a power of 2 (for N > 0), whereas 2**(ILEN(N)-1) rounds N down to a power of 2. (d) Consistent with Fortran 90 and IEEE-STD-754-1985, eliminate the operator "<>" and use only the Fortran 90 "/=". (e) The MAXLOC and MINLOC instrinsics should have an optional DIM argument. If such an argument is present, then the shape of the result equals the shape of the first argument with one dimension (that indicated by the DIM argument) deleted; it is as if a series of one-dimensional MAXLOC or MINLOC operations were performed. Example: If A has the value [ 0 -5 8 -3 ] [ 3 4 -1 2 ] [ 1 5 6 -4 ] then MINLOC(A, DIM=1) has the value [ 1, 1, 2, 3 ] MAXLOC(A, DIM=1) has the value [ 2, 3, 1, 2 ] MINLOC(A, DIM=2) has the value [ 2, 3, 4 ] MAXLOC(A, DIM=2) has the value [ 3, 2, 3 ]. (f) Block FORALL statements: FORALL (... e1 ... e2 ...) s1 s2 ... sn END FORALL means exactly the same as temp1 = e1 temp2 = e2 FORALL (... temp1 ... temp2 ...) s1 FORALL (... temp1 ... temp2 ...) s2 ... FORALL (... temp1 ... temp2 ...) sn That is, a block FORALL means exactly the same as replicating the FORALL header in front of each array assignment statement in the block, except that any expressions in the FORALL header are evaluated only once. (g) Nested WHERE statements. Here is the text of a proposal once sent to X3J3: | Briefly put, the less WHERE is like IF, the more difficult it is to | translate existing serial codes into array notation. Such codes tend to | have the general structure of one or more DO loops iterating over array | indices and surrounding a body of code to be applied to array elements. | Conversion to array notation frequently involves simply deleting the DO | loops and changing array element references to array sections or whole | array references. If the loop body contains logical IF statements, these | are easily converted to WHERE statements. The same is true for translating | IF-THEN constructs to WHERE constructs, except in two cases. If the IF | constructs are nested (or contain IF statements), or if ELSE IF is used, | then conversion suddenly becomes disproportionately complex, requiring the | user to create temporary variables or duplicate mask expressions and to use | explicit .AND. operators to simulate the effects of nesting. | | Users also find it confusing that ELSEWHERE is syntactically and | semantically analogous to ELSE rather than to ELSE IF. | | [We] ... propose that the syntax of WHERE constructs be extended and | changed to have the form | | where-construct is where-construct-stmt | [ where-body-construct ]... | [ elsewhere-stmt | [ where-body-construct ]... ]... | [ where-else-stmt | [ where-body-construct ]... ] | end-where-stmt | | where-construct-stmt is WHERE ( mask-expr ) | | elsewhere-stmt is ELSE WHERE ( mask-expr ) | | where-else-stmt is ELSE | | end-where-stmt is END WHERE | | mask-expr is logical-expr | | where-body-construct is assignment-stmt | or where-stmt | or where-construct | | Constraint: In each assignment-stmt, the mask-expr and the variable | being defined must be arrays of the same shape. If a | where-construct contains a where-stmt, an elsewhere-stmt, | or another where-construct, then the two mask-expr's must | be arrays of the same shape. | | The meaning of such statements may be understood by rewrite rules. First | one may eliminate all occurrences of ELSE WHERE: | | WHERE (m1) WHERE (m1) | xxx xxx | ELSE WHERE (m2) becomes ELSE | yyy WHERE (m2) | END WHERE yyy | END WHERE | END WHERE | | where xxx and yyy represent any sequences of statements, so long as the | original WHERE, ELSE WHERE, and END WHERE match, and the ELSE WHERE is the | first ELSE WHERE of the construct (that is, yyy may include additional ELSE | WHERE or ELSE statements of the construct). Next one eliminates ELSE: | | WHERE (m) temp = m | xxx WHERE (temp) | ELSE becomes xxx | yyy END WHERE | END WHERE WHERE (.NOT. temp) | yyy | END WHERE | | Finally one eliminates nested WHERE constructs: | | WHERE (m1) temp = m1 | xxx WHERE (temp) | WHERE (m2) xxx | yyy becomes END WHERE | END WHERE WHERE (temp .AND. (m2)) | zzz yyy | END WHERE END WHERE | WHERE (temp) | zzz | END WHERE | | and similarly for nested WHERE statements. | | The effects of these rules will surely be a familiar or obvious possibility | to all the members of the committee; I enumerate them explicitly here only | so that there can be no doubt as to the meaning I intend to support. | | Such rewriting rules are simple for a compiler to apply, or the code may | easily be compiled even more directly. But such transformations are | tedious for our users to make by hand and result in code that is | unnecessarily clumsy and difficult to maintain. | | One might propose to make WHERE and IF even more similar by making two | other changes. First, require the noise word THERE to appear in a WHERE | and ELSE WHERE statement after the parenthesized mask-expr, in exactly the | same way that the noise word THEN must appear in IF and ELSE IF statements. | (Read aloud, the results might sound a trifle old-fashioned--"Where knights | dare not go, there be dragons!"--but technically would be as grammatically | correct English as the results of reading an IF construct aloud.) Second, | allow a WHERE construct to be named, and allow the name to appear in ELSE | WHERE, ELSE, and END WHERE statements. I do not feel very strongly one way | or the other about these no doubt obvious points, but offer them for your | consideration lest the possibilities be overlooked. [End of proposal as sent to X3J3.] Now, for compatibility with Fortran 90, HPF should continue to use ELSEWHERE instead of ELSE, but this causes no ambiguity: WHERE(...) ... ELSE WHERE(...) ... ELSEWHERE ... END WHERE is perfectly unambiguous, even when blanks are not significant. Since X3J3 declined to adopt the keyword THERE, it should not be used in HPF either (alas). [9] What about EQUIVALENCE? What can be said about EQUIVALENCE in HPF? Sometimes it is very convenient to treat a multidimensional array as one-dimensional, or as having some smaller number of dimensions, with some dimensions combined to make a single axis. [end]