The ON Clause Charles Koelbel and Rob Schreiber The purpose of the ON directive is to allow the programmer to control the distribution of computations among the processors of a parallel machine. In a sense, this is the dual of the DISTRIBUTE and ALIGN directives for data. Modern parallel machines will achieve their best performance if all operations are performed by separate processors (as specified by ON) with each processor accessing its local data (as specified by DISTRIBUTE). If either explicit computation partitioning or explicit data mapping is not present, the system must choose one. Another proposal explains how this might be done. A secondary purpose of the ON directive is to identify data accessed by the computation as local to the executing processor. The compiler can use this information to avoid generating communication or to simplify array address calculations. Note that whether a given data element is local depends on two facts: Where the data is stored (i.e. its DISTRIBUTE and ALIGN attributes) Where the computation is executed (i.e. its ON directive) For these reasons, the LOCAL clause is added to the ON directive, which is effectively the earliest point in the program where the needed facts might be available. Note that changing the ON directive may invalidate some LOCAL clauses, or may make some new LOCAL clauses true. Syntax There are two flavors of the ON directive: a single-statement form, and a multi-statement form. The BNF for both is simple-on-directive IS ON HOME ( home-expr ) [, LOCAL ( local-var-list ) ] block-on-directive IS simple-on-directive BLOCK end-directive IS END [ BLOCK ] on-block IS block-on-directive block end-directive home-expr IS variable OR template-elmt OR processors-elmt template-elmt IS template-name [ ( section-subscript-list ) ] processors-elmt IS processors-name [ ( section-subscript-list ) ] local-var IS variable OR ALL = variable-name Notes: Constraints need to be added above... Also, changes are needed in some higher-level syntax (for example, adding simple-on-directive as an option to executable-directive). home-expr, template-elmt, and processors-elmt are just auxiliary syntax categories. Note that "variable" is an F90 syntax term that means (roughly) "a reference, including an array element, array section, or derived type field"; "variable" doesn't include template or processor elements since those aren't first-class language constructs. Note also that "block" is an F90 syntax term for "a series of statements treated as a group" - for example, the body of a DO construct. simple-on-directive should be an option under executable-directive (HPF syntax term), as redistribute-directive is now. on-block most naturally fits as a F90 executable-stmt, since that makes constraints about "properly nesting" easier to state. Any top-level variables in the LOCAL clause must be explicitly mapped. Otherwise, the assertion (see below) is a statement about how the compiler works, not about the program. (Alternately, somebody could suggest another definition of LOCAL, I suppose...) Rationale: The BLOCK in on-block is necessary so that the parser can distinguish between the one-statement or multi-statement forms. BEGIN would be equally good as a keyword. HOME is required, even when its argument is a processor. It seems more natural to eliminate the HOME keyword there, but this leads to the following ambiguity: INTEGER X(4) ! X(I) will be on processor I !HPF$ PROCESSORS HOME(4) !HPF$ DISTRIBUTE X(BLOCK) X = (/ 4,3,2,1 /) !HPF$ ON HOME(X(2)) X(2) = X(1) If HOME were not required, where should the computation be done: 1. Processor HOME(2) (i.e. the owner of X(2))? 2. Processor HOME(3) (i.e. the value of X(2), before the assignment)? 3. Processor HOME(4) (i.e. the value of X(2), after the assignment)? The definition of ON clearly indicates interpretation 1 is correct. One can get the effect of interpretation 2 by the directive !HPF$ ON HOME(HOME(X(2))) There is no way to get the effect of interpretation 3, short of building a clairvoyant computer. Introducing reserved keywords into Fortran was suggested as a better solution to this problem, but was seen as too large a change to the underlying language. End of rationale. Semantics The ON directive advises the compiler to perform the computation in the following statement(s) using the processor(s) named in the argument to HOME. Like the mapping directives ALIGN and DISTRIBUTE, this is advice rather than an absolute commandment; the compiler may override it if necessary. Also like ALIGN and DISTRIBUTE, the ON directive may affect the efficiency of computation, but not the final results. Advice to implementors: If the compiler may override the user's advice in an ON directive, then the compiler should also offer the user an option to force all directives to be obeyed. End advice to implementors. The single-statement ON directive (i.e. simple-on-directive) precedes statement for which it is advising behavior, and is said to apply to that statement. If the statement is a compound statement (e.g. a DO loop or an IF-THEN-ELSE construct), then the ON directive also applies to all nested statements therein. Similarly, the block ON directive (i.e. on-block) applies the initial ON clause to all statements up to the matching end directive. It is legal to nest ON directives, provided that the processors named by the inner ON directive are also named by the outer directive. Similarly, an ON directive can apply to a CALL statement where the called subroutine contains an ON directive. In this case, any processor named inside the subroutine must be included by the ON directive in the caller. Note that the syntax of on-block automatically ensures that it is properly nested inside other compound statements, and that compound statements properly nest inside of it. As with other Fortran 90 compound statements, transfer of control to the interior of an on-block from outside the block is prohibited, while transfers within a block may occur. However, HPF also prohibits transfers of control from the interior of a on-block to outside the on-block. Note that this is stricter than Fortran 90. Rationale: This allows a fork-join approach to nested parallelism, which seems to be natural to many applications and programmers. The restrictions about control flow into and out of an ON block essentially make it a single-entry single-exit region, thus simplifying the semantics considerably. End Rationale. The HOME clause can name a program variable, a template, or a processors arrangement. For each of these possibilities, it can further specify a single element or multiple elements. This is translated into the processor executing the computation as follows: * If the HOME clause names a program variable (that is, an array element or section), then every processor owning any element of the variable will execute the following statement(s). * If the HOME clause names a template element or section, then every processor owning any element of the template will execute the following statement(s). * If the HOME clause names a processors arrangement, then the processor(s) referenced there will execute the following statements. In all cases, the value(s) of the HOME clause are determined as if the HOME argument was evaluated when control flow reached the ON directive. That is, if the value of a variable used in the HOME clause changes within the on-block, work is not migrated to a new processor. If evaluation of the argument of the HOME clause would change the value of any variable in the program, then the value of that variable becomes undefined after the ON clause is reached. Finally, note that the HOME clause only specifies how computation is partitioned among processors; it does not indicate processors that may be involved in data transfer. Rationale: This is the heart of the ON clause. The "as if" wording and making side effects undefined avoids the following problem: ON is a directive. Therefore, it cannot have side effects. But implementing the ON clause may require evaluating some functions, therefore causing the side effects. An alternative would be to only allow PURE functions in the HOME and LOCAL clauses. End of rationale. Advice to implementors: If the HPF program is compiled into Single-Program-Multiple-Data (SPMD) code, then the ON clause can always be implemented (albeit inefficiently) by having all processors compare their processor id to an id (or list of ids) generated from the HOME clause. (Similar naive implementations can be constructed in other paradigms as well.) If the ON clause will be executed repeatedly, for example in a DO loop, it is probably worthwhile to invert this process. That is, instead of all processors executing all the HOME clause tests, the compiler should determine the range of loop iterations that will give a true test. (See the "Advice to implementors" in the Examples section below for more details.) For example, consider the following complex case: DO I = 1, N !HPF$ ON HOME( A(MY_FCN(I)) ) BLOCK ... !HPF$ END BLOCK END DO Here, the generated code can perform an "inspector" (i.e. a skeleton loop that only evaluates the HOME clause of each iteration) to produce a list of iterations assigned to each processor. This list can be produced in parallel, since MY_FCN must be side-effect free; however, distributing it to all processors may require unstructured communications patterns, possibly negating the advantage of parallelism. In general, more advanced compilers will be able to efficiently invert more complex HOME clauses. It is recommended that the abilities (and limitations) of a particular compiler be documented clearly for potential users. End advice to implementors. Advice to programmers: The argument to HOME in an ON clause can be arbitrarily complex. This is a two-edged sword; it can express very complicated computation partitioning, but the implementation of these partitions may not be efficient. More concretely, it may express a perfectly load-balanced computation, but force the compiler to serialize the computation of the HOME clauses. Although the amount of overhead for an ON clause will vary based on the HPF code, the compiler, and the hardware, one can expect that compilers will generate very good code based solely on array mappings, and progressively worse code as the complexity of the HOME clause increases. A rough measure of the complexity of an ON directive is the amount of run-time data used to compute it; for example, a constant offset is fairly simple, while a permutation array is very complex. See the Examples section below for more concrete examples of this phenomenon. End of advice to programmers. If ON clauses are nested, then the innermost HOME clause effectively controls execution of the statement(s). A programmer can think of this as successively restricting the set of processors at each level of nesting; clearly, the last restriction must be the strongest. Similarly, if ON clauses appear in a subroutine call chain, then the deepest ON clause is the effective controller. Rationale: This reinforces the idea of the fork-join parallelism pointed out earlier. End of rationale. Advice to implementors: Notice that nested ON clauses do not present additional problems for the naive implementation. If a processor failed the outer HOME comparison, then it would fail any tests inside the ON as well. Thus, it need not even make the tests. End of advice to implementors. Operations controlled by an ON clause must follow certain restrictions: * If an ON directive applies to a DISTRIBUTE directive, then the set of processors named in the HOME clause must include - All processors named in the DISTRIBUTE's ONTO clause if one is present - All processors in the default processors arrangement, if there is no ONTO clause * If an ON directive applies to a REDISTRIBUTE directive, then the set of processors named in the HOME clause must include - All processors that stored any element of the distributee before the REDISTRIBUTE was encountered - The processors required by an equivalent DISTRIBUTE directive * If an ON directive applies to a REALIGN directive, then the set of processors named in the HOME clause must include - All processors that store any element of the align-with-clause * If an ON directive applies to a REALIGN directive, then the set of processors named in the HOME clause must include - All processors that stored any element of the alignee before the REALIGN was encountered - The processors required by an equivalent ALIGN clause * If an ON directive applies to an ALLOCATE statement which creates an explicitly mapped variable, then the set of processors named in the HOME clause must include the processors required by the associated DISTRIBUTE or ALIGN directive. * If an ON directive applies to any other declaration of an explicitly mapped variable, then the set of processors named in the HOME clause must include the processors required by the associated DISTRIBUTE or ALIGN directive. If operations within an ON block do not follow these constraints, then the program is not HPF-conforming. Rationale: Operations which allocate memory require the cooperation of all processors that will own that memory. Therefore, ON clauses must schedule those operations to execute on the cooperating processors. End of rationale. Advice to implementors: These restrictions ensure that HPF data distribution directives inside an ON block (either lexically or in a dynamic call chain) can be implemented without relying on one-way communication outside of the "current" processing group. End of advice to implementors. The LOCAL clause, if present, is an assertion to the compiler that certain array references made within the ON are stored in local memory if the computation is performed by the processor(s) named in the HOME clause. Specifically, if an element in the local-var list is a variable, array, or array section, then that element will be stored on at least one of the processors from the HOME clause. If the local-var is of the form ALL=variable-name, then all references to that variable that the ON applies to will be stored on at least one of the named processors. As with the HOME clause, the value(s) of the LOCAL clause are determined as if the local-var-list was evaluated when control flow reached the ON directive, and possible side effects in the LOCAL clause cause the affected variables to become undefined. Rationale: The ALL=variable-name form is syntactic sugar for listing every reference to an array in the ON block. This is similar in spirit to APR's grouping IGNORE directives by variable. I'm open to better syntax for saying this. (Note that ALL(variable-name) doesn't work, since somebody might name a variable ALL. Grr...) The pseudo-evaluation of the LOCAL clause is the same as the HOME clause, for the same reason. Treating the LOCAL clause as naming sets of variables, rather than as a list of lexical references, gives a reasonably firm logical basis to what the assertion means. It is unclear whether compilers would rather see this "symbolic" information or longer lists that need to be pattern-matched. It is also unclear which approach would appeal more to programmers. End Rationale. Advice to programmers: Since the End of advice to programmers. Note that if the HOME clause specifies more than one processor, then LOCAL only asserts that the variables are stored on one of the processors. For example, if a statement is executed on a section of the processors arrangement, then communication within that section may be needed for some variables in the LOCAL clause. Communication with processors outside of the section will not be needed for those variables, however. Rationale: The alternative to this interpretation would be that any variable named in the LOCAL clause would be local to all processors, i.e. replicated. While that certainly allows more extensive optimizations, it is a less common case. In addition, it does not seem to capture the intent of ON directives applied to CALL statements or compound statements. For example, !HPF$ PROCESSORS PROCS(MP,MP) !HPF$ DISTRIBUTE X(BLOCK,BLOCK) ONTO PROCS !HPF$ ON HOME(PROCS(1,1:MP)), LOCAL( X(K,1:N) ) CALL FOO( X(K,1:N) ) would presumably call FOO on a row of the processors arrangement, passing elements of X in place. This is what the current definition does; if LOCAL meant "local on every processor", the call would force X to be replicated. End Rationale. The LOCAL assertion is similar to the INDEPENDENT directive, in that if it is correct it does not change the meaning of the program. If the LOCAL clause in incorrect, the program is not standard-conforming (and is thus undefined). Like the INDEPENDENT directive, the compiler may use the information in the LOCAL clause, or ignore it if it is insufficient for the compiler's purposes. If the compiler can detect that the LOCAL clause is incorrect (i.e. that a LOCAL variable is definitely nonlocal), it is justified in producing a warning. Unlike the INDEPENDENT directive, however, the truth of the LOCAL clause depends on the mapping of computations (specified with the ON clause) and the mapping of data (specified with DISTRIBUTE and ALIGN clauses); if the compiler overrides either of these, then it may not be able to use information in the LOCAL directive. Rationale: Knowing that a reference is local is valuable information for the optimizer. It is in keeping with the spirit of HPF to phrase this as an assertion of fact, which the compiler can use as it pleases. Expressing it as advice to the compiler seems to have disadvantages. Some possible ways this advice could be phrased, and the counter-arguments, are * "Don't generate communication for this reference" has great potential for changing the meaning of the program. Some programmers want this capability, but it violates the "correct directives should not change the meaning of a program" principle of HPF. Also, once communication is "turned off" for a reference, it's not clear how to turn it back on. * "Generate communication for this reference" is not a useful directive, since the compiler has to do this anyway. * "Generate communication for this reference, and place it here" is useful, since it can override the default placement by the compiler. It still has potential for changing program meaning. It also has the potential to create programs as complex as message-passing, as programmers try to move communication out of loops. End of rationale. Examples The following are valid examples of ON directives. Most of them are "reasonable" in the sense that they illustrate idioms that programmers might want to use, rather than contrived situations. For simplicity, the first several examples assume the following array declarations: REAL A(N), B(N), C(N), D(N) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK) One of the most commonly requested capabilities for HPF as to control how loop iterations were assigned to processors. (Historically, the ON clause first appeared to perform exactly this role in the Kali FORALL construct.) This can be done by the ON directive, as shown in the following examples: !HPF$ INDEPENDENT DO I = 2, N-1 !HPF$ ON HOME(A(I)) A(I) = (B(I) + B(I-1) + B(I+1))/3 END DO !HPF$ INDEPENDENT DO J = 2, N-1 !HPF$ ON HOME(A(J+1)) BLOCK A(J) = B(J+1) + C(J+1) + D(J+1) !HPF$ END BLOCK END DO The ON directive in the I loop advises the compiler to have each processor run over its local section of the A array (and therefore B as well). The references to B(I-1) and B(I+1) must be fetched from off-processor for the first and last iterations on each processor (except for the boundary processors); note that those processors are not mentioned in the HOME clause. The ON directive in the J loop advises the compiler to "shift" computations so that each processor does a vector sum of its local sections of B, C, and D, stores the first element of the result on the processor to its left, and stores the rest of the result (shifted by one) in A. It is worth noting that the directives would still be valid (and minimize nonlocal data accesses) if the arrays were distributed CYCLIC, although the number of nonlocal references would be much higher. Advice to implementors: It is highly recommended that compilers concentrate on optimizing DO loops with a single ON clause including the entire loop body. Schematically, the code will be: DO i = lb, ub, stride !HPF$ ON HOME(array(f(i))) BLOCK body !HPF$ END BLOCK END DO Where array has some data mapping. Assume the mapping give processor p the elements my_set(p). (In a BLOCK distribution, for example, my_set(p) is a contiguous range of integers.) Then the generated code on processor p should be DO i in [lb:ub:stride] intersect f^-1(my_set(p)) body END DO (This schematic does not show where communication or synchronization must be placed; that must be derived from analysis of the body.) Moreover, f is most likely to be the identity function or a linear function with integer coefficients, both of which can be inverted easily. Given this, techniques for iterating through the set can be found in several recent conferences. End of advice to implementors. Advice to users: One can expect the I loop above to generate efficient code for the computation partitioning. In effect, the compiler will arrange for each processor to iterate over its own section of array A. The J loop is slightly more complex, since the compiler must find the inverse of the HOME clause's subscripting function. That is, the compiler must solve K=J+1 for J, where K ranges over the local elements of A. Of course, in this case J=K-1; in general, linear functions can be inverted by the compiler. (It should be pointed out, however, that complex combinations of ALIGN and DISTRIBUTE may make the description of K unwieldy, and this may add overhead to the inversion process.) End of advice to users. Sometimes it is advantageous to "split" an iteration between processors. The following case shows one example of this: !HPF$ INDEPENDENT DO I = 2, N-1 !HPF$ ON HOME(A(I)) A(I) = (B(I) + B(I-1) + B(I+1))/3 !HPF$ ON HOME C(I+1) C(I+1) = A(I) * D(I+1) END DO Due to the first ON clause, the reference to A(I) is local in the first statement. The second ON clause makes A(I) nonlocal (for some values of I) there. This maximizes the data locality in both statements, but does require data movement between the two. Advice to implementors: If there are several non-nested ON clauses in a loop, then the schematic above needs to be generalized. In essence, the iteration range for each individual ON clause must be generated. A processor will then iterate over the union of these ranges. Statements guarded by an ON directive must now be guarded by an explicit test. In summary, the code for DO i = lb, ub, stride !HPF$ ON HOME(array1(f1(i))) stmt1 !HPF$ ON HOME(array2(f2(i))) stmt2 END DO on processor p becomes set1 = [lb:ub:stride] intersect f1^-1(my_set1(p)) set2 = [lb:ub:stride] intersect f2^-1(my_set2(p)) DO i in set1 union set2 IF (i in set1) THEN stmt1 ENDIF IF (i in set2) THEN stmt2 ENDIF END DO where my_set1(p) is the local set for array1, and my_set2(p) is the local set for array2. (Again, synchronization and communication must be handled by other means.) Code transformations such as loop distribution and loop peeling can be used to eliminate the tests in many cases. They will be particularly profitable if there are data dependences between the ON blocks. End of advice to implementors. Advice to users: Splitting an iteration like this is likely to require either additional tests at runtime or additional analysis by the compiler. Even if the compiler can generate low-overhead scheduling for the individual ON clauses, combining them is not necessarily low-overhead. The locality benefits must be rather substantial for this to pay off, but there are cases where multiple ON clauses are valuable. (All these statements are particularly true if one ON block uses data computed in another one.) End of advice to users. Because ON clauses nest naturally, they can be useful for expressing parallelism along different dimensions. Consider the following examples: REAL X(M,M) !HPF$ DISTRIBUTE X(BLOCK,BLOCK) !HPF$ INDEPENDENT, NEW(I) DO J = 1, M !HPF$ ON HOME(X(:,J)) BLOCK DO I = 2, M !HPF$ ON HOME(X(I,J)) X(I,J) = (X(I-1,J) + X(I,J)) / 2 END DO !HPF$ END BLOCK END DO Each iteration of the J loop is executed by a column of the processors arrangement. The I loop further subdivides the computation, giving each processor responsibility for computing the elements it owns. Many compilers would have chosen this computation partitioning automatically for such a simple example. However, the compiler might have attempted to fully parallelize the outer loop, executing each inner loop sequentially on one processor. (This might be attractive on a machine with very fast communications.) By inserting the ON clauses, the user has advised against this strategy, thus trading additional locality for restricted parallelism. Notice that the ON directive neither requires nor implies the INDEPENDENT assertion. In both nests, each iteration of the I loop depends on the preceeding iteration, but the ON directive can still partition the computation among processors. The ON directive does not automatically make a loop parallel. Advice to implementors: "Dimension-based" nesting, as above, will probably be a common case. The HOME clauses can be inverted at each level, treating indices from outer loops as run-time invariants. End of advice to implementors. Advice to programmers: Nested ON directives will tend to have efficient implementations if their HOME clauses refer to different dimensions of the processors arrangements, as in the above example. This minimizes the interaction between the levels of the loops, simplifying the implementation. End of advice to programmers. Consider the following variation on the above example: !HPF$ DISTRIBUTE Y(BLOCK,*) !HPF$ INDEPENDENT, NEW(I) DO J = 1, M !HPF$ ON HOME(Y(:,J)) BLOCK DO I = 2, M !HPF$ ON HOME(Y(I,J)) Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2 END DO !HPF$ END BLOCK END DO Note that the ON clauses have not changed, except for the name of the array. The interpretation is similar to the above, except that the outer ON directive assigns each iteration of the J loop to all of the processors. The inner ON directive again implements a simple owner-computes rule. The programmer has directed the compiler to distribute a serial computation across all the processors. There are a few scenarios where this is more efficient than parallelizing the outer loop: 1. Parallelizing the outer loop will generate many non-local references, since only a part of each column is on any processor. If nonlocal references are very expensive (or if M is relatively small), this overhead may outweigh any gain from parallel execution. 2. The compiler may take advantage of the INDEPENDENT directive to avoid inserting any synchronization. This allows a natural pipelined execution. A processor will execute its part of the I loop for one value of J, then immediately go on to the next J iteration. Thus, the first processor will start on J=2 while the second receives the data it needs (from processor one) for J=1. (A similar pipeline would develop in the X example above.) Clearly, the suitability of these ON clauses will depend on the underlying parallel architecture. Advice to programmers: This example points out how ON may improve software engineering. While the "value" of HOME(X(I)) will change if X's mapping changes, its intent will usually stay the same - run the loop "aligned with" the array X. Moreover, the form of the clauses is portable, and they simplify experimenting with alternative computation partitioning. Both qualities are similar to the advantages of DISTRIBUTE and ALIGN over low-level data layout mechanisms. End advice to programmers. ON directives are particularly useful when the compiler cannot accurately estimate data locality, for example when the computation uses indirection arrays. Consider three variations of the same loop: REAL X(N), Y(N) INTEGER IX1(M), IX2(M) !HPF$ DISTRIBUTE X(BLOCK), Y(BLOCK) !HPF$ DISTRIBUTE IX(BLOCK), IY(BLOCK) !HPF$ INDEPENDENT DO I = 1, N !HPF$ ON HOME( X(I) ) X(I) = Y(IX(I)) - Y(IY(I)) END DO !HPF$ INDEPENDENT DO J = 1, N !HPF$ ON HOME( IX(J) ) X(J) = Y(IX(J)) - Y(IY(J)) END DO !HPF$ INDEPENDENT DO K = 1, N !HPF$ ON HOME( X(IX(K)) ) X(K) = Y(IX(K)) - Y(IY(K)) END DO In the I loop, each processor runs over its section of the X array. Only the reference X(I) is guaranteed to be local. (If M<>N, then IX and IY have a different block size than X, and thus a different mapping.) However, if it is _usually_ the case that X(I), Y(IX(I)), and Y(IY(I)) are located on the same processor, then this mapping may be the best one available. If X(I) and Y(IX(I)) are _always_ on the same processor, then the LOCAL clause should be added: !HPF$ ON HOME( X(I) ), LOCAL( Y(IX(I)) ) This will avoid communication setup overhead on most systems, and there is little chance that the compiler would deduce this automatically. If both references to Y are _always_ on the same processor as X(I), then further improvement is possible and desirable: !HPF$ ON HOME( X(I) ), LOCAL( ALL=Y ) In the J loop, references IX(J) and IY(J) are always local. This is the most common array reference class in the loop, so it minimizes the number of nonlocal data references in the absence of any special properties of IX and IY. It may not evenly balance the load among processors; for example, if N=M/2 then half the processors will be idle. As before, if the values in IX or IY ensure that one of the Y references is always local, a LOCAL assertion should be added. In the K loop, only reference Y(IX(K)) is guaranteed to be local (because Y and X have the same distribution). However, the values stored in IX and IY may ensure that Y(IY(K)) and X(K) always local, a fact that should be noted if true: !HPF$ ON HOME( X(IX(K)) ), LOCAL( Y(IY(K)), X(K) ) Even if the three REAL values are not always, but merely "usually" on the same processor, this may be a good computation partitioning for both locality and parallelism. However, these advantages must be weighed against the cost of computing this partitioning. Since the HOME clause depends on a (presumably large) array of runtime values, substantial time may be required to determine which iterations are assigned to each processor. It should be clear from this discussion that there is no magic solution for handling complex computation partitionings; the best answer is usually a combination of application knowledge, careful data structure design (including ordering of the elements), and efficient compilation methodology and runtime support. Advice to implementors: The K loop is the situation that the inspector strategy described above is designed for. If there is an outer loop around any of these examples, and that loop does not modify the distribution of X or the values of IX, then a record of each processor's iterations can be saved for reuse. The cost is at worst linear in the sizes of the arrays. End of advice to implementors. Advice to users: It is unlikely that any production compiler will generate low-overhead code for K loop above in the near term. The difference from previous examples is that the HOME clause is not a function that can be easily inverted by the compiler. Some compilers may choose to execute every iteration on all processors, testing the HOME clause at run-time; others may pre-compute a list of iterations for every processor. Of course, the cost of computing the list will be substantial. In practice, one would make all the arrays the same size to avoid some of the alignment problems above; the example was written this way for pedagogical reasons, not as an example of good data structure design. End advice to programmers. Explicit use of processors arrangements in ON directives is usually associated with task parallelism. Many examples can be found there (I assume...) The following example illustrates how processors can be used for a one-dimensional domain decomposition algorithm: !HPF$ PROCESSORS (PROCS(NP)) !HPF$ DISTRIBUTE X(BLOCK) ONTO PROCS ! Compute ILO(IP) = lower bound on PROCS(IP) ! Compute IHI(IP) = upper bound on PROCS(IP) DONE = .FALSE. DO WHILE (.NOT. DONE) !HPF$ INDEPENDENT, NEW( ILO, IHI ) DO IP = 1, NP !HPF$ ON HOME(PROCS(IP)), LOCAL( X(ILO(IP):IHI(IP)) ) CALL SOLVE_SUBDOMAIN( IP, X(ILO(IP):IHI(IP)) ) END DO !HPF$ ON HOME(X) BLOCK CALL SOLVE_BOUNDARIES( X, ILO(1:NP), IHI(1:NP) ) DONE = CONVERGENCE_TEST( X, ILO(1:NP), IHI(1:NP) ) !HPF$ END BLOCK END DO The algorithm divides the entire computational domain (array X) into NP subdomains, one for each processor. The INDEPENDENT IP loop performs a computation on each subdomain's interior. The processors then collaborate to update the boundaries of the subdomains and test for convergence. The subroutine SOLVE_SUBDOMAIN can use a transcriptive or descriptive mapping for its array argument, placing it on a single processor. An ON block encompassing the entire procedure could then ensure the computation proceded on a single processor. Subroutines SOLVE_BOUNDARIES and CONVERGENCE_TEST may well have their own loops similar to the IP loop, with similar LOCAL clauses. Note that only the lower and upper bound of each subdomain is recorded; this allows different processors to process different-sized subdomains. However, each subdomain must "fit" into one processor's section of the X array. Advice to implementors: The IP loop above is likely to be a common idiom among programmers doing block-structured codes. In general, it can be implemented by inverting the HOME clause as was done above. In the one-to-one case shown here (probably very popular with programmers), it can be implemented by assigning the processor id to the loop index variable and testing the range of the loop (once). End of advice to implementors. Advice to programmers: Some compilers will propogate the ON information from the caller to the callee at compile time, and some at run time. Repeating the ON clause in the caller and callee will tend to give the compiler better information, resulting in better generated code. Again, note the usefulness of LOCAL clauses in giving the compiler information. Few compilers would be able to unravel nontrivial assignments to ILO and IHI, and no current compiler would even attempt to understand the comments in the above code fragment. End of advice to programmers. Because it is an assertion of act, the compiler can draw many inferences from a single LOCAL clause. For example, consider the following case: !HPF$ ALIGN Y(I) WITH X(I) !HPF$ ALIGN Z(J) WITH X(J+1) !HPF$ ON HOME( X(K) ), LOCAL( X(INDX(K)) ) X(K) = X(INDX(K)) + Y(INDX(K)) + Z(INDX(K)) The compiler is justified in making the following assumptions in compiling the assignment statement (assuming it honors both the ALIGN directives and the ON directive): * X(K) requires no communication (because of the HOME clause) * X(INDX(K)) requires no communication (because of the LOCAL clause) * Y(INDX(K)) requires no communication (because Y has the same mapping as X) The compiler cannot make any assumption about INDX(K) or Z(INDX(K)) from the above code. There is no indication how INDX is mapped relative to X, so the ON directive gives no guidance. Note that the fact that an expression (here, X(INDX(K))) is local does not imply that its subexpressions (here, INDX(K)) are also local. Similarly, Z's mapping does not determine if Z(INDX(K)) would be local; it indicates that Z(INDX(K)-1) is local, but that isn't a great help. If the compiler has additional information (for example, X is distributed by BLOCK and INDX(K) is not near a block boundary), it might be able to make additional deductions. Advice to implementors: One mark of a good compiler will be that it aggressively propagates LOCAL assertions. This is likely to significantly reduce communication costs. Note the cases under "Advice to users" below. End advice to implementors. Advice to users: One can expect compilers to differ in how aggressive they are in drawing these deductions. Higher-quality compilers will be able to identify more references as local, and use this information to eliminate data movement. All compilers should recognize that if an element of one array is local, then the same element of any other arrays with the same static mapping (i.e. arrays ALIGNed together, or with the same DISTRIBUTE pattern and array size) will also be local. That is, any compiler should recognize Y(INDX(K)) in the above example as local. Dynamically changing array mappings (i.e. REALIGN and REDISTRIBUTE) will tend to limit such information and information propagation. End advice to users. Another example of drawing further inferences from LOCAL clauses is the following: !HPF$ ON HOME(PROCS(IP)), LOCAL( X(ILO(IP):IHI(IP)) ) DO I = ILO(IP)+1, IHI(IP)-1 X(I) = (X(I-1) + 2*X(I) + X(I+1)) / 4 END DO (Note that this is the construct form of the DO, so the ON clause applies to all three lines below it.) Here, I is alwas between ILO(IP) and IHI(IP). Therefore, the two references to X(I) require no communication, since they always refer to an element of the LOCAL array section. Slightly deeper reasoning reveals that the same is true of X(I-1) and X(I+1). In short, the entire DO loop can be executed without communication on PROCS(IP). Advice to implementors: Analysis of ranges with symbolic bounds will pay big dividends for applications such as block-structured CFD codes. Of course, the expense of doing this analysis must be weighed against its benefits. See also the "Advice to programmers" below. End advice to implementors. Advice to programmers: As with the previous example, one can expect the capabilities of compilers to differ widely in making these inferences. Most compilers should recognize and use compile-time constants in LOCAL clauses and the executable code. Many compilers will extend this to variables that are invariant within the ON clause and will propogate the bounds of DO loops; others will use pattern recognition to achieve some of this effect. Such pattern recognition suffices to recognize X(I) in the example. Detailed reasoning about symbolic values is beyond the capabilities of most compilers. For example, determining the range of X(I-1) in the above example may require symbolic manipulation (or rather aggressive pattern-matching). End advice to users. -------------------------------------------------------------------------------- [A related mail message from Marc Baber] Hello, I've excerpted the following from APR's manuals for consideration as one possible form of an HPF 2.0 work-distribution directive. DOPAR can be interpreted as a dynamically-balanced work partitioning directive on shared memory systems by ignoring the ON reference, or the ON reference can be used to align the DO with any array/dimension for static balancing and reduction of communication on distributed memory systems. Though this syntax does not address some peoples' desire to specify processor id's in the ON clause, it does achieve a very natural extension of HPF to align DO loops with array distributions. Note that DO PAR is not interpreted to mean "run this loop parallel at any cost." DO PAR simply conveys the programmer's preference about which loop in a loop nest should be parallelized first. The IGNORE directive (also described below) is used to override the xHPF compiler's determinations about iteration dependence. -Marc APR Directive Descriptions Select Loop Directive: CAPR$ DO PAR {ON reference } Select the next loop for parallelization. Applies to both automatic and manual loop selection modes. The ON clause selects a distributed array as controlling the loop distribution. If there is no ON clause, xhpf will determine the best partitioned array reference within the loop to determine distribution of the loop iterations. * Applies only to the next upcoming DO loop. The Controlling Array ( ON Clause ) Without the ON clause on a DO PAR directive, dpf will select a partitioned array reference within each loop to control the distribution of iterations over the processors. A loop's iterations are distributed so that references to the control array occur in the processor that "owns" those elements of the partitioned array. dpf tries to choose an array that will result in the minimum communication when the loop is parallelized. (See Chapter 2 discussion of loop distribution.) The ON clause on a DO PAR directive indicates that the specified array reference is to control the distribution of the iterations of the loop. The ON clause reference syntax: reference is array-name or array-name(subscripts) or array-name Within subscripts, subscript expressions conforming to the array's subscripts appear: : and * indicate an arbitrary subscript initial-expr ~ incr-expr indicates an integer subscript with a constant increment ("CII") initial-expr : final-expr {: incr-expr } indicates a complete range of values. (Each of the -expr are arithmetic expressions that may include constants and parenthesized sub-expressions). Because xhpf only considers a single dimension for array partitioning and loop distribution, only one subscript expression may appear in the ON clause array as a CII or constant; and the other subscripts must be : or *. Examples: CAPR$ DO PAR ON QUOM<:,1~1> CAPR$ DO PAR ON UR<:,:,j-1~1> CAPR$ DO PAR ON WKW<:,:,1> Don't Select Loop Directive: CAPR$ DO NOPAR Indicates that the next loop should not be selected for parallelization. Ignore Inhibitors or Communication Directive: CAPR$ IGNORE ALL INHIBITORS Indicates that the next loop, if marked for parallelization by a DO PAR directive, can be parallelized by ignoring any apparent inhibitors. Directive: CAPR$ IGNORE type COM {ON reference } Indicates that xhpf need not generate certain communication calls in the next loop marked by a DO PAR. type indicates the kind of communication being referenced. The ON clause further identifies the communication to specific arrays. If there is no ON clause, communication of type for all arrays in the loop is ignored for this loop. IGNORE type: referring to: PRELOOP pre-loop fetches of off-processor array elements POSTLOOP post-loop sends of computed off-processor elements ALL all and any of the above * Any CAPR$ IGNORE directives must immediately follow a DO PAR directive. IGNORE type COM ON Clause Each array reference in a loop has the potential of generating data communication calls to fetch or store array elements. PRELOOP communications fetch to the processor all the array elements it uses from the processors who own them, while POSTLOOP communications distribute all the computed array elements back to the owning processors. The IGNORE type COM directive permits the user to optimize the performance of a parallelized loop by recognizing that certain communications are redundant and could be eliminated. The ON clause specifies which array reference's communication calls can be ignored. Unlike the DO PAR ON clause, the syntax of the ON clause on the IGNORE directive must indicate all the subscripting in order to match a particular array reference in the source code, or just the unsubscripted array name to select all references to that array. From the point of view of the DO PAR directive, the array references in the following loop that can be used as the controlling array for distributing iterations of the loop are shown: (X paritioned BLOCK, TEMP replicated.) CAPR$ DO PAR DO 15 I = 1,N temp(I) = I ! temp<1~1> 15 X(I) = temp(I) ! X<1~1> CAPR$ DO PAR DO 20 i = 1,N temp(I) = i*i 20 X(I) = x(i) + temp(I) The array references that give rise to communication calls for loop 15 are : postloop communication of TEMP<1~1> postloop communication of X<1~1> However, values set in temp by loop 15 are immediately replaced in loop 20, eliminating the need to bring all copies of temp up to date after loop 15. The postloop communication of temp can be ignored: CAPR$ DO PAR CAPR$ IGNORE POSTLOOP COM ON TEMP DO 15 I = 1,N temp(I) = I 15 X(I) = temp(I) -------------------------------------------------------------------------------