The ON Clause

Charles Koelbel and Rob Schreiber

The purpose of the ON directive is to allow the programmer to control the 
distribution of computations among the processors of a parallel machine.  
In a sense, this is the dual of the DISTRIBUTE and ALIGN directives for 
data.  Modern parallel machines will achieve their best performance if all 
operations are performed by separate processors (as specified by ON) with 
each processor accessing its local data (as specified by DISTRIBUTE).  
If either explicit computation partitioning or explicit data mapping is not 
present, the system must choose one.  Another proposal explains how this 
might be done.

A secondary purpose of the ON directive is to identify data accessed by 
the computation as local to the executing processor.  The compiler can use 
this information to avoid generating communication or to simplify array address 
calculations.  Note that whether a given data element is local depends on 
two facts:
	Where the data is stored (i.e. its DISTRIBUTE and ALIGN attributes)
	Where the computation is executed (i.e. its ON directive)
For these reasons, the LOCAL clause is added to the ON directive, which is 
effectively the earliest point in the program where the needed facts might 
be available.  Note that changing the ON directive may invalidate some 
LOCAL clauses, or may make some new LOCAL clauses true.


Syntax

There are two flavors of the ON directive: a single-statement form, and a 
multi-statement form.  The BNF for both is

	simple-on-directive	IS	ON HOME ( home-expr ) [, LOCAL ( local-var-list ) ]
	
	block-on-directive	IS	simple-on-directive BLOCK
	end-directive	IS	END [ BLOCK ]
	
	on-block	IS	
		block-on-directive
		block
		end-directive

	home-expr	IS	variable
		OR	template-elmt
		OR	processors-elmt
	
	template-elmt	IS	template-name [ ( section-subscript-list ) ]
	
	processors-elmt	IS	processors-name [ ( section-subscript-list ) ]
	
	local-var	IS	variable
		OR	ALL = variable-name

Notes:
Constraints need to be added above...
Also, changes are needed in some higher-level syntax (for example, adding 
simple-on-directive as an option to executable-directive).

home-expr, template-elmt, and processors-elmt are just auxiliary syntax 
categories.  Note that "variable" is an F90 syntax term that means 
(roughly) "a reference, including an array element, array section, or 
derived type field"; "variable" doesn't include template or processor 
elements since those aren't first-class language constructs.  Note also 
that "block" is an F90 syntax term for "a series of statements treated as 
a group" - for example, the body of a DO construct.

simple-on-directive should be an option under executable-directive (HPF 
syntax term), as redistribute-directive is now.

on-block most naturally fits as a F90 executable-stmt, since that makes 
constraints about "properly nesting" easier to state.  

Any top-level variables in the LOCAL clause must be explicitly 
mapped.  Otherwise, the assertion (see below) is a statement about how the 
compiler works, not about the program.  (Alternately, somebody could 
suggest another definition of LOCAL, I suppose...)


Rationale:
The BLOCK in on-block is necessary so that the parser can distinguish 
between the one-statement or multi-statement forms.  BEGIN would be equally 
good as a keyword.

HOME is required, even when its argument is a processor.  It seems more 
natural to eliminate the HOME keyword there, but this leads to the
following ambiguity:

	INTEGER X(4)	! X(I) will be on processor I
	!HPF$ PROCESSORS HOME(4)
	!HPF$ DISTRIBUTE X(BLOCK)
	X = (/ 4,3,2,1 /)
	!HPF$ ON HOME(X(2))
	X(2) = X(1)

If HOME were not required, where should the computation be done:
1. Processor HOME(2) (i.e. the owner of X(2))?
2. Processor HOME(3) (i.e. the value of X(2), before the assignment)?
3. Processor HOME(4) (i.e. the value of X(2), after the assignment)?
The definition of ON clearly indicates interpretation 1 is correct.  One 
can get the effect of interpretation 2 by the directive
	!HPF$ ON HOME(HOME(X(2)))
There is no way to get the effect of interpretation 3, short of building a 
clairvoyant computer.  Introducing reserved keywords into Fortran was 
suggested as a better solution to this problem, but was seen as too large a 
change to the underlying language.
End of rationale.

 
Semantics

The ON directive advises the compiler to perform the computation in the 
following statement(s) using the processor(s) named in the argument to 
HOME.  Like the mapping directives ALIGN and DISTRIBUTE, this is advice 
rather than an absolute commandment; the compiler may override it if 
necessary.  Also like ALIGN and DISTRIBUTE, the ON directive may affect the 
efficiency of computation, but not the final results.

Advice to implementors:
If the compiler may override the user's advice in an ON directive, then 
the compiler should also offer the user an option to force all directives 
to be obeyed.
End advice to implementors.

The single-statement ON directive (i.e.  simple-on-directive) precedes 
statement for which it is advising behavior, and is said to apply to that 
statement.  If the statement is a compound statement (e.g.  a DO loop or an 
IF-THEN-ELSE construct), then the ON directive also applies to all nested 
statements therein.  Similarly, the block ON directive (i.e.  on-block) 
applies the initial ON clause to all statements up to the matching end 
directive.  

It is legal to nest ON directives, provided that the processors named by 
the inner ON directive are also named by the outer directive.  Similarly, 
an ON directive can apply to a CALL statement where the called subroutine 
contains an ON directive.  In this case, any processor named inside the 
subroutine must be included by the ON directive in the caller.  Note that 
the syntax of on-block automatically ensures that it is properly nested 
inside other compound statements, and that compound statements properly 
nest inside of it.  As with other Fortran 90 compound statements, transfer 
of control to the interior of an on-block from outside the block is 
prohibited, while transfers within a block may occur.  However, HPF also 
prohibits transfers of control from the interior of a on-block to outside 
the on-block.  Note that this is stricter than Fortran 90.

Rationale:
This allows a fork-join approach to nested parallelism, which seems to be 
natural to many applications and programmers.  The restrictions about 
control flow into and out of an ON block essentially make it a single-entry 
single-exit region, thus simplifying the semantics considerably.
End Rationale.


The HOME clause can name a program variable, a template, or a processors 
arrangement.  For each of these possibilities, it can further specify a 
single element or multiple elements.  This is translated into the 
processor executing the computation as follows:
*	If the HOME clause names a program variable (that is, an array element 
	or section), then every processor owning any element of the variable 
	will execute the following statement(s).
*	If the HOME clause names a template element or section, then every 
	processor owning any element of the template will execute the following 
	statement(s).
*	If the HOME clause names a processors arrangement, then the processor(s) 
	referenced there will execute the following statements.
In all cases, the value(s) of the HOME clause are determined as if the HOME 
argument was evaluated when control flow reached the ON directive.  That 
is, if the value of a variable used in the HOME clause changes within the 
on-block, work is not migrated to a new processor.  If evaluation of the 
argument of the HOME clause would change the value of any variable in the 
program, then the value of that variable becomes undefined after the ON 
clause is reached.  Finally, note that the HOME clause only specifies how 
computation is partitioned among processors; it does not indicate 
processors that may be involved in data transfer.  

Rationale:
This is the heart of the ON clause.  

The "as if" wording and making side effects undefined avoids the following 
problem: ON is a directive.  Therefore, it cannot have side effects.  But 
implementing the ON clause may require evaluating some functions, therefore 
causing the side effects.  An alternative would be to only allow PURE 
functions in the HOME and LOCAL clauses.
End of rationale.

Advice to implementors:
If the HPF program is compiled into Single-Program-Multiple-Data (SPMD) 
code, then the ON clause can always be implemented (albeit inefficiently) 
by having all processors compare their processor id to an id (or list of 
ids) generated from the HOME clause.  (Similar naive implementations can be 
constructed in other paradigms as well.) If the ON clause will be executed 
repeatedly, for example in a DO loop, it is probably worthwhile to invert 
this process.  That is, instead of all processors executing all the HOME 
clause tests, the compiler should determine the range of loop iterations 
that will give a true test.  (See the "Advice to implementors" in the 
Examples section below for more details.) For example, consider the 
following complex case:
	DO I = 1, N
		!HPF$ ON HOME( A(MY_FCN(I)) ) BLOCK
		...
		!HPF$ END BLOCK
	END DO
Here, the generated code can perform an "inspector" (i.e.  a skeleton loop 
that only evaluates the HOME clause of each iteration) to produce a list of 
iterations assigned to each processor.  This list can be produced in 
parallel, since MY_FCN must be side-effect free; however, distributing it 
to all processors may require unstructured communications patterns, 
possibly negating the advantage of parallelism.  In general, more advanced 
compilers will be able to efficiently invert more complex HOME clauses.  It 
is recommended that the abilities (and limitations) of a particular 
compiler be documented clearly for potential users.
End advice to implementors.

Advice to programmers:
The argument to HOME in an ON clause can be arbitrarily complex.  This is a 
two-edged sword; it can express very complicated computation partitioning, 
but the implementation of these partitions may not be efficient.  More 
concretely, it may express a perfectly load-balanced computation, but force 
the compiler to serialize the computation of the HOME clauses.  Although 
the amount of overhead for an ON clause will vary based on the HPF code, 
the compiler, and the hardware, one can expect that compilers will generate 
very good code based solely on array mappings, and progressively worse code 
as the complexity of the HOME clause increases.  A rough measure of 
the complexity of an ON directive is the amount of run-time data used to 
compute it; for example, a constant offset is fairly simple, while a 
permutation array is very complex.  See the Examples section below for 
more concrete examples of this phenomenon.
End of advice to programmers.

If ON clauses are nested, then the innermost HOME clause effectively 
controls execution of the statement(s).  A programmer can think of this as 
successively restricting the set of processors at each level of nesting; 
clearly, the last restriction must be the strongest.  Similarly, if ON 
clauses appear in a subroutine call chain, then the deepest ON clause is 
the effective controller.

Rationale:
This reinforces the idea of the fork-join parallelism pointed out earlier.
End of rationale.

Advice to implementors:
Notice that nested ON clauses do not present additional problems for the 
naive implementation.  If a processor failed the outer HOME comparison, 
then it would fail any tests inside the ON as well.  Thus, it need not even 
make the tests.
End of advice to implementors.

Operations controlled by an ON clause must follow certain restrictions:

* If an ON directive applies to a DISTRIBUTE directive, then the set of 
processors named in the HOME clause must include
	- All processors named in the DISTRIBUTE's ONTO clause if one is present
	- All processors in the default processors arrangement, if there is no 
	  ONTO clause
* If an ON directive applies to a REDISTRIBUTE directive, then the set of 
processors named in the HOME clause must include 
	- All processors that stored any element of the distributee before 
	  the REDISTRIBUTE was encountered
	- The processors required by an equivalent DISTRIBUTE directive
* If an ON directive applies to a REALIGN directive, then the set of 
processors named in the HOME clause must include 
	- All processors that store any element of the align-with-clause
* If an ON directive applies to a REALIGN directive, then the set of 
processors named in the HOME clause must include  
	- All processors that stored any element of the alignee before 
	  the REALIGN was encountered
	- The processors required by an equivalent ALIGN clause
* If an ON directive applies to an ALLOCATE statement which creates an 
explicitly mapped variable, then the set of processors named in the HOME clause
must include the processors required by the associated DISTRIBUTE or ALIGN 
directive.
* If an ON directive applies to any other declaration of an explicitly 
mapped variable, then the set of processors named in the HOME clause
must include the processors required by the associated DISTRIBUTE or ALIGN 
directive.

If operations within an ON block do not follow these constraints, then the 
program is not HPF-conforming.

Rationale:
Operations which allocate memory require the cooperation of all processors 
that will own that memory.  Therefore, ON clauses must schedule those 
operations to execute on the cooperating processors.
End of rationale.

Advice to implementors:
These restrictions ensure that HPF data distribution directives inside 
an ON block (either lexically or in a dynamic call chain) can be 
implemented without relying on one-way communication outside of the 
"current" processing group.
End of advice to implementors.

The LOCAL clause, if present, is an assertion to the compiler that certain 
array references made within the ON are stored in local memory if the 
computation is performed by the processor(s) named in the HOME clause.  
Specifically, if an element in the local-var list is a variable, array, or 
array section, then that element will be stored on at least one of the 
processors from the HOME clause.  If the local-var is of the form 
ALL=variable-name, then all references to that variable that the ON 
applies to will be stored on at least one of the named processors.  As with 
the HOME clause, the value(s) of the LOCAL clause are determined as if the 
local-var-list was evaluated when control flow reached the ON directive, 
and possible side effects in the LOCAL clause cause the affected variables 
to become undefined.

Rationale:
The ALL=variable-name form is syntactic sugar for listing every 
reference to an array in the ON block.  This is similar in spirit to APR's 
grouping IGNORE directives by variable.  I'm open to better syntax for 
saying this.  (Note that ALL(variable-name) doesn't work, since somebody 
might name a variable ALL.  Grr...)

The pseudo-evaluation of the LOCAL clause is the same as the HOME clause, 
for the same reason.  Treating the LOCAL clause as naming sets of 
variables, rather than as a list of lexical references, gives a reasonably 
firm  logical basis to what the assertion means.  It is unclear whether 
compilers would rather see this "symbolic" information or longer lists that 
need to be pattern-matched.  It is also unclear which approach would 
appeal more to programmers.
End Rationale.

Advice to programmers:
Since the
End of advice to programmers.

Note that if the HOME clause specifies more than one processor, then LOCAL 
only asserts that the variables are stored on one of the processors.  For 
example, if a statement is executed on a section of the processors 
arrangement, then communication within that section may be needed for some 
variables in the LOCAL clause.  Communication with processors outside of 
the section will not be needed for those variables, however.

Rationale:
The alternative to this interpretation would be that any variable named in 
the LOCAL clause would be local to all processors, i.e. replicated.   
While that certainly allows more extensive optimizations, it is a less 
common case.  In addition, it does not seem to capture the intent of ON 
directives applied to CALL statements or compound statements.  For example,
	!HPF$ PROCESSORS PROCS(MP,MP)
	!HPF$ DISTRIBUTE X(BLOCK,BLOCK) ONTO PROCS
	!HPF$ ON HOME(PROCS(1,1:MP)), LOCAL( X(K,1:N) )
	CALL FOO( X(K,1:N) )
would presumably call FOO on a row of the processors arrangement, passing 
elements of X in place.  This is what the current definition does; if 
LOCAL meant "local on every processor", the call would force X to be 
replicated.
End Rationale.

The LOCAL assertion is similar to the INDEPENDENT directive, in that if it 
is correct it does not change the meaning of the program.  If the LOCAL 
clause in incorrect, the program is not standard-conforming (and is thus 
undefined).  Like the INDEPENDENT directive, the compiler may use the 
information in the LOCAL clause, or ignore it if it is insufficient for the 
compiler's purposes.  If the compiler can detect that the LOCAL clause is 
incorrect (i.e.  that a LOCAL variable is definitely nonlocal), it is 
justified in producing a warning.  Unlike the INDEPENDENT directive, 
however, the truth of the LOCAL clause depends on the mapping of 
computations (specified with the ON clause) and the mapping of data 
(specified with DISTRIBUTE and ALIGN clauses); if the compiler overrides 
either of these, then it may not be able to use information in the LOCAL 
directive.

Rationale:
Knowing that a reference is local is valuable information for the 
optimizer.  It is in keeping with the spirit of HPF to phrase this as an 
assertion of fact, which the compiler can use as it pleases.  Expressing 
it as advice to the compiler seems to have disadvantages.  Some possible ways 
this advice could be phrased, and the counter-arguments, are 
*	"Don't generate communication for this reference" has great potential 
	for changing the meaning of the program.  Some programmers want this 
	capability, but it violates the "correct directives should not change the 
	meaning of a program" principle of HPF.  Also, once communication is 
	"turned off" for a reference, it's not clear how to turn it back on.
*	"Generate communication for this reference" is not a useful directive, 
	since the compiler has to do this anyway.
*	"Generate communication for this reference, and place it here" is useful, 
	since it can override the default placement by the compiler.  It still 
	has potential for changing program meaning.  It also has the potential to 
	create programs as complex as message-passing, as programmers try to move 
	communication out of loops.
End of rationale. 


Examples

The following are valid examples of ON directives.  Most of them are 
"reasonable" in the sense that they illustrate idioms that programmers 
might want to use, rather than contrived situations.  For simplicity, the 
first several examples assume the following array declarations:

	REAL A(N), B(N), C(N), D(N)
	!HPF$ DISTRIBUTE A(BLOCK), B(BLOCK), C(BLOCK), D(BLOCK)

One of the most commonly requested capabilities for HPF as to control how 
loop iterations were assigned to processors.  (Historically, the ON clause 
first appeared to perform exactly this role in the Kali FORALL construct.) 
This can be done by the ON directive, as shown in the following examples:

	!HPF$ INDEPENDENT
	DO I = 2, N-1
	  !HPF$ ON HOME(A(I))
	  A(I) = (B(I) + B(I-1) + B(I+1))/3
	END DO
	
	!HPF$ INDEPENDENT
	DO J = 2, N-1
	  !HPF$ ON HOME(A(J+1)) BLOCK
	    A(J) = B(J+1) + C(J+1) + D(J+1)
	  !HPF$ END BLOCK
	END DO

The ON directive in the I loop advises the compiler to have each processor 
run over its local section of the A array (and therefore B as well).  The 
references to B(I-1) and B(I+1) must be fetched from off-processor for the 
first and last iterations on each processor (except for the boundary 
processors); note that those processors are not mentioned in the HOME 
clause.  The ON directive in the J loop advises the compiler to "shift" 
computations so that each processor does a vector sum of its local sections 
of B, C, and D, stores the first element of the result on the processor to 
its left, and stores the rest of the result (shifted by one) in A.  It is 
worth noting that the directives would still be valid (and minimize 
nonlocal data accesses) if the arrays were distributed CYCLIC, although the 
number of nonlocal references would be much higher.

Advice to implementors:
It is highly recommended that compilers concentrate on optimizing DO loops 
with a single ON clause including the entire loop body.  Schematically, 
the code will be:

	DO i = lb, ub, stride
		!HPF$ ON HOME(array(f(i))) BLOCK
		body
		!HPF$ END BLOCK
	END DO

Where array has some data mapping.  Assume the mapping give processor p 
the elements my_set(p).  (In a BLOCK distribution, for example, my_set(p) 
is a contiguous range of integers.)  Then the generated code on processor 
p should be

	DO i in [lb:ub:stride] intersect f^-1(my_set(p))
		body
	END DO

(This schematic does not show where communication or synchronization must 
be placed; that must be derived from analysis of the body.) Moreover, f is 
most likely to be the identity function or a linear function with integer 
coefficients, both of which can be inverted easily.  Given this, techniques 
for iterating through the set can be found in several recent conferences.
End of advice to implementors.

Advice to users:
One can expect the I loop above to generate efficient code for the 
computation partitioning.  In effect, the compiler will arrange for each 
processor to iterate over its own section of array A.  The J loop is 
slightly more complex, since the compiler must find the inverse of the HOME 
clause's subscripting function.  That is, the compiler must solve K=J+1 for 
J, where K ranges over the local elements of A.  Of course, in this case 
J=K-1; in general, linear functions can be inverted by the compiler.  (It 
should be pointed out, however, that complex combinations of ALIGN and 
DISTRIBUTE may make the description of K unwieldy, and this may add 
overhead to the inversion process.)
End of advice to users.
 

Sometimes it is advantageous to "split" an iteration between processors. 
The following case shows one example of this:
	
	!HPF$ INDEPENDENT
	DO I = 2, N-1
	  !HPF$ ON HOME(A(I))
	  A(I) = (B(I) + B(I-1) + B(I+1))/3
	  !HPF$ ON HOME C(I+1)
	  C(I+1) = A(I) * D(I+1)
	END DO

Due to the first ON clause, the reference to A(I) is local in the first 
statement. The second ON clause makes A(I) nonlocal (for some values of I) 
there.  This maximizes the data locality in both statements, but does 
require data movement between the two.  

Advice to implementors:
If there are several non-nested ON clauses in a loop, then the schematic 
above needs to be generalized.  In essence, the iteration range for each 
individual ON clause must be generated.  A processor will then iterate over 
the union of these ranges. Statements guarded by an ON directive must now 
be guarded by an explicit test.  In summary, the code for

	DO i = lb, ub, stride
		!HPF$ ON HOME(array1(f1(i)))
		stmt1
		!HPF$ ON HOME(array2(f2(i)))
		stmt2
	END DO

on processor p becomes 

	set1 = [lb:ub:stride] intersect f1^-1(my_set1(p))
	set2 = [lb:ub:stride] intersect f2^-1(my_set2(p))
	DO i in set1 union set2
	  IF (i in set1) THEN
	    stmt1
	  ENDIF
	  IF (i in set2) THEN
	    stmt2
	  ENDIF	  
	END DO

where my_set1(p) is the local set for array1, and my_set2(p) is the local 
set for array2.  (Again, synchronization and communication must be handled 
by other means.)  Code transformations such as loop distribution and loop 
peeling can be used to eliminate the tests in many cases.  They will be 
particularly profitable if there are data dependences between the ON 
blocks.
End of advice to implementors.

Advice to users:
Splitting an iteration like this is likely to require either additional 
tests at runtime or additional analysis by the compiler.  Even if the 
compiler can generate low-overhead scheduling for the individual ON 
clauses, combining them is not necessarily low-overhead.  The locality 
benefits must be rather substantial for this to pay off, but there are 
cases where multiple ON clauses are valuable.  (All these statements are 
particularly true if one ON block uses data computed in another one.)
End of advice to users.


Because ON clauses nest naturally, they can be useful for expressing 
parallelism along different dimensions.  Consider the following examples:

	REAL X(M,M)
	!HPF$ DISTRIBUTE X(BLOCK,BLOCK)
	
	!HPF$ INDEPENDENT, NEW(I)
	DO J = 1, M
	  !HPF$ ON HOME(X(:,J)) BLOCK
	    DO I = 2, M
	      !HPF$ ON HOME(X(I,J))
	      X(I,J) = (X(I-1,J) + X(I,J)) / 2
	    END DO
	  !HPF$ END BLOCK
	END DO

Each iteration of the J loop is executed by a column of the processors 
arrangement.  The I loop further subdivides the computation, giving each 
processor responsibility for computing the elements it owns.  Many 
compilers would have chosen this computation partitioning automatically for 
such a simple example.  However, the compiler might have attempted to fully 
parallelize the outer loop, executing each inner loop sequentially on one 
processor.  (This might be attractive on a machine with very fast 
communications.) By inserting the ON clauses, the user has advised against 
this strategy, thus trading additional locality for restricted parallelism.  
Notice that the ON directive neither requires nor implies the INDEPENDENT 
assertion.  In both nests, each iteration of the I loop depends on the 
preceeding iteration, but the ON directive can still partition the 
computation among processors.  The ON directive does not automatically make 
a loop parallel.

Advice to implementors:
"Dimension-based" nesting, as above, will probably be a common case.  The 
HOME clauses can be inverted at each level, treating indices from outer 
loops as run-time invariants.
End of advice to implementors.

Advice to programmers:
Nested ON directives will tend to have efficient implementations if their 
HOME clauses refer to different dimensions of the processors arrangements, 
as in the above example.  This minimizes the interaction between the levels 
of the loops, simplifying the implementation.
End of advice to programmers.

Consider the following variation on the above example:

	!HPF$ DISTRIBUTE Y(BLOCK,*)

	!HPF$ INDEPENDENT, NEW(I)
	DO J = 1, M
	  !HPF$ ON HOME(Y(:,J)) BLOCK
	    DO I = 2, M
	      !HPF$ ON HOME(Y(I,J))
	      Y(I,J) = (Y(I-1,J) + Y(I,J)) / 2
	    END DO
	  !HPF$ END BLOCK
	END DO

Note that the ON clauses have not changed, except for the name of the 
array.  The interpretation is similar to the above, except that the outer 
ON directive assigns each iteration of the J loop to all of the processors.  
The inner ON directive again implements a simple owner-computes rule.  The 
programmer has directed the compiler to distribute a serial computation 
across all the processors.  There are a few scenarios where this is more 
efficient than parallelizing the outer loop:

1. Parallelizing the outer loop will generate many non-local references, 
   since only a part of each column is on any processor. If nonlocal references 
   are very expensive (or if M is relatively small), this overhead 
   may outweigh any gain from parallel execution.
2. The compiler may take advantage of the INDEPENDENT directive to avoid 
   inserting any synchronization.  This allows a natural pipelined execution.  
   A processor will execute its part of the I loop for one value of J, then 
   immediately go on to the next J iteration.  Thus, the first processor 
   will start on J=2 while the second receives the data it needs (from 
   processor one) for J=1.  (A similar pipeline would develop in the X 
   example above.)

Clearly, the suitability of these ON clauses will depend on the underlying 
parallel architecture.  

Advice to programmers:
This example points out how ON may improve software engineering.  While the 
"value" of HOME(X(I)) will change if X's mapping changes, its intent will 
usually stay the same - run the loop "aligned with" the array X.  Moreover, 
the form of the clauses is portable, and they simplify experimenting with 
alternative computation partitioning.  Both qualities are similar to the 
advantages of DISTRIBUTE and ALIGN over low-level data layout mechanisms.
End advice to programmers.

ON directives are particularly useful when the compiler cannot accurately 
estimate data locality, for example when the computation uses indirection 
arrays.  Consider three variations of the same loop:

	REAL X(N), Y(N)
	INTEGER IX1(M), IX2(M)
	!HPF$ DISTRIBUTE X(BLOCK), Y(BLOCK)
	!HPF$ DISTRIBUTE IX(BLOCK), IY(BLOCK)
	
	!HPF$ INDEPENDENT
	DO I = 1, N
	  !HPF$ ON HOME( X(I) )
	  X(I) = Y(IX(I)) - Y(IY(I))
	END DO
	
	!HPF$ INDEPENDENT
	DO J = 1, N
	  !HPF$ ON HOME( IX(J) )
	  X(J) = Y(IX(J)) - Y(IY(J))
	END DO

	!HPF$ INDEPENDENT
	DO K = 1, N
	  !HPF$ ON HOME( X(IX(K)) )
	  X(K) = Y(IX(K)) - Y(IY(K))
	END DO

In the I loop, each processor runs over its section of the X array.  Only 
the reference X(I) is guaranteed to be local.  (If M<>N, then IX and IY 
have a different block size than X, and thus a different mapping.)  
However, if it is _usually_ the case that X(I), Y(IX(I)), and Y(IY(I)) are 
located on the same processor, then this mapping may be the best one 
available.  If X(I) and Y(IX(I)) are _always_ on the same processor, then 
the LOCAL clause should be added:
	!HPF$ ON HOME( X(I) ), LOCAL( Y(IX(I)) )
This will avoid communication setup overhead on most systems, and there is 
little chance that the compiler would deduce this automatically.  If both 
references to Y are _always_ on the same processor as X(I), then further 
improvement is possible and desirable:
	!HPF$ ON HOME( X(I) ), LOCAL( ALL=Y )
In the J loop, references IX(J) and IY(J) are always local.  This is the 
most common array reference class in the loop, so it minimizes the 
number of nonlocal data references in the absence of any special 
properties of IX and IY.  It may not evenly balance the load among 
processors; for example, if N=M/2 then half the processors will be idle.  
As before, if the values in IX or IY ensure that one of the Y references is 
always local, a LOCAL assertion should be added.  In the K loop, only 
reference Y(IX(K)) is guaranteed to be local (because Y and X have the same 
distribution).  However, the values stored in IX and IY may ensure that 
Y(IY(K)) and X(K) always local, a fact that should be noted if true:
  !HPF$ ON HOME( X(IX(K)) ), LOCAL( Y(IY(K)), X(K) )
Even if the three REAL values are not always, but merely "usually" on the 
same processor, this may be a good computation partitioning for both 
locality and parallelism.  However, these advantages must be weighed 
against the cost of computing this partitioning.  Since the HOME clause 
depends on a (presumably large) array of runtime values, substantial time 
may be required to determine which iterations are assigned to each 
processor.  It should be clear from this discussion that there is no magic 
solution for handling complex computation partitionings; the best answer 
is usually a combination of application knowledge, careful data structure 
design (including ordering of the elements), and efficient compilation 
methodology and runtime support.

Advice to implementors:
The K loop is the situation that the inspector strategy described above is 
designed for.  If there is an outer loop around any of these examples, and 
that loop does not modify the distribution of X or the values of IX, then 
a record of each processor's iterations can be saved for reuse.  The cost 
is at worst linear in the sizes of the arrays.
End of advice to implementors.

Advice to users:
It is unlikely that any production compiler will generate low-overhead code 
for K loop above in the near term.  The difference from previous examples 
is that the HOME clause is not a function that can be easily inverted by 
the compiler.  Some compilers may choose to execute every iteration on all 
processors, testing the HOME clause at run-time; others may pre-compute a 
list of iterations for every processor.  Of course, the cost of computing 
the list will be substantial.

In practice, one would make all the arrays the same size to avoid some of 
the alignment problems above; the example was written this way for 
pedagogical reasons, not as an example of good data structure design.
End advice to programmers.

Explicit use of processors arrangements in ON directives is usually 
associated with task parallelism.  Many examples can be found there (I 
assume...) The following example illustrates how processors can be used 
for a one-dimensional domain decomposition algorithm:

	!HPF$ PROCESSORS (PROCS(NP))
	!HPF$ DISTRIBUTE X(BLOCK) ONTO PROCS

	! Compute ILO(IP) = lower bound on PROCS(IP)
	! Compute IHI(IP) = upper bound on PROCS(IP)
	DONE = .FALSE.
	DO WHILE (.NOT. DONE)
	  !HPF$ INDEPENDENT, NEW( ILO, IHI )
	  DO IP = 1, NP
	    !HPF$ ON HOME(PROCS(IP)), LOCAL( X(ILO(IP):IHI(IP)) )
	    CALL SOLVE_SUBDOMAIN( IP, X(ILO(IP):IHI(IP)) )
	  END DO
	  !HPF$ ON HOME(X) BLOCK
	    CALL SOLVE_BOUNDARIES( X, ILO(1:NP), IHI(1:NP) )
	    DONE = CONVERGENCE_TEST( X, ILO(1:NP), IHI(1:NP) )
	  !HPF$ END BLOCK
	END DO

The algorithm divides the entire computational domain (array X) into NP 
subdomains, one for each processor.  The INDEPENDENT IP loop performs a 
computation on each subdomain's interior.  The processors then collaborate 
to update the boundaries of the subdomains and test for convergence.  The 
subroutine SOLVE_SUBDOMAIN can use a transcriptive or descriptive
mapping for its array argument, placing it on a single processor.  An ON 
block encompassing the entire procedure could then ensure the computation 
proceded on a single processor.  Subroutines SOLVE_BOUNDARIES and 
CONVERGENCE_TEST may well have their own loops similar to the IP loop, with 
similar LOCAL clauses.  Note that only the lower and upper bound of each 
subdomain is recorded; this allows different processors to process 
different-sized subdomains.  However, each subdomain must "fit" into one 
processor's section of the X array.

Advice to implementors:
The IP loop above is likely to be a common idiom among programmers doing 
block-structured codes.  In general, it can be implemented by inverting the 
HOME clause as was done above.  In the one-to-one case shown here (probably 
very popular with programmers), it can be implemented by assigning the 
processor id to the loop index variable and testing the range of the loop 
(once).
End of advice to implementors.

Advice to programmers:
Some compilers will propogate the ON information from the caller to the 
callee at compile time, and some at run time.  Repeating the ON clause in 
the caller and callee will tend to give the compiler better information, 
resulting in better generated code.  

Again, note the usefulness of LOCAL clauses in giving the compiler 
information.  Few compilers would be able to unravel nontrivial assignments 
to ILO and IHI, and no current compiler would even attempt to understand 
the comments in the above code fragment.
End of advice to programmers.

Because it is an assertion of act, the compiler can draw many inferences 
from a single LOCAL clause.  For example, consider the following case:

	!HPF$ ALIGN Y(I) WITH X(I)
	!HPF$ ALIGN Z(J) WITH X(J+1)
	
	!HPF$ ON HOME( X(K) ), LOCAL( X(INDX(K)) )
	X(K) = X(INDX(K)) + Y(INDX(K)) + Z(INDX(K))

The compiler is justified in making the following assumptions in compiling 
the assignment statement (assuming it honors both the ALIGN directives and 
the ON directive):

* X(K) requires no communication (because of the HOME clause)
* X(INDX(K)) requires no communication (because of the LOCAL clause)
* Y(INDX(K)) requires no communication (because Y has the same mapping as X)

The compiler cannot make any assumption about INDX(K) or Z(INDX(K)) from 
the above code.  There is no indication how INDX is mapped relative to X, 
so the ON directive gives no guidance.  Note that the fact that an 
expression (here, X(INDX(K))) is local does not imply that its 
subexpressions (here, INDX(K)) are also local.  Similarly, Z's mapping does 
not determine if Z(INDX(K)) would be local; it indicates that Z(INDX(K)-1) 
is local, but that isn't a great help.  If the compiler has additional 
information (for example, X is distributed by BLOCK and INDX(K) is not near 
a block boundary), it might be able to make additional deductions.

Advice to implementors:
One mark of a good compiler will be that it aggressively propagates LOCAL 
assertions.  This is likely to significantly reduce communication costs.  
Note the cases under "Advice to users" below.
End advice to implementors.

Advice to users:
One can expect compilers to differ in how aggressive they are in drawing 
these deductions.  Higher-quality compilers will be able to identify more 
references as local, and use this information to eliminate data movement.  
All compilers should recognize that if an element of one array is local, 
then the same element of any other arrays with the same static mapping 
(i.e.  arrays ALIGNed together, or with the same DISTRIBUTE pattern and 
array size) will also be local.  That is, any compiler should recognize 
Y(INDX(K)) in the above example as local.  Dynamically changing array 
mappings (i.e. REALIGN and REDISTRIBUTE) will tend to limit such 
information and information propagation.
End advice to users.

Another example of drawing further inferences from LOCAL clauses is the 
following:

	!HPF$ ON HOME(PROCS(IP)), LOCAL( X(ILO(IP):IHI(IP)) )
	DO I = ILO(IP)+1, IHI(IP)-1
	  X(I) = (X(I-1) + 2*X(I) + X(I+1)) / 4
	END DO

(Note that this is the construct form of the DO, so the ON clause applies 
to all three lines below it.) Here, I is alwas between ILO(IP) and IHI(IP).  
Therefore, the two references to X(I) require no communication, since they 
always refer to an element of the LOCAL array section.  Slightly deeper 
reasoning reveals that the same is true of X(I-1) and X(I+1).  In short, 
the entire DO loop can be executed without communication on PROCS(IP).

Advice to implementors:
Analysis of ranges with symbolic bounds will pay big dividends for 
applications such as block-structured CFD codes.  Of course, the expense 
of doing this analysis must be weighed against its benefits.  See also 
the "Advice to programmers" below.
End advice to implementors.

Advice to programmers:
As with the previous example, one can expect the capabilities of compilers 
to differ widely in making these inferences.  Most compilers should 
recognize and use compile-time constants in LOCAL clauses and the 
executable code.  Many compilers will extend this to variables that are 
invariant within the ON clause and will propogate the bounds of DO loops; 
others will use pattern recognition to achieve some of this effect.  Such 
pattern recognition suffices to recognize X(I) in the example.  Detailed 
reasoning about symbolic values is beyond the capabilities of most 
compilers.  For example, determining the range of X(I-1) in the above 
example may require symbolic manipulation (or rather aggressive 
pattern-matching).
End advice to users.


--------------------------------------------------------------------------------
[A related mail message from Marc Baber]

Hello,

I've excerpted the following from APR's manuals for consideration as one
possible form of an HPF 2.0 work-distribution directive.  DOPAR can
be interpreted as a dynamically-balanced work partitioning directive
on shared memory systems by ignoring the ON reference, or the ON reference
can be used to align the DO with any array/dimension for static balancing
and reduction of communication on distributed memory systems.  

Though this syntax does not address some peoples' desire to specify
processor id's in the ON clause, it does achieve a very natural
extension of HPF to align DO loops with array distributions.

Note that DO PAR is not interpreted to mean "run this loop parallel at
any cost."  DO PAR simply conveys the programmer's preference about
which loop in a loop nest should be parallelized first.  The IGNORE
directive (also described below) is used to override the xHPF compiler's
determinations about iteration dependence.

-Marc


APR Directive Descriptions

Select Loop

Directive:  CAPR$ DO PAR {ON reference }

Select the next loop for parallelization.  Applies to both automatic and
manual loop selection modes. The ON clause selects a distributed array as
controlling the loop distribution.  If there is no ON clause, xhpf will
determine the best partitioned array reference within the loop to determine
distribution of the loop iterations.
*     Applies only to the next upcoming DO loop.
The Controlling Array  ( ON Clause )

Without the ON clause on a DO PAR directive, dpf will select a partitioned
array reference within each loop to control the distribution of iterations
over the processors.  A loop's iterations are distributed so that
references to the control array occur in the processor that "owns" those
elements of the partitioned array.  dpf tries to choose an array that will
result in the minimum communication when the loop is parallelized.  (See
Chapter 2 discussion of loop distribution.)

The ON clause on a DO PAR directive indicates that the specified array
reference is to control the distribution of the iterations of the loop.

The ON clause reference syntax:

 reference      is      array-name
        or      array-name(subscripts)
        or      array-name<subscripts>

Within subscripts, subscript expressions conforming to the array's
subscripts appear:

:  and  *       indicate an arbitrary subscript

initial-expr ~ incr-expr        indicates an integer subscript with a
constant
                increment ("CII")

initial-expr : final-expr {: incr-expr }        indicates a complete range of
                        values.

(Each of the -expr are arithmetic expressions that may include constants
and parenthesized sub-expressions).  Because xhpf only considers a single
dimension for array partitioning and loop distribution, only one subscript
expression may appear in the ON clause array as a CII or constant;  and the
other subscripts must be :  or  *.

Examples:       CAPR$ DO PAR ON QUOM<:,1~1>
        CAPR$ DO PAR ON UR<:,:,j-1~1>
        CAPR$ DO PAR ON WKW<:,:,1>

Don't Select Loop

Directive:  CAPR$ DO NOPAR

Indicates that the next loop should not be selected for parallelization.

Ignore Inhibitors or Communication

Directive:  CAPR$ IGNORE ALL INHIBITORS

Indicates that the next loop, if marked for parallelization by a DO PAR
directive, can be parallelized by ignoring any apparent inhibitors.

Directive:  CAPR$ IGNORE type COM {ON reference }

Indicates that xhpf need not generate certain communication calls in the
next loop marked by a DO PAR.  type indicates the kind of communication
being referenced.  The ON clause further identifies the communication to
specific arrays.  If there is no ON clause, communication of type  for all
arrays in the loop is ignored for this loop.


IGNORE type:            referring to:
PRELOOP         pre-loop fetches of off-processor array elements
POSTLOOP                post-loop sends of computed off-processor elements
ALL             all and any of the above


*   Any CAPR$ IGNORE directives must immediately follow a DO PAR directive.

IGNORE type COM ON Clause
Each array reference in a loop has the potential of generating data
communication calls to fetch or store array elements.  PRELOOP
communications fetch to the processor all the array elements it uses from
the processors who own them, while POSTLOOP communications distribute all
the computed array elements back to the owning processors.

The IGNORE type COM directive permits the user to optimize the performance
of a parallelized loop by recognizing that certain communications are
redundant and could be eliminated.  The ON clause specifies which array
reference's communication calls can be ignored.  Unlike the DO PAR ON
clause, the syntax of the ON clause on the IGNORE directive must indicate
all the subscripting in order to match a particular array reference in the
source code, or just the unsubscripted array name to select all references
to that array.

From the point of view of the DO PAR directive, the array references in the
following loop that can be used as the controlling array for distributing
iterations of the loop are shown:    (X paritioned BLOCK, TEMP replicated.)

       CAPR$ DO PAR
             DO 15 I = 1,N
                temp(I) = I     !  temp<1~1>
        15      X(I) = temp(I)  !  X<1~1>
       CAPR$ DO PAR
             DO 20 i = 1,N
                temp(I) = i*i
        20      X(I) = x(i) + temp(I)


The array references that give rise to communication calls for loop 15 are :

   postloop communication of TEMP<1~1>
   postloop communication of X<1~1>


However, values set in temp by loop 15 are immediately replaced in loop 20,
eliminating the need to bring all copies of temp up to date after loop 15.
The postloop communication of temp can be ignored:

        CAPR$ DO PAR
        CAPR$ IGNORE POSTLOOP COM ON TEMP
             DO 15 I = 1,N
                temp(I) = I
        15      X(I) = temp(I)

-------------------------------------------------------------------------------