From chk@erato.cs.rice.edu  Tue May  5 15:50:09 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA08963); Tue, 5 May 92 15:50:09 CDT
Received: from localhost.cs.rice.edu by erato.cs.rice.edu (AA08322); Tue, 5 May 92 15:50:08 CDT
Message-Id: <9205052050.AA08322@erato.cs.rice.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: chk@erato.cs.rice.edu
Word-Of-The-Day: subaltern : (n) a person holding a subordinate position
Subject: Welcome to the intrinsics mailing group
Date: Tue, 05 May 92 15:50:06 -0500
From: chk@erato.cs.rice.edu


Just a note to let you know that hpff-intrinsics@rice.edu is now on
the air.  This is the HPFF subgroup on new intrinsic functions,
convened by Rob Schreiber.

						Chuck

From schreibr@riacs.edu  Tue May  5 16:09:07 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA10152); Tue, 5 May 92 16:09:07 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA08371); Tue, 5 May 92 16:09:05 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA05337; Tue, 5 May 92 14:09:04 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA11151; Tue, 5 May 92 14:08:21 PDT
Message-Id: <9205052108.AA11151@thor.riacs.edu>
Date: Tue, 5 May 92 14:08:21 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: chk@cs.rice.edu, hpff-intrinsics@erato.cs.rice.edu
Subject: Re:  Welcome to the intrinsics mailing group
Cc: chk@erato.cs.rice.edu

Please send me proposals for intrinsics.   I have in mind the following
for starters:

Number_of_processors

Scans

Sorting

Send with combining operators


Distribution queries will be handled by Group 2/3, I hope.


-- Rob


From gls@think.com  Wed May  6 13:27:35 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA09666); Wed, 6 May 92 13:27:35 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08577); Wed, 6 May 92 13:27:32 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:27:25 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05258; Wed, 6 May 92 14:27:24 EDT
Date: Wed, 6 May 92 14:27:24 EDT
Message-Id: <9205061827.AA05258@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  POPCNT, POPPAR, LEADZ, and ILEN


Proposal for HPF intrinsics POPCNT, POPPAR, LEADZ, and ILEN

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


(a) An elemental population count intrinsic.  Its action on a scalar is:

  POPCNT(x) = COUNT( (/ (BTEST(x,J), J=0, BIT_SIZE(x)-1) /) )

The number of 1-bits in the integer x, according to the
bit-manipulation model in section 13.5.7 of the Fortran 90 standard.


(b) An elemental population-parity intrinsic.  Its action on a scalar is:

  POPPAR(x) = MERGE(1,0,BTEST(POPCNT(x),0))

The result is 1 if the number of 1-bits in the integer x is odd,
or 0 if the number of 1-bits in the integer x is even.


(c) An elemental count-leading-zeros intrinsic.  Its action on a scalar is:

  LEADZ(x) = MINVAL( (/ (J, J=0,BIT_SIZE(x)) /),
		MASK=(/ (BTEST(x,J), J=BIT_SIZE(x)-1,0,-1), .TRUE. /) )

The result is a count of the number of leading 0-bits in the integer
x, according to the bit-manipulation model in section 13.5.7 of the
Fortran 90 standard.

Note that a given integer value may produce different results from
LEADZ, depending on the number of bits in the representation of the
integer.  That is because bits are counted from the left (the most
significant bit).


(d) An elemental integer-length intrinsic.  Its action on a scalar is:

  ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 ))

This is related to LEADZ but is often much more convenient for
the calculation of array dimensions, etc.  It is the number of bits
required to store a 2's-complement signed integer x.  As examples of
its use,  2**ILEN(N-1)  rounds N up to a power of 2 (for N > 0),
whereas  2**(ILEN(N)-1)  rounds N down to a power of 2.

Note that a given integer value will always produce the same result
from ILEN, independent on the number of bits in the representation of
the integer.  That is because bits are counted from the right (the
least significant bit).

The definition of ILEN is equivalent to that of the built-in function
integer-length in Common Lisp, which has proven to be quite useful.


Issues: I hope I have defined POPCNT, POPPAR, and LEADZ consistent
with their use in Cray Fortran.  I believe that Cray Fortran allows
these intrinsics to be applied to data types other than integers;
should HPF allow this also?

From gls@think.com  Wed May  6 13:29:01 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA09680); Wed, 6 May 92 13:29:01 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08580); Wed, 6 May 92 13:28:56 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:28:52 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05271; Wed, 6 May 92 14:28:51 EDT
Date: Wed, 6 May 92 14:28:51 EDT
Message-Id: <9205061828.AA05271@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  extension to MINLOC and MAXLOC


Proposal for extension to MINLOC and MAXLOC for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


The MAXLOC and MINLOC instrinsics should have an optional DIM
argument.  If such an argument is present, then the shape of the
result equals the shape of the first argument with one dimension (that
indicated by the DIM argument) deleted; it is as if a series of
one-dimensional MAXLOC or MINLOC operations were performed.

Example: If A has the value

[  0  -5   8  -3  ]
[  3   4  -1   2  ]
[  1   5   6  -4  ]

then	MINLOC(A, DIM=1) has the value [ 1, 1, 2, 3 ]
	MAXLOC(A, DIM=1) has the value [ 2, 3, 1, 2 ]
	MINLOC(A, DIM=2) has the value [ 2, 3, 4 ]
	MAXLOC(A, DIM=2) has the value [ 3, 2, 3 ].

From gls@think.com  Wed May  6 13:29:36 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA09699); Wed, 6 May 92 13:29:36 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08584); Wed, 6 May 92 13:29:31 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:29:30 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05287; Wed, 6 May 92 14:29:30 EDT
Date: Wed, 6 May 92 14:29:30 EDT
Message-Id: <9205061829.AA05287@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  reduction intrinsics

Proposal for reduction instrinsics for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


Just as we have the correpondences:

	operator/intrinsic	reduction intrinsic

		+			SUM
		*			PRODUCT
		.AND.			ALL
		.OR.			ANY
		MAX			MAXVAL
		MIN			MINVAL

it would be useful to have reduction versions of certain
other operators and intrinsics in the language that happen
to be associative and commutative:

				    proposed
	operator/intrinsic	reduction intrinsic

		IAND			AND
		IOR			OR
		IEOR			EOR
		.NEQV.			PARITY

Thus

	AND( (/ 7,3,10 /) )  yields 2
	 OR( (/ 7,3,10 /) )  yields 15
	EOR( (/ 7,3,10 /) )  yields 14

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

      PARITY( (/ T,F,F,T,T,F,F,F,T,T /) )  yields .TRUE.
      PARITY( (/ T,F,F,T,T,F,F,F,T,F /) )  yields .FALSE.

Some of these are particularly valuable if corresponding
parallel-prefix intrinsics are also defined (see separate proposal).

From gls@think.com  Wed May  6 13:45:05 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA10301); Wed, 6 May 92 13:45:05 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08590); Wed, 6 May 92 13:45:01 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:44:59 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05642; Wed, 6 May 92 14:44:58 EDT
Date: Wed, 6 May 92 14:44:58 EDT
Message-Id: <9205061844.AA05642@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  parallel prefix intrinsics

Proposal for parallel prefix instrinsics for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


For every reduction operation XXX in the language, introduce two new
intrinsics XXX_PREFIX and XXX_SUFFIX.  They take the same arguments
as the corresponding reduction intrinsic, plus two additional
optional arguments:

	XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

The first additional optional argument is called SEGMENT, which is of
type logical and conformable with the ARRAY argument (a TRUE element
indicates the start of a new segment of the first argument, that is,
a place where the running accumulation is to be reset before
processing the corresponding array element).

The second additional optional argument, a scalar logical, is called
EXCLUSIVE, default .FALSE., which determines whether the prefix or
suffix operation is inclusive (the default) or exclusive.  (The
inclusive sum-prefix of (/ 1,2,3,4 /) is (/ 1,3,6,10 /) whereas the
exclusive sum-prefix is (/ 0,1,3,6 /).)

Array elements corresponding to positions where the MASK is false
do not contribute to the running accumulation.  However, the result
is still defined for corresponding positions in the result.
In actual practice, results may not be required in those positions;
in such cases the programmer may be able to use the WHERE statement
to give the compiler a strong hint:

      WHERE (FOO) A=SUM_PREFIX(B,MASK=FOO)

If the DIM argument is omitted, then the arrays are processed in
array element order ("column-major"), as if temporarily regarded as
one-dimensional.

In all cases the result has the same shape as the first argument.

In addition, the operation COPY_PREFIX replicates the first
(lowest-indexed) element of each segment throughout the segment, and
the operation COPY_SUFFIX replicates the last (highest-indexed)
element of each segment throughout the segment.

Examples:

SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/)
SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/)

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) )              yields (/1,1,1,2,3,4,4,5,5/)
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) yields (/0,1,1,1,2,3,4,4,5/)

SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/),
    SEGMENT=(/T,F,F,F,T,F,T,T,F/)) yields (/1,3,6,10,5,11,7,8,17/)
              ------- --- - ---             -------- ---  - ----
	     four input segments       four independent result segments

COPY_PREFIX( (/1,2,3,4,5,6,7,8,9/),
     SEGMENT=(/T,F,F,F,T,F,T,T,F/)) yields (/1,1,1,1,5,5,7,8,8/)
               ------- --- - ---             ------- --- - ---
	      four input segments       four independent result segments


Outstanding issues: This proposal delimits the segments by indicating
the *start* of each segment.  Cray MPP Fortran delimits the segments
by indicating the *stop* of each segment.  Each method has its advantages.
There is also the question of whether this convention should change when
performing a suffix rather than a prefix.

Another way to delimit segments is to use a logical vector and say
that a new segment begins at every *transition* from false to true or
true to false; thus a segment is indicated by a maximal contiguous
subsequence of like logical values:

	(/T,T,T,F,T,F,F,F,T,F,F,T/)
          ----- - - ----- - --- -    seven segments

The main advantages of this representation are:

(a) It is symmetrical, in that the same segment specifier may
    be meaningfully used for parallel prefix and parallel suffix
    without changing its interpretation (start versus stop).

(b) It seems to be equally inconvenient for every existing
    architecture.  :-)  However, it is not that hard to accommodate.

(c) The start-bit or stop-bit representation is easily converted
    to this form by using a parallel XOR prefix or suffix.
    Of course, we would need to define one (see separate proposal
    for a PARITY reduction intrinsic).  Examples:

    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))

    These might be standard idioms for a compiler to recognize.

From gls@think.com  Wed May  6 14:44:16 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA12314); Wed, 6 May 92 14:44:16 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08601); Wed, 6 May 92 14:44:11 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 15:44:07 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA06540; Wed, 6 May 92 15:44:06 EDT
Date: Wed, 6 May 92 15:44:06 EDT
Message-Id: <9205061944.AA06540@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  sorting intrinsics

Proposal for HPF sorting intrinsics

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


The ideas and names here are inspired by APL.  I have used the
term "grade" rather than "rank" because the latter is already used
in the Fortran 90 standard to mean the size of the shape of an array
(that is, the number of dimensions).


GRADE_UP(ARRAY,DIM)

The array may be of type integer, real, or character.  [Alternate spec:
the array may be of any type for which the operator .LT. has been defined?]

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

	B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)

then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in)
is sorted in ascending order.

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape [SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY))]
and the property that if one computes the rank-1 array

	B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))

where n=SIZE(SHAPE(ARRAY)), then B is sorted in ascending order.

Question: should stability be guaranteed?


GRADE_DOWN(ARRAY,DIM)

Same as GRADE_UP, with "ascending" replaced by "descending".


----------------------------------------------------------------

An alternate approach:

First define the utility intrinsic

INDEX(ARRAY,SUBS)

where SIZE(SUBS,DIM=1) = SIZE(SHAPE(ARRAY)).  If S = SHAPE(ARRAY),
then S(2:) is the shape of the result R, which has the property that

	R(i1,i2,...,in) = ARRAY(SUBS(1,i1,i2,...,in),
                                SUBS(2,i1,i2,...,in),
                                ...
                                SUBS(k,i1,i2,...,in))

where k = SIZE(SHAPE(ARRAY)) and n = SIZE(SHAPE(SUBS))-1.


GRADE(ARRAY1,ARRAY2,...)

Arguments ARRAY2,... are optional.  All arrays must be conformable.
The arrays must be of type integer, real, or character.  [Alternate spec:
the array may be of any type for which the operator .LT. has been defined?]
The arrays need not all be of the same type.

The result S is an array of rank 2, with shape [SIZE(SHAPE(ARRAY1)),
PRODUCT(SHAPE(ARRAY1))], and the property that if j < k then

ARRAY1(S(1,j),...,S(n,j)) .LT. ARRAY1(S(1,k),...,S(n,k))  or
( ARRAY1(S(1,j),...,S(n,j)) .EQ. ARRAY1(S(1,k),...,S(n,k))  and
  ( ARRAY2(S(1,j),...,S(n,j)) .LT. ARRAY2(S(1,k),...,S(n,k))  or
    ( ARRAY2(S(1,j),...,S(n,j)) .EQ. ARRAY2(S(1,k),...,S(n,k))  and
      ( ...

          ARRAYn(S(1,j),...,S(n,j)) .LT.  ARRAYn(S(1,k),...,S(n,k))  or
          ( ARRAYn(S(1,j),...,S(n,j)) .EQ. ARRAYn(S(1,k),...,S(n,k))  )
        ...
      )
    )
  )
)

which can also be written

INDEX(ARRAY1,S(:,j)) .LT. INDEX(ARRAY1,S(:,k))  or
( INDEX(ARRAY1,S(:,j)) .EQ. INDEX(ARRAY1,S(:,k))  and
  ( INDEX(ARRAY2,S(:,j)) .LT. INDEX(ARRAY2,S(:,k))  or
    ( INDEX(ARRAY2,S(:,j)) .EQ. INDEX(ARRAY2,S(:,k))  and
      ( ...

          INDEX(ARRAYn,S(:,j)) .LT. INDEX(ARRAYn,S(:,k))  or
          ( INDEX(ARRAYn,S(:,j)) .EQ. INDEX(ARRAYn,S(:,k))  )
        ...
      )
    )
  )
)

That is, the array arguments are treated as sort fields, with the first
argument most significant (major) and the last argument least significant
(minor).  The result gives a set of indices that can be used to
permute the arrays into a collectively sorted (ascending) order.

For example, suppose one had the following derived type (example
taken from section 4.4.1 of the Fortran 90 standard):

      TYPE PERSON
        INTEGER AGE
        CHARACTER (LEN = 50) NAME
      END TYPE PERSON

now consider two arrays of persons:

      TYPE(PERSON), DIMENSION(100000) :: MEMBERS, ROSTER

then the statement

      ROSTER = INDEX(MEMBERS,GRADE(MEMBERS%NAME,MEMBERS%AGE))

causes ROSTER to be a rearrangement of MEMBERS that is sorted
primarily by name and secondarily by age (that is, members with
the same name are grouped together in order of ascending age).
To list members with the same name in descending order of age,
the following trick more or less works:

      ROSTER = INDEX(MEMBERS,GRADE(MEMBERS%NAME,-MEMBERS%AGE))

though this is not completely general.

From gls@think.com  Wed May  6 17:03:02 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA19514); Wed, 6 May 92 17:03:02 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08652); Wed, 6 May 92 17:02:59 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 18:02:56 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA08781; Wed, 6 May 92 18:02:56 EDT
Date: Wed, 6 May 92 18:02:56 EDT
Message-Id: <9205062202.AA08781@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  combining-send intrinsics


Proposal for HPF combining-send intrinsics

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992

For every reduction operation XXX in the language, introduce a new
intrinsic subroutine XXX_SEND:

   XXX_SEND(SOURCE,DEST,IDX1,...)

Arguments IDX1,... are optional.  The number of IDX arguments
must equal the rank of DEST.  The SOURCE and all the IDX arguments
must be conformable.

For every element s in SOURCE, the corresponding elements ij of IDXj
are used to carry out the operation

	DEST(i1,i2,...,in) = XXX_operation(DEST(i1,i2,...,in), s)

and all such operations performed by a single call are done *as if
serially* in *some* (processor-dependent) order for each element s.
Thus the call

      CALL SUM_SEND(SOURCE,DEST,IDX1,IDX2,...,IDXn)

*could* be implemented as

      DO J1=LBOUND(SOURCE,1),UBOUND(SOURCE,1)
        DO J2=LBOUND(SOURCE,2),UBOUND(SOURCE,2)
          ...
            DO Jk=LBOUND(SOURCE,k),UBOUND(SOURCE,k)
              DEST(IDX1(J1,J2,...,Jk),
     &             IDX2(J1,J2,...,Jk),
     &             ...
     &             IDXn(J1,J2,...,Jk)) =
     &        DEST(IDX1(J1,J2,...,Jk),
     &             IDX2(J1,J2,...,Jk),
     &             ...
     &             IDXn(J1,J2,...,Jk)) + SOURCE(J1,J2,...,Jk)
            END DO
          ...
        END DO
      END DO

where k is the rank of SOURCE.  (However, this nest of DO loops
makes a greater commitment to the particular order in which the
combining operations are carried out than the order--namely, none!-
guaranteed by the XXX_SEND intrinsic.  This matters when the
combining operation is not both associative and commutative,
for example floating-point addition.)

Example:  The C* operation

        x[v] += a;

where x, v, and a are all parallel arrays, and a and v conform,
may be rendered

      CALL SUM_SEND(A,X,V)

If all elements of V were distinct, one could write this in
Fortran 90 as

      X(V) = X(V) + A

The proposed intrinsic SUM_SEND "works" even if V contains
duplicate values.  Note that the two-dimensional case

      X(V,W) = X(V,W) + A

must be rendered using SPREAD:

      CALL SUM_SEND(A,X,SPREAD(V,DIM=2,NCOPIES=SIZE(X,2)),
     &                  SPREAD(W,DIM=1,NCOPIES=SIZE(X,1)))

in order to duplicate the cross-product effect of ordinary array
subscripting.  (I chose to propose a definition of XXX_SEND that does
*not* perform such a cross product of indices because it is more
general and in practice more useful without the cross-product effect
built in.)

From schreibr@riacs.edu  Thu May  7 11:45:16 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA12191); Thu, 7 May 92 11:45:16 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA08814); Thu, 7 May 92 11:45:13 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA10520; Thu, 7 May 92 09:45:12 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA00305; Thu, 7 May 92 09:44:27 PDT
Message-Id: <9205071644.AA00305@thor.riacs.edu>
Date: Thu, 7 May 92 09:44:27 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Number of Processors


Peter Highnam sent this to me some time ago:

	The subroutine/distribute group, meeting on the second
	day of the April committee meeting worried that the
	properties of N$PROC may not match those of any other
	F90 entity.  The matter was pushed to the intrinsics group
	for resolution.  I've appended a list of points for discussion.

	a. This entity provides the maximum number of processors
	   ("processor" is currently an implementation-dependent term) over
	   which a template can be distributed.  The value of the 
	   entity is not necessarily known at compilation.  It is,
	   naturally, positive-integer-valued. This is the definition from 
	   the Align and Distribute Proposal that GLS prepared, dated 4/22/92.

	b. Is it an intrinsic function or an "environment variable"?
	   The latter term doesn't seem to have any standing in F90,
	   which would make it an intrinsic function...

It seems quite like the inquiry functions PRECISION, RADIX, RANGE.  These
return values dependent on the machine that the code runs on and are not
defined at compile time.   They are restricted expressions.

	c. This entity can be used in places where a constant is generally
	   required, such as array declarations, parameter statements, .. 
	   even though it is not necessarily known until runtime.  Can F90
	   intrinsics be used this way ?  If not, what are our options ?

Yes, they can.   As an integer, scalar valued restricted expression, it
becomes a specification expression, so it can be used in bounds in 
array declarations, etc.

	d. The entity can optionally be defined at compilation (e.g., 
	   compile-line flag).  This would provide a compiler
	   with more info for optimization purposes.

But should have no semantic efect, and should not change the places where
it is allowed to include places where only constants are valid.

	e. Naming conventions.  "$" is a legal character (even in
	   F77) but not in variable names.  The "N$PROC" label was
	   simply inherited from an early proposal (Fortran D?) and
	   has the singular advantage of not interfering with the
	   programmer's name space (she cannot use a "$" in a name).
	   Retain this form ?  Change ?

I don't see any reason to introduce a $ into this.   NUMBER_OF_PROCESSORS
is better.

I also propose that this be part of the "minimum subset".   That forces
31 character names into the MS as well!

---   Rob


From zrlp09@stoy.msc.edu  Fri May  8 16:45:18 1992
Received: from noc.msc.edu by cs.rice.edu (AA13443); Fri, 8 May 92 16:45:18 CDT
Received: from uc.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA04977; Fri, 8 May 92 16:45:17 -0500
Received: from [129.230.11.2] by uc.msc.edu (5.65/MSC/v3.0z(901212))
	id AA06830; Fri, 8 May 92 16:45:16 -0500
Received: from trc.amoco.com (apctrc.trc.amoco.com) by netserv2 (4.1/SMI-4.0)
	id AA16500; Fri, 8 May 92 16:45:11 CDT
Received: from stoy.trc.amoco.com by trc.amoco.com (4.1/SMI-4.1)
	id AA16134; Fri, 8 May 92 16:45:08 CDT
Received: from localhost by stoy.trc.amoco.com (4.1/SMI-4.1)
	id AA00260; Fri, 8 May 92 16:45:06 CDT
Message-Id: <9205082145.AA00260@stoy.trc.amoco.com>
To: hpff-intrinsics@cs.rice.edu
Subject: Criterion for intrinsics: F90 defined functions
Date: Fri, 08 May 92 16:45:05 -0500
From: "Rex Page" <zrlp09@stoy.msc.edu>

To me it seems imprudent to define extensions to Fortran 90 in HPF
because it circumvents the usual sources of new features of
Fortran (ISO and ANSI) and could lead to eventual conflicts between
HPF and standard Fortran.  I hope we'll be able to do everything
within the language (e.g., as comment-style directives).

The FORALLextension will, it appears violate this principle, but
maybe that will be the only exception.

Intrinsic functions need not go beyond Fortran 90.  We just need
to take care to defined within the existing Fortran 90 framework,
potentially implementable as part of an HPF module.  Vendors might
not implement them in this way, but HPFF can define them so that
such an implementation is possible.

CRITERION: All HPF intrinsics should be implementable as defined
functions in Fortran 90.

All of the intrinsics proposed so far meet this criterion, I think,
but we need to be a little careful in how we describe them.

For example, MAXLOC with a DIM parameter (Steele) could be
implemented as an overload of the F90 intrinsic MAXLOC.  There
would be a definition in the HPF module of 7 MAXLOC functions,
one for each possible rank of the array argument.  The programmer
gets the same facility that an intrinsic F90 MAXLOC with an
optional parameter would provide, but HPFF would define MAXLOC
as a generic function (in F90 terms) with 7 incarnations.
Technically, the DIM argument would not be optional in these
incarnations, but since the F90 intrinsic MAXLOC could be invoked
by leaving off the DIM argument, it would behave, from the source
code standpoint, as an optional argument.

Similarly, proposed bit manipulation funtions (POPCNT etc.)
would not be "elemental" in a technical F90 sense, but would be
generic F90 functions with definitions for each possible rank of
the argument (rank=0 through rank=7).  This gives the effect of
elemental functions and avoids extending F90.

Rex Page

From loveman@ftn90.enet.dec.com  Mon May 11 08:44:38 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA19735); Mon, 11 May 92 08:44:38 CDT
Received: from enet-gw.pa.dec.com by erato.cs.rice.edu (AA09210); Mon, 11 May 92 08:44:36 CDT
Received: by enet-gw.pa.dec.com; id AA01030; Mon, 11 May 92 06:44:31 -0700
Message-Id: <9205111344.AA01030@enet-gw.pa.dec.com>
Received: from ftn90.enet; by decwrl.enet; Mon, 11 May 92 06:44:32 PDT
Date: Mon, 11 May 92 06:44:32 PDT
From: David Loveman <loveman@ftn90.enet.dec.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: loveman@ftn90.enet.dec.com
Apparently-To: hpff-intrinsics@erato.cs.rice.edu
Subject: reprint of Digital's NUMBER_OF_PROCESSORS proposal


Attached is a reprint from the HPFF January meeting of Digital's
proposal for the introduction of a new class of intrinsic functions
(system inquiry functions), the specific function NUMBER_OF_PROCESSORS,
and the appropriate modification to the Fortran 90 definition of
restricted expression.

-David
loveman@mpsg.enet.dec.com

------------------------------------------------------------------------

New Intrinsic Function

In addition to the intrinsic functions of Fortran 90, High Performance
Fortran has a new intrinsic function, NUMBER_OF_PROCESSORS, which takes
no arguments and returns an integer value giving the number of
processors in the system. This number is expected to remain constant
for (at least) the duration of one program execution. Accordingly,
NUMBER_OF_PROCESSORS() is a constant expression and can be used
wherever any other Fortran 90 constant expression can be used. In
particular, NUMBER_OF_PROCESSORS can be used in an initialization
expression or in a specification expression.  None of the categories of
intrinsic functions listed in Chapter 13 of the Fortran 90 standard
seem quite apt to describe the nature of this new intrinsic function,
so we add a new category of "system inquiry functions" and place
NUMBER_OF_PROCESSORS in that category.

Note that treating NUMBER_OF_PROCESSORS as a constant expression does
not force a compiler to bind the number of processors at compile time
(although that is one possible implementation) -- with the right linker
or code-generation technology the choice could be deferred until run
time, possibly at some performance cost.

A definition of NUMBER_OF_PROCESSORS, in the style of Chapter 13 of the
Fortran 90 standard is:

13.13.77a  NUMBER_OF_PROCESSORS(DIM)

Optional Argument.  DIM

Description.  Returns the total number of processors available to the
program, the dimensionality of the processor array, or the number of
processors available to the program along a specified dimension of the
processor array.

Class.  System inquiry function.

Arguments.
DIM (optional)  must be scalar and of type integer with a value in the
range 0<=DIM<=n where n is the rank of the processor array.

Result Type, Type Parameter, and Shape.  Default integer scalar.

Result Value.  The result has a value equal to the rank of the
processor array if the value of DIM is 0, the extent of dimension DIM
(1<=DIM<=n, where n is the rank of the processor array) of the
processor-dependent hardware processor array or, if DIM is absent, the
total number of elements, equal to or greater than one, of the
processor-dependent hardware processor array.

The value of NUMBER_OF_PROCESSORS() need not be a constant, if the
processor allows for a variable number of processors to execute the
program.  However, the value must not change during the execution of the program.

Example. For a DECmpp 12000 Model 8B with 8192 processors, the value of
NUMBER_OF_PROCESSORS( ) is 8192, the value of
NUMBER_OF_PROCESSORS(DIM=0) is 2, the value of
NUMBER_OF_PROCESSORS(DIM=1) is 128, and the value of NUMBER_OF_PROCE is 64.

The list of alternatives in a Fortran 90 restricted expression is
expanded to include NUMBER_OF_PROCESSORS, as follows:

A  *restricted expression* is an expression in which each operation is
intrinsic and each primary is:

1.   A constant or subobject of a constant,

2.   A variable that is a dummy argument that has neither the OPTIONAL
nor the INTENT (OUT) attribute, or a variable that is a subobject of such as dummy
argument,

3.   A variable that is in a common block or a variable that is a
subobject of a variable in a common block,

4.   A variable that is made accessible by use association or host
association or a variable that is a subobject of such a variable,

5.   An array constructor where each element and the bounds and strides
of each implied-DO are expressions whose primaries are either restricted expressions
or implied-DO variables, 

6.   A structure constructor where each component is a restricted expression,

7.   An elemental intrinsic function reference of type integer or
character where each argument is a restricted expression of type
integer or character,

8.   One of the transformational functions REPEAT, RESHAPE,
SELECTED_INT_KIND, SELECTED_REAL_KIND, TRANSFER, and TRIM, where each
argument is a restricted expression of type integer or character,

9.   A reference to an array inquiry function (13.10.15) other than
ALLOCATED, the bit inquiry function BIT_SIZE, the character inquiry
function LEN, the kind inquiry function KIND, or a numeric inquiry
function (13.10.8), where each argument is either a restricted
expression or a variable whose type parameters or bounds inquired about
are not assumed or defined by an ALLOCATE statement or a pointer
assignment, or

10.   A restricted expression enclosed in parentheses, or

11.   The HPF system inquiry function NUMBER_OF_PROCESSORS.

From gls@think.com  Mon May 11 09:32:31 1992
Received: from mail.think.com by cs.rice.edu (AA20866); Mon, 11 May 92 09:32:31 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Mon, 11 May 92 10:32:27 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA07320; Mon, 11 May 92 10:32:26 EDT
Date: Mon, 11 May 92 10:32:26 EDT
Message-Id: <9205111432.AA07320@strident.think.com>
To: zrlp09@stoy.msc.edu
Cc: hpff-intrinsics@cs.rice.edu
In-Reply-To: "Rex Page"'s message of Fri, 08 May 92 16:45:05 -0500 <9205082145.AA00260@stoy.trc.amoco.com>
Subject: Criterion for intrinsics: F90 defined functions


I am in complete agreement with the following message
with one very important exception.

   Date: Fri, 08 May 92 16:45:05 -0500
   From: "Rex Page" <zrlp09@stoy.msc.edu>

   To me it seems imprudent to define extensions to Fortran 90 in HPF
   because it circumvents the usual sources of new features of
   Fortran (ISO and ANSI) and could lead to eventual conflicts between
   HPF and standard Fortran.  I hope we'll be able to do everything
   within the language (e.g., as comment-style directives).

I will take strong exception to the remark about "the usual sources".
While I agree that, as a matter of fact, ANSI and ISO committees have
been sources of features, as a matter of principle they are not
supposed to be--certainly they are not intended to be the *sole*
source of new design!  On the contrary, HPF is exactly the kind of
industry activity that ought to provide the first-level experiments
from which grist may be chosen in future for the X3J3 mill.

"Standardization" is not the same thing as "invention".  The
theoretical purpose of an ANSI committee is to standardize existing
practice, not to invent new practice.  Such committees do find
themselves forced into invention for various reasons (to resolve
inconsistencies, enable portability, achieve consensus, etc.), but
that should not be their primary purpose.

Moreover, I have had exchanges with long-time members of X3J3 of the
general form: "Why is this restriction in the Fortran 90 standard?"
"Just to avoid complexity for now, but we strongly encourage
companies like yours to explore extensions in this direction".

So I think HPF need not be afraid to explore extensions to the
language for fear of offending X3J3 or ANSI!  Of course, I would also
recommend that we be conservative; we should not add features
frivolously or gratuitously, but only to meet some perceived need,
consistent with our overall goals.

Foe example, I do not necessarily think we should accept all of the
intrinsics proposals I have recently generated.  I think some meet
real needs; others I created because other people had expressed an
interest and no other proposals had been forthcoming on the mailing
list yet (and I am fairly practiced at spewing forth first drafts of
this kind of text).  (On the other hand, I don;t think any of these
proposals is gratuitous; Thinking Machines will be providing *some*
form of that functionality in each case--the question is whether HPF
whishes to require it of, or recommend it for, all HPF
implementations.)  So now we have some material to attract pot-shots
and, preferably, counterproposals.

   The FORALLextension will, it appears violate this principle, but
   maybe that will be the only exception.

   Intrinsic functions need not go beyond Fortran 90.  We just need
   to take care to defined within the existing Fortran 90 framework,
   potentially implementable as part of an HPF module.  Vendors might
   not implement them in this way, but HPFF can define them so that
   such an implementation is possible.

   CRITERION: All HPF intrinsics should be implementable as defined
   functions in Fortran 90.

This is an excellent goal.

   All of the intrinsics proposed so far meet this criterion, I think,
   but we need to be a little careful in how we describe them.

   For example, MAXLOC with a DIM parameter (Steele) could be
   implemented as an overload of the F90 intrinsic MAXLOC.  There
   would be a definition in the HPF module of 7 MAXLOC functions,
   one for each possible rank of the array argument.  The programmer
   gets the same facility that an intrinsic F90 MAXLOC with an
   optional parameter would provide, but HPFF would define MAXLOC
   as a generic function (in F90 terms) with 7 incarnations.
   Technically, the DIM argument would not be optional in these
   incarnations, but since the F90 intrinsic MAXLOC could be invoked
   by leaving off the DIM argument, it would behave, from the source
   code standpoint, as an optional argument.

   Similarly, proposed bit manipulation funtions (POPCNT etc.)
   would not be "elemental" in a technical F90 sense, but would be
   generic F90 functions with definitions for each possible rank of
   the argument (rank=0 through rank=7).  This gives the effect of
   elemental functions and avoids extending F90.

All these points are well taken.

--Guy

From gls@think.com  Mon May 11 11:13:41 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA24700); Mon, 11 May 92 11:13:41 CDT
Received: from mail.think.com by erato.cs.rice.edu (AA09397); Mon, 11 May 92 11:13:36 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Mon, 11 May 92 12:13:36 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA07884; Mon, 11 May 92 12:13:35 EDT
Date: Mon, 11 May 92 12:13:35 EDT
Message-Id: <9205111613.AA07884@strident.think.com>
To: loveman@ftn90.enet.dec.com
Cc: hpff-intrinsics@erato.cs.rice.edu
In-Reply-To: David Loveman's message of Mon, 11 May 92 06:44:32 PDT <9205111344.AA01030@enet-gw.pa.dec.com>
Subject: reprint of Digital's NUMBER_OF_PROCESSORS proposal


Thanks for retransmitting the NUMBER_OF_PROCESSORS text, David.
I have two suggestions:


[a] Using DIM=0 to convey rank information is one of those
clever-but-strange encoding tricks.  Inasmuch as there is no
precedent for it already in Fortrasn 90, I recommend considering a
separate intrinsic that would be analogous to something already in
Fortran 90, to wit, PROCESSORS_SHAPE(), returning a rank-1 vector.
Thus SIZE(PROCESSOR_SHAPE()) is the rank of the processor array.

Example.  For a DECmpp 12000 Model 8B with 8192 processors, the value of
PROCESSORS_SHAPE() is (/ 128, 64 /).

Example.  For a Connection Machine CM-2 with 8192 processors, the value of
PROCESSORS_SHAPE() might be (/ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 /).

Example.  For a Connection Machine CM-5 with 8192 processors, the value of
PROCESSORS_SHAPE() might be (/ 8192 /).


[b] Perhaps NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE should take
another optional argument, the name of a processors arrangement
as declared in a PROCESSORS directive:

	NUMBER_OF_PROCESSORS(procs, dim)
	PROCESSORS_SHAPE(procs)

If "procs" is omitted, then the information returned concerns the
"natural" processors arrangement of the hardware; if included,
then it serves as an inquiry intrinsic for the declared processors
arrangement.  Perhaps this latter form should be restricted to use
within HPF directives (inasmuch as the referenced name is defined
by an HPF directive).

Example.

!HPF$ PROCESSORS FOO(40,80)
!HPF$ ... PROCESSORS_SHAPE(FOO) ...           ! (/ 40, 80 /)
!HPF$ ... NUMBER_OF_PROCESSORS(FOO) ...       ! 3200

From zrlp09@stoy.msc.edu  Mon May 11 12:23:00 1992
Received: from noc.msc.edu by cs.rice.edu (AA26911); Mon, 11 May 92 12:23:00 CDT
Received: from uc.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA03167; Mon, 11 May 92 12:22:57 -0500
Received: from [129.230.11.2] by uc.msc.edu (5.65/MSC/v3.0z(901212))
	id AA10206; Mon, 11 May 92 12:22:55 -0500
Received: from trc.amoco.com (apctrc.trc.amoco.com) by netserv2 (4.1/SMI-4.0)
	id AA27545; Mon, 11 May 92 12:22:51 CDT
Received: from stoy.trc.amoco.com by trc.amoco.com (4.1/SMI-4.1)
	id AA29307; Mon, 11 May 92 12:22:48 CDT
Received: from localhost by stoy.trc.amoco.com (4.1/SMI-4.1)
	id AA10458; Mon, 11 May 92 12:22:48 CDT
Message-Id: <9205111722.AA10458@stoy.trc.amoco.com>
To: Guy Steele <gls@think.com>
Cc: hpff-intrinsics@cs.rice.edu
Subject: Re: Criterion for intrinsics: F90 defined functions 
In-Reply-To: Guy Steele's message of Mon, 11 May 92 10:32:26 -0400.
             <9205111432.AA07320@strident.think.com> 
Date: Mon, 11 May 92 12:22:47 -0500
From: "Rex Page" <zrlp09@stoy.msc.edu>


   [Steele]                        ... ANSI and ISO committees have
   been sources of features, as a matter of principle they are not
   supposed to be--certainly they are not intended to be the *sole*
   source of new design!  On the contrary, HPF is exactly the kind of
   industry activity that ought to provide the first-level experiments
   from which grist may be chosen in future for the X3J3 mill.

   "Standardization" is not the same thing as "invention".  The
   theoretical purpose of an ANSI committee is to standardize existing
   practice, not to invent new practice.

I've been thinking of HPF as something more like a standard than an
experiment.  I hope it will be possible for HPF programs to survive
in a "Fortran 2000" environment, if there is one.  This will be
easy to manage if HPF's experiments are compatible with Fortran 90.
It will be hard to manage if HPF includes extensions that conflict
with future changes.

   [Steele]
   So I think HPF need not be afraid to explore extensions to the
   language for fear of offending X3J3 or ANSI!  Of course, I would also
   recommend that we be conservative; we should not add features
   frivolously or gratuitously, but only to meet some perceived need,
   consistent with our overall goals.

The proposed data distribution directives and intrinsics do seem
to be on the conservative side.  They standardize existing practice
in a sense (CM Fortran, Fortran D, etc.), and people seem inclined
to encode them within Fortran 90 using comment-style directives
where necessary.  I hope this continues.

I agree that we should not concern ourselves about offending X3J3.
What we should worry about is the survivability of HPF code in the
face of possible extensions to Fortran 90 in the next decade.

Rex Page

From gls@think.com  Mon May 11 12:39:47 1992
Received: from mail.think.com by cs.rice.edu (AA27163); Mon, 11 May 92 12:39:47 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Mon, 11 May 92 13:39:40 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA08437; Mon, 11 May 92 13:39:39 EDT
Date: Mon, 11 May 92 13:39:39 EDT
Message-Id: <9205111739.AA08437@strident.think.com>
To: zrlp09@stoy.msc.edu
Cc: gls@think.com, hpff-intrinsics@cs.rice.edu
In-Reply-To: "Rex Page"'s message of Mon, 11 May 92 12:22:47 -0500 <9205111722.AA10458@stoy.trc.amoco.com>
Subject: Criterion for intrinsics: F90 defined functions 


Right.  I think we are now in "violent agreement"!

--Guy

From schreibr@riacs.edu  Tue May 12 16:42:18 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA11264); Tue, 12 May 92 16:42:18 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA10182); Tue, 12 May 92 16:42:15 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA01741; Tue, 12 May 92 14:42:13 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA03424; Tue, 12 May 92 14:41:24 PDT
Message-Id: <9205122141.AA03424@thor.riacs.edu>
Date: Tue, 12 May 92 14:41:24 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Questions


Dear members of the Intrinsics subcommittee:

We have a number of proposals on the table.   Now we need to begin a
debate on them, or we will be forced to have a meeting sometime.   I
hope we can avoid all but a short meeting on June 8.

First, here are the intrinsics that have been proposed.    VOTE up
or down (this is a straw poll) on whether to include them in
some form, and give Guy some help on his questions:

------------------------------------------------------
  POPCNT

  POPPAR

  LEADZ

  ILEN

Issues: I hope I have defined POPCNT, POPPAR, and LEADZ consistent
with their use in Cray Fortran.  I believe that Cray Fortran allows
these intrinsics to be applied to data types other than integers;
should HPF allow this also?

------------------------------------------------------
  extended MAXLOC   (with DIM argument)

  extended MAXLOC   (with DIM argument)

------------------------------------------------------

New reductions:

	AND
	OR
	EOR
	PARITY

------------------------------------------------------

For each reduction intrinsic XXX, introduce the parallel prefix and suffix
intrinsics:

	XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

(The possible values of XXX are:

	 SUM, PRODUCT, ALL, ANY, MAXVAL, MINVAL, AND, OR, EOR, PARITY.)


On PREFIX and SUFFIX:

>>> Outstanding issues: This proposal delimits the segments by indicating
>>> the *start* of each segment.  Cray MPP Fortran delimits the segments
>>> by indicating the *stop* of each segment.  Each method has its advantages.
>>> There is also the question of whether this convention should change when
>>> performing a suffix rather than a prefix.
>>> 
>>> Another way to delimit segments is to use a logical vector and say
>>> that a new segment begins at every *transition* from false to true or
>>> true to false; thus a segment is indicated by a maximal contiguous
>>> subsequence of like logical values:
>>> 
>>>         (/T,T,T,F,T,F,F,F,T,F,F,T/)
>>>           ----- - - ----- - --- -    seven segments
>>> 
>>> The main advantages of this representation are:
>>> 
>>> (a) It is symmetrical, in that the same segment specifier may
>>>    be meaningfully used for parallel prefix and parallel suffix
>>>    without changing its interpretation (start versus stop).
>>> 
>>> (b) It seems to be equally inconvenient for every existing
>>>    architecture.  :-)  However, it is not that hard to accommodate.
>>> 
>>> (c) The start-bit or stop-bit representation is easily converted
>>>    to this form by using a parallel XOR prefix or suffix.
>>>    Of course, we would need to define one (see separate proposal
>>>    for a PARITY reduction intrinsic).  Examples:
>>> 
>>>    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
>>>    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
>>>    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
>>>    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))
>>> 
>>>    These might be standard idioms for a compiler to recognize.


------------------------------------------------------

For each reduction intrinsic XXX, introduce the send with combination
intrinsic:

	XXX_SEND(SOURCE,DEST,IDX1,...)

------------------------------------------------------

Sorting:

	GRADE_UP(ARRAY,DIM)

	GRADE_DOWN(ARRAY,DIM)

>>>	Question: should stability be guaranteed?

The alternative:

	INDEX(ARRAY,SUBS)

	GRADE(ARRAY1,ARRAY2,...)

------------------------------------------------------

	NUMBER_OF_PROCESSORS()  with optional DIM argument

The alternative:

	NUMBER_OF_PROCESSORS()  with no optional DIM argument and
	PROCESSORS_SHAPE()

Question:  Should these be allowed for processor_arrangements in HPF 
directives?

------------------------------

From schreibr@riacs.edu  Tue May 12 16:54:34 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA11573); Tue, 12 May 92 16:54:34 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA10187); Tue, 12 May 92 16:54:31 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA01904; Tue, 12 May 92 14:54:30 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA03447; Tue, 12 May 92 14:53:43 PDT
Message-Id: <9205122153.AA03447@thor.riacs.edu>
Date: Tue, 12 May 92 14:53:43 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Answers

My opinions:

1.    The minimal subset should consist of the NUMBER_OF_PROCESSORS
and PROCESSORS_SHAPE intrinsics only.

2.    The idea of allowing a reduction intrinsic for every suitable
binary operator, and PREFIX, SUFFIX, and SEND intrinsics for every
reduction, is appealing, but maybe this places too great a burden on
the implementors?   Are all of these potentially useful?

3.   Answers to my own questions:
------------------------------------------------------
  POPCNT		

  POPPAR

  LEADZ

  ILEN

>>>>>>>>	Maybe.   I dont need them.

... Cray Fortran allows
these intrinsics to be applied to data types other than integers;
should HPF allow this also?

>>>>>>>>	No.

------------------------------------------------------
  extended MAXLOC   (with DIM argument)
>>>>>>>>	Yes.

  extended MAXLOC   (with DIM argument)
>>>>>>>>	Yes.

------------------------------------------------------

New reductions:

	AND
	OR
	EOR
	PARITY
>>>>>>>>	Yes.

------------------------------------------------------

For each reduction intrinsic XXX, introduce the parallel prefix and suffix
intrinsics:

	XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

>>>>>>>>	Yes.

On PREFIX and SUFFIX:

>>> Outstanding issues: This proposal delimits the segments by indicating
>>> the *start* of each segment.  Cray MPP Fortran delimits the segments
>>> by indicating the *stop* of each segment.  Each method has its advantages.
>>> There is also the question of whether this convention should change when
>>> performing a suffix rather than a prefix.
>>> 
>>> Another way to delimit segments is to use a logical vector and say
>>> that a new segment begins at every *transition* from false to true or
>>> true to false; thus a segment is indicated by a maximal contiguous
>>> subsequence of like logical values:
>>> 
>>>         (/T,T,T,F,T,F,F,F,T,F,F,T/)
>>>           ----- - - ----- - --- -    seven segments
>>> 
>>> The main advantages of this representation are:
>>> 
>>> (a) It is symmetrical, in that the same segment specifier may
>>>    be meaningfully used for parallel prefix and parallel suffix
>>>    without changing its interpretation (start versus stop).
>>> 
>>> (b) It seems to be equally inconvenient for every existing
>>>    architecture.  :-)  However, it is not that hard to accommodate.
>>> 
>>> (c) The start-bit or stop-bit representation is easily converted
>>>    to this form by using a parallel XOR prefix or suffix.
>>>    Of course, we would need to define one (see separate proposal
>>>    for a PARITY reduction intrinsic).  Examples:
>>> 
>>>    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
>>>    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
>>>    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
>>>    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))
>>> 
>>>    These might be standard idioms for a compiler to recognize.

>>>>>>>>    I think that alternating sequences is the best way since
>>>>>>>>    it is the easiest to remember.


------------------------------------------------------

For each reduction intrinsic XXX, introduce the send with combination
intrinsic:

	XXX_SEND(SOURCE,DEST,IDX1,...)

>>>>>>>>    Yes.
------------------------------------------------------

Sorting:

	GRADE_UP(ARRAY,DIM)

	GRADE_DOWN(ARRAY,DIM)

>>>>>>>>    Yes.
>>>	Question: should stability be guaranteed?
>>>>>>>>    Yes.

The alternative:

	INDEX(ARRAY,SUBS)

	GRADE(ARRAY1,ARRAY2,...)

>>>>>>>>    No.

------------------------------------------------------

	NUMBER_OF_PROCESSORS()  with optional DIM argument

>>>>>>>>>>   No.
The alternative:

	NUMBER_OF_PROCESSORS()  with no optional DIM argument and
	PROCESSORS_SHAPE()

>>>>>>>>>>   Yes.
Question:  Should these be allowed for processor_arrangements in HPF 
directives?
>>>>>>>>>>   Yes.

------------------------------

From loveman@ftn90.enet.dec.com  Wed May 13 07:27:41 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA24788); Wed, 13 May 92 07:27:41 CDT
Received: from enet-gw.pa.dec.com by erato.cs.rice.edu (AA10449); Wed, 13 May 92 07:27:38 CDT
Received: by enet-gw.pa.dec.com; id AA16552; Wed, 13 May 92 05:27:36 -0700
Message-Id: <9205131227.AA16552@enet-gw.pa.dec.com>
Received: from ftn90.enet; by decwrl.enet; Wed, 13 May 92 05:27:37 PDT
Date: Wed, 13 May 92 05:27:37 PDT
From: David Loveman <loveman@ftn90.enet.dec.com>
To: gls@think.com
Cc: hpff-intrinsics@erato.cs.rice.edu
Apparently-To: gls@think.com, hpff-intrinsics@erato.cs.rice.edu
Subject: NUMBER_OF_PROCESSORS


Sorry for the delay in my reply to your comments on the
NUMBER_OF_PROCESSORS proposal.  You had two suggestions:


>>[a] Using DIM=0 to convey rank information is one of those
>>clever-but-strange encoding tricks.  . . . . .

Our original proposal had behind it a principle of parsimony;  we were
at that time attempting to make the *minimal* semantic change to
Fortran 90.  As a result we had *one* statement, the FORALL construct,
and *one* intrinsic, the one for NUMBER_OF_PROCESSORS.  Hence the (I
agree with the description) "clever-but-strange encoding trick."  An
alternative principle, which I am actually much more comfortable with,
is that the language should say what it actually means.  As a result I
agree with your suggestion to add the system inquiry intrinsic
PROCESSOR_SHAPE, as you described, and drop the DIM=0 form of the
NUMBER_OF_PROCESSORS intrinsic.


>>[b] Perhaps NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE should take
>>another optional argument, the name of a processors arrangement
>>as declared in a PROCESSORS directive . . . . .

On the other hand, I am against mixing the directives world with the
Fortran 90 world.  I believe we should keep directives as directives
and give them only pragmatic meanings and not give them semantic
meanings.  Being able to use the name of a processors arrangement in
Fortran 90 code would violate this, and would seem to be pushing hard
for "directive things" to be first class objects.

If the user provides a PROCESSORS directive, the user clearly knows at
program construction time the number of processors and processors shape
for that directive. The values are, presumably, constants, or
parameters, or other forms of restricted expressions.  Thus a user
could use the same constants, or parameters, or other forms of
restricted expressions in Fortran 90 text, without the need for intrinsic functions.

On the other hand, a user should, of course, be able to use the
intrinsics NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE in directives, such as 

 !HPF$ PROCESSORS FOO(NUMBER_OF_PROCESSORS())

From gls@think.com  Wed May 13 09:11:57 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA26552); Wed, 13 May 92 09:11:57 CDT
Received: from mail.think.com by erato.cs.rice.edu (AA10467); Wed, 13 May 92 09:11:55 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 13 May 92 10:11:32 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA03637; Wed, 13 May 92 10:11:32 EDT
Date: Wed, 13 May 92 10:11:32 EDT
Message-Id: <9205131411.AA03637@strident.think.com>
To: loveman@ftn90.enet.dec.com
Cc: gls@think.com, hpff-intrinsics@erato.cs.rice.edu
In-Reply-To: David Loveman's message of Wed, 13 May 92 05:27:37 PDT <9205131227.AA16552@enet-gw.pa.dec.com>
Subject: NUMBER_OF_PROCESSORS

   Date: Wed, 13 May 92 05:27:37 PDT
   From: David Loveman <loveman@ftn90.enet.dec.com>
   Apparently-To: gls@think.com, hpff-intrinsics@erato.cs.rice.edu
   ...

   >>[b] Perhaps NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE should take
   >>another optional argument, the name of a processors arrangement
   >>as declared in a PROCESSORS directive . . . . .

   On the other hand, I am against mixing the directives world with the
   Fortran 90 world.  I believe we should keep directives as directives
   and give them only pragmatic meanings and not give them semantic
   meanings.  Being able to use the name of a processors arrangement in
   Fortran 90 code would violate this, and would seem to be pushing hard
   for "directive things" to be first class objects.

   If the user provides a PROCESSORS directive, the user clearly knows at
   program construction time the number of processors and processors shape
   for that directive. The values are, presumably, constants, or
   parameters, or other forms of restricted expressions.  Thus a user
   could use the same constants, or parameters, or other forms of
   restricted expressions in Fortran 90 text, without the need for intrinsic functions.

   On the other hand, a user should, of course, be able to use the
   intrinsics NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE in directives, such as 

    !HPF$ PROCESSORS FOO(NUMBER_OF_PROCESSORS())

Okay.  I am quite sympathetic to this position.

-Guy

From zrlp09@stoy.msc.edu  Wed May 13 15:31:02 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA07237); Wed, 13 May 92 15:31:02 CDT
Received: from noc.msc.edu by erato.cs.rice.edu (AA10656); Wed, 13 May 92 15:30:57 CDT
Received: from uc.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA14713; Wed, 13 May 92 15:30:48 -0500
Received: from [129.230.11.2] by uc.msc.edu (5.65/MSC/v3.0z(901212))
	id AA25696; Wed, 13 May 92 15:30:49 -0500
Received: from trc.amoco.com (apctrc.trc.amoco.com) by netserv2 (4.1/SMI-4.0)
	id AA11112; Wed, 13 May 92 15:30:45 CDT
Received: from stoy.trc.amoco.com by trc.amoco.com (4.1/SMI-4.1)
	id AA09821; Wed, 13 May 92 15:30:42 CDT
Received: from localhost by stoy.trc.amoco.com (4.1/SMI-4.1)
	id AA25495; Wed, 13 May 92 15:30:41 CDT
Message-Id: <9205132030.AA25495@stoy.trc.amoco.com>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: strawpoll on intrinsics - rlp
Date: Wed, 13 May 92 15:30:40 -0500
From: "Rex Page" <zrlp09@stoy.msc.edu>

processor inquiries     
  PROCESSORS_SHAPE()         Yes, without procs argument
  NUMBER_OF_PROCESSORS(DIM)  Yes, without procs argument
  Allowed in HPF directives  Yes

bit inquiries
  POPCNT   Yes
  POPPAR   Yes
  LEADZ    Yes
  ILEN     Yes
  with non-integer arguments?   No
       The ISO standard gives no bit-model of other types
       (except REAL with b=2, and even then accuracy issues
       prohibit reasonable predictions of POPCNT etc. values
       based on the value of a REAL argument).

reductions
  MAXLOC, MINLOC with DIM argument    Yes
  AND     Yes
  OR      Yes
  EOR     Yes
  PARITY  Yes


prefix reductions
  XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)   Yes
  XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)   Yes
  Denote segments by contiguous blocks of like LOGICAL values.

combining send functions
  Abstain  (good testing ground for the notion of assignments
            to arrays with array-valued subscripts containing
            duplicates?)

sorting
  GRADE_UP(ARRAY,DIM)     Yes (marginal; concerned about need
  GRADE_DOWN(ARRAY,DIM)   Yes  for these in typical HPF applications)
  Guaranteed stability    Yes

  INDEX(ARRAY,SUBS)         No   (but a GRADE function with an
  GRADE(ARRAY1,ARRAY2,...)  No    array and a comparision operator
                                  as arguments and an order-index
                                  represented by an integer array
                                  as its result is attractive)


From shapiro@think.com  Thu May 14 11:23:00 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA24355); Thu, 14 May 92 11:23:00 CDT
Received: from mail.think.com by erato.cs.rice.edu (AA11012); Thu, 14 May 92 11:22:57 CDT
Return-Path: <shapiro@Think.COM>
Received: from Django.Think.COM by mail.think.com; Thu, 14 May 92 12:22:49 -0400
From: Richard Shapiro <shapiro@think.com>
Received: by django.think.com (4.1/Think-1.2)
	id AA15999; Thu, 14 May 92 12:22:49 EDT
Date: Thu, 14 May 92 12:22:49 EDT
Message-Id: <9205141622.AA15999@django.think.com>
To: schreibr@riacs.edu
Cc: hpff-intrinsics@erato.cs.rice.edu
In-Reply-To: Rob Schreiber's message of Tue, 12 May 92 14:41:24 PDT <9205122141.AA03424@thor.riacs.edu>
Subject: Questions

   Date: Tue, 12 May 92 14:41:24 PDT
   From: Rob Schreiber <schreibr@riacs.edu>


   Dear members of the Intrinsics subcommittee:

   We have a number of proposals on the table.   Now we need to begin a
   debate on them, or we will be forced to have a meeting sometime.   I
   hope we can avoid all but a short meeting on June 8.

   First, here are the intrinsics that have been proposed.    VOTE up
   or down (this is a straw poll) on whether to include them in
   some form, and give Guy some help on his questions:

   ------------------------------------------------------
     POPCNT
Yes, but can it be POPCOUNT?

     POPPAR
Maybe, it's the same as IAND(POPCNT,1)

     LEADZ
Yes

     ILEN
Yes


   Issues: I hope I have defined POPCNT, POPPAR, and LEADZ consistent
   with their use in Cray Fortran.  I believe that Cray Fortran allows
   these intrinsics to be applied to data types other than integers;
   should HPF allow this also?

No. I don't believe we should open a can of worms and get into
floating-point representation.

   ------------------------------------------------------
     extended MAXLOC   (with DIM argument)

     extended MAXLOC   (with DIM argument)

Absolutely

   ------------------------------------------------------

   New reductions:

	   AND
	   OR
	   EOR
	   PARITY
Yes

   ------------------------------------------------------

   For each reduction intrinsic XXX, introduce the parallel prefix and suffix
   intrinsics:

	   XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	   XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

   (The possible values of XXX are:

	    SUM, PRODUCT, ALL, ANY, MAXVAL, MINVAL, AND, OR, EOR, PARITY.)

Yes

   On PREFIX and SUFFIX:

   >>> Outstanding issues: This proposal delimits the segments by indicating
   >>> the *start* of each segment.  Cray MPP Fortran delimits the segments
   >>> by indicating the *stop* of each segment.  Each method has its advantages.
   >>> There is also the question of whether this convention should change when
   >>> performing a suffix rather than a prefix.
   >>> 
   >>> Another way to delimit segments is to use a logical vector and say
   >>> that a new segment begins at every *transition* from false to true or
   >>> true to false; thus a segment is indicated by a maximal contiguous
   >>> subsequence of like logical values:
   >>> 
   >>>         (/T,T,T,F,T,F,F,F,T,F,F,T/)
   >>>           ----- - - ----- - --- -    seven segments
   >>> 
   >>> The main advantages of this representation are:
   >>> 
   >>> (a) It is symmetrical, in that the same segment specifier may
   >>>    be meaningfully used for parallel prefix and parallel suffix
   >>>    without changing its interpretation (start versus stop).
   >>> 
   >>> (b) It seems to be equally inconvenient for every existing
   >>>    architecture.  :-)  However, it is not that hard to accommodate.
   >>> 
   >>> (c) The start-bit or stop-bit representation is easily converted
   >>>    to this form by using a parallel XOR prefix or suffix.
   >>>    Of course, we would need to define one (see separate proposal
   >>>    for a PARITY reduction intrinsic).  Examples:
   >>> 
   >>>    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
   >>>    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
   >>>    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
   >>>    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))
   >>> 
   >>>    These might be standard idioms for a compiler to recognize.

I like the alternating sequence method as well. I have never been able to
get segmented scans right on the CM-2 without much confusion and many
minutes with a manual.

   ------------------------------------------------------

   For each reduction intrinsic XXX, introduce the send with combination
   intrinsic:

	   XXX_SEND(SOURCE,DEST,IDX1,...)

Yes. I'm not sure how usefule product is, but it's better to have it than
to make an assymetric restriction.

   ------------------------------------------------------

   Sorting:

	   GRADE_UP(ARRAY,DIM)

	   GRADE_DOWN(ARRAY,DIM)

How about just GRADE(array,dim,direction)?

   >>>	Question: should stability be guaranteed?

Stability is a very useful property. Yes.


   The alternative:

	   INDEX(ARRAY,SUBS)

	   GRADE(ARRAY1,ARRAY2,...)

Don't like this as much.

   ------------------------------------------------------

	   NUMBER_OF_PROCESSORS()  with optional DIM argument

I like this method...

   The alternative:

	   NUMBER_OF_PROCESSORS()  with no optional DIM argument and
	   PROCESSORS_SHAPE()

but could certainly live with this method.

   Question:  Should these be allowed for processor_arrangements in HPF 
   directives?

Yes, otherwise the usefulness is very limited.

   ------------------------------


From schreibr@riacs.edu  Mon May 18 16:14:58 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA17907); Mon, 18 May 92 16:14:58 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA13031); Mon, 18 May 92 16:14:32 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA25113; Mon, 18 May 92 14:14:26 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA02139; Mon, 18 May 92 14:13:36 PDT
Message-Id: <9205182113.AA02139@thor.riacs.edu>
Date: Mon, 18 May 92 14:13:36 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Plan

Dear committee,

The mail on intrinsics having died down, I suspect that we are
ready to agree on a few things.

1.  We can come up with a draft proposal on intrinsics via email.
2.  Guy's collection, with Dave's NPROCS as modified (with PROCESSOR_SHAPE)
    will be the basis of it.
3.  There seems to have been general agreement on the questions Guy raised.

 a.   Yes to all the new intrinsic groups proposed.   No floating-point versions
      of the POPCNT, POPPAR, ILEN, LEADZ intinsics.

 b.   Sort:  First option (GRADE_UP, GRADE_DOWN) instead of the second
             (GRADE, INDEX)
             Stability is required.

 c.   The "alternating sequence" method of defining segments in the XXX_PREFIX 
      and XXX_SUFFIX intrinsics.

4.  These proposals are in accord with Rex's advisory message about extensions to the
    language.

Let me know if you agree.   If there is agreement, I will suggest to Guy and
Dave that they formulate new drafts along these lines, and that we try to
reach a consensus by end of June.  We can have a pro forma meeting in Dallas
before the Forum meeting, attendance optional, if we succeed.

--- Rob

**** Here is the text of Guy and Dave's proposals:

**** Next 6 sections are Guy's:

1.  Proposal for HPF intrinsics POPCNT, POPPAR, LEADZ, and ILEN

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


(a) An elemental population count intrinsic.  Its action on a scalar is:

  POPCNT(x) = COUNT( (/ (BTEST(x,J), J=0, BIT_SIZE(x)-1) /) )

The number of 1-bits in the integer x, according to the
bit-manipulation model in section 13.5.7 of the Fortran 90 standard.


(b) An elemental population-parity intrinsic.  Its action on a scalar is:

  POPPAR(x) = MERGE(1,0,BTEST(POPCNT(x),0))

The result is 1 if the number of 1-bits in the integer x is odd,
or 0 if the number of 1-bits in the integer x is even.


(c) An elemental count-leading-zeros intrinsic.  Its action on a scalar is:

  LEADZ(x) = MINVAL( (/ (J, J=0,BIT_SIZE(x)) /),
		MASK=(/ (BTEST(x,J), J=BIT_SIZE(x)-1,0,-1), .TRUE. /) )

The result is a count of the number of leading 0-bits in the integer
x, according to the bit-manipulation model in section 13.5.7 of the
Fortran 90 standard.

Note that a given integer value may produce different results from
LEADZ, depending on the number of bits in the representation of the
integer.  That is because bits are counted from the left (the most
significant bit).


(d) An elemental integer-length intrinsic.  Its action on a scalar is:

  ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 ))

This is related to LEADZ but is often much more convenient for
the calculation of array dimensions, etc.  It is the number of bits
required to store a 2's-complement signed integer x.  As examples of
its use,  2**ILEN(N-1)  rounds N up to a power of 2 (for N > 0),
whereas  2**(ILEN(N)-1)  rounds N down to a power of 2.

Note that a given integer value will always produce the same result
from ILEN, independent on the number of bits in the representation of
the integer.  That is because bits are counted from the right (the
least significant bit).

The definition of ILEN is equivalent to that of the built-in function
integer-length in Common Lisp, which has proven to be quite useful.


Issues: I hope I have defined POPCNT, POPPAR, and LEADZ consistent
with their use in Cray Fortran.  I believe that Cray Fortran allows
these intrinsics to be applied to data types other than integers;
should HPF allow this also?
Received: from icarus.riacs.edu by psd.riacs.edu (4.1/2.0N)
	   id AA15112; Wed, 6 May 92 11:31:09 PDT
Received: from cs.rice.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA21952; Wed, 6 May 92 11:31:05 PDT
Received: from erato.cs.rice.edu by cs.rice.edu (AA09680); Wed, 6 May 92 13:29:01 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08580); Wed, 6 May 92 13:28:56 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:28:52 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05271; Wed, 6 May 92 14:28:51 EDT
Date: Wed, 6 May 92 14:28:51 EDT
Message-Id: <9205061828.AA05271@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  extension to MINLOC and MAXLOC
Status: R


2.  Proposal for extension to MINLOC and MAXLOC for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


The MAXLOC and MINLOC instrinsics should have an optional DIM
argument.  If such an argument is present, then the shape of the
result equals the shape of the first argument with one dimension (that
indicated by the DIM argument) deleted; it is as if a series of
one-dimensional MAXLOC or MINLOC operations were performed.

Example: If A has the value

[  0  -5   8  -3  ]
[  3   4  -1   2  ]
[  1   5   6  -4  ]

then	MINLOC(A, DIM=1) has the value [ 1, 1, 2, 3 ]
	MAXLOC(A, DIM=1) has the value [ 2, 3, 1, 2 ]
	MINLOC(A, DIM=2) has the value [ 2, 3, 4 ]
	MAXLOC(A, DIM=2) has the value [ 3, 2, 3 ].
Received: from icarus.riacs.edu by psd.riacs.edu (4.1/2.0N)
	   id AA15116; Wed, 6 May 92 11:31:42 PDT
Received: from cs.rice.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA21960; Wed, 6 May 92 11:31:38 PDT
Received: from erato.cs.rice.edu by cs.rice.edu (AA09699); Wed, 6 May 92 13:29:36 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08584); Wed, 6 May 92 13:29:31 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:29:30 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05287; Wed, 6 May 92 14:29:30 EDT
Date: Wed, 6 May 92 14:29:30 EDT
Message-Id: <9205061829.AA05287@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  reduction intrinsics
Status: R

3.  Proposal for reduction instrinsics for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


Just as we have the correpondences:

	operator/intrinsic	reduction intrinsic

		+			SUM
		*			PRODUCT
		.AND.			ALL
		.OR.			ANY
		MAX			MAXVAL
		MIN			MINVAL

it would be useful to have reduction versions of certain
other operators and intrinsics in the language that happen
to be associative and commutative:

				    proposed
	operator/intrinsic	reduction intrinsic

		IAND			AND
		IOR			OR
		IEOR			EOR
		.NEQV.			PARITY

Thus

	AND( (/ 7,3,10 /) )  yields 2
	 OR( (/ 7,3,10 /) )  yields 15
	EOR( (/ 7,3,10 /) )  yields 14

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

      PARITY( (/ T,F,F,T,T,F,F,F,T,T /) )  yields .TRUE.
      PARITY( (/ T,F,F,T,T,F,F,F,T,F /) )  yields .FALSE.

Some of these are particularly valuable if corresponding
parallel-prefix intrinsics are also defined (see separate proposal).
Received: from icarus.riacs.edu by psd.riacs.edu (4.1/2.0N)
	   id AA15145; Wed, 6 May 92 11:47:24 PDT
Received: from cs.rice.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA22190; Wed, 6 May 92 11:47:21 PDT
Received: from erato.cs.rice.edu by cs.rice.edu (AA10301); Wed, 6 May 92 13:45:05 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08590); Wed, 6 May 92 13:45:01 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 14:44:59 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA05642; Wed, 6 May 92 14:44:58 EDT
Date: Wed, 6 May 92 14:44:58 EDT
Message-Id: <9205061844.AA05642@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  parallel prefix intrinsics
Status: R

4.  Proposal for parallel prefix instrinsics for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


For every reduction operation XXX in the language, introduce two new
intrinsics XXX_PREFIX and XXX_SUFFIX.  They take the same arguments
as the corresponding reduction intrinsic, plus two additional
optional arguments:

	XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

The first additional optional argument is called SEGMENT, which is of
type logical and conformable with the ARRAY argument (a TRUE element
indicates the start of a new segment of the first argument, that is,
a place where the running accumulation is to be reset before
processing the corresponding array element).

The second additional optional argument, a scalar logical, is called
EXCLUSIVE, default .FALSE., which determines whether the prefix or
suffix operation is inclusive (the default) or exclusive.  (The
inclusive sum-prefix of (/ 1,2,3,4 /) is (/ 1,3,6,10 /) whereas the
exclusive sum-prefix is (/ 0,1,3,6 /).)

Array elements corresponding to positions where the MASK is false
do not contribute to the running accumulation.  However, the result
is still defined for corresponding positions in the result.
In actual practice, results may not be required in those positions;
in such cases the programmer may be able to use the WHERE statement
to give the compiler a strong hint:

      WHERE (FOO) A=SUM_PREFIX(B,MASK=FOO)

If the DIM argument is omitted, then the arrays are processed in
array element order ("column-major"), as if temporarily regarded as
one-dimensional.

In all cases the result has the same shape as the first argument.

In addition, the operation COPY_PREFIX replicates the first
(lowest-indexed) element of each segment throughout the segment, and
the operation COPY_SUFFIX replicates the last (highest-indexed)
element of each segment throughout the segment.

Examples:

SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/)
SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/)

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) )              yields (/1,1,1,2,3,4,4,5,5/)
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) yields (/0,1,1,1,2,3,4,4,5/)

SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/),
    SEGMENT=(/T,F,F,F,T,F,T,T,F/)) yields (/1,3,6,10,5,11,7,8,17/)
              ------- --- - ---             -------- ---  - ----
	     four input segments       four independent result segments

COPY_PREFIX( (/1,2,3,4,5,6,7,8,9/),
     SEGMENT=(/T,F,F,F,T,F,T,T,F/)) yields (/1,1,1,1,5,5,7,8,8/)
               ------- --- - ---             ------- --- - ---
	      four input segments       four independent result segments


Outstanding issues: This proposal delimits the segments by indicating
the *start* of each segment.  Cray MPP Fortran delimits the segments
by indicating the *stop* of each segment.  Each method has its advantages.
There is also the question of whether this convention should change when
performing a suffix rather than a prefix.

Another way to delimit segments is to use a logical vector and say
that a new segment begins at every *transition* from false to true or
true to false; thus a segment is indicated by a maximal contiguous
subsequence of like logical values:

	(/T,T,T,F,T,F,F,F,T,F,F,T/)
          ----- - - ----- - --- -    seven segments

The main advantages of this representation are:

(a) It is symmetrical, in that the same segment specifier may
    be meaningfully used for parallel prefix and parallel suffix
    without changing its interpretation (start versus stop).

(b) It seems to be equally inconvenient for every existing
    architecture.  :-)  However, it is not that hard to accommodate.

(c) The start-bit or stop-bit representation is easily converted
    to this form by using a parallel XOR prefix or suffix.
    Of course, we would need to define one (see separate proposal
    for a PARITY reduction intrinsic).  Examples:

    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))

    These might be standard idioms for a compiler to recognize.
Received: from icarus.riacs.edu by psd.riacs.edu (4.1/2.0N)
	   id AA15319; Wed, 6 May 92 12:46:27 PDT
Received: from cs.rice.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA23213; Wed, 6 May 92 12:46:23 PDT
Received: from erato.cs.rice.edu by cs.rice.edu (AA12314); Wed, 6 May 92 14:44:16 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08601); Wed, 6 May 92 14:44:11 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 15:44:07 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA06540; Wed, 6 May 92 15:44:06 EDT
Date: Wed, 6 May 92 15:44:06 EDT
Message-Id: <9205061944.AA06540@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  sorting intrinsics
Status: R

5.  Proposal for HPF sorting intrinsics

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992


The ideas and names here are inspired by APL.  I have used the
term "grade" rather than "rank" because the latter is already used
in the Fortran 90 standard to mean the size of the shape of an array
(that is, the number of dimensions).


GRADE_UP(ARRAY,DIM)

The array may be of type integer, real, or character.  [Alternate spec:
the array may be of any type for which the operator .LT. has been defined?]

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

	B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)

then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in)
is sorted in ascending order.

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape [SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY))]
and the property that if one computes the rank-1 array

	B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))

where n=SIZE(SHAPE(ARRAY)), then B is sorted in ascending order.

Question: should stability be guaranteed?


GRADE_DOWN(ARRAY,DIM)

Same as GRADE_UP, with "ascending" replaced by "descending".


----------------------------------------------------------------

An alternate approach:

First define the utility intrinsic

INDEX(ARRAY,SUBS)

where SIZE(SUBS,DIM=1) = SIZE(SHAPE(ARRAY)).  If S = SHAPE(ARRAY),
then S(2:) is the shape of the result R, which has the property that

	R(i1,i2,...,in) = ARRAY(SUBS(1,i1,i2,...,in),
                                SUBS(2,i1,i2,...,in),
                                ...
                                SUBS(k,i1,i2,...,in))

where k = SIZE(SHAPE(ARRAY)) and n = SIZE(SHAPE(SUBS))-1.


GRADE(ARRAY1,ARRAY2,...)

Arguments ARRAY2,... are optional.  All arrays must be conformable.
The arrays must be of type integer, real, or character.  [Alternate spec:
the array may be of any type for which the operator .LT. has been defined?]
The arrays need not all be of the same type.

The result S is an array of rank 2, with shape [SIZE(SHAPE(ARRAY1)),
PRODUCT(SHAPE(ARRAY1))], and the property that if j < k then

ARRAY1(S(1,j),...,S(n,j)) .LT. ARRAY1(S(1,k),...,S(n,k))  or
( ARRAY1(S(1,j),...,S(n,j)) .EQ. ARRAY1(S(1,k),...,S(n,k))  and
  ( ARRAY2(S(1,j),...,S(n,j)) .LT. ARRAY2(S(1,k),...,S(n,k))  or
    ( ARRAY2(S(1,j),...,S(n,j)) .EQ. ARRAY2(S(1,k),...,S(n,k))  and
      ( ...

          ARRAYn(S(1,j),...,S(n,j)) .LT.  ARRAYn(S(1,k),...,S(n,k))  or
          ( ARRAYn(S(1,j),...,S(n,j)) .EQ. ARRAYn(S(1,k),...,S(n,k))  )
        ...
      )
    )
  )
)

which can also be written

INDEX(ARRAY1,S(:,j)) .LT. INDEX(ARRAY1,S(:,k))  or
( INDEX(ARRAY1,S(:,j)) .EQ. INDEX(ARRAY1,S(:,k))  and
  ( INDEX(ARRAY2,S(:,j)) .LT. INDEX(ARRAY2,S(:,k))  or
    ( INDEX(ARRAY2,S(:,j)) .EQ. INDEX(ARRAY2,S(:,k))  and
      ( ...

          INDEX(ARRAYn,S(:,j)) .LT. INDEX(ARRAYn,S(:,k))  or
          ( INDEX(ARRAYn,S(:,j)) .EQ. INDEX(ARRAYn,S(:,k))  )
        ...
      )
    )
  )
)

That is, the array arguments are treated as sort fields, with the first
argument most significant (major) and the last argument least significant
(minor).  The result gives a set of indices that can be used to
permute the arrays into a collectively sorted (ascending) order.

For example, suppose one had the following derived type (example
taken from section 4.4.1 of the Fortran 90 standard):

      TYPE PERSON
        INTEGER AGE
        CHARACTER (LEN = 50) NAME
      END TYPE PERSON

now consider two arrays of persons:

      TYPE(PERSON), DIMENSION(100000) :: MEMBERS, ROSTER

then the statement

      ROSTER = INDEX(MEMBERS,GRADE(MEMBERS%NAME,MEMBERS%AGE))

causes ROSTER to be a rearrangement of MEMBERS that is sorted
primarily by name and secondarily by age (that is, members with
the same name are grouped together in order of ascending age).
To list members with the same name in descending order of age,
the following trick more or less works:

      ROSTER = INDEX(MEMBERS,GRADE(MEMBERS%NAME,-MEMBERS%AGE))

though this is not completely general.
Received: from icarus.riacs.edu by psd.riacs.edu (4.1/2.0N)
	   id AA15458; Wed, 6 May 92 15:05:04 PDT
Received: from cs.rice.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA25497; Wed, 6 May 92 15:05:01 PDT
Received: from erato.cs.rice.edu by cs.rice.edu (AA19514); Wed, 6 May 92 17:03:02 CDT
Received: from mail.think.com (Mail1.Think.COM) by erato.cs.rice.edu (AA08652); Wed, 6 May 92 17:02:59 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 6 May 92 18:02:56 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA08781; Wed, 6 May 92 18:02:56 EDT
Date: Wed, 6 May 92 18:02:56 EDT
Message-Id: <9205062202.AA08781@strident.think.com>
To: hpff-intrinsics@erato.cs.rice.edu
Cc: gls@think.com
Subject:  combining-send intrinsics
Status: R


6.  Proposal for HPF combining-send intrinsics

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 6, 1992

For every reduction operation XXX in the language, introduce a new
intrinsic subroutine XXX_SEND:

   XXX_SEND(SOURCE,DEST,IDX1,...)

Arguments IDX1,... are optional.  The number of IDX arguments
must equal the rank of DEST.  The SOURCE and all the IDX arguments
must be conformable.

For every element s in SOURCE, the corresponding elements ij of IDXj
are used to carry out the operation

	DEST(i1,i2,...,in) = XXX_operation(DEST(i1,i2,...,in), s)

and all such operations performed by a single call are done *as if
serially* in *some* (processor-dependent) order for each element s.
Thus the call

      CALL SUM_SEND(SOURCE,DEST,IDX1,IDX2,...,IDXn)

*could* be implemented as

      DO J1=LBOUND(SOURCE,1),UBOUND(SOURCE,1)
        DO J2=LBOUND(SOURCE,2),UBOUND(SOURCE,2)
          ...
            DO Jk=LBOUND(SOURCE,k),UBOUND(SOURCE,k)
              DEST(IDX1(J1,J2,...,Jk),
     &             IDX2(J1,J2,...,Jk),
     &             ...
     &             IDXn(J1,J2,...,Jk)) =
     &        DEST(IDX1(J1,J2,...,Jk),
     &             IDX2(J1,J2,...,Jk),
     &             ...
     &             IDXn(J1,J2,...,Jk)) + SOURCE(J1,J2,...,Jk)
            END DO
          ...
        END DO
      END DO

where k is the rank of SOURCE.  (However, this nest of DO loops
makes a greater commitment to the particular order in which the
combining operations are carried out than the order--namely, none!-
guaranteed by the XXX_SEND intrinsic.  This matters when the
combining operation is not both associative and commutative,
for example floating-point addition.)

Example:  The C* operation

        x[v] += a;

where x, v, and a are all parallel arrays, and a and v conform,
may be rendered

      CALL SUM_SEND(A,X,V)

If all elements of V were distinct, one could write this in
Fortran 90 as

      X(V) = X(V) + A

The proposed intrinsic SUM_SEND "works" even if V contains
duplicate values.  Note that the two-dimensional case

      X(V,W) = X(V,W) + A

must be rendered using SPREAD:

      CALL SUM_SEND(A,X,SPREAD(V,DIM=2,NCOPIES=SIZE(X,2)),
     &                  SPREAD(W,DIM=1,NCOPIES=SIZE(X,1)))

in order to duplicate the cross-product effect of ordinary array
subscripting.  (I chose to propose a definition of XXX_SEND that does
*not* perform such a cross product of indices because it is more
general and in practice more useful without the cross-product effect
built in.)

**** Here is the DEC NPROCS proposal:

New Intrinsic Function

In addition to the intrinsic functions of Fortran 90, High Performance
Fortran has a new intrinsic function, NUMBER_OF_PROCESSORS, which takes
no arguments and returns an integer value giving the number of
processors in the system. This number is expected to remain constant
for (at least) the duration of one program execution. Accordingly,
NUMBER_OF_PROCESSORS() is a constant expression and can be used
wherever any other Fortran 90 constant expression can be used. In
particular, NUMBER_OF_PROCESSORS can be used in an initialization
expression or in a specification expression.  None of the categories of
intrinsic functions listed in Chapter 13 of the Fortran 90 standard
seem quite apt to describe the nature of this new intrinsic function,
so we add a new category of "system inquiry functions" and place
NUMBER_OF_PROCESSORS in that category.

Note that treating NUMBER_OF_PROCESSORS as a constant expression does
not force a compiler to bind the number of processors at compile time
(although that is one possible implementation) -- with the right linker
or code-generation technology the choice could be deferred until run
time, possibly at some performance cost.

A definition of NUMBER_OF_PROCESSORS, in the style of Chapter 13 of the
Fortran 90 standard is:

13.13.77a  NUMBER_OF_PROCESSORS(DIM)

Optional Argument.  DIM

Description.  Returns the total number of processors available to the
program, the dimensionality of the processor array, or the number of
processors available to the program along a specified dimension of the
processor array.

Class.  System inquiry function.

Arguments.
DIM (optional)  must be scalar and of type integer with a value in the
range 0<=DIM<=n where n is the rank of the processor array.

Result Type, Type Parameter, and Shape.  Default integer scalar.

Result Value.  The result has a value equal to the rank of the
processor array if the value of DIM is 0, the extent of dimension DIM
(1<=DIM<=n, where n is the rank of the processor array) of the
processor-dependent hardware processor array or, if DIM is absent, the
total number of elements, equal to or greater than one, of the
processor-dependent hardware processor array.

The value of NUMBER_OF_PROCESSORS() need not be a constant, if the
processor allows for a variable number of processors to execute the
program.  However, the value must not change during the execution of the program.

Example. For a DECmpp 12000 Model 8B with 8192 processors, the value of
NUMBER_OF_PROCESSORS( ) is 8192, the value of
NUMBER_OF_PROCESSORS(DIM=0) is 2, the value of
NUMBER_OF_PROCESSORS(DIM=1) is 128, and the value of NUMBER_OF_PROCE is 64.

The list of alternatives in a Fortran 90 restricted expression is
expanded to include NUMBER_OF_PROCESSORS, as follows:

A  *restricted expression* is an expression in which each operation is
intrinsic and each primary is:

1.   A constant or subobject of a constant,

2.   A variable that is a dummy argument that has neither the OPTIONAL
nor the INTENT (OUT) attribute, or a variable that is a subobject of such as dummy
argument,

3.   A variable that is in a common block or a variable that is a
subobject of a variable in a common block,

4.   A variable that is made accessible by use association or host
association or a variable that is a subobject of such a variable,

5.   An array constructor where each element and the bounds and strides
of each implied-DO are expressions whose primaries are either restricted expressions
or implied-DO variables, 

6.   A structure constructor where each component is a restricted expression,

7.   An elemental intrinsic function reference of type integer or
character where each argument is a restricted expression of type
integer or character,

8.   One of the transformational functions REPEAT, RESHAPE,
SELECTED_INT_KIND, SELECTED_REAL_KIND, TRANSFER, and TRIM, where each
argument is a restricted expression of type integer or character,

9.   A reference to an array inquiry function (13.10.15) other than
ALLOCATED, the bit inquiry function BIT_SIZE, the character inquiry
function LEN, the kind inquiry function KIND, or a numeric inquiry
function (13.10.8), where each argument is either a restricted
expression or a variable whose type parameters or bounds inquired about
are not assumed or defined by an ALLOCATE statement or a pointer
assignment, or

10.   A restricted expression enclosed in parentheses, or

11.   The HPF system inquiry function NUMBER_OF_PROCESSORS.

**** Here is Guy's suggested change to it.

[a] Using DIM=0 to convey rank information is one of those
clever-but-strange encoding tricks.  Inasmuch as there is no
precedent for it already in Fortrasn 90, I recommend considering a
separate intrinsic that would be analogous to something already in
Fortran 90, to wit, PROCESSORS_SHAPE(), returning a rank-1 vector.
Thus SIZE(PROCESSOR_SHAPE()) is the rank of the processor array.

Example.  For a DECmpp 12000 Model 8B with 8192 processors, the value of
PROCESSORS_SHAPE() is (/ 128, 64 /).

Example.  For a Connection Machine CM-2 with 8192 processors, the value of
PROCESSORS_SHAPE() might be (/ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 /).

Example.  For a Connection Machine CM-5 with 8192 processors, the value of
PROCESSORS_SHAPE() might be (/ 8192 /).


[b] Perhaps NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE should take
another optional argument, the name of a processors arrangement
as declared in a PROCESSORS directive:

	NUMBER_OF_PROCESSORS(procs, dim)
	PROCESSORS_SHAPE(procs)

If "procs" is omitted, then the information returned concerns the
"natural" processors arrangement of the hardware; if included,
then it serves as an inquiry intrinsic for the declared processors
arrangement.  Perhaps this latter form should be restricted to use
within HPF directives (inasmuch as the referenced name is defined
by an HPF directive).

Example.

!HPF$ PROCESSORS FOO(40,80)
!HPF$ ... PROCESSORS_SHAPE(FOO) ...           ! (/ 40, 80 /)
!HPF$ ... NUMBER_OF_PROCESSORS(FOO) ...       ! 3200


From shapiro@think.com  Mon May 18 16:20:01 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA18050); Mon, 18 May 92 16:20:01 CDT
Received: from mail.think.com by erato.cs.rice.edu (AA13061); Mon, 18 May 92 16:20:00 CDT
Return-Path: <shapiro@Think.COM>
Received: from Django.Think.COM by mail.think.com; Mon, 18 May 92 17:19:34 -0400
From: Richard Shapiro <shapiro@think.com>
Received: by django.think.com (4.1/Think-1.2)
	id AA28766; Mon, 18 May 92 17:19:52 EDT
Date: Mon, 18 May 92 17:19:52 EDT
Message-Id: <9205182119.AA28766@django.think.com>
To: schreibr@riacs.edu
Cc: hpff-intrinsics@erato.cs.rice.edu
In-Reply-To: Rob Schreiber's message of Mon, 18 May 92 14:13:36 PDT <9205182113.AA02139@thor.riacs.edu>
Subject: Plan

   Date: Mon, 18 May 92 14:13:36 PDT
   From: Rob Schreiber <schreibr@riacs.edu>

   Dear committee,

   The mail on intrinsics having died down, I suspect that we are
   ready to agree on a few things.

   1.  We can come up with a draft proposal on intrinsics via email.
   2.  Guy's collection, with Dave's NPROCS as modified (with PROCESSOR_SHAPE)
       will be the basis of it.
   3.  There seems to have been general agreement on the questions Guy raised.

    a.   Yes to all the new intrinsic groups proposed.   No floating-point versions
	 of the POPCNT, POPPAR, ILEN, LEADZ intinsics.

    b.   Sort:  First option (GRADE_UP, GRADE_DOWN) instead of the second
		(GRADE, INDEX)
		Stability is required.

    c.   The "alternating sequence" method of defining segments in the XXX_PREFIX 
	 and XXX_SUFFIX intrinsics.

   4.  These proposals are in accord with Rex's advisory message about extensions to the
       language.

   Let me know if you agree.   If there is agreement, I will suggest to Guy and
   Dave that they formulate new drafts along these lines, and that we try to
   reach a consensus by end of June.  We can have a pro forma meeting in Dallas
   before the Forum meeting, attendance optional, if we succeed.

I am in agreement.
	Rich Shapiro

From demmel@imafs.ima.umn.edu  Mon May 18 20:16:05 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA22343); Mon, 18 May 92 20:16:05 CDT
Received: from mail.unet.umn.edu by erato.cs.rice.edu (AA13145); Mon, 18 May 92 20:16:00 CDT
Received: from imafs.ima.umn.edu (a15.ima.umn.edu) by mail.unet.umn.edu (5.65c/)
	id AA10209; Mon, 18 May 1992 20:15:59 -0500
Date: Mon, 18 May 92 20:13:51 CDT
From: "James Demmel" <demmel@imafs.ima.umn.edu>
Message-Id: <9205190113.AA27953@imafs.ima.umn.edu>
Received: by imafs.ima.umn.edu; Mon, 18 May 92 20:13:51 CDT
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Parallel Prefix Spec

I am mailing this to you at the suggestion of Rob Schreiber.  
  Jim Demmel
\documentstyle[11pt]{article}
\title{DRAFT \\ A Specification for Floating Point Parallel Prefix}
\author{James Demmel\\Mathematics Department and Computer Science Division\\
University of California\\Berkeley, CA 94720}

\oddsidemargin=.25in
\evensidemargin=.25in
\textwidth=6.0in
\topmargin=0.in
\textheight=8.5in
\newcommand{\bmat}{\left[ \begin{array}}
\newcommand{\emat}{\end{array} \right]}

\begin{document}
\maketitle

\begin{abstract}
Parallel prefix is a useful operation for various linear algebra operations,
including solving bidiagonal systems of equations and finding the eigenvalues 
of a symmetric tridiagonal matrix. However, the simplest implementations of parallel prefix for the
operations of scalar floating point add and scalar floating point multiply
are inadequate to solve these important problems. This is because
they are too susceptible to over/underflow, and because they apparently
cannot solve the general two term recurrence needed to find eigenvalues. 
In this note we propose a 
specification for parallel prefix operations overcoming these drawbacks.
\end{abstract}

\section{Motivation}

\newcommand{\PP}{\mbox {\rm ParPrefix}}

Our notation for parallel prefix will be as follows.
Let $r = [r_1 , \ldots , r_n]$ and $s = [s_1 , \ldots , s_n]$ denote $n$ 
vectors of data objects, which could be scalars or more complicated objects. 
Let $\otimes$ be an associative operator defined on these object.
Then $s = \PP(r, \otimes )$ computes
\[
s_i = r_1 \otimes \cdots \otimes r_i \; \; .
\]
The most basic numerical parallel prefix operations one could support
are for scalar $x_i$ and $y_i$, and $\otimes$ being floating point
addition or floating point multiplication. Of course these floating point
operations are not truly associative, and the impact of this is a
question of numerical analysis we will not pursue here.

Let $B$ be an $n$ by $n$ bidiagonal matrix with diagonal entries
$s_1 , \ldots , s_n$ and subdiagonal entries $e_1 , \ldots , e_{n-1}$.
To solve the linear system $Bx=y$, we need to solve the linear recurrence
\begin{equation}\label{eqn_1}
x_i = \frac{-e_{i-1}}{s_i} x_{i-1} + \frac{y_i}{s_i} \equiv
\eta_i x_{i-1} + \tau_i
\end{equation}
This may be done in two mathematically equivalent ways using parallel prefix.
From (\ref{eqn_1}) we have
\[
\bmat{c} x_i \\ 1 \emat = \bmat{cc} \eta_i & \tau_i \\ 0 & 1 \emat \cdot
\bmat{c} x_{i-1} \\ 1 \emat \equiv M_i \cdot \bmat{c} x_{i-1} \\ 1 \emat =
M_i \cdot M_{i-1} \cdots M_1 \cdot \bmat{c} 0 \\ 1 \emat
\equiv N_i \cdot \bmat{c} 0 \\ 1 \emat
\]
So we need to compute $N = \PP ( M, \cdot )$, where each $M_i$ is a 
2 by 2 matrix, and $\cdot$ is matrix multiply. Alternatively, we could use
the following, equivalent algorithm:

\begin{tabbing}
jnk \= jnk \= jnk \= jnk \= \kill
    \> $f = \PP( \eta , \cdot )$ \\
    \> $g = \tau / f \; \;$ (componentwise vector division) \\
    \> $h = \PP( g, + )$ \\
    \> $x = h \cdot f \; \;$ (componentwise vector multiplication)
\end{tabbing}

Unfortunately, this is very nonrobust because $f$ frequently overflows or
underflows. Even in IEEE double precision, with a range of $10^{\pm 308}$,
it does not take many consecutive floating point multiplies to get a number
out of range. This can be partly resolved 
by taking logarithms of the $\eta_i$ and doing a 
$\PP( \log \eta , +)$ operation, but this is not a satisfactory solution
because it is slower and less accurate (see the appendix).

Let $T$ be an $n$ by $n$ symmetric tridiagonal matrix with diagonal
entries $a_1 , \ldots , a_n$ and offdiagonal entries $b_1 , \ldots , b_{n-1}$.
In order to find its eigenvalues, we need to solve the two term recurrence
\begin{equation}\label{eqn_2}
w_{i} = (a_i - \sigma ) w_{i-1} - b_{i-1}^2 w_{i-2}   \; \; .
\end{equation}
(The number of sign changes in the sequence of $w_i$'s equals the number
of eigenvalues of $T$ less than $\sigma$.) This may be written in terms of
parallel prefix as follows:
\begin{equation}\label{eqn_SS}
\bmat{c} w_{i} \\ w_{i-1} \emat = \bmat{cc} a_i - \sigma & -b_{i-1}^2 \\ 1 & 0
\emat \cdot \bmat{c} w_{i-1} \\ w_{i-2} \emat \equiv P_i \cdot
\bmat{c} w_{i-1} \\ w_{i-2} \emat = 
P_i \cdot P_{i-1} \cdots P_1 \cdot \bmat{c} 1 \\ 0 \emat 
\equiv Q_i \cdot \bmat{c} 1 \\ 0 \emat 
\end{equation}
So we need to compute $Q = \PP ( P, \cdot )$, again a 2 by 2 matrix
multiply parallel prefix.

This recurrence suffers from the same sensitivity to over/underflow as
the last one: $w_i$ is the determinant of the leading $i$ by $i$ submatrix
of $T - \sigma I$, and so is subject to over/underflow even for matrices of
modest norm and modest dimension. Furthermore, there is no known way to
express its solution using scalar multiply and scalar addition
parallel prefix as building blocks. In fact, we strongly suspect that
no such expression exists, although we lack as yet a formal proof.

Furthermore, there is a theorem by H. T. Kung which says that if
$x_i = f_i (x_{i-1})$ is a recurrence relation where $x_i$ is a scalar
an $f_i$ is a rational function, then $x=[x_1 , \ldots x_n]$ can be
evaluated in faster than $\Omega (n)$ time if and only if it can be
evaluated using a 2 by 2 matrix multiply parallel prefix. It turns out that
the only parallelizable $f_i$ are of the form 
\[
f_i (x_{i-1}) = \frac{\alpha_i x_{i-1} + \beta_i}{\gamma_i x_{i-1} + \delta_i}
\]
which can be parallelized by computing $x_i = u_i / v_i$ where
\[
\bmat{c} u_i \\ v_i \emat = 
\bmat{cc} \alpha_i & \beta_i \\ \gamma_i & \delta_i \emat \cdot
\bmat{c} u_{i-1} \\ v_{i-1} \emat  \equiv 
S_i \cdot \bmat{c} u_{i-1} \\ v_{i-1} \emat  =
S_i \cdots S_1 \cdot
\bmat{c} x_{0} \\ 1 \emat  
\]

Thus, 2 by 2 matrix multiply parallel prefix is sufficient (and we
believe necessary) to parallelize all parallelizable scalar recurrence
relations. On the other hand, I do not believe it is adequate for 3 or more 
term recurrences (for the same reason I do not believe 1 term
recurrences are enough to do 2 term recurrences). At this time, however,
I have not seen any evidence that we frequently
need to solve 3 or more term recurrences.

This leads us to propose the following building blocks for parallel prefix:
\begin{enumerate}
\item Scalar multiply parallel prefix with scaling to avoid over/underflow.
\item 2 by 2 matrix multiply parallel prefix with scaling to avoid over/underflow.
\end{enumerate}

\section{Specifications for Parallel Prefix}

Basically, each floating point number $x_i$ will be replaced by a
pair $(f_i,n_i)$, where $f_i$ is a floating point number and $n_i$ an
integer, with the pair representing $f_i \cdot r^{n_i}$, $r$ an integer
power of 2. The first problem is to choose $r$ and the number of bits
to store $n_i$ so as to allow for easy implementation and the ability
to do very large parallel prefix operations without fear of over/underflow.
So given $r$, the number of bits $b$ in which to store $n_i$ as a signed
integer, and the largest and smallest positive possible values of a floating
point number, we will ask how high a power of the largest and smallest
floating point numbers can be stored without over/underflow in the form
$f_i \cdot r^{n_i}$.

We will only consider IEEE single and double precision formats. The largest
and smallest numbers are given in the following table:

\begin{table}[h]
\begin{center}
\begin{tabular}{|r|rr|}
\hline   & IEEE Single & IEEE Double \\ \hline
Approximate overflow threshold & $2^{128}$ & $2^{1024}$ \\ 
Underflow threshold            & $2^{-126}$& $2^{-1022}$\\
Smallest subnormal number      & $2^{-149}$& $2^{-1074}$\\ \hline
\end{tabular}
\end{center}
\end{table}

Reasonable values for $r$ are 2, $2^{192}$ for IEEE single precision, and
$2^{1536}$ for IEEE double precision. The source for these last two values
is the wrapped exponent feature of IEEE arithmetic: If overflow is
trapped, the floating point unit is supposed to return the true answer
times $2^{-192}$ in single precision and the true answer times $2^{-1536}$
in double precision. Similary, if underflow is trapped the value returned
is supposed to be either $2^{192}$ or $2^{1536}$ times the true value.

Reasonable values for $b$, the number of bits in which to store $n_i$, are
16 and 32. 

The following table enumerates the approximate highest powers to which
a floating point number $f$ can be safely raised using the scaled format 
as a function of $b$ and $r$:

\begin{center}
\begin{tabular}{|l|r|cc|}
\hline
\multicolumn{4}{|c|}{Safe Limits for Exponentiation} \\
\hline
IEEE Single &         &    $r=2$             &    $r=2^{192}$    \\ \hline
$f=2^{128}$ &  $b=16$ &    $255$             &    $49150$    \\
            &  $b=32$ &    $1.67 \cdot 10^7$ & $3.22 \cdot 10^9$ \\ \hline
$f=2^{-126}$&  $b=16$ &    $260$             &    $49930$    \\
            &  $b=32$ &    $1.70 \cdot 10^7$ & $3.27 \cdot 10^9$ \\ \hline
$f=2^{-149}$&  $b=16$ &    $219$             &    $42223$    \\
            &  $b=32$ &    $1.44 \cdot 10^7$ & $2.76 \cdot 10^9$ \\ \hline 
\hline
IEEE Double  &         &    $r=2$             &   $r=2^{1536}$    \\ \hline
$f=2^{1024}$ &  $b=16$ &    $31$              &    $49150$    \\
             &  $b=32$ &    $2.09 \cdot 10^6$ & $3.22 \cdot 10^9$ \\ \hline
$f=2^{-1022}$&  $b=16$ &    $32$              &    $49246$    \\
             &  $b=32$ &    $2.10 \cdot 10^6$ & $3.22 \cdot 10^9$ \\ \hline
$f=2^{-1074}$&  $b=16$ &    $30$              &    $46775$    \\
             &  $b=32$ &    $1.99 \cdot 10^6$ & $3.06 \cdot 10^9$ \\ \hline 
\end{tabular}
\end{center}

From this table, we see the limiting case is taking power of the smallest
subnormal number. When $b=16$ we must clearly take the larger of the two
$r$ values to get a reasonably large safe exponent, and even then it is
less than 50000. Choosing $b=32$ is clearly more reasonable. 
Should we choose $r=1$ or larger $r$? I believe the larger $r$ is preferable,
both because of the larger parallel prefix operations it allows, and because
it appears to be easier to implement, since
the representation of a number is almost unique: there are at
most two ways to store a nonzero number in the form $f \cdot r^{n}$.

In order to implement our two operations
it suffices to explain how to do addition and multiplication of numbers
in this scaled format.

\vspace{.2in}
{\em Multiplication: compute 
$z \cdot r^k = (x \cdot  r^n ) \times (y \cdot r^m )$. Statements in
braces are unnecessary on machines that returned wrapped results on
over/underflow. It assumes there are sticky overflow and underflow flags
as in IEEE arithmetic. Multiplications and divisions by $\sqrt{r}$ can
be done by modifying the exponent directly.}

\begin{tabbing}
jnk \= jnk \= jnk \= jnk \= \kill
    \> $z = x \cdot y$ \\
    \> $k = m + n$ \\
    \> if (overflow) then \\
    \>   \> \{$z=(x / \sqrt{r} ) \cdot ( y / \sqrt{r} )$\} \\
    \>   \> $k=k+1$ \\
    \> elseif (underflow) then \\
    \>   \> \{$z=(x \cdot \sqrt{r} ) \cdot ( y \cdot \sqrt{r} )$\} \\
    \>   \> $k=k-1$ \\
    \> endif
\end{tabbing}

\pagebreak
\vspace{.2in}
{\em Addition/Subtraction: compute 
$z \cdot r^k = (x \cdot  r^n ) \pm (y \cdot r^m )$. Statements in
braces are unnecessary on machines that returned wrapped results on
over/underflow. It assumes there are sticky overflow and underflow flags
as in IEEE arithmetic. Multiplications and divisions by $r$ can
be done by modifying the exponent directly. It assumes round to nearest mode
and flush to zero underflow (i.e. {\em not} gradual underflow),
although the changes to account for other assumptions are simple.
Besides $r$ the machine constant
$t = $ underflow\_threshold/machine\_epsilon will be used. I have 
arranged the ``if'' tests in decreasing order of likelihood of their
being executed.}

\begin{tabbing}
jnk \= jnk \= jnk \= jnk \= \kill
    \> if ($m=n$) then \\
    \>    \> $z = x \pm y$ \\
    \>    \> $k = m$ \\
    \>    \> if (overflow) then \\
    \>    \>    \> \{$z=(x/r) \pm (y/r)$\} \\
    \>    \>    \> $k=k+1$ \\
    \>    \> elseif (underflow) then \\
    \>    \>    \> \{$z=(x \cdot r) \pm (y \cdot r)$\} \\
    \>    \>    \> $k=k-1$ \\
    \>    \> endif \\
    \> elseif $(m=n-1)$ then \\
    \>    \> $z= x \pm (y/r) \; \; $ /* no overflow possible if round to nearest */ \\
    \>    \> $k=n$ \\
    \>    \> if $0 < |z| < t$ then \\
    \>    \>    \> $z=(x \cdot r) \pm y$ \\
    \>    \>    \> $k=m$ \\
    \>    \> endif \\
    \> elseif $(m=n+1)$ then \\
    \>    \> $z= y \pm (x/r) \; \; $ /* no overflow possible if round to nearest */ \\
    \>    \> $k=m$ \\
    \>    \> if $0 < |z| < t$ then \\
    \>    \>    \> $z=(y \cdot r) \pm x$ \\
    \>    \>    \> $k=n$ \\
    \>    \> endif \\
    \> elseif $(m<n-1)$ then \\
    \>    \> $z=x$ \\
    \>    \> $k=n$ \\
    \> elseif $(n<m-1)$ then \\
    \>    \> $z=y$ \\
    \>    \> $k=m$ \\
    \> endif
\end{tabbing}

It would be interesting to benchmark this parallel prefix
operation both with and without the protection against over/underflow
I propose here, to see how much this protection costs us.

\section{Exploiting extra range}

If we have extra exponent range available, we can greatly diminish the
amount of time spent testing and scaling. If the data is in IEEE single
precision, then products of 8 such (normalized!) numbers can be computed
without over/underflow in IEEE double precision, and products of 128 in
IEEE extended. If the data is IEEE double then products of 16 can be
computed in IEEE extended. Thus, any scaling tests would only have to be
done after 8, 16 or 128 products.


\section{Extensions}

Of course, it would be nice if the user could supply his or her own data type and
associative operator, and have the system perform parallel prefix. Given the
details of exception handling as described above, it would be nice to have
these basic floating point operations be done automatically, rather than expecting
the user to handle them.

\section*{Appendix: Comments on Inverse Iteration on the CM-2\footnote{These are
some early notes written for J.-P. Brunet.}}

The task at hand is to solve $Bx=y$ where $B$ is an $n$ by $n$ upper
bidiagonal matrix and $x$ and $y$ are $n$-vectors. Let $a_1 , \ldots , a_n$
be the diagonal entries of $B$ and $-b_1 , \ldots , -b_{n-1}$ be the superdiagonal entries.
The usual recurrence is $x_i = (y_i - b_i x_{i+1})/a_i \equiv \alpha_i x_{i+1} + \beta_i$,
for $i=n$ down to $i=1$, with $b_n \equiv 0$ and $x_{n+1} \equiv 0$. I can think of
three ways to solve this, with various kinds of immunity against overflow.
I have not analyzed the accuracy of all these (they are not all equivalent to 
evaluating the recurrence sequentially, which has nearly perfect backward error), but
I don't see any obvious dangers. I have tried to use the scan operation to the extent
possible, assuming only scans for floating-point add and floating-point multiply
(although the best solution would involve modifying the multiply scan).

The first two, and most parallel, methods involve the following factorization:
$E = D_1 \cdot B \cdot D_2$, where $D_1$ and $D_2$ are diagonal, and $E$
is bidiagonal with $1$ on the diagonal and $-1$ off. $D_2 = diag( d_0 , \ldots , d_{n-1} )$
is given by $d_0 = 1$, and $d_i = \prod_{j=1}^i (a_j / b_j )$.
$D_1 = diag( e_0 , \ldots , e_{n-1} )$ is given by $e_i = 1/(d_i a_{i+1})$.
One can also verify $E^{-1}$ is upper triangular with all ones on and above the diagonal.
Thus, computing $y=B^{-1}x = D_2 E^{-1} D_1 x$ involves the following:
\begin{enumerate}
\item Compute $f_0 = 1$, $f_i = a_i / b_i$ for all $i$ in one parallel step.
\item Compute $d_i = \prod_{j=0}^i f_i$ for all $i$ using a multiply-scan operation.
\item Compute $e_0 = 1$, $e_i = 1/(d_i a_{i+1})$ for all $i$ in two parallel steps.
\item Compute $y_1=D_1 x$ for all $i$ in one parallel operation.
\item Compute $y_2=E^{-1}y_1$ for all $i$ using an add-scan operation (which is all
multiplying by $E^{-1}$ is).
\item Compute $y=D_2 y_2$ for all $i$ in one parallel operation.
\end{enumerate}

To protect against over/underflow, one can modify this scheme in one of two ways.
The easiest way is to compute $\log f_i$ (assume for the moment all $f_i>0$), then
$\log d_i = \sum_{j=1}^i \log f_i$ using an add-scan, and finally
$\log e_i = -\log d_i - \log a_{i+1}$. Let $\bar{d} = \max \log d_i$ and
$\bar{e} = \max \log e_i$. Replace $\log d_i$ by $\log d_i - \bar{d}$ and
replace $\log e_i$ by $\log e_i - \bar{e}$. Exponentiate the new $\log d_i$ and
$\log e_i$ to get scaled values of $e_i$ and $d_i$ the largest of each of which is 1.
Now perform steps (4) through (6) of the above algorithm.

This has the added cost of a logarithm and exponent, and loses a little accuracy because
of them too. But it will not overflow and almost certainly not underflow dangerously.
(There are other choices of $\bar{e}$ and $\bar{d}$ that might be marginally safer
against underflow.) The signs of the $f_i$ can be accumulated and applied with a
multiply scan operation involving only $\pm 1$'s.

A better way to protect against overflow is to modify the multiply-scan operation
as follows. Instead of computing $d_i = \prod_{j=0}^i f_i$, one computes a $\tilde{d}_i$
and integer $m_i$ such that $\tilde{d}_i$ cannot over/underflow, and 
$d_i = \tilde{d}_i B^{m_i}$, where $B$ is a big power of the radix near the overflow
threshold. Here is a code, which can obviously be ``scanned'' for computing
$\tilde{d}_i$ and $m_i$:

\begin{tabbing}
jnk \= jnk \= jnk \= jnk \= \kill
     \> $\tilde{d}_0 = 1$; $m_0 = 0$  \\
     \> for $j=1,n$ \\
     \> \> if $\tilde{d}_{i-1} \cdot f_i$ neither overflows nor underflows then \\
     \> \>    \>  $\tilde{d}_i = \tilde{d}_{i-1} \cdot f_i$ \\
     \> \>    \>  $m_i = m_{i-1}$ \\
     \> \> elseif $\tilde{d}_{i-1} \cdot f_i$ would overflow then \\
     \> \>    \>  $\tilde{d}_i = \tilde{d}_{i-1} \cdot f_i / B$ (computed carefully!) \\
     \> \>    \>  $m_i = m_{i-1} + 1$ \\
     \> \> elseif $\tilde{d}_{i-1} \cdot f_i$ would underflow then \\
     \> \>    \>  $\tilde{d}_i = \tilde{d}_{i-1} \cdot f_i \cdot B$ (computed carefully!) \\
     \> \>    \>  $m_i = m_{i-1} - 1$ \\
     \> \> endif \\
     \> endfor
\end{tabbing}

If $B$ is close to overflow, at this point one should take all $\tilde{d}_i$ less than
1 in magnitude, multiply them by $B$ and subtract 1 from their corresponding $m_i$.
When the $m_i$ are available, the largest one can be subtracted from all of them
(so the largest is now 0), and then $d_i = \tilde{d}_i B^{m_i}$ computed without
fear of overflow. 

The third approach is to run the recurrence 
$x_i = (y_i - b_i x_{i+1})/a_i \equiv \alpha_i x_{i+1} + \beta_i$ sequentially,
scaling if necessary as one goes. The code is similar in its use of a counter $m_i$ to
the last code above:

\begin{tabbing}
jnk \= jnk \= jnk \= jnk \= \kill
     \> $\tilde{x}_n = \beta_n$; $m_n = 0$  \\
     \> for $i=n-1$ downto $1$ \\
     \>  \> if $\alpha_i \cdot x_{i+1} + \beta_i \cdot B^{m_{i+1}}$ neither over/underflows,\\
     \>  \>  \>  $x_i = \alpha_i \cdot x_{i+1} + \beta_i \cdot B^{m_{i+1}}$ \\
     \>  \>  \>  $m_i = m_{i+1}$ \\
     \>  \> elseif $\alpha_i \cdot x_{i+1} + \beta_i \cdot B^{m_{i+1}}$ overflows, then \\
     \>  \>  \>  $x_i = \alpha_i \cdot x_{i+1}/B + \beta_i \cdot B^{m_{i+1}-1}$ (carefully!)\\
     \>  \>  \>  $m_i = m_{i+1}-1$ \\
     \>  \> elseif $\alpha_i \cdot x_{i+1} + \beta_i \cdot B^{m_{i+1}}$ underflows, then \\
     \>  \>  \>  $x_i = \alpha_i \cdot x_{i+1} \cdot B + \beta_i \cdot B^{m_{i+1}+1}$ (carefully!)\\
     \>  \>  \>  $m_i = m_{i+1}+1$ \\
     \>  \> endif \\
     \> endfor
\end{tabbing}

The true values of $x_i$ are gotten from the computed $x_i$ and $m_i$ the same way
$d_i$ is gotten from $\tilde{d}_i$ and $m_i$ as described above: If some $x_i$ is
less than 1 in magnitude, multiply it by $B$ and {em add} 1 to $m_i$, then subtract
the largest $m_i$ from all of them so the largest is now zero, and then change
$x_i$ to $x_i B^{-m_i}$.

I have not debugged this pseudocode, but I think it is basically correct.

Alan Edelman points out that these can all be blocked in straigthforward ways.

\end{document} 


From highnam@slcs.slb.com  Tue May 19 09:58:12 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA01269); Tue, 19 May 92 09:58:12 CDT
Received: from SLCS.SLB.COM by erato.cs.rice.edu (AA13422); Tue, 19 May 92 09:58:09 CDT
From: highnam@slcs.slb.com
Received: from speedy.SLCS.SLB.COM
	by SLCS.SLB.COM (4.1/SLCS Mailhost 3.13)
	id AA13465; Tue, 19 May 92 09:57:50 CDT
Received: by speedy.SLCS.SLB.COM (4.1/SLCS Subsidiary 1.10)
	id AA03984; Tue, 19 May 92 09:57:49 CDT
Date: Tue, 19 May 92 09:57:49 CDT
Message-Id: <9205191457.AA03984.highnam@speedy.SLCS.SLB.COM>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Template and Distribution query intrinsics ?


At the moment we have no intrinsics for querying
template and distribution construction.  So, for example, if 
an HPF program does decide that it should REDISTRIBUTE itself
it doesn't have any tools to figure out what to do.

Such intrinsics should report the actual distribution at
run time, and not the distribution that the HPF programmer
asked for (though it would be nice if they were the same.. ).

Guy has already proposed intrinsics of this kind in conjunction
with the LOCAL subroutine proposal.  (Group 2/3.)

With the agreement on PROCESSOR_SHAPE we have provided
some query functionality for the lowest HPF level.  Although
we need a version of Guy's suggestion (11.[b] in Rob's
summary message) to incorporate named PROCESSOR definitions.
Example: A CM2-8K has either 8,192 or 256 ``processors''. 
In general, the number of HPF PROCESSORS may have little or 
nothing to do with the vendor-specific number of processors. 


(If I incrementally purchase processors for a system I'll have to
 be careful to avoid big prime processor counts because the only 
 way I could fully exploit the system is with a 1D PROCESSOR defn. :)


Peter


From gls@think.com  Wed May 20 17:03:48 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA13044); Wed, 20 May 92 17:03:48 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 20 May 92 18:03:47 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA10715; Wed, 20 May 92 18:03:47 EDT
Date: Wed, 20 May 92 18:03:47 EDT
Message-Id: <9205202203.AA10715@strident.think.com>
To: hpff-intrinsics@cs.rice.edu
Subject: intrinsics-maxloc-proposal, version 2

Proposal for extension to MINLOC and MAXLOC for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 20, 1992


The MAXLOC and MINLOC instrinsics should have an optional DIM
argument.  If such an argument is present, then the shape of the
result equals the shape of the first argument with one dimension
(the one indicated by the DIM argument) deleted; it is as if a
series of one-dimensional MAXLOC or MINLOC operations were performed.

Example: If A has the value

[  0  -5   8  -3  ]
[  3   4  -1   2  ]
[  1   5   6  -4  ]

then	MINLOC(A, DIM=1) has the value [ 1, 1, 2, 3 ]
	MAXLOC(A, DIM=1) has the value [ 2, 3, 1, 2 ]
	MINLOC(A, DIM=2) has the value [ 2, 3, 4 ]
	MAXLOC(A, DIM=2) has the value [ 3, 2, 3 ].

From gls@think.com  Wed May 20 17:03:56 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA13058); Wed, 20 May 92 17:03:56 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 20 May 92 18:03:53 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA10718; Wed, 20 May 92 18:03:53 EDT
Date: Wed, 20 May 92 18:03:53 EDT
Message-Id: <9205202203.AA10718@strident.think.com>
To: hpff-intrinsics@cs.rice.edu

Proposal for reduction instrinsics for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 20, 1992


Just as we have the correpondences:

	operator/intrinsic	reduction intrinsic

		+			SUM, COUNT
		*			PRODUCT
		.AND.			ALL
		.OR.			ANY
		MAX			MAXVAL
		MIN			MINVAL

it would be useful to have reduction versions of certain
other operators and intrinsics in the language that happenintrinsics-reduction-proposal, version 2
to be associative and commutative:

				    proposed
	operator/intrinsic	reduction intrinsic

		IAND			AND
		IOR			OR
		IEOR			EOR
		.NEQV.			PARITY

Thus

	AND( (/ 7,3,10 /) )  yields 2
	 OR( (/ 7,3,10 /) )  yields 15
	EOR( (/ 7,3,10 /) )  yields 14

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

      PARITY( (/ T,F,F,T,T,F,F,F,T,T /) )  yields .TRUE.
      PARITY( (/ T,F,F,T,T,F,F,F,T,F /) )  yields .FALSE.

Some of these are particularly valuable if corresponding
parallel-prefix intrinsics are also defined (see separate proposal).

From gls@think.com  Wed May 20 17:04:04 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA13068); Wed, 20 May 92 17:04:04 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 20 May 92 18:03:59 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA10722; Wed, 20 May 92 18:03:57 EDT
Date: Wed, 20 May 92 18:03:57 EDT
Message-Id: <9205202203.AA10722@strident.think.com>
To: hpff-intrinsics@cs.rice.edu

Proposal for HPF combining-send intrinsics

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 20, 1992

[Note the addition of COPY_SEND, by analogy with COPY_PREFIX and
COPY_SUFFIX, to achieve what the Connection Machine calls
send_with_overwrite.]

For every reduction operation XXX in the language, introduce a new
intrinsic subroutine XXX_SEND:

   XXX_SEND(SOURCE,DEST,IDX1,...)

Arguments IDX1,... are optional.  The number of IDX arguments
must equal the rank of DEST.  The SOURCE and all the IDX arguments
must be conformable.

For every element s in SOURCE, the corresponding elements ij of IDXj
are used to carry out the operation

	DEST(i1,i2,...,in) = XXX_operation(DEST(i1,i2,...,in), s)

and all such operations performed by a single call are done *as if
serially* in *some* (processor-dependent) order for each element s.
So if multiple elements of SOURCE are sent to the same destination
element, they will all be properly combined with the destination element.
Thus the call

      CALL SUM_SEND(SOURCE,DEST,IDX1,IDX2,...,IDXn)

*could* be implemented as

      DO J1=LBOUND(SOURCE,1),UBOUND(SOURCE,1)
        DO J2=LBOUND(SOURCE,2),UBOUND(SOURCE,2)
          ...
            DO Jk=LBOUND(SOURCE,k),UBOUND(SOURCE,k)
              DEST(IDX1(J1,J2,...,Jk),
     &             IDX2(J1,J2,...,Jk),
     &             ...
     &             IDXn(J1,J2,...,Jk)) =
     &        DEST(IDX1(J1,J2,...,Jk),
     &             IDX2(J1,J2,...,Jk),
     &             ...
     &             IDXn(J1,J2,...,Jk)) + SOURCE(J1,J2,...,Jk)
            END DO
          ...
        END DO
      END DO

where k is the rank of SOURCE.  (However, this nest of DO loops
makes a greater commitment to the particular order in which the
combining operations are carried out than the order--namely, none!-
guaranteed by the XXX_SEND intrinsic.  This matters when the
combining operation is not both associative and commutative,
for example floating-point addition.)

In addition, the intrinsic COPY_SEND has the behavior that
for every element s in SOURCE, the corresponding elements ij of IDXj
are used to carry out the operation

	DEST(i1,i2,...,in) = s

and all such operations performed by a single call are done *as if
serially* in *some* (processor-dependent) order for each element s.
So if multiple elements of SOURCE are sent to the same destination
element, some one of them will be assigned and the rest effectively
discarded.


Example:  The C* operation

        x[v] += a;

where x, v, and a are all parallel arrays, and a and v conform,intrinsics-send-proposal, version 2
may be rendered

      CALL SUM_SEND(A,X,V)

If all elements of V were distinct, one could write this in
Fortran 90 as

      X(V) = X(V) + A

The proposed intrinsic SUM_SEND "works" even if V contains
duplicate values.  Note that the two-dimensional case

      X(V,W) = X(V,W) + A

must be rendered using SPREAD:

      CALL SUM_SEND(A,X,SPREAD(V,DIM=2,NCOPIES=SIZE(X,2)),
     &                  SPREAD(W,DIM=1,NCOPIES=SIZE(X,1)))

in order to duplicate the cross-product effect of ordinary array
subscripting.  (I chose to propose a definition of XXX_SEND that does
*not* perform such a cross product of indices because it is more
general and in practice more useful without the cross-product effect
built in.)

From gls@think.com  Wed May 20 17:04:08 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA13071); Wed, 20 May 92 17:04:08 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 20 May 92 18:03:59 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA10725; Wed, 20 May 92 18:03:58 EDT
Date: Wed, 20 May 92 18:03:58 EDT
Message-Id: <9205202203.AA10725@strident.think.com>
To: hpff-intrinsics@cs.rice.edu

Proposal for parallel prefix instrinsics for HPF

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 20, 1992


For every reduction operation XXX in the language, introduce two new
intrinsics XXX_PREFIX and XXX_SUFFIX.  They take the same arguments
as the corresponding reduction intrinsic, plus two additional
optional arguments:

	XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)

The first additional optional argument is called SEGMENT, which is of
type logical and conformable with the ARRAY argument.  If present,
the array is divided into pieces


 (a TRUE element
indicates the start of a new segment of the first argument, that is,
a place where the running accumulation is to be reset before
processing the corresponding array element).

The second additional optional argument, a scalar logical, is called
EXCLUSIVE, default .FALSE., which determines whether the prefix or
suffix operation is inclusive (the default) or exclusive.  (The
inclusive sum-prefix of (/ 1,2,3,4 /) is (/ 1,3,6,10 /) whereas the
exclusive sum-prefix is (/ 0,1,3,6 /).)

Array elements corresponding to positions where the MASK is false
do not contribute to the running accumulation.  However, the result
is still defined for corresponding positions in the result.
In actual practice, results may not be required in those positions;
in such cases the programmer may be able to use the WHERE statement
to give the compiler a strong hint:

      WHERE (FOO) A=SUM_PREFIX(B,MASK=FOO)

If the DIM argument is omitted, then the arrays are processed in
array element order ("column-major"), as if temporarily regarded as
one-dimensional.

In all cases the result has the same shape as the first argument.

In every case, every element of the result has a value equal to the
reduction of certain selected elements of ARRAY, or an "identity
value" (zero for SUM_PREFIX or SUM_SUFFIX, for example) if no
elements of ARRAY are selected for that result element.  The optional
arguments affect the selection of elements of ARRAY for each element
of the result; the selected elements of ARRAY are said to contribute
to the result element.

For any given element R of the result, let A be the corresponding
element of ARRAY.  Every element of ARRAY contributes to R unless
disqualified by one of the following rules.

For xxx_PREFIX, no element that follows A in the array element
ordering of ARRAY contributes to R.  For xxx_SUFFIX, no element that
precedes A in the array element ordering of ARRAY contributes to R.

If the DIM argument is provided, an element Z of ARRAY does not
contribute to R unless all its indices, excepting only the index for
dimension DIM, are the same as the corresponding indices of A.

If the MASK argument is provided, an element Z of ARRAY does
not contribute to R if the element of MASK corresponding to
Z is false.

If the SEGMENT argument is provided, an element Z of ARRAY does not
contribute unless the elements B and Y of SEGMENT corresponding to A
and Z (respectively), and the intervening elements of SEGMENT as
well, all have the same value.  If the DIM argument is not present,
then the "intervening" elements are all elements between them in
array element order; if the DIM argument is present, then the
"intervening" elements are those having indices the same as those of
both B and Y, except the index for dimension DIM, which must be
between (and possibly equalling) the indices of B and Y for dimension
DIM.  In other words, the prefix or suffix operation is performed
on groups of elements of ARRAY, where a group corredponds to a
maximal contiguous run of like-valued elements of SEGMENT.

If the EXCLUSIVE argument is provided and is true, then A itself
does not contribute to R.


In addition to all this, the operation COPY_PREFIX replicates the first
(lowest-indexed) element of each segment throughout the segment, and
the operation COPY_SUFFIX replicates the last (highest-indexed)
element of each segment throughout the segment.

Examples:

SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/)
SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/)

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) )              yields (/1,1,1,2,3,4,4,5,5/)
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) yields (/0,1,1,1,2,3,4,4,5/)

SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/),
    SEGMENT=(/T,T,T,T,F,F,T,F,F/)) yields (/1,3,6,10,5,11,7,8,17/)
              ------- --- - ---             -------- ---  - ----
	     four input segments       four independent result segments

COPY_PREFIX( (/1,2,3,4,5,6,7,8,9/),
     SEGMENT=(/T,T,T,T,F,F,T,F,F/)) yields (/1,1,1,1,5,5,7,8,8/)
               ------- --- - ---             ------- --- - ---
	      four input segments       four independent result segments


Note: Connection Machine software delimits the segments by indicating
the *start* of each segment.  Cray MPP Fortran delimits the segments
by indicating the *stop* of each segment.  Each method has its advantages.
There is also the question of whether this convention should change when
performing a suffix rather than a prefix.

HPF adopts yet a third representation: a new segment begins at every
*transition* from false to true or true to false; thus a segment is
indicated by a maximal contiguous subsequence of like logical values:intrinsics-prefix-proposal, version 2

	(/T,T,T,F,T,F,F,F,T,F,F,T/)
          ----- - - ----- - --- -    seven segments

The main advantages of this representation are:

(a) It is symmetrical, in that the same segment specifier may
    be meaningfully used for parallel prefix and parallel suffix
    without changing its interpretation (start versus stop).

(b) It seems to be equally inconvenient for every existing
    architecture.  :-)  However, it is not that hard to accommodate.

(c) The start-bit or stop-bit representation is easily converted
    to this form by using a parallel XOR prefix or suffix.
    Of course, we would need to define one (see separate proposal
    for a PARITY reduction intrinsic).  Examples:

    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))

    These might be standard idioms for a compiler to recognize.

From gls@think.com  Wed May 20 17:04:09 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA13073); Wed, 20 May 92 17:04:09 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 20 May 92 18:03:58 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA10721; Wed, 20 May 92 18:03:56 EDT
Date: Wed, 20 May 92 18:03:56 EDT
Message-Id: <9205202203.AA10721@strident.think.com>
To: hpff-intrinsics@cs.rice.edu

Proposal for HPF intrinsics POPCNT, POPPAR, LEADZ, and ILEN

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 20, 1992


(a) An elemental population count intrinsic.  Its action on a scalar is:

  POPCNT(x) = COUNT( (/ (BTEST(x,J), J=0, BIT_SIZE(x)-1) /) )

The result is the number of 1-bits in the integer x, according to the
bit-manipulation model in section 13.5.7 of the Fortran 90 standard.


(b) An elemental population-parity intrinsic.  Its action on a scalar is:

  POPPAR(x) = MERGE(1,0,BTEST(POPCNT(x),0))

The result is 1 if the number of 1-bits in the integer x is odd,
or 0 if the number of 1-bits in the integer x is even.


(c) An elemental count-leading-zeros intrinsic.  Its action on a scalar is:

  LEADZ(x) = MINVAL( (/ (J, J=0,BIT_SIZE(x)) /),
		MASK=(/ (BTEST(x,J), J=BIT_SIZE(x)-1,0,-1), .TRUE. /) )

The result is a count of the number of leading 0-bits in the integer
x, according to the bit-manipulation model in section 13.5.7 of the
Fortran 90 standard.

Note that a given integer value may produce different results from
LEADZ, depending on the number of bits in the representation of the
integer.  That is because bits are counted from the left (the most
significant bit).
intrinsics-popcnt-proposal, version 2
----------------------------------------------------------------

The intent is to define POPCNT, POPPAR, and LEADZ consistent with
their use in Cray Fortran, but to limit them to integer arguments.

----------------------------------------------------------------

(d) An elemental integer-length intrinsic.  Its action on a scalar is:

  ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 ))

This is related to LEADZ but is often much more convenient for
the calculation of array dimensions, etc.  It is the number of bits
required to store a 2's-complement signed integer x.  As examples of
its use,  2**ILEN(N-1)  rounds N up to a power of 2 (for N > 0),
whereas  2**(ILEN(N)-1)  rounds N down to a power of 2.

Note that a given integer value will always produce the same result
from ILEN, independent on the number of bits in the representation of
the integer.  That is because bits are counted from the right (the
least significant bit).

The definition of ILEN is equivalent to that of the built-in function
integer-length in Common Lisp, which has proven to be quite useful.

From gls@think.com  Wed May 20 17:04:08 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA13072); Wed, 20 May 92 17:04:08 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Wed, 20 May 92 18:04:00 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.0C)
	id AA10726; Wed, 20 May 92 18:03:59 EDT
Date: Wed, 20 May 92 18:03:59 EDT
Message-Id: <9205202203.AA10726@strident.think.com>
To: hpff-intrinsics@cs.rice.edu

Proposal for HPF sorting intrinsics

Guy L. Steele Jr.
Thinking Machines Corporation
Version of May 20, 1992


The ideas and names here are inspired by APL.  I have used the
term "grade" rather than "rank" because the latter is already used
in the Fortran 90 standard to mean the size of the shape of an array
(that is, the number of dimensions).


GRADE_UP(ARRAY,DIM)

The array may be of type integer, real, or character.

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

	B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)

then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in) is
sorted in ascending order; moreover, R(i1,i2,...,:,...,in) is a permutation of
all the integers in the range LBOUND(ARRAY,k):UBOUND(ARRAY,k).  The sort is
stable; that is, if j < m and B(i1,i2,...,j,...,in) .EQ. B(i1,i2,...,m,...,in),
then R(i1,i2,...,:,...,in) < R(i1,i2,...,:,...,in).

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape [SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY))]
and the property that if one computes the rank-1 array

	B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))

where n=SIZE(SHAPE(ARRAY)), then B is sorted in ascending order;
moreover, all columns of S are distinct, that is, if j /= m then
ALL(S(:,j) .EQ. S(:,m)) will be false.  The sort is stable;
if j < m and B(j) .EQ. B(m), then ARRAY(S(1,j),S(2,j),...,S(n,j))
precedes ARRAY(S(1,m),S(2,m),...,S(n,m)) in the array element ordering
of ARRAY.


GRADE_DOWN(ARRAY,DIM)

The array may be of type integer, real, or character.

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

	B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)

then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in) is
sorted in descending order; moreover, R(i1,i2,...,:,...,in) is a permutation of
all the integers in the range LBOUND(ARRAY,k):UBOUND(ARRAY,k).  The sort is
stable; that is, if j < m and B(i1,i2,...,j,...,in) .EQ. B(i1,i2,...,m,...,in),
then R(i1,i2,...,:,...,in) < R(i1,i2,...,:,...,in).  (Yes, that last "<" sign
really should be a "<", not a ">".)

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape [SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY))]
and the property that if one computes the rank-1 array

	B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))

where n=SIZE(SHAPE(ARRAY)), then B is sorted in descending order; moreover, all
columns of S are distinct, that is, if j /= m then ALL(S(:,j) .EQ. S(:,m)) will
be false.  The sort is stable; if j < m and B(j) .EQ. B(m), then
ARRAY(S(1,j),S(2,j),...,S(n,j)) precedes (yes, "precedes", not "follows")
ARRAY(S(1,m),S(2,m),...,S(n,m)) in the array element ordering of ARRAY.

----------------------------------------------------------------

Because of the stability requirement, it is not true in general that
GRADE_DOWN(A(1:N)) equals GRADE_UP(A(N:1:-1)).  Indeed, these results
are equal if and only if A contains no duplicate values.

The stability requirement allows one to cascade grading operations in order to
sort on multiple fields.  For example, suppose one had the following derived
type (example taken from section 4.4.1 of the Fortran 90 standard):

      TYPE PERSON
        INTEGER AGE
        CHARACTER (LEN = 50) NAME
      END TYPE PERSON

now consider two arrays of persons:

      TYPE(PERSON), DIMENSION(100000) :: MEMBERS, ROSTERintrinsics-sort-proposal, version 2

Also assume a work vector for indices:

      INTEGER, DIMENSION(100000) :: V

Then the statements

      V = GRADE_UP(MEMBERS%AGE)
      V = V(GRADE_UP(MEMBERS(V)%NAME))
      ROSTER = MEMBERS(V)

cause ROSTER to be a rearrangement of MEMBERS that is sorted
primarily by name and secondarily by age (that is, members with
the same name are grouped together in order of ascending age).
Note that the minor sort field is graded first, and that more
statements like the second one may be inserted to sort on additional
fields.

To list members with the same name in descending order of age,
change the first GRADE_UP to GRADE_DOWN:

      V = GRADE_DOWN(MEMBERS%AGE)
      V = V(GRADE_UP(MEMBERS(V)%NAME))
      ROSTER = MEMBERS(V)

From loveman@ftn90.enet.dec.com  Fri May 29 08:41:02 1992
Received: from enet-gw.pa.dec.com by cs.rice.edu (AA29203); Fri, 29 May 92 08:41:02 CDT
Received: by enet-gw.pa.dec.com; id AA17207; Fri, 29 May 92 06:40:49 -0700
Message-Id: <9205291340.AA17207@enet-gw.pa.dec.com>
Received: from ftn90.enet; by decwrl.enet; Fri, 29 May 92 06:40:49 PDT
Date: Fri, 29 May 92 06:40:49 PDT
From: David Loveman <loveman@ftn90.enet.dec.com>
To: hpff-intrinsics@cs.rice.edu
Cc: loveman@ftn90.enet.dec.com
Apparently-To: hpff-intrinsics@cs.rice.edu
Subject: NUMBER_OF_PROCESSORS . . .


Proposal for HPF system inquiry intrinsic functions

David Loveman
Digital Equipment Corporation
Version of May 29, 1992


Introduction

In addition to the intrinsic functions of Fortran 90, High Performance
Fortran has two new intrinsic functions:  NUMBER_OF_PROCESSORS and
PROCESSORS_SHAPE.  Their values remain constant for (at least) the
duration of one program execution.  Accordingly, NUMBER_OF_PROCESSORS
and PROCESSORS_SHAPE have values that are restricted expressions and
may be used wherever any other Fortran 90 restricted expression may be
used.  If the system configuration is committed to at compile time,
NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE have values that are constant
expressions and may be used wherever any other Fortran 90 constant
expression may be used.  In particular, NUMBER_OF_PROCESSORS may be
used in a specification expression and, if a constant expression, may
be used in an initialization expression.  None of the categories of
intrinsic functions listed in Chapter 13 of the Fortran 90 standard
seem quite apt to describe the nature of this new intrinsic function,
so we add a new category of "system inquiry functions" and place
NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE in that category.

Note that treating these intrinsics as constant expressions does not
force a compiler to bind the number of processors at compile time
(although that is one possible implementation) -- with the right linker
or code-generation technology the choice could be deferred until run
time, possibly at some performance cost.


Formal Proposal

<this section contains "formal" proposal material, phrased as additions
or modifications to the Fortran 90 specification, with non-proposal
commentary in angle brackets>

<add descriptive section>

13.8a System inquiry functions

In a multi-processor implementation, the processors may be arranged in
an implementation-dependent n-dimensional processor array.  The system
inquiry functions return values related to this underlying machine and
processor configuration, including the size and shape of the underlying
processor array.  NUMBER_OF_PROCESSORS returns the total number of
processors available to the program or the number of processors
available to the program along a specified dimension of the processor
array.  PROCESSORS_SHAPE returns the shape of the processor array.


<add section listing the system inquiry functions>

13.10.21 System inquiry functions

   NUMBER_OF_PROCESSORS(DIM)      Total number of processors in
                                    the processor array.
   PROCESSORS_SHAPE()             Shape of the processor array


<add intrinsic function definitions in the style of Chapter 13 of the
Fortran 90 standard>

13.13.xx  NUMBER_OF_PROCESSORS(DIM)

Optional Argument.  DIM

Description.  Returns the total number of processors available to the
program or the number of processors available to the program along a
specified dimension of the processor array.

Class.  System inquiry function.

Arguments.
DIM (optional)  must be scalar and of type integer with a value in the
range 1<=DIM<=n where n is the rank of the processor array.

Result Type, Type Parameter, and Shape.  Default integer scalar.

Result Value.  The result has a value equal to the extent of dimension
DIM (1<=DIM<=n, where n is the rank of the processor array) of the
processor-dependent hardware processor array or, if DIM is absent, the
total number of elements, equal to or greater than one, of the
processor-dependent hardware processor array.

Examples. For a DECmpp 12000 Model 8B with 8192 processors, the value
of NUMBER_OF_PROCESSORS( ) is 8192, the value of
NUMBER_OF_PROCESSORS(DIM=1) is 128, and the value of
NUMBER_OF_PROCESSORS(DIM=2) is 64.  For a single processor DECalpha
workstation, the value of NUMBER_OF_PROCESSORS( ) is 1, and the value
of NUMBER_OF_PROCESSORS(DIM=1) is 1.


13.13.yy PROCESSORS_SHAPE()

Description.  Returns the shape of the implementation-dependent processor array.

Class.  System inquiry function.

Arguments.  None

Result Type, Type Parameter, and Shape.  The result is a default
integer array of rank one whose size is equal to the rank of the
implementation-dependent processor array.

Result Value.  The value of the result is the shape of the
implementation-dependent processor array.


Example. For a DECmpp 12000 Model 8B with 8192 processors, the value of
PROCESSORS_SHAPE() is (/ 128, 64 /).  For a Connection Machine CM-2
with 8192 processors, the value of PROCESSORS_SHAPE() might be (/ 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 /).  For a Connection Machine CM-5 with
8192 processors, the value of PROCESSORS_SHAPE() might be (/ 8192 /). 
For a single processor DECalpha workstation, the value of
PROCESSORS_SHAPE() is (/ 1 /).


<The list of alternatives for Fortran 90 constant expressions and
restricted expression are expanded to include NUMBER_OF_PROCESSORS and
PROCESSORS_SHAPE.>

7.1.6.1 Constant expression

A constant expression is . . . . .

(6a)  A system inquiry function reference where each argument is a
constant expression and the compiler has been informed of the
appropriate system configuration.  

. . . . .

7.1.6.2 Specification expression

A restricted expresion is . . . . .

(9a)  A system inquiry function reference where each argument is a
restricted expression.  

. . . . .


Discussion and Pragmatic Usage - Consequences of the Formal Proposal

The shape of the processor array may be treated as a constant known at
compile time if the compiler is informed about the system
configuration.  This will allow the values of system inquiry functions
to be used in initialization expressions.  The values of system inquiry
functions are always restricted expressions;  thus they may be used in
specification expressions even if the system configuration is not
committed to at compile time.

Note that the system inquiry functions query the physical machine, and
have nothing to do with any PROCESSORS directive that may occur.

References to system inquiry functions may occur in HPF directives.
Example.
!HPF$ TEMPLATE T(100, 3*NUMBER_OF_PROCESSORS())

The definition of NUMBER_OF_PROCESSORS is modeled on the definition of
the SIZE intrinsic function.

The definition of PROCESSOR_SHAPE is modeled on the definition of the
SHAPE intrinsic function.

SIZE(PROCESSOR_SHAPE()) is the rank of the processor array.

As a result of being a constant expression, if the system configuration
is committed to at compile time, suitably constrained references to
system inquiry functions may occur in initialization expressions as,
for example, initialization values in type-declaration-statements or in
parameter-statements.
Examples.
PARAMETER (N_PROCS=NUMBER_OF_PROCESSORS(),        &
           NXPROCS=NUMBER_OF_PROCESSORS(DIM=1),   &
           NYPROCS=NUMBER_OF_PROCESSORS(DIM=2))

As a result of being a restricted expression, suitably constrained
references to system inquiry functions may occur in specification
expressions as, for example, lower or upper bounds of an
explicit-shape-spec of an array-spec in type-declaration-statements.
Examples.
INTEGER, DIMENSION(SIZE(PROCESSOR_SHAPE())) ::   &
           PS = PROCESSOR_SHAPE()
! PS(2) = NUMBER_OF_PROCESSORS(DIM=2)

Earlier proposals for this type of facility used a form of predefined
named constant, N$PROCS.  This form is rejected for several reasons:
1.  Nowhere else in Fortran is there such a thing as a predefined named constant.
2.  "$" is a legal Fortran character, but not as a character in a name.
3.  If ones implementation supports "$" as a legal character in names,
one could always define 
PARAMETER (N$PROC = NUMBER_OF_PROCESSORS())
4.  Use of a long name for an intrinsic is a strong encouragement for
subset implementations to support 31 character names in general.


From pm@icase.edu  Fri May 29 10:08:26 1992
Received: from bonito.icase.edu by cs.rice.edu (AA01086); Fri, 29 May 92 10:08:26 CDT
Received: by bonito.icase.edu (5.65.1/lanleaf2.4.9)
	id AA01101; Fri, 29 May 92 11:08:24 -0400
Message-Id: <9205291508.AA01101@bonito.icase.edu>
Date: Fri, 29 May 92 11:08:24 -0400
From: Piyush Mehrotra <pm@icase.edu>
To: hpff-intrinsics@cs.rice.edu
Subject: add


From gls@think.com  Mon Jun  1 11:21:54 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA16180); Mon, 1 Jun 92 11:21:54 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Mon, 1 Jun 92 12:21:51 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.2)
	id AA13341; Mon, 1 Jun 92 12:21:50 EDT
Date: Mon, 1 Jun 92 12:21:50 EDT
Message-Id: <9206011621.AA13341@strident.think.com>
To: loveman@ftn90.enet.dec.com
Cc: hpff-intrinsics@cs.rice.edu
In-Reply-To: David Loveman's message of Fri, 29 May 92 06:40:49 PDT <9205291340.AA17207@enet-gw.pa.dec.com>
Subject: NUMBER_OF_PROCESSORS . . .


This version of the proposal looks great to me, and seems to
fit in with the style of the Fortran 90 standard.

My one suggestion is that computers that are unreservedly committed
to being single processors might be regarded as having a scalar processor
arrangement, and we might want to have an example reflecting that:

 For a BananaKlone 2000 laptop, which has a single processor,
 the value of NUMBER_OF_PROCESSORS( ) is 1, and there is
 no valid value of DIM for which NUMBER_OF_PROCESSORS may be called.

 For a BananaKlone 2000 laptop, which has a single processor,
 the value of PROCESSORS_SHAPE() is [] (that is, an empty rank-one array).

--Guy

From loveman@ftn90.enet.dec.com  Mon Jun  1 12:38:37 1992
Received: from enet-gw.pa.dec.com by cs.rice.edu (AA18454); Mon, 1 Jun 92 12:38:37 CDT
Received: by enet-gw.pa.dec.com; id AA05604; Mon, 1 Jun 92 10:38:31 -0700
Message-Id: <9206011738.AA05604@enet-gw.pa.dec.com>
Received: from ftn90.enet; by decwrl.enet; Mon, 1 Jun 92 10:38:35 PDT
Date: Mon, 1 Jun 92 10:38:35 PDT
From: David Loveman <loveman@ftn90.enet.dec.com>
To: gls@think.com
Cc: hpff-intrinsics@cs.rice.edu, loveman@ftn90.enet.dec.com
Apparently-To: gls@think.com, hpff-intrinsics@cs.rice.edu
Subject: NUMBER_OF_PROCESSORS . . .

Guy-

The issue you raise when you say

>My one suggestion is that computers that are unreservedly committed
>to being single processors might be regarded as having a scalar processor
>arrangement, and we might want to have an example reflecting that:

> For a BananaKlone 2000 laptop, which has a single processor,
> the value of NUMBER_OF_PROCESSORS( ) is 1, and there is
> no valid value of DIM for which NUMBER_OF_PROCESSORS may be called.

> For a BananaKlone 2000 laptop, which has a single processor,
> the value of PROCESSORS_SHAPE() is [] (that is, an empty rank-one array).

is, of course, the theological one:  are scalars and one element arrays
"the same" or are they "different?"

I guess it depends on whether or not we support sequence and storage
association in HPF, or not (small joke).  I guess I do have a slight
bias in favor of accepting your amendment, but I am interested in
hearing other views also.

Also, I'm really not quite sure what a "processor" is, except to say
its a concept that is implementation-defined.  For example, what do you
do with multiple function units?  Or with a MIMD computer each of whose
nodes is a small number of cpus (maybe each with its own local memory,
or cache) sharing a memory.

-David

From gls@think.com  Mon Jun  1 13:53:06 1992
Received: from mail.think.com (Mail1.Think.COM) by cs.rice.edu (AA21436); Mon, 1 Jun 92 13:53:06 CDT
Return-Path: <gls@Think.COM>
Received: from Strident.Think.COM by mail.think.com; Mon, 1 Jun 92 14:53:04 -0400
From: Guy Steele <gls@think.com>
Received: by strident.think.com (4.1/Think-1.2)
	id AA15606; Mon, 1 Jun 92 14:53:03 EDT
Date: Mon, 1 Jun 92 14:53:03 EDT
Message-Id: <9206011853.AA15606@strident.think.com>
To: loveman@ftn90.enet.dec.com
Cc: gls@think.com, hpff-intrinsics@cs.rice.edu, loveman@ftn90.enet.dec.com
In-Reply-To: David Loveman's message of Mon, 1 Jun 92 10:38:35 PDT <9206011738.AA05604@enet-gw.pa.dec.com>
Subject: NUMBER_OF_PROCESSORS . . .

   Date: Mon, 1 Jun 92 10:38:35 PDT
   From: David Loveman <loveman@ftn90.enet.dec.com>
   Apparently-To: gls@think.com, hpff-intrinsics@cs.rice.edu

   Guy-

   The issue you raise when you say

   >My one suggestion is that computers that are unreservedly committed
   >to being single processors might be regarded as having a scalar processor
   >arrangement, and we might want to have an example reflecting that:

   > For a BananaKlone 2000 laptop, which has a single processor,
   > the value of NUMBER_OF_PROCESSORS( ) is 1, and there is
   > no valid value of DIM for which NUMBER_OF_PROCESSORS may be called.

   > For a BananaKlone 2000 laptop, which has a single processor,
   > the value of PROCESSORS_SHAPE() is [] (that is, an empty rank-one array).

   is, of course, the theological one:  are scalars and one element arrays
   "the same" or are they "different?"

But in fact they are different in Fortran 90.  There are lots of places
where a 1-element array is acceptable and a scalar is not, or vice versa.

From kwarren@tazdevil.llnl.gov  Wed Jun  3 17:17:38 1992
Received: from tazdevil.llnl.gov by cs.rice.edu (AA18807); Wed, 3 Jun 92 17:17:38 CDT
Received: by tazdevil.llnl.gov (4.1/1.15)
	id AA28423; Wed, 3 Jun 92 15:17:37 PDT
Date: Wed, 3 Jun 92 15:17:37 PDT
From: kwarren@tazdevil.llnl.gov (Karen Warren)
Message-Id: <9206032217.AA28423@tazdevil.llnl.gov>
To: hpff-intrinsics@cs.rice.edu
Subject: proposal


   Is your proposal ready for distribution yet?

   Thanks.
   Karen Warren
   MPCI

From schreibr@riacs.edu  Wed Jun 10 19:21:49 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA13217); Wed, 10 Jun 92 19:21:49 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA00391); Wed, 10 Jun 92 19:21:45 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA00452; Wed, 10 Jun 92 17:21:44 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA09674; Wed, 10 Jun 92 17:20:41 PDT
Message-Id: <9206110020.AA09674@thor.riacs.edu>
Date: Wed, 10 Jun 92 17:20:41 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: ak@esids2.edvz.tuwien.ac.at
Subject: Re:  System intrinsic functions for HPFF
Cc: hpff-intrinsics@erato.cs.rice.edu

Dear Arnold Krommer,

Yes, I would like to see this material.   The best form
would be online, either postscript, tex, latex, or
ascii.   That way I can distribute it to the intrinsics
subcommittee of the HPFF.


Thanks for your interest, and I look forward to reading
your report.   We are certainly considering intrinsics
of the class you describe (system inquiry intrinsics)
and we would be happy to be the beneficiaries of your
experience and work on the problem.

Rob Schreiber

    RIACS
    MS T045-1, NASA Ames Research Center
    Moffett Field, CA 94035


	From ak@esids2.edvz.tuwien.ac.at Tue Jun  9 00:50:55 1992
	Subject: System intrinsic functions for HPFF
	To: schreibr@riacs.edu
	X-Mailer: ELM [version 2.3 PL11]

	Dear Rob Schreiber,

	recently I heard about the HPFF activities concerning
	intrinsic functions. I think our group at the Technical
	University Vienna could contribute some useful concepts
	in the field of system intrinsic functions for parallel
	programing.

	It is our belief that portable programs that run efficiently
	on different parallel systems are only possible if they can
	adjust themselves to fit the respective computer system they
	run on. The essence of such programs are algorithms which are
	able to adapt themselves to different computer architectures:
	ARCHITECTURE ADAPTIVE ALGORITHMS (see Technical Report ACPC/TR
	92-2 of the Austrian Center for Parallel Computation). 
	To write such programs, language elements (environment enquiries)
	are needed to provide information about the available processors,
	memory hierarchies, communication facilities, etc.

	We have made a detailed suggestion for the definition of such
	functions. If you are interested in getting written material about
	our concepts, please let me know (please add your regular mail
	address).

	                                 With kind regards
	                                 
	                                           Arnold Krommer
	                                           
	----------------------------------------------------------
	Arnold R. Krommer
	Institut for Applied and Numerical Mathematics
	Technical University Vienna
	Wiedner Hauptstr. 8-10/115
	A-1040 WIEN

	ak@esids2.edvz.tuwien.ac.at


From zrlp09@trc.amoco.com  Mon Jun 29 14:49:02 1992
Received: from noc.msc.edu by cs.rice.edu (AA05071); Mon, 29 Jun 92 14:49:02 CDT
Received: from uc.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA29051; Mon, 29 Jun 92 14:49:01 -0500
Received: from [129.230.11.2] by uc.msc.edu (5.65/MSC/v3.0z(901212))
	id AA04137; Mon, 29 Jun 92 14:48:57 -0500
Received: from trc.amoco.com (apctrc.trc.amoco.com) by netserv2 (4.1/SMI-4.0)
	id AA00243; Mon, 29 Jun 92 14:48:51 CDT
Received: from backus.trc.amoco.com by trc.amoco.com (4.1/SMI-4.1)
	id AA08477; Mon, 29 Jun 92 14:48:48 CDT
Received: from localhost by backus.trc.amoco.com (4.1/SMI-4.1)
	id AA19325; Mon, 29 Jun 92 14:48:48 CDT
Message-Id: <9206291948.AA19325@backus.trc.amoco.com>
To: hpff-intrinsics@cs.rice.edu
Subject: Julio Diaz comments on parallel prefix operators
Date: Mon, 29 Jun 92 14:48:47 -0500
From: "Rex Page" <zrlp09@trc.amoco.com>


Forwarding a note from Julio Diaz:

Date: Mon, 29 Jun 1992 14:21:15 -0500
From: "J. C. Diaz" <diaz@babieco.mcs.utulsa.edu>
Message-Id: <199206291921.AA06370@babieco.mcs.utulsa.edu>
To: rpage@trc.amoco.com
Subject: Re:  parallel prefix operators for HPF


Rex;
I had a look at the proposal by Demmel about parallel prefix operators for HPF.
I do question the need for it as a required intrinsic in the language.
As it is presented in the Draft that you provided me, it looks that it does not
have a wide applicability.  On the one hand the users requiring to solve a
two term recurrence are but a fraction of the community.  On the other hand
I can see why it would be good for a SIMD architecture such as the CM2.
But, I do not think this would be a good operation for distributed processing.
I have talked to some others leading researchers on the same filed of
numerical eigensystem crunching, and they agreed with me.
Pehaps, it could be made into a vendor option.
Please let me know if you have any other questions.
Thanks.
JCD

From schreibr@riacs.edu  Fri Jul 17 17:13:49 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA24347); Fri, 17 Jul 92 17:13:49 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA17070); Fri, 17 Jul 92 17:13:35 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA03328; Fri, 17 Jul 92 15:13:16 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA04279; Fri, 17 Jul 92 15:11:56 PDT
Message-Id: <9207172211.AA04279@thor.riacs.edu>
Date: Fri, 17 Jul 92 15:11:56 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Paper on Inquiry Intrinsics


I have received (in latex form) a paper on a wide range of
machine-inquiry intrinsics (like number_of_processors()) from
Arnold Krummer and Christoph Ueberhuber
at the Austrian Center for Parallel Computing.   Here, FYI, it is:

%Dear Rob Schreiber,
%
%thank you for your interest. A latex version of our paper
%on system inquiry intrinsics (submitted to Parallel
%Computing) is included in this mail.
%
%
%                           Sincerely 
%
%                                Arnold Krommer
%                                
%----------------------------------------------------------
%Arnold R. Krommer
%Institut for Applied and Numerical Mathematics
%Technical University Vienna
%Wiedner Hauptstr. 8-10/115
%A-1040 WIEN
%
%ak@esids2.edvz.tuwien.ac.at
%
%---------------------------------------------------------
\documentstyle[12pt,fullpage]{article}

\pagestyle{headings}
\setlength{\parindent}{0pt}
\setlength{\parskip}{5pt plus 2pt minus 1pt}
\setcounter{tocdepth}{4}
\setcounter{secnumdepth}{4}
\frenchspacing
\sloppy

\begin{document}
\begin{titlepage}
\title{{\bf Architecture Adaptive Algorithms}}

\author{Arnold R. Krommer\\Christoph W. Ueberhuber\\\\Institute for Applied and
Numerical Mathematics\\Technical University Vienna}

\date{\today}
\end{titlepage}
\maketitle

\begin{abstract}
The architecture adaptive algorithm ({\sc aaa}) methodology introduced in this
paper is an attempt to provide concepts and tools for writing {\em portable\/}
parallel
programs that run {\em efficiently}\/ on a broad range of target machines.
Environment enquiries are defined to
provide
information about the available {\em processors}\/, {\em communication
channels\/},
{\em memory hierarchies}\/, etc. By taking this information into account, a
portable parallel algorithm can adapt its behavior to environment
characteristics and thus increase its {\em efficiency}\/ significantly.

\end{abstract}
\section{Introduction}
On traditional uniprocessor machines, portable programs usually have a good
overall efficiency (due to the power of modern optimizing compilers).
When using vector processors, the programmer must be much more careful to 
utilize fully
the available capacity of all eligible target machines. With parallel systems
the situation is even worse. Portability -- much more difficult to achieve
than on uniprocessor machines -- is by far not enough to ensure a satisfactory
performance over a wide range of parallel computers. Portable programs that
run efficiently on different parallel systems are only possible if they
can adjust themselves to fit the respective computer system they run on. The
essence of such programs are algorithms which are able to 
adapt themselves to different computer architectures:\\ {\sc Architecture 
Adaptive Algorithms}.
 
\subsection{The Idea}
The fundamental idea of the {\sc aaa} methodology has been put into words in
1967 by Peter Naur~\cite{naur67}:
\begin{quote}
{\sl We have to admit a new kind of elements, Environment Enquiries, 
in our common programming languages. These should be designed to place a
carefully chosen set of information about the available equipment at the
disposal of the programmer, for use in directing the control of the process.}
\end{quote}
In context of this quotation (twenty five years ago), 
{\em environment enquiries\/} were supposed to
yield information about machine numbers (radix, minimum and maximum
exponent, etc.) and computer arithmetic. By taking this information into
account,
an algorithm is enabled to adapt itself to machine characteristics and
thus to increase its {\em portability}\/ significantly. After long discussions
{\em arithmetic enquiry functions\/} have been included in the
Fortran\,90~\cite{f90norm} language definition. 

Where multiprocessors are concerned, environment enquiries are to
provide
information about the available {\em processors}\/, {\em communication
channels\/},
{\em memory hierarchies}\/, etc. By taking this information into account, a
portable parallel algorithm can adapt its behavior to environment
characteristics and thus increase its {\em efficiency}\/ significantly.

\subsection{The Necessity}
Programs without explicit parallelism, i.e.\ sequential and implicitly parallel
(functional or logic) programs, cannot be mapped {\em automatically\/} onto
parallel hardware resources if optimum performance is to be achieved.
This is due to the
complexity of the mapping task which is known to be $N\!P$\,complete (Garey,
Johnson~\cite{garey79}).

Programs using explicit parallelism (expressed in shared memory or message
passing languages, {\sc Linda}, etc.) implicitly require certain properties in
the underlying system
to run efficiently (Krommer, Ueberhuber~\cite{krommer92a}, Chapter~3 and 4).
For instance, they require a certain granularity and
topology of the computer, the homogeneity of its processing nodes, the absence of
processor time-sharing, etc. If such a program is ported to a parallel machine
of another type, its efficiency may deteriorate dramatically.

The explicit expression of parallelism is not enough to ensure the efficiency
of portable parallel programs.
Portable programs must {\em adapt}\/
their performance critical sections and features (algorithmic granularity,
communication processes, 
load distribution, etc.) to the architecture of the current parallel computer.

\subsection{The Scope}
The {\sc aaa} approach is intended to provide a methodology for writing programs
that run efficiently on {\em homogeneous}\/ or {\em heterogeneous networks}\/ of
{\em sequential,
vector, SIMD, and MIMD computers}\/.

Computer systems like {\em systolic arrays}\/, {\em dataflow}\/
and {\em reduction machines}\/ -- while attracting considerable interest --
are not dealt with in this report. 
None of these systems is at the moment mature enough to compete as a
cost-effective {\em general purpose\/} computer.
For special applications, such as signal processing, some of
these systems are already applied successfully. However, 
special purpose hardware needs tailor-made software, which keeps 
interest in portable programs low.

\subsection{An Example}
In the following sections, various aspects of the architecture adaptive algorithm
methodology are discussed in some detail.  
Examples are taken from the authors' main area of interest, parallel
quadrature.

Different adaptive quadrature algorithms exploiting coarse grain parallelism
have been discussed so far in literature
(Krommer, Ueberhuber~\cite{krommer91b}). A meta-algorithm comprising a
class of potentially promising algorithms is given by Krommer,
Ueberhuber~\cite{krommer92a}, in Table\,1. This
meta-algorithm has been formulated only for the one-dimensional case
(for functions of one independent variable). A modification for the
multidimensional case can easily be made
(de\,Doncker, Kapenga~\cite{doncker91c}).

The master and all the workers have their own local interval collection which is
processed according to a globally adaptive strategy. The workers
send information about their local integral and error estimates to the master.
Moreover, information concerning workload is exchanged between workers (and
between workers and the master), and
{\em load balancing}\/ is achieved by distributing intervals to be processed 
according to workload information. 

The master process of this meta-algorithm is responsible for
{\em control tasks}\/ like 
initializing,
keeping track of the overall integral and error estimate,
starting new workers, and
terminating the algorithm,
and {\em worker tasks}\/ like
applying a quadrature rule and
load balancing.
The subsumption of control and worker activities under the master process has
been chosen in order to enable
implementations on machines without multitasking facilities.

In the following section, it is shown that the efficiency of a particular
algorithm in this class crucially depends on the underlying computing
environment. Various pieces of information required by an algorithm to
adapt itself to the underlying system will be discussed.  

\section{Required Information} \label{reqinf}
Various kinds of information are required by an architecture
adaptive algorithm:

\begin{description}
\item [{\em Processing Speed}\/:]
Knowledge of the performance of processors is needed for load balancing
purposes. In heterogeneous network-based computing, for example, different
nodes may operate at different speed. Such parallel systems require the
distribution of dissimilar chunks of work adjusted to the individual processor
characteristics. Heterogeneity also occurs in SIMD machines where
a fast control unit contrasts with slow processing elements.
Operations on short arrays should be executed by the control unit of such
machines rather than by the processing elements. Recent investigations (Andrews,
Polychronopoulos~\cite{andrews91}) suggest that heterogeneous environments
may have a better cost/performance ratio than homogeneous systems and therefore
should be included in the range of target machines for parallel software. 

{\footnotesize
{\bf Example:}
The exchange of workload information is part of the load balancing activities
in the
parallel adaptive meta-algorithm in Table\,1. The workload of a worker does not
depend solely
on problem parameters (like error estimates in its local
interval collection), but also on its individual processing speed. A faster
processor should have intervals with larger error estimates in its local
interval collection than a slower one has, as the faster processor is more
powerful
in reducing error estimates (as a result of subdividing intervals).

}

Obtaining an appropriate algorithmic granularity requires knowledge of the
hardware granularity. Thus, for all heterogeneous as well as homogeneous
parallel systems, processor performance information is needed.

{\footnotesize
{\bf Example:}
In the meta-algorithm in Table\,1, load balancing is achieved by transfering
intervals between workers. However, while an interval is being transferred,
neither
the sending nor the receiving worker can process it. Consequently, if
interval processing is fast (with respect to communication speed), it might be
more efficient to process an
interval locally, i.e. to keep a load
{\em imbalance}\/, than to try to balance the load by sending intervals to other
workers.

}

\item [{\em Communication Speed}\/:]
Several performance numbers concerning communication must be known
in order to balance computional load against communication load.  The
{\em communication delay}\/,
i.e. the time required for performing a communication operation, can be divided
into four parts (Bertsekas, Tsitsiklis~\cite{bertsekas89}):
\begin{itemize}
\item {\em Communication Processing Time}\/: the time needed to prepare
information for transmission;
\item {\em Queueing Time}\/: the time spent waiting for the start of transmission;
\item {\em Transmission Time}\/: the time needed to transmit data;
\item {\em Propagation Time}\/: the time between the end of transmission and the end
of reception.
\end{itemize}
Communication delays do not only depend on hardware characteristics, but also
on systems software properties, like communication protocols.

Communication delay is crucial for determining the appropriate
granularity of a parallel algorithm.

{\footnotesize
{\bf Example:}
As can be seen from the previous discussion of the meta-algorithm, load balancing
should be done cautiously if communication delays are not to be
neglected in comparison with processing time.

}

For a specific hardware and system software environment, communication delays
depend primarily on the following factors:
\begin{description}
\item [{\em Message Length}\/:]
Transmission time is a monotone (linearly) increasing function of the amount
of information 
to be transmitted. The time needed for the other parts of a
communication operation is
usually independent of the message length.

{\footnotesize
{\bf Example:}
The time spent in the different stages of a communication operation can vary
significantly depending on the underlying architecture. If a considerable
part of the
communication delay does not depend on the message length, several intervals
should be transferred together. The {\em blocking}\/ of messages helps to
diminish the communication overhead significantly.

}

\item [{\em Location of Sender and Receiver}\/:]
Communication delays are independent of the location of the sender and the
receiver
only in fully connected topologies, like crossbars or buses.
On other topologies, most
notably sparsely connected communication systems like arrays and rings, 
communication delays
may vary substantially depending on the location of the sender and the receiver.

Communication delays can be used for defining a measure of distance (a
{\em metric}\/) for every ordered pair of processors:
Two processors are close to each other, if the
communication delay for a message transfer is low; they are far away from
each other, if the communication delay is high. This measure of distance does
not necessarily correspond to the spatial (physical) distance between
processors.


{\footnotesize
{\bf Example:}
The load balancing strategy applied in a parallel adaptive quadrature 
algorithm should take
into account the communication distances between workers.
If two workers are very
close to each other, intervals should be exchanged even when there is only a
slight load imbalance. On the other hand, should two workers be far away from
each other, a load
imbalance has to be significant to trigger an interval transfer.

}

\item [{\em Traffic and Throughput}\/:]
Queueing time (and consequently communication delay) may increase due to
{\em interference}, i.e. the interaction of several concurrent transmissions
on a communication resource. Communcation delay is a non-linear, monotone
increasing 
function of {\em traffic} -- the amount of information transmitted per time
unit -- and tends to infinity when traffic is beyond the {\em throughput} --
the maximum amount of information that can be transmitted on a given
communication network. 

{\footnotesize
{\bf Example:}
Communication traffic due to load balancing normally results in increased
communication overhead. A load balancing strategy for a parallel adaptive
algorithm
that neglects this effect may be responsible for a loss of efficiency.

}

\item [{\em Communication Type}\/:]
There are a number of communication patterns frequently occuring in numerical
applications: point-to-point message passing, single-source and
multiple-source broadcasting and reduction, gather and scatter operations,
shifts along several axes of multidimensional arrays, and the emulation of
butterfly networks, etc. (Bertsekas, Tsitsiklis~\cite{bertsekas89}, 
Johnsson~\cite{johnsson91}). 
As special hard- and/or software support for some of these communication primitives
may exist (depending on the respective communication system), their
performance has to be enquired individually, i.e. it cannot be derived from
point-to-point communication delays.
The optimal algorithm for a {\em single node scatter}\/, for instance, has
a time complexity of $O(p)$ ($p$ is the number of processors) on a ring
topology as well as on a mesh topology (Bertsekas, Tsitsiklis~\cite{bertsekas89}).
\end{description}

\item [{\em Communication Topology}\/:]
Knowledge of the topology of a given communication network is an important
prerequisite to its effective usage.

There is a broad variety of topologies in use today:
linear arrays, rings, stars, trees, meshes, hypercubes, completely
interconnected systems, etc. (Bertsekas, Tsitsiklis~\cite{bertsekas89}, 
Almasi, Gottlieb~\cite{almasi89}). An interconnection network may even be
{\em reconfigurable}\/, i.e. its topology can be changed during the execution
of an algorithm (Lee, Smitley~\cite{lee88}). 
Additionally, heterogeneous communication systems
may occur in distributed computing environments. 
Thus, information about topology can be quite complex and its utilization can be
difficult.  However, for many applications, there is only a small number of
reasonable task allocations and corresponding communication structures. In these
cases the programmer is primarily interested in how well the {\em virtual}\/
topologies (which are inherent to algorithms) can be mapped onto the
{\em physical}\/ topology of a specific parallel system.

{\footnotesize
{\bf Example:}
All workers send information about their local integral and error estimates to
the master. This communication pattern corresponds to a virtual star
topology. However, the algorithm cannot presuppose the existence of a
physical star topology. Rather, the processes should be mapped onto physical
processors as well as possible. For instance, the processes could be mapped so
that some norm 
\begin{displaymath}
\Vert d \Vert_{q} = \left\{
      \begin{array}{ll}
            \left( \begin{array}{l}
                 \sum\limits_{i=1}^{w} |d_{i}|^{q}
                   \end{array}
            \right) ^{\frac{1}{q}}, & q \in [1,\infty ) \\
            \max\,\{|d_{1}|,\ldots ,|d_{w}|\}, & q = \infty 
      \end{array}
      \right.
\end{displaymath}
of the vector $d = (d_{1}, \ldots, d_{w})$ of the distances between the workers
and the master is minimal. $\Vert d\Vert_{q}$
can serve as a
measure for how well the virtual topology can be emulated on the physical one.

}

\item [{\em Synchronization}\/:]
Inasmuch as synchronization requires communication, all the above considerations
apply. But synchronization causes additional overhead: if a process is delayed
by {\em busy-waiting}, it becomes a load on its environment
(processor, bus, etc.); if it is {\em blocked}, awakening it causes an
additional delay. Synchronization techniques and their
corresponding overhead have to be taken
into consideration when modeling performance.

{\footnotesize
{\bf Example:}
The work to be done by the master process can be split into a part identical
to a worker job (doing integration and load balancing) and a part responsible
for algorithm
control (updating the overall integral and error estimate, initializing
workers). If multitasking facilities are available, these two jobs can be
done by two different processes running on the same processor, which results
in an easier process logic. However, the
control process will spend most of its time waiting for messages from
worker processes. Consequently, the performance of any worker process on the
control algorithm's processor will deteriorate considerably when synchronization
is done by busy-waiting.

}

\item [{\em Memory Organization}\/:]
Performance is significantly influenced by utilizing available memory
hierarchies. 
The management of complex memory structures is usually
hidden from the programmer. If, however, details of the memory management
strategies should be known, algorithms might take
advantage of such knowledge.

{\footnotesize
{\bf Example:} \label{cac}
Adaptive quadrature algorithms applied to integrands with oscillations,
peaks, and/or singularities often produce very large interval
collections with
thousands of intervals. In such cases, the whole interval collection may not
fit into a given cache memory. Consequently, frequent cache misses will occur
when inserting new intervals into the interval collection, which diminishes the
performance
significantly. Therefore, the data structure of the interval collection should
take into account currently available cache sizes. For instance,
the interval collection could be split into two parts: a sorted part (which 
comprises the
intervals with large error estimates) and an unsorted part. The size of the
sorted part chosen should be small enough to fit into the existing cache memory.
There is a penalty when intervals of the unsorted part are accessed. Such
overhead, however, rarely occurs if the unsorted part contains only intervals
with small error estimates.

}

\item [{\em Process Initialization}\/:]
Starting a new process requires initialization work, which results
in a delay of the execution.

{\footnotesize
{\bf Example:}
If an integrand is smooth,
one single evaluation of a quadrature rule is often sufficient to attain the
given accuracy requirement. The time spent on the corresponding
computational effort may be much smaller than the start up time of a worker
process. Therefore, the master process should {\em not}\/
start worker processes at the very beginning of its execution.

} 

\item [{\em Execution Mode}\/:]
Different processors may run either in a synchronous (SIMD) or in an asynchronous
(MIMD) way. 

{\footnotesize
{\bf Example:}
The time required for one evaluation of the integrand function may depend on
the respective abscissa. On a SIMD machine, different workers apply the
quadrature rule to their respective intervals synchronously, which results in
synchronous function evaluations. The overall evaluation time is
determined by the most time consuming evaluation: All processors have to wait
until the most time consuming evaluation has been finished. In cases of
substantially varying function evaluation times, such algorithms cannot be
expected to utilize a SIMD computer
efficiently.

}

\item [{\em Side Effects}\/:]
When several requests are submitted simultaneously to a resource, the
performance of this resource slows down because of {\em interference\/}.
For example, communication delay usually increases when communication traffic
gets heavier.
However, while the programmer can
be expected to be aware of message passing interference, other sources of
interference are less obvious. 

{\footnotesize
{\bf Example:}
In the parallel adaptive quadrature algorithm in Table\,1, communication is
formulated in a
message passing style. This does {\em not}\/ imply that the algorithm
is meant only
for distributed memory machines. Message passing can be implemented easily on
shared memory machines. In this case messages not only interfere with each
other, but also with memory accesses. Thus, both the effective processing speed
and the effective communication speed may
be reduced, which is a fact that might easily be overlooked.  

}

The following side-effects have to be taken into account by an
architecture adaptive algorithm:
\begin{itemize}
\item interference between communication and storage operations on
shared memory computers;
\item interference between processing and communication operations on
distributed systems without special routing processors;
\item interference between different programs running in a time sharing 
environment.
\end{itemize}

\item [{\em Algorithm Adaption Overhead}\/:]
Acquiring information and making the corresponding decisions cause an additional
overhead in architecture adaptive algorithms.
The decision overhead should be quantifyable
in order to find a reasonable compromise between this {\em decision overhead}\/
and the potential gain in efficiency of a more carefully adapted algorithm.
\end{description}

\section{Abstract Machine Model}
Information about the underlying hardware and system software is
indispensable for architecture adaptive algorithms. Preferably, this information
should be given within the framework of a {\em parameterized abstract machine
model}\/ (Augustyn, Krommer, Ueberhuber~\cite{augustyn91}). Describing systems
without an abstract machine model would limit the
applicability of architecture adaptive algorithms to already existing
machines. 
The parameters of the abstract machine model have to represent
different performance relevant aspects of parallel computing systems
(the number of processors, cache sizes, etc.).

The tremendous diversity of today's computing environments makes
the development of an abstract machine model a most challenging task.
In the design process of machine models decisions have to be made concerning
their accuracy and complexity. Accurate models enable highly efficient
architecture adaptive algorithms. Due to their complexity, however, they are
unwieldy for the algorithm designer. An outline
of a possible machine model (based on the
more general exposition of Augustyn, Krommer and
Ueberhuber~\cite{augustyn91}) is given in this section.

{\footnotesize
{\bf Example:}
A specific computing environment of the Technical University Vienna will serve
illustrative purposes. A local area network interconnects
\begin{itemize}
\item a shared memory computer, a Sequent Balance 21000 with $28$ processors;
\item a distributed memory computer, a 16-node iPSC/860 hypercube; and
\item a homogeneous workstation cluster consisting of nine RS/6000-320H
computers connected by a token ring network.
\end{itemize}

}

\subsection{Structure of Computing Environments} \label{agv}
The model to be developed should be used for computationally intensive
problems. I/O bound tasks are not considered in this paper.
The time needed for accessing files is assumed to be negligible.

The performance of computationally intensive tasks can degradate
significantly due to {\em memory hierarchy delays}\/. Properties of the memory
hierarchy's various levels (cache memory, main memory, swap
space on a hard disc, etc.) should be taken into account in a useful machine
model. 
Instruction caches can be neglected in an abstract machine model.
The knowledge of instruction cache sizes is practically useless to the 
programmer of scientific software.

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(156,115)(144,641)
\thicklines
\put(206,728){\line(-1,-6){  7.162}}
\put(199,685){\line( 1,-6){  7.324}}
\multiput(214,656)(8.19048,0.00000){11}{\line( 1, 0){  4.095}}
\multiput(214,699)(8.19048,0.00000){11}{\line( 1, 0){  4.095}}
\multiput(214,714)(8.19048,0.00000){11}{\line( 1, 0){  4.095}}
\put(214,641){\framebox(86,115){}}
\put(214,728){\line( 1, 0){ 86}}
\put(256,717){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninrm Level 1}}}
\put(256,702){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninrm Level 2}}}
\put(256,688){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninrm .}}}
\put(256,677){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninrm .}}}
\put(256,667){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninrm .}}}
\put(256,645){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninrm Level N}}}
\put(167,688){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninbf Memory}}}
\put(167,675){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninbf Hierarchy}}}
\put(256,738){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\ninbf Processor}}}
\end{picture}\\
\vspace{15 pt}
{\footnotesize {\bf Figure 1:} Processor-memory node}
\end{center} 
\end{figure}

{\em Processor-memory nodes} are the fundamental building blocks of a
general model for computing environments. A
processor-memory node contains one processor and/or a memory
hierarchy\footnote{The presence of both the processor and the memory
hierarchy is optional, i.e. a
processor-memory node may consist of only a processor or of only a memory
hierarchy.} (cf. Figure~1). If both the processor and the memory hierarchy
are present,
the processor is supposed to have {\em direct}\/ access to the memory hierarchy.

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(317,358)(83,438)
\thicklines
\put( 83,651){\framebox(83,84){}}
\put( 83,693){\line( 1, 0){ 83}}
\put(125,710){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(125,668){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(125,747){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Sequent Balance}}}
\put(242,747){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm i860}}}
\put(358,747){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm RS 6000/320H}}}
\put(317,693){\line( 1, 0){ 83}}
\put(317,618){\framebox(83,117){}}
\put(200,693){\line( 1, 0){ 83}}
\put(200,643){\framebox(83,92){}}
\multiput(200,668)(6.14815,0.00000){14}{\line( 1, 0){  3.074}}
\multiput(317,668)(6.14815,0.00000){14}{\line( 1, 0){  3.074}}
\multiput(317,643)(6.14815,0.00000){14}{\line( 1, 0){  3.074}}
\put(199,438){\framebox(83,84){}}
\multiput(199,480)(7.90476,0.00000){11}{\line( 1, 0){  3.952}}
\put(358,710){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(358,676){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(358,651){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(358,626){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Swap Space}}}
\put(242,710){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(242,676){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(242,651){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(241,497){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(241,455){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Swap Space}}}
\put(241,534){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Sequent Balance}}}
\put(242,787){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf Processor-memory nodes}}}
\put(241,570){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf Memory node}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 2:} Examples of processor-memory nodes}
\end{center} 
\end{figure}

{\footnotesize
{\bf Example:}
Each processor node of the Sequent Balance (NS\,32032 processor) has exclusive
access to an
8\,KByte data cache. The Sequent Balance includes a memory node (without
processor)
consisting of a 24\,MByte main memory, and a swap space on a disc.
On the iPSC/860 hypercube each processor-memory node (i860 processor) has an
8\,KByte data
cache and an 8\,MByte main memory. Each RS/6000-320H workstation
({\sc Power} processor) has a
32\,KByte data cache, a 16\,MByte main memory, and a 128\,MByte swap space on a private 
disc (cf. Figure~2).

}

Parallel or distributed computing environments are built 
by connecting processor-memory nodes via an {\em interconnection
network\/} (cf. Figure~3). 

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(318,151)(82,640)
\thicklines
\put( 82,690){\framebox(84,84){}}
\put( 82,732){\line( 1, 0){ 84}}
\put(124,749){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(124,707){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Memory}}}
\put(216,690){\framebox(84,84){}}
\put(216,732){\line( 1, 0){ 84}}
\put(258,749){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(258,707){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Memory}}}
\put(316,690){\framebox(84,84){}}
\put(316,732){\line( 1, 0){ 84}}
\put(358,749){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(358,707){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Memory}}}
\put(124,690){\line( 0,-1){ 16}}
\put(258,690){\line( 0,-1){ 16}}
\put(358,690){\line( 0,-1){ 16}}
\put( 82,640){\framebox(318,34){}}
\put(124,782){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf Node 1}}}
\put(258,782){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf Node p-1}}}
\put(358,782){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf Node p}}}
\put(241,653){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Communication network}}}
\put(191,732){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf .  .  .}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 3:} General structure of a parallel machine}
\end{center} 
\end{figure}

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(320,134)(80,660)
\thicklines
\put( 80,710){\framebox(84,84){}}
\put( 80,752){\line( 1, 0){ 84}}
\put(122,769){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(122,727){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(316,710){\framebox(84,84){}}
\multiput(316,752)(8.00000,0.00000){11}{\line( 1, 0){  4.000}}
\put(358,769){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(358,727){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Swap Space}}}
\put(215,710){\framebox(84,84){}}
\put(215,752){\line( 1, 0){ 84}}
\put(257,769){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(257,727){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put( 80,660){\framebox(320,33){}}
\put(240,673){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Bus}}}
\put(122,710){\line( 0,-1){ 17}}
\put(257,710){\line( 0,-1){ 17}}
\put(358,710){\line( 0,-1){ 17}}
\put(189,752){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf .  .  .}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 4:} Shared memory machine (Sequent Balance)}
\end{center} 
\end{figure}
\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(315,150)(85,640)
\thicklines
\put( 85,749){\line( 1, 0){ 83}}
\put( 85,698){\framebox(83,92){}}
\multiput( 85,723)(6.14815,0.00000){14}{\line( 1, 0){  3.074}}
\put(126,765){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(126,732){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(126,707){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(317,749){\line( 1, 0){ 83}}
\put(317,698){\framebox(83,92){}}
\multiput(317,723)(6.14815,0.00000){14}{\line( 1, 0){  3.074}}
\put(359,765){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(359,732){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(359,707){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(126,674){\line( 0, 1){ 24}}
\put(359,674){\line( 0, 1){ 24}}
\put( 85,640){\framebox(315,34){}}
\put(242,652){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Hypercube Network}}}
\put(242,749){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf .     .     .}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 5:} Distributed memory machine (iPSC/860)}
\end{center} 
\end{figure}
\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(315,173)(85,602)
\thicklines
\put( 85,733){\line( 1, 0){ 82}}
\multiput( 85,709)(6.07407,0.00000){14}{\line( 1, 0){  3.037}}
\multiput( 85,684)(6.07407,0.00000){14}{\line( 1, 0){  3.037}}
\put( 85,659){\framebox(82,116){}}
\put(127,750){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(127,718){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(127,693){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(127,667){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Swap Space}}}
\put(318,733){\line( 1, 0){ 82}}
\multiput(318,709)(6.07407,0.00000){14}{\line( 1, 0){  3.037}}
\multiput(318,684)(6.07407,0.00000){14}{\line( 1, 0){  3.037}}
\put(318,659){\framebox(82,116){}}
\put(358,750){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Processor}}}
\put(358,718){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cache}}}
\put(358,693){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Main Memory}}}
\put(358,667){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Swap Space}}}
\put(127,634){\line( 0, 1){ 25}}
\put(358,634){\line( 0, 1){ 25}}
\put( 85,602){\framebox(315,32){}}
\put(243,614){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Token Ring Network}}}
\put(240,730){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf .     .     .}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 6:} Workstation cluster (RS/6000-320H)}
\end{center} 
\end{figure}

{\footnotesize
{\bf Example:}
The Sequent Balance is set up by interconnecting $28$ 
processor-memory nodes and one (shared) memory node by 
a 26.6\,MByte/s high-speed bus (cf. Figure~4).
The processor-memory nodes of the iPSC/860 are interconnected by a
communication system that forms a
four-dimensional hypercube (cf. Figure~5). The overall bandwidth of the
hypercube is 2.8\,MByte/s, the node-to-node latency is 60\,$\mu$s.
The RS/6000-320H workstations are
interconnected by a token ring network (cf. Figure~6) that provides a throuput
of 16\,MBit/s. The latency is about 10\,ms.

}

\begin{itemize}
\item If a memory node is present, it is assumed that {\em each\/} processor has
access to its memory hierarchy (via the interconnection network), i.e.\ such
a node is assumed to be a {\em shared memory\/}. Internode
communication takes place via the shared memory.
\item If {\em all\/} nodes include a processor, it is assumed that
processors communicate by passing messages over the interconnection network.
It is assumed that no {\em explicit\/} routing has to be carried
out to establish the communication between two nodes that are not directly
linked to
each other, i.e.\ a system software layer provides the user with a {\em
virtually fully
connected}\/ interconnection network.
\end{itemize}

{\footnotesize
{\bf Example:}
The existence of a memory node in the diagram of the Sequent Balance indicates
that this is a
shared memory computer. The hypercube topology of the iPSC/860 is completely
hidden by the operating system's message passing routines. Any two
processor-memory nodes can directly communicate with each other. This is also
true for the ring topology
of the RS/6000-320H cluster when using programming tools like {\sc PVM}
(Sunderam~\cite{sunderam90}).

}
 
Several parallel or distributed computing environments in addition to some
processor-memory nodes can be combined by a higher level interconnection
network to form an even more complex distributed computing environment. 

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(316,100)(84,700)
\put(284,764){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Cluster}}}
\put(284,779){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm RS 6000}}}
\thicklines
\put( 84,750){\framebox(67,50){}}
\put(117,750){\line( 0,-1){ 17}}
\put(167,750){\framebox(67,50){}}
\put(200,733){\line( 0, 1){ 17}}
\put(250,750){\framebox(67,50){}}
\put(284,733){\line( 0, 1){ 17}}
\put(333,750){\framebox(67,50){}}
\put(367,733){\line( 0, 1){ 17}}
\put( 84,700){\framebox(316,33){}}
\put(117,771){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm iPSC/860}}}
\put(200,779){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Sequent}}}
\put(200,764){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Balance}}}
\put(367,779){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm DECstation}}}
\put(367,764){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm 3100}}}
\put(242,712){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Ethernet}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 7:} Distributed computing environment}
\end{center} 
\end{figure}

{\footnotesize
{\bf Example:}
The Sequent Balance, the iPSC/860 hypercube host, the RS/6000-320H
workstation cluster, and the author's workstation are connected by an
Ethernet network (cf. Figure~7) that provides a bandwidth of 10\,MBit/s.

}

An important question is whether or not this immanently hierarchical structure
of computing environments should be flattened. Instead of using different levels
of interconnection networks, it is possible -- in a virtual or
logical sense -- to model a complex distributed computing environment by using a
simple structure like that of Figure~3.

{\footnotesize
{\bf Example:}
By using programming environments like {\sc PVM}, each Sequent Balance node can
communicate directly (without explicit routing) with each RS/6000-320H node
via {\sc PVM} message passing routines.

}

Combining different communication networks (and network layers) into a single
network simplifies the model structure, but vastly increases the heterogeneity
of the network. Consequently, algorithm control (for instance, the choice
of an appropriate algorithmic granularity) becomes a formidable task.

{\footnotesize
{\bf Example:}
Communication delays between any two Sequent Balance nodes do not depend on
the location of the sender and the receiver.
This is also approximately true for the communication delays between any two
RS/6000-320H workstations in the cluster. (This is true when using {\sc PVM}.)
In
a virtual network
including both the Sequent Balance and the workstation cluster, communication 
delays between two nodes differ by several orders of magnitude.
Besides, it is not clear how the Sequent Balance's shared memory resources
(its memory node) can be included in the combined machine model.

The iPSC/860 can be controlled only via a host computer. 
It is not possible to allocate i860 processors from outside. Thus, a direct
global algorithm control is not feasible.   

}

In many cases it is better to adopt the hierarchical structure of a
computing environment in the abstract machine model and to achieve algorithmic
control in a hierarchical manner.

{\footnotesize
{\bf Example:}
If a parallel algorithm is supposed to use the Sequent Balance,
the workstation cluster, and the hypercube simultaneously, it is not reasonable
to have just
one centralized control process. If, for instance, control is exercised from 
a workstation node, communication between the controlling process and the
Sequent Balance processes will be slow. All Sequent Balance worker
processes should be controlled by a specific control process running on a
Sequent Balance node. 
In the same way the workstation cluster and the hypercube
should be managed by their own local control processes. These local control
processes in turn are subordinated to a global control process (that may be 
residing in any of the three subenvironments). 

}

The structure of a distributed environment may be composed of several
subhierarchies and hierarchical levels.

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(312,115)(88,680)
\put(203,768){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm University}}}
\put(203,753){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Vienna}}}
\put(121,774){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Technical}}}
\put(121,760){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm University}}}
\put(121,745){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Vienna}}}
\thicklines
\put(170,729){\framebox(66,66){}}
\put(203,729){\line( 0,-1){ 16}}
\put(252,729){\framebox(66,66){}}
\put(285,729){\line( 0,-1){ 16}}
\put( 88,729){\framebox(66,66){}}
\put(121,729){\line( 0,-1){ 16}}
\put( 88,680){\framebox(312,33){}}
\put(244,692){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Internet}}}
\put(285,774){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Kepler}}}
\put(285,760){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm University}}}
\put(285,745){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Linz}}}
\put(359,762){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenbf .     .     .}}}
\end{picture}\\
\vspace{15 pt}  
{\footnotesize {\bf Figure 8:} Distributed computing environment}
\end{center} 
\end{figure}

{\footnotesize
{\bf Example:}
The computing environment of the Technical University Vienna is connected
through Internet to other computing sites (cf. Figure~8). These computing sites
are all structured in a hierarchical manner. Computing
resources available in other sites can be combined into a distributed computing
environment. Such an environment makes only sense for applications with an
extremely high demand for computing power and an algorithmic granularity that
is sufficiently coarse.

}
 
\subsection{Characterization of Available Resources}
It is necessary to know which computing environments and which components are
indeed available, before enquiring the properties of
the various components of distributed computing environments. 

A distributed computing environment can consist of several 
subenvironments. (A processor may have access to several subenvironments.)
If the computational demands of an algorithm are high, the computation might
be spread over several subenvironments. 

{\footnotesize
{\bf Example:}
A Sequent Balance node is not only part of one particular 
Sequent Balance shared memory computer, but has also access (via the Ethernet
and Internet networks) to processor-memory nodes of other machines.
A parallel quadrature algorithm starting its execution on a Sequent Balance node
might first exploit the resources of the Sequent Balance. If the integration
turns out to be computationally intensive, the RS/6000-320H cluster and the
iPSC/860 hypercube might also be exploited.
 
}

An enquiry function that returns the accessible (sub)environments (networks) is
required.

{\footnotesize
{\bf Example:}
The enquiry procedure {\tt environment} yields the number {\sl env\_nr} of
networks accessible to the calling process. A list of these
environments is contained in {\sl env\_list}. 
\begin{tabbing}
aaaa\=\kill
\> {\tt environment(} {\sl env\_nr} {\tt ,} {\sl env\_list} {\tt )}
\end{tabbing}
The array elements in {\sl env\_list} are data objects identifying the
respective environments.

}

As soon as all accessible networks are known, the next step is to enquire their
components (processor-memory nodes, distributed computing environments, etc.).

{\footnotesize
{\bf Example:} The enquiry procedure {\tt components} yields the number
{\sl comp\_nr} of
components connected to the environment {\sl env\_id}. A list of these
components is returned in {\sl comp\_list}.
\begin{tabbing}
aaaa\=\kill
\> {\tt components(} {\sl env\_id} {\tt ,} {\sl comp\_nr} {\tt ,}
{\sl comp\_list} {\tt )}
\end{tabbing}
To keep the environment description
as simple as possible, only different (not replicated) components are 
listed in {\sl comp\_list}. The replication factor of a component is included
in the component's description.

The Sequent Balance, for instance, has only two different components: a
processor-memory
node and a memory node (main memory and hard disc).
There are $28$ replicates of the
processor-memory node, but there is only one memory node.

Components are described by data objects containg the following information:
\begin{itemize}
\item replication factor,
\item type (processor-memory node, memory node, or distributed environment),
\item a component identifier ({\sl node\_id} or {\sl env\_id}).
\end{itemize}

}

\subsubsection{Fault-Tolerance}
Although the {\sc aaa} approach is primarily concerned with the performance
aspect of parallel and distributed computing, the {\em reliability\/}
({\em fault-tolerance\/}) issue is also important. This is particularly the case
for heterogeneous computing environments where nodes may become unavailable
to the executing program. For example, this can happen when one of the
participating workstations is physically switched off.

{\footnotesize
{\bf Example:} The procedure {\tt on\_failure} enables a failure handler
{\sl fail\_handler}.
\begin{tabbing}
aaaa\= {\tt near\_proc(} \=\kill
\> {\tt on\_failure(} {\sl fail\_handler} {\tt ,} {\sl num\_proc} {\tt ,}
{\sl proc\_vector} {\tt )}
\end{tabbing}
If one of the {\sl num\_proc} processor-memory nodes contained in
{\sl proc\_vector} fails, i.e.\ is no longer available to the process that
has called {\tt on\_failure},
the following happens:
\begin{itemize}
\item The process that called {\tt on\_failure} is interrupted.
\item The current process context is saved.
\item {\sl fail\_handler} is invoked. 
\item The process resumes execution in
the pre-failure context if {\sl fail\_handler} returns normally. 
\item Newly occuring failures are blocked during the execution of a failure
handler. They can only be served after the currently executing failure handler
has returned.
\end{itemize}
If no failure handler has been set up for a specific processor-memory node, failures of
this node are ignored. Different failure handlers can be assigned to failures
of different nodes. If a node failure is covered by several handlers,
one handler is chosen randomly and invoked.

A failure handler routine has to be declared as follows:
\begin{tabbing}
aaaa\= {\tt near\_proc(} \=\kill
\> {\sl fail\_handler} {\tt (} {\sl node\_id} {\tt )}.
\end{tabbing}
The failed processor {\sl node\_id} is passed to {\sl fail\_handler}.

}

\subsection{Characterization of Computing Environments} \label{cce}
The general model of distributed computing environments is made up of
{\em processors}\/, {\em memory hierarchies}\/, and
{\em interconnection networks\/} (cf. Figure~3).
In order to provide a {\em parameterized}\/ abstract machine model,
all components have to be parameterized.

\subsubsection{Parameterization of Processors}
{\sl Processor Speed}

The task of a processor is the manipulation of data. Processor 
performance should be characterized by the speed of the respective operations.
Since the instruction sets of modern processors vary significantly, only
operations provided in {\em portable}\/ high-level languages (like
Fortran or C) should be taken into account. The set of operations
will further be limited to {\em arithmetic} operations, as the abstract
machine model is to be used in the context of computationally intensive tasks
(scientific software, etc.).
Nevertheless, the following considerations can be easily transferred to other
kinds of operations (integer or character operations, etc.).

First of all, the arithmetic operators {\tt +}, {\tt -}, 
{\tt *}, {\tt /}, {\tt **} have to be considered. In all scientific 
programming languages, a broad range of mathematical {\em intrinsic functions}\/
is available. The (hardware or software) implementation of these functions may
vary substantially on different computers. 
As a result, their performance {\em cannot\/} be accurately derived
from the speed of the arithmetic operations. If the evaluation of intrinsic
functions takes a significant part of the overall execution time, accurate
performance data on the intrinsic functions should be available.

A certain degree of low-level parallelism is available in many of today's
microprocessors. For instance, arithmetic operation units can be
{\em pipelined}\/,
and/or {\em independent} (concurrently operating) arithmetic operation units
can be available. Unfortunately, the execution time of a {\em sequence}\/ of
operations can {\em not}\/ be inferred from the execution times of 
single operations. 

Hockney and Jesshope~\cite{hockney88} have suggested a model for the execution
time of vector instructions that depends on the length $n$ of the vectors
involved:
\begin{equation}\label{hoc}
t = \frac{n + n_{1/2}}{r_{\infty }}.
\end{equation}
$r_{\infty }$ denotes the {\em maximum (asymptotic)
performance}\/, i.e.\ the Mflop/s rate for infinite vectors.
$n_{1/2}$ -- the {\em half-performance length} -- is the vector length
for which half of the maximum performance is achieved.
This linear model is not just a first order fit to experimental data, but can
be derived from hardware properties (Hennessy, Patterson~\cite{hennessy90}).
The model~(\ref{hoc}), though simple, is sufficiently accurate for
many applications. It can be used, for instance, to describe the performance
of {\em (pipelined) scalar processors}\/.

Due to the existence of concurrently operating, pipelined arithmetic
units in modern processors, execution times for compound vector operations
cannot be inferred from single vector operations without detailed knowledge
of the processor architecture. Therefore, $(r_{\infty },n_{1/2})$
models should also be provided for commonly used compound vector operations,
like the BLAS-1 subroutines and intrinsic vector functions.

In most applications only a small subset of the overall processor performance
characteristics is relevant to a performance prediction.
Linear algebra routines, for instance, predominantly use {\em rational}\/
operations.
Simple generic enquiry functions might provide the framework for
acquiring all necessary information. 

{\footnotesize
{\bf Example:}
Two generic enquiry functions providing processor performance parameters
might be defined as follows:
\begin{tabbing}
aaaa\= $n_{1/2}$ = {\tt n\_half(} \=\kill
\> $n_{1/2}$ = {\tt n\_half(} {\sl node\_id} {\tt ,}
   {\sl operation\_id} {\tt )}\\[1mm]
\> $r_{\infty }$ = {\tt r\_infty(} {\sl node\_id} {\tt ,}
   {\sl operation\_id} {\tt )}
\end{tabbing}
The variable {\sl node\_id} specifies the processor and the character
string {\sl operation\_id} the (possibly compound) operation
to be evaluated. The string {\tt '+'}, for instance, denotes a vector addition,
{\tt 'SIN'} the intrinisic sine function, and {\tt 'SAXPY'} the respective
BLAS-1 subroutine.

}

Different hardware may be used for scalar and vector operations. Furthermore,
vectorizing loops incurs some overhead. For short vector length,
scalar mode is faster than vector mode. The parameter
$n_{v}$ (Hennessy, Patterson~\cite{hennessy90}) gives the vector length
needed to make vector mode faster than scalar mode. For scalar processors
without additional vector units, $n_{v} = 1$; while for scalar processors
with additional vector units or vector processors, $n_{v} > 1$ holds.
$n_{v}$ can also be used to derive scalar performance characteristics. 

{\footnotesize
{\bf Example:} The enquiry function {\tt n\_v} returns the parameter $n_{v}$.
\begin{tabbing} 
aaaa\= $n_{v}$ = {\tt n\_v(} \=\kill
\> $n_{v}$ = {\tt n\_v(} {\sl node\_id} {\tt ,}
   {\sl operation\_id} {\tt )}
\end{tabbing}
For example, given the values $r_{\infty }$, $n_{1/2}$, $n_{v}$ for
multiplication, scalar performance for multiplications is obtained by
\[ r_{\infty }^{scalar} = \frac{n_{v} \cdot r_{\infty}}{n_{v} + n_{1/2}}. \]

}

Performance parameters should characterize the processor performance
for situations where data is stored in the highest level of the memory
hierarchy. 
Performance degradation due to delays caused by accessing data in
lower memory levels can be modeled by using the memory hierarchy
parameters from the following section.
 
~\\
{\sl Available Processor Time}

In a time sharing environment, processors are shared between different
processes. The number of CPU cycles available to one of those processes
can be a highly dynamic entity. An enquiry function should
be provided to yield the percentage of CPU time currently available to the
process making the enquiry.
An additional enquiry function could
indicate cases where a processor is private to an executing process.

{\footnotesize
{\bf Example:} Two enquiry functions are as follows:
\begin{tabbing}
aaaa\= \kill
\> {\sl current\_percentage} = {\tt free\_cpu(} {\sl node\_id} {\tt )}\\[1mm]
\> {\sl current\_mode} = {\tt cpu\_sharing(} {\sl node\_id} {\tt )}
\end{tabbing}
The function {\tt free\_cpu} returns the {\em percentage}\/
of cycles of the processor {\sl node\_id}
currently available. The enquiry function {\tt cpu\_sharing} indicates the mode
in which the processor {\sl node\_id} runs (time sharing, etc.).

}

~\\
{\sl Initialization Delay}

The time required to initialize a process on processor
depends on the processor and the workload of the whole system.

{\footnotesize
{\bf Example:} The enquiry function {\tt init\_delay} returns the time
{\sl current\_init\_delay} {\em currently}\/ required to start a new process
on the processor {\sl node\_id}.
\begin{tabbing}
aaaa\= \kill
\> {\sl current\_init\_delay} = {\tt init\_delay(} {\sl node\_id} {\tt )}
\end{tabbing}

}

{\sl Synchronization Mode}

On SIMD machines, processor execution is controlled and synchronized by the
instruction stream issued by a control unit. Thus, processors should be able
to enquire whether or not they can act as control units of SIMD machines.

{\footnotesize
{\bf Example:} The enquiry procedure {\tt sync\_control} returns the vector
{\sl proc\_vector} of length {\sl num\_proc} that contains the numbers of those
processors that
can be controlled in a synchronized manner by the processor that calls
{\tt sync\_control}. 
\begin{tabbing}
aaaa\= {\tt near\_proc(} \=\kill
\> {\tt call sync\_control(} {\sl num\_proc} {\tt ,} {\sl proc\_vector} {\tt )}
\end{tabbing}
If the calling processor is not a control unit of a SIMD machine, the length
{\sl num\_proc} of the return vector is 0.

}

\subsubsection{Parameterization of Memory Hierarchies}
A memory hierarchy is usually transparent to the
programmer. The question arises, what general assumptions can be made
about the mapping of data onto the memory hierarchy.

Memory hierarchies are organized into several levels -- each
smaller and faster than the level below.
The levels of the hierarchy usually subset one another; all the data in one level
is also
found in the level below. When accessed, data is stored in the highest level.
Data is kept in a level of the memory hierarchy
as long as the level's capacity is not exceeded\footnote{Not fully associative
caches are an exception to this rule.}. Other memory management strategies,
concerning, for instance, block replacement and memory writes, vary
significantly (Hennessy, Patterson~\cite{hennessy90}).

Numerous parameters concerning memory hierarchy may influence performance.
Important properties of caches are characterized by physical parameters --
like
cache size, line size, set associativity (So, Zecca~\cite{so88}) -- as well as
cache management strategies -- like the write-through vs. write-back
(Vaughan-Nichols~\cite{vaughan91}) or write-allocate vs. no-write-allocate
(Hennessy, Patterson~\cite{hennessy90}) options. 

Parameters made available to the programmer must fulfill
the following conditions:
\begin{itemize}
\item The parameter must enable the programmer to write more efficient
programs.
\item The parameter must enable the programmer to estimate the
performance of a processor-memory node more realistically (by including memory
delays).
\end{itemize}

According to these principles, the levels of a memory hierarchy could
be characterized by the following parameters:
\begin{description}
\item [{\em Size}\/:]
An algorithm can often choose data reference locality according to known
storage capacities (see the example on page~\pageref{cac}), which results
in lower miss rates for the faster memory levels.

The memory requirements
of a program may sometimes exceed the capacity of the memory hierarchy.
In such a case the program cannot be run in a specific environment or it 
has to be reorganized.

{\footnotesize
{\bf Example:} The enquiry function {\tt size} returns the size of the memory
level {\sl level} in the memory hierarchy {\sl node\_id} (in unit of bytes). 
\begin{tabbing}
aaaa\= {\tt size(} \=\kill
\> {\sl memory\_size} = {\tt size(} {\sl node\_id} {\tt ,} {\sl level}
{\tt )}
\end{tabbing}
The variable {\sl level} is of type {\tt integer}. ${\sl level} = 1$ designates
the fastest level, ${\sl level} = 2$ designates the second fastest level and
so forth.

}

\item [{\em Number of Banks, Interleaving Factor}\/:]
To increase bandwidth, memory is often divided into a relatively small number of
independent banks, each of which may service simultaneously a different
memory request. Performance degradation occurs in the case of {\em memory-bank
conflicts\/}, that are the result of requests to a memory bank that is still
busy servicing
a previous request (Calahan, Bailey~\cite{calahan88}).
Memory banks can be interleaved at different levels (byte, word, double-word,
etc.). By taking the number of banks and the interleaving level into account,
an algorithm may restructure its array reference patterns to avoid,
as far as possible, memory-bank conflicts.

{\footnotesize
{\bf Example:} The enquiry function {\tt units} returns the number of banks
in the memory level {\sl level} in the memory hierarchy {\sl node\_id}.
The corresponding interleaving level is returned by the enquiry function
{\tt interleave\_level}.
  
\begin{tabbing}
aaaa\= {\tt size(} \=\kill
\> {\sl bank\_number} = {\tt units(} {\sl node\_id} {\tt ,} {\sl level}
{\tt )}\\[1mm]
\> {\sl interleave\_factor} = {\tt interleave\_level(} {\sl node\_id} {\tt ,}
{\sl level} {\tt )}
\end{tabbing}
The interleaving level is given in units of bytes. For instance,
${\sl interleave\_factor} = 4$ means interleaving takes place at the word level.

The following example shows how interleaving information can be used to analyze
the memory access patterns of a program.  
Let us asssume that
\begin{itemize}
\item real numbers are stored as words, and
\item the main memory is interleaved at the word level and is split into
$b$ banks.
\end{itemize}
Successive elements of any column of a real $n \times n$ matrix $A$ can be
accessed at maximum speed. In contrast, accessing successive elements of a
row
might cause memory conflicts, if $n$ is {\em not}\/ relative prime to
$b$. Therefore, if the matrix $A$ is frequently accessed by rows, it should
be stored as an $m \times m$ matrix, $m \geq n$, so that $m$ and $b$ are
relative prime.
 
}

\item [{\em Access and Transfer Time}\/:]
Performance degradation due to {\em memory access delays\/} can be modeled by
assuming
that the access to a vector of length $n$ obeys the timing relation
\begin{equation}\label{coh}
t = \frac{n + n_{1/2}^{m}}{r_{\infty }^{m}}
\end{equation}
(Hockney, Jesshope~\cite{hockney88}). By taking into account
\begin{enumerate}
\item the processor parameters $r_{\infty }$, $n_{1/2}$;
\item the memory parameters $r_{\infty }^{m}$, $n_{1/2}^{m}$; as well as
\item the algorithm parameter $f$ (the {\em computational intensity}\/,
i.e.\ the number of floating-point operations per memory reference),
\end{enumerate}
the processor performance including memory delays can be
estimated (Hockney, Jesshope~\cite{hockney88}).
The parameters $r_{\infty }^{m}$, $n_{1/2}^{m}$ of the
linear model (\ref{coh})
reflect the {\em access time\/} as well as the {\em transfer rate}\/ of memory
accesses.
They vary tremendously
for different levels of the memory hierarchy. Consequently, 
the parameters $r_{\infty }^{m}$, $n_{1/2}^{m}$ have to be given
individually for every level of a memory hierarchy.

{\footnotesize
{\bf Example:} The enquiry functions {\tt n\_half\_m} and {\tt r\_infty\_m}
return the parameters $n_{1/2}^{m}$ and $r_{\infty }^{m}$ for the memory
level {\sl level} in the memory hierarchy {\sl node\_id}.
\begin{tabbing}
aaaa\= $n_{1/2}$ = {\tt n\_half(} \=\kill
\> $n_{1/2}^{m}$ = {\tt n\_half\_m(} {\sl node\_id} {\tt ,} {\sl level}
{\tt )}\\[1mm]
\> $r_{\infty }^{m}$ = {\tt r\_infty\_m(} {\sl node\_id} {\tt ,} {\sl level}
{\tt )}
\end{tabbing}

}

A memory level may consist of independent banks; memory hierarchies can 
either be directly linked to a processor or connected to an interconnection
network. Thus, the interpretation of the parameters 
$r_{\infty }^{m}$, $n_{1/2}^{m}$ has to be done carefully:
\begin{itemize}
\item For an interleaved memory, the parameters $r_{\infty }^{m}$,
$n_{1/2}^{m}$
are given for a whole memory level (and {\em not}\/ for a single
memory bank). When accessing contiguous memory locations, no memory conflicts
occur. Thus, the parameters reflect {\em conflict free}\/ memory performance.
Degradation due to memory-bank conflicts cannot be modeled using these
parameters. It can be avoided by using interleaving information. 
\item For memory hierarchies directly linked to a processor, the parameters
$r_{\infty }^{m}$,  $n_{1/2}^{m}$
describe how fast data can be transferred to the highest (fastest) memory level.
The parameters not only characterize memory performance, but also
transmission time and memory management overhead.
\item For memory hierarchies linked to processors via an interconnection
network (shared memories), memory access delays may occur due to the 
interference of 
several processors. In the case of interleaved memory, performance
models for memory access delays require the knowledge of the
{\em bank busy time\/}, i.e. the minimum time between two requests to a memory
bank. The bank busy time can also be used to estimate performance degradation
in cases where a processor does not access the memory contiguously.

{\footnotesize
{\bf Example:} Let us suppose that a memory bank is busy for four clock
periods. A processor accessing every fourth bank on successive clock cycles
does not suffer any performance degradation. In contrast, accessing every
eighth bank incurs a delay of two cycles. Thus, only half the maximum
access rate can be attained.

The enquiry function {\tt unit\_busy} returns the bank busy time
for the memory level {\sl level} in the memory hierarchy {\sl node\_id}.
\begin{tabbing}
aaaa\= $n_{1/2}$ = {\tt n\_half(} \=\kill
\> {\sl busy\_delay} = {\tt unit\_busy(} {\sl node\_id} {\tt ,} {\sl level}
{\tt )}
\end{tabbing}
If a level of the memory hierarchy is {\em not\/} interleaved, {\tt unit\_busy}
returns the busy time for a single bank.

}

\end{itemize}

\item [{\em Cache Associativity -- Sets}\/:]
For {\em not\/} fully associative caches, the {\em effective\/} cash size can be
significantly
reduced if memory references are {\em not evenly\/} spread over the existing
sets (Hennessy, Patterson~\cite{hennessy90}). The set associativity 
problem resembles the memory bank problem in the following ways:
\begin{itemize}
\item The respective memory level is subdivided into several equal units (banks
or sets).
\item The unit in which a memory address is stored is computed by
\[ unit~~=~~block~number~~{\rm mod}~~number~of~units\] 
where 
\[ block~number~~=~~\left\lfloor \frac{address}{interleaving~level} \right\rfloor \]
holds. (For cache memories, the interleaving level is equal to the cache line
size.) 
\item The memory level is efficiently used if the memory accesses are evenly
spread over the different units.
\end{itemize}

{\footnotesize
{\bf Example:} Due to the similarity between the set associativity and the memory
bank problem, the enquiry functions {\tt units} and {\tt interleave\_factor} can
be reused for set associativity. Whether or not the enquiry function {\tt units}
returns the number of banks or the number of sets is {\em irrelevant\/} to the
algorithm. 

}

\item [{\em Vector Registers}\/:]
Vector registers form the highest level in the memory hierarchy of a
register-to-register vector computer.
Crucial for performance considerations
are the length and the number of vector registers. This information can be
obtained from enquiry functions already defined.

{\footnotesize
{\bf Example:} For the vector register level, the enquiry function {\tt size}
returns the {\em aggregate size\/} of all vector registers. The enquiry function
{\tt units} yields the number of vector registers. Subsequently, the length of
the vector registers can easily be derived. {\tt interleave\_factor} returns
$0$, which
indicates that the addresses stored in the vector registers do not obey any
constant relation. In this way vector registers can be distinguished from
banked memory and from cache.

}
 
\item [{\em Shared/Private Indicator}\/:]
In a time sharing environment some memory levels (main memory, 
disc space) can be shared spatially between different processes\,\footnote{Caches are exclusively
available to the executing process.}.
Memory sharing
reduces the size of memory available to a process. The size of
available memory is no longer a static, rather a {\em dynamic}\/ entity.

{\footnotesize
{\bf Example:} The previously defined enquiry function {\tt size} should
be implemented to return the {\em currently\/} available memory size on a
certain level of the memory hierarchy.

}

An additional enquiry function could indicate
whether this size depends on time or is a constant (which indicates that a
memory level is private to an executing process).

{\footnotesize
{\bf Example:} The enquiry function {\tt memory\_sharing} indicates the mode
(exclusive or shared) in which the memory level {\sl level} of the memory
hierarchy {\sl node\_id} is used.
\begin{tabbing}
aaaa\= \kill
\> {\sl memory\_mode} = {\tt memory\_sharing(} {\sl node\_id} {\tt ,}
{\sl level} {\tt )}
\end{tabbing}

}
\end{description}

\subsubsection{Parameterization of Interconnection Networks}
Several possible parameterizations of interconnection networks have been
discussed by Augustyn, Krommer and Ueberhuber~\cite{augustyn91}. The
conclusion of this study was that information about network topology is
unwieldy and difficult to exploit in a concise manner. Therefore, in
Subsection~\ref{agv} it has been assumed that the interconnection network
provides a virtually fully connected system. Due to the underlying physical
topology, communication delays between two nodes can vary significantly.
In many applications it is important to perform communication {\em locally}\/,
i.e. communication should primarily take place between processors with a fast
mutual message path. 

{\footnotesize
{\bf Example:}
As pointed out in Subsection~\ref{reqinf}, an efficient dynamic load balancing
algorithm is crucial to the performance of parallel quadrature programs.
A subclass of dynamic load balancing algorithms receiving considerable attention
is based on nearest neighbor task migration (Ahmad et\,al.~\cite{ahmad91},
Schmid, Krommer, Ueberhuber~\cite{schmid92}). A dynamic load balancing
algorithm must be enabled to enquire the nearest neighbors of a processor-memory
node in order to perform nearest neighbor task migration.

}

{\footnotesize
{\bf Example:} An enquiry procedure that enables local communication could be
defined as follows:
\begin{tabbing}
aaaa\= {\tt near\_proc(} \=\kill
\> {\tt call near\_proc(} {\sl message\_size} {\tt ,} {\sl delay\_limit} {\tt ,}
{\sl num\_proc} {\tt ,} {\sl proc\_vector} {\tt )}
\end{tabbing}
The input variables of {\tt near\_proc} are
{\sl message\_size} and {\sl delay\_limit}; the output variables are
{\sl num\_proc}
and {\sl proc\_vector}. {\sl proc\_vector} contains a list of those
{\sl num\_proc} processors to which a message of length {\sl message\_size} can
be transmitted with a communication delay of less than {\sl delay\_limit}.
The processors in {\sl proc\_vector} are ordered with respect to increasing
communication delay, i.e. the processor with the least communication delay
(the {\em nearest\/} neighbor) comes
first. The processor {\em calling\/} {\tt near\_proc} is {\em not}\/ included
in the list. 

As communication delays depend on the network's current traffic, 
the information provided by {\tt near\_proc} is the
system's {\em estimate\/} of the {\em current\/} situation.

}

Some interconnection topologies are especially suitable for certain
communication patterns. Bus systems, for instance, lend themselves to
broadcast operations; ring topologies are most appropriate for shift operations.
Al\-though the interconnection network is {\em virtually}\/ fully connected, the
program
designer should be enabled to take advantage of knowing properties of the
{\em physical}\/ network.

For the interaction between the algorithm and the interconnection
network, a {\em three layer model}\/ seems to be appropriate:
\begin{itemize}
\item {\em Level 1: Mapping Virtual Topologies}
\vspace{3mm}\\
Many algorithms exhibit regular communication patterns. A virtual communication
topology can be associated with a communication pattern in the following
sense: Any two communicating processors are linked directly; and, if two
processors do not communicate, there is no communication link connecting them.
In other words, the associated topology is the {\em minimal\/} topology (minimum
number of links) that can perform the communication pattern without routing. 
Whether or not 
a communication pattern is suitable for a physical topology depends on
how well the associated topology can be {\em embedded\/} into the physical 
topology.

{\footnotesize
{\bf Example:} \label{sor}
Saltz et\,al.~\cite{saltz87} have discussed the implementation of a red-black
SOR method on a hypercube system. The domain is decomposed into $p$ subregions
($p$ is the number of processors) 
and each region is assigned to a processor. Depending on whether the subregions
are chosen as stripes or as rectangles, the resulting communication patterns
are associated either with a linear array or with a two dimensional mesh
topology. 
Both the linear
array and the two dimensional mesh can be mapped onto a hypercube so that
communicating processors are directly connected (Ber\-tse\-kas,
Tsi\-tsi\-klis~\cite{bertsekas89}). Thus, both data distributions
are suitable for the hypercube topology.

}

The mapping of a virtual topology onto a physical topology should be performed
by the system. Information about the physical topology of the interconnection
network should
be hidden from the program designer.

{\footnotesize
{\bf Example:} The subroutine {\tt map\_topology} requires of the system that
it map the
virtual topology {\sl virtual\_topology} onto those physical processor-memory
nodes whose numbers are contained in the {\sl processor\_vector} of
length {\sl num\_proc}. {\tt map\_topology} returns the data object
{\sl map} that describes the mapping suggested by the system.
\begin{tabbing}
aaaa\= \kill
\> {\tt map\_topology} ({\sl virtual\_topology} {\tt ,} {\sl num\_proc} {\tt ,}
{\sl processor\_vector} {\tt ,} {\sl map} {\tt )}
\end{tabbing}
The data object {\sl virtual\_topology} contains the following items:
\begin{itemize}
\item a string determining the type of the virtual topology
({\tt 'RING'} denotes a ring;
 {\tt 'MESH'} denotes a mesh topology;
 {\tt 'MY\_TOPOLOGY'} may denote a user defined, algorithm specific virtual
 topology\,\footnote{For example, a graph can be used to define a virtual
topology (Ber\-tse\-kas, Tsi\-tsi\-klis~\cite{bertsekas89}).}, etc.),
\item an array of integer parameters determining topology dimensions 
(for a ring topology, ({\tt nr}) denotes a ring with {\tt nr} nodes;
for a mesh topology, ({\tt nx},{\tt ny}) denotes an {\tt nx} $\times$ {\tt ny}
mesh, etc.).
\end{itemize}
The enquiry function {\tt mapping} returns the {\sl physical\_node} onto which
a {\sl virtual\_node} is mapped according to {\sl map}.
\begin{tabbing}
aaaa\= \kill
\> {\sl physical\_node} = {\tt mapping(} {\sl map} {\tt ,} {\sl virtual\_node}
{\tt )}
\end{tabbing}
The data object {\sl virtual\_node} is an integer vector that specifies a node
of the virtual topology. For instance, ({\tt IX,IY}) designates a node in
a virtual mesh topology.

}

\item {\em Level 2: Defining Aggregate Communication Operations}
\vspace{3mm}\\
Some aggregate communication operations, like broadcasts,
have a topology independent meaning. Other aggregate commmunication operations
make sense only in connection with specific virtual topologies.

{\footnotesize
{\bf Example:}
Shift communication operations mean something solely in $n$-dimensional mesh
or torus topologies.

}

Aggregate communication operations with a topology independent meaning
can be specified by aggregate physical communication operations.

{\footnotesize
{\bf Example:} Aggregate physical communication operations are referenced by
data objects containing communication information, as described in Table~1.
Each aggregate communication operation consists of single node broadcast
operations, where a message is sent from $node_{j}$
to $node_{j1}$, \dots, $node_{jn_{j}}$. 
Note that $j \neq k$
does not imply $node_{j} \neq node_{k}$, i.e. a node may appear
several times as a sending node. As pointed out by Bertsekas and
Tsitsiklis~\cite{bertsekas89}, all other aggregate communication patterns can
be built up by
using single node broadcasts, or they are dual to such communication patterns 
(when sender and receivers are exchanged).
Point-to-point message passing is included in single node broadcast as a
special case
($n_{j} = 1$). Thus, all communication patterns can be represented as they are
in Table~1.

}
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
Sending Node & Receiving Nodes \\
\hline
$node_{1}$ & $node_{1i}, i=1(1)n_{1}$ \\
$node_{2}$ & $node_{2i}, i=1(1)n_{2}$ \\
. & . \\
. & . \\
. & . \\
$node_{k}$ & $node_{ki}, i=1(1)n_{k}$ \\
\hline
\end{tabular}
\end{center}
\begin{center}
\vspace{0 pt}
{\bf Table 1:} Aggregate physical communication operation
\end{center}
\end{table}

Aggregate communication operations with a topology dependent meaning
can be specified by aggregate virtual communication operations.

{\footnotesize
{\bf Example:} The function {\tt virtual\_aggregate} takes as input
a virtual topology {\sl virtual\_topology}, a mapping of the
virtual topology onto the physical topology {\sl map}, and an 
aggregate virtual communication operation
{\sl aggregate\_virtual\_communication}.
{\tt virtual\_aggregate} transforms the aggregate virtual communication
operation into a aggregate {\em physical\/} communication operation and returns
{\sl aggregate\_var}, an identifier
for the aggregate physical communication operation.
\begin{tabbing}
aaaa\= {\sl aggregate\_var} = {\tt virtual\_aggregate(} \=\kill
\> {\sl aggregate\_var} = {\tt virtual\_aggregate(} {\sl virtual\_topology}
{\tt ,} {\sl map} {\tt ,}\\
\>\> {\sl aggregate\_virtual\_communication} {\tt )}
\end{tabbing}
The mapping {\tt map} can be the result of a previous
call of {\tt map\_topology}. But {\tt map} may also be a user provided
mapping. Such mappings make the
specification of aggregate virtual communication operations possible when
virtual topologies are not mapped optimally. Virtual topologies that are not
optimally mapped often arise when data distributions are
determined by previous algorithmic phases associated with other virtual
topologies. 

The data object {\tt aggregate\_virtual\_communication} contains the following
items:
\begin{itemize}
\item a string determining the type of the aggregate virtual communication
operation ({\tt 'SHIFT'}, for instance, denotes a shift operation, etc.),
\item an array of integer parameters specifying accurately the aggregate
virtual communication operation (for a shift operation, for instance,
({\tt idimm1},{\tt ishift1},\dots 
{\tt idimmn},{\tt ishiftn}) denotes a shift in dimension {\tt idimm1} with
stepsize {\tt ishift1}, \dots,  in dimension {\tt idimmn} with
stepsize {\tt ishiftn}, etc.).
\end{itemize}

}

\item {\em Level 3: Delays of Aggregate Communication Operations}
\vspace{3mm}\\
{\footnotesize
{\bf Example:}
The enquiry function {\tt delay} returns the delay of an
aggregate communication operation.
\begin{tabbing}
aaaa\= {\sl delay\_vector} = {\tt delay(} \=\kill
\> {\sl delay\_vector} = {\tt delay(} {\sl aggregate\_var} {\tt ,}
{\sl size\_vector} {\tt )}
\end{tabbing}
The variable {\sl aggregate\_var} identifies a physical
aggregate communication operation, as is described above. {\sl size\_vector} is
a vector of length $k$: the component {\sl size\_vector(j)} gives the size
of the $j$-th message sent.
{\sl delay\_vector} is a vector of length
$k$: the component {\sl delay\_vector(j)} gives the delay of the $j$-th
constituent communication operation. 

}

The results of an enquiry function like {\tt delay} may serve various
purposes:
\begin{itemize}
\item If a
processor is sends a message to itself, no actual transmission takes
place and the
corresponding delay reflects the overhead associated with the communication
routines involved. In this way the delays for asynchronous communication can be
estimated\,\footnote{It is tacitly assumed that the special case where sender and
receiver of a message are identical is {\em not\/} recognized and treated as a
special case, i.e. none of the possible optimization measures are taken.}.
\item If a processor appears several times as a sending node, the overall
communication delay is equal to the maximum constituent communication delay.
\item If the whole aggregate communication pattern takes place in synchronized
manner, i.e. all processors have to wait for the end of all communication
before resuming computation, the overall communication delay is
the maximum communication delay of a single processor. 
\end{itemize}
\end{itemize}

This three layer model can be applied in the following way to
decide which virtual topology is most appropriate for a given interconnection
network.
\begin{itemize}
\item {\tt map\_topology} is used for optimally mapping the possible vitual
topologies. 
\item For each virtual topology, the respective aggregate virtual communication
operations for the algorithm are specified  and
transformed into aggregate physical communication operations. 
\item The delays for the resulting aggregate physical communication operations
and for additionally specified message sizes are enquired. By
comparing the respective delays, the suitable virtual topology can be derived.
\end{itemize}

{\footnotesize
{\bf Example:} In the SOR example on page~\pageref{sor}, both a linear array
and a two dimensional mesh topology are mapped. Both virtual topologies require
shifts as aggregate virtual communication patterns. Whether the array or the
mesh topology is more suitable depends on the problem size
(Saltz et\,al.~\cite{saltz87}).

}

\subsection{Characterization of Compiler and Software}
The efficiency of a computing environment in performing a given task depends not
only on the underlying hardware, but also on many kinds of system software
(compiler, libraries, etc.).

{\footnotesize
{\bf Example:}
Highly optimized BLAS routines are available on many systems. Their performance
can deviate significantly from the performance
of standard Fortran BLAS implementations. For a linear algebra
algorithm based
on BLAS routines, the BLAS routines are transparent. Thus, the algorithm's
performance estimation can be severly flawed.

}

Software factors are hardly amenable to a systematic quantitative
analysis, as is hardware performance. Thus, it is suggested to measure external
software and compiler effects by measuring the run-time behavior of the code
(in particular, the CPU time and the elapsed time).

{\footnotesize
{\bf Example:}
The enquiry function {\tt cpu\_clock} returns the
CPU time consumed by the
calling process since the last {\tt init\_clock} call.
{\tt elapsed\_clock} yields the corresponding elapsed time.
\begin{tabbing}
aaaa\=\kill
\> {\tt init\_clock()}\\
\> $\cdots $\\
\> {\sl cpu\_time} = {\tt cpu\_clock()}\\
\> {\sl elapsed\_time} = {\tt elapsed\_clock()}
\end{tabbing}

}

The actual performance of critical software parts
can be measured by using clock functions. This can be done at different times:
\begin{description}
\item [{\em Before a program run}\/] if the time critical parts of the program
are independent of problem data and dynamic phenomena. The measurement can be
part of the software installation procedure. Another possibility is measuring
any program run and using the accumulated timing data for an algorithmic
tuning.

{\footnotesize
{\bf Example:}
For a linear algebra package, the CPU time needed by various BLAS
routines for a chosen set of input data (size of vectors and matrices)
can be measured. A general performance model of the BLAS routines 
can be derived by fitting a mathematical model to the timing data.
This performance model can be used during run-time for algorithmic decision
making.

}

\item [{\em At the program start}\/] if the time critical parts of the program depend 
on current problem data but do not change during program execution.

{\footnotesize
{\bf Example:}
A result found in Saltz et\,al.~\cite{saltz87} is that the optimal
decomposition for a red-black SOR method on a hypercube system depends on 
the problem size, i.e. the number of grid points. One way to determine the 
appropriate decomposition for a given problem size is to perform a
small number of timesteps for both decompositions at the beginning of a program
run. By comparing the respective performance, the optimal distribution can
be found.

}

\item [{\em During the program run}\/] if the time critical parts of the program 
change during its execution.

{\footnotesize
{\bf Example:}
The time required for one evaluation of the integrand function may depend on
the respective abscissa. As quadrature abscissas are chosen dynamically
during the course of computation, the evaluation times should also be
determined dynamically.

}

\end{description}

\section{Programming Architecture Adaptive Algorithms}
As shown in Figure~9, an architecture adaptive algorithm conceptually consists
of two major parts: the {\em problem solving part}\/ and the {\em algorithm
adaption part}\/. The problem solving part contains those code sections required
for carrying out the original task of the algorithm. The algorithm adaption
part's job is to manage the problem solving activities efficiently. The
algorithm adaption part itself can be subdivided into a {\em decision making
part}\/ and an {\em acting part}\/. The decision making part gathers information
-- for instance, by calling appropriate enquiry functions. It also decides
which measures are to be taken -- according to
a logic developed by the algorithm designer. These measures are subsequently
carried out by the acting part. Note,
however, that this algorithm decomposition is only a conceptual one. It does not
imply that the different parts have to be specified or coded separately. As
shown in a related paper by Bihari and Schwan~\cite{bihari91}, a separate
specification {\em is}\/ possible and results in a more structured approach.

\begin{figure}
\begin{center}
\setlength{\unitlength}{0.0125in}%
\begin{picture}(202,134)(147,637)
\thicklines
\multiput(198,738)(0.00000,-8.00000){9}{\line( 0,-1){  4.000}}
\multiput(198,738)(8.16216,0.00000){19}{\line( 1, 0){  4.081}}
\multiput(198,704)(8.16216,0.00000){19}{\line( 1, 0){  4.081}}
\put(147,637){\framebox(202,134){}}
\put(147,670){\line( 1, 0){202}}
\put(248,750){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Algorithm Adaption}}}
\put(274,717){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Decision Making }}}
\put(274,685){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Acting}}}
\put(248,649){\makebox(0,0)[b]{\raisebox{0pt}[0pt][0pt]{\tenrm Problem Solving}}}
\end{picture}\\
\vspace{15 pt}
{\footnotesize {\bf Figure 9:} Structure of architecture adaptive algorithms}
\end{center}
\end{figure}

\section{Assessment and Outlook} \label{six}
The authors are well aware of possible objections to the {\sc aaa}
concept.

\subsection{Low-Level Approach}
Most of current efforts made to create portable, efficient parallel software
rely on
a high-level approach (functional languages, global name space languages, etc.).
>From this viewpoint
the {\sc aaa} approach appears an anachronism. However, the {\sc aaa} 
approach is {\em not}\/ meant to be used by the average programmer, who
legitimately prefers to deal with a parallel computer system at a very high
level.
Rather, it is meant for people developing scientific software systems which are
extensively reused, like scientific
software libraries. Because of this extensive reuse, porting costs very soon
exceed even high development costs. As a matter of
fact,
the {\sc aaa} approach {\em promotes}\/ the high level approach to parallel
systems by supporting the development of high quality parallel software
packages.
Software libraries which utilize the power of current and future parallel
machines by adapting themselves to different
architectures relieve the average programmer from dealing explicitely with
machine details.

\subsection{Complexity}
The complexity of architecture adaptive algorithms increases
rapidly with the extent and intricacy of its problem solving part and with the
complexity of the
underlying machine model. Machine models will become even more complex
in the future (by including cache hierarchies, SIMD/MIMD switchable
systems, time-sharing and space-sharing systems, etc.). The
algorithm designer will probably not be able to write programs which are
{\em optimal}\/ for all current and future parallel systems. In many cases it
is not necessary to find optimal solutions. Often reasonable solutions are
sufficient.
Relaxing optimality demands can simplify algorithm
adaption significantly. Sophisticated software development tools, like
simulators for parallel software and hardware, could be a great
help in the development process of {\sc aaa} software.

\subsection{Feasibility}
No ``proof'' has been brought forward so far that the {\sc aaa} approach is
indeed
{\em feasible}\/. Specifically, no experimental evidence, such as an
architecture adaptive parallel quadrature algorithm, has been produced yet.
The authors are well aware that this is a severe objection to
the {\sc aaa}
approach. Without such evidence the {\sc aaa} approach is bound to remain
merely theoretical. The authors' future work will concentrate
on furnishing a feasibility study for the {\sc aaa} methodology by developing
prototype software
for parallel quadrature within the {\sc aaa} framework.

\section*{Acknowledgement}
We would like to thank Roman Augustyn and Josef Fritscher for many helpful
discussions.

\begin{thebibliography}{99}
{\small
\bibitem{ahmad91}
   I.\ Ahmad, A.\ Ghafoor, K.\ Mehrotra, {\sl Performance Prediction of
   Distributed Load Balancing on Multicomputer Systems}, 
   Proceedings Supercomputing '91, IEEE Press, Los Alamitos CA, 1991,
   pp.~830--839.
   
\bibitem{almasi89}
   G.\,S.\ Almasi, A.\ Gottlieb, {\sl Highly Parallel Computing},
   Benjamin-Cum\-mings, Redwood City, 1989.

\bibitem{f90norm}
   {\sl American National Standard: Fortran\,90},
   American National Standards Institute Inc. (ANSI X3J3-1990), New\,York, 1991.

\bibitem{andrews91}
    J.\,B.\ Andrews, C.\,D.\ Polychronopoulos, {\sl An Analytical Approach to
    Performance/Cost Modeling of Parallel Computers}, Journal of Parallel and
    Distributed Computing 12 (1991),
    pp.~343--356.

\bibitem{augustyn91}
   R.\ Augustyn, A.\,R.\ Krommer, C.\,W.\ Ueberhuber, {\sl Effizient-portable
   Programmierung von Parallelrechnern}, 
   Report No. 85/91, Institute for Applied and Numerical Mathematics,
   Technical University Vienna, 1991.
      
\bibitem{bertsekas89}
   D.\,P.\ Bertsekas, J.\,N.\ Tsitsiklis, {\sl Parallel and Distributed
   Computation -- Numerical Methods}, Prentice-Hall, Englewood Cliffs, 1989.

\bibitem{bihari91}
   T.\,E.\ Bihari, K.\ Schwan, {\sl Dynamic Adaption of Real-Time Software},
   ACM Transactions on Computer Systems 9 (1991), pp.~143--174.
   
\bibitem{calahan88}
   D.\,A.\ Calahan, D.\,H.\ Bailey, {\sl Measurement and Analysis of Memory
   Conflicts on Vector Multiprocessors}, in ``Performance Evaluation of
   Supercomputers'' (J.\,L.\ Martin, Ed.), North-Holland, Amsterdam, 
   1988, pp.~83--106.
      
\bibitem{ra75}
   P.\,J.\ Davis, P.\ Rabinowitz, {\sl Methods of Numerical Integration},
   Academic Press, New York, 1975.
 
\bibitem{doncker91c}
   E.\ de\,Doncker, J.\,A.\ Kapenga, {\sl Parallel Quadrature on Loosly Coupled
   Systems},
   in ``Numerical integration --- Recent developments,
   software and applications'' (T.O. Espelid, A.C. Genz, Eds.),
   Kluwer Academic Publishers, Dordrecht, 1992, to appear.
   
\bibitem{garey79}
   M.\,R.\ Garey, D.\,S.\ Johnson, {\sl Computers and Intractability: A Guide
   to the Theory of $N\!P$-Completeness}, Freeman, San\,Francisco, 1979.
      
\bibitem{hennessy90}
   J.\,L.\ Hennessy, D.\,A.\ Patterson, {\sl Computer Architecture --
   A Quantitative Approach}, Morgan Kaufmann, San Mateo CA, 1990.
         
\bibitem{hockney88}
   R.\,W.\ Hockney, C.\,R.\ Jesshope, {\sl Parallel Computers 2},
   Adam Hilger, Bristol Philadelphia, 1988.
   
\bibitem{houstis90}
    E.\,N.\ Houstis, J.\,R.\ Rice, N.\,P.\ Chrisochoides, H.\,C.\ Karathanasis,
    P.\,N.\ Papachiou, M.\,K.\ Samartzis, E.\,A.\ Vavalis, K.\,Y.\ Wang,
    S.\ Weerawarana, {\sl // {\sc Ellpack}: A Numerical Simulation Programming
    Environment for Parallel MIMD Machines}, 1990 International Conference on
    Supercomputing, ACM Press, New\,York, 1990, pp.~96--107.

\bibitem{johnsson91}
    S.\,L.\ Johnsson, {\sl Performance Modeling of Distributed Memory
    Architectures}, Journal of Parallel and Distributed Computing 12 (1991),
    pp.~300--312.

\bibitem{krommer91b}
   A.\,R.\ Krommer, C.\,W.\ Ueberhuber, {\sl A Survey of Parallel Quadrature
   Algorithms}, Technical Report, Austrian Center for Parallel Computation,
   Vienna, 1992,
   to appear.

\bibitem{krommer92a}
   A.\,R.\ Krommer, C.\,W.\ Ueberhuber, {\sl Architecture Adaptive Algorithms},
   Technical Report ACPC/TR 92-2,  Austrian Center
   for Parallel Computation, Vienna, 1992.

\bibitem{lee88}
    I.\ Lee, D.\ Smitley, {\sl A Synthesis Algorithm for Reconfigurable
    Interconnection Networks}, IEEE Transactions on Computers 37 (1988),
    pp.~691--699.

\bibitem{naur67}
   P.\ Naur, {\sl Machine Dependent Programming in Common Languages},
   BIT 7 (1967), pp.~123--131.

\bibitem{quadpack}
   R.\ Piessens, E.\ de\,Doncker-Kapenga, C.\,W.\ Ueberhuber, D.\,H.\ Kahaner,
   {\sc Quadpack}\,--- {\sl A Subroutine Package for Automatic Integration},
   Springer-Verlag, Berlin Heidelberg New\,York Tokyo, 1983.

\bibitem{saltz87}
    J.\,H.\ Saltz, V.\,K.\ Naik, D.\,M.\ Nicol, {\sl Reduction of the Effects of
    the Communication Delays in Scientific Algorithms on Message Passing MIMD
    Architectures}, SIAM Journal on Scientific and Statistical Computing 8
    (1987), pp.~118--134.

\bibitem{schmid92}
   C.\ Schmid, A.\,R.\ Krommer, C.\,W.\ Ueberhuber, {\sl Dynamic Load Balancing
   -- An Overview}, 
   Report No. 90/92, Institute for Applied and Numerical Mathematics,
   Technical University Vienna, 1992.

\bibitem{so88}
    K.\ So, V.\ Zecca, {\sl Program Locality of Vectorized Applications Running
    on the IBM 3090 with Vector Facility}, IBM Systems Journal 27 (1988),
    pp.~436--452.

\bibitem{sunderam90}
   V.\ Sunderam, {\sl PVM: A Framework for Parallel Distributed Computing},
   Concurrency: Practice and Experience 2 (1990), pp.~315--339.
      
\bibitem{vaughan91}
   S.\,J.\ Vaughan-Nichols, {\sl Catch as Cache Can}, Byte 16--6 (1991),
   pp.~209--215.
}
\end{thebibliography}
         
\end{document}
 

From schreibr@riacs.edu  Fri Sep 18 13:21:32 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA17753); Fri, 18 Sep 92 13:21:32 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA21008); Fri, 18 Sep 92 13:21:02 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA27626; Fri, 18 Sep 92 11:20:27 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA09430; Fri, 18 Sep 92 11:20:15 PDT
Message-Id: <9209181820.AA09430@thor.riacs.edu>
Date: Fri, 18 Sep 92 11:20:15 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: Draft


Here is the draft of the intrinsics section as modified to incorporate
the changes voted at the last meeting, and with other improvements in the
presentation.


It should latex using the hpf-freestanding-chapter-header.tex macros.


Rob

---------------      CUT HERE   --------------------------
%
%intrinsics.tex

%Version of May 29, 1992 --- Guy Steele, Thinking Machines Corporation
%and David Loveman, Digital Equipment Corporation --- Robert Schreiber, RIACS (Editor)

\chapter{Intrinsic and Library Functions
\protect\footnote{Version of September 14, 1992 ---
Guy Steele, Thinking Machines Corporation, David Loveman, Digital
Equipment Corporation, and
Robert Schreiber, Research Institute for Advanced Computer Science
}}
\label{intrinsics}

This section extends Section 13 of the Fortran 90 standard.

HPF retains Fortran 90's intrinsic functions.  It also adds a number of
new intrinsics in three categories: system inquiry intrinsics, distribution
inquiry intrinsics, and computational intrinsics.

The definitions of two Fortran 90 intrinsics, MAXLOC and MINLOC,
are extended by the addition of an optional DIM argument.

In addition to the new intrinsics, HPF defines an HPF library that must be
provided by vendors of any full HPF implementation.

\section{System Inquiry Intrinsic Functions\protect\footnote{Version of
May 29, 1992 --- David Loveman, Digital Equipment Corporation}
\protect\footnote{Approved second reading 11 September 1992.}}

In addition to the intrinsic functions of Fortran 90, High Performance
Fortran has two system inquiry  intrinsic functions:  NUMBER_OF_PROCESSORS and
PROCESSORS_SHAPE.  Their values remain constant for (at least) the
duration of one program execution.  Accordingly, NUMBER_OF_PROCESSORS
and PROCESSORS_SHAPE have values that are restricted expressions and
may be used wherever any other Fortran 90 restricted expression may be
used.  
%    If the system configuration is committed to at compile time,
%    NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE have values that are constant
%    expressions and may be used wherever any other Fortran 90 constant
%    expression may be used.  
In particular, NUMBER_OF_PROCESSORS may be
used in a specification expression.
%    and, if a constant expression, may
%    be used in an initialization expression.  
None of the categories of
intrinsic functions listed in Chapter 13 of the Fortran 90 standard
seem quite apt to describe the nature of this new intrinsic function,
so we add a new category of ``system inquiry functions'' and place
NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE in that category.

%    Note that treating these intrinsics as constant expressions does not
%    force a compiler to bind the number of processors at compile time
%    (although that is one possible implementation) -- with the right linker
%    or code-generation technology the choice could be deferred until run
%    time, possibly at some performance cost.


\subsection{Formal Definition}

[this subsection contains formal definition material, phrased as additions
or modifications to the Fortran 90 specification, with non-formal
commentary in square brackets]


\par\smallskip
13.8a System inquiry functions

In a multi-processor implementation, the processors may be arranged in
an im\-ple\-men\-ta\-tion-de\-pen\-dent n-dimensional processor array.  The system
inquiry functions return values related to this underlying machine and
processor configuration, including the size and shape of the underlying
processor array.  NUMBER_OF_PROCESSORS returns the total number of
processors available to the program or the number of processors
available to the program along a specified dimension of the processor
array.  PROCESSORS_SHAPE returns the shape of the processor array.


\par\smallskip
13.10.21 System inquiry functions

\begin{verbatim}
   NUMBER_OF_PROCESSORS(DIM)      Total number of processors in
                                    the processor array.
   PROCESSORS_SHAPE()             Shape of the processor array
\end{verbatim}


\par\smallskip
13.13.xx  NUMBER_OF_PROCESSORS(DIM)

Optional Argument.  DIM

Description.  Returns the total number of processors available to the
program or the number of processors available to the program along a
specified dimension of the processor array.

Class.  System inquiry function.

Arguments.
DIM (optional)  must be scalar and of type integer with a value in the
range  (\(1 \leq DIM \leq n\)) where n is the rank of the processor array.)

Result Type, Type Parameter, and Shape.  Default integer scalar.

Result Value.  The result has a value equal to the extent of dimension
DIM (\(1 \leq DIM \leq n\)), where n is the rank of the processor array) of the
processor-dependent hardware processor array or, if DIM is absent, the
total number of elements, equal to or greater than one, of the
processor-dependent hardware processor array.

\par\smallskip\noindent
{\bf Examples:} For a DECmpp 12000 Model 8B with 8192 processors, the value
of NUMBER_OF_PROCESSORS( ) is 8192, the value of
NUMBER_OF_PROCESSORS(DIM=1) is 128, and the value of
NUMBER_OF_PROCESSORS(DIM=2) is 64.  For a single processor DECalpha
workstation, the value of NUMBER_OF_PROCESSORS( ) is 1, and the value
of NUMBER_OF_PROCESSORS(DIM=1) is 1.


\par\smallskip
13.13.yy PROCESSORS_SHAPE()

Description.  Returns the shape of the implementation-dependent processor array.

Class.  System inquiry function.

Arguments.  None

Result Type, Type Parameter, and Shape.  The result is a default
integer array of rank one whose size is equal to the rank of the
implementation-dependent processor array.

Result Value.  The value of the result is the shape of the
implementation-dependent processor array.


\par\smallskip\noindent
{\bf Example:} For a DECmpp 12000 Model 8B with 8192 processors, the value of
PROCESSORS_SHAPE() is (/ 128, 64 /).  For a Connection Machine CM-2
with 8192 processors, the value of PROCESSORS_SHAPE() might be (/ 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 /).  For a Connection Machine CM-5 with
8192 processors, the value of PROCESSORS_SHAPE() might be (/ 8192 /). 
For a single processor DECalpha workstation, the value of
PROCESSORS_SHAPE() is (/ 1 /).


%[The list of alternatives for Fortran 90 constant expressions and
%restricted expression are expanded to include NUMBER_OF_PROCESSORS and
%PROCESSORS_SHAPE.]

%    7.1.6.1 Constant expression
%
%    A constant expression is . . . . .
%
%    (6a)  A system inquiry function reference where each argument is a
%    constant expression and the compiler has been informed of the
%    appropriate system configuration.  
%
%    . . . . .
%
7.1.6.2 Specification expression

A restricted expression is . . . . .

(9a)  A system inquiry function reference where each argument is a
restricted expression.  

. . . . .


\subsection{Discussion and Pragmatic Usage --- Consequences of the Formal Proposal}

%    The shape of the processor array may be treated as a constant known at
%    compile time if the compiler is informed about the system
%    configuration.  This will allow the values of system inquiry functions
%    to be used in initialization expressions
%    even if the system configuration is not
%    committed to at compile time.
The values of system inquiry
functions are always restricted expressions;  thus they may be used in
specification expressions.
They may not, however, occur in initialization expressions, because they
may not be assumed to be constants.   In particular, HPF programs may
be compiled to run on machines whose congigurations are not known at
compile time.

Note that the system inquiry functions query the physical machine, and
have nothing to do with any PROCESSORS directive that may occur.

References to system inquiry functions may occur in HPF directives, as in:
                                                                  \CODE
!HPF$ TEMPLATE T(100, 3*NUMBER_OF_PROCESSORS())
                                                                  \EDOC

The definition of NUMBER_OF_PROCESSORS is modeled on the definition of
the SIZE intrinsic function.

The definition of PROCESSORS_SHAPE is modeled on the definition of the
SHAPE intrinsic function.

The rank of the processor array is returned by
								\CODE
SIZE(PROCESSORS_SHAPE())
								\EDOC
an expression that may occur in any specification expression.
%and that is constant whenever PROCESSORS_SHAPE() is.

%    As a result of being a constant expression, if the system configuration
%    is committed to at compile time, suitably constrained references to
%    system inquiry functions may occur in initialization expressions as,
%    for example, initialization values in type-declaration-statements or in
%    parameter-statements, as in:
%
%                                                                      \CODE
%    PARAMETER (N_PROCS=NUMBER_OF_PROCESSORS(),        &
%               NXPROCS=NUMBER_OF_PROCESSORS(DIM=1),   &
%               NYPROCS=NUMBER_OF_PROCESSORS(DIM=2))
%                                                                      \EDOC
%    
As a result of being a restricted expression, suitably constrained
references to system inquiry functions may occur in specification
expressions as, for example, lower or upper bounds of an
explicit-shape-spec of an array-spec in type-declaration-statements, as in:

                                                                 \CODE
INTEGER, DIMENSION(SIZE(PROCESSORS_SHAPE())) ::   &
           PS = PROCESSORS_SHAPE()
! PS(2) = NUMBER_OF_PROCESSORS(DIM=2)
                                                                 \EDOC

\section{Computational Intrinsic Functions\protect\footnote{Version of
May 29, 1992 --- Guy Steele, Thinking Machines Corporation}
\protect\footnote{Approved second reading 11 September 1992.}}

\subsection{Extension to MINLOC and MAXLOC}

The MAXLOC and MINLOC intrinsics are redefined to have an optional DIM
argument that works exactly as does the DIM argument of MAXVAL.  
If such an argument is present, then the shape of the
result equals the shape of the first argument with one dimension
(the one indicated by the DIM argument) deleted; it is as if a
series of one-dimensional MAXLOC or MINLOC operations were performed.
The rank of the result is one less than the rank of the first argument.   
If the smallest (MINLOC) or largest
(MAXLOC) element along a given dimension is not unique, then the location of
the first one is returned.   The declared lower bounds of the input array play
no role in determining the output.   The optional MASK argument is retained
and may be used together with the DIM argument.

Note that the behavior of MAXLOC and MINLOC without the DIM argument is
quite different.   In this case, a one-dimensional integer array of size equal
to the rank of ARRAY is returned, giving the ssubscripts of the first element
in array element order with the smallest (MINLOC) or largest (MAXLOC) value.

Thus, if A has DIMENSION(4,3), then 
\begin{verbatim}
      SHAPE(MAXLOC(A))       has the value [ 2 ]
      SHAPE(MAXLOC(A,DIM=1)) has the value [ 3 ]
      SHAPE(MAXLOC(A,DIM=2)) has the value [ 4 ].
\end{verbatim}
\par\smallskip\noindent
{\bf Example:} If A has the value

\begin{verbatim}
        [  0  -5   8  -3  ]
        [  3   4  -1   2  ]
        [  0   4   6  -4  ]


then	

        MINLOC(A)        has the value [ 1, 2 ]
        MAXLOC(A)        has the value [ 1, 3 ]
        MINLOC(A, DIM=1) has the value [ 1, 1, 2, 3 ]
        MAXLOC(A, DIM=1) has the value [ 2, 2, 1, 2 ]
        MINLOC(A, DIM=2) has the value [ 2, 3, 4 ]
        MAXLOC(A, DIM=2) has the value [ 3, 2, 3 ].
\end{verbatim}


\subsection{ILEN}

An elemental integer-length intrinsic.  Its action on a scalar is:

                                                                \CODE
  ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 ))
                                                                \EDOC

ILEN(x) is the number of bits
required to store a 2's-complement signed integer x.  As examples of
its use,  2**ILEN(N-1)  rounds N up to a power of 2 (for \(N > 0\)),
whereas  2**(ILEN(N)-1)  rounds N down to a power of 2.

Note that a given integer value will always produce the same result
from ILEN, independent on the number of bits in the representation of
the integer.  That is because bits are counted from the right (the
least significant bit).

As an elemental, integer-valued intrinsic, ILEN may appear in a specification
expression.

\section{Computational Library Functions\protect\footnote{Version of
September 14, 1992 --- Guy Steele, Thinking Machines Corporation
Robert Schreiber, Research Institute for Advanced Computer Science,
and Rex Page, Amoco Corporation}
\protect\footnote{Approved second reading 11 September 1992.}}


This section consists of five groups of computational library functions, to
be available in the standard HPF module.   Use of these functions must be
accompanied by an appropriate USE statement in each scoping unit in which they
are used.   They are not intrinsic.   Thus, they are not allowed in specification
expressions.

\subsection{New Reduction Functions}

Just as we have the correpondences:

\begin{verbatim}
          operator/intrinsic        reduction intrinsic

                +                       SUM, COUNT
                *                       PRODUCT
                .AND.                   ALL
                .OR.                    ANY
                MAX                     MAXVAL
                MIN                     MINVAL
\end{verbatim}
\noindent
it is useful to have reduction versions of certain other operators and
intrinsics in the language that happen to be associative and
commutative.   Therefore the new functions AND, OR, EOR, and PARITY
are defined.

\begin{verbatim}
          operator/intrinsic        reduction function

                IAND                    AND
                IOR                     OR
                IEOR                    EOR
               .NEQV.                   PARITY

Thus

        AND( (/ 7,3,10 /) )  yields 2
         OR( (/ 7,3,10 /) )  yields 15
        EOR( (/ 7,3,10 /) )  yields 14

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

      PARITY( (/ T,F,F,T,T,F,F,F,T,T /) )  yields .TRUE.
      PARITY( (/ T,F,F,T,T,F,F,F,T,F /) )  yields .FALSE.
\end{verbatim}

Some of these are particularly valuable when used with the corresponding
parallel-prefix functions, Section~\ref{parallel-prefix}.

The identity element for the reduction PARITY is .FALSE., for the 
reductions OR and EOR is zero, and for the 
reduction AND is -1.  COUNT does not have an identity, as it maps
logicals to integers and returns zero if there are no true values to be counted.
The identities for the other reductions are defined in the Fortran 90 standard.

\subsection{Combining-Scatter Functions}

%[Note the addition of COPY_SCATTER, by analogy with COPY_PREFIX and
%COPY_SUFFIX, to achieve what the Connection Machine calls
%send-with-overwrite.]

%Combining-Send section revised to make these functions
%rather than subroutines --- Rex Page 26Aug92

%Changed to SCATTER, removed restrictions on order of operations, and
%added optional final MASK argument
%RSS, Sept 14.

For every reduction operation XXX in the language, introduce a new
function:

                                                                \CODE
   XXX_SCATTER(SOURCE,BASE,IDX1,..., IDXn, MASK)
                                                                \EDOC

The IDX arguments are integer arrays.
The number of IDX arguments must equal the rank of BASE.  
The SOURCE and all the IDX arguments must be conformable.  
The result delivered by the function is conformable with BASE.   
The types of SOURCE and BASE must be the
same (exception: COUNT), and the result has the type of BASE.   
The allowed types are:

\begin{verbatim}
                XXX                     Allowed Types

                SUM                     Real, Complex, Integer
                COUNT                   BASE = Integer, SOURCE = Logical
                PRODUCT                 Real, Complex, Integer
                MAXVAL                  Real, Integer
                MINVAL                  Real, Integer
                AND                     Integer
                OR                      Integer
                EOR                     Integer
                ALL                     Logical
                ANY                     Logical
                PARITY                  Logical

\end{verbatim}

Since SOURCE and all the IDX arrays are conformable, for every
element s in SOURCE, there is a corresponding element in each of
the IDX arrays.
Let i1 be the value of the element of
IDX1 that is indexed by the same subscripts as element s of SOURCE.
More generally, for each j=1,2,...,n,
let ij be the value of the element of IDXj that corresponds to
element s in SOURCE, where n is the rank of BASE. 
The integers ij, j=1,...,n, form a subscript selecting an element
of BASE:  BASE(i1,i2,...,in).

Thus SOURCE and the IDX arrays establish a mapping from all the
elements of SOURCE onto selected elements of BASE.
Viewed in the other direction, this mapping associates
with each element b of BASE a set S of elements from SOURCE.

Since BASE and the result of XXX_SCATTER are conformable,
there is a corresponding element of the result for each 
element of BASE.
If S is empty, then 
the element of the result corresponding to the element  b  of BASE 
has the same value as  b.
If S is non-empty, 
then for {\em some} permutation
s1, s2, ..., sm S,
the element of the result corresponding to the element  b  of BASE 
is the result of evaluating
                                                                \CODE
       s1 @ s2 @ ... @ sm @ b
                                                                \EDOC
\noindent
where @ denotes an infix form of operation XXX, 
with some valid parenthesization of this expression.   
Therefore, if multiple elements of SOURCE are associated with the same
base element, they will all be combined with the base element.  

Thus the order of operations is arbitrary, and may differ on two
otherwise identical runs of the same HPF program.  This matters when
the combining operation is not both associative and commutative, for
example floating-point addition.  In fact, because machine arithmetic
is not associative (not even  fixed-point, because of overflow) the
programmer must be sure that the nondeterministic order of evaluation
of the result will not produce undesirable effects.

If the optional argument MASK is present, then only the elements of
SOURCE in positions for which MASK is true participate in the operation.
All other elements of SOURCE and of the IDX arrays are ignored.

Thus the result of the expression
                                                                \CODE
      SUM_SCATTER(SOURCE,BASE,IDX1,IDX2,...,IDXn,MASK)
                                                                \EDOC
\noindent
{\em could} be computed as

                                                                \CODE
      result = BASE
      DO J1=LBOUND(SOURCE,1),UBOUND(SOURCE,1)
        DO J2=LBOUND(SOURCE,2),UBOUND(SOURCE,2)
          ...
            DO Jk=LBOUND(SOURCE,k),UBOUND(SOURCE,k)
              IF (MASK(J1,J2,...,Jk))
     &           result(IDX1(J1,J2,...,Jk),
     &                  IDX2(J1,J2,...,Jk),
     &                  ...
     &                  IDXn(J1,J2,...,Jk)) =
     &           result(IDX1(J1,J2,...,Jk),
     &                  IDX2(J1,J2,...,Jk),
     &                  ...
     &                  IDXn(J1,J2,...,Jk)) + SOURCE(J1,J2,...,Jk)
            END DO
          ...
        END DO
      END DO
                                                                \EDOC

\noindent
where k is the rank of SOURCE.  (However, this nest of DO loops makes a
greater commitment to the particular order in which the combining
operations are carried out than the order---namely, none!--- guaranteed
by the XXX_SCATTER function.)

In addition, COPY_SCATTER is the combining-send
function generated by the (noncommutative) binary operator

                                                                \CODE
      COPY_operation(x,y) = x
                                                                \EDOC
\noindent
Thus an element of the result delivered by
COPY_SCATTER(SOURCE,BASE,IDX1,..., IDXn) corresponding with an element
of BASE that is associated with a non-empty set from SOURCE has
the same value as {\em some} SOURCE element from that set.
So if multiple elements of SOURCE are sent to the same result
element, some one of them will be assigned and the rest, as well as
the corresponding element of BASE, will be effectively discarded.
\par\smallskip\noindent
{\bf Example:} 

                                                                \CODE
      A = (/ 10., 20., 30., 40., -10./)
      X = (/ 1.,  2.,  3.,  4./)
      V = (/ 3,   2,   2,   1,   1/)
      X = SUM_SCATTER(A,X,V, MASK=(A > 0) )
                                                                \EDOC
\noindent
yields the result X = (/41., 52., 13., 4./).

If all elements of V were distinct, one could write this in
Fortran 90 as

                                                                \CODE
      X(V) = X(V) + MERGE(A, 0., A > 0.)
                                                                \EDOC

The proposed function SUM_SCATTER ``works'' even if V contains
duplicate values.  Note that the two-dimensional case

                                                                \CODE
      X(V,W) = X(V,W) + B
                                                                \EDOC

\noindent
must be rendered using SPREAD:

                                                                \CODE
      X = SUM_SCATTER(B,X,SPREAD(V,DIM=2,NCOPIES=SIZE(X,2)),
     &                      SPREAD(W,DIM=1,NCOPIES=SIZE(X,1)))
                                                                \EDOC

\noindent
in order to duplicate the cross-product effect of ordinary array
subscripting.  
(This definition of XXX_SCATTER does {\em not} perform such a cross
product of indices because it is more general and in practice more
useful without the cross-product effect built in.)

When scatter along one or more axes of a multidimensional array is required,
use a surrounding forall.  For example, the idiom used 
to SUM_SCATTER the (j,k) planes
of an (i,j,k)-indexed three-dimensional array, using the one-dimensional
index vector V is

								\CODE
       REAL, ARRAY(NI, NJ, NK) :: SRC, DEST
       LOGICAL MASK(NI, NJ, NK)
       INTEGER V(NI)
       FORALL (J = 1:NJ, K = 1:NK)
     &      DEST(:, J, K) = SUM_SCATTER( SRC(:,J,K), DEST(:,J,K), 
     &           V, MASK(:, J, K))
								\EDOC
which has the same effect as
								\CODE
      DO I = 1, NI
          WHERE (MASK(I, :, :)
     &       DEST(V(I), :, :) = DEST(V(I), :, :) + SRC(I, :, :)
      ENDDO
								\EDOC

\noindent
but may be more efficient, and 
makes no guarantees as to the order of evaluation.

\subsection{Parallel Prefix Instrinsics}
\label{parallel-prefix}

For every reduction operation XXX in the language, introduce the two new
functions XXX_PREFIX and XXX_SUFFIX.  They take the same arguments
as the corresponding reduction intrinsic, 
(an array of appropriate type, an optional scalar integer DIM argument,
an optional, LOGICAL array argument MASK conformable with ARRAY)
plus two additional optional arguments:

                                                                \CODE
	XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
	XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
                                                                \EDOC
Each element of the result is the reduction under the operator XXX of a
(possibly empty) set of elements of ARRAY.

\par\smallskip\noindent
{\bf Example:}
                                                                \CODE
     MAXVAL_PREFIX((/ 3, 2, 4, 1, 6/))
is
                   (/ 3, 3, 4, 4, 6/).

     MAXVAL_SUFFIX((/ 3, 2, 4, 1, 6/))
is
                   (/ 6, 6, 6, 6, 6/).
                                                                \EDOC

The value of these functions has the same shape as ARRAY.

The result has the same type as ARRAY, except for
COUNT_PREFIX and COUNT_SUF\-FIX, which take a LOGICAL
array argument and return an integer array result.
The allowed operations and the corresponding allowed types for ARRAY are
given in the table below.
\begin{verbatim}
                XXX                     Allowed Types

                SUM                     Real, Complex, Integer
                COUNT                   Result = Integer, ARRAY = Logical
                PRODUCT                 Real, Complex, Integer
                MAXVAL                  Real, Integer
                MINVAL                  Real, Integer
                AND                     Integer
                OR                      Integer
                EOR                     Integer
                ALL                     Logical
                ANY                     Logical
                PARITY                  Logical
\end{verbatim}

If the DIM argument is omitted, then the arrays are processed in
array element order (``column-major''), as if temporarily regarded as
one-dimensional.  If it is present, then it must be an integer scalar between
one and the rank of ARRAY.   In this case, completely independent 
prefix or suffix operations occur along the selected dimension of ARRAY.
\par\smallskip\noindent
{\bf Example:} If A has the value

\begin{verbatim}

        [  0  -5   8  -3  ]
        [  3   4  -1   2  ]
        [  0   4   6  -4  ]

then SUM_PREFIX(A) has the value 

        [  0  -2  14  16  ]
        [  3   2  13  18  ]
        [  3   6  19  14  ]

SUM_PREFIX(A, DIM=1) has the value

        [  0  -5   8  -3  ]
        [  3  -1   7  -1  ]
        [  3   3  13  -5  ]

SUM_PREFIX(A, DIM=2) has the value

        [  0  -5   3   0  ]
        [  3   7   6   8  ]
        [  0   4  10   6  ]

\end{verbatim}


Array elements corresponding to positions where the MASK is false
do not contribute to the running accumulation.  However, the result
is still defined for corresponding positions in the result.
\par\smallskip\noindent
{\bf Example:}
                                                                \CODE
     MAXVAL_PREFIX(    (/ 3, 2, 4, 5, 6/), 
   &           MASK  = (/ T, F, F, T, F/))
is                                                  
                       (/ 3, 3, 3, 5, 5/).

                                                                \EDOC
In actual practice, results may not be required in those positions;
in such cases the programmer may be able to use the WHERE statement
to inform the compiler:

                                                                \CODE
      WHERE (FOO) A=SUM_PREFIX(B,MASK=FOO)
                                                                \EDOC


The first additional optional argument is called SEGMENT, which is of
type logical and conformable with the ARRAY argument.  If present,
the array is divided into pieces corresponding to contiguous sequences of true
or false elements of SEGMENT.
The beginning of a piece is
a place where the running accumulation is to be reset before
processing the corresponding array element).
\par\smallskip\noindent
{\bf Example:}
                                                                \CODE
     LOGICAL T,F
     PARAMETER (T = .TRUE., F = .FALSE. )

     MAXVAL_PREFIX((/ 3, 2, 4, 1, 6/), 
   &       SEGMENT=(/ T, T, T, F, F/))  yields  (/ 3, 3, 4, 1, 6/).
                      -------  ----                -------  ----
                   two input segments         two independent results
                                                                \EDOC

The second additional optional argument, a scalar logical, is called
EXCLUSIVE, default .FALSE., which determines whether the prefix or
suffix operation is inclusive (the default) or exclusive.  (The
inclusive sum-prefix of (/ 1,2,3,4 /) is (/ 1,3,6,10 /) whereas the
exclusive sum-prefix is (/ 0,1,3,6 /).)


In every case, every element of the result has a value equal to the
reduction of certain selected elements of ARRAY, or an identity
value (zero for SUM_PREFIX or SUM_SUFFIX, for example) if no
elements of ARRAY are selected for that result element.  The optional
arguments affect the selection of elements of ARRAY for each element
of the result; the selected elements of ARRAY are said to contribute
to the result element.

The identity element for the reduction PARITY is .FALSE., for the 
reductions OR and EOR is zero, and for the 
reduction AND is -1.  COUNT does not have an identity, as it maps
logicals to integers and returns zero if there are no true values to be counted.
The identities for the other reductions are defined in the Fortran 90 standard.

For any given element R of the result, let A be the corresponding
element of ARRAY.  Every element of ARRAY contributes to R unless
disqualified by one of the following rules.

For xxx_PREFIX, no element that follows A in the array element
ordering of ARRAY contributes to R.  For xxx_SUFFIX, no element that
precedes A in the array element ordering of ARRAY contributes to R.
This rule applies even when the DIM argument is present, since
array element order increases with an increase in any component of an
array element index.

If the DIM argument is provided, an element Z of ARRAY does not
contribute to R unless all its indices, excepting only the index for
dimension DIM, are the same as the corresponding indices of A.

If the MASK argument is provided, an element Z of ARRAY does
not contribute to R if the element of MASK corresponding to
Z is false.

If the SEGMENT argument is provided, an element Z of ARRAY does not
contribute unless the elements B and Y of SEGMENT corresponding to A
and Z (respectively), and the intervening elements of SEGMENT as
well, all have the same value.  If the DIM argument is not present,
then the ``intervening'' elements are all elements between them in
array element order; if the DIM argument is present, then the
``intervening'' elements are those having indices the same as those of
both B and Y, except the index for dimension DIM, which must be
between (and possibly equalling) the indices of B and Y for dimension
DIM.  In other words, the prefix or suffix operation is performed
on groups of elements of ARRAY, where a group corresponds to a
maximal contiguous run of like-valued elements of SEGMENT.

If the SEGMENT argument is omitted, then the result is computed using
a default SEGMENT all elements of which are true.   Thus, without the
DIM argument, there is exactly one group, while if DIM is present, there
is one group for each valid set of indices of ARRAY other than the
index selected by DIM.

If the EXCLUSIVE argument is provided and is true, then A itself
does not contribute to R.

In addition to all this, the operation COPY_PREFIX replicates the first
(lowest-indexed) element of each segment throughout the segment, and
the operation COPY_SUF\-FIX replicates the last (highest-indexed)
element of each segment throughout the segment.

\par\smallskip\noindent
{\bf Examples:}

                                                                \CODE
SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/)
SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/)

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )

COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) )              
                                 !yields (/1,1,1,2,3,4,4,5,5/)
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) 
                                 !yields (/0,1,1,1,2,3,4,4,5/)

SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/),
    SEGMENT=(/T,T,T,T,F,F,T,F,F/)) yields (/1,3,6,10,5,11,7,8,17/)
              ------- --- - ---             -------- ---  - ----
	     four input segments       four independent result segments

COPY_PREFIX( (/1,2,3,4,5,6,7,8,9/),
     SEGMENT=(/T,T,T,T,F,F,T,F,F/)) yields (/1,1,1,1,5,5,7,8,8/)
               ------- --- - ---             ------- --- - ---
	      four input segments       four independent result segments
                                                                \EDOC


A new segment begins at every
{\em transition} from false to true or true to false; thus a segment is
indicated by a maximal contiguous subsequence of like logical values:

                                                                \CODE
        (/T,T,T,F,T,F,F,F,T,F,F,T/)
          ----- - - ----- - --- -    seven segments
                                                                \EDOC

Note: Connection Machine software delimits the segments by indicating
the {\em start} of each segment.  Cray MPP Fortran delimits the segments
by indicating the {\em stop} of each segment.  Each method has its advantages.
There is also the question of whether this convention should change when
performing a suffix rather than a prefix.
HPF adopts the symmetric representation above.
The main advantages of this representation are:

(a) It is symmetrical, in that the same segment specifier may
    be meaningfully used for parallel prefix and parallel suffix
    without changing its interpretation (start versus stop).

(b) It seems to be equally inconvenient for every existing
    architecture!  However, it is not that hard to accommodate.

(c) The start-bit or stop-bit representation is easily converted
    to this form by using PARITY_PREFIX or PARITY_SUFFIX.
\par\smallskip\noindent
{\bf Examples:}

                                                                \CODE
    SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
    SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
    SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))
                                                                \EDOC
\noindent
These might be standard idioms for a compiler to recognize.


\subsection{Sorting Functions}

This section introduces two sorting functions, GRADE_UP and GRADE_DOWN.
                                                                \CODE
GRADE_UP(ARRAY,DIM)
                                                                \EDOC

The array may be of type integer, real, or character.

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

                                                                \CODE
	B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)
                                                                \EDOC

\noindent
then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in) is
sorted in ascending order; moreover, R(i1,i2,...,:,...,in) is a permutation of
all the integers in the range 

                                                                \CODE
LBOUND(ARRAY,k):UBOUND(ARRAY,k). 
                                                                \EDOC

The sort is
stable; that is, if j \(\leq\) m and B(i1,i2,...,j,...,in) .EQ.
B(i1,i2,...,m,...,in),
then R(i1,i2,...,j,...,in) \(\leq\) R(i1,i2,...,m,...,in).

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape [SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY))]
and the property that if one computes the rank-1 array

                                                                \CODE
	B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))
                                                                \EDOC

\noindent
where n=SIZE(SHAPE(ARRAY)), then B is sorted in ascending order;
moreover, all of the columns of S are distinct, that is, if j \(\neq\) m then
ALL(S(:,j) .EQ. S(:,m)) will be false.  The sort is stable;
if j \(\leq\) m and B(j) .EQ. B(m), then ARRAY(S(1,j),S(2,j),...,S(n,j))
precedes ARRAY(S(1,m),S(2,m),...,S(n,m)) in the array element ordering
of ARRAY.


                                                                \CODE
GRADE_DOWN(ARRAY,DIM)
                                                                \EDOC

The array may be of type integer, real, or character.

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

                                                                \CODE
	B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)
                                                                \EDOC

\noindent
then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in) is
sorted in descending order; moreover, R(i1,i2,...,:,...,in) is a permutation of
all the integers in the range 

                                                                \CODE
LBOUND(ARRAY,k):UBOUND(ARRAY,k).  
                                                                \EDOC

The sort is
stable; that is, if j \(\leq\) m and B(i1,i2,...,j,...,in) .EQ.
B(i1,i2,...,m,...,in),
then R(i1,i2,...,j,...,in) \(\leq\) R(i1,i2,...,m,...,in).  (Yes, that
last ``\(\leq\)'' sign
really should be a ``\(\leq\)'', not a ``\(\geq\)''.)

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape [SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY))]
and the property that if one computes the rank-1 array

                                                                \CODE
	B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))
                                                                \EDOC

\noindent
where n=SIZE(SHAPE(ARRAY)), then B is sorted in descending order;
moreover, all of the 
columns of S are distinct, that is, if j \(\neq\) m then ALL(S(:,j) .EQ. S(:,m)) will
be false.  The sort is stable; if j \(\leq\) m and B(j) .EQ. B(m), then
ARRAY(S(1,j),S(2,j),...,S(n,j)) precedes (yes, ``precedes'', not ``follows'')
ARRAY(S(1,m),S(2,m),...,S(n,m)) in the array element ordering of ARRAY.


Because of the stability requirement, GRADE_DOWN(A(1:N)) does not, in
general, equal GRADE_UP(A(N:1:-1)).  Indeed, these results are equal if
and only if A contains no duplicate values.

The stability requirement allows one to cascade grading operations in order to
sort on multiple fields.  For example, suppose one had the following derived
type (example taken from section 4.4.1 of the Fortran 90 standard):

                                                                \CODE
      TYPE PERSON
        INTEGER AGE
        CHARACTER (LEN = 50) NAME
      END TYPE PERSON
                                                                \EDOC

Now consider two arrays of persons:

                                                                \CODE
      TYPE(PERSON), DIMENSION(100000) :: MEMBERS, ROSTER
                                                                \EDOC

Also assume a work vector for indices:

                                                                \CODE
      INTEGER, DIMENSION(100000) :: V
                                                                \EDOC

Then the statements

                                                                \CODE
      V = GRADE_UP(MEMBERS%AGE)
      V = V(GRADE_UP(MEMBERS(V)%NAME))
      ROSTER = MEMBERS(V)
                                                                \EDOC

\noindent
cause ROSTER to be a rearrangement of MEMBERS that is sorted
primarily by name and secondarily by age (that is, members with
the same name are grouped together in order of ascending age).
Note that the minor sort field is graded first, and that more
statements like the second one may be inserted to sort on additional
fields.

To list members with the same name in descending order of age,
change the first GRADE_UP to GRADE_DOWN:

                                                                \CODE
      V = GRADE_DOWN(MEMBERS%AGE)
      V = V(GRADE_UP(MEMBERS(V)%NAME))
      ROSTER = MEMBERS(V)
                                                                \EDOC

The ideas and names here are inspired by APL.  The term ``grade''
rather than ``rank'' is used because the latter is already used in the
Fortran 90 standard to mean the size of the shape of an array (that is,
the number of dimensions).

\subsection{POPCNT, POPPAR, and LEADZ Functions}

\subsubsection{POPCNT}

An elemental population count function.  Its action on a scalar is:

                                                                \CODE
  POPCNT(x) = COUNT( (/ (BTEST(x,J), J=0, BIT_SIZE(x)-1) /) )
                                                                \EDOC

The result is the number of 1-bits in the integer x, according to the
bit-manipulation model in section 13.5.7 of the Fortran 90 standard.

\subsubsection{POPPAR}

An elemental population-parity function.  Its action on a scalar is:

                                                                \CODE
  POPPAR(x) = MERGE(1,0,BTEST(POPCNT(x),0))
                                                                \EDOC

The result is 1 if the number of 1-bits in the integer x is odd,
or 0 if the number of 1-bits in the integer x is even.

\subsubsection{LEADZ}

An elemental count-leading-zeros function.  Its action on a scalar is:

                                                                \CODE
  LEADZ(x) = MINVAL( (/ (J, J=0,BIT_SIZE(x)) /),
		MASK=(/ (BTEST(x,J), J=BIT_SIZE(x)-1,0,-1), .TRUE. /) )
                                                                \EDOC

The result is a count of the number of leading 0-bits in the integer
x, according to the bit-manipulation model in section 13.5.7 of the
Fortran 90 standard.

Note that a given integer value may produce different results from
LEADZ, depending on the number of bits in the representation of the
integer.  That is because bits are counted from the left (the most
significant bit).

%(The intent is to define POPCNT, POPPAR, and LEADZ consistent with their
%use in Cray Fortran, but to limit them to integer arguments.  
%The
%definition of ILEN is equivalent to that of the built-in function
%integer-length in Common Lisp, which has proven to be quite useful.)


From schreibr@riacs.edu  Fri Sep 18 13:25:13 1992
Received: from erato.cs.rice.edu by cs.rice.edu (AA17965); Fri, 18 Sep 92 13:25:13 CDT
Received: from icarus.riacs.edu by erato.cs.rice.edu (AA21013); Fri, 18 Sep 92 13:25:06 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA27645; Fri, 18 Sep 92 11:25:03 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA09442; Fri, 18 Sep 92 11:24:50 PDT
Message-Id: <9209181824.AA09442@thor.riacs.edu>
Date: Fri, 18 Sep 92 11:24:50 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@erato.cs.rice.edu
Subject: What's new


In the draft I just sent, the possibility to use

number_of_processors, or
processors_shape

in constant expressions has been removed, as this effectively
forces the compiler to compile for one machine size only.
Constant expressions are required by Fortran 90 in initializations of
variables and parameters and, for example, here:

    equivalence (A( <constnt-expr> ),    stuff )

It does not seem reasonable to allow an expression that may not be
known explicitly to the compiler as the array subscript.


The explanations of SCATTER, PREFIX, and SUFFIX have all been reworked, too.


Rob


From chk@cs.rice.edu  Mon Sep 21 11:26:43 1992
Received: from charon.rice.edu by cs.rice.edu (AB28258); Mon, 21 Sep 92 11:26:43 CDT
Message-Id: <9209211626.AB28258@cs.rice.edu>
Date: Mon, 21 Sep 1992 11:38:21 -0600
To: Rob Schreiber <schreibr@riacs.edu>
From: chk@cs.rice.edu
Subject: Re: What's new
Cc: hpff-intrinsics@erato.cs.rice.edu

>In the draft I just sent, the possibility to use
>
>number_of_processors, or
>processors_shape
>
>in constant expressions has been removed, as this effectively
>forces the compiler to compile for one machine size only.
>Constant expressions are required by Fortran 90 in initializations of
>variables and parameters and, for example, here:
>
>    equivalence (A( <constnt-expr> ),    stuff )
>
>It does not seem reasonable to allow an expression that may not be
>known explicitly to the compiler as the array subscript.
>
>
>The explanations of SCATTER, PREFIX, and SUFFIX have all been reworked, too.
>
>
>Rob

While I can't argue with leaving NUMBER_OF_PROCESSORS out of EQUIVALENCE,
I'd like to note that this also disallows NUMBER_OF_PROCESSORS in the
initializing expression of DATA statements.  This seems harmless; anybody
want to suggest a way to allow it?  I don't have a cross-reference for the
F90 standard, so I don't know if there are other nice uses of constants.

                                                Chuck


From dwatson@uxb.liv.ac.uk  Fri Oct  9 03:37:51 1992
Received: from sun2.nsfnet-relay.ac.uk by cs.rice.edu (AA10981); Fri, 9 Oct 92 03:37:51 CDT
Via: uk.ac.liverpool.uxb; Fri, 9 Oct 1992 09:37:31 +0100
From: "Mr. D.C.B. Watson" <dwatson@uxb.liv.ac.uk>
Date: Fri, 9 Oct 92 09:39:54 BST
Message-Id: <6797.9210090839@uxb.liv.ac.uk>
To: hpff-intrinsics@cs.rice.edu

Subject: System Inquiry Intrinsics

I am unclear as to the expression class which HPF system inquiry intrinsics
fall into. Are they constant or restriced expressions? The current (v0.2) draft 
seems to waver on the issue. If the inquiry functions may be used in constant 
expressions then they may be used to specify the bounds of arrays declared in common. This makes equivalence 
a very interesting statement!


From schreibr@riacs.edu  Mon Oct 19 16:32:23 1992
Received: from icarus.riacs.edu by titan.cs.rice.edu (AA17265); Mon, 19 Oct 92 16:32:23 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA15917; Mon, 19 Oct 92 14:32:16 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA01805; Mon, 19 Oct 92 14:32:11 PDT
Message-Id: <9210192132.AA01805@thor.riacs.edu>
Date: Mon, 19 Oct 92 14:32:11 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@cs.rice.edu
Subject: Inquiry Intrinsics


We are going to have a meeting on Wednesday to discuss Richard Shapiro's
proposal for inquiry intrinsics.    It may occur as early as 1:30 PM,
depending on the other committees.

Please read and review his proposal, as modified by me,
before then.   

Questions:  Should character arrays be used for outputs?  What should the char-len
be?  What should HPF_INQUIRE_DISTRIBUTION return as AXIS_INFO  in the case of 
an axis distributed with the * specification.  (*Is* there  a * specification in a
distribute?)

Here is the current draft:


\section{Distribution Inquiry Intrisics}

\subsection{Motivation}
HPF provides a rich set of data distribution directives. These directives
are advisory in nature. At some point, users will want to know to what
extent the compiler took their advice. This is especially important when a
user calls a non-HPF subroutine, since he may need to know the exact
distribution. For these reasons, HPF includes inquiry intrinsics
which describe how an array is actually mapped onto a machine.
To keep the number of intrinsics small, the inquiry intrisics are
structured as intrinsic subroutines with optional arguments.


\subsection{Alignment Inquiry Subroutine}
\def\varray{\verb+ARRAY+}

\CODE
	CALL HPF_INQUIRE_ALIGNMENT(ARRAY,LB,UB,STRIDE,AXIS_MAP,
		IDENTITY_MAP,REALIGNABLE,NUMBER_OF_COPIES)
	INTEGER,DIMENSION(7) :: LB,UB,STRIDE,AXIS_MAP
	INTEGER NUMBER_OF_COPIES
	LOGICAL IDENTITY_MAP,REALIGNABLE
\EDOC	
\begin{description}
\item[Required] ARRAY 
\item[Optional] LB,UB,STRIDE,AXIS\_MAP,
		IDENTITY\_MAP,REALIGNABLE,\\ NUMBER\_OF\_COPIES
\end{description}

The \verb+HPF_INQUIRE_ALIGNMENT+ subroutine returns information regarding
the alignment of an array to its associated template.  \verb+ARRAY+ is the
only input argument; all the remaining arguments are optional output arguments.
\begin{description}
\item[ARRAY] The array about which alignment information is requested.
\item[LB] An array containing the template coordinate
	of the first element of \varray\ along an axis.
\item[UB] An array containing the template coordinate
	of the last element of \varray\ along an axis.
\item[STRIDE] An array containing the stride used in aligning the elements
	of \varray\ along an axis.
\item[AXIS\_MAP] An array containing the template axis associated with an
	array axis. If \verb+AXIS_MAP+ is 0, the axis is a collapsed axis.
\item[IDENTITY\_MAP] A logical which will be true if the template
	associated with the array has a shape identical to \varray, the
	axes are mapped using the identity permutation, and the strides are all
        positive.
\item[REALIGNABLE] A logical which will be true if \varray\ has the 
	REALIGNABLE attribute.
\item[NUMBER\_OF\_COPIES] The product of the extents of all template axes over
        which \varray\ has been replicated.
	For a non-replicated array, for
	example, this will be 1.
\end{description}

\subsection{Template Inquiry Subroutine}

\CODE
	CALL HPF_INQUIRE_TEMPLATE(ARRAY,TEMPLATE_RANK,LB,UB,AXIS_TYPE,
		AXIS_INFO,NUMBER_ALIGNED,REDISTRIBUTABLE)	
	INTEGER,DIMENSION(MAX_TEMPLATE_RANK) :: LB,UB,AXIS_INFO
	CHARACTER*(*) AXIS_TYPE(MAX_TEMPLATE_RANK)
	INTEGER NUMBER_ALIGNED,TEMPLATE_RANK
	LOGICAL REDISTRIBUTABLE
\EDOC	

\begin{description}
\item[Required] ARRAY 
\item[Optional] LB,UB,AXIS\_TYPE,AXIS\_INFO,
		NUMBER\_ALIGNED,\\ TEMPLATE\_RANK,REDISTRIBUTABLE
\end{description}
The \verb+HPF_INQUIRE_TEMPLATE+ subroutine returns information regarding
the template associated with an array. The main difference between
\verb+HPF_INQUIRE_TEMPLATE+ and \verb+HPF_INQUIRE_ALIGNMENT+  is that the
former returns information concerning the array from the template's point
of view, while the latters returns information from the arrays point of
view. \varray\ is the
only input argument; all the remaining arguments are optional output
arguments.

\begin{description}
\item[ARRAY] The array about which template information is requested.
\item[LB] An array containing the declared template lower bound 
	for each axis.
\item[UB] An array containing the declared template upper bound 
	for each axis.
\item[AXIS\_TYPE] A character array which returns information about each
	axis of the template.  The following values are defined by HPF
	(implementations may define other values):
	
	\begin{description}
	\item['NORMAL'] The axis has an axis of \varray\ aligned to to it.
		\verb+AXIS_INFO+ contains the axis of the array aligned
		with the axis of the template.
	\item['SINGLE'] The array is aligned with a single coordinate of
		the template axis. \verb+AXIS_INFO+ contains the coordinate
		to which \varray\ is aligned.
	\item['REPLICATED'] The array is replicated along this template axis.
		\verb+AXIS_INFO+ contains the number of copies of
		\varray\ along the axis. This is an implemtation-specific
		quantity.
	\end{description}	

\item[AXIS\_INFO] See the desciption of \verb+AXIS_TYPE+ above.

         Example:
                               \CODE
                REAL A(4, 20)
         CHPF$  TEMPLATE T(30, 150, 8, 200) 
         CHPF$  ALIGN A(I,J,*) WITH T(J+5, 100, 10-2*I, *)
							\EDOC

         then 

							\CODE
         AXIS\_TYPE = ['NORMAL', 'SINGLE', 'NORMAL', 'REPLICATED'] and
         AXIS\_INFO = [2, 100, 1, 200]
							\EDOC

\item[NUMBER\_ALIGNED] The total number of arrays aligned to the template. 
	This is the number of arrays which will be moved when the template
	is redistributed.
\item[REDISTRIBUTABLE] A logical variable which will be true if the
	template is redistributable.
\item[TEMPLATE\_RANK] The number of axes in the template. This can be
	different than the number of array axes due to collapsing and
	replicating.
\end{description}


\subsection{Distribution Inquiry Subroutine}
\CODE
	CALL HPF_INQUIRE_DISTRIBUTION (ARRAY,AXIS_TYPE,AXIS_INFO,
		PROCESSORS_SHAPE,PROCESSORS_RANK)
	INTEGER AXIS_INFO(MAX_TEMPLATE_RANK),PROCESSORS_RANK
	CHARACTER*(*) AXIS_TYPE(MAX_TEMPLATE_RANK)
	INTEGER PROCESSORS_SHAPE(MAX_PROCESSORS_RANK)
\EDOC

\begin{description}
\item[Required] ARRAY 
\item[Optional] AXIS\_TYPE,AXIS\_INFO,
		PROCESSORS\_SHAPE,PROCESSORS\_RANK
\end{description}
The \verb+HPF_INQUIRE_DISTRIBUTION+ subroutine returns information regarding
the distribution of the template associated with an array.
\varray\ is the only input argument; all the remaining arguments 
are optional output arguments.

\begin{description}
\item[ARRAY] The array about whose template distribution 
	information is requested.
\item[AXIS\_TYPE] A character array which returns information about the 
	distribution of each axis of the template. 
	 The following values are defined by HPF
	(implementations may define other values):
	
	\begin{description}
	\item['BLOCK'] The axis is distributed BLOCK.
		\verb+AXIS_INFO+ contains the block size.
	\item['CYCLIC'] The axis is distributed CYCLIC.
		\verb+AXIS_INFO+ contains the block size.
	\item['COLLAPSED'] The axis is collapsed (distributed with the *
		specification)
	\end{description}	
\item[AXIS\_INFO] See the desciption of \verb+AXIS_TYPE+ above.
\item[PROCESSORS\_RANK] The rank of the processor arrangement associated
	with the array's template.
\item[PROCESSORS\_SHAPE] The shape of the processor arrangement associated
	 with the array's template.
\end{description}

\subsection{Examples}
Consider the declarations below:

\CODE
	DIMENSION A(10,10),B(20,30),C(20,40,10),D(40)
	CHPF$ TEMPLATE T(40,20)
	CHPF$REALIGNABLE A
	CHPF$ ALIGN A(I,:) WITH T(1+3*I,2:20:2)
	CHPF$ ALIGN C(I,*,J) WITH T(J,21-I)
	CHPF$ ALIGN D(I) WITH T(I,4)
	PROCESSORS P(4,2)
	CHPF$DISTRIBUTE T(BLOCK,BLOCK) ONTO P
	CHPF$DISTRIBUTE B(CYCLIC,BLOCK) ONTO P
\EDOC	

The results of \verb+HPF_INQUIRE_ALIGNMENT+ will be:
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& A & B & C \\ \hline \hline
LB & 4,2,... & 1,1,... & 1,N/A,1,... \\ \hline
UB & 31,20,... & 20,30,... & 20,N/A,10,... \\ \hline
STRIDE & 3,2,... & 1,1,... & -1,N/A,1,... \\ \hline
AXIS\_MAP & 1,2,... & 1,2,... & 2,0,1,... \\ \hline
IDENTITY\_MAP & .FALSE. & .TRUE. & .FALSE. \\ \hline
REALIGNABLE & .TRUE. & .FALSE. & .FALSE. \\ \hline
NUMBER\_OF\_COPIES & 1 & 1 & 1 \\ \hline
\end{tabular}\end{center}
and the result of  \verb+HPF_INQUIRE_TEMPLATE+ will be

\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& A & C & D \\ \hline \hline
LB & 1,1,... & 1,1,... & 1,1,... \\ \hline
UB & 40,20,... & 40,20,... & 40,20,...  \\ \hline
AXIS\_TYPE & 'NORMAL','NORMAL',... &
	 'NORMAL','NORMAL',... & 'NORMAL','SINGLE',... \\ \hline
AXIS\_INFO & 1,2,... & 3,1,... & 1,4,... \\ \hline
NUM.AL. & 3 & 3 & 3 \\ \hline
TEMP. RANK& 2 & 2 & 2 \\ \hline
REDIST. & .FALSE. & .FALSE. & .FALSE. \\ \hline
\end{tabular}\end{center}

Finally  \verb+HPF_INQUIRE_DISTRIBUTION+ will produce
\begin{center}
\begin{tabular}{|l|c|c|}
\hline
& A & B \\ \hline \hline
AXIS\_TYPE & 'BLOCK','BLOCK',... &
	 'CYCLIC','BLOCK',... \\ \hline
AXIS\_INFO & 10,10,... & 1,15,... \\ \hline
PROCESSORS\_SHAPE & 4,2,... & 4,2,... \\ \hline
PROCESSORS\_RANK & 2 & 2 \\ \hline
\end{tabular}\end{center}
Note that the values of the block sizes (in \verb+AXIS_INFO+) are not
specified by HPF, but may be implementation-dependent.


From schreibr@riacs.edu  Mon Nov 30 16:57:56 1992
Received: from icarus.riacs.edu by cs.rice.edu (AA28300); Mon, 30 Nov 92 16:57:56 CST
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA10574; Mon, 30 Nov 92 14:57:34 PST
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA01165; Mon, 30 Nov 92 14:57:33 PST
Message-Id: <9211302257.AA01165@thor.riacs.edu>
Date: Mon, 30 Nov 92 14:57:33 PST
From: Rob Schreiber <schreibr@riacs.edu>
To: loveman@mpsg.enet.dec.com
Subject: Intrinsics
Cc: hpff-intrinsics@cs.rice.edu


Dave,

I dont object to any of your edits.
I cannot make the last table fit the page.   So to hell with the 
"Overfull hbox".

I have added text to deal with inquiry about an unallocated array
(copied from the F90 SIZE intrinsic).  I also clarified the type and shape
of the results returned by the mapping inquiries HPF_ALIGNMENT, etc..


%
%intrinsics.tex

%I have made a number of small edits in the intrinsics chapter,
%primarily to eliminate many messages of the form:
%
%Overfull \hbox (13.32715pt too wide) in paragraph at lines 1438--1441 . . . .
%
%Could you please use this edited version as your base for future edits.
% If you disagree with any of the edits I made, please fix them, while
%attempting to not introduce new LaTeX messages.  I can't think of a
%good way to make the next to the last table fit.  Any ideas?
%
%-David
%
%====================================================================
%Version of May 29, 1992 --- Guy Steele, Thinking Machines Corporation
%and David Loveman, Digital Equipment Corporation --- Robert Schreiber,
RIACS (Editor)

\chapter{Intrinsic and Library Functions}
\label{intrinsics}

\footnote{Version of October 27, 1992 ---
Guy Steele, Thinking Machines Corporation, David Loveman, Digital
Equipment Corporation, and
Robert Schreiber, Research Institute for Advanced Computer Science
}
This section extends Section 13 of the Fortran 90 standard.

HPF retains Fortran 90's intrinsic functions.  It also adds a number of
new intrinsics in three categories: system inquiry intrinsics, mapping
inquiry intrinsics, and computational intrinsics.

The definitions of two Fortran 90 intrinsics, MAXLOC and MINLOC,
are extended by the addition of an optional DIM argument.

In addition to the new intrinsics, HPF defines an HPF library that must be
provided by vendors of any full HPF implementation.

\section{System Inquiry Intrinsic Functions}

\footnote{Version of
October 27, 1992 --- David Loveman, Digital Equipment Corporation.}
In addition to the Fortran 90 intrinsic functions, High Performance
Fortran has two system inquiry  intrinsic functions:
NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE.  Their values remain
constant for (at least) the duration of one program execution.
Accordingly, NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE have values that
are restricted expressions and may be used wherever any other Fortran
90 restricted expression may be used.
%    If the system configuration is committed to at compile time,
%    NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE have values that are constant
%    expressions and may be used wherever any other Fortran 90 constant
%    expression may be used.  
In particular, NUMBER_OF_PROCESSORS may be
used in a specification expression.
%    and, if a constant expression, may
%    be used in an initialization expression.  
None of the categories of
intrinsic functions listed in Chapter 13 of the Fortran 90 standard
seem quite apt to describe the nature of this new intrinsic function,
so HPF adds a new category of ``system inquiry functions'' and place
NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE in that category.

%    Note that treating these intrinsics as constant expressions does not
%    force a compiler to bind the number of processors at compile time
%    (although that is one possible implementation) -- with the right linker
%    or code-generation technology the choice could be deferred until run
%    time, possibly at some performance cost.


\subsection{Formal Definition}

%[this subsection contains formal definition material, phrased as additions
%or modifications to the Fortran 90 specification, with non-formal
%commentary in square brackets]


\par\smallskip
13.8a System inquiry functions

In a multi-processor implementation, the processors may be arranged in
an im\-ple\-men\-ta\-tion-de\-pen\-dent n-dimensional processor array.
The system inquiry functions return values related to this underlying
machine and processor configuration, including the size and shape of
the underlying processor array.  NUMBER_OF_PROCESSORS returns the total
number of processors available to the program or the number of
processors available to the program along a specified dimension of the
processor array.  PROCESSORS_SHAPE returns the shape of the processor
array.


\par\smallskip
13.10.21 System inquiry functions

\begin{verbatim}
   NUMBER_OF_PROCESSORS(DIM)      Total number of processors in
                                    the processor array.
   PROCESSORS_SHAPE()             Shape of the processor array
\end{verbatim}


\par\smallskip
13.13.xx  NUMBER_OF_PROCESSORS(DIM)

Optional Argument.  DIM

Description.  Returns the total number of processors available to the
program or the number of processors available to the program along a
specified dimension of the processor array.

Class.  System inquiry function.

Arguments.
DIM (optional)  must be scalar and of type integer with a value in the
range  (\(1 \leq DIM \leq n\)) where n is the rank of the processor array.)

Result Type, Type Parameter, and Shape.  Default integer scalar.

Result Value.  The result has a value equal to the extent of dimension
DIM (\(1 \leq DIM \leq n\)), where n is the rank of the processor array) of the
processor-dependent hardware processor array or, if DIM is absent, the
total number of elements, equal to or greater than one, of the
processor-dependent hardware processor array.

\par\smallskip\noindent
{\bf Examples:} For a DECmpp 12000/Sx Model 200 with 8192 processors,
the value of NUMBER_OF_PROCESSORS( ) is 8192, the value of
NUMBER_OF_PROCESSORS(DIM=1) is 128, and the value of
NUMBER_OF_PROCESSORS(DIM=2) is 64.  For a single processor DEC 3000 AXP
workstation, the value of NUMBER_OF_PROCESSORS( ) is 1, and the value
of NUMBER_OF_PROCESSORS(DIM=1) is 1.


\par\smallskip
13.13.yy PROCESSORS_SHAPE()

Description.  Returns the shape of the implementation-dependent processor array.

Class.  System inquiry function.

Arguments.  None

Result Type, Type Parameter, and Shape.  The result is a default
integer array of rank one whose size is equal to the rank of the
implementation-dependent processor array.

Result Value.  The value of the result is the shape of the
implementation-dependent processor array.


\par\smallskip\noindent
{\bf Example:} For a DECmpp 12000/Sx Model 200 with 8192 processors,
the value of PROCESSORS_SHAPE() is (/ 128, 64 /).  For a Connection
Machine CM-2 with 8192 processors, the value of PROCESSORS_SHAPE()
might be (/ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 /).  For a Connection
Machine CM-5 with 8192 processors, the value of PROCESSORS_SHAPE()
might be (/ 8192 /).  For a single processor DEC 3000 AXP workstation,
the value of PROCESSORS_SHAPE() is (/ 1 /).


%[The list of alternatives for Fortran 90 constant expressions and
%restricted expression are expanded to include NUMBER_OF_PROCESSORS and
%PROCESSORS_SHAPE.]

%    7.1.6.1 Constant expression
%
%    A constant expression is . . . . .
%
%    (6a)  A system inquiry function reference where each argument is a
%    constant expression and the compiler has been informed of the
%    appropriate system configuration.  
%
%    . . . . .
%
7.1.6.2 Specification expression

A restricted expression is . . . . .

(9a)  A system inquiry function reference where each argument is a
restricted expression.  

. . . . .


\subsection{Discussion and Pragmatic Usage --- Consequences of the Formal Proposal}

%    The shape of the processor array may be treated as a constant known at
%    compile time if the compiler is informed about the system
%    configuration.  This will allow the values of system inquiry functions
%    to be used in initialization expressions
%    even if the system configuration is not
%    committed to at compile time.
The values of system inquiry
functions are always restricted expressions;  thus they may be used in
specification expressions.
They may not, however, occur in initialization expressions, because they
may not be assumed to be constants.   In particular, HPF programs may
be compiled to run on machines whose configurations are not known at
compile time.

Note that the system inquiry functions query the physical machine, and
have nothing to do with any PROCESSORS directive that may occur.

References to system inquiry functions may occur in HPF directives, as in:
                                                                  \CODE
!HPF$ TEMPLATE T(100, 3*NUMBER_OF_PROCESSORS())
                                                                  \EDOC

The definition of NUMBER_OF_PROCESSORS is modeled on the definition of
the SIZE intrinsic function.

The definition of PROCESSORS_SHAPE is modeled on the definition of the
SHAPE intrinsic function.

The rank of the processor array is returned by
								\CODE
SIZE(PROCESSORS_SHAPE())
								\EDOC
an expression that may occur in any specification expression.
%and that is constant whenever PROCESSORS_SHAPE() is.

%  PART DELETED AS INCORRECT
%    As a result of being a constant expression, if the system configuration
%    is committed to at compile time, suitably constrained references to
%    system inquiry functions may occur in initialization expressions as,
%    for example, initialization values in type-declaration-statements or in
%    parameter-statements, as in:
%
%                                                                      \CODE
%    PARAMETER (N_PROCS=NUMBER_OF_PROCESSORS(),        &
%               NXPROCS=NUMBER_OF_PROCESSORS(DIM=1),   &
%               NYPROCS=NUMBER_OF_PROCESSORS(DIM=2))
%                                                                      \EDOC
%    
As a result of being a restricted expression, suitably constrained
references to system inquiry functions may occur in specification
expressions as, for example, lower or upper bounds of an
explicit-shape-spec of an array-spec in type-declaration-statements, as in:

                                                                 \CODE
INTEGER, DIMENSION(SIZE(PROCESSORS_SHAPE())) :: PS
PS = PROCESSORS_SHAPE()
! PS(2) = NUMBER_OF_PROCESSORS(DIM=2)
                                                                 \EDOC

\section{Computational Intrinsic Functions}

\footnote{Version of
October 27, 1992 --- Guy Steele, Thinking Machines Corporation.}
This section extends the set of Fortran intrinsic functions.

\subsection{Extension to MINLOC and MAXLOC}

The MAXLOC and MINLOC intrinsics are redefined to have an optional DIM
argument that works exactly as does the DIM argument of MAXVAL.  If
such an argument is present, then the shape of the result equals the
shape of the first argument with one dimension (the one indicated by
the DIM argument) deleted; it is as if a series of one-dimensional
MAXLOC or MINLOC operations were performed.  The rank of the result is
one less than the rank of the first argument.  If the smallest (MINLOC)
or largest (MAXLOC) element along a given dimension is not unique, then
the location of the first one is returned.   The declared lower bounds
of the input array play no role in determining the output.   The
optional MASK argument is retained and may be used together with the
DIM argument.

Note that the behavior of MAXLOC and MINLOC without the DIM argument is
quite different.   In this case, a one-dimensional integer array of
size equal to the rank of ARRAY is returned, giving the ssubscripts of
the first element in array element order with the smallest (MINLOC) or
largest (MAXLOC) value.

Thus, if A has DIMENSION(4,3), then 
\begin{verbatim}
      SHAPE(MAXLOC(A))       has the value (/ 2 /)
      SHAPE(MAXLOC(A,DIM=1)) has the value (/ 3 /)
      SHAPE(MAXLOC(A,DIM=2)) has the value (/ 4 /).
\end{verbatim}
\par\smallskip\noindent
{\bf Example:} If A has the value

\begin{verbatim}
        [  0  -5   8  -3  ]
        [  3   4  -1   2  ]
        [  0   4   6  -4  ]


then	

        MINLOC(A)        has the value (/ 1, 2 /)
        MAXLOC(A)        has the value (/ 1, 3 /)
        MINLOC(A, DIM=1) has the value (/ 1, 1, 2, 3 /)
        MAXLOC(A, DIM=1) has the value (/ 2, 2, 1, 2 /)
        MINLOC(A, DIM=2) has the value (/ 2, 3, 4 /)
        MAXLOC(A, DIM=2) has the value (/ 3, 2, 3 /).
\end{verbatim}


\subsection{ILEN}

An elemental integer-length intrinsic.  Its action on a scalar is:

                                                                \CODE
  ILEN(X) = ceiling(log2( IF x < 0 THEN -x ELSE x+1 ))
                                                                \EDOC

ILEN(x) is the number of bits
required to store a 2's-complement signed integer x.  As examples of
its use,  2**ILEN(N-1)  rounds N up to a power of 2 (for \(N > 0\)),
whereas  2**(ILEN(N)-1)  rounds N down to a power of 2.

Note that a given integer value will always produce the same result
from ILEN, independent on the number of bits in the representation of
the integer.  That is because bits are counted from the right (the
least significant bit).

{\bf Argument:}:  X must be integer.  It may be scalar or array valued.

{\bf Result shape and type:}  Same as X.

As an elemental, integer-valued intrinsic, ILEN may appear in a
specification expression.


\section{Computational Library Functions}

\footnote{Version of
September 14, 1992 --- Guy Steele, Thinking Machines Corporation
Robert Schreiber, Research Institute for Advanced Computer Science,
and Rex Page, Amoco Corporation.}
This section consists of five groups of computational library
functions, to be available in the standard HPF module.   Use of these
functions must be accompanied by an appropriate USE statement in each
scoping unit in which they are used.   They are not intrinsic.   Thus,
they are not allowed in specification expressions.

\subsection{New Reduction Functions}

Just as Fortran 90 has the correpondences:

\begin{verbatim}
          operator/intrinsic        reduction intrinsic

                +                       SUM, COUNT
                *                       PRODUCT
                .AND.                   ALL
                .OR.                    ANY
                MAX                     MAXVAL
                MIN                     MINVAL
\end{verbatim}
\noindent
it is useful to have reduction versions of certain other operators and
intrinsics in the language that happen to be associative and
commutative.   Therefore the new functions AND, OR, EOR, and PARITY
are defined.

\begin{verbatim}
          operator/intrinsic        reduction function

                IAND                    AND
                IOR                     OR
                IEOR                    EOR
               .NEQV.                   PARITY

Thus

        AND( (/ 7,3,10 /) )  yields 2
         OR( (/ 7,3,10 /) )  yields 15
        EOR( (/ 7,3,10 /) )  yields 14

      LOGICAL T,F
      PARAMETER (T = .TRUE., F = .FALSE. )      !just for conciseness

      PARITY( (/ T,F,F,T,T,F,F,F,T,T /) )  yields .TRUE.
      PARITY( (/ T,F,F,T,T,F,F,F,T,F /) )  yields .FALSE.
\end{verbatim}

Some of these are particularly valuable when used with the corresponding
parallel-prefix functions, Section~\ref{parallel-prefix}.

The identity element for the reduction PARITY is .FALSE., for the 
reductions OR and EOR is zero, and for the 
reduction AND is -1.  COUNT does not have an identity, as it maps
logicals to integers and returns zero if there are no true values to be counted.
The identities for the other reductions are defined in the Fortran 90 standard.

\subsection{Combining-Scatter Functions}

%[Note the addition of COPY_SCATTER, by analogy with COPY_PREFIX and
%COPY_SUFFIX, to achieve what the Connection Machine calls
%send-with-overwrite.]

%Combining-Send section revised to make these functions
%rather than subroutines --- Rex Page 26Aug92

%Changed to SCATTER, removed restrictions on order of operations, and
%added optional final MASK argument
%RSS, Sept 14.

For every reduction operation XXX in the language, introduce a new
function:

                                                                \CODE
   XXX_SCATTER(SOURCE,BASE,IDX1,..., IDXn, MASK)
                                                                \EDOC

The IDX arguments are integer arrays.
The number of IDX arguments must equal the rank of BASE.  
The SOURCE and all the IDX arguments must be conformable.  
The result delivered by the function is conformable with BASE.   
The types of SOURCE and BASE must be the
same (exception: COUNT), and the result has the type of BASE.   
The allowed types are:

\begin{verbatim}
                XXX                     Allowed Types

                SUM                     Real, Complex, Integer
                COUNT                   BASE = Integer, SOURCE = Logical
                PRODUCT                 Real, Complex, Integer
                MAXVAL                  Real, Integer
                MINVAL                  Real, Integer
                AND                     Integer
                OR                      Integer
                EOR                     Integer
                ALL                     Logical
                ANY                     Logical
                PARITY                  Logical

\end{verbatim}

Since SOURCE and all the IDX arrays are conformable, for every
element s in SOURCE there is a corresponding element in each of
the IDX arrays.
Let i1 be the value of the element of
IDX1 that is indexed by the same subscripts as element s of SOURCE.
More generally, for each j=1,2,...,n,
let ij be the value of the element of IDXj that corresponds to
element s in SOURCE, where n is the rank of BASE. 
The integers ij, j=1,...,n, form a subscript selecting an element
of BASE:  BASE(i1,i2,...,in).

Thus SOURCE and the IDX arrays establish a mapping from all the
elements of SOURCE onto selected elements of BASE.
Viewed in the other direction, this mapping associates
with each element b of BASE a set S of elements from SOURCE.

Since BASE and the result of XXX_SCATTER are conformable,
there is a corresponding element of the result for each 
element of BASE.

If S is empty, then the element of the result corresponding to the
element  b  of BASE has the same value as  b.

If S is non-empty, then the elements of  S  will be combined with
element  b to produce an element of the result.
Let the elements of  S  be  s1, ..., sm.
Let @ denote an infix form of operation XXX.
The element of the result corresponding to the
element  b  of BASE is the result of evaluating
                                                                \CODE
       s1 @ s2 @ ... @ sm @ b
                                                                \EDOC
\noindent
or any mathematically equivalent expression (as defined in Section
7.1.7.3 of the Fortran 90 standard [ISO/IEC 1539:1991(E)]).

Thus the order of operations is arbitrary, and may differ on two
otherwise identical runs of the same HPF program.  This matters when
the combining operation is not both associative and commutative, for
example floating-point addition.  In fact, because machine arithmetic
is not associative (not even  fixed-point, because of overflow) the
programmer must be sure that the nondeterministic order of evaluation
of the result will not produce undesirable effects.

If the optional argument MASK is present, then only the elements of
SOURCE in positions for which MASK is true participate in the operation.
All other elements of SOURCE and of the IDX arrays are ignored.

Thus the result of the expression
                                                                \CODE
      SUM_SCATTER(SOURCE,BASE,IDX1,IDX2,...,IDXn,MASK)
                                                                \EDOC
\noindent
{\em could} be computed as

                                                                \CODE
      result = BASE
      DO J1=LBOUND(SOURCE,1),UBOUND(SOURCE,1)
        DO J2=LBOUND(SOURCE,2),UBOUND(SOURCE,2)
          ...
            DO Jk=LBOUND(SOURCE,k),UBOUND(SOURCE,k)
              IF (MASK(J1,J2,...,Jk))
     &           result(IDX1(J1,J2,...,Jk),
     &                  IDX2(J1,J2,...,Jk),
     &                  ...
     &                  IDXn(J1,J2,...,Jk)) =
     &           result(IDX1(J1,J2,...,Jk),
     &                  IDX2(J1,J2,...,Jk),
     &                  ...
     &                  IDXn(J1,J2,...,Jk)) + SOURCE(J1,J2,...,Jk)
            END DO
          ...
        END DO
      END DO
                                                                \EDOC

\noindent
where k is the rank of SOURCE.  (However, this nest of DO loops makes a
greater commitment to the particular order in which the combining
operations are carried out than the order---namely, none!--- guaranteed
by the XXX_SCATTER function.)

In addition, COPY_SCATTER is the combining-send
function generated by the (noncommutative) binary operator

                                                                \CODE
      COPY_operation(x,y) = x
                                                                \EDOC
\noindent
Thus an element of the result delivered by
COPY_SCATTER(SOURCE,BASE,IDX1,..., IDXn) corresponding with an element
of BASE that is associated with a non-empty set from SOURCE has
the same value as {\em some} SOURCE element from that set.
So if multiple elements of SOURCE are sent to the same result
element, some one of them will be assigned and the rest, as well as
the corresponding element of BASE, will be effectively discarded.
\par\smallskip\noindent
{\bf Example:} 

                                                                \CODE
      A = (/ 10., 20., 30., 40., -10./)
      X = (/ 1.,  2.,  3.,  4./)
      V = (/ 3,   2,   2,   1,   1/)
      X = SUM_SCATTER(A,X,V, MASK=(A > 0) )
                                                                \EDOC
\noindent
yields the result X = (/41., 52., 13., 4./).

If all elements of V were distinct, one could write this in
Fortran 90 as

                                                                \CODE
      X(V) = X(V) + MERGE(A, 0., A > 0.)
                                                                \EDOC

The proposed function SUM_SCATTER ``works'' even if V contains
duplicate values.  Note that the two-dimensional case

                                                                \CODE
      X(V,W) = X(V,W) + B
                                                                \EDOC

\noindent
must be rendered using SPREAD:

                                                                \CODE
      X = SUM_SCATTER(B,X,SPREAD(V,DIM=2,NCOPIES=SIZE(X,2)),
     &                      SPREAD(W,DIM=1,NCOPIES=SIZE(X,1)))
                                                                \EDOC

\noindent
in order to duplicate the cross-product effect of ordinary array
subscripting.  
(This definition of XXX_SCATTER does {\em not} perform such a cross
product of indices because it is more general and in practice more
useful without the cross-product effect built in.)

When scatter along one or more axes of a multidimensional array is required,
use a surrounding FORALL.  For example, the idiom used 
to SUM_SCATTER the (j,k) planes
of an (i,j,k)-indexed three-dimensional array, using the one-dimensional
index vector V is

								\CODE
  REAL, ARRAY(NI, NJ, NK) :: SRC, DEST
  LOGICAL MASK(NI, NJ, NK)
  INTEGER V(NI)
  FORALL (J = 1:NJ, K = 1:NK)
&      DEST(:, J, K) = SUM_SCATTER( SRC(:,J,K), DEST(:,J,K), 
&           V, MASK(:, J, K))
								\EDOC
which has the same effect as
								\CODE
  DO I = 1, NI
     WHERE (MASK(I, :, :)
&       DEST(V(I), :, :) = DEST(V(I), :, :) + SRC(I, :, :)
  ENDDO
								\EDOC

\noindent
but may be more efficient, and 
makes no guarantees as to the order of evaluation.

\subsection{Parallel Prefix Instrinsics}
\label{parallel-prefix}

For every reduction operation XXX in the language, introduce the two new
functions XXX_PREFIX and XXX_SUFFIX.  They take the same arguments
as the corresponding reduction intrinsic, 
(an array of appropriate type, an optional scalar integer DIM argument,
an optional, LOGICAL array argument MASK conformable with ARRAY)
plus two additional optional arguments:

                                                                \CODE
XXX_PREFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
XXX_SUFFIX(ARRAY, DIM, MASK, SEGMENT, EXCLUSIVE)
                                                                \EDOC
Each element of the result is the reduction under the operator XXX of a
(possibly empty) set of elements of ARRAY.

\par\smallskip\noindent
{\bf Example:}
                                                                \CODE
     MAXVAL_PREFIX((/ 3, 2, 4, 1, 6/))
is
                   (/ 3, 3, 4, 4, 6/).

     MAXVAL_SUFFIX((/ 3, 2, 4, 1, 6/))
is
                   (/ 6, 6, 6, 6, 6/).
                                                                \EDOC

The value of these functions is conformable with ARRAY.

The result has the same type as ARRAY, except
COUNT_PREFIX and COUNT_SUF\-FIX, which take a LOGICAL
array argument and return an integer array result.
The allowed operations and the corresponding allowed types for ARRAY are
given in the table below.
\begin{verbatim}
                XXX                     Allowed Types

                SUM                     Real, Complex, Integer
                COUNT                   Result = Integer, ARRAY = Logical
                PRODUCT                 Real, Complex, Integer
                MAXVAL                  Real, Integer
                MINVAL                  Real, Integer
                AND                     Integer
                OR                      Integer
                EOR                     Integer
                ALL                     Logical
                ANY                     Logical
                PARITY                  Logical
\end{verbatim}

If the DIM argument is omitted, then the arrays are processed in array
element order (``column-major''), as if temporarily regarded as
one-dimensional.  If it is present, then it must be an integer scalar
between one and the rank of ARRAY.   In this case, completely
independent prefix or suffix operations occur along the selected
dimension of ARRAY.

\par\smallskip\noindent
{\bf Example:} If A has the value

\begin{verbatim}

        [  0  -5   8  -3  ]
        [  3   4  -1   2  ]
        [  0   4   6  -4  ]

then SUM_PREFIX(A) has the value 

        [  0  -2  14  16  ]
        [  3   2  13  18  ]
        [  3   6  19  14  ]

SUM_PREFIX(A, DIM=1) has the value

        [  0  -5   8  -3  ]
        [  3  -1   7  -1  ]
        [  3   3  13  -5  ]

SUM_PREFIX(A, DIM=2) has the value

        [  0  -5   3   0  ]
        [  3   7   6   8  ]
        [  0   4  10   6  ]

\end{verbatim}


Array elements corresponding to positions where the MASK is false
do not contribute to the running accumulation.  However, the result
is still defined for corresponding positions in the result.
\par\smallskip\noindent
{\bf Example:}
                                                                \CODE
     MAXVAL_PREFIX(    (/ 3, 2, 4, 5, 6/), 
   &           MASK  = (/ T, F, F, T, F/))
is                                                  
                       (/ 3, 3, 3, 5, 5/).

                                                                \EDOC
In actual practice, results may not be required in those positions;
in such cases the programmer may be able to use the WHERE statement
to inform the compiler:

                                                                \CODE
      WHERE (FOO) A=SUM_PREFIX(B,MASK=FOO)
                                                                \EDOC


The first additional optional argument is called SEGMENT, which is of
type logical and conformable with the ARRAY argument.  If present, the
array is divided into pieces corresponding to contiguous sequences of
true or false elements of SEGMENT.  The beginning of a piece is a place
where the running accumulation is to be reset before processing the
corresponding array element).

\par\smallskip\noindent
{\bf Example:}
                                                                \CODE
     LOGICAL T,F
     PARAMETER (T = .TRUE., F = .FALSE. )

     MAXVAL_PREFIX((/ 3, 2, 4, 1, 6/), 
   &       SEGMENT=(/ T, T, T, F, F/))  yields  (/ 3, 3, 4, 1, 6/).
                      -------  ----                -------  ----
                   two input segments         two independent results
                                                                \EDOC

The second additional optional argument, a scalar logical, is called
EXCLUSIVE, default .FALSE., which determines whether the prefix or
suffix operation is inclusive (the default) or exclusive.  (The
inclusive sum-prefix of (/ 1,2,3,4 /) is (/ 1,3,6,10 /) whereas the
exclusive sum-prefix is (/ 0,1,3,6 /).)


In every case, every element of the result has a value equal to the
reduction of certain selected elements of ARRAY, or an identity value
(zero for SUM_PREFIX or SUM_SUFFIX, for example) if no elements of
ARRAY are selected for that result element.  The optional arguments
affect the selection of elements of ARRAY for each element of the
result; the selected elements of ARRAY are said to contribute to the
result element.

The identity element for the reduction PARITY is .FALSE., for the
reductions OR and EOR is zero, and for the reduction AND is -1.  COUNT
does not have an identity, as it maps logicals to integers and returns
zero if there are no true values to be counted.  The identities for the
other reductions are defined in the Fortran 90 standard.

For any given element R of the result, let A be the corresponding
element of ARRAY.  Every element of ARRAY contributes to R unless
disqualified by one of the following rules.

For xxx_PREFIX, no element that follows A in the array element ordering
of ARRAY contributes to R.  For xxx_SUFFIX, no element that precedes A
in the array element ordering of ARRAY contributes to R.  This rule
applies even when the DIM argument is present, since array element
order increases with an increase in any component of an array element
index.

If the DIM argument is provided, an element Z of ARRAY does not
contribute to R unless all its indices, excepting only the index for
dimension DIM, are the same as the corresponding indices of A.

If the MASK argument is provided, an element Z of ARRAY does not
contribute to R if the element of MASK corresponding to Z is false.

If the SEGMENT argument is provided, an element Z of ARRAY does not
contribute unless the elements B and Y of SEGMENT corresponding to A
and Z (respectively), and the intervening elements of SEGMENT as well,
all have the same value.  If the DIM argument is not present, then the
``intervening'' elements are all elements between them in array element
order; if the DIM argument is present, then the ``intervening''
elements are those having indices the same as those of both B and Y,
except the index for dimension DIM, which must be between (and possibly
equalling) the indices of B and Y for dimension DIM.  In other words,
the prefix or suffix operation is performed on groups of elements of
ARRAY, where a group corresponds to a maximal contiguous run of
like-valued elements of SEGMENT.

If the SEGMENT argument is omitted, then the result is computed using a
default SEGMENT all elements of which are true.   Thus, without the DIM
argument, there is exactly one group, while if DIM is present, there is
one group for each valid set of indices of ARRAY other than the index
selected by DIM.

If the EXCLUSIVE argument is provided and is true, then A itself does
not contribute to R.

In addition to all this, the operation COPY_PREFIX replicates the first
(lowest-indexed) element of each segment throughout the segment, and
the operation COPY_SUF\-FIX replicates the last (highest-indexed)
element of each segment throughout the segment.

\par\smallskip\noindent
{\bf Examples:}

                                                                \CODE
SUM_PREFIX( (/1,3,5,7/) ) yields (/1,4,9,16/)
SUM_SUFFIX( (/1,3,5,7/) ) yields (/16,15,12,7/)

LOGICAL T,F
PARAMETER (T = .TRUE., F = .FALSE. )

COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/) )              
                                 !yields (/1,1,1,2,3,4,4,5,5/)
COUNT_PREFIX( (/T,F,F,T,T,T,F,T,F/), EXCLUSIVE=T ) 
                                 !yields (/0,1,1,1,2,3,4,4,5/)

  SUM_PREFIX( (/1,2,3,4,5,6,7,8,9/),
&   SEGMENT=(/T,T,T,T,F,F,T,F,F/)) yields (/1,3,6,10,5,11,7,8,17/)
              ------- --- - ---             -------- ---  - ----
	     four input segments       four independent result segments

  COPY_PREFIX( (/1,2,3,4,5,6,7,8,9/),
&    SEGMENT=(/T,T,T,T,F,F,T,F,F/)) yields (/1,1,1,1,5,5,7,8,8/)
               ------- --- - ---             ------- --- - ---
	      four input segments       four independent result segments
                                                                \EDOC


A new segment begins at every
{\em transition} from false to true or true to false; thus a segment is
indicated by a maximal contiguous subsequence of like logical values:

                                                                \CODE
(/T,T,T,F,T,F,F,F,T,F,F,T/)
  ----- - - ----- - --- -    seven segments
                                                                \EDOC

Note: Connection Machine software delimits the segments by indicating
the {\em start} of each segment.  Cray MPP Fortran delimits the
segments by indicating the {\em stop} of each segment.  Each method has
its advantages.  There is also the question of whether this convention
should change when performing a suffix rather than a prefix.  HPF
adopts the symmetric representation above.  The main advantages of this
representation are:

(a) It is symmetrical, in that the same segment specifier may
    be meaningfully used for parallel prefix and parallel suffix
    without changing its interpretation (start versus stop).

(b) It seems to be equally inconvenient for every existing
    architecture!  However, it is not that hard to accommodate.

(c) The start-bit or stop-bit representation is easily converted
    to this form by using PARITY_PREFIX or PARITY_SUFFIX.

\par\smallskip\noindent
{\bf Examples:}

                                                                \CODE
SUM_PREFIX(FOO,SEGMENT=PARITY_PREFIX(START_BITS))
SUM_PREFIX(FOO,SEGMENT=PARITY_SUFFIX(STOP_BITS))
SUM_SUFFIX(FOO,SEGMENT=PARITY_SUFFIX(START_BITS))
SUM_SUFFIX(FOO,SEGMENT=PARITY_PREFIX(STOP_BITS))
                                                                \EDOC
\noindent
These might be standard idioms for a compiler to recognize.


\subsection{Sorting Functions}

This section introduces two sorting functions, GRADE_UP and GRADE_DOWN.
                                                                \CODE
GRADE_UP(ARRAY,DIM)
                                                                \EDOC

The array may be of type integer, real, or character.

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

                                                                \CODE
B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)
                                                                \EDOC

\noindent
then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in) is
sorted in ascending order; moreover, R(i1,i2,...,:,...,in) is a permutation of
all the integers in the range 

                                                                \CODE
LBOUND(ARRAY,k):UBOUND(ARRAY,k). 
                                                                \EDOC

The sort is
stable; that is, if j \(\leq\) m and B(i1,i2,...,j,...,in) .EQ.
B(i1,i2,...,m,...,in),
then R(i1,i2,...,j,...,in) \(\leq\) R(i1,i2,...,m,...,in).

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape (/ SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY)) /)
and the property that if one computes the rank-1 array

                                                                \CODE
B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))
                                                                \EDOC

\noindent
where n=SIZE(SHAPE(ARRAY)), then B is sorted in ascending order;
moreover, all of the columns of S are distinct, that is, if j \(\neq\) m then
ALL(S(:,j) .EQ. S(:,m)) will be false.  The sort is stable;
if j \(\leq\) m and B(j) .EQ. B(m), then ARRAY(S(1,j),S(2,j),...,S(n,j))
precedes ARRAY(S(1,m),S(2,m),...,S(n,m)) in the array element ordering
of ARRAY.


                                                                \CODE
GRADE_DOWN(ARRAY,DIM)
                                                                \EDOC

The array may be of type integer, real, or character.

If the optional DIM argument is present, then the result has the same
shape as the ARRAY.  Suppose DIM has the value k; then the result R
has the property that if one computes the array

                                                                \CODE
B(i1,i2,...,ik,...in)=ARRAY(i1,i2,...,R(i1,i2,...,ik,...,in),...,in)
                                                                \EDOC

\noindent
then for all i1,i2,...,(omit ik),...,in, the vector B(i1,i2,...,:,...,in) is
sorted in descending order; moreover, R(i1,i2,...,:,...,in) is a permutation of
all the integers in the range 

                                                                \CODE
LBOUND(ARRAY,k):UBOUND(ARRAY,k).  
                                                                \EDOC

The sort is
stable; that is, if j \(\leq\) m and B(i1,i2,...,j,...,in) .EQ.
B(i1,i2,...,m,...,in),
then R(i1,i2,...,j,...,in) \(\leq\) R(i1,i2,...,m,...,in).  (Yes, that
last ``\(\leq\)'' sign
really should be a ``\(\leq\)'', not a ``\(\geq\)''.)

If the optional DIM argument is absent, then the result S is an
array of rank 2, with shape (/ SIZE(SHAPE(ARRAY)), PRODUCT(SHAPE(ARRAY)) /)
and the property that if one computes the rank-1 array

                                                                \CODE
B(k)=ARRAY(S(1,k),S(2,k),...,S(n,k))
                                                                \EDOC

\noindent
where n=SIZE(SHAPE(ARRAY)), then B is sorted in descending order;
moreover, all of the 
columns of S are distinct, that is, if j \(\neq\) m then ALL(S(:,j)
.EQ. S(:,m)) will
be false.  The sort is stable; if j \(\leq\) m and B(j) .EQ. B(m), then
ARRAY(S(1,j),S(2,j),...,S(n,j)) precedes (yes, ``precedes'', not ``follows'')
ARRAY(S(1,m),S(2,m),...,S(n,m)) in the array element ordering of ARRAY.


Because of the stability requirement, GRADE_DOWN(A(1:N)) does not, in
general, equal GRADE_UP(A(N:1:-1)).  Indeed, these results are equal if
and only if A contains no duplicate values.

The stability requirement allows one to cascade grading operations in order to
sort on multiple fields.  For example, suppose one had the following derived
type (example taken from section 4.4.1 of the Fortran 90 standard):

                                                                \CODE
TYPE PERSON
  INTEGER AGE
  CHARACTER (LEN = 50) NAME
END TYPE PERSON
                                                                \EDOC

Now consider two arrays of persons:

                                                                \CODE
TYPE(PERSON), DIMENSION(100000) :: MEMBERS, ROSTER
                                                                \EDOC

Also assume a work vector for indices:

                                                                \CODE
INTEGER, DIMENSION(100000) :: V
                                                                \EDOC

Then the statements

                                                                \CODE
V = GRADE_UP(MEMBERS%AGE)
V = V(GRADE_UP(MEMBERS(V)%NAME))
ROSTER = MEMBERS(V)
                                                                \EDOC

\noindent
cause ROSTER to be a rearrangement of MEMBERS that is sorted
primarily by name and secondarily by age (that is, members with
the same name are grouped together in order of ascending age).
Note that the minor sort field is graded first, and that more
statements like the second one may be inserted to sort on additional
fields.

To list members with the same name in descending order of age,
change the first GRADE_UP to GRADE_DOWN:

                                                                \CODE
V = GRADE_DOWN(MEMBERS%AGE)
V = V(GRADE_UP(MEMBERS(V)%NAME))
ROSTER = MEMBERS(V)
                                                                \EDOC

The ideas and names here are inspired by APL.  The term ``grade''
rather than ``rank'' is used because the latter is already used in the
Fortran 90 standard to mean the size of the shape of an array (that is,
the number of dimensions).

\subsection{POPCNT, POPPAR, and LEADZ Functions}

\subsubsection{POPCNT}

An elemental population count function.  Its action on a scalar is:

                                                                \CODE
POPCNT(x) = COUNT( (/ (BTEST(x,J), J=0, BIT_SIZE(x)-1) /) )
                                                                \EDOC

The result is the number of 1-bits in the integer x, according to the
bit-manipulation model in section 13.5.7 of the Fortran 90 standard.

\subsubsection{POPPAR}

An elemental population-parity function.  Its action on a scalar is:

                                                                \CODE
POPPAR(x) = MERGE(1,0,BTEST(POPCNT(x),0))
                                                                \EDOC

The result is 1 if the number of 1-bits in the integer x is odd,
or 0 if the number of 1-bits in the integer x is even.

\subsubsection{LEADZ}

An elemental count-leading-zeros function.  Its action on a scalar is:

                                                                \CODE
LEADZ(x) = MINVAL( (/ (J, J=0,BIT_SIZE(x)) /),
     MASK=(/ (BTEST(x,J), J=BIT_SIZE(x)-1,0,-1), .TRUE. /) )
                                                                \EDOC

The result is a count of the number of leading 0-bits in the integer
x, according to the bit-manipulation model in section 13.5.7 of the
Fortran 90 standard.

Note that a given integer value may produce different results from
LEADZ, depending on the number of bits in the representation of the
integer.  That is because bits are counted from the left (the most
significant bit).

%(The intent is to define POPCNT, POPPAR, and LEADZ consistent with their
%use in Cray Fortran, but to limit them to integer arguments.  
%The
%definition of ILEN is equivalent to that of the built-in function
%integer-length in Common Lisp, which has proven to be quite useful.)

%
%    Draft of October 7, 1992, by Richard Shapiro;  edited by Rob Schreiber
%
\section{Mapping Inquiry Intrinsic Functions}

\footnote{Version of
October 14, 1992 --- Richard Shapiro.}
This section extends the set of Fortran intrinsic functions to add
inquiries regarding data distribution.

\subsection{Motivation}
HPF provides a rich set of data mapping directives. These directives
are advisory in nature. At some point, users will want to know to what
extent the compiler took their advice. This is especially important
when a user calls a non-HPF subroutine, since he may need to know the
exact mapping. For these reasons, HPF includes inquiry intrinsics which
describe how an array is actually mapped onto a machine.  To keep the
number of intrinsics small, the inquiry intrisics are structured as
intrinsic subroutines with optional arguments.

\def\varray{\verb+ARRAY+}
{\bf Result type and shape:}  Results are returned in optional arguments,
all of which have {\tt INTENT OUT}.   All logical results are
default logical scalars.   All integer results are default integer
scalars or rank one default integer arrays. 
Where a result is an integer array, it is a rank one,
default integer array of size equal to the rank of \varray\ except as
noted below.


\subsection{Alignment Inquiry Subroutine}

\CODE
CALL HPF_ALIGNMENT(ARRAY,LB,UB,STRIDE,AXIS_MAP,
	IDENTITY_MAP,DYNAMIC,NCOPIES)
INTEGER,DIMENSION(7) :: LB,UB,STRIDE,AXIS_MAP
INTEGER NCOPIES
LOGICAL IDENTITY_MAP,DYNAMIC
\EDOC	
\begin{description}
\item[Required] ARRAY 
\item[Optional] LB,UB,STRIDE,AXIS\_MAP,
		IDENTITY\_MAP,DYNAMIC,\\ NUMBER\_OF\_COPIES
\end{description}

The \verb+HPF_ALIGNMENT+ subroutine returns information regarding
the alignment of an array to its associated template.  \verb+ARRAY+ is the
only input argument; all the remaining arguments are optional output arguments.
\begin{description}
\item[ARRAY] The array about which alignment information is requested.
{\bf New In This Draft:}
ARRAY may not be a pointer that is disassociated or an allocatable array
that is not allocated.

\item[LB] An integer array containing the template coordinate
	of the first element of \varray\ along an axis.
\item[UB] An integer array containing the template coordinate
	of the last element of \varray\ along an axis.
\item[STRIDE] An integer array containing the stride used in 
        aligning the elements of \varray\ along an axis.
\item[AXIS\_MAP] An integer array containing the template axis associated 
        with an array axis.  If \verb+AXIS_MAP+ is 0, the axis 
        is a collapsed axis.
\item[IDENTITY\_MAP] A logical scalar which will be true if the template
	associated with \varray\  has a shape identical to \varray, the
	axes are mapped using the identity permutation, and the strides are all
        positive.
\item[DYNAMIC] A logical scalar which will be true if \varray\ has the 
	DYNAMIC attribute.
\item[NUMBER\_OF\_COPIES] An integer scalar equal to 
        the product of the extents of all template axes over
        which \varray\ has been replicated.
	For a non-replicated array, for
	example, this will be 1.
\end{description}

\subsection{Template Inquiry Subroutine}

\CODE
CALL HPF_TEMPLATE(ARRAY,TEMPLATE_RANK,LB,UB,AXIS_TYPE,
	AXIS_INFO,NUMBER_ALIGNED,DYNAMIC)
INTEGER,DIMENSION(MAX_TEMPLATE_RANK) :: LB,UB,AXIS_INFO
CHARACTER*(*) AXIS_TYPE(MAX_TEMPLATE_RANK)
INTEGER NUMBER_ALIGNED,TEMPLATE_RANK
LOGICAL DYNAMIC
\EDOC	

\begin{description}
\item[Required] ARRAY 
\item[Optional] LB,UB,AXIS\_TYPE,AXIS\_INFO,
		NUMBER\_ALIGNED,\\ TEMPLATE\_RANK,DYNAMIC
\end{description}
The \verb+HPF_TEMPLATE+ subroutine returns information regarding
the template associated with an array. The main difference between
\verb+HPF_TEMPLATE+ and \verb+HPF_ALIGNMENT+  is that the
former returns information concerning the array from the template's point
of view, while the latters returns information from the array's point of
view. \varray\ is the
only input argument; all the remaining arguments are optional output
arguments.   Array outputs are rnak one and of size equal to the rank of
the template, which is returned in TEMPLATE\_RANK.

\begin{description}
\item[ARRAY] The array about which template information is requested.
{\bf New In This Draft:}
ARRAY may not be a pointer that is disassociated or an allocatable array
that is not allocated.
\item[LB] An integer array containing the declared template lower bound 
	for each axis.
\item[UB] An integer array containing the declared template upper bound 
	for each axis.
\item[AXIS\_TYPE] A rank one array of type default character, of length
        at least 10,
        that returns information about each
	axis of the template.  The following values are defined by HPF
	(implementations may define other values):
	
	\begin{description}
	\item['NORMAL'] The axis has an axis of \varray\ aligned to to it.
		\verb+AXIS_INFO+ contains the axis of \varray\ aligned
		with the axis of the template.
	\item['SINGLE'] \varray\  is aligned with one coordinate of
		the template axis. \verb+AXIS_INFO+ contains the coordinate
		to which \varray\ is aligned.
	\item['REPLICATED'] \varray\  is replicated along this template axis.
		\verb+AXIS_INFO+ contains the number of copies of
		\varray\ along the axis. This is an implemtation-specific
		quantity.
	\end{description}

\item[AXIS\_INFO] See the desciption of \verb+AXIS_TYPE+ above.

         Example:
                               \CODE
       REAL A(4, 20)
CHPF$  TEMPLATE T(30, 150, 8, 200) 
CHPF$  ALIGN A(I,J,*) WITH T(J+5, 100, 10-2*I, *)
							\EDOC

         then 

							\CODE
AXIS_TYPE = (/'NORMAL', 'SINGLE', 'NORMAL', 'REPLICATED'/) 
							\EDOC
         and
							\CODE
AXIS_INFO = (/2, 100, 1, 200/)
							\EDOC

\item[NUMBER\_ALIGNED] An integer scalar giving the 
        total number of arrays aligned to the template. 
	This is the number of arrays which will be moved when the template
	is redistributed.
\item[DYNAMIC] A logical scalar which will be true if the
	template is redistributable.
\item[TEMPLATE\_RANK] An integer scalar giving the 
        number of axes in the template. This can be
	different than the number of \varray\  axes, due to collapsing and
	replicating.
\end{description}


\subsection{Distribution Inquiry Subroutine}
\CODE
CALL HPF_DISTRIBUTION (ARRAY,AXIS_TYPE,AXIS_INFO,
	PROCESSORS_SHAPE,PROCESSORS_RANK)
INTEGER AXIS_INFO(MAX_TEMPLATE_RANK),PROCESSORS_RANK
CHARACTER*(*) AXIS_TYPE(MAX_TEMPLATE_RANK)
INTEGER PROCESSORS_SHAPE(MAX_PROCESSORS_RANK)
\EDOC

\begin{description}
\item[Required] ARRAY 
\item[Optional] AXIS\_TYPE,AXIS\_INFO,
		PROCESSORS\_SHAPE,PROCESSORS\_RANK
\end{description}
The \verb+HPF_DISTRIBUTION+ subroutine returns information regarding
the distribution of the template associated with an array.
\varray\ is the only input argument; all the remaining arguments 
are optional output arguments.

\begin{description}
\item[ARRAY] The array about whose template distribution 
	information is requested.
{\bf New In This Draft:}
ARRAY may not be a pointer that is disassociated or an allocatable array
that is not allocated.
\item[AXIS\_TYPE] A rank one array of type default character and of
        length at least 9, and size equal to the rank of the template
        of \varray\ (which is returned by HPF_TEMPLATE in
        TEMPLATE\_RANK),
        that returns information about the 
	distribution of each axis of the template. 
	 The following values are defined by HPF
	(implementations may define other values):
	
	\begin{description}
	\item['BLOCK'] The axis is distributed BLOCK.
		\verb+AXIS_INFO+ contains the block size.
	\item['CYCLIC'] The axis is distributed CYCLIC.
		\verb+AXIS_INFO+ contains the block size.
	\item['COLLAPSED'] The axis is collapsed (distributed with the *
		specification)
	\end{description}	
\item[AXIS\_INFO] A rank one integer array.  
        See the desciption of \verb+AXIS_TYPE+ above.
\item[PROCESSORS\_RANK] An integer scalar giving 
        the rank of the processor arrangement onto which
        \varray\ is distributed.
	with the template.
\item[PROCESSORS\_SHAPE] An array of type default integer and of size
	equal to the value returned in PROCESSORS\_RANK, giving the
	shape of the processor arrangement onto which \varray\ is
	distributed.

\end{description}

\subsection{Examples}
Consider the declarations below:

\CODE
      DIMENSION A(10,10),B(20,30),C(20,40,10),D(40)
CHPF$ TEMPLATE T(40,20)
CHPF$ DYNAMIC A
CHPF$ ALIGN A(I,:) WITH T(1+3*I,2:20:2)
CHPF$ ALIGN C(I,*,J) WITH T(J,21-I)
CHPF$ ALIGN D(I) WITH T(I,4)
CHPF$ PROCESSORS PROCS(4,2)
CHPF$ DISTRIBUTE T(BLOCK,BLOCK) ONTO PROCS
CHPF$ DISTRIBUTE B(CYCLIC,BLOCK) ONTO PROCS
\EDOC	

The results of \verb+HPF_ALIGNMENT+ will be:
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
                   & A         & B         & C      \\ \hline \hline
LB                 & 4,2,...   & 1,1,...   & 1,N/A,1,...   \\ \hline
UB                 & 31,20,... & 20,30,... & 20,N/A,10,... \\ \hline
STRIDE             & 3,2,...   & 1,1,...   & -1,N/A,1,...  \\ \hline
AXIS\_MAP          & 1,2,...   & 1,2,...   & 2,0,1,...     \\ \hline
IDENTITY\_MAP      & .FALSE.   & .TRUE.    & .FALSE.       \\ \hline
DYNAMIC            & .TRUE.    & .FALSE.   & .FALSE.       \\ \hline
NUMBER\_OF\_COPIES & 1         & 1         & 1             \\ \hline
\end{tabular}\end{center}

and the result of  \verb+HPF_TEMPLATE+ will be

\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
           & A         & C         & D          \\ \hline \hline
LB         & 1,1,...   & 1,1,...   & 1,1,...    \\ \hline
UB         & 40,20,... & 40,20,... & 40,20,...  \\ \hline
AXIS\_TYPE & 'NORMAL','NORMAL',... &
	 'NORMAL','NORMAL',... & 'NORMAL','SINGLE',... \\ \hline
AXIS\_INFO & 1,2,...   & 3,1,...   & 1,4,...    \\ \hline
NUM.AL.    & 3         & 3         & 3          \\ \hline
TEMP. RANK & 2         & 2         & 2          \\ \hline
DYNAMIC    & .FALSE.   & .FALSE.   & .FALSE.    \\ \hline
\end{tabular}\end{center}

Finally  \verb+HPF_DISTRIBUTION+ will produce
\begin{center}
\begin{tabular}{|l|c|c|}
\hline
                  & A         & B  \\ \hline \hline
AXIS\_TYPE        & 'BLOCK','BLOCK',... &
	 'CYCLIC','BLOCK',...            \\ \hline
AXIS\_INFO        & 10,10,... & 1,15,...  \\ \hline
PROCESSORS\_SHAPE & 4,2,...   & 4,2,...   \\ \hline
PROCESSORS\_RANK  & 2         & 2         \\ \hline
\end{tabular}\end{center}

Note that the values of the block sizes (in \verb+AXIS_INFO+) are not
specified by HPF, but may be implementation-dependent.


From shapiro@think.com  Tue Dec  1 09:13:15 1992
Received: from mail.think.com by cs.rice.edu (AA08238); Tue, 1 Dec 92 09:13:15 CST
Received: from Django.Think.COM by mail.think.com; Tue, 1 Dec 92 10:13:06 -0500
From: Richard Shapiro <shapiro@think.com>
Received: by django.think.com (4.1/Think-1.2)
	id AA05018; Tue, 1 Dec 92 10:13:04 EST
Date: Tue, 1 Dec 92 10:13:04 EST
Message-Id: <9212011513.AA05018@django.think.com>
To: schreibr@riacs.edu
Cc: loveman@mpsg.enet.dec.com, hpff-intrinsics@cs.rice.edu
In-Reply-To: Rob Schreiber's message of Mon, 30 Nov 92 14:57:33 PST <9211302257.AA01165@thor.riacs.edu>
Subject: Intrinsics

   Date: Mon, 30 Nov 92 14:57:33 PST
   From: Rob Schreiber <schreibr@riacs.edu>


   Dave,

   I dont object to any of your edits.
   I cannot make the last table fit the page.   So to hell with the 
   "Overfull hbox".

   I have added text to deal with inquiry about an unallocated array
   (copied from the F90 SIZE intrinsic).  I also clarified the type and shape
   of the results returned by the mapping inquiries HPF_ALIGNMENT, etc..

In order to make the table fit try this:

   The results of \verb+HPF_ALIGNMENT+ will be:
   \begin{center}
   \small      % added to make table fit
   \begin{tabular}{|l|c|c|c|}
   \hline
		      & A         & B         & C      \\ \hline \hline
   LB                 & 4,2,...   & 1,1,...   & 1,N/A,1,...   \\ \hline
   UB                 & 31,20,... & 20,30,... & 20,N/A,10,... \\ \hline
   STRIDE             & 3,2,...   & 1,1,...   & -1,N/A,1,...  \\ \hline
   AXIS\_MAP          & 1,2,...   & 1,2,...   & 2,0,1,...     \\ \hline
   IDENTITY\_MAP      & .FALSE.   & .TRUE.    & .FALSE.       \\ \hline
   DYNAMIC            & .TRUE.    & .FALSE.   & .FALSE.       \\ \hline
   NUMBER\_OF\_COPIES & 1         & 1         & 1             \\ \hline
   \end{tabular}\end{center}

   and the result of  \verb+HPF_TEMPLATE+ will be

   \begin{center}
   \small      % added to make table fit
   \begin{tabular}{|l|c|c|c|}
   \hline
	      & A         & C         & D          \\ \hline \hline
   LB         & 1,1,...   & 1,1,...   & 1,1,...    \\ \hline
   UB         & 40,20,... & 40,20,... & 40,20,...  \\ \hline
   AXIS\_TYPE & 'NORMAL','NORMAL',... &
	    'NORMAL','NORMAL',... & 'NORMAL','SINGLE',... \\ \hline
   AXIS\_INFO & 1,2,...   & 3,1,...   & 1,4,...    \\ \hline
   NUM.AL.    & 3         & 3         & 3          \\ \hline
   TEMP. RANK & 2         & 2         & 2          \\ \hline
   DYNAMIC    & .FALSE.   & .FALSE.   & .FALSE.    \\ \hline
   \end{tabular}\end{center}

   Finally  \verb+HPF_DISTRIBUTION+ will produce
   \begin{center}
   \small      % added to make table fit
   \begin{tabular}{|l|c|c|}
   \hline
		     & A         & B  \\ \hline \hline
   AXIS\_TYPE        & 'BLOCK','BLOCK',... &
	    'CYCLIC','BLOCK',...            \\ \hline
   AXIS\_INFO        & 10,10,... & 1,15,...  \\ \hline
   PROCESSORS\_SHAPE & 4,2,...   & 4,2,...   \\ \hline
   PROCESSORS\_RANK  & 2         & 2         \\ \hline
   \end{tabular}\end{center}

   Note that the values of the block sizes (in \verb+AXIS_INFO+) are not
   specified by HPF, but may be implementation-dependent.


From zrlp09@trc.amoco.com  Thu Dec 17 11:20:40 1992
Received: from noc.msc.edu by cs.rice.edu (AA29432); Thu, 17 Dec 92 11:20:40 CST
Received: from uc.msc.edu by noc.msc.edu (5.65/MSC/v3.0.1(920324))
	id AA01017; Thu, 17 Dec 92 11:20:38 -0600
Received: from [149.180.11.2] by uc.msc.edu (5.65/MSC/v3.0z(901212))
	id AA07634; Thu, 17 Dec 92 11:20:37 -0600
Received: from trc.amoco.com (apctrc.trc.amoco.com) by netserv2 (4.1/SMI-4.0)
	id AA11787; Thu, 17 Dec 92 11:20:36 CST
Received: from backus.trc.amoco.com by trc.amoco.com (4.1/SMI-4.1)
	id AA00510; Thu, 17 Dec 92 11:20:35 CST
Received: from mahler.trc.amoco.com by backus.trc.amoco.com (4.1/SMI-4.1)
	id AA29020; Thu, 17 Dec 92 11:20:31 CST
Received: from localhost by mahler.trc.amoco.com (4.1/SMI-4.1)
	id AA20958; Thu, 17 Dec 92 11:20:32 CST
Message-Id: <9212171720.AA20958@mahler.trc.amoco.com>
To: Rob Schreiber <schreibr@riacs.edu>
Cc: hpff-distribute@cs.rice.edu, hpff-intrinsics@cs.rice.edu
Subject: Re: Last chance 
In-Reply-To: Your message of Wed, 16 Dec 92 17:53:03 -0800.
             <9212170153.AA06349@thor.riacs.edu> 
Date: Thu, 17 Dec 92 11:20:31 -0600
From: "Rex Page" <zrlp09@trc.amoco.com>

Rob,

COPY_SCATTER:

There is a problem with the COPY_SCATTER definition that slipped
through all of our careful specification based on "mathematical
equivalence" and the like.  The problem is that the definition of
COPY_SCATTER relies on the expression   s1 @ s2 @ ... @ sm @ b
and the defintion COPY_operation(x,y)=x.  Because COPY_operation is
not commutative, the expression, without parentheses to specify the
order of application of the operations, is not well defined.

Here is my suggestion for fixing the problem:
Just after the definition of COPY_operation (i.e., just before the
paragraph beginning "Thus an element of the result delivered by
COPY_SCATTER ..."), insert the following sentence:
  When COPY_SCATTER combines source elements s1, s2, ..., sm with
  base element b, the processor may apply the operations in the
  expression s1 @ s2 @ ... @ sm @ b in any order it chooses (where
  x @ y denotes COPY_operation(x,y)).  

Since b occurs on the right in the expression, the processor cannot
select b as the result (because COPY_operation selects its first
argument as its result.  This is the intended effect of COPY_SCATTER
when, as in the case this part of the definition covers, there is a
non-empty set of source elements to be combined with a base element.


Mapping Inquiries:

The second paragraph of Section 3.1 (Motivation), which begins with
boldface "Result type, type parameters, and shape", is not really
needed because all of the information is repeated in the individual
definitions (as it should be).

If you decide to retain the paragraph, which I advise against,
I recommend using the terms "logical" and "integer" instead of
"default logical" and "default integer".  The paragraph will be
easier to read without the three occurances of  "default",
and it will mean the same thing.  Also, the term "rank one"
need not be mentioned in both the second and third sentences; I 
recommend taking it out of the second sentence, which should then read
something like "Integer results may be scalars or arrays."  The third
sentence ("Where a result is an integer array...") refers to the
dummy argument ARRAY, which hasn't been defined at that point; change
"ARRAY" to "the array whose distribution status is being requested".

The prototype headers for the inquiry subroutines specify all
the attributes of the INTENT(OUT) arguments except their intent.
The INTENT(OUT) attribute may as well be included in the list along
with type, OPTIONAL, etc.:
   e.g., INTEGER, INTENT(OUT), DIMENSION(:), OPTIONAL:: ...

All of the array INTENT(OUT) arguments except those in HPF_ALIGNMENT
are assumed shape.  I think it would be better to be consistent and
have them all be assumed shape.  Specific-shape arrays have some
restrictions on argument-passing mechanisms, so they should generally
be avoided anyway.  The text implies that LB, UP, and STRIDE have
at least as many elements as the rank of ARRAY, so there is no real
need for a specific shape declaration.

The definitions of LB and UB, as arguments of HPF_ALIGNMENT, use the
term "template coordinate".  I think I know what this means, but I
think it would be better to stick with more universal terminology.
How about something like this for LB (and something similar for UB):
  LB  An integer array.  The first element of the ith axis of ARRAY
    is mapped to the LB(i)th template element along the axis of the
    template associated with the ith axis of ARRAY.

In the list of optional arguments of HPF_TEMPLATE, and in the list
of definitions of those arguments, the argument TEMPLATE_RANK occurs
next to last and last, respectively.  Yet, TEMPLATE_RANK is the first
optional argument in the example header.  Why not list it in the same
relative position in the other references, following the lead of the
other definitions of the inquiry functions?

In the defintion of HPF_DISTRIBUTION, there is another anomaly in the
order of argument definitions:  PROCESSOR_RANK and PROCESSORS_SHAPE
are interchanged.


Rex

From chk@erato.cs.rice.edu  Tue Jan 26 22:41:40 1993
Received: from erato.cs.rice.edu by titan.cs.rice.edu (AA01848); Tue, 26 Jan 93 22:41:40 CST
Received: from localhost.cs.rice.edu by erato.cs.rice.edu (AA08481); Tue, 26 Jan 93 22:41:23 CST
Message-Id: <9301270441.AA08481@erato.cs.rice.edu>
To: hpff@erato.cs.rice.edu
Cc: hpff-core@erato.cs.rice.edu, hpff-distribute@erato.cs.rice.edu,
        hpff-forall@erato.cs.rice.edu, hpff-io@erato.cs.rice.edu,
        hpff-f90@erato.cs.rice.edu, hpff-intrinsics@erato.cs.rice.edu
Word-Of-The-Day: salariat : (n) the class of salaried workers
Subject: HPF Language Specification, version 1.0
Date: Tue, 26 Jan 93 22:41:22 -0600
From: chk@erato.cs.rice.edu


It's available!  (For sure from titan.cs.rice.edu; availability from
other sites will depend on how fast e-mail travels and how dedicated
administrators at other sites are.)  Below are the "standard"
announcement and call for comments.

Many thanks to everyone involved in producing this document, including
(but not limited to!):
	The HPFF working group.
	People who commented on version 0.4 of the spec.
	People who attended (and asked questions at) many
		presentations, including the Supercomputing '92 workshop.
	Our friendly funding agencies: DARPA, NSF, ESPRIT, and the
		employers who bankrolled most of the HPFF committee
		members.
Special thanks to David Loveman, who edited the document.

						Chuck Koelbel
						Executive Director, NSF

----------------------------------------------------------------

The most recent draft of the High Performance Fortran Language
Specification is version 1.0 Darft, dated January 25, 1993.  See
"Version History" below for a description of the changes.

How to Get the High Performance Fortran Language Specification
==============================================================

There are three ways to get a copy of the draft:

	1. Anonymous FTP: The most recent draft is available on 
	   titan.cs.rice.edu in the directory public/HPFF/draft.
	   Several files are kept there, including compressed
	   Postscript files of previous versions of the draft.  The
	   most current version of this draft is 0.4, which can be
	   retrieved as a tar file containing LaTeX source
	   (hpf-v10.tar) or in Postscript format (hpf-v10.ps);
	   both of these are also available as compressed files.
	   Several other sites also have the draft available in one or
           more formats, including think.com, ftp.gmd.de,
	   theory.tc.cornell.edu, and minerva.npac.syr.edu.

	2. Electronic mail: The most recent draft is available from
	   the Softlib server at Rice University.  This can be
	   accessed in two ways:
	     A. Send electronic mail to softlib@cs.rice.edu with "send 
		hpf-v10.ps" in the message body. The report is sent as a 
		Postscript file.
	     B. Send electronic mail to softlib@cs.rice.edu with "send 
		hpf-v10.tar.Z" in the message body. The report is
		sent as a uuencodeded compressed tar file containing
		LaTeX source.
             C. Send electronic mail to netlib@ornl.gov with "send
                hpf-v10.ps from hpf" in the message body.  The report
                is sent as a Postscript file.  This site also has the
                LaTeX source of the draft; use "send index from hpf"
                to see the file names.
             D. Send electronic mail to netlib@research.att.com with
	        "send hpf-v10.ps from hpff" in the message body.  The
		report is sent as a Postscript file.
	   (In all cases, the reply is sent as several messages to
	   avoid mailer restrictions; edit the message bodies together
	   to obtain the whole file.)  The same files can be obtained
	   from David Loveman (loveman@mpsg.enet.dec.com) and Chuck
	   Koelbel (chk@cs.rice.edu), but replies will take longer
	   because real people have to answer the mail.

	3. Hardcopy: The most recent draft is available as technical report 
	   CRPC-TR 92225 from the Center for Research on Parallel
	   Computation at Rice University.  Send requests to
		Theresa Chatman
		CITI/CRPC, Box 1892
		Rice University
		Houston, TX 77251
	   There is a charge of $50.00 for this report to cover copying and 
	   mailing costs.

Disclaimers
===========

A few caveats about the HPF draft:

	A. The current version contains some material that is still
	   under active discussion.  Changes will be fairly frequent
	   until at least December 1992.  New versions will be
           announced on the HPFF mailing list and in the newsgroups
	   comp.parallel, comp.lang.misc, and comp.lang.fortran.

	B. The HPF Language Specification does not necessarily
	   represent the official view of any individual, company,
	   university, government, or other agency.

	C. Please address any questions, comments, or possible
	   inconsistencies in the draft to hpff-comments@cs.rice.edu.
	   Include the chapter number you are commenting on in the
	   "Subject:" line of the message.


Version History
===============

Version 0.1:
August 14, 1992
EXTREMELY preliminary version.  

First collection of the proposals active in the High Performance Fortran 
Forum.  Established much of the outline for later documents, and 
represented most decisions made through the July HPFF meeting.


Version 0.2:
September 9, 1992
Version discussed at the September 10-11 HPFF meeting

Changes:
General cleaning up of version 0.1.
Inclusion of most new proposals at that time.


Version 0.3:
October 12, 1992
Version discussed at the October 22-23 HPFF meeting

Changes:
Numerous minor and major changes due to discussions at the September meeting.
Added a section on "Model of Computation".
Presented alternate chapters for data distribution with and without
templates.
Added two proposals for ON clauses specifying where computation is to
be executed.
Added distribution inquiry intrinsics.
Total rewrite of I/O material, sending most previous material to the
Journal of Development.


Version 0.4:
November 6, 1992
Version to be presented at Supercomputing '92

Changes:
Numerous minor and major changes due to discussions at the October
meeting.
"Acknowledgements" section now much more accurate.
"The HPF Model" (replacing "Model of Computation") substantially
simplified and improved.
"Distribution without Templates" chapter removed.
Many proposals not adopted moved to "Journal of Development".


Version 1.0:
January 25, 1993
Draft final version

Changes:
Many changes for clarity or pedagogical reasons.
The examples in several sections have been significantly enlarged.
INHERIT (for dummy arguments) added to distribution chapter.
Pure procedures may now have dummy arguments with explicit
distributions, if those distributions are inherited from the caller.
Changed the names of the new reductions AND, OR, and EOR to IALL,
IANY, and IPARITY.
Clarified the status of the character array language to be not in the
subset, and as a result, removed the character array intrinsics.
Only very restricted forms of alignment subscript expressions (of the
form \(m*i + n\) where \(m\) and \(n\) are integer expressions) are
part of the subset.
[Bibliography] Correctly spelled ``Mehrotra'' and ``Gerndt''.


----------------------------------------------------------------

REQUEST FOR PUBLIC COMMENT ON HIGH PERFORMANCE FORTRAN

To: The High Performance Computing Community

Invitation:

The High Performance Fortran Forum (HPFF), with participation from over 40 
organizations, has been meeting since January 1992 to discuss and 
define a set of extensions to Fortran called High Performance Fortran 
(HPF). Our goal is to address the problems of writing data parallel 
programs for architectures where the distribution of data impacts 
performance. While we hope that the HPF extensions become widely available, 
HPFF is not sanctioned or supported by any official standards organization. 
At this time, HPFF invites general public review comments on the initial 
version of the language draft. 

The HPF language specification, version 1.0 draft, is now available. This 
document contains all the technical features proposed for the language. 
We plan to make minor revisions to correct errors or clarify
ambiguities in March 1993, at which time we will issue a final draft;
however, we expect that there will be few (if any) major technical
changes from this draft.

HPFF invites comments on the technical content of HPF, as well as on the 
editorial presentation in the document.  To facilitate incorporation
of comments into the final document, we ask that comments be sent
before March 1, 1993.

comments, we ask that 

How to Get the Documents:

Electronic copies of the HPF language specification are available from 
numerous sources. 

    Anonymous FTP sources:      Directory:
    titan.cs.rice.edu           public/HPFF/draft
    think.com                   public/HPFF
    ftp.gmd.de                  hpf-europe
    theory.tc.cornell.edu       pub
    minerva.npac.syr.edu        public

    Email sources:              First line of message:
    netlib@research.att.com     send hpf-v10.ps from hpff
    netlib@ornl.gov             send hpf-v10.ps from hpf
    softlib@cs.rice.edu         send hpf-v10.ps    

The following formats are available (xx will be 04 or 10, depending on 
version). Note that not all formats are available from all sources. 
    hpf-v10.dvi                 DVI file
    hpf-v10.ps                  Postscript
    hpf-v10.ps.Z                Compressed Postscript
    hpf-v10.tar                 Tar file of LaTeX version
    hpf-v10.tar.Z               Compressed tar file

For more detailed instructions, send email to hpff-info@cs.rice.edu. This 
will return a message with expanded detail about accessing the above 
document sources, as well as other information about HPFF. 

We strongly encourage reviewers to obtain an electronic copy of the 
document. However, if electronic access is impossible the draft is also 
available in hard copy form as CRPC Technical Report #92225. This report is 
available for $50 (copying/handling fee) from: 

    Theresa Chatman
    CITI/CRPC, Box 1892
    Rice University
    Houston, TX 77251

Make checks payable to Rice University. This document will be sent surface 
mail unless additional airmail postage is included in the payment. 


How to Submit Comments:

HPFF encourages reviewers to submit comments as soon as possible, with a 
deadline of February 15 for consideration. Please do not submit comments 
for any version of the draft earlier than the 0.4 version. 

Please send comments by email to hpff-comments@cs.rice.edu. To facilitate 
the processing of comments we request that separate comment messages be 
submitted for each chapter of the document and that the chapter be clearly 
identified in the "Subject:" line of the message. Comments about general 
overall impressions of the HPF document should be labeled as Chapter 1. All 
comments on the language specification become the property of Rice 
University. 

If email access is impossible for comment responses, hard copy may be sent 
to 

    HPF Comments
    c/o Theresa Chatman
    CITI/CRPC, Box 1892
    Rice University
    Houston, TX 77251

HPFF plans to process the feedback received at a meeting in March. Best 
efforts will be made to reply to comments submitted. 


Sincerely,


Charles Koelbel
Rice University
HPFF Executive Director


From schreibr@riacs.edu  Thu Mar 25 16:47:03 1993
Received: from icarus.riacs.edu by titan.cs.rice.edu (AA29800); Thu, 25 Mar 93 16:47:03 CST
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA00540; Thu, 25 Mar 93 14:47:01 PST
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA23043; Thu, 25 Mar 93 14:46:29 PST
Message-Id: <9303252246.AA23043@thor.riacs.edu>
Date: Thu, 25 Mar 93 14:46:29 PST
From: Rob Schreiber <schreibr@riacs.edu>
To: hpff-intrinsics@cs.rice.edu
Subject: Important question


My notes from the last meeting do not indicate that 
the HPF_ALIGNMENT, HPF_TEMPLATE and HPF_DISTRIBUTION
(mapping inquiry) subroutines should be moved from
the intrinsics into the library, but Rick Swift recalls this.
And it seems correct and consistent with our
decision that DYNAMIC, and INHERIT, and extrinsic are not
in the HPF subset.   Since they are subroutines with
INTENT(OUT) arguments, they cant occur in specification
(or any other kind of) expressions.


Shall I move them to the library?   In the absence of
argument I will.

Rob

From chk@cs.rice.edu  Thu Mar 25 18:16:32 1993
Received: from  by titan.cs.rice.edu (AB02000); Thu, 25 Mar 93 18:16:32 CST
Message-Id: <9303260016.AB02000@titan.cs.rice.edu>
Date: Thu, 25 Mar 1993 18:16:59 -0600
To: hpff-intrinsics@cs.rice.edu, Rob Schreiber <schreibr@riacs.edu>
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: Re: Important question

My notes don't have any mention of removing the intrinsics from the subset.
 There was a discussion of moving the intrinsics/functions defined in the
HPF_LOCAL stuff out of the subset (along with HPF_LOCAL itself).

However, I agree with the reasoning that the inquiry subroutines are not
very useful without dynamic distributions or inheritance.  Moving them out
of the subset sounds good.

                                                Chuck


From @ecs.soton.ac.uk,@brewery.ecs.soton.ac.uk:jhm@ecs.southampton.ac.uk  Thu Apr 29 12:54:11 1993
Received: from sun2.nsfnet-relay.ac.uk by cs.rice.edu (AA12994); Thu, 29 Apr 93 12:54:11 CDT
Via: uk.ac.southampton.ecs; Thu, 29 Apr 1993 18:01:15 +0100
Via: brewery.ecs.soton.ac.uk; Thu, 29 Apr 93 17:53:59 BST
From: John Merlin <jhm@ecs.soton.ac.uk>
Received: from bacchus.ecs.soton.ac.uk by brewery.ecs.soton.ac.uk;
          Thu, 29 Apr 93 18:02:44 BST
Date: Thu, 29 Apr 93 18:02:48 BST
Message-Id: <4540.9304291702@bacchus.ecs.soton.ac.uk>
To: hpff-core@cs.rice.edu, hpff-distribute@cs.rice.edu,
        hpff-intrinsics@cs.rice.edu
Subject: Shouldn't mapping inquiry subroutines be functions?
Cc: jhm@ecs.soton.ac.uk, schreibr@riacs.edu

It's occurred to me that the 'mapping inquiry subroutines'
really need to be functions.

The problem is that subroutines can't be used in declarations,
which rules out nearly everything one would want to do with them.
E.g. one might want to reconstruct the template to which a
dummy with transcriptive mapping was aligned, align locals 
to this template, perhaps using expressions involving the 
bounds and strides of the dummy's alignment, or distribute 
a local variable in a way corresponding somehow to the
distribution of a transcriptvely-mapped dummy.

The idea of an 'inquiry *subroutine*' seems quite odd
-- e.g. the F90 array inquiry intrinsics (SIZE, LBOUND, etc)
wouldn't be so useful if they were subroutines, would they?

Perhaps people with more experience of applications might like 
to comment?

                 Best regards,
                        John Merlin.


P.S. 
(i) A problem I remember being voiced about lots of 'small'
inquiry intrinsics, which I guess is what I'm proposing here,
is that of 'namespace pollution'.  However, I suggest this 
isn't really such a problem, as intrinsic names can be used
for other objects provided one doesn't need to use the intrinsic,
e.g. one can call a variable 'SIN' provided one doesn't need
the SIN intrinsic.

(ii) Also, the names of the mapping inquiry functions should be
*short*, so they can be used in expressions woutout too much pain, 
e.g.:

    SUBROUTINE s (x)
      REAL x (10)
!HPF$ DISTRIBUTE x *

      REAL a (10,10)
!HPF$ DISTRIBUTE a (DISTRIB (x), *)

-----------------------------------------------------------------------
John Merlin                                  email: jhm@ecs.soton.ac.uk
Dept. of Electronics and Computer Science,   tel:   +44 703 593368
University of Southampton,                   fax:   +44 703 593045
Southampton S09 5NH,  U.K.

From schreibr@riacs.edu  Thu Apr 29 13:25:53 1993
Received: from icarus.riacs.edu by cs.rice.edu (AA13717); Thu, 29 Apr 93 13:25:53 CDT
Received: from thor.riacs.edu by icarus.riacs.edu (4.1/2.7G)
	   id AA06868; Thu, 29 Apr 93 11:25:45 PDT
Received: by thor.riacs.edu (4.1/2.0N)
	   id AA06792; Thu, 29 Apr 93 11:25:00 PDT
Message-Id: <9304291825.AA06792@thor.riacs.edu>
Date: Thu, 29 Apr 93 11:25:00 PDT
From: Rob Schreiber <schreibr@riacs.edu>
To: jhm@ecs.soton.ac.uk
Subject: Re:  Shouldn't mapping inquiry subroutines be functions?
Cc: hpff-intrinsics@cs.rice.edu

Good points, but too late.   Anyway, one can make a user-defined
function DISTRIB and call HPF_DISTRIBUTE within it if this is essential.


From gls@think.com  Thu Apr 29 14:59:24 1993
Received: from mail.think.com by cs.rice.edu (AA15943); Thu, 29 Apr 93 14:59:24 CDT
Received: from Ukko.Think.COM by mail.think.com; Thu, 29 Apr 93 15:59:14 -0400
From: Guy Steele <gls@think.com>
Received: by ukko.think.com (4.1/Think-1.2)
	id AA01026; Thu, 29 Apr 93 15:59:13 EDT
Date: Thu, 29 Apr 93 15:59:13 EDT
Message-Id: <9304291959.AA01026@ukko.think.com>
To: schreibr@riacs.edu
Cc: jhm@ecs.soton.ac.uk, hpff-intrinsics@cs.rice.edu
In-Reply-To: Rob Schreiber's message of Thu, 29 Apr 93 11:25:00 PDT <9304291825.AA06792@thor.riacs.edu>
Subject:  Shouldn't mapping inquiry subroutines be functions?

   Date: Thu, 29 Apr 93 11:25:00 PDT
   From: Rob Schreiber <schreibr@riacs.edu>

   Good points, but too late.   Anyway, one can make a user-defined
   function DISTRIB and call HPF_DISTRIBUTE within it if this is essential.

But then it would not be useable in a specification-expr?

From chk@cs.rice.edu  Thu Apr 29 15:38:40 1993
Received: from [128.42.1.227] by cs.rice.edu (AA16830); Thu, 29 Apr 93 15:38:40 CDT
Message-Id: <9304292038.AA16830@cs.rice.edu>
Date: Thu, 29 Apr 1993 15:43:43 -0600
To: hpff-core@cs.rice.edu, hpff-distribute@cs.rice.edu,
        hpff-intrinsics@cs.rice.edu, John Merlin <jhm@ecs.soton.ac.uk>
From: chk@cs.rice.edu (Chuck Koelbel)
Subject: Re: Shouldn't mapping inquiry subroutines be functions?
Cc: jhm@ecs.soton.ac.uk, schreibr@riacs.edu

At 18:02 4/29/93 -0800, John Merlin wrote:
>It's occurred to me that the 'mapping inquiry subroutines'
>really need to be functions.

I agree with Rob - regardless of technical merits, it's too late to make a
change this big (exercise for readers: estimate number of pages that need
to change).

                                                Chuck


From gls@think.com  Fri Apr 30 10:45:09 1993
Received: from mail.think.com by cs.rice.edu (AA28884); Fri, 30 Apr 93 10:45:09 CDT
Received: from Ukko.Think.COM by mail.think.com; Fri, 30 Apr 93 11:45:05 -0400
From: Guy Steele <gls@think.com>
Received: by ukko.think.com (4.1/Think-1.2)
	id AA16434; Fri, 30 Apr 93 11:45:07 EDT
Date: Fri, 30 Apr 93 11:45:07 EDT
Message-Id: <9304301545.AA16434@ukko.think.com>
To: hpff-intrinsics@cs.rice.edu
Subject: [MAILER-DAEMON: Returned mail: Host unknown]

Date: Fri, 30 Apr 93 11:38:25 -0400
From: Mail Delivery Subsystem <MAILER-DAEMON>
Subject: Returned mail: Host unknown
To: <gls>

   ----- Transcript of session follows -----
550 <hpff-intrinsics@e.rice.cs>... Host unknown

   ----- Unsent message follows -----
Return-Path: <gls@Think.COM>
Received: from Ukko.Think.COM by mail.think.com; Fri, 30 Apr 93 11:38:25 -0400
From: Guy Steele <gls@Think.COM>
Received: by ukko.think.com (4.1/Think-1.2)
	id AA16366; Fri, 30 Apr 93 11:38:27 EDT
Date: Fri, 30 Apr 93 11:38:27 EDT
Message-Id: <9304301538.AA16366@ukko.think.com>
To: glossa@cix.compulink.co.uk
Cc: chk@cs.rice.edu, hpff-core@cs.rice.edu, hpff-intrinsics@e.rice.cs,
        jhm@ecs.soton.ac.uk
In-Reply-To: Glossa's message of Fri, 30 Apr 93 16:15 GMT0BST-1 <memo.178877@cix.compulink.co.uk>
Subject: Shouldn't mapping inquiry subroutines be functions?

   Date: Fri, 30 Apr 93 16:15 GMT0BST-1
   From: Glossa <glossa@cix.compulink.co.uk>


   In-Reply-To: <9304292038.AA16830@cs.rice.edu>
   Its a great thing to have a fast timetable for deciding on
   the main issues in a language design BUT it is never too late
   to get things right for usability. John's pointing this out at this 
   stage shows just how superficial the review has been 
   because of the haste.  This cannot be a major change for implementors 
    - hence the choice should be made on whether 
   the facilities for declaration are adequate or not. 

   Personally I thought that Chap 3 was apalling in a public document
   but as I didn't have time to do a detailed review I did nothing.
   I wonder how many other people did the same?  

But last-minute pot-shots are not helpful either.

It is not too late.  *Please* tell me what you find appalling.
If it is a matter of presentational style rather than technical
content, it is not too late to consider changes.

   A lot can be learned by reading Robin Milner's remarks on the 
   design of ML in this years ACM Turing lecture. 

Alas, I missed the lecture itself, and it will not be published
in time.  However, I can state unqeuivocally that Milner's
"A Proposal for Standard ML" (1984) was a model of clarity
and concision (I was responsible for having it published
in the proceedings of the 1984 ACM Lisp Conference) and I look
forward to reading his lecture.