---------------------------------------------------------------------------
hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel
features in HPF.  Instructions for adding or deleting yourself from this
list appear at the bottom of this message.
---------------------------------------------------------------------------


I am attaching a revised proposal for task parallelism. There are some
minor corrections and clarifications, a few words about SMPs, and I have
stated the scheme Rob Schreiber proposed as a separate section. 

I will not be able to attend the next meeting because of an ARPA site
visit. I will be in email contact right through. If you send me any
comments relatively soon, I can prepare a revised proposal if needed.

jaspal

------------------------------------------------------------------------

PROPOSAL FOR TASK PARALLELISM IN HPF

[Contact: Jaspal Subhlok (jass@cs.cmu.edu)]

------------------------------------------------------------------------
Assumption of functionality available from related features:

 1) A way to group processors into subgroups P1, P2, P3. A way to 
   attach and distribute variables onto subgroups, i.e. a1, a2, a3
   can be mapped onto subgroups P1, P2, P3. (For SMPs
   explicit distribution of variables to subgroups is not necessary 
   and may not mean anything. However, a way to attach variable names
   to subgroups is still needed)

2) An ON construct that directs execution on groups of processors
   P1, P2, P3 etc. for a block of code.

-------------------------------------------------------------------------

1. GENERAL IDEA AND MODEL:

Task parallelism is expressed by mapping different data objects on
different subgroups of processors and specifying that blocks of code
be executed on named subgroups of processors using an ON directive. Code
executing on a processor subgroup inside a designated ``task region''
normally reads and writes only to/from the variables that are mapped to
them. The code inside a task region that is not directed to execute ON
a task region (at least conceptually) executes on ALL processors and
has unrestricted access to all variables. Data is exchanged
between subgroups by copying the variables of one subgroup to the
variables of another subgroup in the ALL code.

A subgroup is allowed access to variables not mapped to it, if that would
not cause a data dependence. A sufficient condition for ``no dependences''
is that such accesses should only be to variables that are ``read only''
in the task region, or ``read and written'' only by code ON any one subgroup.
The proposal essentially offers a way in which the programmer can tell the
compiler that these rules will be followed in designated code regions.

2. PROPOSAL:

A ``task region'' is a single entry, single exit region delimited by
(say) TASK REGION .... END TASK REGION.  A task region can have blocks
of code that are directed to execute ON  processor subgroups. All
other code executes on all available processors, referred to as ALL.

The following restrictions must hold for the code inside a task region:  

[This is the core of the proposal]
A code block executing on ALL processors has unrestricted  access to
all variables. A code block directed to execute ON a subgroup P
has unrestricted access to any variable mapped to P. A code block directed
to execute ON a subgroup P can access a variable not mapped to P only 
if the following constraint holds for the entire code in the task region:

a) The variable is``read only''. 
               OR
b) accessed only in code directed to execute ON P.
              
(Variable in this context means any addressable location)

An I/O operation in a code section directed to execute ON a subgroup
may not ``interfere'' with an I/O operation in a code section not
explicitly directed to execute on that subgroup. The interference of
I/O operations is detailed in Section 4.4 (INDEPENDENT).

For a subroutine call inside an ON block, ``all available processors''
are processors in the corresponding subgroup. This is the number that is
used for mapping the parameters of the subroutine. [This part will
become more specific after the syntax etc. of creating subgroups is
decided. There should probably be a system inquiry function for the
number of processors in the current subgroup, if NUMBER_OF_PROCESSORS()
is supposed to return the total number of processors for the program]

3. COMPILATION/EXECUTION MODEL:

3.1 Basic Execution:

The execution model for a subgroup is to unconditionally execute code
ON it, unconditionally skip code ON others, and participate in the
execution of common code (on ALL processors) as normal data parallel code.
An operation in ALL involving a set of variables starts only when all
processors of the subgroups owning those variables  reach that point
of execution. This is the basic execution model for shared and 
distributed memory machines.

[The access restrictions guarantee that the results will be consistent
 with pure data parallel execution. A processor group cannot be
 ``invisibly'' writing to a location being accessed by ALL or another
 processor group, and vice versa]

3.2 Variable Access:

We state ``one'' model for accessing variables in a task region for a
distributed memory machine. (This is important for building an
efficient compilation scheme although not really a part of the
execution model).

Accesses to variables owned by other processors is cooperative, i.e.
the owner sends the value, and the user receives it, with one
exception - when code ON a subgroup has to access a variable not
mapped to it, it use a remote fetch/deposit. (It can also cache remote
locations locally in the subgroup for the duration of the execution of
the task region since computation not ON that subgroup cannot access it)

4 EXAMPLE: 2DFFT

Sequential:


      real, dimension(n,n) :: a1, a2

      do while(.true.)
          read (unit = iu, end = 100) a1
          call rowfft(a1)
          a2 = a1
          call colfft(a2)
          write (unit = ou) a2
          cycle
100       continue
          exit
      enddo


Pipelined Data/Task Parallel HPF

        real dimension(n,n) :: a1,a2
        boolean done1
!hpf$   disjoint processor groups P1, P2 (Syntax TBA)
!hpf$   distribute a1(block,*) onto P1
!hpf$   distribute a2(*,block) onto P2
!hpf$   distribute done1 onto P1
                 
!hpf$   TASK REGION
        done1 = .false.
        do while (.true.)
!hpf$       ON HOME(P1) BLOCK 
              read (unit = iu,end=100) a1
              call rowfft(a1)
              goto 101
    100       done1 = .true.
    101       continue
!hpf$       END BLOCK
            
            if (done1) exit
            a2 = a1

!hpf$       ON HOME(P2) BLOCK
               call colfft(a2)
               write(unit = ou) a2
!hpf$       END BLOCK
        enddo
!hpf$   END TASK REGION


The data parallel code on the two processor groups might look something
like this, after the task region is compiled.

Processor Group P1:

        real dimension(n,n) :: a1
!hpf$   distribute a1(block,*)
        boolean done1

        done1 = .false.
        do while (.true.)
           read (unit = iu,end=100) a1
           call rowfft(a1)
           goto 101
    100    done1 = .true.
    101    continue
           _send(done1,P2)
           if (done1) exit
           _send(a1,P2)
        enddo

Processor group P2:

        real dimension(n,n) :: a2
!hpf$   distribute a2(*,block)
        boolean local_done1

        do while (.true.)
           _receive(local_done1,P1)
          if (local_done1) exit 
          _receive(a2,P1)
          call colfft(a2)
          write(unit = ou) a2
        enddo

4. AN ALTERNATE MODEL:
   (Rob Schreiber proposed this at the last meeting)

   A related but different model for task parallelism is as follows:

   1) All code in a task region is directed to execute ON some
      (arbitrary) subgroup of processors. If no ON directive
      is present, ALL is assumed.

   2) If a variable is ``read only'' in the region, there are no
      other restrictions. Otherwise, if two subgroups P1 and P2
      access a variable x, P1 and P2 must have at least one common
      processor.

The execution model is that all processors execute the code that
is mapped to a subgroup they belong to, and skip other code.

4.1 TRADEOFFS:

This model is very clean and simple to state. It separates the
control aspect of task parallelism from the data aspect. Other
mechanisms are used for mapping data to the tasks in a distributed 
memory machine for performance.

The cons are that even though it is simple to state, it is a
subtle construct for task  parallelism, and there is no clear
user programming model. Compilation model is also not as clear
and it is extremely hard for the compiler to check for any
violations of the user assertions.  There is no experience in 
using something like this.

5. GENERAL COMMENTS:
 
1) The main difference with the simple parallel section/region, (or
   using an INDEPENDENT do loop to achieve parallel sections), is that
   task regions presented can have code that executes on ALL
   processors also. If it has no such code, it is similar to parallel
   section/region. However, allowing other code makes this construct
   more general, and implicitly allows pipelining in particular. At
   the same time, existence of code in ALL can constrain parallelism
   due to data dependence, and in the worst case no task parallelism
   may exist.

2) No explicit control dependence constraints are required. Inside an
   ON block, any variable being read (or used for control flow) cannot
   be written by any other processor group - it can be only written by
   ALL processors, in which case the control flow from the subgroup
   must also reach that point.  Outside an ON block, all processor
   groups execute all control flow (and other) statements. If a
   subgroup skips a control construct because it is not involved(
   i.e. its variables are not involved and there is no code inside the
   scope of the control construct that is directed to execute ON it)
   and continues to execute its next ON block, the constraints ensure
   that it cannot write to a location that is used for managing
   control flow.

3) There may be some issues with respect to extrinsic subroutine calls
   to ensure that the basic model works in their presence. It is
   probably best to address them after subroutine call execution model
   is more clearly defined for ON regions in general.
---------------------------------------------------------------------------
To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu.
Leave the subject line blank, and in the body put the line
(un)subscribe <email-address>
---------------------------------------------------------------------------