Proposal for Task Parallelism in HPF

Here is a modified task parallelism proposal. This proposal is not quite 
formal since the final wording will be somewhat dependent on SUBGROUP and 
ON features which are under development. On the bright side, I think it is 
relatively easy to read and better debugged than earlier proposals.

Jaspal and Bwolen

Assumptions of functionality available from related features: 

1) A way to group processors into subgroups P1, P2, P3 and distribute 
variables onto them, i.e. a1, a2, a3.

2) An ON construct that directs execution on groups of processors 
P1, P2, P3 etc. for a block of code.


PROPOSAL:

A "task region" is a single entry, single exit region delimited by (say) 
TASK REGION .... END TASK REGION. A task region can have blocks of code 
that are directed to execute ON a processor subgroup. All other code 
executes on all available processors, referred to as ALL. 

The following restrictions must hold for the code inside a task region. 

A code block directed to execute ON a subgroup must be a single entry 
single exit region.

A code block directed to execute ON a subgroup P may access a variable 
location not mapped to P only if that variable location is: 

a) accessed exclusively in code directed to execute ON P. 
OR
b) not written to in the task region.

[There are no other access constraints: Code executing on ALL processors 
has unrestricted access to all variable locations. A code block directed 
to execute ON a subgroup P has unrestricted access to all variable 
locations mapped to P]


An I/O operation in a code section directed to execute ON a subgroup may 
not ``interfere'' with an I/O operation in a code section not explicitly 
directed to execute on that subgroup. The interference of I/O operations 
is detailed in Section 4.4 (INDEPENDENT). 

For a subroutine call inside an ON block, "all available processors" are 
processors in the corresponding subgroup. This is the number that is used 
for mapping the parameters of the subroutine. [This part will become more 
specific after the syntax etc. of creating subgroups is decided. There 
should probably be a system inquiry function for the number of processors 
in the current subgroup, if NUMBER_OF_PROCESSORS() is supposed to return 
the total number of processors for the program] 

COMPILATION/EXECUTION MODEL

The execution model for a subgroup is to unconditionally execute code ON 
it, unconditionally skip code ON others, and participate in the execution 
of common code (on ALL processors) as normal data parallel code.
[The access restrictions guarantee that the results will be consistent 
with pure data parallel execution. A processor group cannot be "invisibly" 
writing to a location being accessed by ALL or another processor group, 
and vice versa]

Following is one model for accessing variables in a task region: 

Accesses to variables owned by other processors is cooperative, i.e. the 
owner sends the value, and the user receives it, with one exception - when 
code ON a subgroup has to access a variable not mapped to it, it use a 
remote fetch/deposit. (It can also cache remote locations locally in the 
subgroup for the duration of the execution of the task region since 
computation not ON that subgroup cannot access it) 

EXAMPLE: 2DFFT

Sequential:


real, dimension(n,n) :: a1, a2

do while(.true.)
read (unit = iu, end = 100) a1
call rowfft(a1)
a2 = a1
call colfft(a2)
write (unit = ou) a2
cycle
100	continue
exit
enddo


Pipelined Data/Task Parallel HPF

real dimension(n,n) :: a1,a2
boolean done1
!hpf$ disjoint processor groups P1, P2 (Syntax TBA) !hpf$ distribute 
a1(block,*) onto P1
!hpf$ distribute a2(*,block) onto P2
!hpf$ distribute done1 onto P1

!hpf$ TASK REGION
done1 = .false.
do while (.true.)
!hpf$	ON HOME(P1) BLOCK
read (unit = iu,end=100) a1
call rowfft(a1)
goto 101
100	done1 = .true.
101	continue
!hpf$	END BLOCK

if (done1) exit
a2 = a1

!hpf$	ON HOME(P2) BLOCK
call colfft(a2)
write(unit = ou) a2
!hpf$	END BLOCK
enddo
!hpf$ END TASK REGION


The data parallel code on the two processor groups will look something 
like this, after the task region is compiled.

Processor Group P1:

real dimension(n,n) :: a1
!hpf$ distribute a1(block,*)
boolean done1

done1 = .false.
do while (.true.)
read (unit = iu,end=100) a1
call rowfft(a1)
goto 101
100 done1 = .true.
101 continue
_send(done1,P2)
if (done1) exit
_send(a1,P2)
enddo


Processor group P2:

real dimension(n,n) :: a2
!hpf$ distribute a2(*,block)
boolean local_done1

do while (.true.)
_receive(local_done1,P1)
if (local_done1) exit
_receive(a2,P1)
call colfft(a2)
write(unit = ou) a2
enddo


COMMENTS:

1) The main difference with the simple parallel section/region, 
(or using an INDEPENDENT do loop to achieve parallel sections), is that 
task regions presented can have code that executes on ALL processors also. 
If it has no such code, it is similar to parallel section/region. However, 
allowing other code makes this construct more general, and implicitly 
allows pipelining in particular. At the same time, existence of code in 
ALL can constrain parallelism due to data dependence, and in the worst 
case no task parallelism may exist. 

2) No explicit control dependence constraints are required. Inside an ON 
block, any variable being read (or used for control flow) cannot be 
written by any other processor group - it can be only written by ALL 
processors, in which case the control flow from the subgroup must also 
reach that point. 

Outside an ON block, all processor groups execute all control flow (and 
other) statements. If a subgroup skips a control construct because it is 
not involved( i.e. its variables are not involved and there is no code 
inside the scope of the control construct that is directed to execute ON 
it) and continues to execute its next ON block, the constraints ensure 
that it cannot write to a location that is used for managing control flow.

--------------------------------------------------------------------------------
[Reply from Chuck Koelbel]

>Here is a modified task parallelism proposal. This proposal is not quite 
>formal since the final wording will be somewhat dependent on SUBGROUP and 
>ON features which are under development. On the bright side, I think it is 
>relatively easy to read and better debugged than earlier proposals.

I tend to agree, and will make sure that it gets printed for the next 
meeting. 

A couple details follow...

>The following restrictions must hold for the code inside a task region. 
>
>A code block directed to execute ON a subgroup must be a single entry 
>single exit region.

Good news, this is part of the ON proposal already. (Albeit phrased as "no 
jumps into or out of the region)

>A code block directed to execute ON a subgroup P may access a variable 
>location not mapped to P only if that variable location is: 
>
>a) accessed exclusively in code directed to execute ON P. 
>OR
>b) not written to in the task region.

Shouldn't the second constraint be
b) not written to by code in another ON block in the task region. ?

Another way of asking this is, "Aren't the sequential regions also in the 
task region?"

>COMPILATION/EXECUTION MODEL

>The execution model for a subgroup is to unconditionally execute code ON 
>it, unconditionally skip code ON others, and participate in the execution 
>of common code (on ALL processors) as normal data parallel code.
>[The access restrictions guarantee that the results will be consistent 
>with pure data parallel execution. A processor group cannot be 
>"invisibly" writing to a location being accessed by ALL or another 
>processor group, and vice versa]

Right motivation. But if processor groups can't write to locations read by 
ALL, how is data going to flow from group to group? 

>The data parallel code on the two processor groups will look something 
>like this, after the task region is compiled.

Definitely "might" look something like this. I expect many compilers, 
especially those on scalable machines, to produce SPMD code with IF 
statements (checking local data, like myproc()) about where the ON 
directives are.

>2) No explicit control dependence constraints are required. Inside an ON 
>block, any variable being read (or used for control flow) cannot be 
>written by any other processor group - it can be only written by ALL 
>processors, in which case the control flow from the subgroup must also 
>reach that point. 

>Outside an ON block, all processor groups execute all control flow (and 
>other) statements. If a subgroup skips a control construct because it is 
>not involved( i.e. its variables are not involved and there is no code 
>inside the scope of the control construct that is directed to execute ON 
>it) and continues to execute its next ON block, the constraints ensure 
>that it cannot write to a location that is used for managing control flow.

In the following sequence, does P1 execute the GOTO? It doesn't involve 
any data mapped to P1, nor is it directed ON P1. 

!HPF$ DISTRIBUTE A1(BLOCK) ONTO P1
!HPF$ DISTRIBUTE A2(BLOCK) ONTO P2

!HPF$ TASK REGION
IF (A2(2) > 0) THEN
!HPF$ ON HOME(P2) BLOCK
... do something with A2 ...
!HPF$ END BLOCK
GOTO 100
END IF
!HPF$ ON HOME(P1) BLOCK
... do something with A1 ...
!HPF$ END BLOCK
100 CONTINUE
!HPF$ END REGION

I think I know what you mean. Probably just need to be a little more 
careful about the execution model, in particular, what does "normal 
data-parallel model" mean?


Chuck


--------------------------------------------------------------------------------
[Reply^2 from Jaspal Subhlok]


You point out some text on access restrictions that seems to be in error. 
I think the restrictions may be more subtle than they appear, but the text 
is what is intended. I will add a general motivation section along the 
lines of the next paragraph in the next revision - please take a look and 
mail again if you think something needs to be changed. 


TASKING MODEL: Subgroups ``normally'' read and write only to/from the 
variables that are mapped to them. The code in ALL has unrestricted access 
to all variables, and data is exchanged between subgroups by copying the 
variables of one subgroup to the variables of another subgroup in the ALL 
code.

In some circumstances, a subgroup may want to access a variable NOT mapped 
to it (e.g. a common block). Such access is allowed, but it must not cause 
a data dependence during the entire duration of the execution of the task 
region. (Even dependence between a subgroup and ALL are not allowed since 
that would imply that when ALL is executing no other subgroup can execute 
as there is no easy way for a compiler to figure out what variable a 
subgroup may access ). A sufficient condition for "no dependences" is that 
such accesses should only be to variables that are ``read only'' in the 
task region, or ``read and written'' only by code ON only one subgroup.


Other points are well taken and I will add corrections/clarifications in 
the next revision. Yes, the compiled code is just one way to do it, and 
meant only for illustration of the task constructs. Every processor 
executes control constructs in ALL unless the compiler can determine that 
it is not necessary. In general, a GOTO is certainly executed by 
everybody. The "normal data-parallel model" is supposed to mean that 
execution follows the normal HPF semantics (some say no such thing exists 
:) but I am not going to fix that) without any notion of task parallelism 
constraints.

Let me know if there are unresolved things or other comments. 


jaspa


--------------------------------------------------------------------------------
[Reply^3 from Chuck Koelbel] 


Sounds to me like you're addressing my concerns. Maybe it would also help 
to explicitly mention that a processor group always has write access to 
data mapped to it. Also, note that "normal data-parallel mode" on some 
machines may mean GET/PUT access; copying data in this way in ALL may not 
be sufficient for the synchronization that you need. 

Chuck


--------------------------------------------------------------------------------
[Reply^4 from Jaspal Subhlok]


>Sounds to me like you're addressing my concerns. Maybe it would also help 
>to explicitly mention that a processor group always has write access to 
>data mapped to it.

Actually it is mentioned that processor groups have unrestricted access to 
their variables, but only as a comment. Excerpt: 

[There are no other access constraints: Code executing on ALL processors 
has unrestricted access to all variable locations. A code block directed 
to execute ON a subgroup P has unrestricted access to all variable 
locations mapped to P]

Perhaps it should be in the main text.

>Also, note that "normal data-parallel mode" on some machines may mean 
>GET/PUT access; copying data in this way in ALL may not be sufficient for 
>the synchronization that you need. 


If GET/PUT access is used, the compiler has to include explicit 
synchronization like barriers for normal data parallel processing. The 
compiler has to use "subset barriers" to be able to exploit task 
parallelism. One concern is that if a full barrier is used around ALL 
sections, then the program may be oversynchronized, so subgroup barriers 
should be used in ALL as needed. e.g. if a variable in group1 is copied to 
a variable in group2, correct use of subgroup barriers will allow group3 
processors to continue if there is no other code in ALL that needs them.

Anyway, I think this is about efficient compilation in GET/PUT model and I 
don't think it changes the specification or the basic execution model.

jaspal


--------------------------------------------------------------------------------
[Reply^5 by Chuck Koelbel]

>>Sounds to me like you're addressing my concerns. Maybe it would also 
>>help to explicitly mention that a processor group always has write access 
>>to data mapped to it.
>
>Actually it is mentioned that processor groups have unrestricted access 
>to their variables, but only as a comment. Excerpt: 
>
>[There are no other access constraints: Code executing on ALL processors 
>has unrestricted access to all variable locations. A code block directed 
>to execute ON a subgroup P has unrestricted access to all variable 
>locations mapped to P]
>
>Perhaps it should be in the main text.

I think so. It's an important point.

>>Also, note that "normal data-parallel mode" on some machines may mean 
>>GET/PUT access; copying data in this way in ALL may not be sufficient for 
>>the synchronization that you need. 
>
>...
>Anyway, I think this is about efficient compilation in GET/PUT model and 
>I don't think it changes the specification or the basic execution model.
>
>jaspal

Agreed. Is it true that your execution model assumes that the owners of A 
and B both participate in A=B (at least in the sense that they do some 
synchronization)? If so, I'd feel better if this were explicitly stated 
somewhere. Yes, in some sense the synchronization is also implied in 
straight HPF - but here the model is explicitly multi-threaded, as opposed 
to the single-threaded model in HPF-default mode. 

Chuck

--------------------------------------------------------------------------------
[Jaspal Subhlok gets the last word]


OK, it will probably make things clearer to state that owners of A and B 
both participate in an A=B. But the main point is that execution model in 
ALL is same as regular HPF, (with extra information that subgroup 
computations can be assumed to only read and modify subgroup variables 
from ALL's perspective), and it should be stated in those terms, else we 
have to detail the behavior in all scenarios. Perhaps some illustration 
will make things easier. 

jaspal