Proposal for Task Parallelism in HPF Here is a modified task parallelism proposal. This proposal is not quite formal since the final wording will be somewhat dependent on SUBGROUP and ON features which are under development. On the bright side, I think it is relatively easy to read and better debugged than earlier proposals. Jaspal and Bwolen Assumptions of functionality available from related features: 1) A way to group processors into subgroups P1, P2, P3 and distribute variables onto them, i.e. a1, a2, a3. 2) An ON construct that directs execution on groups of processors P1, P2, P3 etc. for a block of code. PROPOSAL: A "task region" is a single entry, single exit region delimited by (say) TASK REGION .... END TASK REGION. A task region can have blocks of code that are directed to execute ON a processor subgroup. All other code executes on all available processors, referred to as ALL. The following restrictions must hold for the code inside a task region. A code block directed to execute ON a subgroup must be a single entry single exit region. A code block directed to execute ON a subgroup P may access a variable location not mapped to P only if that variable location is: a) accessed exclusively in code directed to execute ON P. OR b) not written to in the task region. [There are no other access constraints: Code executing on ALL processors has unrestricted access to all variable locations. A code block directed to execute ON a subgroup P has unrestricted access to all variable locations mapped to P] An I/O operation in a code section directed to execute ON a subgroup may not ``interfere'' with an I/O operation in a code section not explicitly directed to execute on that subgroup. The interference of I/O operations is detailed in Section 4.4 (INDEPENDENT). For a subroutine call inside an ON block, "all available processors" are processors in the corresponding subgroup. This is the number that is used for mapping the parameters of the subroutine. [This part will become more specific after the syntax etc. of creating subgroups is decided. There should probably be a system inquiry function for the number of processors in the current subgroup, if NUMBER_OF_PROCESSORS() is supposed to return the total number of processors for the program] COMPILATION/EXECUTION MODEL The execution model for a subgroup is to unconditionally execute code ON it, unconditionally skip code ON others, and participate in the execution of common code (on ALL processors) as normal data parallel code. [The access restrictions guarantee that the results will be consistent with pure data parallel execution. A processor group cannot be "invisibly" writing to a location being accessed by ALL or another processor group, and vice versa] Following is one model for accessing variables in a task region: Accesses to variables owned by other processors is cooperative, i.e. the owner sends the value, and the user receives it, with one exception - when code ON a subgroup has to access a variable not mapped to it, it use a remote fetch/deposit. (It can also cache remote locations locally in the subgroup for the duration of the execution of the task region since computation not ON that subgroup cannot access it) EXAMPLE: 2DFFT Sequential: real, dimension(n,n) :: a1, a2 do while(.true.) read (unit = iu, end = 100) a1 call rowfft(a1) a2 = a1 call colfft(a2) write (unit = ou) a2 cycle 100 continue exit enddo Pipelined Data/Task Parallel HPF real dimension(n,n) :: a1,a2 boolean done1 !hpf$ disjoint processor groups P1, P2 (Syntax TBA) !hpf$ distribute a1(block,*) onto P1 !hpf$ distribute a2(*,block) onto P2 !hpf$ distribute done1 onto P1 !hpf$ TASK REGION done1 = .false. do while (.true.) !hpf$ ON HOME(P1) BLOCK read (unit = iu,end=100) a1 call rowfft(a1) goto 101 100 done1 = .true. 101 continue !hpf$ END BLOCK if (done1) exit a2 = a1 !hpf$ ON HOME(P2) BLOCK call colfft(a2) write(unit = ou) a2 !hpf$ END BLOCK enddo !hpf$ END TASK REGION The data parallel code on the two processor groups will look something like this, after the task region is compiled. Processor Group P1: real dimension(n,n) :: a1 !hpf$ distribute a1(block,*) boolean done1 done1 = .false. do while (.true.) read (unit = iu,end=100) a1 call rowfft(a1) goto 101 100 done1 = .true. 101 continue _send(done1,P2) if (done1) exit _send(a1,P2) enddo Processor group P2: real dimension(n,n) :: a2 !hpf$ distribute a2(*,block) boolean local_done1 do while (.true.) _receive(local_done1,P1) if (local_done1) exit _receive(a2,P1) call colfft(a2) write(unit = ou) a2 enddo COMMENTS: 1) The main difference with the simple parallel section/region, (or using an INDEPENDENT do loop to achieve parallel sections), is that task regions presented can have code that executes on ALL processors also. If it has no such code, it is similar to parallel section/region. However, allowing other code makes this construct more general, and implicitly allows pipelining in particular. At the same time, existence of code in ALL can constrain parallelism due to data dependence, and in the worst case no task parallelism may exist. 2) No explicit control dependence constraints are required. Inside an ON block, any variable being read (or used for control flow) cannot be written by any other processor group - it can be only written by ALL processors, in which case the control flow from the subgroup must also reach that point. Outside an ON block, all processor groups execute all control flow (and other) statements. If a subgroup skips a control construct because it is not involved( i.e. its variables are not involved and there is no code inside the scope of the control construct that is directed to execute ON it) and continues to execute its next ON block, the constraints ensure that it cannot write to a location that is used for managing control flow. -------------------------------------------------------------------------------- [Reply from Chuck Koelbel] >Here is a modified task parallelism proposal. This proposal is not quite >formal since the final wording will be somewhat dependent on SUBGROUP and >ON features which are under development. On the bright side, I think it is >relatively easy to read and better debugged than earlier proposals. I tend to agree, and will make sure that it gets printed for the next meeting. A couple details follow... >The following restrictions must hold for the code inside a task region. > >A code block directed to execute ON a subgroup must be a single entry >single exit region. Good news, this is part of the ON proposal already. (Albeit phrased as "no jumps into or out of the region) >A code block directed to execute ON a subgroup P may access a variable >location not mapped to P only if that variable location is: > >a) accessed exclusively in code directed to execute ON P. >OR >b) not written to in the task region. Shouldn't the second constraint be b) not written to by code in another ON block in the task region. ? Another way of asking this is, "Aren't the sequential regions also in the task region?" >COMPILATION/EXECUTION MODEL >The execution model for a subgroup is to unconditionally execute code ON >it, unconditionally skip code ON others, and participate in the execution >of common code (on ALL processors) as normal data parallel code. >[The access restrictions guarantee that the results will be consistent >with pure data parallel execution. A processor group cannot be >"invisibly" writing to a location being accessed by ALL or another >processor group, and vice versa] Right motivation. But if processor groups can't write to locations read by ALL, how is data going to flow from group to group? >The data parallel code on the two processor groups will look something >like this, after the task region is compiled. Definitely "might" look something like this. I expect many compilers, especially those on scalable machines, to produce SPMD code with IF statements (checking local data, like myproc()) about where the ON directives are. >2) No explicit control dependence constraints are required. Inside an ON >block, any variable being read (or used for control flow) cannot be >written by any other processor group - it can be only written by ALL >processors, in which case the control flow from the subgroup must also >reach that point. >Outside an ON block, all processor groups execute all control flow (and >other) statements. If a subgroup skips a control construct because it is >not involved( i.e. its variables are not involved and there is no code >inside the scope of the control construct that is directed to execute ON >it) and continues to execute its next ON block, the constraints ensure >that it cannot write to a location that is used for managing control flow. In the following sequence, does P1 execute the GOTO? It doesn't involve any data mapped to P1, nor is it directed ON P1. !HPF$ DISTRIBUTE A1(BLOCK) ONTO P1 !HPF$ DISTRIBUTE A2(BLOCK) ONTO P2 !HPF$ TASK REGION IF (A2(2) > 0) THEN !HPF$ ON HOME(P2) BLOCK ... do something with A2 ... !HPF$ END BLOCK GOTO 100 END IF !HPF$ ON HOME(P1) BLOCK ... do something with A1 ... !HPF$ END BLOCK 100 CONTINUE !HPF$ END REGION I think I know what you mean. Probably just need to be a little more careful about the execution model, in particular, what does "normal data-parallel model" mean? Chuck -------------------------------------------------------------------------------- [Reply^2 from Jaspal Subhlok] You point out some text on access restrictions that seems to be in error. I think the restrictions may be more subtle than they appear, but the text is what is intended. I will add a general motivation section along the lines of the next paragraph in the next revision - please take a look and mail again if you think something needs to be changed. TASKING MODEL: Subgroups ``normally'' read and write only to/from the variables that are mapped to them. The code in ALL has unrestricted access to all variables, and data is exchanged between subgroups by copying the variables of one subgroup to the variables of another subgroup in the ALL code. In some circumstances, a subgroup may want to access a variable NOT mapped to it (e.g. a common block). Such access is allowed, but it must not cause a data dependence during the entire duration of the execution of the task region. (Even dependence between a subgroup and ALL are not allowed since that would imply that when ALL is executing no other subgroup can execute as there is no easy way for a compiler to figure out what variable a subgroup may access ). A sufficient condition for "no dependences" is that such accesses should only be to variables that are ``read only'' in the task region, or ``read and written'' only by code ON only one subgroup. Other points are well taken and I will add corrections/clarifications in the next revision. Yes, the compiled code is just one way to do it, and meant only for illustration of the task constructs. Every processor executes control constructs in ALL unless the compiler can determine that it is not necessary. In general, a GOTO is certainly executed by everybody. The "normal data-parallel model" is supposed to mean that execution follows the normal HPF semantics (some say no such thing exists :) but I am not going to fix that) without any notion of task parallelism constraints. Let me know if there are unresolved things or other comments. jaspa -------------------------------------------------------------------------------- [Reply^3 from Chuck Koelbel] Sounds to me like you're addressing my concerns. Maybe it would also help to explicitly mention that a processor group always has write access to data mapped to it. Also, note that "normal data-parallel mode" on some machines may mean GET/PUT access; copying data in this way in ALL may not be sufficient for the synchronization that you need. Chuck -------------------------------------------------------------------------------- [Reply^4 from Jaspal Subhlok] >Sounds to me like you're addressing my concerns. Maybe it would also help >to explicitly mention that a processor group always has write access to >data mapped to it. Actually it is mentioned that processor groups have unrestricted access to their variables, but only as a comment. Excerpt: [There are no other access constraints: Code executing on ALL processors has unrestricted access to all variable locations. A code block directed to execute ON a subgroup P has unrestricted access to all variable locations mapped to P] Perhaps it should be in the main text. >Also, note that "normal data-parallel mode" on some machines may mean >GET/PUT access; copying data in this way in ALL may not be sufficient for >the synchronization that you need. If GET/PUT access is used, the compiler has to include explicit synchronization like barriers for normal data parallel processing. The compiler has to use "subset barriers" to be able to exploit task parallelism. One concern is that if a full barrier is used around ALL sections, then the program may be oversynchronized, so subgroup barriers should be used in ALL as needed. e.g. if a variable in group1 is copied to a variable in group2, correct use of subgroup barriers will allow group3 processors to continue if there is no other code in ALL that needs them. Anyway, I think this is about efficient compilation in GET/PUT model and I don't think it changes the specification or the basic execution model. jaspal -------------------------------------------------------------------------------- [Reply^5 by Chuck Koelbel] >>Sounds to me like you're addressing my concerns. Maybe it would also >>help to explicitly mention that a processor group always has write access >>to data mapped to it. > >Actually it is mentioned that processor groups have unrestricted access >to their variables, but only as a comment. Excerpt: > >[There are no other access constraints: Code executing on ALL processors >has unrestricted access to all variable locations. A code block directed >to execute ON a subgroup P has unrestricted access to all variable >locations mapped to P] > >Perhaps it should be in the main text. I think so. It's an important point. >>Also, note that "normal data-parallel mode" on some machines may mean >>GET/PUT access; copying data in this way in ALL may not be sufficient for >>the synchronization that you need. > >... >Anyway, I think this is about efficient compilation in GET/PUT model and >I don't think it changes the specification or the basic execution model. > >jaspal Agreed. Is it true that your execution model assumes that the owners of A and B both participate in A=B (at least in the sense that they do some synchronization)? If so, I'd feel better if this were explicitly stated somewhere. Yes, in some sense the synchronization is also implied in straight HPF - but here the model is explicitly multi-threaded, as opposed to the single-threaded model in HPF-default mode. Chuck -------------------------------------------------------------------------------- [Jaspal Subhlok gets the last word] OK, it will probably make things clearer to state that owners of A and B both participate in an A=B. But the main point is that execution model in ALL is same as regular HPF, (with extra information that subgroup computations can be assumed to only read and modify subgroup variables from ALL's perspective), and it should be stated in those terms, else we have to detail the behavior in all scenarios. Perhaps some illustration will make things easier. jaspal