--------------------------------------------------------------------------- hpff-task@cs.rice.edu is a mailing list for discussion of control-parallel features in HPF. Instructions for adding or deleting yourself from this list appear at the bottom of this message. --------------------------------------------------------------------------- I am attaching a revised proposal for task parallelism. There are some minor corrections and clarifications, a few words about SMPs, and I have stated the scheme Rob Schreiber proposed as a separate section. I will not be able to attend the next meeting because of an ARPA site visit. I will be in email contact right through. If you send me any comments relatively soon, I can prepare a revised proposal if needed. jaspal ------------------------------------------------------------------------ PROPOSAL FOR TASK PARALLELISM IN HPF [Contact: Jaspal Subhlok (jass@cs.cmu.edu)] ------------------------------------------------------------------------ Assumption of functionality available from related features: 1) A way to group processors into subgroups P1, P2, P3. A way to attach and distribute variables onto subgroups, i.e. a1, a2, a3 can be mapped onto subgroups P1, P2, P3. (For SMPs explicit distribution of variables to subgroups is not necessary and may not mean anything. However, a way to attach variable names to subgroups is still needed) 2) An ON construct that directs execution on groups of processors P1, P2, P3 etc. for a block of code. ------------------------------------------------------------------------- 1. GENERAL IDEA AND MODEL: Task parallelism is expressed by mapping different data objects on different subgroups of processors and specifying that blocks of code be executed on named subgroups of processors using an ON directive. Code executing on a processor subgroup inside a designated ``task region'' normally reads and writes only to/from the variables that are mapped to them. The code inside a task region that is not directed to execute ON a task region (at least conceptually) executes on ALL processors and has unrestricted access to all variables. Data is exchanged between subgroups by copying the variables of one subgroup to the variables of another subgroup in the ALL code. A subgroup is allowed access to variables not mapped to it, if that would not cause a data dependence. A sufficient condition for ``no dependences'' is that such accesses should only be to variables that are ``read only'' in the task region, or ``read and written'' only by code ON any one subgroup. The proposal essentially offers a way in which the programmer can tell the compiler that these rules will be followed in designated code regions. 2. PROPOSAL: A ``task region'' is a single entry, single exit region delimited by (say) TASK REGION .... END TASK REGION. A task region can have blocks of code that are directed to execute ON processor subgroups. All other code executes on all available processors, referred to as ALL. The following restrictions must hold for the code inside a task region: [This is the core of the proposal] A code block executing on ALL processors has unrestricted access to all variables. A code block directed to execute ON a subgroup P has unrestricted access to any variable mapped to P. A code block directed to execute ON a subgroup P can access a variable not mapped to P only if the following constraint holds for the entire code in the task region: a) The variable is``read only''. OR b) accessed only in code directed to execute ON P. (Variable in this context means any addressable location) An I/O operation in a code section directed to execute ON a subgroup may not ``interfere'' with an I/O operation in a code section not explicitly directed to execute on that subgroup. The interference of I/O operations is detailed in Section 4.4 (INDEPENDENT). For a subroutine call inside an ON block, ``all available processors'' are processors in the corresponding subgroup. This is the number that is used for mapping the parameters of the subroutine. [This part will become more specific after the syntax etc. of creating subgroups is decided. There should probably be a system inquiry function for the number of processors in the current subgroup, if NUMBER_OF_PROCESSORS() is supposed to return the total number of processors for the program] 3. COMPILATION/EXECUTION MODEL: 3.1 Basic Execution: The execution model for a subgroup is to unconditionally execute code ON it, unconditionally skip code ON others, and participate in the execution of common code (on ALL processors) as normal data parallel code. An operation in ALL involving a set of variables starts only when all processors of the subgroups owning those variables reach that point of execution. This is the basic execution model for shared and distributed memory machines. [The access restrictions guarantee that the results will be consistent with pure data parallel execution. A processor group cannot be ``invisibly'' writing to a location being accessed by ALL or another processor group, and vice versa] 3.2 Variable Access: We state ``one'' model for accessing variables in a task region for a distributed memory machine. (This is important for building an efficient compilation scheme although not really a part of the execution model). Accesses to variables owned by other processors is cooperative, i.e. the owner sends the value, and the user receives it, with one exception - when code ON a subgroup has to access a variable not mapped to it, it use a remote fetch/deposit. (It can also cache remote locations locally in the subgroup for the duration of the execution of the task region since computation not ON that subgroup cannot access it) 4 EXAMPLE: 2DFFT Sequential: real, dimension(n,n) :: a1, a2 do while(.true.) read (unit = iu, end = 100) a1 call rowfft(a1) a2 = a1 call colfft(a2) write (unit = ou) a2 cycle 100 continue exit enddo Pipelined Data/Task Parallel HPF real dimension(n,n) :: a1,a2 boolean done1 !hpf$ disjoint processor groups P1, P2 (Syntax TBA) !hpf$ distribute a1(block,*) onto P1 !hpf$ distribute a2(*,block) onto P2 !hpf$ distribute done1 onto P1 !hpf$ TASK REGION done1 = .false. do while (.true.) !hpf$ ON HOME(P1) BLOCK read (unit = iu,end=100) a1 call rowfft(a1) goto 101 100 done1 = .true. 101 continue !hpf$ END BLOCK if (done1) exit a2 = a1 !hpf$ ON HOME(P2) BLOCK call colfft(a2) write(unit = ou) a2 !hpf$ END BLOCK enddo !hpf$ END TASK REGION The data parallel code on the two processor groups might look something like this, after the task region is compiled. Processor Group P1: real dimension(n,n) :: a1 !hpf$ distribute a1(block,*) boolean done1 done1 = .false. do while (.true.) read (unit = iu,end=100) a1 call rowfft(a1) goto 101 100 done1 = .true. 101 continue _send(done1,P2) if (done1) exit _send(a1,P2) enddo Processor group P2: real dimension(n,n) :: a2 !hpf$ distribute a2(*,block) boolean local_done1 do while (.true.) _receive(local_done1,P1) if (local_done1) exit _receive(a2,P1) call colfft(a2) write(unit = ou) a2 enddo 4. AN ALTERNATE MODEL: (Rob Schreiber proposed this at the last meeting) A related but different model for task parallelism is as follows: 1) All code in a task region is directed to execute ON some (arbitrary) subgroup of processors. If no ON directive is present, ALL is assumed. 2) If a variable is ``read only'' in the region, there are no other restrictions. Otherwise, if two subgroups P1 and P2 access a variable x, P1 and P2 must have at least one common processor. The execution model is that all processors execute the code that is mapped to a subgroup they belong to, and skip other code. 4.1 TRADEOFFS: This model is very clean and simple to state. It separates the control aspect of task parallelism from the data aspect. Other mechanisms are used for mapping data to the tasks in a distributed memory machine for performance. The cons are that even though it is simple to state, it is a subtle construct for task parallelism, and there is no clear user programming model. Compilation model is also not as clear and it is extremely hard for the compiler to check for any violations of the user assertions. There is no experience in using something like this. 5. GENERAL COMMENTS: 1) The main difference with the simple parallel section/region, (or using an INDEPENDENT do loop to achieve parallel sections), is that task regions presented can have code that executes on ALL processors also. If it has no such code, it is similar to parallel section/region. However, allowing other code makes this construct more general, and implicitly allows pipelining in particular. At the same time, existence of code in ALL can constrain parallelism due to data dependence, and in the worst case no task parallelism may exist. 2) No explicit control dependence constraints are required. Inside an ON block, any variable being read (or used for control flow) cannot be written by any other processor group - it can be only written by ALL processors, in which case the control flow from the subgroup must also reach that point. Outside an ON block, all processor groups execute all control flow (and other) statements. If a subgroup skips a control construct because it is not involved( i.e. its variables are not involved and there is no code inside the scope of the control construct that is directed to execute ON it) and continues to execute its next ON block, the constraints ensure that it cannot write to a location that is used for managing control flow. 3) There may be some issues with respect to extrinsic subroutine calls to ensure that the basic model works in their presence. It is probably best to address them after subroutine call execution model is more clearly defined for ON regions in general. --------------------------------------------------------------------------- To (un)subscribe to this list, send mail to hpff-task-request@cs.rice.edu. Leave the subject line blank, and in the body put the line (un)subscribe ---------------------------------------------------------------------------