The Transportation Primitive

         Ravi V. Shankar(S)    Khaled A. Alsabti(S)    Sanjay Ranka
               School of Computer and Information Science
              Syracuse University, Syracuse, NY 13244-4100
           e-mail: rshankar, kaalsabt, ranka@top.cis.syr.edu
                            August 1994


This paper presents algorithms for implementing the transportation
primitive on a distributed memory parallel architecture.  The
transportation primitive performs many-to-many personalized communication
with bounded incoming and outgoing traffic.  We present a two-stage
deterministic algorithm that decomposes the communication with possibly
high variance in message size into two communication stages with low
message size variance.  If the maximum outgoing or incoming traffic at any
processor is $t$, transportation can be done in $2t\mu$ time ($+$ lower
order terms) when $t \ge O(p^2+p\tau/\mu)$ ($\mu$ is the inverse of the
data transfer rate, $\tau$ is the startup overhead). If the maximum
outgoing and incoming traffic are $r$ and $c$ respectively, transportation
can be done in $(r+c)\mu$ time when either $r \ge O(p^2)$ or $c \ge
O(p^2)$. Optimality and scalability are thus achieved when the traffic is
large, a condition that is usually satisfied in practice.  The algorithm
was implemented on the Connection Machine CM-5. The implementation used the
low latency communication primitives (active messages) available on the
CM-5, but the algorithm as such is architecture-independent.  An alternate
single-stage algorithm using distributed random scheduling was implemented
on the CM-5 and the performance of the two algorithms were compared.