Scalable Content-aware Request Distribution in Cluster-based Network Servers
Department of Computer Science
Rice University
- Author: Mohit Aron
Email: aron@cs.rice.edu
- Code release version: 1.0
- License: This code release is NOT under the GPL or its
variants. We are licensing this code free of charge to individuals for
non-commercial personal use and to non-profit entities for
non-commercial use. Permission is also granted to commercial enterprises
to use this software for evaluation purposes only. Any other use
requires obtaining a license from the Rice University Office of
Technology Transfer (
techtran@rice.edu). This code is copyrighted software and can NOT
be redistributed.
- Downloading: Please fill out the registration form
to download this software; this link will also include more details regarding
the license terms and copyrights. To unpack the code, the following instructions can be
used on Unix machines:
gunzip -c scalableRD-1.0.tgz | tar xvf -
- Supported platforms:
This code only supports the FreeBSD-2.2.6 operating system running on
the x86 hardware. The gcc-2.7.2.1 compiler was used for compiling the
code on this platform. The source code for the freely available
FreeBSD-2.2.6 operating system can be obtained from the FreeBSD CVS
repository. For more information, refer the FreeBSD Handbook available
from http://www.freebsd.org/.
- Contents:
Introduction
Design
Terminology
Loadable Module Implementation
Code layout
Compiling and Installing
Running
Caveats
Bibliography
This code implements an architecture for scalable content-aware request
distribution in cluster-based servers. It was developed as part of the ScalaServer project
under the direction of Prof. Peter
Druschel and Prof. Willy
Zwaenepoel at the Department of Computer Science, Rice University. Several
papers describe the various design stages that lead to
this code. However, this version of the code most closely corresponds to that
described in [1].
This document briefly describes the overall design and layout of the code as
well as how to compile, install and run it. It does not comprehensively
describe all the code details and is only intended to be a guide to help give a
quick start to a knowledgeable Systems person. Implementation details have to be
gathered by reading the code. This document further assumes that the reader is
familiar with the papers given in the bibliography.
As described in [1], the architecture consists of three
basic components - Dispatcher , Distributor and the Server
. Any component can be placed on any of the cluster nodes, providing a wide
variety of possible cluster designs. For example, by placing the Dispatcher and
the Distributor components in one cluster node that acts as the front-end and
the Server component on all other cluster nodes that are back-ends, the
architecture described in [3] can be realized. On the
other hand, by placing the Dispatcher by itself on one node and the Distributor
and the Server components on the remaining nodes, the scalable architecture
described in [1] can be realized. The Dispatcher and
the Distributor components are completely implemented inside the kernel. The
Server component flanks both the user-level as well as the kernel. The
user-level part is the unmodified web server application (e.g., Apache) while
the kernel part consists of an enhanced network stack. Persistent control TCP
connections are established between cluster nodes for the purpose of internal
communication. Any component can communicate either directly with any other
component in the cluster or through a TCP Handoff protocol that is implemented
as part of the network stack.
The cluster operates as follows. Clients send web requests to one of the
the Distributor components in the cluster using DNS Round-Robin (see Caveats). The Distributor establishes TCP connection
with the client, gets the HTTP request, sends it to the Dispatcher component
(located possibly on a remote node) for making the request distribution
decision and then does a TCP Handoff to hand the connection to the cluster
node chosen by the Dispatcher. The kernel part of the Server component on the
chosen node receives the TCP Handoff, and the application web server then
does rest of the HTTP processing.
The code itself follows a slightly different terminology than that described in
Design above. First, the Distributor is referred to as
the Front-End (FE) throughout the code. This is because for
this code release all nodes having the Distributor component are front-ends to
the cluster (see Caveats). Second, the Server component
is considered to consist of two parts - the web server application (e.g.,
Apache) operating at the user-level and the kernel code that implements the
network stack and supports the TCP Handoff mechanism for the Server
component. This kernel part of the Server component is referred to as the
Back-End (BE) in the code. This terminology reflects the fact that
any back-end node (a node that runs the web server application) must have the
Back-End kernel part.
In order to facilitate the development and debugging of the code, the kernel
parts of the code are implemented as a loadable module. Very few code changes
are made to the actual FreeBSD-2.2.6 kernel - the only changes made are
those necessary to hook the loadable module into the kernel. Since this
project required changes to the network stack to support the TCP Handoff
mechanism, the loadable module installs an additional network stack in the
operating system rather than modifying the existing one. In other words,
once the module is loaded, the operating system runs with two network stacks.
The original stack is used for regular networking services like telnet, rlogin
etc. Any bugs (unless they cause memory violations) in the new stack thus
still keep regular system services running. All it takes to fix these bugs
is to unload the module, recompile the fixed code and load the module back.
The network stack implemented by the loadable module was derived from the
original FreeBSD-2.2.6 stack itself. The trick to installing additional network
stacks is to install them as a different protocol family. Application programs
(such as web servers) choose the network stack they want to use by specifying
the protocol family as the first argument to the socket() system call. In
FreeBSD, the protocol family for the regular TCP/IP stack is 2 (specified
with macro PF_INET or AF_INET). For the alternate stack implemented in the
loadable module, this protocol family is chosen to be 21. Therefore, in order
to use a conventional web server (e.g., Apache) with this loadable module,
the first argument of any socket() calls made in the server code should be
changed to 21. (In Apache-1.3.3 source this required a change in the file
src/main/http_main.c).
Once the kernel operates with both the stacks loaded, the next question that
arises is which stack should service an IP packet that is received at a network
interface. The solution to this is of necessity ad hoc, as both network
stacks are equally capable of handling all IP packets. The packet is first
presented to the stack in the loadable module. This stack quickly makes a
decision on whether it wants the packet or not - if not, then the packet is
sent up the original stack. Under this implementation, all packets containing a
source IP address of the form 192.x.x.x and arriving on the fxp0 or fxp1
interface are accepted by the stack in the loadable module. However, packets
from source 192.168.2.75 and packets from all other sources are sent up the
original stack (192.168.2.75 was the IP address of our development machine and
we wanted to use the regular stack for this address for ftp'ing the compiled
code over to the cluster nodes). This part of the code is implemented in
FreeBSD-stack/netinet/netsys.c and should be modified as necessary.
This section provides a brief description of most of the files and
directories included in this release.
- distdir/: After compilation, this
directory contains the files that should be copied over to each node
in the cluster. These are a Makefile, the loadable module
object file netsys_mod.o and an application program
cluster_startup for initializing the loadable module once
loaded. Notice that although different cluster nodes might be
configured with different components, the same loadable module is
installed on all of them.
- FreeBSD-stack/: Directory that contains code for compiling
the loadable kernel module. This code implements the alternate
network stack containing the TCP Handoff protocol along with the
Distributor, Dispatcher and the kernel part of the Server component.
- FreeBSD-stack/compile/: The loadable kernel module is
compiled by executing a 'make' in this directory.
- FreeBSD-stack/usr.bin/symorder: This directory contains the
modified source for the symorder program found in FreeBSD.
This program is used for localizing symbol definitions in the
loadable module so that they do not conflict with similar definitions
in the original network stack when the module is loaded. This enables
an alternate network stack to be derived from the original stack
without changing all the function names. The original FreeBSD
symorder does not localize global BSS symbols; the code in this
directory provides a fix to work around that. Executing a 'make'
in this directory would compile the symorder program. This should be
done before trying a 'make' in the FreeBSD-stack/compile directory.
- FreeBSD-stack/kpatch/sys/: This directory contains the set of
modified files in the FreeBSD-2.2.6 kernel code. These files are
necessary for the loadable kernel module to interact with the
kernel. If the kernel source is present in /sys, then file
FreeBSD-stack/kpatch/sys/X/Y should replace file /sys/X/Y in the
kernel code. The file FreeBSD-stack/kpatch/sys/i386/conf/PD0X
was the kernel configuration file using which we compiled our
kernel. To quickly update the kernel source in /sys, the command '
cp -r FreeBSD-stack/kpatch/sys /' can be used.
- FreeBSD-stack/netinet/: This directory contains the
source for the loadable kernel module. Most of this source was
derived from the source for the network stack found in a regular
FreeBSD-2.2.6 kernel in the directory /sys/netinet. The files
in this directory are briefly described next.
- netsys.c, netsys.h : Contain the boiler-plate code
a loadable kernel module. Also contain code to initialize
a new protocol family in the kernel when module is loaded.
Finally, the function sched_softintr() in netsys.c inspects
every IP packet received on any network interface to decide
whether it is to be sent up the original network stack or
the stack in the loadable module.
- back_end.c: Implements some of the kernel functionality
of the Server component. For example, this file intercepts
client requests on a P-HTTP connection for sending them
to the dispatcher for inspection; a modified HTTP request is
put in the web server application's socket buffer to make the
web server fetch the request from a remote node using NFS (see
[2]).
- demux.c, demux.h : Common code used for manipulating
cluster communication messages received on connections within
the kernel. Internal communication between any two cluster nodes
takes place using a persistently established control TCP
connection. Messages destined for the Dispatcher, Distributor
(or FE), the Server component (or BE) or the Handoff protocol
are multiplexed over this single control connection. The code
in these files determines the component that is to handle the
message and passes the extracted message to that component.
- dispatcher.c, dispatcher.h : Implements the
communication part of the Dispatcher component. The actual
request distribution strategy is implemented in splitter.c.
- splitter.c, wrr.c, lard.c : splitter.c implements
the request distribution strategy. splitter.c is a symbolic
link to either lard.c or wrr.c that implement the LARD and
the WRR request distribution policies respectively.
- front_end.c : Implements the Distributor component.
- handoff.c, handoff.h : Implement the TCP Handoff
protocol.
- hash.c, hash.h : Implement a generic hash function
from the Compiler book by Aho, Ullman, Sethi.
- ifc_disp.c : Contains stub code used by the
Distributor and the Server components to communicate with the
Dispatcher component (possibly on a remote node).
- tcp_tw.c, tcp_tw.h : Implement a timing wheel
for efficient maintenance of TCP timer events. Default
FreeBSD-2.2.6 makes linear scans periodically over all the
timer events that has prohibitive overhead.
- Most other files are copies of the corresponding file from the
kernel source in /sys/netinet. Most are unchanged or slightly
changed to incorporate them in the alternate network stack. Some
have larger changes to support the TCP Handoff protocol.
- interface/: Directory containing source for the
application-level program cluster_startup used for
initializing various components in the loadable module. This
application program goes to sleep once it initializes the kernel
components and is used by the kernel to provide a process context
for the various network connections that are handled solely within
the kernel.
Download the source from the URL given at the beginning of this document.
Unzip and untar the tarball.
- Kernel compilation and installation:
Let the kernel source be present in /sys. Replace files in
the kernel source with the corresponding files from
FreeBSD-stack/kpatch/sys. The file /sys/i386/config/PD0X will
be the kernel configuration file. This file might need to be modified
to adapt to a different hardware configuration than on which this
code was tested.
In addition, the following modifications to the kernel source are
optionally recommended for better performance:
- Increase IFQ_MAXLEN from 50 to 1000 in /sys/net/if.h
- Increase SOMAXCONN from 128 to 1024 in /sys/sys/socket.h
- Increase SB_MAX from 256K to 512K in /sys/sys/socketvar.h
Execute the following series of commands to compile the kernel:
- cd /sys/i386/conf
- config PD0X
- cd /sys/compile/PD0X
- make depend
- make
This should prepare the kernel binary /sys/compile/PD0X/kernel.
Install this as /kernel on all the cluster nodes. Upon rebooting
the cluster machines, they would be running the compiled kernel.
- Loadable module compilation and installation:
Change directory to the root of the directory where the source is
installed after downloading. Execute the following series of
commands to compile the module and related programs:
- cd FreeBSD-stack/usr.bin/symorder
- make depend
- make
- cd ../../compile
- make depend
- make
- cd ../../interface
- make depend
- make
- cd ../distdir
The contents of FreeBSD-stack/distdir contain the three files
that are to be copied over to each of the cluster nodes. These
are: (1) Makefile (2) netsys_mod.o, and (3) cluster_startup
- Web server compilation and installation :
The Web server source is unmodified with one exception. Since we want
the web server to use the network stack in the loadable module, the
first argument to any socket() call in the source code for the web
server should be changed to 21. The compilation and installation on
the cluster nodes should be done as per the instruction manual for
the web server.
There is an issue involved with the choice of the port at which the
web server application receives requests. Since the Distributor
component and the Server component both listen independently for HTTP
requests, their port numbers should not conflict. Therefore, if any
cluster node contains both the Distributor and the Server component,
then the port on which the Server listens to be should be chosen to
be different than the port that the Dispatcher listens to. It is
usually a wise choice to enforce this even when no cluster node runs
both the Server and the Distributor components together. For testing
this code, the Dispatcher received requests on port 8080 while the
Server received requests on port 7080 on any cluster node. Notice
that since the TCP Handoff is transparent to clients, all packets
sent from the Server component to the client after the Handoff would
appear to have a source port of 8080 (same as the Distributor).
Follow the following steps for testing the system.
- Reboot cluster nodes : Reboot all the cluster nodes so
that they run the modified kernel from the
Compiling and Installing section.
- Load kernel module : On each cluster node, go to directory
where the 3 files from FreeBSD-stack/distdir were copied. Execute
'make load' in this directory with super-user privileges. This
loads the kernel module and installs the second stack in the
kernel.
- Start web server application : On each cluster node that
is to run the Server component, start the web server application.
- Initialize various components in the loadable module and
setup control connections between cluster nodes:
This is done using the cluster_startup program. On each cluster
node, one instance of this program is run and is never terminated
as long as the cluster remains in operation. cluster_startup
is started with one or more of the arguments "-D", "-FE", or "-BE".
These arguments specify whether the corresponding cluster node
is to contain one or more of the Dispatcher, Distributor or Server
components respectively. Additionally, it takes arguments "-fp"
and/or "-bp" that are used to specify the port at which the
Distributor and/or Server component at that node listen.
The node that runs the Dispatcher component should be initialized
first. On the remaining nodes, the cluster_startup program takes
the "-da" argument that specifies the name of the cluster node
that runs the Dispatcher.
While the cluster will work well even if each node has only one
network interface, it is desirable to have at least two interfaces -
one for internal cluster communication and one for communication
with clients. In principal, each cluster node can have many network
interfaces. However, this code only supports internal cluster
communication on one of each node's interfaces. The clients can,
however, send requests to any of the network interfaces on a cluster
node.
An example demonstrates the initialization done by the
cluster_startup program. Lets assume we have 3 nodes, node 1 is
to run the Dispatcher and the other two are to run both the
Distributor as well as the Server Components. Assume further that we
want the Distributor to listen at port 8080 (this is the port to
which the clients would send requests) while the web server
application listens at port 7080. Each node has two network
interfaces - let host2-ifc1.cs.rice.edu be the address of the 1st
interface on node 2. The example below sets up internal cluster
communication between the 2nd interface on nodes (all these
interfaces should have the same subnet address):
- Initializing Dispatcher on node 1: :
cluster_startup -D &
- Initializing Distributor and Server on node 2:
:
cluster_startup -FE -fp 8080 -BE -bp 7080 -da host1-ifc2.cs.rice.edu &
- Initializing Distributor and Server on node 3:
:
cluster_startup -FE -fp 8080 -BE -bp 7080 -da host1-ifc2.cs.rice.edu &
- Additional web server setup for P-HTTP :
This step is only required if the clients send requests using
HTTP/1.1 P-HTTP connections. The sending of multiple client
requests on P-HTTP connections requires some additional
configuration setup for web servers. This is because additional
requests on P-HTTP connections are fetched using a technique called
back-end forwarding (see [2]) where
content is fetched from remote nodes through NFS mounted
directories.
Let docroot be the root directory from where the web server serves
its documents. For each cluster node running the Server component,
go to the docroot directory and create one additional directory for
all other cluster nodes that run Server components. The name of this
directory corresponds to the IP address of the interface used for
internal cluster communication by the remote cluster node. The
docroot directories of each of the cluster nodes running Server
component are then NFS mounted in their corresponding directory
under docroot of current cluster node.
In the example above, in the docroot directory of node 2,
a directory should be created corresponding to the IP address
of host3-ifc2.cs.rice.edu. The docroot of node 3 should then
be NFS mounted under this directory. Similarly, under the docroot
of node 3, a directory should be created corresponding to the IP
address of host2-ifc2.cs.rice.edu and the docroot of node 2 be
NFS mounted under it.
- Sending requests from clients :
After initializing the cluster, the clients can start sending HTTP
requests to any of the nodes that run the Distributor component.
Any of the network interfaces can be used, but for better network
utilization, the interface other than than used for internal cluster
communication should be used. In the example above, the clients
should send requests to host2-ifc1.cs.rice.edu or
host3-ifc1.cs.rice.edu.
For terminating the cluster setup, execute the following steps:
- Kill the cluster_startup program running on each node.
- Stop the web server application running on the cluster nodes
with the Server component.
- Unload the module by executing the command ' make unload '.
- This release does not include the code corresponding to the
layer-4 switch used as the front-end in one of the architectures
described in [1]. Therefore, the request
distribution between all the distributors has to be done using the
DNS round-robin strategy. An experimental lab environment
can make specific clients send requests to specific distributors
in the cluster, thus roughly emulating DNS round-robin.
- [1] describes two mechanisms for fetching
request content from a cluster node other than the one that
establishes a connection first with the client. These mechanisms are
Splicing and TCP Handoff . This code release
only contains the TCP Handoff mechanism.
- The dispatcher component should only be installed on one of the
cluster machines. A distributed dispatcher is not supported.
- In a normal HTTP transaction, the connection is closed first by the
server, i.e., the server does the active close. This code is known to
not cleanly handle situations where the client closes the connection
first (e.g., if the user presses a Control-C to shut down the client
program). Therefore, it is imperative that care be taken while
experimenting with this code. If by mistake the client connections
are closed first, then it is best to restart the cluster by rebooting
the machines.
- The implementation of the LARD algorithm distributed in this
release is the one described in [2].
- P-HTTP is supported in this implementation. However, this requires
the Server components to communicate their disk queue lengths to the
Dispatcher component (see [2]). Currently the
code only supports reading the disk queue length from the SCSI disk
driver for the first SCSI disk (sd0). Therefore, for any other type
of disk, P-HTTP is not supported by this code. Support for P-HTTP
also hasn't been stress tested with this implementation.
- The LARD and the WRR implementations distributed with this code
assume that all cluster nodes are equal in speed.
- All internal communication between cluster nodes happens using TCP
and the Nagle's algorithm is turned off by default for every internal
TCP connection. However, for a configuration where the Dispatcher and
Distributor are located on the front-end cluster node and only the
Server component is located on the rest of the cluster nodes (this
configuration is the one used in [3]), it is
recommended that Nagle's algorithm be turned back on for TCP packets
sent to the front-end node for the purpose of improving its
scalability. This can be achieved by using the "-doNagleDisp"
argument to the cluster_startup program on all cluster nodes
other than the front-end.
- This is an alpha release. The code might be unstable under various
experimental conditions and might result in system crashes.
-
"Scalable Content-aware Request Distribution in Cluster-based
Network Servers", Mohit Aron, Darren Sanders, Peter Druschel and
Willy Zwaenepoel. In Proceedings of the USENIX 2000 Annual
Technical Conference , San Diego, CA, June 2000.
-
"Efficient Support for P-HTTP in Cluster-Based Web Servers" ,
Mohit Aron, Peter Druschel and Willy Zwaenepoel. In Proceedings
of the USENIX 1999 Annual Technical Conference, Monterey, CA,
June 1999.
-
"Locality-Aware Request Distribution in Cluster-based Network
Servers" , Vivek S. Pai, Mohit Aron, Gaurav Banga,
Michael Svendsen, Peter Druschel, Willy Zwaenepoel and Erich Nahum.
In Proceedings of the 8th International Conference on
Architectural Support for Programming Languages and Operating
Systems (ASPLOS-VIII) , San Jose, CA, October 1998.