Scalable Content-aware Request Distribution in Cluster-based Network Servers

Department of Computer Science
Rice University

Author: Mohit Aron
Email: aron@cs.rice.edu
Code release version: 1.0
License: This code release is NOT under the GPL or its variants. We are licensing this code free of charge to individuals for non-commercial personal use and to non-profit entities for non-commercial use. Permission is also granted to commercial enterprises to use this software for evaluation purposes only. Any other use requires obtaining a license from the Rice University Office of Technology Transfer ( techtran@rice.edu). This code is copyrighted software and can NOT be redistributed.
Downloading: Please fill out the registration form to download this software; this link will also include more details regarding the license terms and copyrights. To unpack the code, the following instructions can be used on Unix machines:
gunzip -c scalableRD-1.0.tgz | tar xvf -
Supported platforms: This code only supports the FreeBSD-2.2.6 operating system running on the x86 hardware. The gcc-2.7.2.1 compiler was used for compiling the code on this platform. The source code for the freely available FreeBSD-2.2.6 operating system can be obtained from the FreeBSD CVS repository. For more information, refer the FreeBSD Handbook available from http://www.freebsd.org/.
Contents:
Introduction
Design
Terminology
Loadable Module Implementation
Code layout
Compiling and Installing
Running
Caveats
Bibliography

Introduction

This code implements an architecture for scalable content-aware request distribution in cluster-based servers. It was developed as part of the ScalaServer project under the direction of Prof. Peter Druschel and Prof. Willy Zwaenepoel at the Department of Computer Science, Rice University. Several papers describe the various design stages that lead to this code. However, this version of the code most closely corresponds to that described in [1].

This document briefly describes the overall design and layout of the code as well as how to compile, install and run it. It does not comprehensively describe all the code details and is only intended to be a guide to help give a quick start to a knowledgeable Systems person. Implementation details have to be gathered by reading the code. This document further assumes that the reader is familiar with the papers given in the bibliography.

Design

As described in [1], the architecture consists of three basic components - Dispatcher , Distributor and the Server . Any component can be placed on any of the cluster nodes, providing a wide variety of possible cluster designs. For example, by placing the Dispatcher and the Distributor components in one cluster node that acts as the front-end and the Server component on all other cluster nodes that are back-ends, the architecture described in [3] can be realized. On the other hand, by placing the Dispatcher by itself on one node and the Distributor and the Server components on the remaining nodes, the scalable architecture described in [1] can be realized. The Dispatcher and the Distributor components are completely implemented inside the kernel. The Server component flanks both the user-level as well as the kernel. The user-level part is the unmodified web server application (e.g., Apache) while the kernel part consists of an enhanced network stack. Persistent control TCP connections are established between cluster nodes for the purpose of internal communication. Any component can communicate either directly with any other component in the cluster or through a TCP Handoff protocol that is implemented as part of the network stack.

The cluster operates as follows. Clients send web requests to one of the the Distributor components in the cluster using DNS Round-Robin (see Caveats). The Distributor establishes TCP connection with the client, gets the HTTP request, sends it to the Dispatcher component (located possibly on a remote node) for making the request distribution decision and then does a TCP Handoff to hand the connection to the cluster node chosen by the Dispatcher. The kernel part of the Server component on the chosen node receives the TCP Handoff, and the application web server then does rest of the HTTP processing.

Terminology

The code itself follows a slightly different terminology than that described in Design above. First, the Distributor is referred to as the Front-End (FE) throughout the code. This is because for this code release all nodes having the Distributor component are front-ends to the cluster (see Caveats). Second, the Server component is considered to consist of two parts - the web server application (e.g., Apache) operating at the user-level and the kernel code that implements the network stack and supports the TCP Handoff mechanism for the Server component. This kernel part of the Server component is referred to as the Back-End (BE) in the code. This terminology reflects the fact that any back-end node (a node that runs the web server application) must have the Back-End kernel part.

Loadable Module Implementation

In order to facilitate the development and debugging of the code, the kernel parts of the code are implemented as a loadable module. Very few code changes are made to the actual FreeBSD-2.2.6 kernel - the only changes made are those necessary to hook the loadable module into the kernel. Since this project required changes to the network stack to support the TCP Handoff mechanism, the loadable module installs an additional network stack in the operating system rather than modifying the existing one. In other words, once the module is loaded, the operating system runs with two network stacks. The original stack is used for regular networking services like telnet, rlogin etc. Any bugs (unless they cause memory violations) in the new stack thus still keep regular system services running. All it takes to fix these bugs is to unload the module, recompile the fixed code and load the module back.

The network stack implemented by the loadable module was derived from the original FreeBSD-2.2.6 stack itself. The trick to installing additional network stacks is to install them as a different protocol family. Application programs (such as web servers) choose the network stack they want to use by specifying the protocol family as the first argument to the socket() system call. In FreeBSD, the protocol family for the regular TCP/IP stack is 2 (specified with macro PF_INET or AF_INET). For the alternate stack implemented in the loadable module, this protocol family is chosen to be 21. Therefore, in order to use a conventional web server (e.g., Apache) with this loadable module, the first argument of any socket() calls made in the server code should be changed to 21. (In Apache-1.3.3 source this required a change in the file src/main/http_main.c).

Once the kernel operates with both the stacks loaded, the next question that arises is which stack should service an IP packet that is received at a network interface. The solution to this is of necessity ad hoc, as both network stacks are equally capable of handling all IP packets. The packet is first presented to the stack in the loadable module. This stack quickly makes a decision on whether it wants the packet or not - if not, then the packet is sent up the original stack. Under this implementation, all packets containing a source IP address of the form 192.x.x.x and arriving on the fxp0 or fxp1 interface are accepted by the stack in the loadable module. However, packets from source 192.168.2.75 and packets from all other sources are sent up the original stack (192.168.2.75 was the IP address of our development machine and we wanted to use the regular stack for this address for ftp'ing the compiled code over to the cluster nodes). This part of the code is implemented in FreeBSD-stack/netinet/netsys.c and should be modified as necessary.

Code layout

This section provides a brief description of most of the files and directories included in this release.

distdir/: After compilation, this directory contains the files that should be copied over to each node in the cluster. These are a Makefile, the loadable module object file netsys_mod.o and an application program cluster_startup for initializing the loadable module once loaded. Notice that although different cluster nodes might be configured with different components, the same loadable module is installed on all of them.
FreeBSD-stack/: Directory that contains code for compiling the loadable kernel module. This code implements the alternate network stack containing the TCP Handoff protocol along with the Distributor, Dispatcher and the kernel part of the Server component.
FreeBSD-stack/compile/: The loadable kernel module is compiled by executing a 'make' in this directory.
FreeBSD-stack/usr.bin/symorder: This directory contains the modified source for the symorder program found in FreeBSD. This program is used for localizing symbol definitions in the loadable module so that they do not conflict with similar definitions in the original network stack when the module is loaded. This enables an alternate network stack to be derived from the original stack without changing all the function names. The original FreeBSD symorder does not localize global BSS symbols; the code in this directory provides a fix to work around that. Executing a 'make' in this directory would compile the symorder program. This should be done before trying a 'make' in the FreeBSD-stack/compile directory.
FreeBSD-stack/kpatch/sys/: This directory contains the set of modified files in the FreeBSD-2.2.6 kernel code. These files are necessary for the loadable kernel module to interact with the kernel. If the kernel source is present in /sys, then file FreeBSD-stack/kpatch/sys/X/Y should replace file /sys/X/Y in the kernel code. The file FreeBSD-stack/kpatch/sys/i386/conf/PD0X was the kernel configuration file using which we compiled our kernel. To quickly update the kernel source in /sys, the command ' cp -r FreeBSD-stack/kpatch/sys /' can be used.
FreeBSD-stack/netinet/: This directory contains the source for the loadable kernel module. Most of this source was derived from the source for the network stack found in a regular FreeBSD-2.2.6 kernel in the directory /sys/netinet. The files in this directory are briefly described next.
- netsys.c, netsys.h : Contain the boiler-plate code a loadable kernel module. Also contain code to initialize a new protocol family in the kernel when module is loaded. Finally, the function sched_softintr() in netsys.c inspects every IP packet received on any network interface to decide whether it is to be sent up the original network stack or the stack in the loadable module.
- back_end.c: Implements some of the kernel functionality of the Server component. For example, this file intercepts client requests on a P-HTTP connection for sending them to the dispatcher for inspection; a modified HTTP request is put in the web server application's socket buffer to make the web server fetch the request from a remote node using NFS (see [2]).
- demux.c, demux.h : Common code used for manipulating cluster communication messages received on connections within the kernel. Internal communication between any two cluster nodes takes place using a persistently established control TCP connection. Messages destined for the Dispatcher, Distributor (or FE), the Server component (or BE) or the Handoff protocol are multiplexed over this single control connection. The code in these files determines the component that is to handle the message and passes the extracted message to that component.
- dispatcher.c, dispatcher.h : Implements the communication part of the Dispatcher component. The actual request distribution strategy is implemented in splitter.c.
- splitter.c, wrr.c, lard.c : splitter.c implements the request distribution strategy. splitter.c is a symbolic link to either lard.c or wrr.c that implement the LARD and the WRR request distribution policies respectively.
- front_end.c : Implements the Distributor component.
- handoff.c, handoff.h : Implement the TCP Handoff protocol.
- hash.c, hash.h : Implement a generic hash function from the Compiler book by Aho, Ullman, Sethi.
- ifc_disp.c : Contains stub code used by the Distributor and the Server components to communicate with the Dispatcher component (possibly on a remote node).
- tcp_tw.c, tcp_tw.h : Implement a timing wheel for efficient maintenance of TCP timer events. Default FreeBSD-2.2.6 makes linear scans periodically over all the timer events that has prohibitive overhead.
- Most other files are copies of the corresponding file from the kernel source in /sys/netinet. Most are unchanged or slightly changed to incorporate them in the alternate network stack. Some have larger changes to support the TCP Handoff protocol.
interface/: Directory containing source for the application-level program cluster_startup used for initializing various components in the loadable module. This application program goes to sleep once it initializes the kernel components and is used by the kernel to provide a process context for the various network connections that are handled solely within the kernel.

Compiling and Installing

Download the source from the URL given at the beginning of this document. Unzip and untar the tarball.

Kernel compilation and installation:
Let the kernel source be present in /sys. Replace files in the kernel source with the corresponding files from FreeBSD-stack/kpatch/sys. The file /sys/i386/config/PD0X will be the kernel configuration file. This file might need to be modified to adapt to a different hardware configuration than on which this code was tested.
In addition, the following modifications to the kernel source are optionally recommended for better performance:
- Increase IFQ_MAXLEN from 50 to 1000 in /sys/net/if.h
- Increase SOMAXCONN from 128 to 1024 in /sys/sys/socket.h
- Increase SB_MAX from 256K to 512K in /sys/sys/socketvar.h
Execute the following series of commands to compile the kernel:
1. cd /sys/i386/conf
2. config PD0X
3. cd /sys/compile/PD0X
4. make depend
5. make
This should prepare the kernel binary /sys/compile/PD0X/kernel. Install this as /kernel on all the cluster nodes. Upon rebooting the cluster machines, they would be running the compiled kernel.
Loadable module compilation and installation:
Change directory to the root of the directory where the source is installed after downloading. Execute the following series of commands to compile the module and related programs:
1. cd FreeBSD-stack/usr.bin/symorder
2. make depend
3. make
4. cd ../../compile
5. make depend
6. make
7. cd ../../interface
8. make depend
9. make
10. cd ../distdir
The contents of FreeBSD-stack/distdir contain the three files that are to be copied over to each of the cluster nodes. These are: (1) Makefile (2) netsys_mod.o, and (3) cluster_startup
Web server compilation and installation :
The Web server source is unmodified with one exception. Since we want the web server to use the network stack in the loadable module, the first argument to any socket() call in the source code for the web server should be changed to 21. The compilation and installation on the cluster nodes should be done as per the instruction manual for the web server.
There is an issue involved with the choice of the port at which the web server application receives requests. Since the Distributor component and the Server component both listen independently for HTTP requests, their port numbers should not conflict. Therefore, if any cluster node contains both the Distributor and the Server component, then the port on which the Server listens to be should be chosen to be different than the port that the Dispatcher listens to. It is usually a wise choice to enforce this even when no cluster node runs both the Server and the Distributor components together. For testing this code, the Dispatcher received requests on port 8080 while the Server received requests on port 7080 on any cluster node. Notice that since the TCP Handoff is transparent to clients, all packets sent from the Server component to the client after the Handoff would appear to have a source port of 8080 (same as the Distributor).

Running

Follow the following steps for testing the system.

Reboot cluster nodes : Reboot all the cluster nodes so that they run the modified kernel from the Compiling and Installing section.
Load kernel module : On each cluster node, go to directory where the 3 files from FreeBSD-stack/distdir were copied. Execute 'make load' in this directory with super-user privileges. This loads the kernel module and installs the second stack in the kernel.
Start web server application : On each cluster node that is to run the Server component, start the web server application.
Initialize various components in the loadable module and setup control connections between cluster nodes:
This is done using the cluster_startup program. On each cluster node, one instance of this program is run and is never terminated as long as the cluster remains in operation. cluster_startup is started with one or more of the arguments "-D", "-FE", or "-BE". These arguments specify whether the corresponding cluster node is to contain one or more of the Dispatcher, Distributor or Server components respectively. Additionally, it takes arguments "-fp" and/or "-bp" that are used to specify the port at which the Distributor and/or Server component at that node listen. The node that runs the Dispatcher component should be initialized first. On the remaining nodes, the cluster_startup program takes the "-da" argument that specifies the name of the cluster node that runs the Dispatcher.
While the cluster will work well even if each node has only one network interface, it is desirable to have at least two interfaces - one for internal cluster communication and one for communication with clients. In principal, each cluster node can have many network interfaces. However, this code only supports internal cluster communication on one of each node's interfaces. The clients can, however, send requests to any of the network interfaces on a cluster node.
An example demonstrates the initialization done by the cluster_startup program. Lets assume we have 3 nodes, node 1 is to run the Dispatcher and the other two are to run both the Distributor as well as the Server Components. Assume further that we want the Distributor to listen at port 8080 (this is the port to which the clients would send requests) while the web server application listens at port 7080. Each node has two network interfaces - let host2-ifc1.cs.rice.edu be the address of the 1st interface on node 2. The example below sets up internal cluster communication between the 2nd interface on nodes (all these interfaces should have the same subnet address):
1. Initializing Dispatcher on node 1: :
  cluster_startup -D &
2. Initializing Distributor and Server on node 2: :
  cluster_startup -FE -fp 8080 -BE -bp 7080 -da host1-ifc2.cs.rice.edu &
3. Initializing Distributor and Server on node 3: :
  cluster_startup -FE -fp 8080 -BE -bp 7080 -da host1-ifc2.cs.rice.edu &
Additional web server setup for P-HTTP :
This step is only required if the clients send requests using HTTP/1.1 P-HTTP connections. The sending of multiple client requests on P-HTTP connections requires some additional configuration setup for web servers. This is because additional requests on P-HTTP connections are fetched using a technique called back-end forwarding (see [2]) where content is fetched from remote nodes through NFS mounted directories.
Let docroot be the root directory from where the web server serves its documents. For each cluster node running the Server component, go to the docroot directory and create one additional directory for all other cluster nodes that run Server components. The name of this directory corresponds to the IP address of the interface used for internal cluster communication by the remote cluster node. The docroot directories of each of the cluster nodes running Server component are then NFS mounted in their corresponding directory under docroot of current cluster node.
In the example above, in the docroot directory of node 2, a directory should be created corresponding to the IP address of host3-ifc2.cs.rice.edu. The docroot of node 3 should then be NFS mounted under this directory. Similarly, under the docroot of node 3, a directory should be created corresponding to the IP address of host2-ifc2.cs.rice.edu and the docroot of node 2 be NFS mounted under it.
Sending requests from clients :
After initializing the cluster, the clients can start sending HTTP requests to any of the nodes that run the Distributor component. Any of the network interfaces can be used, but for better network utilization, the interface other than than used for internal cluster communication should be used. In the example above, the clients should send requests to host2-ifc1.cs.rice.edu or host3-ifc1.cs.rice.edu.

For terminating the cluster setup, execute the following steps:

Kill the cluster_startup program running on each node.
Stop the web server application running on the cluster nodes with the Server component.
Unload the module by executing the command ' make unload '.

Caveats

This release does not include the code corresponding to the layer-4 switch used as the front-end in one of the architectures described in [1]. Therefore, the request distribution between all the distributors has to be done using the DNS round-robin strategy. An experimental lab environment can make specific clients send requests to specific distributors in the cluster, thus roughly emulating DNS round-robin.
[1] describes two mechanisms for fetching request content from a cluster node other than the one that establishes a connection first with the client. These mechanisms are Splicing and TCP Handoff . This code release only contains the TCP Handoff mechanism.
The dispatcher component should only be installed on one of the cluster machines. A distributed dispatcher is not supported.
In a normal HTTP transaction, the connection is closed first by the server, i.e., the server does the active close. This code is known to not cleanly handle situations where the client closes the connection first (e.g., if the user presses a Control-C to shut down the client program). Therefore, it is imperative that care be taken while experimenting with this code. If by mistake the client connections are closed first, then it is best to restart the cluster by rebooting the machines.
The implementation of the LARD algorithm distributed in this release is the one described in [2].
P-HTTP is supported in this implementation. However, this requires the Server components to communicate their disk queue lengths to the Dispatcher component (see [2]). Currently the code only supports reading the disk queue length from the SCSI disk driver for the first SCSI disk (sd0). Therefore, for any other type of disk, P-HTTP is not supported by this code. Support for P-HTTP also hasn't been stress tested with this implementation.
The LARD and the WRR implementations distributed with this code assume that all cluster nodes are equal in speed.
All internal communication between cluster nodes happens using TCP and the Nagle's algorithm is turned off by default for every internal TCP connection. However, for a configuration where the Dispatcher and Distributor are located on the front-end cluster node and only the Server component is located on the rest of the cluster nodes (this configuration is the one used in [3]), it is recommended that Nagle's algorithm be turned back on for TCP packets sent to the front-end node for the purpose of improving its scalability. This can be achieved by using the "-doNagleDisp" argument to the cluster_startup program on all cluster nodes other than the front-end.
This is an alpha release. The code might be unstable under various experimental conditions and might result in system crashes.

Bibliography

"Scalable Content-aware Request Distribution in Cluster-based Network Servers", Mohit Aron, Darren Sanders, Peter Druschel and Willy Zwaenepoel. In Proceedings of the USENIX 2000 Annual Technical Conference , San Diego, CA, June 2000.
"Efficient Support for P-HTTP in Cluster-Based Web Servers" , Mohit Aron, Peter Druschel and Willy Zwaenepoel. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
"Locality-Aware Request Distribution in Cluster-based Network Servers" , Vivek S. Pai, Mohit Aron, Gaurav Banga, Michael Svendsen, Peter Druschel, Willy Zwaenepoel and Erich Nahum. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII) , San Jose, CA, October 1998.