Scalable Content-aware Request Distribution in Cluster-based Network Servers

Department of Computer Science
Rice University

Introduction

This code implements an architecture for scalable content-aware request distribution in cluster-based servers. It was developed as part of the ScalaServer project under the direction of Prof. Peter Druschel and Prof. Willy Zwaenepoel at the Department of Computer Science, Rice University. Several papers describe the various design stages that lead to this code. However, this version of the code most closely corresponds to that described in [1].

This document briefly describes the overall design and layout of the code as well as how to compile, install and run it. It does not comprehensively describe all the code details and is only intended to be a guide to help give a quick start to a knowledgeable Systems person. Implementation details have to be gathered by reading the code. This document further assumes that the reader is familiar with the papers given in the bibliography.

Design

As described in [1], the architecture consists of three basic components - Dispatcher , Distributor and the Server . Any component can be placed on any of the cluster nodes, providing a wide variety of possible cluster designs. For example, by placing the Dispatcher and the Distributor components in one cluster node that acts as the front-end and the Server component on all other cluster nodes that are back-ends, the architecture described in [3] can be realized. On the other hand, by placing the Dispatcher by itself on one node and the Distributor and the Server components on the remaining nodes, the scalable architecture described in [1] can be realized. The Dispatcher and the Distributor components are completely implemented inside the kernel. The Server component flanks both the user-level as well as the kernel. The user-level part is the unmodified web server application (e.g., Apache) while the kernel part consists of an enhanced network stack. Persistent control TCP connections are established between cluster nodes for the purpose of internal communication. Any component can communicate either directly with any other component in the cluster or through a TCP Handoff protocol that is implemented as part of the network stack.

The cluster operates as follows. Clients send web requests to one of the the Distributor components in the cluster using DNS Round-Robin (see Caveats). The Distributor establishes TCP connection with the client, gets the HTTP request, sends it to the Dispatcher component (located possibly on a remote node) for making the request distribution decision and then does a TCP Handoff to hand the connection to the cluster node chosen by the Dispatcher. The kernel part of the Server component on the chosen node receives the TCP Handoff, and the application web server then does rest of the HTTP processing.

Terminology

The code itself follows a slightly different terminology than that described in Design above. First, the Distributor is referred to as the Front-End (FE) throughout the code. This is because for this code release all nodes having the Distributor component are front-ends to the cluster (see Caveats). Second, the Server component is considered to consist of two parts - the web server application (e.g., Apache) operating at the user-level and the kernel code that implements the network stack and supports the TCP Handoff mechanism for the Server component. This kernel part of the Server component is referred to as the Back-End (BE) in the code. This terminology reflects the fact that any back-end node (a node that runs the web server application) must have the Back-End kernel part.

Loadable Module Implementation

In order to facilitate the development and debugging of the code, the kernel parts of the code are implemented as a loadable module. Very few code changes are made to the actual FreeBSD-2.2.6 kernel - the only changes made are those necessary to hook the loadable module into the kernel. Since this project required changes to the network stack to support the TCP Handoff mechanism, the loadable module installs an additional network stack in the operating system rather than modifying the existing one. In other words, once the module is loaded, the operating system runs with two network stacks. The original stack is used for regular networking services like telnet, rlogin etc. Any bugs (unless they cause memory violations) in the new stack thus still keep regular system services running. All it takes to fix these bugs is to unload the module, recompile the fixed code and load the module back.

The network stack implemented by the loadable module was derived from the original FreeBSD-2.2.6 stack itself. The trick to installing additional network stacks is to install them as a different protocol family. Application programs (such as web servers) choose the network stack they want to use by specifying the protocol family as the first argument to the socket() system call. In FreeBSD, the protocol family for the regular TCP/IP stack is 2 (specified with macro PF_INET or AF_INET). For the alternate stack implemented in the loadable module, this protocol family is chosen to be 21. Therefore, in order to use a conventional web server (e.g., Apache) with this loadable module, the first argument of any socket() calls made in the server code should be changed to 21. (In Apache-1.3.3 source this required a change in the file src/main/http_main.c).

Once the kernel operates with both the stacks loaded, the next question that arises is which stack should service an IP packet that is received at a network interface. The solution to this is of necessity ad hoc, as both network stacks are equally capable of handling all IP packets. The packet is first presented to the stack in the loadable module. This stack quickly makes a decision on whether it wants the packet or not - if not, then the packet is sent up the original stack. Under this implementation, all packets containing a source IP address of the form 192.x.x.x and arriving on the fxp0 or fxp1 interface are accepted by the stack in the loadable module. However, packets from source 192.168.2.75 and packets from all other sources are sent up the original stack (192.168.2.75 was the IP address of our development machine and we wanted to use the regular stack for this address for ftp'ing the compiled code over to the cluster nodes). This part of the code is implemented in FreeBSD-stack/netinet/netsys.c and should be modified as necessary.

Code layout

This section provides a brief description of most of the files and directories included in this release.

Compiling and Installing

Download the source from the URL given at the beginning of this document. Unzip and untar the tarball.

Running

Follow the following steps for testing the system.
  1. Reboot cluster nodes : Reboot all the cluster nodes so that they run the modified kernel from the Compiling and Installing section.

  2. Load kernel module : On each cluster node, go to directory where the 3 files from FreeBSD-stack/distdir were copied. Execute 'make load' in this directory with super-user privileges. This loads the kernel module and installs the second stack in the kernel.

  3. Start web server application : On each cluster node that is to run the Server component, start the web server application.

  4. Initialize various components in the loadable module and setup control connections between cluster nodes:
    This is done using the cluster_startup program. On each cluster node, one instance of this program is run and is never terminated as long as the cluster remains in operation. cluster_startup is started with one or more of the arguments "-D", "-FE", or "-BE". These arguments specify whether the corresponding cluster node is to contain one or more of the Dispatcher, Distributor or Server components respectively. Additionally, it takes arguments "-fp" and/or "-bp" that are used to specify the port at which the Distributor and/or Server component at that node listen. The node that runs the Dispatcher component should be initialized first. On the remaining nodes, the cluster_startup program takes the "-da" argument that specifies the name of the cluster node that runs the Dispatcher.

    While the cluster will work well even if each node has only one network interface, it is desirable to have at least two interfaces - one for internal cluster communication and one for communication with clients. In principal, each cluster node can have many network interfaces. However, this code only supports internal cluster communication on one of each node's interfaces. The clients can, however, send requests to any of the network interfaces on a cluster node.

    An example demonstrates the initialization done by the cluster_startup program. Lets assume we have 3 nodes, node 1 is to run the Dispatcher and the other two are to run both the Distributor as well as the Server Components. Assume further that we want the Distributor to listen at port 8080 (this is the port to which the clients would send requests) while the web server application listens at port 7080. Each node has two network interfaces - let host2-ifc1.cs.rice.edu be the address of the 1st interface on node 2. The example below sets up internal cluster communication between the 2nd interface on nodes (all these interfaces should have the same subnet address):

    1. Initializing Dispatcher on node 1: :
      cluster_startup -D &
    2. Initializing Distributor and Server on node 2: :
      cluster_startup -FE -fp 8080 -BE -bp 7080 -da host1-ifc2.cs.rice.edu &
    3. Initializing Distributor and Server on node 3: :
      cluster_startup -FE -fp 8080 -BE -bp 7080 -da host1-ifc2.cs.rice.edu &

  5. Additional web server setup for P-HTTP :
    This step is only required if the clients send requests using HTTP/1.1 P-HTTP connections. The sending of multiple client requests on P-HTTP connections requires some additional configuration setup for web servers. This is because additional requests on P-HTTP connections are fetched using a technique called back-end forwarding (see [2]) where content is fetched from remote nodes through NFS mounted directories.

    Let docroot be the root directory from where the web server serves its documents. For each cluster node running the Server component, go to the docroot directory and create one additional directory for all other cluster nodes that run Server components. The name of this directory corresponds to the IP address of the interface used for internal cluster communication by the remote cluster node. The docroot directories of each of the cluster nodes running Server component are then NFS mounted in their corresponding directory under docroot of current cluster node.

    In the example above, in the docroot directory of node 2, a directory should be created corresponding to the IP address of host3-ifc2.cs.rice.edu. The docroot of node 3 should then be NFS mounted under this directory. Similarly, under the docroot of node 3, a directory should be created corresponding to the IP address of host2-ifc2.cs.rice.edu and the docroot of node 2 be NFS mounted under it.

  6. Sending requests from clients :
    After initializing the cluster, the clients can start sending HTTP requests to any of the nodes that run the Distributor component. Any of the network interfaces can be used, but for better network utilization, the interface other than than used for internal cluster communication should be used. In the example above, the clients should send requests to host2-ifc1.cs.rice.edu or host3-ifc1.cs.rice.edu.
For terminating the cluster setup, execute the following steps:
  1. Kill the cluster_startup program running on each node.
  2. Stop the web server application running on the cluster nodes with the Server component.
  3. Unload the module by executing the command ' make unload '.

Caveats

Bibliography

  1. "Scalable Content-aware Request Distribution in Cluster-based Network Servers", Mohit Aron, Darren Sanders, Peter Druschel and Willy Zwaenepoel. In Proceedings of the USENIX 2000 Annual Technical Conference , San Diego, CA, June 2000.

  2. "Efficient Support for P-HTTP in Cluster-Based Web Servers" , Mohit Aron, Peter Druschel and Willy Zwaenepoel. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.

  3. "Locality-Aware Request Distribution in Cluster-based Network Servers" , Vivek S. Pai, Mohit Aron, Gaurav Banga, Michael Svendsen, Peter Druschel, Willy Zwaenepoel and Erich Nahum. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII) , San Jose, CA, October 1998.