Mach Microkernel Support
For Distributed Memory Multiprocessors
Joseph S. Barrera III
One Microsoft Way
Redmond, WA 98052-6399
Microkernel technology can be used to support applications and Unix-like operating systems on distributed memory multiprocessors. To demonstrate this point, we have implemented a distributed Mach kernel that runs on the Intel iPSC/860 hypercube and on collections of networked workstations. The complete Mach interface, including tasks, threads, shared memory, and external pagers, is supported, allowing existing applications and servers, including a 4.3 BSD Unix server, to run without modification. The resulting system is semantically indistinguishable from a shared-memory multiprocessor Mach system; in particular, when a Unix server is run on a distributed kernel, processes on the network see a single view of Unix services, as they would on a shared memory multiprocessor. Good distributed kernel performance has been achieved through high level optimizations such as copy-on-reference and copy-on-write, as well as lower level optimizations such as removing scheduling operations from IPC fast paths. Good server performance follows from good distributed kernel performance and from server optimizations already implemented for nondistributed kernels; these optimizations include local processing of some system calls and the use of mapped files in place of remote procedure calls to a file server.
Message-based interprocess communication performance is critical in any Mach system since it is the primary mechanism used by tasks (applications or servers) to communicate with each other. IPC performance is even more critical in a distributed Mach system since it is used internally to implement other Mach features such as virtual memory support. We have achieved good IPC performance by optimizing the two most important cases: remote procedure calls with small amounts of data, and messages carrying a page of data.
Most remote procedure calls carry small amounts of data in both the request and the reply, and are answered relatively quickly. The dominant cost of such RPCs is therefore kernel overhead; in particular, scheduling operations can add considerably to this overhead. In our original implementation, when a user thread performed an RPC, it would block after sending the request; a kernel thread dedicated to processing incoming messages would then wake the user thread when its reply arrived. Unfortunately, this design introduces two sleep/wakeup pairs on each side of the RPC, one for the user thread and one for the kernel thread. Eliminating the kernel thread and moving message processing functions into the network interrupt handler eliminated one sleep/wakeup pair. We eliminated the other pair in most cases by allowing the user thread to spin until the reply message arrives or until another thread becomes runnable. This avoidance of the scheduler when performing message operations parallels handoff scheduling in a nondistributed kernel, where the client avoids the scheduler by handing off the processor directly to the server. The importance of avoiding general purpose scheduling operations has been observed in other work, such as the Firefly[FireflyRPC] and Amoeba[AmoebaRPC] RPC systems.
Beyond small messages that compose remote procedure calls, the most common type of messages in a Mach system are messages carrying a page of data. Such messages are generated during page-in and page-out and as replies to device and file requests. The distributed Mach kernel uses virtual memory mapping to avoid copying large amounts of data. When a page of data is sent, it is made temporarily nonpageable and the network interface is pointed to the corresponding physical page; when a page is received, it is received into an anonymous physical page which is then mapped into the receiver's address space. Avoiding data copies is particularly important when using networks that approach memory speeds, where two extra data copies can triple the time required to sent a message.
Mach supports a rich set of operations for managing virtual memory that provides efficient support for copying, caching, and sharing data. Server and application performance depends critically on efficient implementation of these operations.
Mach supports efficient address space copies through the use of lazy evaluation. In nondistributed Mach, when an address space is copied, the corresponding physical pages are marked read-only and are then shared by parent and child. Page copies are then performed only when the parent or child attempts to modify a page.
Distributed Mach combines copy-on-write with copy-on-reference to provide lazy evaluation of address space copies across a network. In pure copy-on-reference, as implemented in Accent[Zayas], pages are copied to the child's node only when referenced by the child. While pure copy-on-reference does reduce the number of page copies, it can result in the same page being copied multiple times when a process, such as a shell, forks repeatedly onto another node. To handle this case, Mach combines copy-on-reference with copy-on-write by only copying those pages that are not already resident and unmodified on the copied-to node.
Mach's external pager facility allows user processes to handle page-in and page-out requests, thus allowing virtual memory to be used as a cache for user (or server) provided data, such as file data. An external pager allocates a port, distributes it to clients, and then listens on the port for page-in and page-out requests; clients then use the vm_map call to tell the kernel to allocate a region of memory in their address spaces backed by the external pager. Once an external pager supplies a page, all clients see a consistent view of the data in the page, which can be accessed without subsequent messages to the external pager. The page is removed from physical memory only upon normal pageout or upon explicit request by the external pager. The external pager can also ask that pages be written while remaining resident; this could be used by a file server to flush data to disk (e.g., to implement the sync Unix system call).
In nondistributed Mach, when multiple kernels map a memory object provided by an external pager, the external pager is responsible for handling read/write consistency among the different kernels. However, since managing such consistency is difficult, most pagers (particularly most file system pagers) simply disallow more than one kernel from using an external pager at a time. Normally this is not a problem, since a file server typically only provides service to its local kernel.
The distributed Mach kernel thus handles read/write consistency among different nodes, using a single writer/multiple reader protocol as used by shared virtual memory systems. The external pager does not have to manage consistency and in fact believes that it is mapped by a single kernel.
Finally, shared memory is provided by the Mach interface as a by-product of task creation; if the parent task has marked a region of its memory as inherit-shared, then its child task will share that region with its parent instead of receiving a copy. Distributed Mach actually uses the distributed external pager support to implement inherited shared memory, since the external pager support already provides read/write consistency among nodes. Thus at task creation time, an external pager is created for each inherit-shared region, which is then mapped into both parent and child.
The two main techniques used to achieve good operating system service performance in nondistributed Mach, emulation libraries and mapped files, are directly applicable to distributed Mach. Emulation libraries give a process a chance to satisfy a system call request locally instead of having to performing an RPC to the operating system server. Mapped files, built on the Mach external pager facility, allow processes to directly use the virtual memory system as a cache, avoiding a message if the data is local.
When the process makes a system call, control is transferred to the emulation library, which lives in the process's address space and is executed in user mode; the emulation library then does one of three things. In the best case, the emulation library performs the system call by itself, avoiding any RPC overhead. In the second case, the emulation library translates the system call directly into an equivalent Mach operation; for example, sbrk may be translated into vm_allocate. This results in a RPC to the kernel, which is less expensive than an RPC to the server, and does not increase server or network load. In the final case, the emulation library is forced to make an RPC to the appropriate server to satisfy the system call.
The use of mapped files allows file operations to be performed locally by the emulation library instead of by an RPC to the file server. When a file is opened, the file server provides the emulation library with a port for an external pager for the file; this port is then mapped into the process's address space. File reads and writes are then translated into reads and writes to the mapped area. These operations only translate into messages when a page of the file is not resident in memory, in which case a page-in request is delivered to the external pager for the mapped file.
Mapped files work identically in the distributed kernel case, except that the distributed kernel will only send a page-in request to the file server when the page is not resident in any node's physical memory. The distributed kernel thus allows the entire distributed memory of a distributed memory multiprocessor to be managed as a single cache.
One of the great benefits of this approach is the relatively small amount of effort required to provide distributed operating system functionality. Since the distributed kernel support consists of relatively small and machine independent extensions to the standard Mach kernel (and no changes to the Unix server), we have been able to easily track ongoing performance improvements and bug fixes made to the Mach kernel.
While unmodified operating system servers work with reasonable performance, there are some simple modifications that could be made to increase performance. For example, process creation requires six or seven Mach operations that require remote RPCs to the node upon which the process is to be created; one could instead install a small server on each node, which would in response to a single RPC perform these six or seven operations locally.
[UnixEmulation] D. Golub, R. Dean, A. Forin, and R. Rashid, Unix as an Application Program, Proceedings of the Summer 1990 USENIX Conference, June 1990, pp. 87-95.
[FireflyRPC] M. D. Schroeder and M. Burrows, Performance of Firefly RPC, ACM Transactions on Computer Systems, February 1990, pp. 1-17.
[AmoebaRPC] R. van Renesse, H. van Staveren, and A. S. Tannenbaum, Performance of the World's Fastest Distributed Operating System, Operating Systems Review, vol. 24, no. 4, October 1988, pp. 25-34.