Manipulating Lightweight Objects across Heavyweight Boundaries
Joseph S. Barrera III
One Microsoft Way
Redmond, WA 98052-6399
Invocation batching combines multiple object invocations into a single message; result chaining makes results from one batched invocation available to the other invocations batched with it. Invocation chaining, or the combination of invocation batching with result chaining, is the key to allowing lightweight objects to be manipulated efficiently across heavyweight boundaries, whether between machines, between address spaces, or between user and kernel. By reducing the number of boundary crossings, invocation chaining reduces the total cost of invocation, making it more effective than previous solutions such as asynchronous messaging. This paper describes an initial implementation of invocation chaining.
The primary motivation for invocation chaining is the intrinsically high cost of cross-boundary method invocation. Despite heroic efforts to reduce this cost (e.g. [Bershad 90, Draves et. al. 91, Barrera 91]), it remains many tens of times more expensive than local invocation, even between address spaces on the same machine. This high cost adversely affects interface design by requiring coarse grained interfaces when finer grained interfaces would be more appropriate, often explicitly combining several independent functions in one method. Even the not-terribly-fine-grained interfaces provided by microkernels such as Mach cause performance problems when emulating other operating systems, due to the need to invoke several microkernel methods to implement some operating system primitives. For example, Unix fork implemented on top of Mach requires calls to create a task, create a thread, set the thread's registers, start the thread running, etc., and for many of these calls, the boundary crossing call dominates the actual method invoked. Thus some features, such as system call emulation, are provided even when they could be implemented via a combination of other primitives.
We hope that with a sufficiently usable invocation chaining facility, interface granularity will no longer be an issue, removing a major practical boundary to the decomposition of services attempted by the microkernel approach. In particular, we would expect that in the fork example above, only a single batched call to the kernel would be required, which in turn would reduce the cost of the emulated fork by more than a factor of two.
3. Comparison with previous methods
3.1. Invocation Chaining vs. Asynchronous Messaging
Invocation chaining can be compared to asynchronous messaging, in which method invocation results in an immediate send but not an immediate wait for the method's return value. They are superficially quite similar. Both can be provided via the same language constructs -- coroutines, promises [Liskov 88], and void-valued methods -- and both address the high cost of communication relative to local computation. However, invocation chaining emphasizes reduced boundary crossings, while asynchronous messaging emphasizes reduced latency and increased caller-callee parallelism. We believe that asynchronous messaging is the wrong approach because it fails to take into account the high cost of simply sending a message relative to both the costs of the methods being invoked and the time required to collect several invocations. In particular, whenever the sending cost dominates the method execution and collection costs, the total amount of time required to execute a sequence of invocations will be less, even when caller and callee can run in parallel on different processors. When caller and callee run on the same processor, the total time will always be less, by the number of boundary crossings avoided multiplied by the cost of a boundary crossing.
3.2. Invocation Chaining vs. Invocation Batching
In contrast to asynchronous messaging, invocation batching addressing the high cost of messaging operations. Accordingly, invocation batching is often used by systems or applications that generate many small messages, such as in the X windowing system. However, the downfall of invocation batching (without result chaining) is that an invocation cannot be batched if subsequent invocations depend on its results. In practice, this limits invocation batching's usefulness to domains where return values are absent or are limited to success/failure indications, which usually means output tasks such as displaying text or graphics or adding entries to an event log.
4. An Implementation
We have implemented invocation batching and result chaining on top of Mach IPC and its stub generator, MiG. Invocation batching was implemented by adopting existing asynchronous messaging constructs and layering them on top of a message batching scheme. Result chaining was implemented by taking one of the asynchronous messaging constructs -- promises -- and using it to represent dependencies between batched invocations, for interpretation by the callee.
4.1. Invocation batching
For invocation batching to be effective, a program must be able to generate invocations without blocking for them immediately so that there are multiple invocations to batch. Our implementation provides three constructs for doing so: void-valued methods, promises [Liskov 88], and coroutines. As mentioned previously, these constructs are traditionally used for asynchronous messaging, always resulting in a message send operation when invoked. In contrast, these constructs in our implementation only add a message to an outgoing buffer, with the actual send delayed until the program blocks waiting for results.
Void-valued methods, or methods with neither a return value nor out parameters, are the easiest non-blocking invocation construct to implement. In asynchronous messaging implementations, they result directly in a message send, which generally involves a trap into the kernel and a possible transfer of control to another address space. In our implementation, a void-valued method simply results in the copying of the method's parameters into a batching buffer in the same address space, with no traps or control transfers. A more sophisticated implementation could marshall the method's parameters directly into the batching buffer, eliminating the copy as well.
Coroutines provide multiple threads of control, allowing one thread to block waiting for return values from a method without preventing other threads from generating other invocations. Our implementation provides a very lightweight coroutine implementation solely to allow a program to use batching for methods that return values. For example, if a program wants to perform some operation on a collection of remote objects, it spawns a separate coroutine for each object, and each coroutine then invokes the appropriate remote method on its associated object. Only when all coroutines are blocked waiting for replies does our implementation send the batched requests. When the replies return (usually also batched in a single message), each coroutine is resumed to process the return value and either participates in a second round of calls or simply exits.
There are several reasons that we provided our own coroutine implementation instead of using an underlying thread mechanism (in our case, Mach's cthreads). First, we wanted to absolutely minimize the cost of coroutine creation to make it practical to create a coroutine for every invocation. Second, we wanted coroutine semantics (e.g., no preemption), again to simplify locking and reduce invocation cost. Finally, coroutines make it easier to determine when to stop collecting batched invocations; if we waited for all underlying threads to block waiting for replies, we might never send the batched messages, since some threads may be performing unrelated tasks and thus may never block waiting for a reply.
A final method for allowing the caller of a method to not block is the representation of return values by promises. When an actual value is required, the caller can evaluate the promise; if the value still is not available (e.g., because the reply message carrying the value has not yet arrived), then the caller blocks until it is. In any implementation of promises, the promise must remember which reply message it is waiting for and where in that message to retrieve its value, or, if the reply message has already arrived, the promise can simply point to or contain the actual value. In our implementation, evaluating a not-ready promise blocks the evaluating coroutine, and only when all other coroutines are blocked is the associated message sent. Note that promises provide batching even when coroutines are not used, since a single caller can accumulate promises from several invocations before evaluating one of them. The real benefits of promises in our implementation, however, is as a way of specifying result chaining.
4.2. Result chaining
We implemented resulting chaining by expanding the nonblocking behavior of promises: in contrast to previous implementations of promises, our promises do not block when they are used as parameters in subsequent invocations. Thus to chain the results from one invocation into another, one passes promises from the first as parameters to the second. When buffering the parameters for the second invocation, the implementation recognizes the parameters as promises, and instead of naively evaluating them (which would force a wait for the first method and thereby prevent any chaining from occurring), it copies a description of the promise into the buffer. When the callee eventually receives the batched invocations and evaluates the first invocation, it notices the promise description and therefore copies the return parameters not only into the reply buffer for the first invocation but also into the request buffer for the second, overwriting the promise description.
5. Conclusions and Future Work
Our initial implementation verified the feasibility of invocation chaining. Simple benchmarks show good results: chained invocations, at 12 microseconds on a 25Mhz 486DX2 PC running Mach 3.0, are ten times cheaper than unchained invocations and over 300 times cheaper than invocations over Ethernet between two such PCs. Future work will consist primarily of verifying the usability of the provided constructs in real systems, such as in the fork-emulation example described above, and in porting the invocation chaining implementation to the Microsoft Cairo environment, to see how invocation chaining can be combined with the Cairo object model.