The Shared Memory Server

 

Alessandro Forin, Joseph S. Barrera III, Richard Sanzi

 

Department of Computer Science

Carnegie Mellon University

Pittsburgh, Pennsylvania 15213

 

Abstract

This paper describes the design and performance evaluation of a virtual shared memory server for the Mach operating system, providing an extension to Unix to support distributed shared memory. In this respect, it subsumes standard facilities like the Unix System V shared memory facility. The server runs in user-mode and provides sharing of read/write memory between processes, regardless of their machine-location. A number of memory coherency algorithms have been implemented and evaluated, including a new distributed algorithm that is shown to outperform centralized ones. Some of the novel features of the server include support for machines with multiple page sizes and for heterogeneous processors. Performance measurements of the server and of some applications are presented, and the intrinsic costs evaluated.

Introduction

Shared memory multiprocessors are becoming increasingly available, and with them a faster way to program applications and system services via the use of shared memory. Currently, the major limitation in using shared memory is that it is not extensible network-wise and therefore is not suited for building distributed applications and services. Example uses of a distributed shared memory facility include file systems, process migration, databases, parallel languages like Ada or Multilisp, and systems for parallel and distributed programming @cite[camelot, agora, texas]. More motivation for a distributed shared memory facility comes from the increasing interest that hardware designers show in non-shared memory multiprocessors: the Nectar project @cite(nectar) at CMU for instance uses fast fiber optic links. This will reduce the end-to-end time to send a 1 kilobyte message from the tens of milliseconds range of the current ethernet to the tens of microseconds range of the fiber.

The Mach virtual memory system allows the user to create memory objects that are managed by user-defined processes, called external pagers in @cite(mwyoung). An external pager is a process responsible for providing data in response to page faults (pagein) and backing storage for page cleaning (page-out) requests. This is precisely the function of the in-kernel disk pager. The only difference is that the user-specified pager process can manage the data in more creative ways than the designer of the in-kernel pager may have envisioned. This paper describes the design and performance evaluation of one such memory server, which provides shared memory semantics for the objects it manages. The server provides unrestricted sharing of read/write memory between processes running either on the same machine or on different machines. In the first case, all processors have direct access to common physical memory ( architectures with Uniform Memory Access time (UMA) or Non-Uniform Memory Access time (NUMA) ) and the server provides a flexible management of shared memory. In the second case, processors do not have any way to access common physical memory (architectures with No Remote Memory Access (NORMA) ) and the server provides it as virtual shared memory, migrating and replicating virtual memory pages between processors as needed.

To understand the properties of a distributed memory facility the performance characteristics of the server itself and of various application programs have been evaluated. To measure the effects of different page management policies, a number of memory scheduling algorithms have been implemented and evaluated, including a new distributed algorithm that outperforms centralized ones. Some of the features of the algorithms described include support for machines with different page sizes and for heterogeneous processors. The algorithms service page faults on multiple machines by migrating read-write pages, replicating read-only pages, scheduling conflicting or overlapping requests appropriately and tagging and translating memory pages across incompatible processors. The experiments with application programs were designed under the assumption that the amount of information that is exchanged in each synchronization step is the key factor. Applications at the extreme of the spectrum have been analyzed in detail.

Shared Memory Within a Machine

The first goal of the server is to provide sharing of read/write memory between processes running on the same machine. This overcomes the constraint of the standard Mach memory inheritance mechanism that the shared memory must have been allocated by some common ancestor, as well as a security check in the implementation of the Unix exec(2) system call that deallocates all of the process's address space. The server provides the user with a call to create memory objects, logical pieces of memory that are identified by ports. A memory object can be used by a process in a call to the vm_map kernel primitive, which maps some portion of the object into the process's address space at some virtual address. Note that since a port can only be transmitted in a message, memory objects are entities protected by the kernel. Note also that access to ports can be transmitted over the network, and therefore the vm_map primitive allows for networked shared memory.

The process can access the memory normally, and the kernel delegates the paging duties to the user-level memory manager (external pager) that is responsible for the memory object. This is done via an asynchronous message interface between the pager and the kernel which is described in more detail in @cite(mwyoung). The external pager interface allows pagers to control the managing of main memory by the kernel, so that main memory effectively acts as a common cache for memory objects. The various operations therefore have the flavor of cache control functions: when a process first accesses a page it takes a page fault, and the kernel sends to the pager a memory_object_data_request message to request the missing page, just like in a cache miss. The server provides the page in a memory_object_data_provided message. Other messages allow a pager to request a page flush or specify the caching and copy policies for the object. The following tables informally list the messages and procedures defined by the external pager interface and by the shared memory server.

 

From user to pager

create_memory_object( initial size )

RPC

Creates a new memory object and returns the associated port.

memory_object_replicate( object )

RPC

When using the distributed pager, create a local copy of the memory object.

memory_object_tag( tag, page range )

RPC

When using heterogeneous processors, assign a type tag to a portion of the memory object.

 

From user to kernel

vm_map( task, memory_object, address range, attributes )

RPC

Maps an object into a task’s address space.

vm_deallocate( task, address range)

RPC

Remove all mappings for the given address range.

 

From kernel to server

memory_object_init( pager, control port)

MSG

Contact the pager of an object which is mapped for the first time, for initial handshake.

memory_object_data_request( page range, protection )

MSG

Request for a page (or range of pages) which the kernel does not have in its cache.

memory_object_data_unlock( page range, protection )

MSG

When using heterogeneous processors, assign a type tag to a portion of the memory object.

memory_object_data_write( page range, pages)

MSG

Pageout of dirty pages from main memory.

memory_object_lock_completed( page range )

MSG

Completion of the requested paging operation.

memory_object_terminate( )

MSG

Notification of removal from cache.

 

From server to kernel

memory_object_set_attributes( attributes )

MSG

Confirms availability completing initial handshake, specifies initial attributes.

memory_object_data_provided( page range, pages )

MSG

Provides data to the cache for pages in range.

memory_object_data_ununavailable( page range )

MSG

Zero-fill pages in range.

memory_object_data_lock_request( object, request, reply_port )

MSG

Cache control request, e.g., a page flush or a granting of write access.

A Simple Algorithm

The shared memory server has been structured in an object-oriented fashion, so that it is possible to have memory objects with different behaviors. When a memory object is mapped by processes on multiple machines, the pager needs to manage multiple copies of memory pages in some coherent way. The various management policies for objects are provided by different implementations of a set of operations. An implementation is called fault scheduler in the following, because the goal of the module is to schedule read and write faults on different kernels in the best way, just like ordinary schedulers schedule the execution order of various processes. One of the many reasons for this choice is to allow experimentation with various algorithms and heuristics. At object creation time, a user can choose which specific scheduling policy will be applied to the new object, or rely on the default one. All the algorithms we describe maintain strict memory coherence on the objects they manage, e.g. there is no stale data because at any given time there is only one version of a page.

This section describes a very simple scheduler that provides centralized, single page-size objects. There is only one pager process for each memory object, but different objects might be allocated to different pager processes to reduce service contention. Since Mach IPC is location transparent, the location of the pager process is also transparent to the client kernels. A later Section will describe how this algorithm is modified to allow distributed, coordinated management of a single object between separate pagers on different machines. Ownership of a page is transferred among kernels as needed: the owner of the page is the kernel that currently has write access to the page. When no kernel has write access to a page the scheduler itself is the owner, and multiple kernels are allowed to have read-only copies of the page. The simple scheduler's algorithm is an automaton with four per-page states, which correspond to the four conditions in which a page can be:

Transitions between states are driven by the requests that are made by client kernels. In practice, not all requests make sense in all states. For example, a kernel will not pageout a page that has not been modified. The server accepts four input message types (requests), which the scheduler handles in three procedures:

These three functions do all the necessary work. Pseudo-code descriptions of how they operate on a page appear below. It can be assumed that all procedures keep the page locked and that messages are processed in the order of arrival. This pseudo code will be used again later to describe the distributed algorithm. The remaining procedures are either for initialization, termination, or recovery from kernel crashes. The pseudo code indicates that writers are queued in FIFO order, while readers do not need to be ordered. Writers take precedence over readers. Other, possibly more complicated policies might be needed, for instance to deal with multiple page sizes, or to avoid reader starvation.

read_fault:

read_fault(page, kernel)

{

switch ( page->state ) {

case Read:

memory_object_data_provided(kernel)

break

case Write:

page->state = ReadWait

memory_object_lock_request(page->owner, FLUSH(page), owner_self)

break

default: /* just enqueue */

}

set_add(page->readers, kernel)

}

write_fault:

write_fault(page, kernel)

switch ( page->state ) {

case Read:

set_remove( page->readers, kernel)

forall( readers )

(1) memory_object_lock_request( reader, FLUSH(page), owner_self )

page->readers = empty_set

(2)

page->state = Write

page->owner = kernel

if (needs_data)

memory_object_data_provided( page->owner )

else

memory_object_data_unlock( page->owner )

break

case Write:

memory_object_lock_request(page->owner, FLUSH(page), owner_self )

/* fall through */

case WriteWait:

case ReadWait:

page->state = WriteWait

enqueue( kernel, page->writers )

}

pageout:

pageout(page, kernel, data)

{

(3) switch( page->state ) {

case Read:

return /* never happens */

case Write:

save(data) /* true pageout */

page->state = Read

page->owner = owner_self

break

case WriteWait:

(4)

save(data)

page->owner = dequeue( page->writers )

memory_object_data_provided( page->owner)

if (!page->writers)

if (page->readers)

page->state = ReadWait

else

page->state = Write

if (page->readers || page->writers) {

deschedule_myself()

memory_object_lock_request( page->owner, FLUSH(page), owner_self)

(5)

}

break;

case ReadWait:

save(data)

forall(readers)

memory_object_data_provided(reader)

page->state = Read

(6) page->owner = owner_self

}

}

An example will help clarify the following discussion. Since all the processes on one machine use the same copy of the memory object's pages (cache copy, possibly mapped into the various address spaces with different protections), we can pretend there is a single process per machine. Let us assume that the process makes a read access to a page. The page is not in the cache, hence the kernel sends a memory_object_data_request message to the pager. If the page is in Read state (the initial state), the server immediately sends the page in a memory_object_data_provided message, with read-only protection. If the process makes a subsequent write access, the kernel sends a memory_object_data_unlock message to request a protection upgrade which will be granted in a memory_object_lock_request message, unless the page has changed state in the meantime. If the page is not in Read state, the kernel's request is enqueued and possibly the current writer is asked to page out the page via a memory_object_lock_request message. When the page is actually paged out, the pageout procedure dequeues the next write access request and satisfies it, or satisfies all read requests at once.

Multiple Page Sizes

The simple scheduler described above can only be used by machines with the same page size, an unpleasant restriction. Moreover, in Mach the size of a virtual page can be changed and set even on a per-machine basis. Transforming a single page size scheduler into a multiple page size scheduler is not trivial. Our multiple page size scheduler solves the problem by two means:

Locking is accomplished via a queuing mechanism. Some complications arise from the requirements of avoiding false contention and descheduling of kernels until absolutely necessary and of satisfying requests as quickly as possible while maintaining fairness. A description of this mechanism can be found in @cite(npsreport).

Heterogeneous Processors

Parallel programs that use a distributed shared memory facility should not be constrained to run on a uniform set of processors. Such a constraint is undesirable because as the number of machines available at a given site increases one typically observes an increased variation in their types as well. Unfortunately, interfacing heterogeneous processors not only creates the problem of potentially different page sizes, but also raises the issue of different machine representations of data objects. This problem goes beyond the byte order problem, since different processors are free to assign any given meaning to any given sequence of bits. A clear example is the case of floating point numbers.

The problem can be separated in two sub-problems, hardware data types (e.g. integers) and software data types (e.g. C records). A general-purpose server solves the problems for the first class of types, and can be extended to cope with the second class of types. Quite simply, our pager assigns a type tag to each segment of a paging object and makes the appropriate translation (if necessary) when sending data from that segment to a kernel. The interface with the application program is defined by the memory_object_tag RPC from the client to the pager that assigns a type tag to a segment. This operation is typically used by a dynamic memory allocator to fragment shared memory in typed segments, each segment containing only data of the given type. The standard Unix BSD malloc(2) memory allocator for C was modified to allocate typed data, as exemplified in Figure @ref(hetmalloc). Although different types cannot be mixed in a structure, one can always resort to a level of indirection, building records that only contain pointers to data.

@begin(figure)

@begin(programexample)

extern char

*tmalloc( type_tag, num_elements )

enum { t_int8, t_int16, t_int32, t_float32, ... } type_tag;

unsigned long int num_elements;

 

#define malloc_short(n) (short*)tmalloc( t_int16, n)

...

@end(programexample)

 

@caption{A Typed @i[malloc()]}

@tag(hetmalloc)

@end(figure)

 

A Distributed Algorithm

The motivations for a distributed algorithm are manyfold. A centralized server is a solution that does not scale up. When many kernels share many memory objects serviced by the same pager the availability of each object decreases, because the pager becomes the bottleneck where all requests pile up. Even when few kernels are involved, the location of the server is important because local and remote messages might have very different costs. A distributed solution that can allocate any number of servers on any number of machines is more usable. In this way the sharing of memory between processes located on the same (multi)processor is decoupled from unrelated events on other machines.

The approach is simple: @i[treat each remote server just like another kernel, and apply the algorithm of the centralized case]. The reader may wish to go back to Figures @ref(spsfunc1), @ref(spsfunc2) and @ref(spsfunc3) and review the algorithm substituting the word "kernel" with "client", which now means either a kernel or (more likely) a fellow server. A pager will now accept a @i[memory_object_lock_request()] message just like a Mach kernel does and treat it as a fault notification, invoking read_fault() or write_fault() as appropriate. A @i[memory_object_data_provided()] message is handled by the pageout() procedure.

Note now that the notion of the "owner" that each pager has does not need to be exact at all times. It is quite possible, actually highly desirable, that a pager be able to ask a second pager to transfer a page directly to a third one who needs it, without handling the page directly. We call this optimization @b(forwarding), to catch both the positive effect of avoiding one message hop, and the (minor) negative effect of producing a new type of activity: the act of forwarding a mis-directed page fault message to the correct destination. Implementing forwarding requires relatively simple changes to the centralized algorithm.

Modifications to the Distributed Scheduler to Implement Forwarding of Page Faults

@comment( <<<< Description of Forwarding >>>> )

@begin(figure)

@begin(programexample, size -2)

 

(1) memory_object_lock_request( reader, FLUSH(page),

is_server(page->owner) ? kernel : owner_self)

 

(2) if (page->owner != owner_self) {

memory_object_lock_request(page->owner, WRITE_FAULT(page), owner_self)

enqueue(page->writers, kernel)

page->state = WriteWait

return

}

 

(3) if (kernel != page->owner && !hinted(page))

page->owner = kernel

hinted(page) = FALSE

 

(4) if (!page->writers) {

page->owner = owner_self

goto ReadWait

}

 

(5) if (is_server(page->owner))

page_state = WriteWait /* pretend */

 

(6) if (!is_server(kernel))

page->owner = owner_self

 

@end(programexample)

@tag(forward1)

@end(figure)

 

Figures @ref(forward1) and @ref(forward2) illustrate the changes and additions to the pseudo code to implement forwarding. A pager creates a local copy of a memory object when a user asks for it. The initial state of all pages in this case is the @i(Write) state, and the owner is the pager from which the object has been copied. Of course, no real copy is actually done. Note that it is possible to copy from another copy, and that the pager need not have complete knowledge of all the kernels involved. The handling of read faults does not change. While handling write faults, at line (1) all readers are informed of who the new owner is, if it is a different pager. At line (2), a check is added to see whether the true owner actually is another pager, in which case the fault is queued and the state of the page modified accordingly. In the pageout() procedure at line (3) it is necessary to handle the case where the pager has incorrect information about the true owner. Note that the pager might have received a @i(hint) about who will eventually become the owner because it forwarded a write fault. At line (5) it is necessary to handle specially the case when a page is given to a server queued for writing, while having other readers waiting. The immediate request to have the page back pretends that there are writers queued anyway, to prevent the race that would otherwise arise. Line (4) jumps to the correct code in case the last writer had actually been serviced. Line (6) handles the fact that if the pager only receives read-only access to the page it does not become the owner of the page.

Two new procedures, described in Figure @ref(forward2), are used to check whether a page fault must be forwarded and to handle invalidations of read-only pages. A @i[memory_object_lock_request()] message is handled first by the page_fault() procedure, which forwards it if necessary. The fault is definitely not forwarded if the pager has ownership of the page, or the pager has already asked the current owner for write access to the page (state WriteWait), or if the pager has (state Read) or is about to have (state ReadWait) a read-only copy of the page and the fault is a read fault. In other words, a fault is only forwarded to another server when the pager has no current interest in the page whatsoever. An invalidation of a read-only page is generated at lines (1) and (7) if the reader is a server, and is handled in the invalidate_page() procedure. This is the only new message type needed.

 

 

@begin(figure)

@begin(programexample, size -2)

 

invalidate_page(page, owner)

if (page->state != Read)

return /* sanity check */

forall (readers)

(7) memory_object_lock_request(reader, FLUSH(page), owner)

page->state = Write;

page->owner = owner;

 

page_fault( page, who, fault_type)

if ((page->owner == owner_self) ||

!is_server(page->owner) ||

(page->state == WriteWait) ||

((fault_type == READ) && (page->state != Write))) {

if (fault_type == READ) read_fault(page, who)

else write_fault(page, who)

return

}

/* Forward */

send_page_fault(owner,who,page)

if (fault_type == WRITE) {

page->owner = who

hinted(page) = TRUE

}

 

@end(programexample)

@caption(Additions to the Distributed Scheduler

to Implement Forwarding of Page Faults)

@tag(forward2)

@end(figure)

 

 

Forwarding creates problems for a closed form analysis, since the effect of forwarding of both page locations (page faults) and invalidations (page flush) are difficult to model. Our claim is that in actual use one will typically see only the two extreme cases: pages that are frequently accessed in write mode by many parties, and pages that are accessed infrequently, most likely in read mode. Even if a page is accessed infrequently, it is hard to generate a faulting sequence that produces many forwarding messages. This claim is supported by the experience with actual application programs. Infrequently accessed pages do not affect performance. The bottlenecks derive very easily from the opposite case. Our analysis shows that the expected number of remote messages required to service a N-party page fault for the distributed pager is

To get the total number of messages in the distributed scheduler one must add a total of 2N-2 local messages between pagers and the kernels they service. For comparison, any centralized algorithm that maintains strict memory coherence must use at least 4N remote messages and no local messages. In the case of the simple scheduler this figure is 5N messages. Since the cost of local messages is often much less than the cost of remote messages, the distributed pager clearly outperforms the centralized one. The performance evaluation results reported in Section @ref(performance) confirm this analysis.

Example

COMMENT: When a process first maps a memory object in its address space the kernel contacts the server but does not require it to send any data yet. It is only when the process touches a memory location within the address range where the object is mapped that a fault is generated. The faulting process is stopped, and a message is sent to the pager to request data to service the fault. When the scheduling algorithm in the server has the necessary data available the page is sent to the kernel which maps it for the faulting process which can then continue execution. In case the page is not immediately available at the server, a message is sent to the kernel that currently owns the page, asking to page it out to the server. In the case of the distributed algorithm, this may imply some more processing, since the "kernel" is actually another server.

It is interesting to consider one example that shows the effects of forwarding page faults among distributed servers. Let us assume that N servers (each one serving one or more kernels) all take repeated page faults on the same page, which is the @i(hotspot) case that makes distributed shared memory perform the worst. Initially, all servers refer to the memory object's pages from the same one (say server 1). Therefore N-1 requests are sent to server 1. The server first services its local fault(s), then ships the page to server 2 (say) which becomes (in server's 1 opinion) the new owner. The next fault request is then forwarded by server 1 to server 2, the next to server 3 and so on, to server N-1. When all faults have been forwarded and served, the situation is such that servers 1, N-1 and N all know that the page is located at server N, while every other server @i(i) believes the page is at server @i(i+1). When all servers take the next page fault only 2 requests are sent to the owner, and any other request @i(i) is queued at server @i(i+1) waiting for @i(i+1) itself to be served first.

@begin(figure)

@begin(verbatim)

 

S1 S2 -> S3 -> S4 -> ... Sn-1 -> Sn

| ^

-------------------------------------------------

@end(verbatim)

@caption(Steady State Behavior for a N-Party Write Hotspot)

@tag(forwarding)

@end(figure)

This situation is depicted in Figure @ref(forwarding) and can repeat itself. Our experiments show that indeed in a write-hotspot the system oscillates between two configurations of this type, never entering the initial state again. There is a worst case that could surface: an isolated page fault triggers a number of forwarding messages. This number is N-2, since always at least two servers know exactly where the page is: the owner and the one who sent the page to it. In the example, this would happen if server 2 alone takes a fault after the first N faults are served. After a worst case fault all servers know exactly where the page is, and therefore the system goes back to the initial state.

Related Work

Forwarding creates a new need: the need of forwarding page faults to the current owner of a page. Li @cite(kaili) looked at the problem of locating a page and provided various algorithms to solve it, and analyzed their costs. Our distributed algorithm must be compared against the "Distributed Manager Algorithm 2.6", with the optimizations indicated at pages 61-63 that page invalidations are sent in a divide-and-conquer fashion. Note however that in Li's algorithms all operations are RPC, hence requiring twice as many messages and unnecessary serialization. Li also evaluates the use of broadcast messages and proves that they could benefit some of his algorithms, under the assumption that their cost is the same as a direct message. Note that in our algorithm the use of broadcasts would be detrimental to performance, since it brings back the system to the initial state and away from the most favorable situation. The idea of propagating invalidations in a divide-and-conquer fashion is, in our system, much more effective than broadcasts. In this paper it was only assumed that the underlying architecture provides efficient point-to-point communication, with quasi-uniform cost. The cost of sending a message to N recipients is therefore greater than or equal to N times the cost of a message to a single recipient.

Cheriton @cite(Cheri88) has recently extended the V kernel to support user-level data and caching servers, which can be used to provide distributed shared memory. His facility has many similarities with Mach's external pager facility, although it is described in terms of file abstractions rather than memory object abstractions. The implementation uses a scheme analogous to the simple scheduler presented above, but might add considerable extra message traffic by polling and forcing page flushes every T-milliseconds to provide @i{T-consistent} files for transaction support.

Fleisch @cite(Fleisch) has extended the Locus kernel to provide distributed shared memory, with a SystemV interface. The scheme he describes seems geared to maintaining consistency at the segment rather than page level. A report on the implementation work will be necessary to better evaluate his approach. Other approaches to user-level shared memory are possible, @cite[agora] contains some references.

Performance Evaluation

The performance of the server was evaluated along a number of dimensions. Fundamental are the average times to service a fault, in both cases of single machine and multi-machine applications. These are affected by the various special features of the server. The centralized and distributed cases were compared, using programs that exercise the hotspot behavior. Our measures show two overall results: the distributed algorithm is more efficient than the centralized one, and none of the special features we introduced has an unacceptable impact on performance. The major bottleneck in the test configuration (token ring workstations) is the network latency, which accounts for about 98% of the elapsed times. More details on the measurements can be found in @cite(npsreport). The server was instrumented in two ways: keeping track of the number and type of faults it services (per object and per page), and collecting extensive traces of the message activity. These data can be obtained via a remote procedure call by other processes, with minimum perturbation.

Basic Costs

The most common use of the server is in sharing memory within a single machine. In this case, a fault on a missing page (cache-fill) requires two local messages, for a total cost of 1.5ms on a IBM RT-APC. A protection fault also requires two messages but no memory mapping, for a cost of 1.1ms. A pageout operation requires two receive messages and the deallocation of data, which is not a system call but a RPC to the kernel and involves two messages. The total cost is then 2.5ms. Since system time is by far the dominant factor (93%) in all cases, schedulers do not show significant differences in the handling of local faults. Table @ref(CostSummary) summarizes the most important costs.

COMMENT: Memory use is an important factor for characterizing the performance of a program, although our primary concern was speed rather than space. The server allocates memory in a sparse fashion only when a kernel demands it, and then replaces each page as it is paged out by a kernel. This not only reduces the memory usage for a large and sparse object, but also removes from the critical path the copying of data (just switch a pointer) and the deallocation of memory (two messages) which can be done in batches. To quantify these improvements, the hotspot cycle time for the distributed case for the simple scheduler was reduced by this strategy from 7.8ms/fault to 5.5ms/fault, including memory deallocations. Memory deallocation can be devoted to a separate thread, which reduces the fault time to approximately 4.2ms/fault. Memory saving depends on the actual use, and is very effective for some applications.

Server Costs

Parameter

Measured Cost

Zero-Fill Fault Time

1.5 msec/fault

Protection Fault Time

1.1 msec/fault

Hot-Spot Cycle Time

4.2 msec/cycle

Multiple Page Size Overhead Time

0.2 msec/fault (max)

Centralized Hotspot Case Message Cost

5.0 remote msg/fault (average)

Distributed Hotspot Case Message Cost

2.0 remote msg/fault, 2.1 local msg/fault (average)

Percentage of Faults Forward, Hot-Spot Case

10%

System Time

93%

Costs of the Algorithms

The multiple page size scheduler adds some overhead to the fault times, primarily because more server pages might be needed to cover a kernel's page fault. In most cases, a small range of page sizes will be used, but even with an unlikely ratio maximum/minimum page size of eight the overhead over the basic fault times is only 0.2ms. If necessary, however, the algorithm can be tuned further for larger page size ranges.

Various experiments were performed on the distributed scheduler, the most interesting one being the case of an hotspot page. This is demonstrated by a simple program that repeatedly increments the same memory location, replicated across various machines. The measures show that on average each server received requests for read/write/protection faults in an equal amount, as expected. The average number of messages per fault is the single most important figure: on average, each server handled 4.1 messages per fault. Half these messages are received and half sent. On average, 2.1 messages are local (interactions with the local kernel) and 2.0 are remote (interactions with other servers). This nicely confirms the estimates presented in Section @ref(distributed). Remote messages are extremely more expensive than local ones: an average 98% overhead was observed in the test system, equally divided among the local Mach network server, the remote one, and the TCP/IP transfer protocol.

 

@Enter(MyTable, LeftMargin +.0in, RightMargin +.0in)

@TabClear

@Tabdivide(4)

@CommandString(TableBar="@ux[@\@\@\@>@\]@*@BlankSpace(-0.01in)")

@CommandString(LastBar="@ux[@\@\@\@>@\]")

@Enter(MyFormat)

@TableBar

 

@u[Machine@\Int16/32@\Float32@\Float64]

 

Sun 4/260 (*)@\0.8@\1.0@\1.1

 

Vax 8800@\1.5@\2.3@\3.7

 

IBM RT@\1.9@\2.4@\2.5

 

Sun 3/280@\1.9@\2.5@\2.9

 

@g(m)Vax-III@\2.8@\4.6@\6.8

 

Sun 3/160@\3.0@\4.8@\4.6

 

Vax 785@\4.4@\7.6@\10.9

 

Encore (*)@\4.9@\12.5@\14.3

 

@g(m)VaxII@\6.1@\10.4@\14.5

 

Vax 8200@\9.1@\15.3@\27.9

 

@TableBar

@Leave(MyFormat)

@Caption[Overhead of Data Translations (in milliseconds per 4kbytes).]

@Tag(hettable)

@Leave(MyTable)

For the heterogeneity problem, only those machine types that are more or less implied by the definition of the C language were chosen for implementation, e.g. integers of various sorts and floating point numbers. Many other data types map obviously onto these types. For floating point numbers, the two formats that are most often used on our machines (Vax-D and IEEE-754) were selected. Both short (32 bits) and long (64 bits) forms were considered. Table @ref(hettable) shows the overheads measured on the server on a wide variety of machines. The times reported are those necessary to convert 4kbyte of data, but note that some machines use larger page sizes. There is no other basic overhead beyond a simple test of whether conversion is necessary or not. Starred entries in the table indicate machines for which a Mach External Pager kernel is not yet available. In these cases, a synthetic test was run to time the critical code.

Assuming that the server's (multi)processor has spare cycles, it is possible to eliminate the type conversion overhead at the expense of increased memory usage. The server keeps multiple copies of each segment, one per machine type, and pre-translates it when a page is received. Translation is done in parallel by a separate thread, which works in a pipelined fashion with the main thread that services faults. We have not yet implemented this optimization.

Application Programs

Intuitively, the performance gain from the use of memory sharing techniques comes from the large amounts of information that can be transferred with no cost between parallel activities in each synchronization operation. Below a certain threshold, on a uniprocessor the integration of scheduling and data transfer provided by a kernel optimized for message passing is apparent and wins over the simple busy-waiting scheme of spin-locks. The effect must be visible in the networked case, where spin-locks are more expensive. This was the idea that guided the choice of applications for testing the server. This hypothesis partially contradicts the suggestion that the @i(locality) of operations would completely dominate performance of a distributed shared memory program.

In the networked shared memory case, all the processes running on the same machine produce a single load on the pager, and the advantage of one process obtaining a page that will then be used by the other processes is not apparent. This non-measurable gain was eliminated from the experiments and only one process was allocated per machine even if this is clearly unfair to the pager.

@Enter(MyTable, LeftMargin +.0in, RightMargin +.0in)

@TabClear

@Tabdivide(4)

@CommandString(TableBar="@ux[@\@\@\@>@\]@*@BlankSpace(-0.01in)")

@CommandString(LastBar="@ux[@\@\@\@>@\]")

@Enter(MyFormat)

@TableBar

 

@u[Program@\1 Machine@\2 Machines@\3 Machines]

 

Matrix 128x128@\29@\15@\10

 

Matrix 256x256@\241@\122@\80

 

ShortestPath@\60@\60@\40

 

LocusRoute@\277@\333@\397

 

Mp3d@\8.6@\16.1@\23.0

 

@TableBar

@Leave(MyFormat)

@Caption[Execution Times For Some Tightly Coupled

Shared Memory Programs, Unmodified]

@Tag(matrix)

@Leave(MyTable)

 

All programs have been developed for a uniform shared memory multiprocessor, and were not modified in any way to get better distributed performance. In the matrix multiplication case, the problem is decomposed so that each machine computes all the elements of some row in the output matrix. In this way it is easy to compute large matrices with few processors. The Shortest Path program is a parallel version of a sequential algorithm which shows Nlog(N) complexity for planar graphs @cite(johnson). The program evaluates in parallel the possible extensions to the most promising paths, and each activity only looks in the neighborhood of a point and queues the new extensions to other activities. The other two programs have been used in architectural simulations, on the assumption that they are representatives of a large class of parallel programs. Mp3d is a particle simulator @cite(mp3d) and LocusRoute is a parallel VLSI router @cite(Locusroute).

The experiments were performed on machines under standard multi-user operating conditions, including a normal level of disk paging. Measures were taken of elapsed and per-thread CPU times. Table @ref(matrix) shows the results of executing the programs on a small group of IBM RTs on a token ring. The network latency dominates performance, and only the matrix multiplication case shows linear speedup. All programs are known to demonstrate linear speedups on a bus-based shared memory multiprocessor with a small number of processors.

One important factor affecting the performance of an application that uses dynamically managed shared memory is the memory allocation algorithm used. Li described a scheme for memory allocation derived from Knuth's FirstFit scheme. A quick comparison was made with a different one, a descendant of Knuth's FreeList algorithm. Such an allocator is currently used, in a sequential version, by the standard Berkeley BSD Unix distribution. A parallel version was easily created by associating a semaphore to each free list, whereby requests for memory blocks of different sizes proceed completely in parallel. It is much more difficult to make the FirstFit scheme more parallel.

The measurements show that not only does the FreeList algorithm use less memory (1/4 on average) than the FirstFit one, but that it is about 20-30% faster even in the sequential case. Other measurements indicate that a two level memory allocation strategy is very effective in reducing shared memory contention. The simple solution of allocating and deallocating memory in batches for blocks of the most frequently used size often suffices to eliminate the most obvious bottlenecks.

Conclusions

This paper described a user-level memory server for Mach and the algorithms it uses for dealing with issues like heterogeneity, multiple page sizes, distributed service. The server itself shows very good performance, and the distributed algorithm is effective in reducing communication over the (potentially slow) communication medium. Results with application programs are dominated by the network latency, but still optimal in some cases. It is conjectured that the amount of data exchanged between synchronization points is the main indicator to consider when deciding between the use of distributed shared memory and message passing in a parallel application. There is definitely space for more research work: a number of extensions and optimizations can be attempted using more sophisticated caching strategies and heuristics in servicing fault requests.

COMMENT: Besides final user applications (e.g. scientific applications, window managers, etc.) there are a number of operating system utilities that can be built using shared memory, knowing that it is now a resource that is available network-wise. I/O between processes can be modeled as the transfer of ownership of some shared memory buffer. In this way, a process (the producer) can allocate a buffer, fill it with data, and then notify the other process (consumer) that the buffer is available by enqueuing it in, for example, a circular queue. A good case in point is implementation of the Streams abstraction at the user level. Supporting distributed databases with distributed shared memory also becomes more simple. An example of how to structure a file system using the external pager facility was illustrated in @cite[mwyoung], and the Camelot system @cite(camelot) uses the facility to provide distributed atomic transactions. Finally, all parallel languages that assume a shared memory model will port easily on a distributed shared memory system, although they will require some tuning to obtain the best performance.

Acknowledgements

We would like to thank Roberto Bisiani and David Black for their invaluable help in reviewing earlier drafts of this paper.

References