Odin:

A Virtual Memory System for Massively Parallel Processors

Joseph S. Barrera III

Microsoft Research, Microsoft CorporationOne Microsoft WayRedmond, WA 98052

 

joebar@microsoft.com

+1 (206) 936-3837

Introduction

We have designed and implemented a distributed virtual memory system which allows practical dynamic load balancing on non-shared memory massively parallel processors such as the Intel Paragon XP/S. We have done this by combining techniques such as copy-on-write, external pagers, and shared virtual memory, and by exploiting knowledge of the underlying structure of virtual memory. This system, called Odin, treats distributed memory as a single cache. It supports copying, caching, and management of data between nodes in a non-shared memory massively parallel processor. It also allows the contents, accessibility, and consistency of memory to be managed by user-provided external pagers and makes it possible for sophisticated memory-management services such as mapped files, garbage collection, and checkpointing to be written without special attention to the needs of inter-node memory consistency.

Odin is the kernel in use today as a key component of the Open Software Foundation's OSF/1 AD. It runs on the Intel Paragon XP/S which is a massively parallel processing platform capable of supporting two thousand i860XP processors. Measurements of Odin in operation on the XP/S show the impact of its design approach. For example, the fork of a shell process with sixty resident virtual pages requires the internode copying of only three pages. This paper describes the implementation and functionality of Odin, and the way in which Odin enables many of the services provided by OSF/1 AD, such as internode process creation and migration, mapped files, and shared memory.

Key Ideas

Odin's architecture focuses not on address spaces but on the objects mapped by those address spaces. This emphasis on objects is based on observations about the actual use of memory by applications. Our decision to use objects as the basic building block of a distributed shared memory system led us to the develop techniques for recognizing, compressing, caching and managing distributed objects in ways that reduce the amount of paging traffic and communication between nodes.

Observations about application memory references

Application processes do not normally refer to random sequences of bytes. Three key observations led us to use an object-based scheme for structuring distributed memory:

· Processes use objects, not unstructured address spaces. Address spaces are typically composed of a collection of well defined abstract data types. Even in systems that treat virtual memory as an unstructured array of bytes; the address space of an application actually contains a collection of objects from different sources and with different purposes. For example, a process's address space might contain program text, initialized and uninitialized data areas, and a stack, as well as mapped files, shared memory segments, or dynamically linked text and data. The abstract data object found in an application address space can have a variety of usage patterns; for example, the stack will be read and written with a high degree of locality, whereas program text will executed from, not written to, and will be accessed with much less locality. These usage patterns are characteristic of the objects themselves, not the virtual addresses used to refer to them.

· Objects transcend processes. Abstract data types within an application can be shared concurrently with other processes and/or persist beyond the lifetime of a single application. Objects can therefore often transcend the lifetime of the application processes that map them. . This transcendence can apply to a wide variety object types. While some concurrently mappable objects, such as files, may have external names and permanent storage, others, such as shared memory sections, may exist only as long as they are mapped by a specific application. Even objects that seem intrinsically bound to a process, such as stacks, can survive a process by being inherited by descendants of that process through primitives such as Unix's fork or Mach's task_create.

· Objects have versions with control modification and propagation. As an object is shared or passed on from one application to the next it can be changed. New versions are effectively created which may coexist over time with older versions of the same object. Objects can have versions which restrict the visibility of modifications made to the object. A modification made to a version of an object could, for example, be visible only to processes using that version of the object. Creating a new version of an object protects the users of the old and new versions from each other. For example, Unix fork creates new versions of all objects for the child process, protecting parent and child from each other. Similarly, Mach's vm_map, when used to map program text into a process, protects the process from subsequent changes in the executable file while also protecting other users of the file from changes made by the process to the mapped text. Conversely, processes using the same version of an object will see each other's changes, even if they live on different nodes.

The way in which objects are used and the patterns of use associated with object versions can, in turn, be exploited to gain efficiency. Specifically:

· Object versions overlap.

Different versions of an object often overlap, that is, have identical sections. A newly created version will completely resemble the old, but identical sections often remain long after version creation. Identical sections can arise either from limited modification of both old and new version, or from limited modification of several new versions despite heavy modification of the old version.

Several scenarios produce overlapping versions via limited modification of both old and new versions. Using version creation to protect mapped program text will create completely identical versions, assuming that the executable file isn't being changed and programs don't modify their text sections; limited modification of program text, which can result from text and data sections sharing a page, still yields largely overlapping versions. Overlapping versions can also result from data which is initialized by a parent and then mostly read by its children. Finally, overlapping versions result when a portion of an object is rarely used by any process, such as the far end of a large, statically allocated stack.

Similar scenarios produce overlapping new versions despite heavy modification of the old version, as long as the new versions were created before the bulk of the old version modification. Thus even if an executable file is periodically completely overwritten, all processes using the executable in between overwritings will use identical versions for their mapped text. Similarly, if a process alternates between writing new data into an object and spawning a set children to compute on the data, then all children in each set will use largely overlapping versions.

· Objects are reused on individual nodes.

When an object is used on a node, some version of that object is likely to have been used on that node before. For some objects, such as program data, reuse arises from locality in process scheduling; for other objects, such as commonly used files, reuse on a given node results from the object's wide use.

Locality in object placement arises in simple process scheduling schemes, such as static partitioning, as well as more sophisticated schemes. Static partitioning, in which a fixed set of nodes is reserved for a particular job, produces object locality by mapping similar processes (those associated with the job) onto a subset (or partition) of the full set of nodes in the system. In particular, if a parent process (such as a shell or a parallel application) repeatedly creates children which inherit objects from the parent, then these objects will be repeatedly reused on the other nodes in the partition. More sophisticated process scheduling schemes still produce object locality, for two reasons. First, for scalability, such schemes distribute scheduling decisions among multiple agents, yielding a dynamic partitioning with behavior similar to the static partitioning already discussed, in which a slowly changing subset of processors is used for related processes. Second, such schemes include locality in their node selection criteria, since locality can be exploited for performance, due to better caching and reduced communication cost to related processes.

· Object use has temporal locality.

An object in use on one node is likely to be in use on other nodes as well. Inherited objects such as stacks, data, and program text will have been in use on the parent's node, and additionally may be in use on siblings' nodes. Other objects such as files will exhibit the same temporal locality that makes conventional file caching worthwhile.

· Objects are incompletely referenced.

Processes do not completely reference the objects that they map. Private objects, such as stacks, are often allocated much larger than will actually be used, to accommodate the maximum growth case. Shared objects often have regions only of interest to a subset of the processes sharing the object, such as a queue of work items in a shared memory region for a parallel program, or data allocated by a parent for its own use but not that of its children. Finally, regions of program text containing rarely used routines may never be referenced by a given process.

A final observation is that many object based services are best handled not by a general purpose operating system kernel with generic object page management, but by servers which understand their semantics and behavior. These external memory managers (external because they are object memory managers which reside outside the bounds of a pre-defined operating system kernel) can provide services such as object data paging, garbage collection and checkpointing. The can be aware directly of page modifications based on fault behavior rather so manual comparing of pages for differences can be avoided. They can read or write protect sections of memory for synchronization, provide data on demand and avoid double paging. Because they operate on objects, they need not concern themselves with dynamically changing processes and process state but can concentrate on the needs and demands of that object data type. External memory managers can be used to implement services such as generational garbage collection, concurrent checkpointing, and persistent stores [Appel & Li 91].

Although they can be extremely useful, the implementation of external memory managers in a distributed shared memory environment can be quite complex. Without some form of lower level system assistance, each external object handler must be aware of the nuances of page caching and consistency. This puts a significant burden on the implementors of such services.

Techniques for distributed object management in Odin

For the reasons outlined above, we chose to use objects as basic building block of the Odin distributed shared memory system. The most important problem we had to solve was how to provide for efficient handling of distributed object migration and object sharing. In order to reduce the amount of paging traffic and communication between nodes we developed techniques for recognizing, compressing, caching and managing distributed objects in ways that reduce the amount of paging traffic and communication between nodes. Odin also makes the implementation of external memory managers simpler by providing the low level support for distributed object management upon which such managers can build their own semantics.

Object recognition

Odin uses object recognition to avoid internode copying, It recognizes when an object is already resident at a node and exploits the reuse of objects on individual nodes. Object recognition can be aided by caching objects after they have been used at a node, since otherwise only objects actively in use at that node will be recognized. Object recognition is particularly helpful for remote process creation performance, since it allows newly created processes to take advantage of data retrieved by previously created processes, such as files, shared memory sections, and program text.

Version compression

Version compression takes advantage of overlap between object versions to reduce the amount of space required to store multiple versions, to reduce the time required to create new versions, and to aid object recognition. Version compression allows an object version to share a section of an object with other versions as long as it agrees about the section's contents. As well as saving space, version compression speeds version creation, since initially old and new versions can share the entire object, and therefore no copying of object contents is required. Version compression also aids object recognition by allowing a version of an object to be recognized even when only other versions of that object are present; all overlapping sections from the other versions can then be used to construct the desired version.

Distributed caching

Odin takes advantage of the temporal locality of objects by using distributed memory as a shared object cache. If a node needs object data, it first asks other nodes which have recently used the object, fetching the data from backing store (file server or paging disk) only when no nodes have it cached. By representing all data as objects, Odin can use distributed memory to cache temporary objects such as stacks and program data as well as more traditionally cached objects such as files, aiding remote process creation by allowing sibling processes to contribute data no longer cached by the parent. By using version recognition, nodes can use objects cached at other nodes even when those objects are a different version, as is common for writable objects such as stacks.

Distributed caching is well suited for massively parallel processors with uniform nodes. In local area networks of workstation clients and dedicated server machines, the servers have more processing power to handle large numbers of requests, and more memory to serve as an effective second-level cache (where the client's own memory is the first-level cache). Distributed caching compensates for the uniformity of nodes by reducing the number of requests that end up at the server, reducing processor demand at the server's node, and by using distributed memory as the second-level cache, eliminating the need for a high hit rate at the server and thus for a large server memory.

Copy-on-reference

Odin exploits the incomplete referencing of mapped objects through copy-on-reference, which only attempts to find data in a mapped object after an application has tried to access it. Copy-on-reference can greatly reduce the amount of data required to be copied from other nodes, particularly since unreferenced data is less likely to have been referenced before and therefore less likely to be locally cached.

External memory management

Odin provides an interface for servers to externally manage memory which allows them to control the contents and accessibility of objects mapped by clients. Unlike other systems which only provide external memory management, Odin also provides coherency management and distributed caching for mapped objects, freeing servers from such concerns. By allowing external object servers to ignore the distributed nature of the memory in which their objects are mapped, Odin drastically simplifies the task of writing distributed object services such as mapped databases or files.

Architecture and Implementation

With Odin, applications running on a massively parallel processor such as the Paragon XP/S derive their memory and other system services from operating system servers. These servers in turn rely on Odin's local cache managers to provide address spaces and objects mapped within them. Distributed cache mangers are used by these local cache mangers to provide coherency of information. All server provided data is ultimately provided by external memory mangers.

Address space and physical memory management

Odin simplifies the distributed management of objects by assigning to each node the responsibilities of managing local address spaces, of managing local physical memory as a cache of object data, and of mediating between address spaces and the distributed object cache. This assignment of responsibilities reduces internode message traffic by resolving many requests locally. It also simplifies the implementation of distributed caching by allowing it to ignore both address spaces and physical memory. A final per-node responsibility, that of maintaining object versioning structure, will be discussed later when describing the implementation of object recognition.

Address space management

Each node provides methods for creating and destroying address spaces, for mapping and unmapping objects in an address space, and for listing what objects are currently mapped in an address space. Address space inheritance is implemented by combining these methods. While Odin only directly supports single-node address spaces, multinode address spaces could be implemented by combining several single-node address spaces.

The address space creation and object mapping methods provided by each node are used to implement services such as memory allocation, file mapping, process creation, and process migration. When a server creates a process, it first asks the node to create an empty address space, which it then populates by mapping objects into it. Objects such as mapped files, program text, and shared memory segments are obtained by mapping external pagers provided by file and other servers. Other objects, such as stacks, uninitialized data, and other zero-filled data, are obtained by mapping internal pagers provided by Odin; these internal pagers are created automatically by the mapping primitive upon being given a null pager name.

Address space inheritance is implemented by obtaining a list of the objects mapped in the source address space and mapping these objects into a newly created destination address space. If the inheritance has copy semantics, as in Unix fork, then new versions of the objects are mapped; if the inheritance has sharing or duplication semantics, as in process migration, then the same versions are used. For convenience, Odin provides methods to create an address space that inherits from another with either copying or sharing semantics, but these methods are built upon the address space creation and mapping primitives.

While Odin address spaces live only on a single node, multinode address spaces could be implemented on top of a collection of Odin address spaces by translating all mapping and unmapping primitives into the same calls performed on each Odin address space. Consistency among the mapped objects would be provided automatically as the result of mapping the same versions in each address space. This scheme could be used to allow a single multithreaded process to run in parallel on a massively parallel machine.

Local cache management

Each node has a local cache manager which caches object data in the node's physical memory and which mediates between address spaces and the distributed cache manager. When a process faults on a page of a mapped object, the local cache manager first attempts to satisfy the fault with a page in the local cache. If the page is present and the node has the desired access (e.g., write access) to the page, then the local cache manager maps the physical page into the process and resumes the process. Otherwise, the local cache manager forwards the request to the distributed cache manager for the given object.

Since the local cache manager resolves all conflicts and overlaps among address space requests generated on its node, the distributed cache manager is presented with node requests, not address space requests. By processing only node requests, the distributed cache manager processes fewer requests and is able to process each more easily. For example, when enforcing a consistent view of object data, the distributed cache manager can use a simple single-writer, multiple reader protocol, since no nodes share physical memory; if it had to arbitrate among address spaces instead, it would have to keep track of which address spaces shared physical memory.

Version Compression

Odin implements version compression by using a tree to represent all versions of an object, with vertices representing individual versions and shared subpaths representing overlap between versions. By using vertices, not entire objects, as the basis of caching, data cached for one version can be used for another. Furthermore, since version creation is performed simply by adding a new vertex, no copying of object data is required to create a new version. Techniques such as existence maps and deferred distribution are used to avoid a separate remote access for each vertex upon suffering a local cache miss.

Version representation

Odin uses a version tree to represent the versions of an object, allowing data to be shared among versions without modifications to one version affecting another. Each vertex in the tree represents a different version and holds data unique to that version. The vertices between a version's vertex and the root hold the rest of the data associated with the version, shared with other versions that share these vertices as ancestors. Privacy of modifications is ensured by exploiting the priority of modifications in a vertex over those in vertices closer to the root.

Each vertex in the tree represents a distinct object version. The root represents the original version of the object; children of a vertex represent versions derived from the parent vertex's version. Each vertex holds data unique to its version. When a version is newly created, it resembles its parent completely and thus its vertex holds no data; a vertex acquires data incrementally as its version diverges from its parent.

Data in interior vertices is shared by descendent vertices, except when obscured by other data. The closest vertex with data for a given offset provides the data for that offset; thus modifications in more recently created versions override data in older versions. In the absence of any modifications, the data provided will reflect the original version of the object, which will either be zero-filled or provided by an external pager.

Modifications made to one version must not be visible to another; thus modifications must either not be made to any interior vertex, or must be obscured by data in all children of the interior vertex. The latter condition can be attained by copying data about to be modified to all children vertices before modification is allow to begin. In contrast, modifications of leaf vertices is always allowed, since leaves have no descendants that can see the modifications. Odin's two version creation methods allow both strategies for interior vertex modification to be used.

Version creation and modification

Odin uses two methods, symmetric and asymmetric, for creating new versions. By selecting the most appropriate method for each vertex, Odin reduces modification and lookup costs while preserving the semantics required by services such as mapped files. Both methods create a new version by adding a new child vertex to the parent version's vertex, protecting the parent from modifications made by the new version. The difference between the methods lies in how the new version is protected from modifications made by the parent. The symmetric method avoids modifications to interior vertices, whereas the asymmetric method ensures that such modifications are obscured by other data.

The symmetric method prevents interior vertex modification by moving parent versions to newly created leaves. When a new version is created, two vertices are added to the old parent vertex, one for the new version and one as a replacement parent vertex. Both versions are protected from each other by not having each other as an ancestor, sharing instead an unchanging snapshot of the parent at the time of version creation.

Instead of preventing interior vertex modification, the asymmetric method obscures such modifications with unmodified data. When a new version is created, a single vertex is added to the parent vertex, representing the new version. The parent continues to use the old parent vertex, but whenever it is about to modify data for the first time after the version creation, it copies the data into the new version's vertex.

While the symmetric method allows more optimizations, the asymmetric method must sometimes be used to preserve external pager semantics. Most external pagers, such as those providing mapped files, must be able to see all modifications made to the original version mapping the pager. Since pagers only see modifications made to the vertex to which they are attached, symmetric version creation prevents a pager from seeing any subsequent modifications, and therefore the asymmetric version is required in this case.

Version compression and distributed caching

Version compression allows greater reuse of local and distributed cached data, but the more complex object representation required to implement version compression can increase the number of remote accesses required to find data. In particular, finding data in a vertex chain could require as many remote accesses as there are vertices in the chain, even when the data is cached locally. Odin uses two methods to avoid these unnecessary remote accesses. First, it uses existence maps to record the offsets for which a vertex has data. Second, it defers distribution of vertices that are only used on a single node. Both techniques allow distributed cache lookups to be skipped; when combined, they usually eliminate all unnecessary distributed cache lookups.

The need for many distributed cache lookups even when data is cached locally arises because data in a vertex must always obscure that in ancestor vertices. If an object version is represented by a long version chain, and only the root vertex has locally cached data, it is not correct to use the locally cached data until it has been determined, by checking the distributed cache for every other vertex in the chain, that there is no data obscuring the locally cached data. This particular case illustrates why batching several distributed cache requests in one remote access is not sufficient: a remote access would still be required to access local data. Besides, the distributed cache managers for the vertices are not guaranteed to be on the same node, and thus several remote accesses may still be required.

If a vertex is stable, or protected from modification, then Odin constructs for the vertex an existence map, a bitmap recording which pages the vertex has data for. Existence maps allow most vertices to be processed entirely locally, with a remote access generated only if the required data must be obtained from the distributed cache. When a vertex is stable, its existence map will not change, and thus may be distributed to all nodes using the vertex, without provisions for updating the existence map. Since the symmetric version creation method always creates stable interior vertices, and since Odin uses the symmetric method by default, most vertices will have existence maps. (Odin also uses techniques, beyond the scope of this paper, to force asymmetric version creation to create stable vertices.)

A complimentary technique, applicable to unstable vertices, is deferred distribution. Typically, the only unstable vertices in a version tree are the leaves. Fortunately, leaf vertices usually represent a private version of an object, used only on a single node. Accordingly, the creation of a distributed cache manger for the vertex can be deferred until it is exported to another node or until pageout from the vertex is necessary. If a vertex has no distributed cache manager, then only its local cache need be examined before moving to the next vertex. Even if a distributed cache manager must be created to handle pageout, the vertex can use a pager map to remember which pages have been paged out. This pager map can be used just like an existence map until the distributed cache manager notifies the node that another node has started using the vertex, at which point the pager map must be discarded because it will not reflect pages added by other nodes. Furthermore, if a private leaf vertex is transformed into a shared interior vertex, which is typical when migrating or copying data to another node, then the pager map can be transformed immediately into an existence map.

Object Recognition

Odin implements cross-version object recognition, the recognition of an object version on any node that has any version of the object. Cross-version object recognition allows effective object recognition even when new versions of an object are constantly being created, such as when a process repeatedly forks new children. Cross-version object recognition is implemented by combining vertex recognition and incremental version tree construction, whereby the destination node incrementally constructs a vertex path until the path joins an existing version tree at a recognized vertex.

Object recognition is invoked whenever a version of an object is mapped on a node. The simple case is when the mapped version is the original version of an object, in which case either the corresponding vertex is either recognized or not. If it is, then it can be mapped immediately; if not, then a local representation of the vertex must first be created and connected with its distributed cache manager. The more complicated case is the mapping of a derived, or non-original, version of an object.

Incremental version tree construction is used to recognize derived versions of objects. If a destination node does not immediately recognize the derived version, it constructs a vertex path by starting with the mapped version's vertex and adding ancestors until either the root node has been added or the path joins a preexisting version tree. Since a reference to a derived version can only be obtained by listing the objects mapped in an address space, there is always a source address space and node from which to obtain the vertex path information. Note that each node keeps only a relevant subtree of each version tree, constructed incrementally as a result of object recognition and local version creation; the full version tree for an object may not exist on any node.

Distributed Caching and Object Consistency

Odin uses a distributed cache manager associated with each object version vertex to provide distributed caching and object consistency. Each vertex's distributed cache manager mediates among nodes and with the vertex's pager. Distributed caching is implemented by fetching data from one node on behalf of another, or from the pager if necessary. Object consistency is implemented by managing a single-writer, multiple-reader protocol among nodes and with the pager. The distributed caching and object consistency implementations cooperate to save bookkeeping costs and to reduce network and disk traffic.

Distributed caching is implemented by keeping track of which nodes have which vertex data; when a node requests vertex data, the distributed cache manager knows which nodes to fetch the data from. The cache manager records a node as having vertex data upon supplying the node with the data. It records the node as no longer having data when told by the node that it has discarded the data during page replacement; when it tells the node to discard the page (to maintain consistency or satisfy an external pager request); and when the node responds negatively to a fetch request. When the cache manager receives a data request, it attempts to fetch data from all nodes it believes to have the data until one responds positively. The cache manager uses its record of who has what pages to reduce the number of futile fetches, but it does not attempt to eliminate them entirely, since doing so would create more work than is spent on useless fetches. The cache manager will never fetch from the pager when data is available on a node; this not only minimizes pager traffic, but also ensures that a node will not be supplied with stale data, since pager data may be out-of-date.

Object consistency is implemented by combining vertex consistency with tree structure consistency. Tree structure consistency is automatically maintained during version tree construction and version creation. Vertex consistency is maintained by the vertex's distributed cache manager with a single-writer, multiple-reader protocol among nodes: each page of vertex data may either be read by multiple nodes, or written by a single node. Note that "single-writer" refers to nodes, not address spaces, and thus several address spaces may share write access to a page as long as they all live on the same node. Each node's local cache manager is responsible for transparently upgrading and downgrading access in response to distributed cache manager commands, and for translating read and write faults into consistency requests sent to the distributed cache manager.

The distributed caching and vertex consistency algorithms cooperate to reduce bookkeeping costs, futile fetches, and unnecessary pageouts. Bookkeeping costs are reduced by sharing one database for page possession and protection. The consistency algorithm keeps track of which nodes have what access to each page; the distributed caching algorithm uses any access as an indication that the node has that data. Futile fetches and unnecessary pageouts are reduced through the use of precious pages. Normally, when a node has read access to a page, it is allowed to discard the page without notifying the distributed cache manager, thereby reducing network traffic. However, when the consistency algorithm switches from writer to reader mode, it must ensure that the newly modified data is not discarded. It could send the page to the pager as well as the new readers, but doing so would increase pager load and disk traffic. Instead, it tells some of the readers that the page is precious, requiring them to return the page to the distributed cache manager instead of discarding it. Besides reducing pager load, precious pages allow the distributed caching algorithm to avoid false fetches. A fetch from a precious page reader will always result in data, either as a result of the fetch or because the data was already being returned.

External Memory Management

Odin implements its external memory management interface by extending its distributed cache management to treat the external memory manager as another local cache manager. All aspects of the interface, including changes in access and demands to clean or flush data, can be mapped into equivalent data or access requests as generated by the local cache managers.

Evaluation

Performance

Note to reviewers: final paper will use measurements from a Paragon XP/S running OSF 1/AD.

The following table provides measurements which demonstrate the effectiveness of Odin's object recognition and version compression techniques for remote process creation. All measurements were performed using two Gateway2000 486DX2/50s connected by an Ethernet. Both machines were running a modified NORMA_MK14 version of Odin and an unmodified UX38 STDAFS+WS release of Carnegie Mellon's UX Unix server. The parent process in all tests was a Unix C-shell, with approximately with 203 virtual 4KB pages and 60 resident pages.

 

Operation

Time (msec)

Pages copied

Percent virtual

Percent resident

Odin RPCs

Other RPCs

remote task create/destroy, no inheritance

10

0

--

--

2

--

remote task create/destroy, inheritance

70

0

--

--

18

--

remote task create/destroy, inheritance, cached

30

0

--

--

8

--

remote fork/exit

220

9

4%

15%

31

9

remote fork/exit, cached

100

3

1%

5%

10

9

remote fork/exit, cascaded

150

4

2%

7%

26

4

The first three measurements examine the overhead of address space creation and object recognition in the absence of operating system operations or page references. The first two measurements compare the cost of remotely creating an empty address space with the cost of inheriting an address space from the shell, with a net overhead of 60 milliseconds. The third measurement is the cost of inheriting an address space on a node that has already inherited a previous version of the same address space. The net overhead of 20 milliseconds is significantly lower, since much of the required version tree creation had already been performed by the first inheritance, as can be seen by the count of Odin-related RPCs.

The next two measurements examine the costs of remote process creation with inheritance. These costs represent upper bounds on migration costs, since migration can be approximated by creating a child process with inheritance and then destroying the parent. The first measurement in this set is the cost of remotely forking (creating with inheritance) a child process and then waiting for it to exit. Simply returning from the fork system call and calling exit causes the child to reference nine pages. The second measurement is the cost of forking and waiting for a second child on the same node. The resulting time is less than half that for the first child, since roughly two-thirds of both the required pages and the required version tree information had been cached by the creation of the first child. (The resulting time for the second process creation is not one-third of the first because a constant number of remote messages is required to initialize the non-address-space related aspects of each process.)

The final measurement is the cost for a remotely created child process to create a grandchild on the original parent's node. This sequence can be interpreted as an upper bound on the cost of an eviction migration for this process, where a process is forced to migrate to its original site because its current site has been claimed for another use. In this measurement, more than half of the referenced pages were found locally. More messages were required to build the version tree than when creating a second child on the same node, however, since not one but two generations of versions had to be added to the tree.

An obvious optimization which we have not yet performed is to batch the RPCs used to acquire version tree information. If a single RPC were performed for each generation instead of separately for each vertex, the number of Odin-related RPCs and resulting total time would be greatly reduced. For example, the number of RPCs required for the last measurement would drop from 26 to 9, with the resulting total time dropping to approximately 80 milliseconds.

Comparison with Related Work

Most of Odin's techniques, such as version compression and coherent object management, have appeared in previous systems. However, Odin is the only system to have combined all of these techniques; furthermore, Odin gains significant performance and functionality advantages by having done so. One area that Odin does not currently address is alternative scalable memory architectures; a possible implementation of Odin on one such alternative architecture is discussed below.

Mach virtual memory

Mach [Accetta et al. 86, Tevanian 87, Young 89] on a single node uses many of the same techniques that Odin does, such as external memory management and version compression, but Mach provides little help for managing multiple nodes. Mach provides external pagers, but does not provide coherency or distributed caching among multiple Mach nodes mapping an external pager. While in principle an external pager could itself implement coherency and distributed caching, doing so would add a great deal of complexity to most pagers. To the author's knowledge, no Mach external pagers implement distributed caching, and the few pagers that implement multinode coherency provide only zero-filled distributed shared memory. Mach also uses version compression on a single node to implement copy-on-write, but provides no interface to export versioning information, preventing its use among multiple nodes, for example to implement efficient address space copy-on-reference.

V process migration

V process migration [Theimer et al. 85, Theimer 86] uses precopying to hide the cost of copying an address space, but uses no techniques to reduce the amount of data to be copied. In contrast, by combining object recognition and copy-on-reference, Odin greatly reduces the amount of data to be copied, leaving little cost to hide. Furthermore, precopying is not applicable to remote forking, where the child must see a copy of the parent's address space at the time of the fork, and thus both parent and child must wait until the copy is complete.

Accent process migration

Like Odin, Accent process migration [Zayas 87a, Zayas 87b] uses copy-on-reference to reduce the amount of data to be copied and to allow the migrated process to start running immediately. However, since Accent lacks object recognition, it must remotely copy every page referenced by the migrated process, even when local copies of program text and initialized data are available. Furthermore, migrated processes in Accent maintain a permanent dependency on their source node, while Odin's dependency on the source node disappears once normal page replacement on the source node flushes all of the process's pages. Finally, if a process migrates several times, a page-in may result in a retrace through every previous site to find the page, even if the process has migrated back to its original site.

IVY

IVY [Li 86] replicates a single address space across multiple nodes, using shared virtual memory to maintain coherency among nodes and to provide distributed caching of data. IVY thus provides many of the same advantages that Odin does, but only for threads executing in the same replicated address space. This is because IVY's granularity of recognition is address spaces, not objects, and because IVY does not use version compression and thus cannot use one version of an address space to recognize another. Conversely, Odin uses IVY-like techniques to manage vertices, and thus can be viewed as an embedding of IVY in a object recognition and version compression framework.

Sprite process migration

By viewing files in Sprite [Douglis & Ousterhout 87, Ousterhout et al. 88] as Odin objects, many similarities between Sprite and Odin become apparent. Sprite, like Odin, constructs address spaces out of mapped objects, performs both local and nonlocal caching of these objects, implements process migration through copy-on-reference from the nonlocal cache and not from the source node, and provides a limited form of object recognition. However, by lacking version compression, coherent mapped objects, and distributed caching, Sprite is unable to provide some services, and provides others less efficiently.

Lack of version compression in Sprite slows process migration and prohibits efficient remote fork. Sprite process migration works by flushing all dirty pages from the source's address space into its paging file, destroying the source address space, and mapping its paging file into the new destination address space. Lack of version compression slows process migration because only read-only objects such as program text can be recognized; writable objects, including initialized data, cannot, and must instead always be paged (on demand) from the server. Lack of version compression prohibits efficient remote fork in Sprite, since there is no safe way for parent and child to share a single paging file, and no efficient way to create a new version of the paging file.

Sprite process migration is also slowed by the lack of distributed caching. In Sprite, object cache misses on a node can only be satisfied by data from the server and not from another node. Thus a migrated Sprite process must wait for all dirty pages to be flushed from the source node to the server before it can begin running on the destination node. In contrast, migrated processes under Odin start running immediately, with the distributed cache retrieving pages from the source's memory at first, then increasingly from the pager as the source node writes out pages as part of normal page replacement. Even if Odin aggressively wrote out pages from the source node as Sprite does to more quickly eliminate residual dependencies on the host, it could do so concurrently with the execution of the migrated process. Sprite's lack of distributed caching would be particularly troublesome on massively parallel processors with uniform nodes, where no node has a large enough memory to serve as an effective secondary cache for more than a few nodes.

Finally, by not providing coherent mapped objects, Sprite is unable to provide either user mappable files or shared memory between arbitrary processes, desirable services provided by most modern operating systems, including OSF/1. Conversely, processes that do share memory (through use of a Sprite variant of fork that leaves the data section shared between parent and child) cannot be migrated away from each other.

NUMA memory management

Although Odin was developed for non-shared-memory multiprocessors, many of its techniques can be applied to Non-Uniform Memory Access (NUMA) multiprocessors. NUMAs provide addressability of remote memory, but with slower access than private local memory, particularly since typically only private local memory can be cached. The primary advantage of NUMAs over non-shared-memory multiprocessors is the reduced performance impact of write-sharing. Work such as [Cox & Fowler 89] and [LaRowe et al. 91] has explored how to most efficiently implement a coherent memory abstraction on a NUMA. Such work parallels IVY, which provided the same abstraction in a non-shared-memory environment. Accordingly, it should be possible to host Odin on a NUMA by replacing its IVY-style vertex coherency management with NUMA coherency management. The result would provide the same object recognition and versioning advantages of the Mach port to the ACE NUMA [Bolosky et. al. 89], but would additionally allow kernel data structures to be kept in local memory, significantly improving performance.

Conclusions

Odin combines techniques such as object recognition, version compression, distributed object data caching, and coherent object management into a single system framework. It supports copying, caching and management of application data between nodes in a non-shared memory massively parallel processor. It allows object management servers to be written for a distributed shared memory environment without regard to concerns of distributed object data consistency. Odin is the key distribution component of the Open Software Foundation's OSF/1 AD and runs on networks of PCs and workstations as well as the Intel Paragon -- a parallel processing platform capable of supporting two thousand i860XP processors. Performance measurements to date indicate the effectiveness of Odin's approach to distributed memory management. Internode page traffic is drastically reduced for system operations such as fork and migrate. We believe Odin has made process migration and remote process creation practical system tools for dynamic load balancing on multicomputer platforms.

References

[Accetta et al. 86] Accetta, M. J. et al. Mach: A new kernel foundation for UNIX development. In Proceedings of the USENIX 1986 Summer Conference, July 1986.

[Appel & Li 91] Appel, A. W., and Li, K. Virtual Memory Primitives for User Programs. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Systems (Santa Clara, Ca, Apr. 1991), pp. 96 - 107.

[Bolosky et. al. 89] Bolosky, W., Scott, M., and Fitzgerald, R. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th Symposium on Operating System Principles (Litchfield Park, AZ, Dec. 1989), pp. 19 - 31.

[Cox & Fowler 89] Cox, A. L., and Fowler, R. J. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with Platinum. In Proceedings of the 12th Symposium on Operating System Principles (Litchfield Park, AZ, Dec. 1989), pp. 32 - 43.

[Douglis & Ousterhout 87] Douglis, F., and Ousterhout, J. Process migration in the Sprite operating system. In Proceedings of the 7th International Conference on Distributed Computing Systems (Berlin, West Germany, Sept. 1987), pp. 18 - 25.

[Li 86] Li, K. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis (Yale 1986).

[LaRowe et al. 91] LaRowe, R. P., Ellis, C. S., and Kaplan, L. S. The robustness of NUMA memory management. In Proceedings of the 13th Symposium on Operating System Principles (Pacific Grove, CA, Oct. 1991), pp. 137 - 151.

[Ousterhout et al. 88] Ousterhout, J. et al. The Sprite network operating system. In IEEE Computer 21, 2 (1988), pp. 23-36.

[Popek & Walker 85] Popek, G. J., and Walker, B. J., ed. The Locus Distributed System Architecture (MIT Press, 1985).

[Tevanian 87] Tevanian, A. Architecture-independent virtual memory management for parallel and distributed environments: the Mach approach. PhD thesis (Carnegie Mellon 1987).

[Theimer et al. 85] Theimer, M., Lantz, K., and Cheriton, D. Preemptable remote execution facilities for the V system. In Proceedings of the 10th Symposium on Operating Systems Principles (Orcas Island, WA, Dec 1985), pp. 13 - 22.

[Theimer 86] Theimer, M. Preemptable remote execution facilities for loosely-coupled distributed systems. PhD thesis (Stanford 1986).

[Young 89] Young, M. W. Exporting a user interface to memory management from a communication-oriented operating system. PhD thesis (Carnegie Mellon 1989).

[Zayas 87a] Zayas, E. Attacking the process migration bottleneck. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (Austin, TX, Nov. 1987), pp. 13 - 22.

[Zayas 87b] Zayas, E. The use of copy-on-reference in a process migration system. PhD thesis (Carnegie Mellon 1987).