Page History

...

This page describes improvements that can be made to the Wave in a Box (WIAB) wave store to scale it to large numbers of waves and simultaneous users and to build out the WIAB wave server to support scalable and reliable indexing, search, federation, and APIs.

1. BACKGROUND AND PROBLEMS

Presently (July 2011), the WIAB wave server stores wavelet deltas in (file based) persistent storage immediately after it applies them. All wavelets are kept in memory. On startup, the wave server reads all wavelet deltas from disk, keeps them all in memory, and constructs the state of every wavelet from its deltas. The wave server maintains no real index, only an in-memory cache of lists of wave views for local users and groups. Searches are performed by traversing these lists for all wavelets accessible to the given user.

...

The current implementation is very simplistic and inefficient in most ways, preventing WIAB from scaling beyond thousands of waves and dozens of users. The biggest problems are:
P1. Wavelets are reconstructed from scratch from their entire delta history when loaded from disk. This is expensive and causes excessive latency when loading a wave with a long edit history from disk.
P2. The wave server keeps the entire wavelet delta history in memory for all the wavelets it has loaded into memory. This makes the memory footprint for waves with long edit histories excessive.
P3. All wavelets are kept in memory in the wave server, and the wave server loads all wavelets from disk on startup. Obviously, the memory footprint and startup latency of this solution doesn't scale to large number of waves.
P4. The search implementation is tightly coupled with the in-memory wave store, in particular, user searches contend for the wavelet lock in the wave store. This makes searches slow and expensive and doesn't scale beyond dozens of simultaneous users.

2. GOALS

The goal is to build out WIAB's wave store and wave server to satisfy the following properties:
G1. The cost (in terms of storage operations and memory footprint) and latency to access a wave is proportional to its current size.
G2. The total memory footprint on the wave server is a modest constant factor larger than the "working set", measured as the aggregate current size of the waves which were recently modified.
G3. A scalable and modular wave server architecture, interconnected by a wave bus API that supports the construction of reliable indexing, search, federation, and API subsystems, suitably decoupled from the wave store.
The remainder of this page describes solutions to problems P1-P4 with an aim towards goals G1-G3.

3. WAVELETS WITH LONG EDIT HISTORIES

Problems P1 and P2 make wavelets with long edit histories unnecessarily expensive: they are slow and expensive to load into memory and their memory footprint is excessive. These can be solved in isolation in the wave store without repercussions for the wave bus and other subsystems.

3.1 DELTA HISTORY EVICTION

The excessive footprint can be avoided if the wave server evicts the oldest deltas from memory, that is, it only keeps the most recent deltas6 in memory and loads older deltas from disk only upon request.19 Since we don't want to perform any disk operations synchronously under the wavelet lock in the wave store, we need to make all the methods that access delta history asynchronous. Specifically, the following methods need to be made asynchronous:
box.server.waveserver.WaveletContainer#requestHistory()
box.server.waveserver.WaveletContainer#requestTransformedHistory()
box.server.waveserver.WaveletProvider#getHistory()
which requires changes to all the places they are called, of course.30

3.2 PERSISTENT WAVELET SNAPSHOTS

The cost and latency of loading wavelets into memory can be reduced by storing wavelet snapshots in persistent storage. Appropriate care is needed to ensure consistency with stored deltas. The frequency of writing the current snapshot to disk will need to be calibrated for best performance and resource economy.1

Here is a simple design for adding persistent consistent snapshots to the file based wave store implementation:
Give the snapshot file a fixed name, "snapshot", so the file can be opened immediately without a prior directory lookup to determine the file name.
In the snapshot file include the snapshot version number and the index into the deltas file for the delta that ends at that version. This makes it easy to load only the deltas after the snapshot when a wavelet is loaded from disk.
A wavelet is loaded from disk with the following steps:
Try to read the wavelet snapshot from the "snapshot" file.
If the file doesn't exist (or the read fails for any other reason), construct the wavelet in the old fashion from its entire delta history.
Otherwise read only the deltas from the wavelet version onwards and construct the wavelet from the snapshot and any later deltas.
The wave server only writes out a snapshot for a version of the wavelet after all the deltas up to that version have been written successfully to disk.
Some heuristic is used to determine when to write a new snapshot. It should strike a good balance between the cost of writing frequent snapshots and the cost of applying many deltas to an old snapshot, if new snapshots are written infrequently. The heuristic could say that a new snapshot is written whenever the aggregate size of the deltas since the last snapshot exceeds the size of a new2 snapshot.
The snapshot is first written into a temporary file. Then, when it is successfully written in its entirety, the temporary file is renamed to "snapshot", atomically deleting any pre-existing snapshot of an earlier version of the wavelet.
On startup, the run-server.sh script can delete any dangling temporary files (not fully written and renamed when the last run of the wave server terminated).
The snapshot file is always optional. It will eventually be created if the delta history grows long enough to make a snapshot file worthwhile. Because it is optional, the wave server will automatically start creating snapshot files when this functionality is implemented, so there's no need to run a special migration logic to populate the wave store with snapshot files. Moreover, if a snapshot file is corrupted, the operator of the wave server can overcome the corruption by deleting the corrupt snapshot and restarting the wave server.

4. PERSISTENT INDEX

Similarly to the way the delta history of a wavelet can be paged out and paged in, it is useful to load wavelets into memory and unload them from memory in accordance to how and when they are accessed.2 This solves problem P3, namely the lack of scalability of keeping all wavelets resident in memory and loading them all at wave server startup.

...

When it comes to indexing stored wavelets, it's a real drawback that we only have a file based implementation of the wave store. If the persistent wave store was backed by a good database, we could make use the database's indexing facilities. Alternatively, we could integrate an off-the-shelf indexing system such as Apache Lucene, indepedent of
the underlying wave store implementation. Regardless, in keeping with the existing file based wave store, in section 4.4 we consider how a primitive file based index can be implemented, as an illustration.

4.1 GOOGLE WAVE'S SEARCH AND INDEXING

First, let us consider Google Wave's search and indexing solution:
Users can only search waves that they share, either waves that they participate in or follow or waves with a group that they are members of.
Google Wave exploits this structure to organize its persistent index as an index per user and per group. Within each user/group index, there's a posting list per token, sorted by last-modified time (LMT).
This solution makes searches involving long posting lists relatively efficient: long posting lists are already intersected with the relevant user/group and only a prefix of the posting list needs to be accessed to deliver the first page of results.
All search results are ordered by LMT, a useful and conventional sort order in communication tools.
The indexer maintains a cache of index changes since it last modified a user/group persistent index and occasionally rewrites the persistent index, updating the sort order according to LMT changes.
The indexer is a separate process (running on a separate machine) from the wave server. It is connected to the wave server by a "remote wave bus", that is, the indexer maintains a network connection to the wave server and receives a message for every update to a wave in the wave store.
To recover any messages lost due to network or server outages, the wave server maintains a commit log that logs all persisted wave updates and broadcasts the commit log on the wave bus along with the actual wave updates. The indexer maintains a pointer into the commit log in persistent storage tracking its progress processing the logged updates and recording them all in the persistent user/group indexes. If the indexer restarts, it contacts the wave server to read the commit log contents from the pointer onwards and it catches up on any missed, persisted wave updates while indexing new wave updates on the wave bus.

4.2 PERSISTENT INDEX FOR WIAB

The following is a simple persistent user/group index design for WIAB.

...

The index maintains a record for each wave in the index with its wave id and LMT and possibly some metadata about the wave. (If the record includes blip count, unread count, first three participants, and a snippet, then a search result digest can be generated from this record without having to load the wave when this wave is found during search.)

4.3 MONGODB IMPLEMENTATION

The index for a user or group is a MongoDB collection. The documents in the index are wave records with wave id, LMT, and any other metadata we want, plus an array of tokens. Each collection is indexed by tokens, using MongoDB's multikeys feature.

Code Block

$ mongo
MongoDB shell version: 1.8.2
connecting to: test
> db.INDEXsoren.find()
> db.INDEXsoren.ensureIndex({tokens:1})
> db.INDEXsoren.save({
    waveid:"example.com/wave1",
    lastmodifiedtime:1311170776,
    participants:["soren@example.com","john.smith@example.com"],
    snippet:"This is wave one",
    tokens:[
      "id:example.com/wave1",
      "creator:soren@example.com",
      "creator:soren",
      "is:read",
      "with:soren@example.com",
      "with:soren",
      "with:john.smith@example.com",
      "with:john.smith",
      "with:john",
      "with:smith",
      "this",
      "is",
      "wave",
      "one",
    ]})
> db.INDEXsoren.save({
    waveid:"example.com/wave2",
    lastmodifiedtime:1311170777,
    participants:["bob@example.com","soren@example.com"],
    snippet:"Wave two",
    tokens:[
      "id:example.com/wave2",
      "creator:bob@example.com",
      "creator:bob",
      "is:unread",
      "with:bob@example.com",
      "with:bob",
      "with:soren@example.com",
      "with:soren",
      "wave",
      "two",
    ]})
> db.INDEXsoren.find()
{ "_id" : ObjectId("4e26e2005dcfccaa5f94b81b"),
  "waveid" : "example.com/wave1",
  "lastmodifiedtime" : 1311170776,
  "participants" : [ "soren@example.com", "john.smith@example.com" ],
  "snippet" : "This is wave one",
  "tokens" : [ "id:example.com/wave1", ... ]
}
{ "_id" : ObjectId("4e26e4d15dcfccaa5f94b81c"),
  "waveid" : "example.com/wave2",
  "lastmodifiedtime" : 1311170777,
  "participants" : [ "bob@example.com", "soren@example.com" ],
  "snippet" : "Wave two",
  "tokens" : [ "id:example.com/wave2", ... ]
}
> db.INDEXsoren.find({tokens:"wave"})
{ "_id" : ObjectId("4e26e2005dcfccaa5f94b81b"), ... }
{ "_id" : ObjectId("4e26e4d15dcfccaa5f94b81c"), ... }
> db.INDEXsoren.find({tokens:"is:unread"})
{ "_id" : ObjectId("4e26e4d15dcfccaa5f94b81c"), ... }
> // a conjunctive query is constructed with {$all:[...]}
> db.INDEXsoren.find({tokens:{$all:["with:soren","wave"]}})
{ "_id" : ObjectId("4e26e2005dcfccaa5f94b81b"), ... }
{ "_id" : ObjectId("4e26e4d15dcfccaa5f94b81c"), ... }
> db.INDEXsoren.find({tokens:{$all:["with:soren","wave","one"]}})
{ "_id" : ObjectId("4e26e2005dcfccaa5f94b81b"), ... }

4.4 FILE BASED IMPLEMENTATION

The index for a user or group is a file. It consists of a short header with pointers to the sections of the index followed by the three sections:
Wave list section: Lists the waves in the index, sorted by LMT. Contains a record for each wave with its wave id and possibly some metadata about the wave. (If the record includes LMT, unread count, and a snippet, then a search result digest can be generated from this record without having to load the wave when this wave is found during search.)
Posting lists section:2 For each token there is a posting list with pointers to all the waves in the index with this token, sorted by LMT. The posting list begins with the token string itself. The wave pointers point into the wave list section. (Since that is also ordered by LMT, the pointers in the posting list are in increasing numerical order.) The posting lists themselves are sorted by a hash of the token.
Posting lists index section: A list of (token hash, posting list pointer) pairs, sorted by token hash. The posting list pointers point into the posting lists section.

...

TODO: describe how to rewrite the on-disk index periodically and use the commit log to deal with data loss when the server is terminated abruptly and doesn't write recent changes to the on-disk indices before it terminates ...1

5. WAVE STORE COMMIT LOG

TODO: ...

Space shortcuts

Child pages

Versions Compared

Old Version 2

New Version 3

Key