LogCabin.appendEntry(4, "SegmentedLog")

| raft logcabin

This is the fourth in a series of blog posts detailing the ongoing development of LogCabin. This entry describes LogCabin's new storage module and several other recent improvements.

Storage Module

Storage modules are how LogCabin servers persist their logs to disk. Most of the time, entries are just appended to the end of the log. Two other operations come up less frequently:

  1. If leaders change, a few entries from the end of the log might need to be truncated away.

  2. Periodically, snapshots are written that make some prefix of the log redundant. Those entries should be discarded.

Up until now, LogCabin had a pretty naive storage module named SimpleFileLog. On each log append, SimpleFileLog would write the new log entry to disk as a separate file, then fsync that file, the directory, and a separate metadata file. This was slow and was also never tested. It was never meant to be more than just a placeholder, and replacing it has been on the to-do list since 2012.

Last year when I was running performance benchmarks for my dissertation, I finally had a need for a faster storage module. That's how SegmentedLog was born. It was written to use disks efficiently, while still sitting on top of any filesystem for easy deployment. SegmentedLog worked well enough for performance benchmarks, but my dissertation got in the way, and SegmentedLog stayed in a not-quite-usable state.

Over the last couple of weeks, I dug up SegmentedLog, cleaned it up, tested it, and merged it into master. It should behave very similarly as before, but I did fix a few bugs in the process and touched nearly every line of code.

A storage module that wrote all entries to a single file in a sequential manner would be really simple and efficient. However, that wouldn't handle truncating the start of the log well (after snapshotting). The POSIX interface makes truncating the end of files easy but provides no support for truncating the start of a file. On the other end of the spectrum, writing each entry to its own file as in SimpleFileLog is inefficient, as it wastes precious disk writes on directory updates.

Finding some middle ground, SegmentedLog appends new entries to files it calls segments that are about 8 megabytes in size. Once a segment fills up with entries, it starts writing entries into a new file. To truncate entries at the end of the log, it uses the filesystem's truncate operation. To truncate entries at the start of the log, it first writes the new start index to a metadata file, then it removes any complete segments that are no longer needed. This can leave up to one segment's worth of redundant entries (a few megabytes) in place at the start of the log, which shouldn't pose a problem.

To further avoid metadata updates, SegmentedLog avoids changing the segment's file size during appends. A separate thread allocates segment files to be their full size and calls fsync to write the file's metadata to disk. Normal log appends only require fdatasync calls after that, which should be cheaper than full fsync calls. When a segment fills up (it can't fit another entry), the few extra zero bytes at the end are truncated, just to tidy things up.

SegmentedLog will become the default storage module once we gain more experience with it, and SimpleFileLog will be deprecated soon.

Configurable Timeouts

Two classes of timeouts where hard-coded in LogCabin and are now configurable:

  1. The Raft-level timeouts such as the election timeout and the heartbeat interval.
  2. The lower-level heartbeats sent on TCP connections that have slow outstanding RPCs, used to make sure those connections are still alive.

The interesting thing about the lower-level heartbeats is that the code is shared with the client library, and the client library doesn't consume a configuration file. Thus, the Client::Cluster constructor can now take a map from string to string of options, which applications can configure as they see fit. The only option so far is this timeout setting, but I'm sure more will follow.

Application-Level Testing

LogCabin's client library includes a mode in which all operations execute using an in-memory tree data structure. This is meant to aid with testing applications, so that they don't need to set up a full LogCabin cluster for every test. This testing mode was limited, however, in that it didn't give application-level tests control over things like timeout failures, or injecting state changes or results when specific operations were called.

Now the application can register a pair of callbacks with the client library which interpose on requests to the LogCabin Tree. They can inspect the contents of requests, modify them, and/or return custom results.

These callbacks operate at the level of protocol buffers used for communication between clients and servers (Protocol::Client::ReadOnlyTree and Protocol::Client::ReadWrite::Tree). These protocols aren't exactly part of the public LogCabin API, but using this low layer allows applications to get at all the information they need in a few lines of code, without being burdened by a bunch of C++ types.

Misc

A Note on LogCabin's Performance

Several people have asked me about LogCabin's performance. The top questions are:

Unfortunately, performance in LogCabin has never been the top priority, and it hasn't gotten the dedicated attention it needs.

I made several changes while running benchmarks for my dissertation that still haven't landed in master (these are in the nasty-thesis-wip tag, which I will not be supporting). Some of these changes may be improvements while others are probably bad ideas. They need further evaluation and care before they're ready to merge.

The good news:

And the bad news:

Next

I'll be working through more of the issue backlog next. First up is a problem that Scale's regression tests found, where drastically changing the time on the leader of a LogCabin cluster will needlessly kill all of the clients. Thanks to Scale Computing for supporting this work.