This is the third in a series of blog posts detailing the ongoing development of LogCabin. This entry describes a new tool to extract information from LogCabin servers and a tricky bug that occurred when unregistering handlers from the event loop.
Server Stats
Nate Hardt has been integrating and testing LogCabin into Scale's software stack, and he's run into a few issues lately that were needlessly time-consuming to diagnose. Some of the problems were silly, like when one test stopped the LogCabin daemon and it was never restarted for the next test. Still, even this was one hard to spot, since LogCabin had no tooling for diagnostics.
To make our lives easier, we wanted a tool that would go collect information that might be useful about each server's internals, then report it back in a nice way.
The first step was to collect some statistics into a Protocol Buffer, and have
servers dump these out to their debug logs. They now do this periodically
(defaulting to once per minute) and also upon the SIGUSR1
signal. Protocol
Buffers are nice for this sort of thing, since we anticipate the number of
stats to grow over time, and not every tool needs to understand each stat.
Here's an example of the raw output (using the Protocol Buffers text format):
server_id: 1
address: "192.168.2.1:61023"
start_at: 1421726306722327036
end_at: 1421726306722407779
raft {
current_term: 39
state: LEADER
commit_index: 135
last_log_index: 135
leader_id: 1
voted_for: 1
start_election_at: 9223372036854775807
withhold_votes_until: 9223372036854775807
last_snapshot_index: 123
last_snapshot_bytes: 1207
log_start_index: 124
log_bytes: 11019
peer {
server_id: 1
old_member: true
new_member: false
staging_member: false
last_synced_index: 135
}
peer {
server_id: 2
old_member: true
new_member: false
staging_member: false
request_vote_done: true
have_vote: true
force_heartbeat: false
next_index: 136
last_agree_index: 135
is_caught_up: true
next_heartbeat_at: 1421726306777304527
backoff_until: 1421726273743534090
}
peer {
server_id: 3
old_member: true
new_member: false
staging_member: false
request_vote_done: false
have_vote: false
force_heartbeat: false
next_index: 136
last_agree_index: 135
is_caught_up: true
next_heartbeat_at: 1421726306779778550
backoff_until: 1421726273643086864
}
peer {
server_id: 4
old_member: true
new_member: false
staging_member: false
request_vote_done: false
have_vote: false
force_heartbeat: false
next_index: 136
last_agree_index: 135
is_caught_up: true
next_heartbeat_at: 1421726306781546795
backoff_until: 1421726270475566462
}
peer {
server_id: 5
old_member: true
new_member: false
staging_member: false
request_vote_done: true
have_vote: true
force_heartbeat: false
next_index: 136
last_agree_index: 135
is_caught_up: true
next_heartbeat_at: 1421726306783424837
backoff_until: 1421726273743630864
}
}
The second step was to go collect all this information with a new RPC, named
getServerStats(). This RPC is different from existing ones in that it's not
destined for the cluster leader. That's because we want statistics from every
individual server, and they need to be collected even if the cluster has no
leader. The client library had to be refactored a little bit as a result, so
that the event loop and TCP connection rate-limiting mechanism are available
outside of Client::LeaderRPC
. A new client binary called ServerStats
invokes the RPC and prints out the results.
Finally, the scripts/serverstats.py
script understands the meaning of the
ServerStats fields, and it tries to present something a little nicer for human
consumption. Some of the fields in the raw output (like nanoseconds since the
Unix epoch) are not easy for humans, so the script translates those to more
meaningful formats. It also uses colors to highlight interesting things, and
it'll output a warning if the cluster has no leader (more warnings to come in
the future).
Here's an example of the serverstats.py
output for the same server as above:
Server 1 at 192.168.2.1:61023:
Leader at term 39
Snapshot covers entries 1 through 123 (1207 bytes)
Log covers entries 124 through 135 (11019 bytes)
All log entries committed
All log entries flushed to disk
Voted for server 1
Withholding votes for +infinity
Configuration: [1, 2, 3, 4, 5]
Peer 2 (old/curr):
Next index: 136
Match index: 135
Next heartbeat in +54.897 ms
Peer 3 (old/curr):
Next index: 136
Match index: 135
Next heartbeat in +57.371 ms
Peer 4 (old/curr):
Next index: 136
Match index: 135
Next heartbeat in +59.139 ms
Peer 5 (old/curr):
Next index: 136
Match index: 135
Next heartbeat in +61.017 ms
Stats collected at 2015-01-19 19:58:26.722
Took 0.081 ms to generate stats
The ServerStats structure will need to include much more information to be comprehensive, but it's probably easier to do that over time than all at once. It currently includes just the state that was easily available in the Raft module. There's also room for improvement on formatting the information nicely and highlighting problems. Still, it's a useful tool already, and it's a good starting point for further iteration.
DumpTree
It's also important to be able to see what a LogCabin state machine is storing
in its Tree structure. I created a simple client called DumpTree
to
recursively list out the values for each key. This will only work for small
state machines with human-readable values, but it's so much better than not
being able to see inside at all or writing a new C++ client every time.
Event Loop Crashes
Nate kept running into crashes like this one, which came with pretty mysterious error messages:
pure virtual method called
terminate called without an active exception
Program received signal SIGABRT, Aborted.
After seeing a few of these with similar stack traces, we took a closer look and found the problem. It's subtle but worth discussing.
LogCabin has a set of classes that wrap epoll
and provide an event loop.
Event::File
used to be an abstract base class that would register a file
descriptor with the event loop and arrange for the event loop thread to call a
virtual handleFileEvent()
method appropriately. The destructor on File would
first interrupt the event loop thread, then unregister the file handler.
RPC::ReceiveSocket
is one example of a class that derives from Event::File
,
representing the receiving end of a TCP connection. When ReceiveSocket is
destroyed, C++ calls its destructor, starts to destroy the ReceiveSocket
members, probably scribbles on the vtable, and only then does it call the
destructor on the base class, Event::File
. If the event loop thread called
handleFileEvent at this point (just before the Event::File
destructor), the
process would crash.
The desired behavior is to run through the shutdown procedure for Event::File
first, interrupting the event loop thread and unregistering the file handler.
Only then would it be safe to start destroying the ReceiveSocket.
One way to achieve this behavior is to require all classes that derive from
Event::File
to call a shutdown method on Event::File
as the first step in
their destructors. However, this approach would be error-prone: if someone
forgot to call this method in any one class, they might get crashes, either now
or anytime in the future.
Instead, the approach I implemented split Event::File
into two classes: one
to provide a handler, and a second one, called a Monitor
, to register the
handler with the event loop. The lifetime of the handler must extend beyond
that of the monitor; thus, the monitor's destructor will interrupt the event
loop and unregister the handler before the handler begins to be destroyed. This
is relatively easy for users of these classes to get right: they generally
declare a monitor following a handler in the same object or scope. The
monitor's constructor needs a reference to the handler, so it's hard to declare
these variables backwards. And if someone forgets to create a monitor
altogether, their handler will never fire, so they will probably notice in
testing.
Next
Scale is in the middle of an office move that's making some of the development and test clusters unavailable, so it's probably a good time to work through some of the issue backlog next. Thanks to Scale Computing for supporting this work. And congrats to Nate on his first two pull requests (#79 and #81).