7 Log compaction
Raft’s log grows during normal operation to incorporate more client requests, but in a practical system, it cannot grow without bound. As the log grows longer, it occupies more space and takes more time to replay. This will eventually cause availability problems without some mechanism to discard obsolete information that has accumulated in the log.
Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded. Snapshotting is used in Chubby and ZooKeeper, and the remainder of this section describes snapshotting in Raft.
Incremental approaches to compaction, such as log cleaning [36] and log-structured merge trees [30, 5], are also possible. These operate on a fraction of the data at once, so they spread the load of compaction more evenly over time. They first select a region of data that has accumulated many deleted and overwritten objects, then they rewrite the live objects from that region more compactly and free the region. This requires significant additional mechanism and complexity compared to snapshotting, which simplifies the problem by always operating on the entire data set. While log cleaning would require modifications to Raft, state machines can implement LSM trees using the same interface as snapshotting.
Figure 12 shows the basic idea of snapshotting in Raft. Each server takes snapshots independently, covering just the committed entries in its log. Most of the work consists of the state machine writing its current state to the snapshot. Raft also includes a small amount of metadata in the snapshot: the last included index is the index of the last entry in the log that the snapshot replaces (the last entry the state machine had applied), and the last included term is the term of this entry. These are preserved to support the AppendEntries consistency check for the first log entry following the snapshot, since that entry needs a previous log index and term. To enable cluster membership changes (Section 6), the snapshot also includes the latest configuration in the log as of last included index. Once a server completes writing a snapshot, it may delete all log entries up through the last included index, as well as any prior snapshot.
Although servers normally take snapshots independently, the leader must occasionally send snapshots to followers that lag behind. This happens when the leader has already discarded the next log entry that it needs to send to a follower. Fortunately, this situation is unlikely in normal operation: a follower that has kept up with the leader would already have this entry. However, an exceptionally slow follower or a new server joining the cluster (Section 6) would not. The way to bring such a follower up-to-date is for the leader to send it a snapshot over the network.
The leader uses a new RPC called InstallSnapshot to send snapshots to followers that are too far behind; see Figure 13. When a follower receives a snapshot with this RPC, it must decide what to do with its existing log entries. Usually the snapshot will contain new information not already in the recipient’s log. In this case, the follower discards its entire log; it is all superseded by the snapshot and may possibly have uncommitted entries that conflict with the snapshot. If instead the follower receives a snapshot that describes a prefix of its log (due to retransmission or by mistake), then log entries covered by the snapshot are deleted but entries following the snapshot are still valid and must be retained.
This snapshotting approach departs from Raft’s strong leader principle, since followers can take snapshots without the knowledge of the leader. However, we think this departure is justified. While having a leader helps avoid conflicting decisions in reaching consensus, consensus has already been reached when snapshotting, so no decisions conflict. Data still only flows from leaders to followers, just followers can now reorganize their data.
We considered an alternative leader-based approach in which only the leader would create a snapshot, then it would send this snapshot to each of its followers. However, this has two disadvantages. First, sending the snapshot to each follower would waste network bandwidth and slow the snapshotting process. Each follower already has the information needed to produce its own snapshots, and it is typically much cheaper for a server to produce a snapshot from its local state than it is to send and receive one over the network. Second, the leader’s implementation would be more complex. For example, the leader would need to send snapshots to followers in parallel with replicating new log entries to them, so as not to block new client requests.
There are two more issues that impact snapshotting performance. First, servers must decide when to snapshot. If a server snapshots too often, it wastes disk bandwidth and energy; if it snapshots too infrequently, it risks exhausting its storage capacity, and it increases the time required to replay the log during restarts. One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.
The second performance issue is that writing a snapshot can take a significant amount of time, and we do not want this to delay normal operations. The solution is to use copy-on-write techniques so that new updates can be accepted without impacting the snapshot being written. For example, state machines built with functional data structures naturally support this. Alternatively, the operating system’s copy-on-write support (e.g., fork on Linux) can be used to create an in-memory snapshot of the entire state machine (our implementation uses this approach).
7 日志压缩
在正常操作期间,Raft日志的不断增长来合并更多的客户端请求,但是在实际系统中,它不可能无限的增长。当日志增长时,它会占用更多空间并耗费更多时间来重现。如果没有机制来丢弃日志中累积的过时的信息就会最终导致可用性问题。
快照是压缩最普遍的方式。快照中写入了整个当前系统的状态,然后就丢弃整个日志。Chubby和Zookeeper都使用了快照,而本章的剩余部分则描述了Raft的快照。
用增量的方法来进行压缩,比如日志清理和日志结构合并树[30,5]都可以实现。这些操作每次操作一部分数据,所以他们分离压缩的负载用时更平均。他们首先选择已经积累了很多被删除和覆盖的对象的区域,然后改写那些存放活对象的区域使之更紧凑,并释放该区域。这相比快照明显需要额外的机制和复杂性,但简化了总是对整个数据集操作的问题。当日志清理需要修改Raft时,状态机可以通过使用和快照相同的接口来实现LSM树。
图12显示了Raft的快照基本思路。每个服务器独立进行快照,只覆盖提交了的日志条目。大部分的工作就是由状态机写入它的状态到快照中。Raft的快照还包含了少量的元数据:最新的索引就是快照替换的最新的日志条目(状态机申请的最新条目),最新的term就是这个条目的term。这些被保留下来以支持AppendEntries对第一个日志条目的一致性检查,因为条目需要之前日志的索引和term。为了集群成员关系能够改变(第 6章),快照还包含了作为日志最新索引的最新配置。一旦服务器完成写快照,就会删除最新索引之前的所有日志条目,以及之前的所有快照。
虽然服务器一般都要独立的进行快照,leader偶尔需要发送快照给那些落后的follower。这会在leader已经准备抛弃下一个需要发送给follower的日志条目时发生。幸运的是,这不太可能在正常运行时发生:跟上leader脚步的follower就已经有这个条目了。然而,一个格外慢的follower或是一个新加入集群的服务器(第6章)就没有。让一个这样的follow跟上来的方法就是leader通过网络发送快照给它。
leader使用一个叫做InstallSnapshot的新RPC发送快照给那些落后很多的follower;见图13。当一个follower接收到这个带有快照的RPC时,它必须决定如何处理现有的日志。通常情况下,快照将包含接收者日志没有的信息。这种情况下,follower丢弃整个日志条目;它将用快照取代所有,并可能有未提交的条目与快照冲突。相反,follower收到一个描述日志前缀(由于转发或错误)的快照,然后用快照替换的日志条目被删除,但是之后快照的条目仍然有效且必须保留。
这种策略背离了Raft的强leader原则,因为follower可以在leader不知情的情况下创建快照。不过,我们认为这种背离是合理的。虽然有一个leader有助于避免达成共识决定时的冲突,但是共识已在快照时就达成了,所以没有决定是冲突的。数据还是从leader流向follower的,只是follower能决定重组自己的数据。
我们考虑过一个基于leader的,只有leader可以创建快照的替代方案,然后发送快照给每一个follower。但这有两个缺点。首先,发送快照给每个follower会浪费很多的网络带宽并减缓快照进程。每个follower需要有信息创建到自己的快照中,服务器在本地创建快照比起通过网络来发送和接收更实惠。第二,leader的实现会更复杂。例如,leader需要并行发送快照给follower来复制日志条目给他们,从而不阻止客户端的请求。
同样,也有两个问题影响快照性能。首先,服务器需要决定什么时间创建快照。如果服务器创建快照太过频繁,它就会浪费磁盘带宽和性能;如果很少创建快照,就会加剧磁盘容量的风险,在重启期间重播日志的时间也会增加。一个简单的存储策略是当日志达到固定字节大小的时候创建快照。如果这个大小比预期的快照大小明显要大,那么磁盘带宽的开销就会很小了。
第二个性能问题是写快照需要消耗大量的时间,而我们不希望这个耽误正常操作。解决方案就是使用copy-on-write(在副本上写)的技术以便在不影响快照写入的情况下接收新的更新。例如,基于功能性数据结构构建的状态机自然就支持这个。另外,操作系统的copy-on-write(如linux分支)可以用来创建整个状态机的内存快照(我们的实现就采用的这个策略)。