From: Johannes Berg <johannes@sipsolutions.net>
Date: Fri, 1 May 2020 22:52:53 +0000 (+0200)
Subject: DESIGN: document the actual hashsplit algorithm
X-Git-Tag: 0.31~46
X-Git-Url: https://arthur.barton.de/cgi-bin/gitweb.cgi?p=bup.git;a=commitdiff_plain;h=7b140d5b027db0a82e059e545ab47a0ec03fbf32;ds=sidebyside

DESIGN: document the actual hashsplit algorithm

The hashsplit algorithm, when used for the fanout, has a quirk
that appears to be due to an implementation bug.

In order for splitting to occur, the lowest 13 (BUP_BLOBBITS)
bits of the csum need to be 1. Then, per DESIGN, the next bits
that are 1 are used for the fanout. However, the implementation
doesn't actually work this way. What actually happens is that
the lower 13 bits need to be ones:

 ........1'1111'1111'1111

Then, the DESIGN document states that the next bits that are 1
should be used for the fanout:

 ....'111_'____'____'____

However, the implementation actually ignores the next bit ('x')

 ....'11x_'____'____'____

and it doesn't matter whether that's set to 0 or 1, the fanout
will be based on the next higher bits (marked '.' and '1').

Fix this in the DESIGN documentation rather than changing the
algorithm as the latter would cause a save of an identical file
to completely rewrite the tree objects that make up the file,
due to different fanout behaviour.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
---

diff --git a/DESIGN b/DESIGN
index 8c2732d..4d15203 100644
--- a/DESIGN
+++ b/DESIGN
@@ -179,7 +179,7 @@ the blocks in the new backup would be different from the previous ones, and
 you'd have to store the same data all over again.  But with hashsplitting,
 no matter how much data you add, modify, or remove in the middle of the
 file, all the chunks *before* and *after* the affected chunk are absolutely
-the same.  All that matters to the hashsplitting algorithm is the 32-byte
+the same.  All that matters to the hashsplitting algorithm is the
 "separator" sequence, and a single change can only affect, at most, one
 separator sequence or the bytes between two separator sequences.  And
 because of rollsum, about one in 8192 possible 64-byte sequences is a
@@ -212,12 +212,17 @@ special tools.
 
 What we do instead is we extend the hashsplit algorithm a little further
 using what we call "fanout." Instead of checking just the last 13 bits of
-the checksum, we use additional checksum bits to produce additional splits. 
-For example, let's say we use a 4-bit fanout.  That means we'll break a
-series of chunks into its own tree object whenever the last 13+4 = 17 bits
-of the rolling checksum are 1.  Naturally, whenever the lowest 17 bits are
-1, the lowest 13 bits are *also* 1, so the boundary of a chunk group is
-always also the boundary of a particular chunk.
+the checksum, we use additional checksum bits to produce additional splits.
+Note that (most likely due to an implementation bug), the next higher bit
+after the 13 bits (marked 'x'):
+
+  ...... '..x1'1111'1111'1111
+
+is actually ignored next. Now, let's say we use a 4-bit fanout. That means
+we'll break a series of chunks into its own tree object whenever the next
+4 bits of the rolling checksum are 1, in addition to the 13 lowest ones.
+Since the 13 lowest bits already have to be 1, the boundary of a group of
+chunks is necessarily also always the boundary of a particular chunk.
 
 And so on.  Eventually you'll have too many chunk groups, but you can group
 them into supergroups by using another 4 bits, and continue from there.