X-Git-Url: https://arthur.barton.de/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=DESIGN;h=c7eca302bdad3047b3922703976777c1fbbd8612;hb=763b2d09f993b637a81257d7efdc6a4b5448cb6e;hp=7d2b64d6307f2fbc509c116ccf160696f056cc61;hpb=8585613c1f45f3e20feec00b24fc7e3a948fa23e;p=bup.git diff --git a/DESIGN b/DESIGN index 7d2b64d..c7eca30 100644 --- a/DESIGN +++ b/DESIGN @@ -116,11 +116,11 @@ giant tarball each day, then send that tarball to bup, bup will be able to efficiently store only the changes to that tarball from one day to the next. For small files, bup's compression won't be as good as xdelta's, but for anything over a few megabytes in size, bup's compression will actually -*work*, which is a bit advantage over xdelta. +*work*, which is a big advantage over xdelta. How does hashsplitting work? It's deceptively simple. We read through the -file one byte at a time, calculating a rolling checksum of the last 128 -bytes. (Why 128? No reason. Literally. We picked it out of the air. +file one byte at a time, calculating a rolling checksum of the last 64 +bytes. (Why 64? No reason. Literally. We picked it out of the air. Probably some other number is better. Feel free to join the mailing list and tell us which one and why.) (The rolling checksum idea is actually stolen from rsync and xdelta, although we use it differently. And they use @@ -145,7 +145,7 @@ we need your help! Avery's out of control! Join our mailing list! Please! Save us! ... oh boy, I sure hope he doesn't read this) In any case, rollsum seems to do pretty well at its job. -You can find it in bupsplit.c. Basically, it converts the last 128 bytes +You can find it in bupsplit.c. Basically, it converts the last 64 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk. This happens on average once every 2^13 = 8192 bytes, so the @@ -182,7 +182,7 @@ file, all the chunks *before* and *after* the affected chunk are absolutely the same. All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences. And -because of rollsum, about one in 8192 possible 128-byte sequences is a +because of rollsum, about one in 8192 possible 64-byte sequences is a separator sequence. Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously. @@ -340,7 +340,7 @@ they index multiple packs at a time. Imagine you had a midx file for your 200 packs. midx files are a lot like idx files; they have a lookup table at the beginning that narrows down the -initial search, followed by a binary search. The unlike idx files (which +initial search, followed by a binary search. Then unlike idx files (which have a fixed-size 256-entry lookup table) midx tables have a variably-sized table that makes sure the entire binary search can be contained to a single page of the midx file. Basically, the lookup table tells you which page to