DESIGN

   1
   2 The Crazy Hacker's Crazy Guide to Bup Craziness
   3 ===============================================
   4
   5 Despite what you might have heard, bup is not that crazy, and neither are
   6 you if you're trying to figure out how it works.  But it's also (as of this
   7 writing) rather new and the source code doesn't have a lot of comments, so
   8 it can be a little confusing at first glance.  This document is designed to
   9 make it easier for you to get started if you want to add a new feature, fix
  10 a bug, or just understand how it all works.
  11
  12
  13 Bup Source Code Layout
  14 ----------------------
  15
  16 As you're reading this, you might want to look at different parts of the bup
  17 source code to follow along and see what we're talking about.  bup's code is
  18 written primarily in python with a bit of C code in speed-sensitive places.
  19 Here are the most important things to know:
  20
  21  - bup (symlinked to main.py) is the main program that runs when you type
  22    'bup'.
  23
  24  - cmd/bup-* (mostly symlinked to cmd/*-cmd.py) are the individual
  25    subcommands, in a way similar to how git breaks all its subcommands into
  26    separate programs.  Not all the programs have to be written in python;
  27    they could be in any language, as long as they end up named cmd/bup-*.
  28    We might end up re-coding large parts of bup in C eventually so that it
  29    can be even faster and (perhaps) more portable.
  30
  31  - lib/bup/*.py are python library files used by the cmd/*.py commands.
  32    That directory name seems a little silly (and worse, redundant) but there
  33    seemed to be no better way to let programs write "from bup import
  34    index" and have it work.  Putting bup in the top level conflicted with
  35    the 'bup' command; calling it anything other than 'bup' was fundamentally
  36    wrong, and doesn't work when you install bup on your system in /usr/lib
  37    somewhere.  So we get the annoyingly long paths.
  38
  39
  40 Repository Structure
  41 ====================
  42
  43 Before you can talk about how bup works, we need to first address what it
  44 does.  The purpose of bup is essentially to let you "replicate" data between
  45 two main data structures:
  46
  47 1. Your computer's filesystem;
  48
  49 2. A bup repository. (Yes, we know, that part also resides in your
  50    filesystem.  Stop trying to confuse yourself.  Don't worry, we'll be
  51    plenty confusing enough as it is.)
  52
  53 Essentially, copying data from the filesystem to your repository is called
  54 "backing stuff up," which is what bup specializes in.  Normally you initiate
  55 a backup using the 'bup save' command, but that's getting ahead of
  56 ourselves.
  57
  58 As most backup experts know, backing stuff up is normally about 100x more
  59 common than restoring stuff, ie.  copying from the repository to your
  60 filesystem.  For that reason, and also because bup is so new, there is no
  61 actual 'bup restore' command that does the obvious inverse operation to 'bup
  62 save'.  There are 'bup ftp' and 'bup fuse', which let you access your
  63 backed-up data, but they aren't as efficient as a fully optimized restore
  64 tool intended for high-volume restores.  There's nothing stopping us from
  65 writing one; we just haven't written it yet.  Feel free to pester us about
  66 it on the bup mailing list (see the README to find out about the list).
  67
  68 Now, those are the basics of backups.  In other words, we just spent about
  69 half a page telling you that bup backs up and restores data.  Are we having
  70 fun yet?
  71
  72 The next thing you'll want to know is the format of the bup repository,
  73 because hacking on bup is rather impossible unless you understand that part.
  74 In short, a bup repository is a git repository.  If you don't know about
  75 git, you'll want to read about it now.  A really good article to read is
  76 "Git for Computer Scientists" - you can find it in Google.  Go read it now.
  77 We'll wait.
  78
  79 Got it?  Okay, so now you're an expert in blobs, trees, commits, and refs,
  80 the four building blocks of a git repository.  bup uses these four things,
  81 and they're formatted in exactly the same way as git does it, so you can use
  82 git to manipulate the bup repository if you want, and you probably won't
  83 break anything.  It's also a comfort to know you can squeeze data out using
  84 git, just in case bup fails you, and as a developer, git offers some nice
  85 tools (like 'git rev-list' and 'git log' and 'git diff' and 'git show' and
  86 so on) that allow you to explore your repository and help debug when things
  87 go wrong.
  88
  89 Now, bup does use these tools a little bit differently than plain git.  We
  90 need to do this in order to address two deficiencies in git when used for
  91 large backups, namely a) git bogs down and crashes if you give it really
  92 large files; b) git is too slow when you give it too many files; and c) git
  93 doesn't store detailed filesystem metadata.
  94
  95 Let's talk about each of those problems in turn.
  96
  97
  98 Handling large files (cmd/split, hashsplit.split_to_blob_or_tree)
  99 --------------------
 100
 101 The primary reason git can't handle huge files is that it runs them through
 102 xdelta, which generally means it tries to load the entire contents of a file
 103 into memory at once.  If it didn't do this, it would have to store the
 104 entire contents of every single revision of every single file, even if you
 105 only changed a few bytes of that file.  That would be a terribly inefficient
 106 use of disk space, and git is well known for its amazingly efficient
 107 repository format.
 108
 109 Unfortunately, xdelta works great for small files and gets amazingly slow
 110 and memory-hungry for large files.  For git's main purpose, ie. managing
 111 your source code, this isn't a problem.  But when backing up your
 112 filesystem, you're going to have at least a few large files, and so it's a
 113 non-starter.  bup has to do something totally different.
 114
 115 What bup does instead of xdelta is what we call "hashsplitting."  We wanted
 116 a general-purpose way to efficiently back up *any* large file that might
 117 change in small ways, without storing the entire file every time.  In fact,
 118 the original versions of bup could only store a single file at a time;
 119 surprisingly enough, this was enough to give us a large part of bup's
 120 functionality.  If you just take your entire filesystem and put it in a
 121 giant tarball each day, then send that tarball to bup, bup will be able to
 122 efficiently store only the changes to that tarball from one day to the next.
 123 For small files, bup's compression won't be as good as xdelta's, but for
 124 anything over a few megabytes in size, bup's compression will actually
 125 *work*, which is a bit advantage over xdelta.
 126
 127 How does hashsplitting work?  It's deceptively simple.  We read through the
 128 file one byte at a time, calculating a rolling checksum of the last 32
 129 bytes.  (Why 32?  No reason.  Literally.  We picked it out of the air.
 130 Probably some other number is better.  Feel free to join the mailing list
 131 and tell us which one and why.)  (The rolling checksum idea is actually
 132 stolen from rsync and xdelta, although we use it differently.)
 133
 134 The particular rolling checksum algorithm we use is called "stupidsum,"
 135 because it's based on the only checksum Avery remembered how to calculate at
 136 the time.  He also remembered that it was the introductory checksum
 137 algorithm in a whole article about how to make good checksums that he read
 138 about 15 years ago, and it was thoroughly discredited in that article for
 139 being very stupid.  But, as so often happens, Avery couldn't remember any
 140 better algorithms from the article.  So what we get is stupidsum.  (If
 141 you're a computer scientist and can demonstrate that some other rolling
 142 checksum would be faster and/or better and/or have fewer screwy edge cases,
 143 we need your help!  Avery's out of control!  Join our mailing list!  Please!
 144 Save us! ...  oh boy, I sure hope he doesn't read this)
 145
 146 In any case, stupidsum, although stupid, seems to do pretty well at its job.
 147 You can find it in _hashsplit.c.  Basically, it converts the last 32 bytes
 148 of the file into a 32-bit integer.  What we then do is take the lowest 13
 149 bits of the checksum, and if they're all 1's, we consider that to be the end
 150 of a chunk.  This happens on average once every 2^13 = 8192 bytes, so the
 151 average chunk size is 8192 bytes.
 152
 153 (Why 13 bits?  Well, we picked the number at random and... eugh.  You're
 154 getting the idea, right?  Join the mailing list and tell us why we're
 155 wrong.)
 156
 157 (Incidentally, even though the average chunk size is 8192 bytes, the actual
 158 probability distribution of block sizes ends up being non-uniform; if we
 159 remember our stats classes correctly, which we probably don't, it's probably
 160 an "exponential distribution."  The idea is that for each byte in the block,
 161 the probability that it's the last block is one in 8192.  Thus, the
 162 block sizes end up being skewed toward the smaller end.  That's not
 163 necessarily for the best, but maybe it is.  Computer science to the rescue?
 164 You know the drill.)
 165
 166 Anyway, so we're dividing up those files into chunks based on the rolling
 167 checksum.  Then we store each chunk separately (indexed by its sha1sum) as a
 168 git blob.  Why do we split this way?  Well, because the results are actually
 169 really nice.  Let's imagine you have a big mysql database dump (produced by
 170 mysqldump) and it's basically 100 megs of SQL text.  Tomorrow's database
 171 dump adds 100 rows to the middle of the file somewhere, soo it's 100.01 megs
 172 of text.
 173
 174 A naive block splitting algorithm - for example, just dividing the file into
 175 8192-byte blocks - would be a disaster.  After the first bit of text has
 176 changed, every block after that would have a different boundary, so most of
 177 the blocks in the new backup would be different from the previous ones, and
 178 you'd have to store the same data all over again.  But with hashsplitting,
 179 no matter how much data you add, modify, or remove in the middle of the
 180 file, all the chunks *before* and *after* the affected chunk are absolutely
 181 the same.  All that matters to the hashsplitting algorithm is the 32-byte
 182 "separator" sequence, and a single change can only affect, at most, one
 183 separator sequence or the bytes between two separator sequences.  And
 184 because of stupidsum, about one in 8192 possible 32-byte sequences is a
 185 separator sequence.  Like magic, the hashsplit chunking algorithm will chunk
 186 your file the same way every time, even without knowing how it had chunked
 187 it previously.
 188
 189 The next problem is less obvious: after you store your series of chunks as
 190 git blobs, how do you store their sequence?  Each blob has a 20-byte sha1
 191 identifier, which means the simple list of blobs is going to be 20/8192 =
 192 0.25% of the file length.  For a 200GB file, that's 488 megs of just
 193 sequence data.
 194
 195 As an overhead percentage, 0.25% basically doesn't matter.  488 megs sounds
 196 like a lot, but compared to the 200GB you have to store anyway, it's
 197 irrelevant.  What *is* relevant is that 488 megs is a lot of memory you have
 198 to use in order to to keep track of the list.  Worse, if you back up an
 199 almost-identical file tomorrow, you'll have *another* 488 meg blob to keep
 200 track of, and it'll be almost but not quite the same as last time.
 201
 202 Hmm, big files, each one almost the same as the last... you know where this
 203 is going, right?
 204
 205 Actually no!  Ha!  We didn't split this list in the same way.  We could
 206 have, in fact, but it wouldn't have been very "git-like", since we'd like to
 207 store the list as a git 'tree' object in order to make sure git's
 208 refcounting and reachability analysis doesn't get confused.  Never mind the
 209 fact that we want you to be able to 'git checkout' your data without any
 210 special tools.
 211
 212 What we do instead is we extend the hashsplit algorithm a little further
 213 using what we call "fanout." Instead of checking just the last 13 bits of
 214 the checksum, we use additional checksum bits to produce additional splits.
 215 For example, let's say we use a 4-bit fanout.  That means we'll break a
 216 series of chunks into its own tree object whenever the last 13+4 = 17 bits
 217 of the rolling checksum are 1.  Naturally, whenever the lowest 17 bits are
 218 1, the lowest 13 bits are *also* 1, so the boundary of a chunk group is
 219 always also the boundary of a particular chunk.
 220
 221 And so on.  Eventually you'll have too many chunk groups, but you can group
 222 them into supergroups by using another 4 bits, and continue from there.
 223
 224 What you end up with is an actual tree of blobs - which git 'tree' objects
 225 are ideal to represent.  And if you think about it, just like the original
 226 list of chunks, the tree itself is pretty stable across file modifications.
 227 Any one modification will only affect the chunks actually containing the
 228 modifications, thus only the groups containing those chunks, and so on up
 229 the tree.  Essentially, the number of changed git objects is O(log n)
 230 where n is the number of chunks.  Since log 200 GB, using a base of 16 or
 231 so, is not a very big number, this is pretty awesome.  Remember, any git
 232 object we *don't* change in a new backup is one we can reuse from last time,
 233 so the deduplication effect is pretty awesome.
 234
 235 Better still, the hashsplit-tree format is good for a) random instead of
 236 sequential access to data (which you can see in action with 'bup fuse'); and
 237 b) quickly showing the differences between huge files (which we haven't
 238 really implemented because we don't need it, but you can try 'git diff -M -C
 239 -C backup1 backup2 -- filename' for a good start).
 240
 241 So now we've split out 200 GB file into about 24 million pieces.  That
 242 brings us to git limitation number 2.
 243
 244
 245 Handling huge numbers of files (git.PackWriter)
 246 ------------------------------
 247
 248 git is designed for handling reasonably-sized repositories that change
 249 relatively infrequently.  (You might think you change your source code
 250 "frequently" and that git handles much more frequent changes than, say, svn
 251 can handle.  But that's not the same kind of "frequently" we're talking
 252 about.  Imagine you're backing up all the files on your disk, and one of
 253 those files is a 100 GB database file with hundreds of daily users.  You
 254 disk changes so frequently you can't even back up all the revisions even if
 255 you were backing stuff up 24 hours a day.  That's "frequently.")
 256
 257 git's way of doing things works really nicely for the way software
 258 developers write software, but it doesn't really work so well for everything
 259 else.  The #1 killer is the way it adds new objects to the repository: it
 260 creates one file per blob.  Then you later run 'git gc' and combine those
 261 files into a single file (using highly efficient xdelta compression, and
 262 ignoring any files that are no longer relevant).
 263
 264 'git gc' is slow, but for source code repositories, the resulting
 265 super-efficient storage (and associated really fast access to the stored
 266 files) is worth it.  For backups, it's not; you almost never access your
 267 backed-up data, so storage time is paramount, and retrieval time is mostly
 268 unimportant.
 269
 270 To back up that 200 GB file with git and hashsplitting, you'd have to create
 271 24 million little 8k files, then copy them into a 200 GB packfile, then
 272 delete the 24 million files again.  That would take about 400 GB of disk
 273 space to run, require lots of random disk seeks, and require you to go
 274 through your data twice.
 275
 276 So bup doesn't do that.  It just writes packfiles directly.  Luckily, these
 277 packfiles are still git-formatted, so git can happily access them once
 278 they're written.
 279
 280 But that leads us to our next problem.
 281
 282
 283 Huge numbers of huge packfiles (git.PackMidx, cmd/midx)
 284 ------------------------------
 285
 286 Git isn't actually designed to handle super-huge repositories.  Most git
 287 repositories are small enough that it's reasonable to merge them all into a
 288 single packfile, which 'git gc' usually does eventually.
 289
 290 The problematic part of large packfiles isn't the packfiles themselves - git
 291 is designed to expect the total size of all packs to be larger than
 292 available memory, and once it can handle that, it can handle virtually any
 293 amount of data about equally efficiently.  The problem is the packfile
 294 indexes (.idx) files.  In bup we call these idx (pronounced "idix") files
 295 instead of using the word "index," because the word index is already used
 296 for something totally different in git (and thus bup) and we'll become
 297 hopelessly confused otherwise.
 298
 299 Anyway, each packfile (*.pack) in git has an associated idx (*.idx) that's a
 300 sorted list of git object hashes and file offsets.  If you're looking for a
 301 particular object based on its sha1, you open the idx, binary search it to
 302 find the right hash, then take the associated file offset, seek to that
 303 offset in the packfile, and read the object contents.
 304
 305 The performance of the binary search is about O(log n) with the number of
 306 hashes in the pack, with an optimized first step (you can read about it
 307 elsewhere) that somewhat improves it to O(log(n)-7).
 308
 309 Unfortunately, this breaks down a bit when you have *lots* of packs.  Say
 310 you have 24 million objects (containing around 200 GB of data) spread across
 311 200 packfiles of 1GB each.  To look for an object requires you search
 312 through about 122000 objects per pack; ceil(log2(122000)-7) = 10, so you'll
 313 have to search 10 times.  About 7 of those searches will be confined to a
 314 single 4k memory page, so you'll probably have to page in about 3-4 pages
 315 per file, times 200 files, which makes 600-800 4k pages (2.4-3.6 megs)...
 316 every single time you want to look for an object.
 317
 318 This brings us to another difference between git's and bup's normal use
 319 case.  With git, there's a simple optimization possible here: when looking
 320 for an object, always search the packfiles in MRU (most recently used)
 321 order.  Related objects are usually clusted together in a single pack, so
 322 you'll usually end up searching around 3 pages instead of 600, which is a
 323 tremendous improvement.  (And since you'll quickly end up swapping in all
 324 the pages in a particular idx file this way, it isn't long before searching
 325 for a nearby object doesn't involve any swapping at all.)
 326
 327 bup isn't so lucky.  git users spend most of their time examining existing
 328 objects (looking at logs, generating diffs, checking out branches), which
 329 lends itself to the above optimization.  bup, on the other hand, spends most
 330 of its time looking for *nonexistent* objects in the repository so that it
 331 can back them up.  When you're looking for objects that aren't in the
 332 repository, there's no good way to optimize; you have to exhaustively check
 333 all the packs, one by one, to ensure that none of them contain the data you
 334 want.
 335
 336 To improve performance of this sort of operation, bup introduces midx
 337 (pronounced "midix" and short for "multi-idx") files.  As the name implies,
 338 they index multiple packs at a time.
 339
 340 Imagine you had a midx file for your 200 packs.  midx files are a lot like
 341 idx files; they have a lookup table at the beginning that narrows down the
 342 initial search, followed by a binary search.  The unlike idx files (which
 343 have a fixed-size 256-entry lookup table) midx tables have a variably-sized
 344 table that makes sure the entire binary search can be contained to a single
 345 page of the midx file.  Basically, the lookup table tells you which page to
 346 load, and then you binary search inside that page.  A typical search thus
 347 only requires the kernel to swap in two pages, which is better than results
 348 with even a single large idx file.  And if you have lots of RAM, eventually
 349 the midx lookup table (at least) will end up cached in memory, so only a
 350 single page should be needed for each lookup.
 351
 352 You generate midx files with 'bup midx'.  The downside of midx files is that
 353 generating one takes a while, and you have to regenerate it every time you
 354 add a few packs.
 355
 356 (Computer Sciency observers will note that there are some interesting data
 357 structures out there that could help make things even better.  A very
 358 promising sounding one is called a "bloom filter." Look it up in Wikipedia.)
 359
 360 midx files are a bup-specific optimization and git doesn't know what to do
 361 with them.  However, since they're stored as separate files, they don't
 362 interfere with git's ability to read the repository.
 363
 364
 365 Detailed Metadata
 366 -----------------
 367
 368 So that's the basic structure of a bup repository, which is also a git
 369 repository.  There's one more thing we have to deal with in bup: filesystem
 370 metadata.  git repositories are really only intended to store file contents
 371 with a small bit of extra information, like symlink support and
 372 differentiating between executable and non-executable files.  For the rest,
 373 we'll have to store it some other way.
 374
 375 As of this writing, bup's support for metadata is... pretty much
 376 nonexistent.  People are working on it.  But the plan goes like this:
 377
 378  - Each git tree will contain a file called .bupmeta.
 379
 380  - .bupmeta contains an entry for every entry in the tree object, sorted in
 381    the same order as in the tree.
 382
 383  - the .bupmeta entry lists information like modification times, attributes,
 384    file ownership, and so on for each file in the tree.
 385
 386  - for backward compatibility with pre-metadata versions of bup (and git,
 387    for that matter) the .bupmeta file for each tree is optional, and if it's
 388    missing, files will be assumed to have default permissions.
 389
 390  The nice thing about this design is that you can walk through each file in
 391  a tree just by opening the tree and the .bupmeta contents, and iterating
 392  through both at the same time.
 393
 394  Trust us, it'll be awesome.
 395
 396
 397 Filesystem Interaction
 398 ======================
 399
 400 Storing data is just half of the problem of making a backup; figuring out
 401 what to store is the other half.
 402
 403 At the most basic level, piping the output of 'tar' into 'bup split' is an
 404 easy way to offload that decision; just let tar do all the hard stuff.  And
 405 if you like tar files, that's a perfectly acceptable way to do it.  But we
 406 can do better.
 407
 408 Backing up with tarballs would totally be the way to go, except for two
 409 serious problems:
 410
 411 1. The result isn't easily "seekable."  Tar files have no index, so if (as
 412    commonly happens) you only want to restore one file in a 200 GB backup,
 413    you'll have to read up to 200 GB before you can get to the beginning of
 414    that file.  tar is short for "tape archive"; on a tape, there was no
 415    better way to do it anyway, so they didn't try.  But on a disk, random
 416    file access is much, much better when you can figure out how.
 417
 418 2. tar doesn't remember which files it backed up last time, so it has to
 419    read through the entire file contents again in order to generate the
 420    tarball, large parts of which will then be skipped by bup since they've
 421    already been stored.  This is much slower than necessary.
 422
 423 (The second point isn't entirely true for all versions of tar. For example,
 424 GNU tar has an "incremental" mode that can somewhat mitigate this problem,
 425 if you're smart enough to know how to use it without hurting yourself.  But
 426 you still have to decide which backups are "incremental" and which ones will
 427 be "full" and so on, so even when it works, it's more error-prone than bup.)
 428
 429 bup divides the backup process into two major steps: a) indexing the
 430 filesystem, and b) saving file contents into the repository.  Let's look at
 431 those steps in detail.
 432
 433
 434 Indexing the filesystem (cmd/drecurse, cmd/index, index.py)
 435 -----------------------
 436
 437 Splitting the filesystem indexing phase into its own program is
 438 nontraditional, but it gives us several advantages.
 439
 440 The first advantage is trivial, but might be the most important: you can
 441 index files a lot faster than you can back them up.  That means we can
 442 generate the index (.bup/bupindex) first, then have a nice, reliable,
 443 non-lying completion bar that tells you how much of your filesystem remains
 444 to be backed up.  The alternative would be annoying failures like counting
 445 the number of *files* remaining (as rsync does), even though one of the
 446 files is a virtual machine image of 80 GB, and the 1000 other files are each
 447 under 10k.  With bup, the percentage complete is the *real* percentage
 448 complete, which is very pleasant.
 449
 450 Secondly, it makes it easier to debug and test; you can play with the index
 451 without actually backing up any files.
 452
 453 Thirdly, you can replace the 'bup index' command with something else and not
 454 have to change anything about the 'bup save' command.  The current 'bup
 455 index' implementation just blindly walks the whole filesystem looking for
 456 files that have changed since the last time it was indexed; this works fine,
 457 but something using inotify instead would be orders of magnitude faster.
 458 Windows and MacOS both have inotify-like services too, but they're totally
 459 different; if we want to support them, we can simply write new bup commands
 460 that do the job, and they'll never interfere with each other.
 461
 462 And fourthly, git does it that way, and git is awesome, so who are we to
 463 argue?
 464
 465 So let's look at how the index file works.
 466
 467 First of all, note that the ".bup/bupindex" file is not the same as git's
 468 ".git/index" file.  The latter isn't used in bup; as far as git is
 469 concerned, your bup repository is a "bare" git repository and doesn't have a
 470 working tree, and thus it doesn't have an index either.
 471
 472 However, the bupindex file actually serves exactly the same purpose as git's
 473 index file, which is why we still call it "the index." We just had to
 474 redesign it for the usual bup-vs-git reasons, mostly that git just isn't
 475 designed to handle millions of files in a single repository.  (The only way
 476 to find a file in git's index is to search it linearly; that's very fast in
 477 git-sized repositories, but very slow in bup-sized ones.)
 478
 479 Let's not worry about the exact format of the bupindex file; it's still not
 480 optimal, and will probably change again.  The most important things to know
 481 about bupindex are:
 482
 483  - You can iterate through it much faster than you can iterate through the
 484    "real" filesystem (using something like the 'find' command).
 485
 486  - If you delete it, you can get it back just by reindexing your filesystem
 487    (although that can be annoying to wait for); it's not critical to the
 488    repository itself.
 489
 490  - You can iterate through only particular subtrees if you want.
 491
 492  - There is no need to have more than one index for a particular filesystem,
 493    since it doesn't store anything about backups; it just stores file
 494    metadata.  It's really just a cache (or 'index') of your filesystem's
 495    existing metadata.  You could share the bupindex between repositories, or
 496    between multiple users on the same computer.  If you back up your
 497    filesystem to multiple remote repositories to be extra safe, you can
 498    still use the same bupindex file across all of them, because it's the
 499    same filesystem every time.
 500
 501  - Filenames in the bupindex are absolute paths, because that's the best way
 502    to ensure that you only need one bupindex file and that they're
 503    interchangeable.
 504
 505
 506 A note on file "dirtiness"
 507 --------------------------
 508
 509 The concept on which 'bup save' operates is simple enough; it reads through
 510 the index and backs up any file that is "dirty," that is, doesn't already
 511 exist in the repository.
 512
 513 Determination of dirtiness is a little more complicated than it sounds.  The
 514 most dirtiness-relevant relevant flag in the bupindex is IX_HASHVALID; if
 515 this flag is reset, the file *definitely* is dirty and needs to be backed
 516 up.  But a file may be dirty even if IX_HASHVALID is set, and that's the
 517 confusing part.
 518
 519 The index stores a listing of files, their attributes, and
 520 their git object ids (sha1 hashes), if known.  The "if known" is what
 521 IX_HASHVALID is about.  When 'bup save' backs up a file, it sets
 522 the sha1 and sets IX_HASHVALID; when 'bup index' sees that a file has
 523 changed, it leaves the sha1 alone and resets IX_HASHVALID.
 524
 525 Remember that the index can be shared between users, repositories, and
 526 backups.  So IX_HASHVALID doesn't mean your repository *has* that sha1 in
 527 it; it only means that if you *do* have it, that you don't need to back up
 528 the file.  Thus, 'bup save' needs to check every file in the index to make
 529 sure its hash exists, not just that it's valid.
 530
 531 There's an optimization possible, however: if you know a particular tree's
 532 hash is valid and exists (say /usr), then you don't need to check the
 533 validity of all its children; because of the way git trees and blobs work,
 534 if your repository is valid and you have a tree object, then you have all
 535 the blobs it points to.  You won't back up a tree object without backing up
 536 its blobs first, so you don't need to double check it next time.  (If you
 537 really want to double check this, it belongs in a tool like 'bup fsck' or
 538 'git fsck'.) So in short, 'bup save' on a "clean" index (all files are
 539 marked IX_HASHVALID) can be very fast; we just check our repository and see
 540 if the top level IX_HASHVALID sha1 exists.  If it does, then we're done.
 541
 542 Similarly, if not the entire index is valid, you can still avoid recursing
 543 into subtrees if those particular subtrees are IX_HASHVALID and their sha1s
 544 are in the repository.  The net result is that, as long as you never lose
 545 your index, 'bup save' can always run very fast.
 546
 547 Another interesting trick is that you can skip backing up files even if
 548 IX_HASHVALID *isn't* set, as long as you have that file's sha1 in the
 549 repository.  What that means is you've chosen not to backup the latest
 550 version of that file; instead, your new backup set just contains the
 551 most-recently-known valid version of that file.  This is a good trick if you
 552 want to do frequent backups of smallish files and infrequent backups of
 553 large ones (as in 'bup save --smaller').  Each of your backups will be
 554 "complete," in that they contain all the small files and the large ones, but
 555 intermediate ones will just contain out-of-date copies of the large files.
 556
 557 A final game we can play with the bupindex involves restoring: when you
 558 restore a directory from a previous backup, you can update the bupindex
 559 right away.  Then, if you want to restore a different backup on top, you can
 560 compare the files in the index against the ones in the backup set, and
 561 update only the ones that have changed.  (Even more interesting things
 562 happen if people are using the files on the restored system and you haven't
 563 updated the index yet; the net result would be an automated merge of all
 564 non-conflicting files.)  This would be a poor man's distributed filesystem.
 565 The only catch is that nobody has written 'bup restore' yet.  Someday!
 566
 567
 568 How 'bup save' works (cmd/save)
 569 --------------------
 570
 571 This section is too boring and has been omitted.  Once you understand the
 572 index, there's nothing special about bup save.
 573
 574
 575 Retrieving backups: the bup vfs layer (vfs.py, cmd/ls, cmd/ftp, cmd/fuse)
 576 =====================================
 577
 578 One of the neat things about bup's storage format, at least compared to most
 579 backup tools, is it's easy to read a particular file, or even part of a
 580 file.  That means a read-only virtual filesystem is easy to generate and
 581 it'll have good performance characteristics.  Because of git's commit
 582 structure, you could even use branching and merging to make a transactional
 583 read-write filesystem... but that's probably getting a little out of bup's
 584 scope.  Who knows what the future might bring, though?
 585
 586 Read-only filesystems are well within our reach today, however.  The 'bup
 587 ls', 'bup ftp', and 'bup fuse' commands all use a "VFS" (virtual filesystem)
 588 layer to let you access your repositories.  Feel free to explore the source
 589 code for these tools and vfs.py - they're pretty straightforward.  Some
 590 things to note:
 591
 592  - None of these use the bupindex for anything.
 593
 594  - For user-friendliness, they present your refs/commits/trees as a single
 595    hierarchy (ie.  a filesystem), which isn't really how git repositories
 596    are formatted.  So don't get confused!
 597
 598
 599 We hope you'll enjoy bup.  Looking forward to your patches!
 600
 601 -- apenwarr and the rest of the bup team