Update base_version to 0.34~ for 0.34 development

[bup.git] / DESIGN
diff --git a/DESIGN b/DESIGN

index dfe30f36062ffbf8cc2fb623792ef1b87271cf8a..d6e8c1b17403411d063181d8055d2c08206aa503 100644 (file)
--- a/DESIGN
+++ b/DESIGN
@@ -18,17 +18,41 @@ source code to follow along and see what we're talking about.  bup's code is
  written primarily in python with a bit of C code in speed-sensitive places. 
  Here are the most important things to know:
  
- - bup (symlinked to main.py) is the main program that runs when you type
-   'bup'.
- 
- - cmd/bup-* (mostly symlinked to cmd/*-cmd.py) are the individual
-   subcommands, in a way similar to how git breaks all its subcommands into
-   separate programs.  Not all the programs have to be written in python;
-   they could be in any language, as long as they end up named cmd/bup-*. 
-   We might end up re-coding large parts of bup in C eventually so that it
-   can be even faster and (perhaps) more portable.
-
- - lib/bup/*.py are python library files used by the cmd/*.py commands. 
+ - The main program is a fairly small C program that mostly just
+   initializes the correct Python interpreter and then runs
+   bup.main.main().  This arrangement was chosen in order to give us
+   more flexibility.  For example:
+
+     - It allows us to avoid
+       [crashing on some Unicode-unfriendly command line arguments](https://bugs.python.org/issue35883)
+       which is critical, given that paths can be arbitrary byte
+       sequences.
+
+     - It allows more flexibility in dealing with upstream changes
+       like the breakage of our ability to manipulate the
+       processes arguement list on platforms that support it during
+       the Python 3.9 series.
+
+     - It means that we'll no longer be affected by any changes to the
+       `#!/...` path, i.e. if `/usr/bin/python`, or
+       `/usr/bin/python3`, or whatever we'd previously selected during
+       `./configure` were to change from 2 to 3, or 3.5 to 3.20.
+
+   The version of python bup uses is determined by the `python-config`
+   program selected by `./configure`.  It tries to find a suitable
+   default unless `BUP_PYTHON_CONFIG` is set in the environment.
+
+ - bup supports both internal and external subcommands.  The former
+   are the most common, and are all located in lib/bup/cmd/.  They
+   must be python modules named lib/bup/cmd/COMMAND.py, and must
+   contain a `main(argv)` function that will be passed the *binary*
+   command line arguments (bytes, not strings).  The filename must
+   have underscores for any dashes in the subcommand name.  The
+   external subcommands are in lib/cmd/.
+
+ - The python code is all in lib/bup.
+
+ - lib/bup/\*.py contains the python code (modules) that bup depends on.
     That directory name seems a little silly (and worse, redundant) but there
     seemed to be no better way to let programs write "from bup import
     index" and have it work.  Putting bup in the top level conflicted with
@@ -116,11 +140,11 @@ giant tarball each day, then send that tarball to bup, bup will be able to
  efficiently store only the changes to that tarball from one day to the next. 
  For small files, bup's compression won't be as good as xdelta's, but for
  anything over a few megabytes in size, bup's compression will actually
-*work*, which is a bit advantage over xdelta.
+*work*, which is a big advantage over xdelta.
  
  How does hashsplitting work?  It's deceptively simple.  We read through the
-file one byte at a time, calculating a rolling checksum of the last 128
-bytes.  (Why 128?  No reason.  Literally.  We picked it out of the air. 
+file one byte at a time, calculating a rolling checksum of the last 64
+bytes.  (Why 64?  No reason.  Literally.  We picked it out of the air.
  Probably some other number is better.  Feel free to join the mailing list
  and tell us which one and why.)  (The rolling checksum idea is actually
  stolen from rsync and xdelta, although we use it differently.  And they use
@@ -145,7 +169,7 @@ we need your help!  Avery's out of control!  Join our mailing list!  Please!
  Save us! ...  oh boy, I sure hope he doesn't read this)
  
  In any case, rollsum seems to do pretty well at its job. 
-You can find it in bupsplit.c.  Basically, it converts the last 128 bytes
+You can find it in bupsplit.c.  Basically, it converts the last 64 bytes
  read into a 32-bit integer.  What we then do is take the lowest 13
  bits of the rollsum, and if they're all 1's, we consider that to be the end
  of a chunk.  This happens on average once every 2^13 = 8192 bytes, so the
@@ -179,10 +203,10 @@ the blocks in the new backup would be different from the previous ones, and
  you'd have to store the same data all over again.  But with hashsplitting,
  no matter how much data you add, modify, or remove in the middle of the
  file, all the chunks *before* and *after* the affected chunk are absolutely
-the same.  All that matters to the hashsplitting algorithm is the 32-byte
+the same.  All that matters to the hashsplitting algorithm is the
  "separator" sequence, and a single change can only affect, at most, one
  separator sequence or the bytes between two separator sequences.  And
-because of rollsum, about one in 8192 possible 128-byte sequences is a
+because of rollsum, about one in 8192 possible 64-byte sequences is a
  separator sequence.  Like magic, the hashsplit chunking algorithm will chunk
  your file the same way every time, even without knowing how it had chunked
  it previously.
@@ -196,7 +220,7 @@ sequence data.
  As an overhead percentage, 0.25% basically doesn't matter.  488 megs sounds
  like a lot, but compared to the 200GB you have to store anyway, it's
  irrelevant.  What *is* relevant is that 488 megs is a lot of memory you have
-to use in order to to keep track of the list.  Worse, if you back up an
+to use in order to keep track of the list.  Worse, if you back up an
  almost-identical file tomorrow, you'll have *another* 488 meg blob to keep
  track of, and it'll be almost but not quite the same as last time.
  
@@ -212,12 +236,17 @@ special tools.
  
  What we do instead is we extend the hashsplit algorithm a little further
  using what we call "fanout." Instead of checking just the last 13 bits of
-the checksum, we use additional checksum bits to produce additional splits. 
-For example, let's say we use a 4-bit fanout.  That means we'll break a
-series of chunks into its own tree object whenever the last 13+4 = 17 bits
-of the rolling checksum are 1.  Naturally, whenever the lowest 17 bits are
-1, the lowest 13 bits are *also* 1, so the boundary of a chunk group is
-always also the boundary of a particular chunk.
+the checksum, we use additional checksum bits to produce additional splits.
+Note that (most likely due to an implementation bug), the next higher bit
+after the 13 bits (marked 'x'):
+
+  ...... '..x1'1111'1111'1111
+
+is actually ignored next. Now, let's say we use a 4-bit fanout. That means
+we'll break a series of chunks into its own tree object whenever the next
+4 bits of the rolling checksum are 1, in addition to the 13 lowest ones.
+Since the 13 lowest bits already have to be 1, the boundary of a group of
+chunks is necessarily also always the boundary of a particular chunk.
  
  And so on.  Eventually you'll have too many chunk groups, but you can group
  them into supergroups by using another 4 bits, and continue from there.
@@ -251,7 +280,7 @@ relatively infrequently.  (You might think you change your source code
  "frequently" and that git handles much more frequent changes than, say, svn
  can handle.  But that's not the same kind of "frequently" we're talking
  about.  Imagine you're backing up all the files on your disk, and one of
-those files is a 100 GB database file with hundreds of daily users.  You
+those files is a 100 GB database file with hundreds of daily users.  Your
  disk changes so frequently you can't even back up all the revisions even if
  you were backing stuff up 24 hours a day.  That's "frequently.")
  
@@ -340,7 +369,7 @@ they index multiple packs at a time.
  
  Imagine you had a midx file for your 200 packs.  midx files are a lot like
  idx files; they have a lookup table at the beginning that narrows down the
-initial search, followed by a binary search.  The unlike idx files (which
+initial search, followed by a binary search.  Then unlike idx files (which
  have a fixed-size 256-entry lookup table) midx tables have a variably-sized
  table that makes sure the entire binary search can be contained to a single
  page of the midx file.  Basically, the lookup table tells you which page to
@@ -371,32 +400,64 @@ Detailed Metadata
  -----------------
  
  So that's the basic structure of a bup repository, which is also a git
-repository.  There's one more thing we have to deal with in bup: filesystem
-metadata.  git repositories are really only intended to store file contents
-with a small bit of extra information, like symlink support and
-differentiating between executable and non-executable files.  For the rest,
-we'll have to store it some other way.
-
-As of this writing, bup's support for metadata is... pretty much
-nonexistent.  People are working on it.  But the plan goes like this:
-
- - Each git tree will contain a file called .bupmeta.
- 
- - .bupmeta contains an entry for every entry in the tree object, sorted in
-   the same order as in the tree.
- 
- - the .bupmeta entry lists information like modification times, attributes,
-   file ownership, and so on for each file in the tree.
-   
- - for backward compatibility with pre-metadata versions of bup (and git,
-   for that matter) the .bupmeta file for each tree is optional, and if it's
-   missing, files will be assumed to have default permissions.
-   
- The nice thing about this design is that you can walk through each file in
- a tree just by opening the tree and the .bupmeta contents, and iterating
- through both at the same time.
- 
- Trust us, it'll be awesome.  
+repository.  There's just one more thing we have to deal with:
+filesystem metadata.  Git repositories are really only intended to
+store file contents with a small bit of extra information, like
+symlink targets and executable bits, so we have to store the rest
+some other way.
+
+Bup stores more complete metadata in the VFS in a file named .bupm in
+each tree.  This file contains one entry for each file in the tree
+object, sorted in the same order as the tree.  The first .bupm entry
+is for the directory itself, i.e. ".", and its name is the empty
+string, "".
+
+Each .bupm entry contains a variable length sequence of records
+containing the metadata for the corresponding path.  Each record
+records one type of metadata.  Current types include a common record
+type (containing the normal stat information), a symlink target type,
+a hardlink target type, a POSIX1e ACL type, etc.  See metadata.py for
+the complete list.
+
+The .bupm file is optional, and when it's missing, bup will behave as
+it did before the addition of metadata, and restore files using the
+tree information.
+
+The nice thing about this design is that you can walk through each
+file in a tree just by opening the tree and the .bupm contents, and
+iterating through both at the same time.
+
+Since the contents of any .bupm file should match the state of the
+filesystem when it was *indexed*, bup must record the detailed
+metadata in the index.  To do this, bup records four values in the
+index, the atime, mtime, and ctime (as timespecs), and an integer
+offset into a secondary "metadata store" which has the same name as
+the index, but with ".meta" appended.  This secondary store contains
+the encoded Metadata object corresponding to each path in the index.
+
+Currently, in order to decrease the storage required for the metadata
+store, bup only writes unique values there, reusing offsets when
+appropriate across the index.  The effectiveness of this approach
+relies on the expectation that there will be many duplicate metadata
+records.  Storing the full timestamps in the index is intended to make
+that more likely, because it makes it unnecessary to record those
+values in the secondary store.  So bup clears them before encoding the
+Metadata objects destined for the index, and timestamp differences
+don't contribute to the uniqueness of the metadata.
+
+Bup supports recording and restoring hardlinks, and it does so by
+tracking sets of paths that correspond to the same dev/inode pair when
+indexing.  This information is stored in an optional file with the
+same name as the index, but ending with ".hlink".
+
+If there are multiple index runs, and the hardlinks change, bup will
+notice this (within whatever subtree it is asked to reindex) and
+update the .hlink information accordingly.
+
+The current hardlink implementation will refuse to link to any file
+that resides outside the restore tree, and if the restore tree spans a
+different set of filesystems than the save tree, complete sets of
+hardlinks may not be restored.
  
  
  Filesystem Interaction
@@ -516,7 +577,7 @@ the index and backs up any file that is "dirty," that is, doesn't already
  exist in the repository.
  
  Determination of dirtiness is a little more complicated than it sounds.  The
-most dirtiness-relevant relevant flag in the bupindex is IX_HASHVALID; if
+most dirtiness-relevant flag in the bupindex is IX_HASHVALID; if
  this flag is reset, the file *definitely* is dirty and needs to be backed
  up.  But a file may be dirty even if IX_HASHVALID is set, and that's the
  confusing part.
@@ -555,9 +616,10 @@ repository.  What that means is you've chosen not to backup the latest
  version of that file; instead, your new backup set just contains the
  most-recently-known valid version of that file.  This is a good trick if you
  want to do frequent backups of smallish files and infrequent backups of
-large ones (as in 'bup save --smaller').  Each of your backups will be
-"complete," in that they contain all the small files and the large ones, but
-intermediate ones will just contain out-of-date copies of the large files.
+large ones.  Each of your backups will be "complete," in that they contain
+all the small files and the large ones, but intermediate ones will just
+contain out-of-date copies of the large files. Note that this isn't done
+right now, and 'bup save --smaller' doesn't store bigger files _at all_.
  
  A final game we can play with the bupindex involves restoring: when you
  restore a directory from a previous backup, you can update the bupindex
@@ -602,6 +664,74 @@ things to note:
     are formatted.  So don't get confused!
  
  
+Handling Python 3's insistence on strings
+=========================================
+
+In Python 2 strings were bytes, and bup used them for all kinds of
+data.  Python 3 made a pervasive backward-incompatible change to treat
+all strings as Unicode, i.e. in Python 2 'foo' and b'foo' were the
+same thing, while u'foo' was a Unicode string.  In Python 3 'foo'
+became synonymous with u'foo', completely changing the type and
+potential content, depending on the locale.
+
+In addition, and particularly bad for bup, Python 3 also (initially)
+insisted that all kinds of things were strings that just aren't (at
+least not on many platforms), i.e. user names, groups, filesystem
+paths, etc.  There's no guarantee that any of those are always
+representable in Unicode.
+
+Over the years, Python 3 has gradually backed away from that initial
+position, adding alternate interfaces like os.environb or allowing
+bytes arguments to many functions like open(b'foo'...), so that in
+those cases it's at least possible to accurately retrieve the system
+data.
+
+After a while, they devised the concept of
+[bytesmuggling](https://www.python.org/dev/peps/pep-0383/) as a more
+comprehensive solution.  In theory, this might be sufficient, but our
+initial randomized testing discovered that some binary arguments would
+crash Python during startup[1].  Eventually Johannes Berg tracked down
+the [cause](https://sourceware.org/bugzilla/show_bug.cgi?id=26034),
+and we hope that the problem will be fixed eventually in glibc or
+worked around by Python, but in either case, it will be a long time
+before any fix is widely available.
+
+Before we tracked down that bug we were pursuing an approach that
+would let us side step the issue entirely by manipulating the
+LC_CTYPE, but that approach was somewhat complicated, and once we
+understood what was causing the crashes, we decided to just let Python
+3 operate "normally", and work around the issues.
+
+Consequently, we've had to wrap a number of things ourselves that
+incorrectly return Unicode strings (libacl, libreadline, hostname,
+etc.)  and we've had to come up with a way to avoid the fatal crashes
+caused by some command line arguments (sys.argv) described above.  To
+fix the latter, for the time being, we just use a trivial sh wrapper
+to redirect all of the command line arguments through the environment
+in BUP_ARGV_{0,1,2,...} variables, since the variables are unaffected,
+and we can access them directly in Python 3 via environb.
+
+[1] Our randomized argv testing found that the byte smuggling approach
+    was not working correctly for some values (initially discovered in
+    Python 3.7, and observed in other versions).  The interpreter
+    would just crash while starting up like this:
+
+    Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
+    ValueError: character U+134bd2 is not in range [U+0000; U+10ffff]
+
+    Current thread 0x00007f2f0e1d8740 (most recent call first):
+    Traceback (most recent call last):
+      File "t/test-argv", line 28, in <module>
+        out = check_output(cmd)
+      File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
+        **kwargs).stdout
+      File "/usr/lib/python3.7/subprocess.py", line 487, in run
+        output=stdout, stderr=stderr)
+
  We hope you'll enjoy bup.  Looking forward to your patches!
  
  -- apenwarr and the rest of the bup team
+
+Local Variables:
+mode: markdown
+End: