Adjust rm-cmd and bup.rm for python 3 and enable test-rm

[bup.git] / DESIGN
diff --git a/DESIGN b/DESIGN

index 8a00073e454cbfe42191dc2a79154a2122845f0d..e2419186ef057a4b398ce38c969f841445525cf1 100644 (file)
--- a/DESIGN
+++ b/DESIGN
@@ -116,11 +116,11 @@ giant tarball each day, then send that tarball to bup, bup will be able to
  efficiently store only the changes to that tarball from one day to the next. 
  For small files, bup's compression won't be as good as xdelta's, but for
  anything over a few megabytes in size, bup's compression will actually
-*work*, which is a bit advantage over xdelta.
+*work*, which is a big advantage over xdelta.
  
  How does hashsplitting work?  It's deceptively simple.  We read through the
-file one byte at a time, calculating a rolling checksum of the last 128
-bytes.  (Why 128?  No reason.  Literally.  We picked it out of the air. 
+file one byte at a time, calculating a rolling checksum of the last 64
+bytes.  (Why 64?  No reason.  Literally.  We picked it out of the air.
  Probably some other number is better.  Feel free to join the mailing list
  and tell us which one and why.)  (The rolling checksum idea is actually
  stolen from rsync and xdelta, although we use it differently.  And they use
@@ -145,7 +145,7 @@ we need your help!  Avery's out of control!  Join our mailing list!  Please!
  Save us! ...  oh boy, I sure hope he doesn't read this)
  
  In any case, rollsum seems to do pretty well at its job. 
-You can find it in bupsplit.c.  Basically, it converts the last 128 bytes
+You can find it in bupsplit.c.  Basically, it converts the last 64 bytes
  read into a 32-bit integer.  What we then do is take the lowest 13
  bits of the rollsum, and if they're all 1's, we consider that to be the end
  of a chunk.  This happens on average once every 2^13 = 8192 bytes, so the
@@ -182,7 +182,7 @@ file, all the chunks *before* and *after* the affected chunk are absolutely
  the same.  All that matters to the hashsplitting algorithm is the 32-byte
  "separator" sequence, and a single change can only affect, at most, one
  separator sequence or the bytes between two separator sequences.  And
-because of rollsum, about one in 8192 possible 128-byte sequences is a
+because of rollsum, about one in 8192 possible 64-byte sequences is a
  separator sequence.  Like magic, the hashsplit chunking algorithm will chunk
  your file the same way every time, even without knowing how it had chunked
  it previously.
@@ -196,7 +196,7 @@ sequence data.
  As an overhead percentage, 0.25% basically doesn't matter.  488 megs sounds
  like a lot, but compared to the 200GB you have to store anyway, it's
  irrelevant.  What *is* relevant is that 488 megs is a lot of memory you have
-to use in order to to keep track of the list.  Worse, if you back up an
+to use in order to keep track of the list.  Worse, if you back up an
  almost-identical file tomorrow, you'll have *another* 488 meg blob to keep
  track of, and it'll be almost but not quite the same as last time.
  
@@ -251,7 +251,7 @@ relatively infrequently.  (You might think you change your source code
  "frequently" and that git handles much more frequent changes than, say, svn
  can handle.  But that's not the same kind of "frequently" we're talking
  about.  Imagine you're backing up all the files on your disk, and one of
-those files is a 100 GB database file with hundreds of daily users.  You
+those files is a 100 GB database file with hundreds of daily users.  Your
  disk changes so frequently you can't even back up all the revisions even if
  you were backing stuff up 24 hours a day.  That's "frequently.")
  
@@ -340,7 +340,7 @@ they index multiple packs at a time.
  
  Imagine you had a midx file for your 200 packs.  midx files are a lot like
  idx files; they have a lookup table at the beginning that narrows down the
-initial search, followed by a binary search.  The unlike idx files (which
+initial search, followed by a binary search.  Then unlike idx files (which
  have a fixed-size 256-entry lookup table) midx tables have a variably-sized
  table that makes sure the entire binary search can be contained to a single
  page of the midx file.  Basically, the lookup table tells you which page to
@@ -374,7 +374,7 @@ So that's the basic structure of a bup repository, which is also a git
  repository.  There's just one more thing we have to deal with:
  filesystem metadata.  Git repositories are really only intended to
  store file contents with a small bit of extra information, like
-symlink targets and and executable bits, so we have to store the rest
+symlink targets and executable bits, so we have to store the rest
  some other way.
  
  Bup stores more complete metadata in the VFS in a file named .bupm in
@@ -395,9 +395,27 @@ it did before the addition of metadata, and restore files using the
  tree information.
  
  The nice thing about this design is that you can walk through each
-file in a tree just by opening the tree and the .bupmeta contents, and
+file in a tree just by opening the tree and the .bupm contents, and
  iterating through both at the same time.
  
+Since the contents of any .bupm file should match the state of the
+filesystem when it was *indexed*, bup must record the detailed
+metadata in the index.  To do this, bup records four values in the
+index, the atime, mtime, and ctime (as timespecs), and an integer
+offset into a secondary "metadata store" which has the same name as
+the index, but with ".meta" appended.  This secondary store contains
+the encoded Metadata object corresponding to each path in the index.
+
+Currently, in order to decrease the storage required for the metadata
+store, bup only writes unique values there, reusing offsets when
+appropriate across the index.  The effectiveness of this approach
+relies on the expectation that there will be many duplicate metadata
+records.  Storing the full timestamps in the index is intended to make
+that more likely, because it makes it unnecessary to record those
+values in the secondary store.  So bup clears them before encoding the
+Metadata objects destined for the index, and timestamp differences
+don't contribute to the uniqueness of the metadata.
+
  Bup supports recording and restoring hardlinks, and it does so by
  tracking sets of paths that correspond to the same dev/inode pair when
  indexing.  This information is stored in an optional file with the
@@ -530,7 +548,7 @@ the index and backs up any file that is "dirty," that is, doesn't already
  exist in the repository.
  
  Determination of dirtiness is a little more complicated than it sounds.  The
-most dirtiness-relevant relevant flag in the bupindex is IX_HASHVALID; if
+most dirtiness-relevant flag in the bupindex is IX_HASHVALID; if
  this flag is reset, the file *definitely* is dirty and needs to be backed
  up.  But a file may be dirty even if IX_HASHVALID is set, and that's the
  confusing part.
@@ -616,10 +634,87 @@ things to note:
     are formatted.  So don't get confused!
  
  
+Handling Python 3's insistence on strings
+=========================================
+
+In Python 2 strings were bytes, and bup used them for all kinds of
+data.  Python 3 made a pervasive backward-incompatible change to make
+all strings Unicode, i.e. in Python 2 'foo' and b'foo' were the same
+thing, while u'foo' was a Unicode string.  In Python 3 'foo' became
+synonymous with u'foo', completely changing the type and potential
+content, depending on the locale.
+
+In addition, and particularly bad for bup, Python 3 also (initially)
+insisted that all kinds of things were strings that just aren't (at
+least not on many platforms), i.e. user names, groups, filesystem
+paths, etc.  There's no guarantee that any of those are always
+representable in Unicode.
+
+Over the years, Python 3 has gradually backed off from that initial
+aggressive stance, adding alternate interfaces like os.environb or
+allowing bytes arguments to many functions like open(b'foo'...), so
+that in those cases it's at least possible to accurately
+retrieve the system data.
+
+After a while, they devised the concept of [byte smuggling](https://www.python.org/dev/peps/pep-0383/)
+as a more comprehensive solution, though at least currently, we've
+found that it doesn't always work (see below), and at least for bulk
+data, it's more expensive, converting the data back and forth when you
+just wanted the original bytes, exactly as provided by the system
+APIs.
+
+At least one case where we've found that the byte smuggling approach
+it doesn't work is with respect to sys.argv (initially discovered in
+Python 3.7).  The claim is that we should be able to retrieve the
+original bytes via fsdecode(sys.argv[n]), but after adding some
+randomized argument testing, we quickly discovered that this isn't
+true with (at least) the default UTF-8 environment.  The interpreter
+just crashes while starting up with some random binary arguments:
+
+    Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
+    ValueError: character U+134bd2 is not in range [U+0000; U+10ffff]
+
+    Current thread 0x00007f2f0e1d8740 (most recent call first):
+    Traceback (most recent call last):
+      File "t/test-argv", line 28, in <module>
+        out = check_output(cmd)
+      File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
+        **kwargs).stdout
+      File "/usr/lib/python3.7/subprocess.py", line 487, in run
+        output=stdout, stderr=stderr)
+
+To fix that, at least for now, the plan is to *always* force the
+LC_CTYPE to ISO-8859-1 before launching Python, which does "fix" the
+problem.
+
+The reason we want to require ISO-8859-1 is that it's a (common)
+8-byte encoding, which means that there are no invalid byte sequences
+with respect to encoding/decoding, and so the mapping between it and
+Unicode is one-to-one.  i.e. any sequence of bytes is a valid
+ISO-8859-1 string and has a valid representation in Unicode.  Whether
+or not the end result in Unicode represents what was originally
+intended is another question entirely, but the key thing is that the
+round-trips between ISO-8859-1 bytes and Unicode should be completely
+safe.
+
+We're requiring this encoding so that *hopefully* Python 3 will then
+allow us to get the unmangled bytes from os interfaces where it
+doesn't provide an explicit or implicit binary version like environb
+or open(b'foo', ...).
+
+In the longer run we might consider wrapping these APIs ourselves in
+C, and have them just return Py_bytes objects to begin with, which
+would be more efficient and make the process completely independent of
+the system encoding, and/or less potentially fragile with respect to
+whatever the Python upstream might decide to try next.
+
+But for now, this approach will hopefully save us some work.
+
+
  We hope you'll enjoy bup.  Looking forward to your patches!
  
  -- apenwarr and the rest of the bup team
  
  Local Variables:
-mode: text
+mode: markdown
  End: