X-Git-Url: https://arthur.barton.de/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=DESIGN;h=e2419186ef057a4b398ce38c969f841445525cf1;hb=ad8c188a619b16e9a937bf857af7360068c7efa7;hp=97dc96d5cb9cfbb6ff95b96ed4a2b22f25576b9b;hpb=a90d677d550deac13b2baaeade5d73f9320f574c;p=bup.git

diff --git a/DESIGN b/DESIGN
index 97dc96d..e241918 100644
--- a/DESIGN
+++ b/DESIGN
@@ -196,7 +196,7 @@ sequence data.
 As an overhead percentage, 0.25% basically doesn't matter.  488 megs sounds
 like a lot, but compared to the 200GB you have to store anyway, it's
 irrelevant.  What *is* relevant is that 488 megs is a lot of memory you have
-to use in order to to keep track of the list.  Worse, if you back up an
+to use in order to keep track of the list.  Worse, if you back up an
 almost-identical file tomorrow, you'll have *another* 488 meg blob to keep
 track of, and it'll be almost but not quite the same as last time.
 
@@ -374,7 +374,7 @@ So that's the basic structure of a bup repository, which is also a git
 repository.  There's just one more thing we have to deal with:
 filesystem metadata.  Git repositories are really only intended to
 store file contents with a small bit of extra information, like
-symlink targets and and executable bits, so we have to store the rest
+symlink targets and executable bits, so we have to store the rest
 some other way.
 
 Bup stores more complete metadata in the VFS in a file named .bupm in
@@ -548,7 +548,7 @@ the index and backs up any file that is "dirty," that is, doesn't already
 exist in the repository.
 
 Determination of dirtiness is a little more complicated than it sounds.  The
-most dirtiness-relevant relevant flag in the bupindex is IX_HASHVALID; if
+most dirtiness-relevant flag in the bupindex is IX_HASHVALID; if
 this flag is reset, the file *definitely* is dirty and needs to be backed
 up.  But a file may be dirty even if IX_HASHVALID is set, and that's the
 confusing part.
@@ -634,10 +634,87 @@ things to note:
    are formatted.  So don't get confused!
 
 
+Handling Python 3's insistence on strings
+=========================================
+
+In Python 2 strings were bytes, and bup used them for all kinds of
+data.  Python 3 made a pervasive backward-incompatible change to make
+all strings Unicode, i.e. in Python 2 'foo' and b'foo' were the same
+thing, while u'foo' was a Unicode string.  In Python 3 'foo' became
+synonymous with u'foo', completely changing the type and potential
+content, depending on the locale.
+
+In addition, and particularly bad for bup, Python 3 also (initially)
+insisted that all kinds of things were strings that just aren't (at
+least not on many platforms), i.e. user names, groups, filesystem
+paths, etc.  There's no guarantee that any of those are always
+representable in Unicode.
+
+Over the years, Python 3 has gradually backed off from that initial
+aggressive stance, adding alternate interfaces like os.environb or
+allowing bytes arguments to many functions like open(b'foo'...), so
+that in those cases it's at least possible to accurately
+retrieve the system data.
+
+After a while, they devised the concept of [byte smuggling](https://www.python.org/dev/peps/pep-0383/)
+as a more comprehensive solution, though at least currently, we've
+found that it doesn't always work (see below), and at least for bulk
+data, it's more expensive, converting the data back and forth when you
+just wanted the original bytes, exactly as provided by the system
+APIs.
+
+At least one case where we've found that the byte smuggling approach
+it doesn't work is with respect to sys.argv (initially discovered in
+Python 3.7).  The claim is that we should be able to retrieve the
+original bytes via fsdecode(sys.argv[n]), but after adding some
+randomized argument testing, we quickly discovered that this isn't
+true with (at least) the default UTF-8 environment.  The interpreter
+just crashes while starting up with some random binary arguments:
+
+    Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
+    ValueError: character U+134bd2 is not in range [U+0000; U+10ffff]
+
+    Current thread 0x00007f2f0e1d8740 (most recent call first):
+    Traceback (most recent call last):
+      File "t/test-argv", line 28, in <module>
+        out = check_output(cmd)
+      File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
+        **kwargs).stdout
+      File "/usr/lib/python3.7/subprocess.py", line 487, in run
+        output=stdout, stderr=stderr)
+
+To fix that, at least for now, the plan is to *always* force the
+LC_CTYPE to ISO-8859-1 before launching Python, which does "fix" the
+problem.
+
+The reason we want to require ISO-8859-1 is that it's a (common)
+8-byte encoding, which means that there are no invalid byte sequences
+with respect to encoding/decoding, and so the mapping between it and
+Unicode is one-to-one.  i.e. any sequence of bytes is a valid
+ISO-8859-1 string and has a valid representation in Unicode.  Whether
+or not the end result in Unicode represents what was originally
+intended is another question entirely, but the key thing is that the
+round-trips between ISO-8859-1 bytes and Unicode should be completely
+safe.
+
+We're requiring this encoding so that *hopefully* Python 3 will then
+allow us to get the unmangled bytes from os interfaces where it
+doesn't provide an explicit or implicit binary version like environb
+or open(b'foo', ...).
+
+In the longer run we might consider wrapping these APIs ourselves in
+C, and have them just return Py_bytes objects to begin with, which
+would be more efficient and make the process completely independent of
+the system encoding, and/or less potentially fragile with respect to
+whatever the Python upstream might decide to try next.
+
+But for now, this approach will hopefully save us some work.
+
+
 We hope you'll enjoy bup.  Looking forward to your patches!
 
 -- apenwarr and the rest of the bup team
 
 Local Variables:
-mode: text
+mode: markdown
 End: