DESIGN: describe plan to handle Python 3's insistence on strings

author Rob Browning <rlb@defaultvalue.org>

Sun, 3 Nov 2019 21:48:53 +0000 (15:48 -0600)

committer Rob Browning <rlb@defaultvalue.org>

Sun, 1 Dec 2019 18:20:36 +0000 (12:20 -0600)
author Rob Browning <rlb@defaultvalue.org>
Sun, 3 Nov 2019 21:48:53 +0000 (15:48 -0600)
committer Rob Browning <rlb@defaultvalue.org>
Sun, 1 Dec 2019 18:20:36 +0000 (12:20 -0600)
diff --git a/DESIGN b/DESIGN

index e8aa8d08eb9b2602c595140f0c58f16f100983ea..e2419186ef057a4b398ce38c969f841445525cf1 100644 (file)
--- a/DESIGN
+++ b/DESIGN
@@ -634,10 +634,87 @@ things to note:
     are formatted.  So don't get confused!
  
  
+Handling Python 3's insistence on strings
+=========================================
+
+In Python 2 strings were bytes, and bup used them for all kinds of
+data.  Python 3 made a pervasive backward-incompatible change to make
+all strings Unicode, i.e. in Python 2 'foo' and b'foo' were the same
+thing, while u'foo' was a Unicode string.  In Python 3 'foo' became
+synonymous with u'foo', completely changing the type and potential
+content, depending on the locale.
+
+In addition, and particularly bad for bup, Python 3 also (initially)
+insisted that all kinds of things were strings that just aren't (at
+least not on many platforms), i.e. user names, groups, filesystem
+paths, etc.  There's no guarantee that any of those are always
+representable in Unicode.
+
+Over the years, Python 3 has gradually backed off from that initial
+aggressive stance, adding alternate interfaces like os.environb or
+allowing bytes arguments to many functions like open(b'foo'...), so
+that in those cases it's at least possible to accurately
+retrieve the system data.
+
+After a while, they devised the concept of [byte smuggling](https://www.python.org/dev/peps/pep-0383/)
+as a more comprehensive solution, though at least currently, we've
+found that it doesn't always work (see below), and at least for bulk
+data, it's more expensive, converting the data back and forth when you
+just wanted the original bytes, exactly as provided by the system
+APIs.
+
+At least one case where we've found that the byte smuggling approach
+it doesn't work is with respect to sys.argv (initially discovered in
+Python 3.7).  The claim is that we should be able to retrieve the
+original bytes via fsdecode(sys.argv[n]), but after adding some
+randomized argument testing, we quickly discovered that this isn't
+true with (at least) the default UTF-8 environment.  The interpreter
+just crashes while starting up with some random binary arguments:
+
+    Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
+    ValueError: character U+134bd2 is not in range [U+0000; U+10ffff]
+
+    Current thread 0x00007f2f0e1d8740 (most recent call first):
+    Traceback (most recent call last):
+      File "t/test-argv", line 28, in <module>
+        out = check_output(cmd)
+      File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
+        **kwargs).stdout
+      File "/usr/lib/python3.7/subprocess.py", line 487, in run
+        output=stdout, stderr=stderr)
+
+To fix that, at least for now, the plan is to *always* force the
+LC_CTYPE to ISO-8859-1 before launching Python, which does "fix" the
+problem.
+
+The reason we want to require ISO-8859-1 is that it's a (common)
+8-byte encoding, which means that there are no invalid byte sequences
+with respect to encoding/decoding, and so the mapping between it and
+Unicode is one-to-one.  i.e. any sequence of bytes is a valid
+ISO-8859-1 string and has a valid representation in Unicode.  Whether
+or not the end result in Unicode represents what was originally
+intended is another question entirely, but the key thing is that the
+round-trips between ISO-8859-1 bytes and Unicode should be completely
+safe.
+
+We're requiring this encoding so that *hopefully* Python 3 will then
+allow us to get the unmangled bytes from os interfaces where it
+doesn't provide an explicit or implicit binary version like environb
+or open(b'foo', ...).
+
+In the longer run we might consider wrapping these APIs ourselves in
+C, and have them just return Py_bytes objects to begin with, which
+would be more efficient and make the process completely independent of
+the system encoding, and/or less potentially fragile with respect to
+whatever the Python upstream might decide to try next.
+
+But for now, this approach will hopefully save us some work.
+
+
  We hope you'll enjoy bup.  Looking forward to your patches!
  
  -- apenwarr and the rest of the bup team
  
  Local Variables:
-mode: text
+mode: markdown
  End:
author	Rob Browning <rlb@defaultvalue.org>
	Sun, 3 Nov 2019 21:48:53 +0000 (15:48 -0600)
committer	Rob Browning <rlb@defaultvalue.org>
	Sun, 1 Dec 2019 18:20:36 +0000 (12:20 -0600)