X-Git-Url: https://arthur.barton.de/cgi-bin/gitweb.cgi?p=bup.git;a=blobdiff_plain;f=DESIGN;h=e2419186ef057a4b398ce38c969f841445525cf1;hp=e8aa8d08eb9b2602c595140f0c58f16f100983ea;hb=6b6559e405d264d4127211b935b21a3dda93ad93;hpb=ba329fdac189ca5c29dc5262d081e56ae70c3201 diff --git a/DESIGN b/DESIGN index e8aa8d0..e241918 100644 --- a/DESIGN +++ b/DESIGN @@ -634,10 +634,87 @@ things to note: are formatted. So don't get confused! +Handling Python 3's insistence on strings +========================================= + +In Python 2 strings were bytes, and bup used them for all kinds of +data. Python 3 made a pervasive backward-incompatible change to make +all strings Unicode, i.e. in Python 2 'foo' and b'foo' were the same +thing, while u'foo' was a Unicode string. In Python 3 'foo' became +synonymous with u'foo', completely changing the type and potential +content, depending on the locale. + +In addition, and particularly bad for bup, Python 3 also (initially) +insisted that all kinds of things were strings that just aren't (at +least not on many platforms), i.e. user names, groups, filesystem +paths, etc. There's no guarantee that any of those are always +representable in Unicode. + +Over the years, Python 3 has gradually backed off from that initial +aggressive stance, adding alternate interfaces like os.environb or +allowing bytes arguments to many functions like open(b'foo'...), so +that in those cases it's at least possible to accurately +retrieve the system data. + +After a while, they devised the concept of [byte smuggling](https://www.python.org/dev/peps/pep-0383/) +as a more comprehensive solution, though at least currently, we've +found that it doesn't always work (see below), and at least for bulk +data, it's more expensive, converting the data back and forth when you +just wanted the original bytes, exactly as provided by the system +APIs. + +At least one case where we've found that the byte smuggling approach +it doesn't work is with respect to sys.argv (initially discovered in +Python 3.7). The claim is that we should be able to retrieve the +original bytes via fsdecode(sys.argv[n]), but after adding some +randomized argument testing, we quickly discovered that this isn't +true with (at least) the default UTF-8 environment. The interpreter +just crashes while starting up with some random binary arguments: + + Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed + ValueError: character U+134bd2 is not in range [U+0000; U+10ffff] + + Current thread 0x00007f2f0e1d8740 (most recent call first): + Traceback (most recent call last): + File "t/test-argv", line 28, in + out = check_output(cmd) + File "/usr/lib/python3.7/subprocess.py", line 395, in check_output + **kwargs).stdout + File "/usr/lib/python3.7/subprocess.py", line 487, in run + output=stdout, stderr=stderr) + +To fix that, at least for now, the plan is to *always* force the +LC_CTYPE to ISO-8859-1 before launching Python, which does "fix" the +problem. + +The reason we want to require ISO-8859-1 is that it's a (common) +8-byte encoding, which means that there are no invalid byte sequences +with respect to encoding/decoding, and so the mapping between it and +Unicode is one-to-one. i.e. any sequence of bytes is a valid +ISO-8859-1 string and has a valid representation in Unicode. Whether +or not the end result in Unicode represents what was originally +intended is another question entirely, but the key thing is that the +round-trips between ISO-8859-1 bytes and Unicode should be completely +safe. + +We're requiring this encoding so that *hopefully* Python 3 will then +allow us to get the unmangled bytes from os interfaces where it +doesn't provide an explicit or implicit binary version like environb +or open(b'foo', ...). + +In the longer run we might consider wrapping these APIs ourselves in +C, and have them just return Py_bytes objects to begin with, which +would be more efficient and make the process completely independent of +the system encoding, and/or less potentially fragile with respect to +whatever the Python upstream might decide to try next. + +But for now, this approach will hopefully save us some work. + + We hope you'll enjoy bup. Looking forward to your patches! -- apenwarr and the rest of the bup team Local Variables: -mode: text +mode: markdown End: