X-Git-Url: https://arthur.barton.de/cgi-bin/gitweb.cgi?a=blobdiff_plain;f=DESIGN;h=e2419186ef057a4b398ce38c969f841445525cf1;hb=ad8c188a619b16e9a937bf857af7360068c7efa7;hp=97dc96d5cb9cfbb6ff95b96ed4a2b22f25576b9b;hpb=a90d677d550deac13b2baaeade5d73f9320f574c;p=bup.git diff --git a/DESIGN b/DESIGN index 97dc96d..e241918 100644 --- a/DESIGN +++ b/DESIGN @@ -196,7 +196,7 @@ sequence data. As an overhead percentage, 0.25% basically doesn't matter. 488 megs sounds like a lot, but compared to the 200GB you have to store anyway, it's irrelevant. What *is* relevant is that 488 megs is a lot of memory you have -to use in order to to keep track of the list. Worse, if you back up an +to use in order to keep track of the list. Worse, if you back up an almost-identical file tomorrow, you'll have *another* 488 meg blob to keep track of, and it'll be almost but not quite the same as last time. @@ -374,7 +374,7 @@ So that's the basic structure of a bup repository, which is also a git repository. There's just one more thing we have to deal with: filesystem metadata. Git repositories are really only intended to store file contents with a small bit of extra information, like -symlink targets and and executable bits, so we have to store the rest +symlink targets and executable bits, so we have to store the rest some other way. Bup stores more complete metadata in the VFS in a file named .bupm in @@ -548,7 +548,7 @@ the index and backs up any file that is "dirty," that is, doesn't already exist in the repository. Determination of dirtiness is a little more complicated than it sounds. The -most dirtiness-relevant relevant flag in the bupindex is IX_HASHVALID; if +most dirtiness-relevant flag in the bupindex is IX_HASHVALID; if this flag is reset, the file *definitely* is dirty and needs to be backed up. But a file may be dirty even if IX_HASHVALID is set, and that's the confusing part. @@ -634,10 +634,87 @@ things to note: are formatted. So don't get confused! +Handling Python 3's insistence on strings +========================================= + +In Python 2 strings were bytes, and bup used them for all kinds of +data. Python 3 made a pervasive backward-incompatible change to make +all strings Unicode, i.e. in Python 2 'foo' and b'foo' were the same +thing, while u'foo' was a Unicode string. In Python 3 'foo' became +synonymous with u'foo', completely changing the type and potential +content, depending on the locale. + +In addition, and particularly bad for bup, Python 3 also (initially) +insisted that all kinds of things were strings that just aren't (at +least not on many platforms), i.e. user names, groups, filesystem +paths, etc. There's no guarantee that any of those are always +representable in Unicode. + +Over the years, Python 3 has gradually backed off from that initial +aggressive stance, adding alternate interfaces like os.environb or +allowing bytes arguments to many functions like open(b'foo'...), so +that in those cases it's at least possible to accurately +retrieve the system data. + +After a while, they devised the concept of [byte smuggling](https://www.python.org/dev/peps/pep-0383/) +as a more comprehensive solution, though at least currently, we've +found that it doesn't always work (see below), and at least for bulk +data, it's more expensive, converting the data back and forth when you +just wanted the original bytes, exactly as provided by the system +APIs. + +At least one case where we've found that the byte smuggling approach +it doesn't work is with respect to sys.argv (initially discovered in +Python 3.7). The claim is that we should be able to retrieve the +original bytes via fsdecode(sys.argv[n]), but after adding some +randomized argument testing, we quickly discovered that this isn't +true with (at least) the default UTF-8 environment. The interpreter +just crashes while starting up with some random binary arguments: + + Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed + ValueError: character U+134bd2 is not in range [U+0000; U+10ffff] + + Current thread 0x00007f2f0e1d8740 (most recent call first): + Traceback (most recent call last): + File "t/test-argv", line 28, in + out = check_output(cmd) + File "/usr/lib/python3.7/subprocess.py", line 395, in check_output + **kwargs).stdout + File "/usr/lib/python3.7/subprocess.py", line 487, in run + output=stdout, stderr=stderr) + +To fix that, at least for now, the plan is to *always* force the +LC_CTYPE to ISO-8859-1 before launching Python, which does "fix" the +problem. + +The reason we want to require ISO-8859-1 is that it's a (common) +8-byte encoding, which means that there are no invalid byte sequences +with respect to encoding/decoding, and so the mapping between it and +Unicode is one-to-one. i.e. any sequence of bytes is a valid +ISO-8859-1 string and has a valid representation in Unicode. Whether +or not the end result in Unicode represents what was originally +intended is another question entirely, but the key thing is that the +round-trips between ISO-8859-1 bytes and Unicode should be completely +safe. + +We're requiring this encoding so that *hopefully* Python 3 will then +allow us to get the unmangled bytes from os interfaces where it +doesn't provide an explicit or implicit binary version like environb +or open(b'foo', ...). + +In the longer run we might consider wrapping these APIs ourselves in +C, and have them just return Py_bytes objects to begin with, which +would be more efficient and make the process completely independent of +the system encoding, and/or less potentially fragile with respect to +whatever the Python upstream might decide to try next. + +But for now, this approach will hopefully save us some work. + + We hope you'll enjoy bup. Looking forward to your patches! -- apenwarr and the rest of the bup team Local Variables: -mode: text +mode: markdown End: