Avery Pennarun [Thu, 4 Mar 2010 03:34:40 +0000 (22:34 -0500)]
main: fix problem when redirecting to newliner on MacOS X.
It's probably just a bug in python 2.4.2, which is the version on my old
MacOS machine. But it seems that if you use subprocess.Popen with stdout=1
and/or stderr=2, it ends up closing the file descriptors instead of passing
them along. Since those are the defaults anyway, just use None instead.
Avery Pennarun [Thu, 4 Mar 2010 01:17:04 +0000 (20:17 -0500)]
save-cmd: Fix --smaller and other behaviour when files are skipped.
The --smaller option now uses parse_num() so it can be something other than
a raw number of bytes (eg. "1.5G").
We were incorrectly marking a tree as valid when we skipped any of its
contents for any reason; that's no good. We can still save a tree to the
backup, but it'll be missing some stuff, so we have to avoid marking it as
valid. That way it won't be skipped next time around.
Avery Pennarun [Thu, 4 Mar 2010 00:21:20 +0000 (19:21 -0500)]
save-cmd: progress meter wouldn't count identical files correctly.
This one was really tricky. If a file was IX_HASHVALID but its object
wasn't available on the target server (eg. if you backed up to one server
server and now are backing up to a different one), we could correctly count
is toward the total bytes we expected to back up.
Now imagine there are two *identical* files (ie. with the same sha1sum) in
this situation. When that happens, we'd back up the first one, after which
the objects for the second one *are* available. So we'd skip it, thinking
that we had skipped it in the first place. The result would be that our
backup count showed a final byte percentage less than 100%.
The workaround isn't very pretty, but should be correct: we add a new
IX_SHAMISSING flag, setting or clearing it during the initial index scan,
and then we use *that* as the indicator of whether to add bytes to the count
or not.
We also have to decide whether to recurse into subdirectories using this
algorithm. If /etc/rc3.d and /etc/rc4.d are identical, and one of the files
in them had this problem, then we wouldn't even *recurse* into /etc/rc3.d
after backing up /etc/rc4.d. That means we wouldn't check the IX_SHAMISSING
flag on the file inside. So we had to fix that up too.
On the other hand, this is an awful lot of complexity just to make the
progress messages more exact...
Avery Pennarun [Wed, 3 Mar 2010 22:36:06 +0000 (17:36 -0500)]
save-cmd: don't fail an assertion when doing a backup from the root level.
This wasn't caught by unit tests because "virtual" nodes added by
index.py:_golevel() weren't being marked as IX_EXISTS, which in the unit
tests included the root, so save-cmd was never actually trying to back up
that node.
That made the base directories incorrectly marked as status=D (deleted) if
you printed out the index during the tests. So add a test for that to make
it fail if "/" is deleted (which obviously makes no sense), then add another
test for saving from the root level, then fix both bugs.
Avery Pennarun [Wed, 3 Mar 2010 04:59:08 +0000 (23:59 -0500)]
'make stupid' stopped working when I moved subcommands into their own dir.
Remote server mode tries to add the directory of argv[0] (the
currently-running program) to the PATH on the remote server, just in case
bup isn't installed in the PATH there, so that it can then run 'bup server'.
However, now that bup-save is in a different place than bup, argv[0] is the
wrong place to look. Instead, have the bup executable export an environment
variable containing its location, and client.py can use that instead of
argv[0]. Slightly gross, but it works.
Avery Pennarun [Wed, 3 Mar 2010 04:18:49 +0000 (23:18 -0500)]
log(): handle situations where stderr gets set to nonblocking.
It's probably ssh doing this, and in obscure situations, it means log() ends
up throwing an exception and aborting the program.
Fix it so that we handle EAGAIN correctly if we get it when writing to
stderr, even though this is only really necessary due to stupidity on
(I think/hope) someone else's part.
Avery Pennarun [Tue, 2 Mar 2010 21:20:41 +0000 (16:20 -0500)]
bup random: fix progress output and don't print to a tty.
We were printing output using a series of dots, which interacted badly with
bup newliner (and for good reason). Change it to actually display the
number of megabytes done so far.
Also, don't print random binary data to a tty unless -f is given. It's
just more polite that way.
Avery Pennarun [Mon, 1 Mar 2010 00:07:00 +0000 (19:07 -0500)]
Rename PackIndex->PackIdx and MultiPackIndex->PackIdxList.
This corresponds to the PackMidx renaming I did earlier, and helps avoid
confusion between index.py (which talks to the 'bupindex' file and has
nothing to do with packs) and git.py (which talks to packs and has nothing
to do with the bupindex). Now pack indexes are always called Idx, and the
bupindex is always Index.
Furthermore, MultiPackIndex could easily be assumed to be the same thing as
a Midx, which it isn't. PackIdxList is a more accurate description of what
it is: a list of pack indexes. A Midx is an index of a list of packs.
Avery Pennarun [Sun, 28 Feb 2010 22:05:41 +0000 (17:05 -0500)]
Move cmd-*.py to cmd/*-cmd.py.
The bup-* programs shouldn't need to be installed into /usr/bin; we should
search for them in /usr/lib somewhere.
I could have left the names as cmd/cmd-*.py, but the cmd-* was annoying me
because of tab completion. Now I can type cmd/ran<tab> to get
random-cmd.py.
Avery Pennarun [Sun, 28 Feb 2010 21:17:35 +0000 (16:17 -0500)]
Move python library files to lib/bup/
...and update other programs so that they import them correctly from their
new location.
This is necessary so that the bup library files can eventually be installed
somewhere other than wherever the 'bup' executable ends up. Plus it's
clearer and safer to say 'from bup import options' instead of just 'import
options', in case someone else writes an 'options' module.
I wish I could have named the directory just 'bup', but I can't; there's
already a program with that name.
Also, in the name of sanity, rename memtest.py to 'bup memtest' so that it
can get the new paths automatically.
Avery Pennarun [Sun, 28 Feb 2010 20:51:16 +0000 (15:51 -0500)]
cmd-index: auto-invalidate entries without a valid sha1 or gitmode.
Not exactly sure where these entries came from; possibly a failed save or an
earlier buggy version of bup. But previously, they weren't auto-fixable
without deleting your bupindex.
Avery Pennarun [Sun, 28 Feb 2010 20:00:50 +0000 (15:00 -0500)]
Add a new 'bup newliner' that fixes progress message whitespace.
If we have multiple processes producing status messages to stderr and/or
stdout, and some of the lines ended in \r (ie. a progress message that was
supposed to be overwritten later) they would sometimes stomp on each other
and leave ugly bits lying around.
Now bup.py automatically pipes stdout/stderr to the new 'bup newliner'
command to fix this, but only if they were previously pointing at a tty.
Thus, if you redirect stdout to a file, nothing weird will happen, but if
you don't, stdout and stderr won't conflict with each other.
Anyway, the output is prettier now. Trust me on this.
Avery Pennarun [Sun, 28 Feb 2010 18:07:48 +0000 (13:07 -0500)]
Add an options.fatal() function and use it.
Every existing call to o.usage() was preceded by an error message that
printed the exename, then the error message. So let's add a fatal()
function that does it all in one step. This reduces the net number of lines
plus improves consistency.
Avery Pennarun [Sun, 14 Feb 2010 08:35:45 +0000 (03:35 -0500)]
Another suspicious fix for CatPipe parallelism.
This really shouldn't be necessary: it's clear to me that the 'it' object
should be going out of scope right away, and thus getting cleaned up by the
garbage collector.
But on one of my Linux PCs (with python 2.4.4) it fails the unit tests
unless I add this patch. Oh well, let's do it then.
Avery Pennarun [Sun, 14 Feb 2010 06:16:43 +0000 (01:16 -0500)]
hashsplit: smallish files (less than BLOB_MAX) weren't getting split.
This buglet was introduced when doing my new fanout cleanups. It's
relatively unimportant, but it would cause a bit of space wastage for
smallish files that changed by a bit, since we couldn't take advantage of
deduplication for their blocks.
This also explains why the --fanout argument test broke earlier. I thought
I was going crazy (since the whole fanout implementation had changed and the
number now means something slightly different), so I just removed it. But
now we can bring it back and it passes again.^
Avery Pennarun [Sat, 13 Feb 2010 23:21:09 +0000 (18:21 -0500)]
Make CatPipe objects more resilient when interrupted.
If we stopped iterating halfway through a particular object, the iterator
wouldn't finishing reading all the data, which would mess up the state of
the git-cat-file pipe. Now we read all the data even if we're going to just
throw it away.
Avery Pennarun [Fri, 12 Feb 2010 19:53:19 +0000 (14:53 -0500)]
_hashsplit.c: right shifting 32 bits doesn't work.
in C, if you do
uint32_t i = 0xffffffff;
i >>= 32;
then the answer is 0xffffffff, not 0 as you might expect. Let's shift it by
less than 32 at a time, which will give the right results. This fixes a
rare infinite loop when counting the bits in the hashsplit.
Avery Pennarun [Fri, 12 Feb 2010 04:50:39 +0000 (23:50 -0500)]
hashsplit: totally change the way the fanout stuff works.
Useless code churn or genius innovation? You decide.
The previous system for naming chunks of a split file was kind of lame. We
tried to name the files something that was "almost" their offset, so that
filenames wouldn't shuffle around too much if a few bytes were added/deleted
here and there. But that totally failed to work if a *lot* of bytes were
added, and it also lost the useful feature that you could seek to a specific
point in a file (like a VM image) without restoring the whole thing.
"Approximate" offsets aren't much good for seeking to.
The new system is even more crazy than the original hashsplit: we now use
the "extra bits" of the rolling checksum to define progressively larger
chunks. For example, we might define a normal chunk if the checksum ends in
0xFFF (12 bits). Now we can group multiple chunks together when the
checksum ends in 0xFFFF (16 bits). Because of the way the checksum works,
this happens about every 2^4 = 16 chunks. Similarly, 0xFFFFF (20 bits) will
happen 16 times less often than that, and so on. We can use this effect to
define a tree.
Then, in each branch of the tree, we name files based on their (exact, not
approximate) offset *from the start of that tree*.
Essentially, inserting/deleting/changing bytes will affect more "levels" of
the rolling checksum, mangling bigger and bigger branches of the overall
tree and causing those branches to change. However, only the content of
that sub-branch (and the *names*, ie offsets, of the following branches at
that and further-up levels) end up getting changed, so the effect can be
mostly localized. The subtrees of those renamed trees are *not* affected,
because all their offsets are relative to the start of their own tree. This
means *most* of the sha1sums in the resulting hierarchy don't need to
change, no matter how much data you add/insert/delete.
Anyway, the net result is that "git diff -M" now actually does something
halfway sensible when comparing the trees corresponding to huge split files.
Only halfway (because the chunk boundaries can move around a bit, and such
large files are usually binary anyway) but it opens the way for much cooler
algorithms in the future.
Also, it'll now be possible to make 'bup fuse' open files without restoring
the entire thing to a temp file first. That means restoring (or even
*using*) snapshotted VMs ought to become possible.
Andrew Schleifer [Wed, 10 Feb 2010 20:40:46 +0000 (15:40 -0500)]
Fix building on MacOS X on PowerPC.
bup failed to build on one of my machines, an older iMac; make
died ~40 lines in with "gcc-4.0: Invalid arch name : Power".
On PPC machines, uname -m returns the helpfully descriptive
"Power Macintosh", which gcc doesn't recognize. Some googling
revealed e.g.
http://www.opensource.apple.com/source/ld64/ld64-95.2.12/unit-tests/include/common.makefile
where they use $(shell arch) to get the necessary info.
With that little change, bup built on ppc and i386 machines for
me, and passed all tests.
Avery Pennarun [Tue, 9 Feb 2010 05:51:25 +0000 (00:51 -0500)]
cmd-save: don't recurse into already-valid subdirs.
When iterating through the index, if we find out that a particular dir (like
/usr) has a known-valid sha1sum and isn't marked as changed, there's no need
to recurse into it at all. This saves some pointless grinding through the
index when entire swaths of the tree are known to be already valid.
Avery Pennarun [Tue, 9 Feb 2010 01:28:51 +0000 (20:28 -0500)]
cmd-index/cmd-save: correctly mark directories as dirty/clean.
Previously, we just ignored the IX_HASHVALID on directories, and regenerated
their hashes on every backup regardless. Now we correctly save directory
hashes and mark them IX_HASHVALID after doing a backup, as well as removing
IX_HASHVALID all the way up the tree whenever a file is marked as invalid.
Avery Pennarun [Tue, 9 Feb 2010 00:26:38 +0000 (19:26 -0500)]
Fix some list comprehensions that I thought were generator comprehensions.
Apparently [x for x in whatever] yields a list, not an iterator, which means
two things:
- it might use more memory than I thought
- you definitely don't need to write list([...]) since it's already a
list.
Clean up a few of these. You learn something new every day.
Avery Pennarun [Mon, 8 Feb 2010 18:49:17 +0000 (13:49 -0500)]
test.sh: don't try non-quick fsck on damaged repositories.
It turns out that older versions of git (1.5.x or so) have a git-verify-pack
that goes into an endless loop when it hits certain kinds of corruption, and
our test would trigger it almost every time. Using --quick avoids calling
git-verify-pack, so it won't exhibit the problem.
Unfortunately this means a slightly less thorough test of non-quick
bup-fsck, but it'll have to do. Better than failing tests nonstop, anyway.
Avery Pennarun [Sun, 24 Jan 2010 03:09:15 +0000 (22:09 -0500)]
Infrastructure for generating a markdown-based man page using pandoc.
The man page (bup.1) is total drivel for the moment, though. And arguably
we could split up the manpages per subcommand like git does, but maybe
that's overkill at this stage.
Avery Pennarun [Fri, 5 Feb 2010 01:12:41 +0000 (20:12 -0500)]
bup save: try to estimate the time remaining.
Naturally, estimating the time remaining is one of those things that sounds
super easy, but isn't. So the numbers wobble around a bit more than I'd
like, especially at first. But apply a few scary heuristics, and boom!
Stuff happens.
Avery Pennarun [Fri, 5 Feb 2010 00:26:17 +0000 (19:26 -0500)]
bup-server: revert to non-midx indexes when suggesting a pack.
Currently midx files can't tell you *which* index contains a particular
hash, just that *one* of them does. So bup-server was barfing when it
expected MultiPackIndex.exists() to return a pack name, and was getting a
.midx file instead.
We could have loosened the assertion and allowed the server to suggest a
.midx file... but those can be huge, and it defeats the purpose of only
suggesting the minimal set of packs so that lightweight clients aren't
overwhelmed.
Avery Pennarun [Fri, 5 Feb 2010 00:12:30 +0000 (19:12 -0500)]
Narrow the exception handling in cmd-save.
If we encountered an error *writing* the pack, we were counting it as a
non-fatal error, which was not the intention. Only *reading* files we want
to back up should be considered non-fatal.
Avery Pennarun [Thu, 4 Feb 2010 23:56:01 +0000 (18:56 -0500)]
On python 2.4 on MacOS X, __len__() must return an int.
We were already returning integers, which seem to be "long ints" in this
case, even though they're relatively small. Whatever, we'll typecast them
to int first, and now unit tests pass.
Avery Pennarun [Thu, 4 Feb 2010 06:21:51 +0000 (01:21 -0500)]
Merge branch 'indexrewrite'
* indexrewrite:
Greatly improved progress reporting during index/save.
Fix bugs in new indexing code.
Speed up cmd-drecurse by 40%.
Split directory recursion stuff from cmd-index.py into drecurse.py.
Massive speedups to bupindex code.
Avery Pennarun [Thu, 4 Feb 2010 06:12:06 +0000 (01:12 -0500)]
Greatly improved progress reporting during index/save.
Now that the index reading stuff is much faster, we can afford to waste time
reading through it just to count how many bytes we're planning to back up.
And that lets us print really friendly progress messages during bup save, in
which we can tell you exactly what fraction of your bytes have been backed
up so far.
Avery Pennarun [Wed, 3 Feb 2010 23:56:43 +0000 (18:56 -0500)]
Fix bugs in new indexing code.
The logic was way too screwy, so I've simplified it a lot. Also extended
the unit tests quite a bit to replicate the weird problems I was having. It
seems pretty stable - and pretty fast - now.
Iterating through an index of my whole home directory (bup index -p ~) now
takes about 5.1 seconds, vs. 3.5 seconds before the rewrite. However,
iterating through just a *fraction* of the index can now bypass all the
parts we don't care about, so it's much much faster than before.
Could probably still stand some more optimization eventually, but at least
the file format allows for speed. The rest is just code :)
Avery Pennarun [Wed, 3 Feb 2010 21:42:48 +0000 (16:42 -0500)]
Split directory recursion stuff from cmd-index.py into drecurse.py.
Also add a new command, 'bup drecurse', which just recurses through a
directory tree and prints all the filenames. This is useful for timing
performance vs. the native 'find' command.
The result is a bit embarrassing; for my home directory of about 188000
files, drecurse is about 10x slower:
$ time bup drecurse -q ~
real 0m2.935s
user 0m2.312s
sys 0m0.580s
$ time find ~ -printf ''
real 0m0.385s
user 0m0.096s
sys 0m0.284s
time find ~ -printf '%s\n' >/dev/null
real 0m0.662s
user 0m0.208s
sys 0m0.456s
Avery Pennarun [Sun, 31 Jan 2010 22:59:33 +0000 (17:59 -0500)]
Massive speedups to bupindex code.
The old file format was modeled after the git one, but it was kind of dumb;
you couldn't search through the file except linearly, which is pretty slow
when you have hundreds of thousands, or millions, of files. It also stored
the entire pathname of each file, which got very wasteful as filenames got
longer.
The new format is much quicker; each directory has a pointer to its list of
children, so you can jump around rather than reading linearly through the
file. Thus you can now 'bup index -p' any subdirectory pretty much
instantly. The code is still not completely optimized, but the remaining
algorithmic silliness doesn't seem to matter.
And it even still passes unit tests! Which is too bad, actually, because I
still get oddly crashy behaviour when I repeatedly update a large index. So
there are still some screwy bugs hanging around. I guess that means we need
better unit tests...
Avery Pennarun [Tue, 2 Feb 2010 05:54:10 +0000 (00:54 -0500)]
cmd-save: add --smaller option.
This makes it only back up files smaller than the given size. bup can
handle big files, but you might want to do quicker incremental backups and
skip bigger files except once a day, or something.
Avery Pennarun [Tue, 2 Feb 2010 02:34:56 +0000 (21:34 -0500)]
midx: the fanout table entries can be 4 bytes, not 8.
I was trying to be future-proof, but it was kind of overkill, since a 32-bit
fanout entry could handle a total of 4 billion *hashes* per midx. That
would be 20*4bil = 80 gigs in a single midx. This corresponds to about 10
terabytes of packs, which isn't inconceivable... but if it happens, you
could just use more than one midx. Plus you'd likely run into other weird
bup problems before your midx files get anywhere near 80 gigs.
Avery Pennarun [Tue, 2 Feb 2010 02:30:59 +0000 (21:30 -0500)]
cmd-midx: correctly handle a tiny nonzero number of objects.
If all the sha1sums would have fit in a single page, the number of bits in
the table would be negative, with odd results. Now we just refuse to create
the midx if there are too few objects *and* too few files, since it would be
useless anyway.
We're still willing to create a very small midx if it allows us to merge
several indexes into one, however small, since that will still speed up
searching.
Avery Pennarun [Tue, 2 Feb 2010 01:40:30 +0000 (20:40 -0500)]
cmd-margin: a command to find out the max bits of overlap between hashes.
Run 'bup margin' to go through the list of all the objects in your bup
directory and count the number of overlapping prefix bits between each two
consecutive objects. That is, fine the longest hash length (in bits) that
*would* have caused an overlap, if sha1 hashes had been that length.
On my system with 111 gigs of packs, I get 44 bits. Out of a total of 160.
That means I'm still safe from collisions for about 2^116 times over. Or is
it only the square root of that? Anyway, it's such a large number that my
brain explodes just thinking about it.
Mark my words: 2^160 ought to be enough for anyone.
Avery Pennarun [Sun, 31 Jan 2010 21:54:00 +0000 (16:54 -0500)]
Update README.md to reflect recent developments.
- Remove the version number since I never remember to update it
- We now work with earlier versions of python and MacOS
- There's now a mailing list
- 'bup fsck' allows us to remove one of the things from the "stupid" list.
Avery Pennarun [Sun, 31 Jan 2010 01:29:22 +0000 (20:29 -0500)]
fsck: add a -j# (run multiple threads) option.
Sort of like make -j. par2 can be pretty slow, so this lets us verify
multiple files in parallel. Since the files are so big, though, this might
actually make performance *worse* if you don't have a lot of RAM. I haven't
benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does
definitely make it go "noticeably" faster.
Avery Pennarun [Sat, 30 Jan 2010 21:31:27 +0000 (16:31 -0500)]
Use mkstemp() when creating temporary packfiles.
Using getpid() was an okay hack, but there's no good excuse for doing it
that way when there are perfectly good tempfile-naming functions around
already.
Avery Pennarun [Sat, 30 Jan 2010 21:09:38 +0000 (16:09 -0500)]
client: fix a race condition when the server suggests an index.
If we finished our current pack too quickly after getting the suggestion,
the client would get confused, resulting in 'exected "ok, got %r' type
errors.
Avery Pennarun [Wed, 27 Jan 2010 00:30:30 +0000 (19:30 -0500)]
cmd-ls and cmd-fuse: toys for browsing your available backups.
'bup ls' lets you browse the set of backups on your current system. It's a
bit useless, so it might go away or be rewritten eventually.
'bup fuse' is a simple read-only FUSE filesystem that lets you mount your
backup sets as a filesystem (on Linux only). You can then export this
filesystem over samba or NFS or whatever, and people will be able to restore
their own files from backups.
Warning: we still don't support file metadata in 'bup save', so all the file
permissions will be wrong (and users will probably be able to see things
they shouldn't!). Also, anything that has been split into chunks will show
you the chunks instead of the full file, which is a bit silly. There are
also tons of places where performance could be improved.
But it's a pretty neat toy nevertheless. To try it out:
Avery Pennarun [Mon, 25 Jan 2010 07:22:23 +0000 (02:22 -0500)]
cmd-midx: add --auto and --force options.
Rather than having to list the indexes you want to merge, now it can do it
for you automatically. The output filename is now also optional; it'll
generate it in the right place in the git repo automatically.
Avery Pennarun [Mon, 25 Jan 2010 06:41:44 +0000 (01:41 -0500)]
When there are multiple overlapping .midx files, discard redundant ones.
That way if someone generates a .midx for a subset of .idx files, then
another for the *entire* set of .idx files, we'll automatically ignore the
former one, thus increasing search speed and improving memory thrashing
behaviour even further.
Avery Pennarun [Mon, 25 Jan 2010 06:24:16 +0000 (01:24 -0500)]
MultiPackIndex: use .midx files if they exist.
Wow, using a single .midx file that merges my 435 megs of packfile indexes
(across 169 files) reduces memory churn in memtest.py by at least two orders
of magnitude. (ie. we need to map 100x fewer memory pages in order to
search for each nonexistent object when creating a new backup) memtest.py
runs *visibly* faster.
We can also remove the PackBitmap code now, since it's not nearly as good as
the PackMidx stuff and is now an unnecessary layer of indirection.
Avery Pennarun [Mon, 25 Jan 2010 05:52:14 +0000 (00:52 -0500)]
cmd-midx: a command for merging multiple .idx files into one.
This introduces a new "multi-index" index format, as suggested by Lukasz
Kosewski.
.midx files have a variable-bit-width fanout table that's supposedly
optimized to be able to find any sha1 while dirtying only two pages (one for
the fanout table lookup, and one for the final binary search). Each entry
in the fanout table should correspond to approximately one page's worth of
sha1sums.
Also adds a PackMidx class, which acts just like PackIndex, but for .midx
files. Not using it for anything yet, though. The idea is to greatly
reduce memory burn when searching through lots of pack files.
Avery Pennarun [Sun, 24 Jan 2010 22:46:51 +0000 (17:46 -0500)]
In some versions of python, comparing buffers with < gives a warning.
It seems to be a buggy warning. But we only really do it in one place, and
buffers in question are only 20 bytes long, so forcing them into strings
seems harmless enough.
Avery Pennarun [Sun, 24 Jan 2010 22:18:25 +0000 (17:18 -0500)]
Wrap mmap calls to help with portability.
python2.4 in 'fink' on MacOS X seems to not like it when you pass a file
length of 0, even though that's supposed to mean "determine map size
automatically."
Avery Pennarun [Sun, 24 Jan 2010 21:37:46 +0000 (16:37 -0500)]
executable files: don't assume python2.5.
The forcing of version 2.5 was leftover from before, when it was
accidentally selecting python 2.4 by accident on some distros when both
versions are installed. But actually that's fine; bup works in python 2.4
without problems.
So let's not cause potentially *more* portability problems by forcing python
2.5 when it might not exist.
Dave Coombs [Thu, 14 Jan 2010 01:13:38 +0000 (20:13 -0500)]
Change t/tindex.py to pass on Mac OS.
It turns out /etc is a symlink (to /private/etc) on Mac OS, so checking
that the realpath of t/sampledata/etc is /etc fails. Instead we now check
against the realpath of /etc.
Avery Pennarun [Tue, 12 Jan 2010 05:52:21 +0000 (00:52 -0500)]
Use a PackBitmap file as a quicker way to check .idx files.
When we receive a new .idx file, we auto-generate a .map file from it. It's
essentially an allocation bitmap: for each 20-bit prefix, we assign one bit
to tell us if that particular prefix is in that particular packfile. If it
isn't, there's no point searching the .idx file at all, so we can avoid
mapping in a lot of pages. If it is, though, we then have to search the
.idx *too*, so we suffer a bit.
On the whole this reduces memory thrashing quite a bit for me, though.
Probably the number of bits needs to be variable in order to work over a
wider range of packfile sizes/numbers.
Avery Pennarun [Tue, 12 Jan 2010 04:02:56 +0000 (23:02 -0500)]
memtest.py: a standalone program for testing memory usage in PackIndex.
The majority of the memory usage in bup split/save is now caused by
searching pack indexes for sha1 hashes. The problem with this is that, in
the common case for a first full backup, *none* of the object hashes will be
found, so we'll *always* have to search *all* the packfiles. With just 45
packfiles of 200k objects each, that makes about (18-8)*45 = 450 binary
search steps, or 100+ 4k pages that need to be loaded from disk, to check
*each* object hash. memtest.py lets us see how fast RSS creeps up under
various conditions, and how different optimizations affect the result.
Avery Pennarun [Tue, 12 Jan 2010 03:59:46 +0000 (22:59 -0500)]
options parser: automatically convert strings to ints when appropriate.
If the given parameter is exactly an int (ie. str(int(v)) == v) then convert
it to an int automatically. This helps avoid weird bugs in apps using the
option parser.