Avery Pennarun [Sun, 31 Jan 2010 21:54:00 +0000 (16:54 -0500)]
Update README.md to reflect recent developments.
- Remove the version number since I never remember to update it
- We now work with earlier versions of python and MacOS
- There's now a mailing list
- 'bup fsck' allows us to remove one of the things from the "stupid" list.
Avery Pennarun [Sun, 31 Jan 2010 01:29:22 +0000 (20:29 -0500)]
fsck: add a -j# (run multiple threads) option.
Sort of like make -j. par2 can be pretty slow, so this lets us verify
multiple files in parallel. Since the files are so big, though, this might
actually make performance *worse* if you don't have a lot of RAM. I haven't
benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does
definitely make it go "noticeably" faster.
Avery Pennarun [Sat, 30 Jan 2010 21:31:27 +0000 (16:31 -0500)]
Use mkstemp() when creating temporary packfiles.
Using getpid() was an okay hack, but there's no good excuse for doing it
that way when there are perfectly good tempfile-naming functions around
already.
Avery Pennarun [Sat, 30 Jan 2010 21:09:38 +0000 (16:09 -0500)]
client: fix a race condition when the server suggests an index.
If we finished our current pack too quickly after getting the suggestion,
the client would get confused, resulting in 'exected "ok, got %r' type
errors.
Avery Pennarun [Wed, 27 Jan 2010 00:30:30 +0000 (19:30 -0500)]
cmd-ls and cmd-fuse: toys for browsing your available backups.
'bup ls' lets you browse the set of backups on your current system. It's a
bit useless, so it might go away or be rewritten eventually.
'bup fuse' is a simple read-only FUSE filesystem that lets you mount your
backup sets as a filesystem (on Linux only). You can then export this
filesystem over samba or NFS or whatever, and people will be able to restore
their own files from backups.
Warning: we still don't support file metadata in 'bup save', so all the file
permissions will be wrong (and users will probably be able to see things
they shouldn't!). Also, anything that has been split into chunks will show
you the chunks instead of the full file, which is a bit silly. There are
also tons of places where performance could be improved.
But it's a pretty neat toy nevertheless. To try it out:
Avery Pennarun [Mon, 25 Jan 2010 07:22:23 +0000 (02:22 -0500)]
cmd-midx: add --auto and --force options.
Rather than having to list the indexes you want to merge, now it can do it
for you automatically. The output filename is now also optional; it'll
generate it in the right place in the git repo automatically.
Avery Pennarun [Mon, 25 Jan 2010 06:41:44 +0000 (01:41 -0500)]
When there are multiple overlapping .midx files, discard redundant ones.
That way if someone generates a .midx for a subset of .idx files, then
another for the *entire* set of .idx files, we'll automatically ignore the
former one, thus increasing search speed and improving memory thrashing
behaviour even further.
Avery Pennarun [Mon, 25 Jan 2010 06:24:16 +0000 (01:24 -0500)]
MultiPackIndex: use .midx files if they exist.
Wow, using a single .midx file that merges my 435 megs of packfile indexes
(across 169 files) reduces memory churn in memtest.py by at least two orders
of magnitude. (ie. we need to map 100x fewer memory pages in order to
search for each nonexistent object when creating a new backup) memtest.py
runs *visibly* faster.
We can also remove the PackBitmap code now, since it's not nearly as good as
the PackMidx stuff and is now an unnecessary layer of indirection.
Avery Pennarun [Mon, 25 Jan 2010 05:52:14 +0000 (00:52 -0500)]
cmd-midx: a command for merging multiple .idx files into one.
This introduces a new "multi-index" index format, as suggested by Lukasz
Kosewski.
.midx files have a variable-bit-width fanout table that's supposedly
optimized to be able to find any sha1 while dirtying only two pages (one for
the fanout table lookup, and one for the final binary search). Each entry
in the fanout table should correspond to approximately one page's worth of
sha1sums.
Also adds a PackMidx class, which acts just like PackIndex, but for .midx
files. Not using it for anything yet, though. The idea is to greatly
reduce memory burn when searching through lots of pack files.
Avery Pennarun [Sun, 24 Jan 2010 22:46:51 +0000 (17:46 -0500)]
In some versions of python, comparing buffers with < gives a warning.
It seems to be a buggy warning. But we only really do it in one place, and
buffers in question are only 20 bytes long, so forcing them into strings
seems harmless enough.
Avery Pennarun [Sun, 24 Jan 2010 22:18:25 +0000 (17:18 -0500)]
Wrap mmap calls to help with portability.
python2.4 in 'fink' on MacOS X seems to not like it when you pass a file
length of 0, even though that's supposed to mean "determine map size
automatically."
Avery Pennarun [Sun, 24 Jan 2010 21:37:46 +0000 (16:37 -0500)]
executable files: don't assume python2.5.
The forcing of version 2.5 was leftover from before, when it was
accidentally selecting python 2.4 by accident on some distros when both
versions are installed. But actually that's fine; bup works in python 2.4
without problems.
So let's not cause potentially *more* portability problems by forcing python
2.5 when it might not exist.
Dave Coombs [Thu, 14 Jan 2010 01:13:38 +0000 (20:13 -0500)]
Change t/tindex.py to pass on Mac OS.
It turns out /etc is a symlink (to /private/etc) on Mac OS, so checking
that the realpath of t/sampledata/etc is /etc fails. Instead we now check
against the realpath of /etc.
Avery Pennarun [Tue, 12 Jan 2010 05:52:21 +0000 (00:52 -0500)]
Use a PackBitmap file as a quicker way to check .idx files.
When we receive a new .idx file, we auto-generate a .map file from it. It's
essentially an allocation bitmap: for each 20-bit prefix, we assign one bit
to tell us if that particular prefix is in that particular packfile. If it
isn't, there's no point searching the .idx file at all, so we can avoid
mapping in a lot of pages. If it is, though, we then have to search the
.idx *too*, so we suffer a bit.
On the whole this reduces memory thrashing quite a bit for me, though.
Probably the number of bits needs to be variable in order to work over a
wider range of packfile sizes/numbers.
Avery Pennarun [Tue, 12 Jan 2010 04:02:56 +0000 (23:02 -0500)]
memtest.py: a standalone program for testing memory usage in PackIndex.
The majority of the memory usage in bup split/save is now caused by
searching pack indexes for sha1 hashes. The problem with this is that, in
the common case for a first full backup, *none* of the object hashes will be
found, so we'll *always* have to search *all* the packfiles. With just 45
packfiles of 200k objects each, that makes about (18-8)*45 = 450 binary
search steps, or 100+ 4k pages that need to be loaded from disk, to check
*each* object hash. memtest.py lets us see how fast RSS creeps up under
various conditions, and how different optimizations affect the result.
Avery Pennarun [Tue, 12 Jan 2010 03:59:46 +0000 (22:59 -0500)]
options parser: automatically convert strings to ints when appropriate.
If the given parameter is exactly an int (ie. str(int(v)) == v) then convert
it to an int automatically. This helps avoid weird bugs in apps using the
option parser.
Avery Pennarun [Mon, 11 Jan 2010 23:19:29 +0000 (18:19 -0500)]
client-server: only retrieve index files when actually needed.
A busy server could end up with a *large* number of index files, mostly
referring to objects from other clients. Downloading all the indexes not only
wastes bandwidth, but causes a more insidious problem: small servers end up
having to mmap a huge number of large index files, which sucks lots of RAM.
In general, the RAM on a server is roughly proportional to the disk space on
that server. So it's okay for larger clients to need more RAM in order
to complete a backup. However, it's not okay for the existence of larger
clients to make smaller clients suffer. Hopefully this change will settle
it a bit.
Avery Pennarun [Tue, 12 Jan 2010 01:06:08 +0000 (20:06 -0500)]
Reduce default max objects per pack to 200,000 to save memory.
After some testing, it seems each object sha1 we need to cache while writing
a pack costs us about 83 bytes of memory. (This isn't so great, so
optimizing it in C later could cut this down a lot.) The new limit of 200k
objects takes about 16.6 megs of RAM, which nowadays is pretty acceptable.
It also corresponds to roughly 1GB of packfile for my random select of
sample data, so (since the default packfile limit is about 1GB anyway), this
*mostly* won't matter.
It will have an effect if your data is highly compressible, however; an
8192-byte object could compress down to a very small size and you'd end up
with a large number of objects. The previous default limit of 10 million
objects was ridiculous, since that would take 830 megs of RAM.
Avery Pennarun [Mon, 11 Jan 2010 20:18:35 +0000 (15:18 -0500)]
Merge branch 'cygwin'
* cygwin:
Assorted cleanups to Luke's cygwin fixes.
Makefile: work with cygwin on different windows versions.
.gitignore sanity.
Makefile: On Windows, executable files must end with .exe.
client.py: Windows files don't support ':', so rename cachedir.
index.py: os.rename() fails on Windows if dstfile already exists.
Don't try to rename tmpfiles into existing open files.
helpers.py: Cygwin doesn't support `hostname -f`, use `hostname`.
cmd-index.py: Retry os.open without O_LARGEFILE if not supported.
Makefile: Build on Windows under Cygwin.
Avery Pennarun [Mon, 11 Jan 2010 20:06:03 +0000 (15:06 -0500)]
Assorted cleanups to Luke's cygwin fixes.
There were a few things that weren't quite done how I would have done them,
so I changed the implementation. Should still work in cygwin, though.
The only actual functional changes are:
- index.Reader.close() now actually sets m=None rather than just closing it
- removed the "if rename fails, then unlink first" logic, which is
seemingly not needed after all.
- rather than special-casing cygwin to use "hostname" instead of "hostname
-f", it turns out python has a socket.getfqdn() that does what we want.
Avery Pennarun [Mon, 11 Jan 2010 19:57:23 +0000 (14:57 -0500)]
Makefile: work with cygwin on different windows versions.
Just check the CYGWIN part; don't depend on the fact that it's NT 5.1. (Of
course, uname isn't supposed to report such things by default anyway... but
that's cygwin for you.)
Lukasz Kosewski [Sun, 10 Jan 2010 09:04:17 +0000 (04:04 -0500)]
Don't try to rename tmpfiles into existing open files.
Linux and friends have no problem with this, but Windows doesn't allow
this without some effort, which we can avoid by... not needing to write
to an already-open file.
Give index.Reader a 'close' method which identifies and closes any open
mmaped files, and make cmd-index.py use this before trying to close a
index.Writer instance (which renames a tmpfile into the same file the
Reader has mmaped).
Lukasz Kosewski [Sun, 10 Jan 2010 08:57:42 +0000 (03:57 -0500)]
cmd-index.py: Retry os.open without O_LARGEFILE if not supported.
Python under Cygwin doesn't have os.O_LARGEFILE, so if we receive an
'AttributeError' exception trying to open something, just remove
O_LARGEFILE and try again.
Lukasz Kosewski [Sun, 10 Jan 2010 08:52:52 +0000 (03:52 -0500)]
Makefile: Build on Windows under Cygwin.
- Python modules have to end with .dll instead .so to load into Python
via 'import'.
- GCC under Windows builds all programs with -fPIC, and doesn't accept
this command-line option.
- libpython2.5.dll is found in /usr/bin under Cygwin (wtf?), so we need
to add this to the LDFLAGS path.
- 'make clean' should remove .dll files too.
Avery Pennarun [Sun, 10 Jan 2010 06:13:10 +0000 (01:13 -0500)]
This adds the long-awaited indexfile feature, so you no longer have to feed
your backups through tar.
Okay, 'bup save' is still a bit weak... but it could be much worse.
Merge branch 'indexfile'
* indexfile:
Minor fix for python 2.4.4 compatibility.
cmd-save: completely reimplement using the indexfile.
Moved some reusable index-handling code from cmd-index.py to index.py.
A bunch of wvtests for the 'bup index' command.
Start using wvtest.sh for shell-based tests in test-sh.
cmd-index: default indexfile path is ~/.bup/bupindex, not $PWD/index
cmd-index: skip merging the index if nothing was written to the new one.
cmd-index: only update if -u is given; print only given file/dirnames.
cmd-index: correct reporting of deleted vs. added vs. modified status.
Generalize the multi-index-walking code.
cmd-index: indexfiles should start with a well-known header.
cmd-index: eliminate redundant paths from index update command.
cmd-index: some handy options.
index: add --xdev (--one-file-system) option.
Fix some bugs with indexing '/'
cmd-index: basic index reader/writer/merger.
Avery Pennarun [Sun, 10 Jan 2010 03:43:48 +0000 (22:43 -0500)]
cmd-save: completely reimplement using the indexfile.
'bup save' no longer walks the filesystem: instead it walks the indexfile
(which is much faster) and doesn't bother opening any files that haven't had
an attribute change, since it can just reuse their sha1 from before. That
makes it *much* faster in the common case.
Avery Pennarun [Sun, 10 Jan 2010 00:27:26 +0000 (19:27 -0500)]
cmd-index: only update if -u is given; print only given file/dirnames.
cmd-index now does two things:
- it updates the index with the given names if -u is given
- it prints the index if -p, -s, or -m are given.
In both cases, if filenames are given, it operates (recursively) on the
given filenames or directories. If no filenames are given, -u fails (we
don't want to default to /; it's too slow) but -p/s/m just prints the whole
index.
Avery Pennarun [Sun, 10 Jan 2010 00:07:05 +0000 (19:07 -0500)]
cmd-index: correct reporting of deleted vs. added vs. modified status.
A file with an all-zero sha1 is considered Added instead of Modified, since
it has obviously *never* had a valid sha1. (A modified file has an old
sha1, but IX_HASHVALID isn't set.)
We also now don't remove old files from the index - for now - so that we can
report old files with a D status. This might perhaps be useful eventually.
Furthermore, we had a but where reindexing a particular filename would
"sometimes" cause siblings of that file to be marked as deleted. The
sibling entries should never be updated, because we didn't check them and
thus have no idea of their new status. This bug was mostly caused by the
silly way we current pass dirnames and filenames around...
Avery Pennarun [Thu, 7 Jan 2010 23:54:40 +0000 (18:54 -0500)]
cmd-index: eliminate redundant paths from index update command.
If someone asks to update "/etc" and "/etc/passwd", the latter is redundant
because it's included in the first. Don't bother updating the file twice
(and thus causing two index merges, etc).
Ideally we would only do one merge for *any* number of updates (etc /etc and
/var). This should be possible as long as we sort the entries correctly
(/var/ and then /etc/), since a single sequential indexfile could just have
one appended to the other. But we don't do that yet.
Avery Pennarun [Thu, 7 Jan 2010 23:43:02 +0000 (18:43 -0500)]
cmd-index: some handy options.
New options:
--modified: print only files that aren't up to date
--status: prefix printouts with status chars
--fake-valid: mark all entries as up to date
--indexfile: override the default index filename
Avery Pennarun [Wed, 6 Jan 2010 21:42:54 +0000 (16:42 -0500)]
splitting to a remote server would cause "already busy" errors.
Specifically:
client.ClientError: already busy with command 'receive-objects'
That's because recent changes removed the call to onclose() from
PackWriter_Remote. Now it's back, plus I added an extra unit test to reveal
the problem.
Avery Pennarun [Wed, 6 Jan 2010 18:03:23 +0000 (13:03 -0500)]
client: enhance the PATH when searching for the 'bup' binary.
Automatically adds the *local* $PWD to the *remote* $PATH before trying to
run 'bup server'. That way, if you build the source in exactly the same
folder on two machines - or if those two machines are actually the same
machine and you're just doing a test against localhost - it'll work.
I hereby curse both "sh -c <command>" and "ssh hostname -- <command>" for
not allowing a sensible way to just set argv[] without doing any stupid
quoting. Nasty.
Avery Pennarun [Wed, 6 Jan 2010 17:07:59 +0000 (12:07 -0500)]
Much more user-friendly error messages when bup can't exec the server.
...which happens unfortunately often, including in 'make test' when PATH
doesn't include bup. I'll fix that next. But it makes sense to fix the
error messages first :)
Avery Pennarun [Wed, 6 Jan 2010 05:19:11 +0000 (00:19 -0500)]
split: Prevent memory drain from excessively long shalists.
This avoids huge RAM usage when you're splitting a really huge object, plus
git probably doesn't work too well with single trees that contain millions
of objects anyway.
Avery Pennarun [Wed, 6 Jan 2010 04:42:15 +0000 (23:42 -0500)]
Split packs around 100M objects or 1G bytes.
This will make pruning much easier later, plus avoids any problems with
packs >= 2GB (not that we've had any of those yet, but...), plus avoids
wasting RAM with an overly full MultiPackIndex.also{} dictionary.
Avery Pennarun [Wed, 6 Jan 2010 04:50:41 +0000 (23:50 -0500)]
OOPS! Was writing one byte at a time to the server.
_raw_write() expects a list, not a string, so it was iterating over it
character by character. Magically it worked anyway. Which is sort of cool,
and yet not.
Avery Pennarun [Wed, 6 Jan 2010 03:21:18 +0000 (22:21 -0500)]
Fix compatibility with git 1.5.4.3 (Ubuntu Hardy).
Thanks to Andy Chong for reporting the problem.
Basically it comes down to two things that are missing in that version but
exist in git 1.5.6:
- git init --bare doesn't work, but git --bare init does.
- git cat-file --batch doesn't exist in that version.
Unfortunately, the latter problem is pretty serious; bup join is really slow
without it. I guess it might be time to implement an internal version of
cat-file.
Avery Pennarun [Mon, 4 Jan 2010 16:48:38 +0000 (11:48 -0500)]
Fix two bugs reported by dcoombs.
test-sh was assuming 'bup' was on the PATH. (It wasn't *supposed* to be
assuming that, but the "alias bup=whatever" line wasn't working,
apparently.)
randomgen.c triggered a warning in some versions of gcc about the return
value of write() being ignored. It really doesn't bother me if some of my
random bytes don't get written, but whatever; I'll assert instead, which
should shut it up.
Avery Pennarun [Sun, 3 Jan 2010 11:17:30 +0000 (06:17 -0500)]
Support incremental backups to a remote server.
We now cache the server's packfile indexes locally, so we know which objects
he does and doesn't have. That way we can send him a packfile with only the
ones he's missing.
cmd-split supports this now, but cmd-save still doesn't support remote
servers.
The -n option (set a ref correctly) doesn't work yet either.
Avery Pennarun [Sun, 3 Jan 2010 10:00:38 +0000 (05:00 -0500)]
Extremely basic 'bup server' support.
It's enough to send a pack to the remote end with 'bup split', though 'bup
save' doesn't support it yet, and we're not smart enough to do incremental
backups, which means we generate the gigantic pack every single time.