Avery Pennarun [Wed, 6 Jan 2010 17:07:59 +0000 (12:07 -0500)]
Much more user-friendly error messages when bup can't exec the server.
...which happens unfortunately often, including in 'make test' when PATH
doesn't include bup. I'll fix that next. But it makes sense to fix the
error messages first :)
Avery Pennarun [Wed, 6 Jan 2010 05:19:11 +0000 (00:19 -0500)]
split: Prevent memory drain from excessively long shalists.
This avoids huge RAM usage when you're splitting a really huge object, plus
git probably doesn't work too well with single trees that contain millions
of objects anyway.
Avery Pennarun [Wed, 6 Jan 2010 04:42:15 +0000 (23:42 -0500)]
Split packs around 100M objects or 1G bytes.
This will make pruning much easier later, plus avoids any problems with
packs >= 2GB (not that we've had any of those yet, but...), plus avoids
wasting RAM with an overly full MultiPackIndex.also{} dictionary.
Avery Pennarun [Wed, 6 Jan 2010 04:50:41 +0000 (23:50 -0500)]
OOPS! Was writing one byte at a time to the server.
_raw_write() expects a list, not a string, so it was iterating over it
character by character. Magically it worked anyway. Which is sort of cool,
and yet not.
Avery Pennarun [Wed, 6 Jan 2010 03:21:18 +0000 (22:21 -0500)]
Fix compatibility with git 1.5.4.3 (Ubuntu Hardy).
Thanks to Andy Chong for reporting the problem.
Basically it comes down to two things that are missing in that version but
exist in git 1.5.6:
- git init --bare doesn't work, but git --bare init does.
- git cat-file --batch doesn't exist in that version.
Unfortunately, the latter problem is pretty serious; bup join is really slow
without it. I guess it might be time to implement an internal version of
cat-file.
Avery Pennarun [Mon, 4 Jan 2010 16:48:38 +0000 (11:48 -0500)]
Fix two bugs reported by dcoombs.
test-sh was assuming 'bup' was on the PATH. (It wasn't *supposed* to be
assuming that, but the "alias bup=whatever" line wasn't working,
apparently.)
randomgen.c triggered a warning in some versions of gcc about the return
value of write() being ignored. It really doesn't bother me if some of my
random bytes don't get written, but whatever; I'll assert instead, which
should shut it up.
Avery Pennarun [Sun, 3 Jan 2010 11:17:30 +0000 (06:17 -0500)]
Support incremental backups to a remote server.
We now cache the server's packfile indexes locally, so we know which objects
he does and doesn't have. That way we can send him a packfile with only the
ones he's missing.
cmd-split supports this now, but cmd-save still doesn't support remote
servers.
The -n option (set a ref correctly) doesn't work yet either.
Avery Pennarun [Sun, 3 Jan 2010 10:00:38 +0000 (05:00 -0500)]
Extremely basic 'bup server' support.
It's enough to send a pack to the remote end with 'bup split', though 'bup
save' doesn't support it yet, and we're not smart enough to do incremental
backups, which means we generate the gigantic pack every single time.
Avery Pennarun [Sat, 2 Jan 2010 09:16:25 +0000 (04:16 -0500)]
Write git pack files instead of loose object files.
This causes much, much less disk grinding than creating zillions of files,
plus it's even more disk space efficient.
We could theoretically make it go even faster by generating the .idx file
ourselves, but for now, we just call "git index-pack" to do it. That
helpfully also confirms that the data was written in a git-compatible way.
Avery Pennarun [Sat, 2 Jan 2010 06:46:06 +0000 (01:46 -0500)]
'bup split': speed optimization for never-ending blocks.
For blocks which never got split (eg. huge endless streams of zeroes) we
would constantly scan and re-scan the same sub-blocks, making things go
really slowly. In such a bad situation, there's no point in being so careful;
just dump the *entire* input buffer to a chunk and move on. This vastly
speeds up splitting of files with lots of blank space in them, eg.
VirtualBox images.
Also add a cache for git.hash_raw() so it doesn't have to stat() the same
blob files over and over if the same blocks (especially zeroes) occur more
than once.
Avery Pennarun [Wed, 30 Dec 2009 23:18:35 +0000 (18:18 -0500)]
Add a '-t' option to 'bup split' and make 'bup join' support it.
This lets you generate a git "tree" object with the list of hashes in a
particular file, so you can treat the file as a directory as far as git is
concerned. And 'bup join' knows how to take a tree and concatenate it
together to reverse the operation.
Avery Pennarun [Wed, 30 Dec 2009 08:28:36 +0000 (03:28 -0500)]
Use t# instead of et# for hashsplit parameter type.
This lets us work with any kind of buffer object, which means there's no
unnecessary data copying when coming into our module. Causes a bit of
speedup.
Also refactored the checksum code for easier experimentation.
Avery Pennarun [Wed, 30 Dec 2009 07:33:35 +0000 (02:33 -0500)]
Add a C module to do the rolling checksum.
This is about 80x faster than the old speed (27megs/sec instead of 330k/sec)
but still quite a lot slower than the 60+megs/sec I get *without* the
checksum stuff. There are a few inefficiencies remaining, but not such easy
ones as before...
Avery Pennarun [Wed, 30 Dec 2009 06:08:27 +0000 (01:08 -0500)]
datagen.c: a quick program to generate a repeatable series of bytes.
Useful for testing. Note that we *don't* see the random number generator,
so every time you generate the bytes, you get the same sequence.
This is also vastly faster than /dev/urandom, since it doesn't try to be
cryptographically secure. It generates about 200 megs/sec on my computer,
which is much faster than a disk and thus useful for testing the speed of
hashsplit.