Avery Pennarun [Sat, 2 Jan 2010 09:16:25 +0000 (04:16 -0500)]
Write git pack files instead of loose object files.
This causes much, much less disk grinding than creating zillions of files,
plus it's even more disk space efficient.
We could theoretically make it go even faster by generating the .idx file
ourselves, but for now, we just call "git index-pack" to do it. That
helpfully also confirms that the data was written in a git-compatible way.
Avery Pennarun [Sat, 2 Jan 2010 06:46:06 +0000 (01:46 -0500)]
'bup split': speed optimization for never-ending blocks.
For blocks which never got split (eg. huge endless streams of zeroes) we
would constantly scan and re-scan the same sub-blocks, making things go
really slowly. In such a bad situation, there's no point in being so careful;
just dump the *entire* input buffer to a chunk and move on. This vastly
speeds up splitting of files with lots of blank space in them, eg.
VirtualBox images.
Also add a cache for git.hash_raw() so it doesn't have to stat() the same
blob files over and over if the same blocks (especially zeroes) occur more
than once.
Avery Pennarun [Wed, 30 Dec 2009 23:18:35 +0000 (18:18 -0500)]
Add a '-t' option to 'bup split' and make 'bup join' support it.
This lets you generate a git "tree" object with the list of hashes in a
particular file, so you can treat the file as a directory as far as git is
concerned. And 'bup join' knows how to take a tree and concatenate it
together to reverse the operation.
Avery Pennarun [Wed, 30 Dec 2009 08:28:36 +0000 (03:28 -0500)]
Use t# instead of et# for hashsplit parameter type.
This lets us work with any kind of buffer object, which means there's no
unnecessary data copying when coming into our module. Causes a bit of
speedup.
Also refactored the checksum code for easier experimentation.
Avery Pennarun [Wed, 30 Dec 2009 07:33:35 +0000 (02:33 -0500)]
Add a C module to do the rolling checksum.
This is about 80x faster than the old speed (27megs/sec instead of 330k/sec)
but still quite a lot slower than the 60+megs/sec I get *without* the
checksum stuff. There are a few inefficiencies remaining, but not such easy
ones as before...
Avery Pennarun [Wed, 30 Dec 2009 06:08:27 +0000 (01:08 -0500)]
datagen.c: a quick program to generate a repeatable series of bytes.
Useful for testing. Note that we *don't* see the random number generator,
so every time you generate the bytes, you get the same sequence.
This is also vastly faster than /dev/urandom, since it doesn't try to be
cryptographically secure. It generates about 200 megs/sec on my computer,
which is much faster than a disk and thus useful for testing the speed of
hashsplit.