arthur.barton.de Git - bup.git/log

]> arthur.barton.de Git - bup.git/log

projects / bup.git / log

commit | commitdiff | tree

Avery Pennarun [Sun, 3 Jan 2010 03:10:10 +0000 (22:10 -0500)]

'make test' now tests the -t and -c options of bup split.

commit | commitdiff | tree

Avery Pennarun [Sun, 3 Jan 2010 02:10:17 +0000 (21:10 -0500)]

git.PackIndex: a class for quickly searching a git packfile index.

This will allow us to generate incremental backups more efficiently, since
we can avoid rewriting already-known objects into a new pack.

commit | commitdiff | tree

Avery Pennarun [Sat, 2 Jan 2010 09:16:25 +0000 (04:16 -0500)]

Write git pack files instead of loose object files.

This causes much, much less disk grinding than creating zillions of files,
plus it's even more disk space efficient.

We could theoretically make it go even faster by generating the .idx file
ourselves, but for now, we just call "git index-pack" to do it. That
helpfully also confirms that the data was written in a git-compatible way.

commit | commitdiff | tree

Avery Pennarun [Sat, 2 Jan 2010 07:36:13 +0000 (02:36 -0500)]

bup split: print extra output to stderr if -v or -vv is given.

commit | commitdiff | tree

Avery Pennarun [Sat, 2 Jan 2010 06:46:06 +0000 (01:46 -0500)]

'bup split': speed optimization for never-ending blocks.

For blocks which never got split (eg. huge endless streams of zeroes) we
would constantly scan and re-scan the same sub-blocks, making things go
really slowly. In such a bad situation, there's no point in being so careful;
just dump the *entire* input buffer to a chunk and move on. This vastly
speeds up splitting of files with lots of blank space in them, eg.
VirtualBox images.

Also add a cache for git.hash_raw() so it doesn't have to stat() the same
blob files over and over if the same blocks (especially zeroes) occur more
than once.

commit | commitdiff | tree

Avery Pennarun [Sat, 2 Jan 2010 06:45:14 +0000 (01:45 -0500)]

Fix 'bup split --bench'.

This was broken earlier and apparently didn't have a test; now it does.

commit | commitdiff | tree

Avery Pennarun [Sat, 2 Jan 2010 06:44:04 +0000 (01:44 -0500)]

Fix generated commit messages.

The first (summary) line should be shorter so that git log looks prettier.

commit | commitdiff | tree

Avery Pennarun [Sat, 2 Jan 2010 00:28:14 +0000 (19:28 -0500)]

'bup save' now generates a hierarchical set of git trees as it should.

commit | commitdiff | tree

Avery Pennarun [Fri, 1 Jan 2010 23:15:12 +0000 (18:15 -0500)]

Initial version of 'bup save'.

Saves a complete tree by recursively iterating into subdirs, and splits
large files into chunks using the same algorithm as 'bup split'.

Currently no support for special files (symlinks etc), and it generates the
resulting git tree incorrectly (by just turning / into _ in filenames).

commit | commitdiff | tree

Avery Pennarun [Fri, 1 Jan 2010 02:59:30 +0000 (21:59 -0500)]

'bup join' now takes objects on the command line and handles commitids.

It converts commitids directly into trees and cats the entire tree
recursively.

If no ids are provided on the command line, it reverts back to reading the
list of objects from stdin.

commit | commitdiff | tree

Avery Pennarun [Fri, 1 Jan 2010 02:51:12 +0000 (21:51 -0500)]

Refactor splitting functions from cmd-split.py into hashsplit.py.

Now we can split other stuff from other programs (which don't exist yet).

commit | commitdiff | tree

Avery Pennarun [Fri, 1 Jan 2010 02:28:53 +0000 (21:28 -0500)]

Oops, multi-file split forced a split between each file.

Now if you have multiple files on input, it's possible for a single
resulting blob to contain parts of more than one file.

commit | commitdiff | tree

Avery Pennarun [Fri, 1 Jan 2010 00:04:59 +0000 (19:04 -0500)]

split: name chunkfiles more carefully to prevent name changes.

This isn't perfect, but a bit of byte jitter here and there now won't cause
unnecessary filename changes.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 23:46:04 +0000 (18:46 -0500)]

'bup split' takes a list of filenames on the command line.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 23:35:55 +0000 (18:35 -0500)]

'bup split' now outputs any combination of blobs, tree, and commit.

maximum flexibility.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 23:29:34 +0000 (18:29 -0500)]

'bup split' can now update a git ref if you give it the -n option.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 23:10:02 +0000 (18:10 -0500)]

Automatically handle "--no-" prefix on long options.

Similar to how git does it.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 22:55:14 +0000 (17:55 -0500)]

'bup split' now has a -c option to generate a git commit object.

There's no way to set its parent yet, but at least this is all you need if
you want to repack.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 22:06:15 +0000 (17:06 -0500)]

Completely revamped option parsing based on git-shell-style.

This is in options.py. Also added some wvtests for option parsing stuff.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 21:43:32 +0000 (16:43 -0500)]

Import wvtestrun and wvtest.py from the wvtest.git project.

Corresponding wvtest commit is db65ff5907571a5004bb3c500efd19421cb06d1a.

commit | commitdiff | tree

Avery Pennarun [Thu, 31 Dec 2009 19:45:04 +0000 (14:45 -0500)]

Rename datagen.c to randomgen.c, to better reflect its purpose.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 23:18:35 +0000 (18:18 -0500)]

Add a '-t' option to 'bup split' and make 'bup join' support it.

This lets you generate a git "tree" object with the list of hashes in a
particular file, so you can treat the file as a directory as far as git is
concerned. And 'bup join' knows how to take a tree and concatenate it
together to reverse the operation.

Also refactored a bunch of stuff in cmd-split.py.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 22:14:21 +0000 (17:14 -0500)]

Get rid of extra CFLAGS/LDFLAGS that I don't understand anyway.

I just copied them from some other python module. But it's better to keep
things simple, I think.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 22:10:03 +0000 (17:10 -0500)]

Add a 'bup' wrapper program.

We're going to use that with some subcommands, git-style.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 21:10:21 +0000 (16:10 -0500)]

Remove old hashsplit.c, since hashsplit.py replaces it.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 09:46:30 +0000 (04:46 -0500)]

Do less work for objects that occur more than once.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 09:09:08 +0000 (04:09 -0500)]

Clean up buffering to reduce number of buffer copies.

Slight performance improvement, but not inspirational.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 08:28:36 +0000 (03:28 -0500)]

Use t# instead of et# for hashsplit parameter type.

This lets us work with any kind of buffer object, which means there's no
unnecessary data copying when coming into our module. Causes a bit of
speedup.

Also refactored the checksum code for easier experimentation.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 07:55:07 +0000 (02:55 -0500)]

Hey wow, turning on -O2 gives about a 50% speedup.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 07:33:35 +0000 (02:33 -0500)]

Add a C module to do the rolling checksum.

This is about 80x faster than the old speed (27megs/sec instead of 330k/sec)
but still quite a lot slower than the 60+megs/sec I get *without* the
checksum stuff. There are a few inefficiencies remaining, but not such easy
ones as before...

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 06:17:24 +0000 (01:17 -0500)]

hashsplit.py: print performance timings to stderr on exit.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 06:08:27 +0000 (01:08 -0500)]

datagen.c: a quick program to generate a repeatable series of bytes.

Useful for testing. Note that we *don't* see the random number generator,
so every time you generate the bytes, you get the same sequence.

This is also vastly faster than /dev/urandom, since it doesn't try to be
cryptographically secure. It generates about 200 megs/sec on my computer,
which is much faster than a disk and thus useful for testing the speed of
hashsplit.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 06:06:16 +0000 (01:06 -0500)]

hashsplit.py: less excessive logging, more suitable for speed tests.

Result of speed tests: it's slow. Almost entirely because of how slow
splitbuf() is in python (which is no surprise at all).

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 00:50:14 +0000 (19:50 -0500)]

hashsplit.py: create blob objects by hand, not with 'git hash-object'

Much quicker this way, since we never need to fork anything.

commit | commitdiff | tree

Avery Pennarun [Wed, 30 Dec 2009 00:20:35 +0000 (19:20 -0500)]

hashsplit.py is now much, much faster than before.

4.8 secs vs. 0.8 secs for testfile1.

Still vastly slower than the C version (0.17 secs including time to fork
git for each blob) but still a significant improvement.

The remaining slowness seems to be entirely from:

- running git hash-object (which we can avoid by hashing the object
ourselves)

- running the rolling checksum algorithm (which we can speed up using a C
module)

So it's looking good.

commit | commitdiff | tree

Avery Pennarun [Tue, 29 Dec 2009 19:24:50 +0000 (14:24 -0500)]

hashsplit.py: a python version of hashsplit.c

It's slow. Very slow. But hopefully it won't be too hard to optimize.

commit | commitdiff | tree

Avery Pennarun [Tue, 29 Dec 2009 18:07:22 +0000 (13:07 -0500)]

Make split condition depend on ~0, not 0.

Otherwise we could end up splitting on one-byte blocks, which is pretty
dumb.

commit | commitdiff | tree

Avery Pennarun [Tue, 29 Dec 2009 18:02:03 +0000 (13:02 -0500)]

Report the block size when splitting each block.

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 02:55:42 +0000 (22:55 -0400)]

Add a README

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 02:33:28 +0000 (22:33 -0400)]

Add some comments so nobody thinks I think fgetc/fputc are fast.

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 01:51:16 +0000 (21:51 -0400)]

Add a comment to stupidsum_add() so people don't think I'm an idiot.

Yes, I know shift-and-xor is a supremely lame algorithm.

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 01:49:43 +0000 (21:49 -0400)]

Rename hsplit/hjoin to hashsplit/hashjoin.

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 00:38:43 +0000 (20:38 -0400)]

Add a trivial hjoin, the reverse of hsplit.

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 00:31:41 +0000 (20:31 -0400)]

Aha, fixed a bug causing the split points not to resync.

commit | commitdiff | tree

Avery Pennarun [Sun, 4 Oct 2009 00:06:56 +0000 (20:06 -0400)]

Actually hash stuff, and add a basic 'make test'.

Unfortunately the test fails: after the first difference, it never manages
to resync.

commit | commitdiff | tree

Avery Pennarun [Sat, 3 Oct 2009 23:48:49 +0000 (19:48 -0400)]

Extremely cheesy initial implementation of rolling-sum-based splitting.

The checksum algorithm is crap, and we don't actually generate the output
files yet, so I'm guessing it's still junk.

Alex' bup ("It backs things up") development repository

RSS Atom