arthur.barton.de Git - bup.git/log

]> arthur.barton.de Git - bup.git/log

projects / bup.git / log

commit | commitdiff | tree

Avery Pennarun [Thu, 1 Apr 2010 23:36:02 +0000 (19:36 -0400)]

Merge branch 'master' of /tmp/bup

* 'master' of /tmp/bup:
Add a 'make install' target.

commit | commitdiff | tree

Avery Pennarun [Thu, 1 Apr 2010 23:34:03 +0000 (19:34 -0400)]

Add a 'make install' target.

Also change main.py to search around in appropriate places for the installed
library files. By default, if your bup is in /usr/bin/bup, it'll look in
/usr/lib/bup. (It drops two words off the end of the filename and adds
/lib/bup to the end.)

This also makes the Debian packager at
http://git.debian.org/collab-maint/bup
actually produce a usable package.

commit | commitdiff | tree

Avery Pennarun [Thu, 1 Apr 2010 19:43:05 +0000 (15:43 -0400)]

cmd/fsck: correctly catch nonzero return codes of 'par2 create'.

Oops; we weren't checking the return value like we should. Reported by
Sitaram Chamarty.

commit | commitdiff | tree

Avery Pennarun [Thu, 1 Apr 2010 18:58:00 +0000 (14:58 -0400)]

helpers.log(): run sys.stdout.flush() first.

It's annoying when your log messages come out before stdout messages do.
But it's equally annoying (and inefficient) to have to flush every time you
print something. This seems like a nice compromise.

commit | commitdiff | tree

Avery Pennarun [Thu, 1 Apr 2010 18:48:10 +0000 (14:48 -0400)]

Get rid of a sha-related DeprecationWarning in python 2.6.

hashlib is only available in python 2.5 or higher, but the 'sha' module
produces a DeprecationWarning in python 2.6 or higher. We want to support
python 2.4 and above without any stupid warnings, so let's try using
hashlib. If it fails, switch to the old sha module.

commit | commitdiff | tree

Rob Browning [Thu, 25 Mar 2010 07:23:57 +0000 (00:23 -0700)]

Add support for a global --bup-dir or -d argument.

When a "--bup-dir DIR" or "-d DIR" argument is provided, act as if
BUP_DIR=DIR is set in the environment.

Signed-off-by: Rob Browning <rlb@defaultvalue.org>

commit | commitdiff | tree

Rob Browning [Thu, 25 Mar 2010 07:23:56 +0000 (00:23 -0700)]

Add support for global command-line options (before any subcmd).

Process global arguments via getopt before handling a subcmd, and add
initial support for a global --help (or -?) option.

Also support --help for subcmds by noticing and translating

git ... subcmd --help ...

into

git ... help subcmd ...

Signed-off-by: Rob Browning <rlb@defaultvalue.org>

commit | commitdiff | tree

Rob Browning [Thu, 25 Mar 2010 07:23:55 +0000 (00:23 -0700)]

cmd/help-cmd.py: Use BUP_MAIN_EXE to invoke the correct bup.

Signed-off-by: Rob Browning <rlb@defaultvalue.org>

commit | commitdiff | tree

Avery Pennarun [Tue, 23 Mar 2010 18:13:48 +0000 (14:13 -0400)]

Add a LICENSE file to reflect that bup is licensed under the LGPLv2.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 20:48:23 +0000 (16:48 -0400)]

server: only suggest a max of one pack per receive-objects cycle.

Since the client only handles one at a time and forgets the others anyway,
suggesting others is a bit of a waste of time... and because of the cheating
way we figure out which index to suggest when using a midx, suggesting packs
is more expensive than it should be anyway.

The "correct" fix in the long term will be to make the client accept
multiple suggestions at once, plus make midx files a little smarter about
figuring out which pack is the one that needs to be suggested. But in the
meantime, this makes things a little nicer: there are fewer confusing log
messages from the server, and a lot less disk grinding related to looking
into which pack to suggest, followed by finding out that we've already
suggested that pack anyway.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 04:41:52 +0000 (00:41 -0400)]

rbackup-cmd: we can now backup a *remote* machine to a *local* server.

The -r option to split and save allowed you to backup from a local machine
to a remote server, but that doesn't always work; sometimes the machine you
want to backup is out on the Internet, and the backup repo is safe behind a
firewall.  In that case, you can ssh *out* from the secure backup machine to
the public server, but not vice versa, and you were out of luck.  Some
people have apparently been doing this:

    ssh publicserver tar -c / | bup split -n publicserver

(ie. running tar remotely, piped to a local bup split) but that isn't
efficient, because it sends *all* the data from the remote server over the
network before deduplicating it locally.  Now you can do instead:

    bup rbackup publicserver index -vux /
    bup rbackup publicserver save -n publicserver /

And get all the usual advantages of 'bup save -r', except the server runs
locally and the client runs remotely.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 03:03:53 +0000 (23:03 -0400)]

client: Extract 'bup server' connection code into its own module.

The screwball function we use to let us run 'bup xxx' on a remote server
after correctly setting the PATH variable is about to become useful for more
than just 'bup server'.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 04:36:31 +0000 (00:36 -0400)]

options: allow user to specify an alternative to getopt.gnu_getopt.

The most likely alternative is getopt.getopt, which doesn't rearrange
arguments. That would mean "-a foo -p" is considered as the option "-a"
followed by the non-option arguments ['foo', '-p'].

The non-gnu behaviour is annoying most of the time, but can be useful when
you're receiving command lines that you want to pass verbatim to someone
else.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 05:47:24 +0000 (01:47 -0400)]

save/index/drecurse: correct handling for fifos and nonexistent paths.

When indexing a fifo, you can try to open it (for security reasons) but it
has to be O_NDELAY just in case the fifo doesn't have anyone on the other
end; otherwise indexing can freeze.

In index.reduce_paths(), we weren't reporting ENOENT for reasons I can no
longer remember, but I think they must have been wrong. Obviously if
someone specifies a nonexistent path on the command line, we should barf
rather than silently not back it up.

Add some unit tests to catch both cases.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 05:24:31 +0000 (01:24 -0400)]

save-cmd: exit nonzero if any errors were encountered.

Somehow I forgot to do this before.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 04:57:23 +0000 (00:57 -0400)]

main: when printing help, don't mix stdout/stderr.

commit | commitdiff | tree

Avery Pennarun [Sun, 21 Mar 2010 04:34:21 +0000 (00:34 -0400)]

main.py: don't leak a file descriptor.

subprocess.Popen() is a little weird about when it closes the file
descriptors you give it. In this case, we have to dup() it because if
stderr=2 (the default) and stdout=2 (because fix_stderr), it'll close fd 2.
But if we dup it first, it *won't* close the dup, because stdout!=stderr.
So we have to dup it, but then we have to close it ourselves.

This was apparently harmless (it just resulted in an extra fd#3 getting
passed around to subprocesses as a clone of fd#2) but it was still wrong.

commit | commitdiff | tree

Lukasz Kosewski [Mon, 15 Mar 2010 03:20:08 +0000 (23:20 -0400)]

cmd/index-cmd.py: How it pains me to have to explicitly close() stuff

If we don't explicitly close() the wr reader object while running
update-index, the corresponding writer object won't be able to unlink
its temporary file under Cygwin.

commit | commitdiff | tree

Lukasz Kosewski [Mon, 15 Mar 2010 03:17:42 +0000 (23:17 -0400)]

lib/bup/index.py: mmap.mmap() objects need to be closed() for Win32.

Not *entirely* sure why this is the case, but it appears through some
refcounting weirdness, just setting the mmap variables to None in
index.Readers doesn't cause the mmap to be freed under Cygwin, though
I can't find any reason why this would be the case.

Naturally, this caused all sort of pain when we attempt to unlink
an mmaped file created while running bup index --check -u.

Fix the issue by explicitly .close()ing the mmap in Reader.close().

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Mar 2010 06:59:45 +0000 (01:59 -0500)]

PackIdxList.refresh(): remember to exclude old midx files.

Previously, if you called refresh(), it would fail to consider
the contents of already-loaded .midx files as already-loaded.  That means
it would load all the constituent .idx files, so you'd actually lose all the
advantages of the .midx after the first refresh().

Thus, the midx optimization mainly worked before you filled up your first
pack (about 1GB of data saved) or until you got an index suggestion.  This
explains why backups would slow down significantly after running for a
while.

Also, get rid of the stupid forget_packs option; just automatically prune
the packs that aren't relevant after the refresh.  This avoids the
possibility of weird behaviour if you set forget_packs incorrectly (which we
did).

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Mar 2010 07:50:05 +0000 (03:50 -0400)]

bup.client: fix freeze when suggest-index after finishing a full pack.

It was just rare enough to be hard to find: if you write an entire pack full
of stuff (1GB or more) and *then* trigger a suggest-index, the client would
freeze because it would send a send-index command without actually
suspending the receive-pack first.

The whole Client/PackWriter separation is pretty gross, so it's not terribly
surprising this would happen.

Add a unit test to detect this case if it ever happens in the future, for
what it's worth.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Mar 2010 07:39:08 +0000 (03:39 -0400)]

hashsplit: no need to import bup.git.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Mar 2010 05:55:08 +0000 (00:55 -0500)]

main: even more fixes for signal handling.

If the child doesn't die after the first SIGINT and the user presses ctrl-c
one more time, the main bup process would die instead of forwarding it on to
the child. That's no good; we actually have to loop forwarding signals
until the child is really good and dead.

And if the child refuses to die, well, he's the one with the bug, not
main.py. So main.py should stay alive too in the name of not losing track
of things.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Mar 2010 05:15:29 +0000 (00:15 -0500)]

client/server: correctly handle case where receive-objects had 0 objects.

Previously we'd throw a (probably harmless other than ugly output)
exception in this case.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Mar 2010 05:05:44 +0000 (00:05 -0500)]

save-cmd: oops, left in some code that was forcing progress output.

commit | commitdiff | tree

Avery Pennarun [Sat, 13 Mar 2010 01:40:46 +0000 (20:40 -0500)]

cmd/{index,save}: handle ctrl-c without printing a big exception trace.

It's not very exciting to look at a whole stack trace just because someone
hit ctrl-c, especially since that's designed to work fine. Trim it down in
that case.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Mar 2010 23:46:40 +0000 (18:46 -0500)]

git.PackWriter: avoid pack corruption if interrupted by a signal.

PackWriter tries to "finish" a half-written pack in its destructor if
interrupted.  To do this, it flushes the stream, seeks back to the beginning
to update the sha1sum and object count, then runs git-index-pack on it to
create the .idx file.

However, sometimes if you were unlucky, you'd interrupt PackWriter partway
through writing an object to the pack.  If only half an object exists at the
end, it would have the wrong header and thus come out as corrupt when
index-pack would run.

Since our objects are meant to be small anyway, just make sure we write
everything all in one file.write() operation.  The files themselves are
buffered, so this wouldn't survive a surprise termination of the whole
unix process, but we wouldn't run index-pack in that case anyway, so it
doesn't matter.

Now when I press ctrl-c in 'bup save', it consistently writes the half-saved
objects as it should.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Mar 2010 23:05:54 +0000 (18:05 -0500)]

Correctly pass along SIGINT to child processes.

Ever since we introduced bup newliner, signal handling has been a little
screwy. The problem is that ctrl-c is passed to *all* processes in the
process group, not just the parent, so everybody would start terminating at
the same time, with very messy results.

Two results were particularly annoying: git.PackWriter()'s destructor
wouldn't always get called (so half-finished packs would be lost instead of
kept so we don't need to backup the same stuff next time) and bup-newliner
would exit, so the stdout/stderr of a process that *did* try to clean up
would be lost, usually resulting in EPIPE, which killed the proces while
attempting to clean up.

The fix is simple: when starting a long-running subprocess, give it its own
session by calling os.setsid(). That way ctrl-c is only sent to the
toplevel 'bup' process, who can forward it as it should.

Next, fix bup's signal forwarding to actually forward the same signal as it
received, instead of always using SIGTERM.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Mar 2010 22:18:20 +0000 (17:18 -0500)]

hashsplit: use posix_fadvise(DONTNEED) when available.

When reading through large disk images to back them up, we'll only end up
reading the data once, but it still takes up space in the kernel's disk
cache.  If you're backing up a whole disk full of stuff, that's bad news for
anything else running on your system, which will rapidly have its stuff
dumped out of cache to store a bunch of stuff bup will never look at again.

The posix_fadvise() call actually lets us tell the kernel we won't be using
this data anymore, thus greatly reducing our hit on the disk cache.

Theoretically it improves things, anyway.  I haven't been able to come up
with a really scientific way to test it, since of course *bup's* performance
is expected to be the same either way (we're only throwing away stuff we're
done using).  It really does throw things out of cache, though, so the rest
follows logically at least.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Mar 2010 21:49:32 +0000 (16:49 -0500)]

save-cmd: open files with O_NOATIME on OSes that support it.

Backing up files normally changes their atime, which is bad for two reasons.

First, the files haven't really been "accessed" in a useful sense; the fact
that we backed them up isn't an indication that, say, they're any more
frequently used than they were before.

Secondly, when reading a file updates its atime, the kernel has to enqueue
an atime update (disk write) for every file we back up.  For programs that
read the same files repeatedly, this is no big deal, since the atime just
gets flushed out occasionally (after a lot of updates).  But since bup
accesses *every* file only once, you end up with a huge atime backlog, and
this can wastefully bog down your disks during a big backup.

Of course, mounting your filesystem with noatime would work too, but not
everybody does that.  So let's help them out.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Mar 2010 04:16:52 +0000 (23:16 -0500)]

save-cmd: oops, byte counter was checking sha_missing() too late.

After validating a backed-up file, sha_missing() goes false. So we have to
remember the value from *before* we backed it up. Sigh.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Mar 2010 03:34:40 +0000 (22:34 -0500)]

main: fix problem when redirecting to newliner on MacOS X.

It's probably just a bug in python 2.4.2, which is the version on my old
MacOS machine. But it seems that if you use subprocess.Popen with stdout=1
and/or stderr=2, it ends up closing the file descriptors instead of passing
them along. Since those are the defaults anyway, just use None instead.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Mar 2010 01:35:29 +0000 (20:35 -0500)]

save-cmd: when verbose=1, print the dirname *before* backing it up.

It was really misleading showing the most-recently-completed directory, then
spending a long time backing up files in a totally different place.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Mar 2010 01:17:04 +0000 (20:17 -0500)]

save-cmd: Fix --smaller and other behaviour when files are skipped.

The --smaller option now uses parse_num() so it can be something other than
a raw number of bytes (eg. "1.5G").

We were incorrectly marking a tree as valid when we skipped any of its
contents for any reason; that's no good. We can still save a tree to the
backup, but it'll be missing some stuff, so we have to avoid marking it as
valid. That way it won't be skipped next time around.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Mar 2010 00:21:20 +0000 (19:21 -0500)]

save-cmd: progress meter wouldn't count identical files correctly.

This one was really tricky.  If a file was IX_HASHVALID but its object
wasn't available on the target server (eg. if you backed up to one server
server and now are backing up to a different one), we could correctly count
is toward the total bytes we expected to back up.

Now imagine there are two *identical* files (ie. with the same sha1sum) in
this situation.  When that happens, we'd back up the first one, after which
the objects for the second one *are* available.  So we'd skip it, thinking
that we had skipped it in the first place.  The result would be that our
backup count showed a final byte percentage less than 100%.

The workaround isn't very pretty, but should be correct: we add a new
IX_SHAMISSING flag, setting or clearing it during the initial index scan,
and then we use *that* as the indicator of whether to add bytes to the count
or not.

We also have to decide whether to recurse into subdirectories using this
algorithm.  If /etc/rc3.d and /etc/rc4.d are identical, and one of the files
in them had this problem, then we wouldn't even *recurse* into /etc/rc3.d
after backing up /etc/rc4.d.  That means we wouldn't check the IX_SHAMISSING
flag on the file inside.  So we had to fix that up too.

On the other hand, this is an awful lot of complexity just to make the
progress messages more exact...

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Mar 2010 23:36:59 +0000 (18:36 -0500)]

save-cmd: byte count was missing a few files.

In particular, if we tried to back up a file but couldn't open it, that
would fail to increment the byte count.

We also sometimes counted unmodified directories instead of ignoring them.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Mar 2010 22:43:33 +0000 (17:43 -0500)]

main: initialize 'p' before the try/finally that uses it.

Otherwise, if we fail to run the subprocess, the finally section doesn't
work quite right.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Mar 2010 22:36:06 +0000 (17:36 -0500)]

save-cmd: don't fail an assertion when doing a backup from the root level.

This wasn't caught by unit tests because "virtual" nodes added by
index.py:_golevel() weren't being marked as IX_EXISTS, which in the unit
tests included the root, so save-cmd was never actually trying to back up
that node.

That made the base directories incorrectly marked as status=D (deleted) if
you printed out the index during the tests. So add a test for that to make
it fail if "/" is deleted (which obviously makes no sense), then add another
test for saving from the root level, then fix both bugs.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Mar 2010 04:59:08 +0000 (23:59 -0500)]

'make stupid' stopped working when I moved subcommands into their own dir.

Remote server mode tries to add the directory of argv[0] (the
currently-running program) to the PATH on the remote server, just in case
bup isn't installed in the PATH there, so that it can then run 'bup server'.

However, now that bup-save is in a different place than bup, argv[0] is the
wrong place to look. Instead, have the bup executable export an environment
variable containing its location, and client.py can use that instead of
argv[0]. Slightly gross, but it works.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Mar 2010 04:42:45 +0000 (23:42 -0500)]

bup.options: remove reference to bup.helpers.

This makes the module more easily reusable in other apps.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Mar 2010 04:18:49 +0000 (23:18 -0500)]

log(): handle situations where stderr gets set to nonblocking.

It's probably ssh doing this, and in obscure situations, it means log() ends
up throwing an exception and aborting the program.

Fix it so that we handle EAGAIN correctly if we get it when writing to
stderr, even though this is only really necessary due to stupidity on
(I think/hope) someone else's part.

commit | commitdiff | tree

Avery Pennarun [Tue, 2 Mar 2010 21:42:47 +0000 (16:42 -0500)]

Add man pages for random, newliner, help, memtest, ftp.

Also add a 'help' command to ftp, and fix up some minor help messages.

commit | commitdiff | tree

Avery Pennarun [Tue, 2 Mar 2010 21:20:41 +0000 (16:20 -0500)]

bup random: fix progress output and don't print to a tty.

We were printing output using a series of dots, which interacted badly with
bup newliner (and for good reason). Change it to actually display the
number of megabytes done so far.

Also, don't print random binary data to a tty unless -f is given. It's
just more polite that way.

commit | commitdiff | tree

Avery Pennarun [Tue, 2 Mar 2010 21:17:17 +0000 (16:17 -0500)]

main.py: clean up subprocesses dying on signal.

CTRL-C didn't abort 'bup random' properly, and possibly others as well.

commit | commitdiff | tree

Avery Pennarun [Mon, 1 Mar 2010 00:07:00 +0000 (19:07 -0500)]

Rename PackIndex->PackIdx and MultiPackIndex->PackIdxList.

This corresponds to the PackMidx renaming I did earlier, and helps avoid
confusion between index.py (which talks to the 'bupindex' file and has
nothing to do with packs) and git.py (which talks to packs and has nothing
to do with the bupindex).  Now pack indexes are always called Idx, and the
bupindex is always Index.

Furthermore, MultiPackIndex could easily be assumed to be the same thing as
a Midx, which it isn't.  PackIdxList is a more accurate description of what
it is: a list of pack indexes.  A Midx is an index of a list of packs.

commit | commitdiff | tree

Avery Pennarun [Mon, 1 Mar 2010 00:00:50 +0000 (19:00 -0500)]

main: list common commands before other ones.

When you just type 'bup' or 'bup help', we print a list of available
commands. Now we improve this list by:

1) Listing the common commands (with one-line descriptions) before listing
the automatically-generated list.

2) Printing the automatically-generated list in columns, so it takes up less
vertical space.

This whole concept was stolen from how git does it. I think it should be a
bit more user friendly for beginners this way.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 23:28:04 +0000 (18:28 -0500)]

Add a 'bup help' command.

It works like 'git help xxx', ie. it runs 'man bup-xxx' where xxx is the
command name.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 22:28:01 +0000 (17:28 -0500)]

vfs: supply ctime/mtime for the root of each commit.

This makes it a little more obvious which backups were made when.

Mostly useful with 'bup fuse'.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 22:05:41 +0000 (17:05 -0500)]

Move cmd-*.py to cmd/*-cmd.py.

The bup-* programs shouldn't need to be installed into /usr/bin; we should
search for them in /usr/lib somewhere.

I could have left the names as cmd/cmd-*.py, but the cmd-* was annoying me
because of tab completion. Now I can type cmd/ran<tab> to get
random-cmd.py.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 21:17:35 +0000 (16:17 -0500)]

Move python library files to lib/bup/

...and update other programs so that they import them correctly from their
new location.

This is necessary so that the bup library files can eventually be installed
somewhere other than wherever the 'bup' executable ends up. Plus it's
clearer and safer to say 'from bup import options' instead of just 'import
options', in case someone else writes an 'options' module.

I wish I could have named the directory just 'bup', but I can't; there's
already a program with that name.

Also, in the name of sanity, rename memtest.py to 'bup memtest' so that it
can get the new paths automatically.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 20:52:04 +0000 (15:52 -0500)]

bup index --check: detect broken index entries.

Entries with invalid gitmode or sha1 are actually invalid, so if
IX_HASHVALID is set, that's a bug. Detect it right away when it happens.

Also clean up a bit of log output related to checking and status.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 20:51:16 +0000 (15:51 -0500)]

cmd-index: auto-invalidate entries without a valid sha1 or gitmode.

Not exactly sure where these entries came from; possibly a failed save or an
earlier buggy version of bup. But previously, they weren't auto-fixable
without deleting your bupindex.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 20:00:50 +0000 (15:00 -0500)]

Add a new 'bup newliner' that fixes progress message whitespace.

If we have multiple processes producing status messages to stderr and/or
stdout, and some of the lines ended in \r (ie. a progress message that was
supposed to be overwritten later) they would sometimes stomp on each other
and leave ugly bits lying around.

Now bup.py automatically pipes stdout/stderr to the new 'bup newliner'
command to fix this, but only if they were previously pointing at a tty.
Thus, if you redirect stdout to a file, nothing weird will happen, but if
you don't, stdout and stderr won't conflict with each other.

Anyway, the output is prettier now. Trust me on this.

commit | commitdiff | tree

Avery Pennarun [Sun, 28 Feb 2010 18:07:48 +0000 (13:07 -0500)]

Add an options.fatal() function and use it.

Every existing call to o.usage() was preceded by an error message that
printed the exename, then the error message. So let's add a fatal()
function that does it all in one step. This reduces the net number of lines
plus improves consistency.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Feb 2010 08:15:18 +0000 (03:15 -0500)]

cmd-fuse: use the new vfs layer.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Feb 2010 07:03:37 +0000 (02:03 -0500)]

cmd-ls: use the new vfs layer.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Feb 2010 01:21:57 +0000 (20:21 -0500)]

cmd-ftp: a new command-line client you can use for browsing your repo.

It acts kind of like the 'ftp' command; hence the name. It even has
readline and filename autocompletion!

The new vfs layer stuff should be useful for cmd-ls and cmd-fuse too.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Feb 2010 08:35:45 +0000 (03:35 -0500)]

Another suspicious fix for CatPipe parallelism.

This really shouldn't be necessary: it's clear to me that the 'it' object
should be going out of scope right away, and thus getting cleaned up by the
garbage collector.

But on one of my Linux PCs (with python 2.4.4) it fails the unit tests
unless I add this patch. Oh well, let's do it then.

commit | commitdiff | tree

Avery Pennarun [Sun, 14 Feb 2010 06:16:43 +0000 (01:16 -0500)]

hashsplit: smallish files (less than BLOB_MAX) weren't getting split.

This buglet was introduced when doing my new fanout cleanups.  It's
relatively unimportant, but it would cause a bit of space wastage for
smallish files that changed by a bit, since we couldn't take advantage of
deduplication for their blocks.

This also explains why the --fanout argument test broke earlier.  I thought
I was going crazy (since the whole fanout implementation had changed and the
number now means something slightly different), so I just removed it.  But
now we can bring it back and it passes again.^

commit | commitdiff | tree

Avery Pennarun [Sat, 13 Feb 2010 23:21:09 +0000 (18:21 -0500)]

Make CatPipe objects more resilient when interrupted.

If we stopped iterating halfway through a particular object, the iterator
wouldn't finishing reading all the data, which would mess up the state of
the git-cat-file pipe. Now we read all the data even if we're going to just
throw it away.

commit | commitdiff | tree

Avery Pennarun [Sat, 13 Feb 2010 21:45:12 +0000 (16:45 -0500)]

bup join: continue gracefully if one of the requested files does not exist.

This makes it work more like 'cat'. If any of the requested files is
missing, the final return code is nonzero.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Feb 2010 19:53:19 +0000 (14:53 -0500)]

_hashsplit.c: right shifting 32 bits doesn't work.

in C, if you do
uint32_t i = 0xffffffff;
i >>= 32;

then the answer is 0xffffffff, not 0 as you might expect. Let's shift it by
less than 32 at a time, which will give the right results. This fixes a
rare infinite loop when counting the bits in the hashsplit.

commit | commitdiff | tree

Steve Diver [Fri, 12 Feb 2010 17:54:05 +0000 (12:54 -0500)]

Fix building under cygwin.

I attempted to build the latest under cygwin and ran into this:

<snip>
...
creating build/temp.cygwin-1.7.1-i686-2.5
gcc -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-
prototypes -I/usr/include/python2.5 -c _hashsplit.c -o b
uild/temp.cygwin-1.7.1-i686-2.5/_hashsplit.o
creating build/lib.cygwin-1.7.1-i686-2.5
gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.7.1-
i686-2.5/_hashsplit.o -L/usr/lib/python2.5/config -lpyt
hon2.5 -o build/lib.cygwin-1.7.1-i686-2.5/_hashsplit.dll
cp build/*/_hashsplit..dll .
cp: cannot stat \ 2uild/*/_hashsplit..dll': No such file or directory
make: *** [_hashsplit.dll] Error 1

</snip>

Some investigation turned up that Makefile was mistakenly referencing
_hashsplit.so instead of _hashsplit.dll

Changing the Makefile to expand the extension macro for the detected
platform, now allows problem free building.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Feb 2010 04:50:39 +0000 (23:50 -0500)]

hashsplit: totally change the way the fanout stuff works.

Useless code churn or genius innovation?  You decide.

The previous system for naming chunks of a split file was kind of lame.  We
tried to name the files something that was "almost" their offset, so that
filenames wouldn't shuffle around too much if a few bytes were added/deleted
here and there.  But that totally failed to work if a *lot* of bytes were
added, and it also lost the useful feature that you could seek to a specific
point in a file (like a VM image) without restoring the whole thing.
"Approximate" offsets aren't much good for seeking to.

The new system is even more crazy than the original hashsplit: we now use
the "extra bits" of the rolling checksum to define progressively larger
chunks.  For example, we might define a normal chunk if the checksum ends in
0xFFF (12 bits).  Now we can group multiple chunks together when the
checksum ends in 0xFFFF (16 bits).  Because of the way the checksum works,
this happens about every 2^4 = 16 chunks.  Similarly, 0xFFFFF (20 bits) will
happen 16 times less often than that, and so on.  We can use this effect to
define a tree.

Then, in each branch of the tree, we name files based on their (exact, not
approximate) offset *from the start of that tree*.

Essentially, inserting/deleting/changing bytes will affect more "levels" of
the rolling checksum, mangling bigger and bigger branches of the overall
tree and causing those branches to change.  However, only the content of
that sub-branch (and the *names*, ie offsets, of the following branches at
that and further-up levels) end up getting changed, so the effect can be
mostly localized.  The subtrees of those renamed trees are *not* affected,
because all their offsets are relative to the start of their own tree.  This
means *most* of the sha1sums in the resulting hierarchy don't need to
change, no matter how much data you add/insert/delete.

Anyway, the net result is that "git diff -M" now actually does something
halfway sensible when comparing the trees corresponding to huge split files.
Only halfway (because the chunk boundaries can move around a bit, and such
large files are usually binary anyway) but it opens the way for much cooler
algorithms in the future.

Also, it'll now be possible to make 'bup fuse' open files without restoring
the entire thing to a temp file first.  That means restoring (or even
*using*) snapshotted VMs ought to become possible.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Feb 2010 04:50:18 +0000 (23:50 -0500)]

cmd-split and hashsplit: cleaning up in preparation for refactoring.

Theoretically, this doesn't actually change any functionality.

commit | commitdiff | tree

Avery Pennarun [Fri, 12 Feb 2010 04:17:54 +0000 (23:17 -0500)]

cmd-join: don't restart git cat-file so frequently.

We would restart cat-file for every id passed on the command line or via
stdin, which was needlessly inefficient.

commit | commitdiff | tree

Avery Pennarun [Thu, 11 Feb 2010 21:27:38 +0000 (16:27 -0500)]

Replace randomgen with a new 'bup random' command.

Now we can override the random seed. Plus we can specify units
thanks to a new helpers.parse_num() functions, so it's not always kb.

Thus, we can now just do
bup random 50G
to generate 50 gigs of random data for testing.

Update "bup split" parameter parsing to use parse_num() too while we're
there.

commit | commitdiff | tree

Avery Pennarun [Thu, 11 Feb 2010 17:20:40 +0000 (12:20 -0500)]

Documentation: correctly mark .md.tmp files as "precious".

Otherwise, if you don't have pandoc installed, they get repeatedly
regenerated for no good reason.

commit | commitdiff | tree

Avery Pennarun [Wed, 10 Feb 2010 22:52:41 +0000 (17:52 -0500)]

midx: automatically ignore .midx files if one of their .idx is missing.

That implies that a pack has been deleted, so the entire .midx is pretty
much worthless. 'bup midx -a' will generate a new one.

commit | commitdiff | tree

Avery Pennarun [Wed, 10 Feb 2010 22:18:46 +0000 (17:18 -0500)]

midx: prune redundant midx files automatically.

After running 'bup midx -f', all previous midx files become redundant.
Throw them away if we end up opening a midx file that supercedes them.

Also cleans up some minor code bits in cmd-midx.py.

commit | commitdiff | tree

Andrew Schleifer [Wed, 10 Feb 2010 20:40:46 +0000 (15:40 -0500)]

Fix building on MacOS X on PowerPC.

bup failed to build on one of my machines, an older iMac; make
died ~40 lines in with "gcc-4.0: Invalid arch name : Power".

On PPC machines, uname -m returns the helpfully descriptive
"Power Macintosh", which gcc doesn't recognize. Some googling
revealed e.g.
http://www.opensource.apple.com/source/ld64/ld64-95.2.12/unit-tests/include/common.makefile
where they use $(shell arch) to get the necessary info.

With that little change, bup built on ppc and i386 machines for
me, and passed all tests.

commit | commitdiff | tree

Avery Pennarun [Tue, 9 Feb 2010 22:30:56 +0000 (17:30 -0500)]

README: bup now has more reasons it's cool and fewer not to use it.

Clearly we're making some progress. I look forward to a world in which
we can finally delete the "reasons bup is stupid" section.

commit | commitdiff | tree

Avery Pennarun [Tue, 9 Feb 2010 06:00:21 +0000 (01:00 -0500)]

index: if files were already deleted, don't dirty the index.

We had a bug where any deleted files in the index would always dirty all
their parent directories when refreshing, which is inefficient.

commit | commitdiff | tree

Avery Pennarun [Tue, 9 Feb 2010 05:51:25 +0000 (00:51 -0500)]

cmd-save: don't recurse into already-valid subdirs.

When iterating through the index, if we find out that a particular dir (like
/usr) has a known-valid sha1sum and isn't marked as changed, there's no need
to recurse into it at all. This saves some pointless grinding through the
index when entire swaths of the tree are known to be already valid.

commit | commitdiff | tree

Avery Pennarun [Tue, 9 Feb 2010 01:28:51 +0000 (20:28 -0500)]

cmd-index/cmd-save: correctly mark directories as dirty/clean.

Previously, we just ignored the IX_HASHVALID on directories, and regenerated
their hashes on every backup regardless. Now we correctly save directory
hashes and mark them IX_HASHVALID after doing a backup, as well as removing
IX_HASHVALID all the way up the tree whenever a file is marked as invalid.

commit | commitdiff | tree

Avery Pennarun [Tue, 9 Feb 2010 05:02:58 +0000 (00:02 -0500)]

drecurse.py: handle initial pathnames that aren't directories.

commit | commitdiff | tree

Avery Pennarun [Tue, 9 Feb 2010 00:26:38 +0000 (19:26 -0500)]

Fix some list comprehensions that I thought were generator comprehensions.

Apparently [x for x in whatever] yields a list, not an iterator, which means
two things:
  - it might use more memory than I thought
  - you definitely don't need to write list([...]) since it's already a
    list.

Clean up a few of these.  You learn something new every day.

commit | commitdiff | tree

Avery Pennarun [Mon, 8 Feb 2010 18:49:17 +0000 (13:49 -0500)]

test.sh: don't try non-quick fsck on damaged repositories.

It turns out that older versions of git (1.5.x or so) have a git-verify-pack
that goes into an endless loop when it hits certain kinds of corruption, and
our test would trigger it almost every time. Using --quick avoids calling
git-verify-pack, so it won't exhibit the problem.

Unfortunately this means a slightly less thorough test of non-quick
bup-fsck, but it'll have to do. Better than failing tests nonstop, anyway.

Reported by Eduardo Kienetz.

commit | commitdiff | tree

Avery Pennarun [Sat, 6 Feb 2010 22:18:33 +0000 (17:18 -0500)]

Wrote man pages for bup(1) and all the subcommands.

commit | commitdiff | tree

Avery Pennarun [Sat, 6 Feb 2010 22:03:41 +0000 (17:03 -0500)]

Merge remote branch 'origin/master'

* origin/master:
cmd-margin: work correctly in python 2.4 when a midx is present.

commit | commitdiff | tree

Avery Pennarun [Sat, 6 Feb 2010 22:01:03 +0000 (17:01 -0500)]

cmd-save: fix a potential divide by zero error.

In the progress calculation stuff.

commit | commitdiff | tree

Avery Pennarun [Sat, 6 Feb 2010 20:27:29 +0000 (15:27 -0500)]

cmd-ls: a line got lost and it didn't work at all.

Also add a trivial test for bup ls to prevent this sort of thing in the
future.

commit | commitdiff | tree

Avery Pennarun [Sat, 6 Feb 2010 20:55:13 +0000 (15:55 -0500)]

cmd-margin: work correctly in python 2.4 when a midx is present.

And add a test so this doesn't happen again.

commit | commitdiff | tree

Avery Pennarun [Sun, 24 Jan 2010 03:09:15 +0000 (22:09 -0500)]

Infrastructure for generating a markdown-based man page using pandoc.

The man page (bup.1) is total drivel for the moment, though. And arguably
we could split up the manpages per subcommand like git does, but maybe
that's overkill at this stage.

commit | commitdiff | tree

Avery Pennarun [Fri, 5 Feb 2010 22:13:35 +0000 (17:13 -0500)]

bup.py: list subcommands in alphabetical order.

We were forgetting to sort the output of listdir().

commit | commitdiff | tree

Avery Pennarun [Fri, 5 Feb 2010 01:12:41 +0000 (20:12 -0500)]

bup save: try to estimate the time remaining.

Naturally, estimating the time remaining is one of those things that sounds
super easy, but isn't. So the numbers wobble around a bit more than I'd
like, especially at first. But apply a few scary heuristics, and boom!
Stuff happens.

commit | commitdiff | tree

Avery Pennarun [Fri, 5 Feb 2010 00:39:38 +0000 (19:39 -0500)]

When receiving an index from the server, print a handy progress message.

This is less boring than seeing a blank screen while we download 5+ megs of
stuff.

commit | commitdiff | tree

Avery Pennarun [Fri, 5 Feb 2010 00:26:17 +0000 (19:26 -0500)]

bup-server: revert to non-midx indexes when suggesting a pack.

Currently midx files can't tell you *which* index contains a particular
hash, just that *one* of them does. So bup-server was barfing when it
expected MultiPackIndex.exists() to return a pack name, and was getting a
.midx file instead.

We could have loosened the assertion and allowed the server to suggest a
.midx file... but those can be huge, and it defeats the purpose of only
suggesting the minimal set of packs so that lightweight clients aren't
overwhelmed.

commit | commitdiff | tree

Avery Pennarun [Fri, 5 Feb 2010 00:12:30 +0000 (19:12 -0500)]

Narrow the exception handling in cmd-save.

If we encountered an error *writing* the pack, we were counting it as a
non-fatal error, which was not the intention. Only *reading* files we want
to back up should be considered non-fatal.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Feb 2010 23:58:40 +0000 (18:58 -0500)]

bup index: fix progress message printing when using -v.

It wasn't printing often enough, and thus was absent more often than
present.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Feb 2010 23:56:01 +0000 (18:56 -0500)]

On python 2.4 on MacOS X, __len__() must return an int.

We were already returning integers, which seem to be "long ints" in this
case, even though they're relatively small. Whatever, we'll typecast them
to int first, and now unit tests pass.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Feb 2010 06:21:51 +0000 (01:21 -0500)]

Merge branch 'indexrewrite'

* indexrewrite:
  Greatly improved progress reporting during index/save.
  Fix bugs in new indexing code.
  Speed up cmd-drecurse by 40%.
  Split directory recursion stuff from cmd-index.py into drecurse.py.
  Massive speedups to bupindex code.

commit | commitdiff | tree

Avery Pennarun [Thu, 4 Feb 2010 06:12:06 +0000 (01:12 -0500)]

Greatly improved progress reporting during index/save.

Now that the index reading stuff is much faster, we can afford to waste time
reading through it just to count how many bytes we're planning to back up.

And that lets us print really friendly progress messages during bup save, in
which we can tell you exactly what fraction of your bytes have been backed
up so far.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Feb 2010 23:56:43 +0000 (18:56 -0500)]

Fix bugs in new indexing code.

The logic was way too screwy, so I've simplified it a lot.  Also extended
the unit tests quite a bit to replicate the weird problems I was having.  It
seems pretty stable - and pretty fast - now.

Iterating through an index of my whole home directory (bup index -p ~) now
takes about 5.1 seconds, vs. 3.5 seconds before the rewrite.  However,
iterating through just a *fraction* of the index can now bypass all the
parts we don't care about, so it's much much faster than before.

Could probably still stand some more optimization eventually, but at least
the file format allows for speed.  The rest is just code :)

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Feb 2010 22:33:32 +0000 (17:33 -0500)]

Speed up cmd-drecurse by 40%.

It's now 40% faster, ie. 1.769 seconds or so to go through my home
directory, instead of the previous 2.935.

Still sucks compared to the native C 'find' command, but that's probably
about as good as it's getting in pure python.

commit | commitdiff | tree

Avery Pennarun [Wed, 3 Feb 2010 21:42:48 +0000 (16:42 -0500)]

Split directory recursion stuff from cmd-index.py into drecurse.py.

Also add a new command, 'bup drecurse', which just recurses through a
directory tree and prints all the filenames. This is useful for timing
performance vs. the native 'find' command.

The result is a bit embarrassing; for my home directory of about 188000
files, drecurse is about 10x slower:

$ time bup drecurse -q ~
real 0m2.935s
user 0m2.312s
sys 0m0.580s

$ time find ~ -printf ''
real 0m0.385s
user 0m0.096s
sys 0m0.284s

time find ~ -printf '%s\n' >/dev/null
real 0m0.662s
user 0m0.208s
sys 0m0.456s

commit | commitdiff | tree

Avery Pennarun [Sun, 31 Jan 2010 22:59:33 +0000 (17:59 -0500)]

Massive speedups to bupindex code.

The old file format was modeled after the git one, but it was kind of dumb;
you couldn't search through the file except linearly, which is pretty slow
when you have hundreds of thousands, or millions, of files.  It also stored
the entire pathname of each file, which got very wasteful as filenames got
longer.

The new format is much quicker; each directory has a pointer to its list of
children, so you can jump around rather than reading linearly through the
file.  Thus you can now 'bup index -p' any subdirectory pretty much
instantly.  The code is still not completely optimized, but the remaining
algorithmic silliness doesn't seem to matter.

And it even still passes unit tests!  Which is too bad, actually, because I
still get oddly crashy behaviour when I repeatedly update a large index. So
there are still some screwy bugs hanging around.  I guess that means we need
better unit tests...

commit | commitdiff | tree

Avery Pennarun [Tue, 2 Feb 2010 05:54:10 +0000 (00:54 -0500)]

cmd-save: add --smaller option.

This makes it only back up files smaller than the given size. bup can
handle big files, but you might want to do quicker incremental backups and
skip bigger files except once a day, or something.

It's also handy for testing.

commit | commitdiff | tree

Avery Pennarun [Tue, 2 Feb 2010 02:34:56 +0000 (21:34 -0500)]

midx: the fanout table entries can be 4 bytes, not 8.

I was trying to be future-proof, but it was kind of overkill, since a 32-bit
fanout entry could handle a total of 4 billion *hashes* per midx.  That
would be 20*4bil = 80 gigs in a single midx.  This corresponds to about 10
terabytes of packs, which isn't inconceivable... but if it happens, you
could just use more than one midx.  Plus you'd likely run into other weird
bup problems before your midx files get anywhere near 80 gigs.

commit | commitdiff | tree

Avery Pennarun [Tue, 2 Feb 2010 02:30:59 +0000 (21:30 -0500)]

cmd-midx: correctly handle a tiny nonzero number of objects.

If all the sha1sums would have fit in a single page, the number of bits in
the table would be negative, with odd results. Now we just refuse to create
the midx if there are too few objects *and* too few files, since it would be
useless anyway.

We're still willing to create a very small midx if it allows us to merge
several indexes into one, however small, since that will still speed up
searching.

Alex' bup ("It backs things up") development repository

RSS Atom