_hashsplit.c: replace the stupidsum algorithm with rsync's adler32-based one.
I've been meaning to do this for a while, but a particular test dataset that
really caused problems with stupidsum() (ie. it split things into way more
chunks than it should have) finally screwed me over. Let's change over to a
"real" checksum algorithm.
Non-annoying datasets shouldn't be noticeably affected, but bad ones (such
as my test case from EQL Data) can be 10x more sensible. Typical backup
sets now have about 20% fewer chunks, although this has little affect on the
overall repository size.
WARNING: After this patch, all your chunk boundaries will be different from
before! That means your incremental backups won't be terribly incremental
and your backup repositories will jump in size. This should only happen
once.
_hashsplit.c: switch rollsum_roll() to a macro instead of an inline function.
gcc 4.3's optimizer manages to fail at optimizing the inline, but works okay
with the macro.
Mysteriously, if find_ofs() is *not* static (and therefore presumably
*harder* to optimize), the optimizer works either way. But removing the
static is just wrong, so use the macro instead.
The difference in speed is about 53 megs/sec vs 80 megs/sec on my machine
for this command:
bup random 100M 2>/dev/null | bup split -N --bench
Gabriel Filion [Tue, 27 Jul 2010 03:52:34 +0000 (23:52 -0400)]
cmd/ftp: Hide .dotfiles by default (-a shows them)
Normally in FTP sites, files beginning with a dot are hidden from a list
(ls) command by default. Also, using the argument '-a' makes the list
show hidden files.
The current 'bup ftp' implementation does not behave so. Make it hide
hidden files by default, as expected, and show hidden files when '-a' or
'--all' is specified to the 'ls' command.
All unknown switches will make bup ftp show the ls command usage.
Users can also give 'ls --help' to obtain the usage string.
Gabriel Filion [Tue, 27 Jul 2010 03:52:33 +0000 (23:52 -0400)]
lib/options: Add an onabort argument to Options()
Some times, we may want to parse a list of arguments and not have the
call to Options.parse() exit the program when it finds an unknown
argument.
Add an argument to the class' __init__ method that can be either a
function or a class (must be an exception class). If calling the
function or class constructor returns an object, this object will be
raised on abort.
Also add a convenience exception class named Fatal that can be
passed to Options() to exclusively catch situations in which
Options.parse() would have caused the program to exit.
Finally, set the default value to the onabort argument to call
sys.exit(97) as was previously the case.
Gabriel Filion [Tue, 27 Jul 2010 07:24:23 +0000 (03:24 -0400)]
cmd/ftp: if completion fails due to FileNotFound, just eat it.
Just as bash would do, if you're trying to complete a filename that doesn't
exist, just don't offer any completions. In this case, it only happens if
you try to complete through a broken symlink.
Now that we've fixed this case, enable the printing of exception tracebacks
in case of *other* kinds of completion errors, since we don't expect there
to be any.
[Committed by apenwarr based on an unofficial patch from Gabriel]
vfs: resolve absolute symlinks inside their particular backup set.
Let's say you back up a file "/etc/motd" that's a symlink to
"/var/run/motd". The file inside the backup repo is actually
/whatever/latest/etc/motd, so the symlink should *actually* point to
/whatever/latest/var/run/motd. Let's resolve it that way automatically in
Symlink.dereference().
vfs: try_lresolve() was a bad idea. Create try_resolve() instead.
Also add some comments to describe the actual differences between resolve()
and lresolve(), and clean things up a bit so that they actually work as
they're supposed to.
Basically, all of lresolve(), resolve(), and try_resolve() depend on
*intermediate* paths being resolvable; all of them will throw an exception
if not. They only differ in the very last node in the path, when that node
is a symlink:
resolve() will dereference it or throw an exception if it can't;
try_resolve() will try to dereference it, but return self if it can't;
lresolve() will not dereference it at all, like lstat() doesn't.
With that in mind, we can fix up cmd/ftp and cmd/web to use the right calls,
thus fixing an unexpected error in ftp's tab completion reported by Gabriel
Filion, which would happen if you tried to tab complete inside a directory
that contained a broken symlink. We only care what the symlink points to so
we can decide whether or not to append '/' to the tab completion, so we want
it to fail silently if it's going to fail.
Gabriel Filion [Sun, 25 Jul 2010 17:34:13 +0000 (13:34 -0400)]
fix helpers.columnate bug when list is empty
When the list given to the columnate function is empty, the function
raises an exception when determining the max(len of all elements), since
the list given to max is empty.
One indirect example of when this bug is apparent is in the 'bup ftp'
command when listing an empty directory:
bup> ls backupname/latest/etc/keys
error: max() arg is an empty sequence
Add a special condition at the beginning of the columnate function that
returns an empty string if the list of elements is empty.
Joe Beda [Fri, 23 Jul 2010 07:10:36 +0000 (00:10 -0700)]
Convert 'bup web' directory listing to use tornado templates.
This includes creating a new idea of a "resource path" that currently sits
under the lib dir. Getting resources is supported with a new helper
(resource_path).
I just took the tornado/tornado directory, along with the README.
I'm using tornado's git commit 7a30f9f6eac9aa0cf295b078695156776fd050ce,
since recent versions of Tornado have support for specifying which
address you want to listen to.
Signed-off-by: Peter McCurdy <petermccurdy@alumni.uwaterloo.ca>
git.py: use close_fds=True when starting git cat-file.
Otherwise git could inherit some other file descriptors we're using. This
is particularly relevant in cmd/web, and particularly when applying
pmccurdy's patches to use Tornado.
Because of changes to wvtest.py's chdir() handling, had to make some slight
changes to filenames used by the bup tests themselves - all changes for the
better.
options.py: differentiate unset and set-to-negative options.
Unset options will still be None, but options explicitly set to a negative
will now be 0. This doesn't change semantics for anything currently in bup,
but it could be useful later when applying defaults.
While we're here, clean up the option parsing code to make it
very slightly more efficient.
Apparently on some systems (Mandriva and Slackware at least), importing
the readline library can print some escape sequences to stdout, which screws
things up with the unit tests that run 'bup ftp "cat filename"' and expect
it to be the right data.
Thanks to Eduardo Kienetz for noticing and helping to track down the problem
since I couldn't reproduce it.
vfs: File.open() needs to do a seek(0) on the cached FileReader.
Otherwise if you open a file, read through it, and close it, then do it
again, you'll get zero bytes the second time.
To make this efficient, change seek() to not discard its _chunkiter every
single time; instead, keep the _chunkiter around until trying to read() from
a location that *isn't* the current offset. Now seeking around in the file
is cheap.
Gabriel Filion [Fri, 25 Jun 2010 08:07:03 +0000 (04:07 -0400)]
Inline git.cat() inside server-cmd.py
Since the cat() function in git.py is used only inside the server-cmd.py
script, and since it is a discouraged use of CatPipe, inline the code
inside the server-cmd.py script.
At the same time, make the CatPipe object persistent between calls to
the "cat" command to remove unnecessary deletion/creation or resources.
Avery Pennarun [Fri, 25 Jun 2010 17:13:49 +0000 (13:13 -0400)]
vfs: correctly handle reading small files.
After the recent change to let vfs seek around in files, we broke support
for files that were only one chunk. Fix it up, then add some unit tests to
detect such mistakes in the future.
Also, 'bup ftp' now returns nonzero if it catches any exceptions during
execution, making it more suitable for use in scripts... such as the unit
tests :)
Gabriel Filion [Fri, 25 Jun 2010 01:36:36 +0000 (21:36 -0400)]
Makefile: allow PYTHON variable to override python version.
Currently, the Makefile assumes the python command that should be used
is the default python version -- the "python" executable that is found
in PATH. Compiling and testing with a different python version is not
possible without either having a system with another default version, or
by manually changing the link found in PATH.
Correct this situation by using a variable for the python command name,
that can be overridden on the command line like the following:
Gabriel Filion [Tue, 8 Jun 2010 05:03:41 +0000 (01:03 -0400)]
Docstrings for the git.py library
Add docstrings to the module and the public classes and functions of the
git library (eg. the ones that do not start with _ ).
Also rename the AbortableIter class to _AbortableIter since it is used
only inside the git.py library and is not intended to be used elsewhere
for now.
Avery Pennarun [Mon, 7 Jun 2010 23:02:23 +0000 (19:02 -0400)]
cmd/{save,split}: add a --bwlimit option.
This allows you to limit how much upstream bandwidth 'bup save' and 'bup
split' will use. Specify it as a number of bytes/second, or with the 'k' or
'M' or (lucky you!) 'G' suffixes for larger values.
Peter McCurdy [Fri, 16 Apr 2010 07:11:13 +0000 (03:11 -0400)]
Work around extra space added by some readline versions.
Apparently some versions of readline (6.0, for me) in some versions of
Python (Ubuntu's python2.6.4-0ubuntu1, for me) have an irritating bug
where they add an extra space to the end of all completions. This is
particularly annoying for directory completions, as you can't
tab-complete your way into the contents of the directory. See
http://bugs.python.org/issue5833
This patch, borrowed mostly from Trac, goes in and twiddles the
appropriate variable inside the readline library to make it stop doing
that. See http://trac.edgewall.org/ticket/8711 for the discussion.
Signed-off-by: Peter McCurdy <petermccurdy@alumni.uwaterloo.ca>
Gabriel Filion [Fri, 30 Apr 2010 05:53:13 +0000 (01:53 -0400)]
code clarity: one-letter var carried for too long
In split-cmd.py, the "w" variable is first seen on line 55 and is kept
around until line 96. Variables that are sparsely used in a medium
distance in the code should have a name that carries more sense when
read on its own.
Change "w" into "pack_writer" to better identify the purpose of the
variable.
Jon Dowland [Wed, 28 Apr 2010 14:41:05 +0000 (15:41 +0100)]
adjust .md files to make lexgrog happy
the whatis(1) tool cannot parse the bup manpages, because there
are two words before the '-' separator. This patch joins the words
using another '-', in the same fashion as git, to overcome this
limitation.
Before:
$ whatis bup fuse
bup (1) - Backup program using rolling checksums and git file fo...
fuse: nothing appropriate.
$ whatis bup-fuse
bup-fuse: nothing appropriate.
After:
$ whatis bup-fuse
bup-fuse (1) - mount a bup repository as a filesystem
Gabriel Filion [Wed, 28 Apr 2010 17:12:05 +0000 (13:12 -0400)]
Documentation: some placeholders are lost
Some pieces of text in the documentation files use the <...> syntax to
mark named placeholders. However, the conversion done by pandoc from
Markdown to makefile makes some of these placeholders disappear.
The affected elements are those that contain only characters that could
be valid for an e-mail address or a URL, but are not supposed to be one
of both. Also, elements inside `...`-style code blocks are unaffected.
Fix this situation by escaping the < and > characters where the tags
disappear.
Jon Dowland [Wed, 28 Apr 2010 13:50:30 +0000 (14:50 +0100)]
add -o/--allow-other to bup-fuse
Setting the fuse option allow_other will fail if user_allow_other
is not set in fuse.conf. Add toggle -o/--allow-other to bup-fuse
(disabled by default).
cmd/version, etc: fix version number detection stuff.
Gabriel Filion pointed out that bup's version number (which we added to the
man pages automatically) was not detected when you used a bup tarball
generated from 'git archive' (or more likely, you let GitHub call 'git
archive' for you). That makes sense, since our version detection was based
on having a .git directory around, which the tarball doesn't.
Instead, let's create a .gitattributes file and have it auto-substitute some
version information during 'git archive'. Of course, if we actually *do*
have a .git directory, continue to use that.
While we're here, add a new 'bup version' command and alias "bup --version"
and "bup -V" to call it, since those options are pretty standard.
vfs: take advantage of bup chunking to make file seeking faster.
If you have a huge file, you can now seek around inside it (eg. in 'bup
fuse') without having to read its entire contents. Calculating the file
size is also really fast now.
This makes a bup fuse-mounted filesystem much more useful for real-time
access. For example, I was able to connect to an sqlite3 database and have
it work at a reasonable speed. (Obviously, since 'bup fuse' is written in
python and doesn't currently support threading, the speed could still be
improved, but at least it's no longer godawful.)
git.CatPipe: more resilience against weird errors.
Notably, MemoryErrors thrown because the file we're trying to load into
memory is too big to load all at once. Now the MemoryError gets thrown, but
the main program is potentially able to recover from it because CatPipe at
least doesn't get into an inconsistent state.
Also we can recover nicely if some lamer kills our git-cat-file subprocess.
The AutoFlushIter we were using for this purpose turns out to not have been
good enough, and it's never been used anywhere but in CatPipe, so I've
revised it further and renamed it to git.AbortableIter.
cmd/save: when a file is chunked, mangle its name from * to *.bup
Files that are already named *.bup are renamed to *.bup.bupl, so that we can
just always drop either .bup or .bupl from a filename if it's there, and the
result will be the original filename.
Also updated lib/bup/vfs.py to demangle the names appropriately, and treat
git trees named *.bup as real chunked files (ie. by joining them back
together).
Also change main.py to search around in appropriate places for the installed
library files. By default, if your bup is in /usr/bin/bup, it'll look in
/usr/lib/bup. (It drops two words off the end of the filename and adds
/lib/bup to the end.)
This also makes the Debian packager at
http://git.debian.org/collab-maint/bup
actually produce a usable package.
It's annoying when your log messages come out before stdout messages do.
But it's equally annoying (and inefficient) to have to flush every time you
print something. This seems like a nice compromise.
Get rid of a sha-related DeprecationWarning in python 2.6.
hashlib is only available in python 2.5 or higher, but the 'sha' module
produces a DeprecationWarning in python 2.6 or higher. We want to support
python 2.4 and above without any stupid warnings, so let's try using
hashlib. If it fails, switch to the old sha module.
Avery Pennarun [Sun, 21 Mar 2010 20:48:23 +0000 (16:48 -0400)]
server: only suggest a max of one pack per receive-objects cycle.
Since the client only handles one at a time and forgets the others anyway,
suggesting others is a bit of a waste of time... and because of the cheating
way we figure out which index to suggest when using a midx, suggesting packs
is more expensive than it should be anyway.
The "correct" fix in the long term will be to make the client accept
multiple suggestions at once, plus make midx files a little smarter about
figuring out which pack is the one that needs to be suggested. But in the
meantime, this makes things a little nicer: there are fewer confusing log
messages from the server, and a lot less disk grinding related to looking
into which pack to suggest, followed by finding out that we've already
suggested that pack anyway.
Avery Pennarun [Sun, 21 Mar 2010 04:41:52 +0000 (00:41 -0400)]
rbackup-cmd: we can now backup a *remote* machine to a *local* server.
The -r option to split and save allowed you to backup from a local machine
to a remote server, but that doesn't always work; sometimes the machine you
want to backup is out on the Internet, and the backup repo is safe behind a
firewall. In that case, you can ssh *out* from the secure backup machine to
the public server, but not vice versa, and you were out of luck. Some
people have apparently been doing this:
ssh publicserver tar -c / | bup split -n publicserver
(ie. running tar remotely, piped to a local bup split) but that isn't
efficient, because it sends *all* the data from the remote server over the
network before deduplicating it locally. Now you can do instead:
bup rbackup publicserver index -vux /
bup rbackup publicserver save -n publicserver /
And get all the usual advantages of 'bup save -r', except the server runs
locally and the client runs remotely.
Avery Pennarun [Sun, 21 Mar 2010 03:03:53 +0000 (23:03 -0400)]
client: Extract 'bup server' connection code into its own module.
The screwball function we use to let us run 'bup xxx' on a remote server
after correctly setting the PATH variable is about to become useful for more
than just 'bup server'.
Avery Pennarun [Sun, 21 Mar 2010 04:36:31 +0000 (00:36 -0400)]
options: allow user to specify an alternative to getopt.gnu_getopt.
The most likely alternative is getopt.getopt, which doesn't rearrange
arguments. That would mean "-a foo -p" is considered as the option "-a"
followed by the non-option arguments ['foo', '-p'].
The non-gnu behaviour is annoying most of the time, but can be useful when
you're receiving command lines that you want to pass verbatim to someone
else.
Avery Pennarun [Sun, 21 Mar 2010 05:47:24 +0000 (01:47 -0400)]
save/index/drecurse: correct handling for fifos and nonexistent paths.
When indexing a fifo, you can try to open it (for security reasons) but it
has to be O_NDELAY just in case the fifo doesn't have anyone on the other
end; otherwise indexing can freeze.
In index.reduce_paths(), we weren't reporting ENOENT for reasons I can no
longer remember, but I think they must have been wrong. Obviously if
someone specifies a nonexistent path on the command line, we should barf
rather than silently not back it up.
Avery Pennarun [Sun, 21 Mar 2010 04:34:21 +0000 (00:34 -0400)]
main.py: don't leak a file descriptor.
subprocess.Popen() is a little weird about when it closes the file
descriptors you give it. In this case, we have to dup() it because if
stderr=2 (the default) and stdout=2 (because fix_stderr), it'll close fd 2.
But if we dup it first, it *won't* close the dup, because stdout!=stderr.
So we have to dup it, but then we have to close it ourselves.
This was apparently harmless (it just resulted in an extra fd#3 getting
passed around to subprocesses as a clone of fd#2) but it was still wrong.
Lukasz Kosewski [Mon, 15 Mar 2010 03:20:08 +0000 (23:20 -0400)]
cmd/index-cmd.py: How it pains me to have to explicitly close() stuff
If we don't explicitly close() the wr reader object while running
update-index, the corresponding writer object won't be able to unlink
its temporary file under Cygwin.