Documentation/bup-split.md

   1 % bup-split(1) Bup %BUP_VERSION%
   2 % Avery Pennarun <apenwarr@gmail.com>
   3 % %BUP_DATE%
   4
   5 # NAME
   6
   7 bup-split - save individual files to bup backup sets
   8
   9 # SYNOPSIS
  10
  11 bup split [-r *host*:*path*] <-b|-t|-c|-n *name*> [-v] [-q]
  12   [--bench] [--max-pack-size=*bytes*]
  13   [--max-pack-objects=*n*] [--fanout=*count]
  14   [--git-ids] [--keep-boundaries] [filenames...]
  15
  16 # DESCRIPTION
  17
  18 `bup split` concatenates the contents of the given files
  19 (or if no filenames are given, reads from stdin), splits
  20 the content into chunks of around 8k using a rolling
  21 checksum algorithm, and saves the chunks into a bup
  22 repository.  Chunks which have previously been stored are
  23 not stored again (ie. they are "deduplicated").
  24
  25 Because of the way the rolling checksum works, chunks
  26 tend to be very stable across changes to a given file,
  27 including adding, deleting, and changing bytes.
  28
  29 For example, if you use `bup split` to back up an XML dump
  30 of a database, and the XML file changes slightly from one
  31 run to the next, nearly all the data will still be
  32 deduplicated and the size of each backup after the first
  33 will typically be quite small.
  34
  35 Another technique is to pipe the output of the `tar`(1) or
  36 `cpio`(1) programs to `bup split`.  When individual files
  37 in the tarball change slightly or are added or removed, bup
  38 still processes the remainder of the tarball efficiently.
  39 (Note that `bup save` is usually a more efficient way to
  40 accomplish this, however.)
  41
  42 To get the data back, use `bup-join`(1).
  43
  44 # OPTIONS
  45
  46 -r, --remote=*host*:*path*
  47 :   save the backup set to the given remote server.  If
  48     *path* is omitted, uses the default path on the remote
  49     server (you still need to include the ':')
  50
  51 -b, --blobs
  52 :   output a series of git blob ids that correspond to the
  53     chunks in the dataset.
  54
  55 -t, --tree
  56 :   output the git tree id of the resulting dataset.
  57
  58 -c, --commit
  59 :   output the git commit id of the resulting dataset.
  60
  61 -n, --name=*name*
  62 :   after creating the dataset, create a git branch
  63     named *name* so that it can be accessed using
  64     that name.  If *name* already exists, the new dataset
  65     will be considered a descendant of the old *name*.
  66     (Thus, you can continually create new datasets with
  67     the same name, and later view the history of that
  68     dataset to see how it has changed over time.)
  69
  70 -q, --quiet
  71 :   disable progress messages.
  72
  73 -v, --verbose
  74 :   increase verbosity (can be used more than once).
  75
  76 --git-ids
  77 :   stdin is a list of git object ids instead of raw data.
  78     `bup split` will read the contents of each named git
  79     object (if it exists in the bup repository) and split
  80     it.  This might be useful for converting a git
  81     repository with large binary files to use bup-style
  82     hashsplitting instead.  This option is probably most
  83     useful when combined with `--keep-boundaries`.
  84
  85 --keep-boundaries
  86 :   if multiple filenames are given on the command line,
  87     they are normally concatenated together as if the
  88     content all came from a single file.  That is, the
  89     set of blobs/trees produced is identical to what it
  90     would have been if there had been a single input file.
  91     However, if you use `--keep-boundaries`, each file is
  92     split separately.  You still only get a single tree or
  93     commit or series of blobs, but each blob comes from
  94     only one of the files; the end of one of the input
  95     files always ends a blob.
  96
  97 --noop
  98 :   read the data and split it into blocks based on the "bupsplit"
  99     rolling checksum algorithm, but don't do anything with
 100     the blocks.  This is mostly useful for benchmarking.
 101
 102 --copy
 103 :   like --noop, but also write the data to stdout.  This
 104     can be useful for benchmarking the speed of read+bupsplit+write
 105     for large amounts of data.
 106
 107 --bench
 108 :   print benchmark timings to stderr.
 109
 110 --max-pack-size=*bytes*
 111 :   never create git packfiles larger than the given number
 112     of bytes.  Default is 1 billion bytes.  Usually there
 113     is no reason to change this.
 114
 115 --max-pack-objects=*numobjs*
 116 :   never create git packfiles with more than the given
 117     number of objects.  Default is 200 thousand objects.
 118     Usually there is no reason to change this.
 119
 120 --fanout=*numobjs*
 121 :   when splitting very large files, never put more than
 122     this number of git blobs in a single git tree.  Instead,
 123     generate a new tree and link to that.  Default is
 124     4096 objects per tree.
 125
 126 --bwlimit=*bytes/sec*
 127 :   don't transmit more than *bytes/sec* bytes per second
 128     to the server.  This is good for making your backups
 129     not suck up all your network bandwidth.  Use a suffix
 130     like k, M, or G to specify multiples of 1024,
 131     1024*1024, 1024*1024*1024 respectively.
 132
 133
 134 # EXAMPLE
 135
 136     $ tar -cf - /etc | bup split -r myserver: -n mybackup-tar
 137     tar: Removing leading /' from member names
 138     Indexing objects: 100% (196/196), done.
 139
 140     $ bup join -r myserver: mybackup-tar | tar -tf - | wc -l
 141     1961
 142
 143
 144 # SEE ALSO
 145
 146 `bup-join`(1), `bup-index`(1), `bup-save`(1), `bup-on`(1)
 147
 148 # BUP
 149
 150 Part of the `bup`(1) suite.