Documentation/bup-split.md

   1 % bup-split(1) Bup %BUP_VERSION%
   2 % Avery Pennarun <apenwarr@gmail.com>
   3 % %BUP_DATE%
   4
   5 # NAME
   6
   7 bup-split - save individual files to bup backup sets
   8
   9 # SYNOPSIS
  10
  11 bup split [-r *host*:*path*] \<-b|-t|-c|-n *name*\> [-v] [-q]
  12   [\--bench] [\--max-pack-size=*bytes*] [-#]
  13   [\--max-pack-objects=*n*] [\--fanout=*count*]
  14   [\--git-ids] [\--keep-boundaries] [filenames...]
  15
  16 # DESCRIPTION
  17
  18 `bup split` concatenates the contents of the given files
  19 (or if no filenames are given, reads from stdin), splits
  20 the content into chunks of around 8k using a rolling
  21 checksum algorithm, and saves the chunks into a bup
  22 repository.  Chunks which have previously been stored are
  23 not stored again (ie. they are 'deduplicated').
  24
  25 Because of the way the rolling checksum works, chunks
  26 tend to be very stable across changes to a given file,
  27 including adding, deleting, and changing bytes.
  28
  29 For example, if you use `bup split` to back up an XML dump
  30 of a database, and the XML file changes slightly from one
  31 run to the next, nearly all the data will still be
  32 deduplicated and the size of each backup after the first
  33 will typically be quite small.
  34
  35 Another technique is to pipe the output of the `tar`(1) or
  36 `cpio`(1) programs to `bup split`.  When individual files
  37 in the tarball change slightly or are added or removed, bup
  38 still processes the remainder of the tarball efficiently.
  39 (Note that `bup save` is usually a more efficient way to
  40 accomplish this, however.)
  41
  42 To get the data back, use `bup-join`(1).
  43
  44 # OPTIONS
  45
  46 -r, \--remote=*host*:*path*
  47 :   save the backup set to the given remote server.  If
  48     *path* is omitted, uses the default path on the remote
  49     server (you still need to include the ':').  The connection to the
  50     remote server is made with SSH.  If you'd like to specify which port, user
  51     or private key to use for the SSH connection, we recommend you use the
  52     `~/.ssh/config` file.
  53
  54 -b, \--blobs
  55 :   output a series of git blob ids that correspond to the
  56     chunks in the dataset.
  57
  58 -t, \--tree
  59 :   output the git tree id of the resulting dataset.
  60
  61 -c, \--commit
  62 :   output the git commit id of the resulting dataset.
  63
  64 -n, \--name=*name*
  65 :   after creating the dataset, create a git branch
  66     named *name* so that it can be accessed using
  67     that name.  If *name* already exists, the new dataset
  68     will be considered a descendant of the old *name*.
  69     (Thus, you can continually create new datasets with
  70     the same name, and later view the history of that
  71     dataset to see how it has changed over time.)
  72
  73 -q, \--quiet
  74 :   disable progress messages.
  75
  76 -v, \--verbose
  77 :   increase verbosity (can be used more than once).
  78
  79 \--git-ids
  80 :   stdin is a list of git object ids instead of raw data.
  81     `bup split` will read the contents of each named git
  82     object (if it exists in the bup repository) and split
  83     it.  This might be useful for converting a git
  84     repository with large binary files to use bup-style
  85     hashsplitting instead.  This option is probably most
  86     useful when combined with `--keep-boundaries`.
  87
  88 \--keep-boundaries
  89 :   if multiple filenames are given on the command line,
  90     they are normally concatenated together as if the
  91     content all came from a single file.  That is, the
  92     set of blobs/trees produced is identical to what it
  93     would have been if there had been a single input file.
  94     However, if you use `--keep-boundaries`, each file is
  95     split separately.  You still only get a single tree or
  96     commit or series of blobs, but each blob comes from
  97     only one of the files; the end of one of the input
  98     files always ends a blob.
  99
 100 \--noop
 101 :   read the data and split it into blocks based on the "bupsplit"
 102     rolling checksum algorithm, but don't do anything with
 103     the blocks.  This is mostly useful for benchmarking.
 104
 105 \--copy
 106 :   like `--noop`, but also write the data to stdout.  This
 107     can be useful for benchmarking the speed of read+bupsplit+write
 108     for large amounts of data.
 109
 110 \--bench
 111 :   print benchmark timings to stderr.
 112
 113 \--max-pack-size=*bytes*
 114 :   never create git packfiles larger than the given number
 115     of bytes.  Default is 1 billion bytes.  Usually there
 116     is no reason to change this.
 117
 118 \--max-pack-objects=*numobjs*
 119 :   never create git packfiles with more than the given
 120     number of objects.  Default is 200 thousand objects.
 121     Usually there is no reason to change this.
 122
 123 \--fanout=*numobjs*
 124 :   when splitting very large files, try and keep the number
 125     of elements in trees to an average of *numobjs*.
 126
 127 \--bwlimit=*bytes/sec*
 128 :   don't transmit more than *bytes/sec* bytes per second
 129     to the server.  This is good for making your backups
 130     not suck up all your network bandwidth.  Use a suffix
 131     like k, M, or G to specify multiples of 1024,
 132     1024*1024, 1024*1024*1024 respectively.
 133
 134 -*#*, \--compress=*#*
 135 :   set the compression level to # (a value from 0-9, where
 136     9 is the highest and 0 is no compression).  The default
 137     is 1 (fast, loose compression)
 138
 139
 140 # EXAMPLE
 141
 142     $ tar -cf - /etc | bup split -r myserver: -n mybackup-tar
 143     tar: Removing leading /' from member names
 144     Indexing objects: 100% (196/196), done.
 145
 146     $ bup join -r myserver: mybackup-tar | tar -tf - | wc -l
 147     1961
 148
 149
 150 # SEE ALSO
 151
 152 `bup-join`(1), `bup-index`(1), `bup-save`(1), `bup-on`(1), `ssh_config`(5)
 153
 154 # BUP
 155
 156 Part of the `bup`(1) suite.