Documentation/bup-split.md

   1 % bup-split(1) Bup %BUP_VERSION%
   2 % Avery Pennarun <apenwarr@gmail.com>
   3 % %BUP_DATE%
   4
   5 # NAME
   6
   7 bup-split - save individual files to bup backup sets
   8
   9 # SYNOPSIS
  10
  11 bup split \[-t\] \[-c\] \[-n *name*\] COMMON\_OPTIONS
  12
  13 bup split -b COMMON\_OPTIONS
  14
  15 bup split --copy COMMON\_OPTIONS
  16
  17 bup split --noop \[-t|-b\] COMMON\_OPTIONS
  18
  19 COMMON\_OPTIONS
  20   ~ \[-r *host*:*path*\] \[-v\] \[-q\] \[-d *seconds-since-epoch*\] \[\--bench\]
  21     \[\--max-pack-size=*bytes*\] \[-#\] \[\--bwlimit=*bytes*\]
  22     \[\--max-pack-objects=*n*\] \[\--fanout=*count*\]
  23     \[\--keep-boundaries\] \[--git-ids | filenames...\]
  24
  25 # DESCRIPTION
  26
  27 `bup split` concatenates the contents of the given files
  28 (or if no filenames are given, reads from stdin), splits
  29 the content into chunks of around 8k using a rolling
  30 checksum algorithm, and saves the chunks into a bup
  31 repository.  Chunks which have previously been stored are
  32 not stored again (ie. they are 'deduplicated').
  33
  34 Because of the way the rolling checksum works, chunks
  35 tend to be very stable across changes to a given file,
  36 including adding, deleting, and changing bytes.
  37
  38 For example, if you use `bup split` to back up an XML dump
  39 of a database, and the XML file changes slightly from one
  40 run to the next, nearly all the data will still be
  41 deduplicated and the size of each backup after the first
  42 will typically be quite small.
  43
  44 Another technique is to pipe the output of the `tar`(1) or
  45 `cpio`(1) programs to `bup split`.  When individual files
  46 in the tarball change slightly or are added or removed, bup
  47 still processes the remainder of the tarball efficiently.
  48 (Note that `bup save` is usually a more efficient way to
  49 accomplish this, however.)
  50
  51 To get the data back, use `bup-join`(1).
  52
  53 # MODES
  54
  55 These options select the primary behavior of the command, with -n
  56 being the most likely choice.
  57
  58 -n, \--name=*name*
  59 :   after creating the dataset, create a git branch
  60     named *name* so that it can be accessed using
  61     that name.  If *name* already exists, the new dataset
  62     will be considered a descendant of the old *name*.
  63     (Thus, you can continually create new datasets with
  64     the same name, and later view the history of that
  65     dataset to see how it has changed over time.)  The original data
  66     will also be available as a top-level file named "data" in the VFS,
  67     accessible via `bup fuse`, `bup ftp`, etc.
  68
  69 -t, \--tree
  70 :   output the git tree id of the resulting dataset.
  71
  72 -c, \--commit
  73 :   output the git commit id of the resulting dataset.
  74
  75 -b, \--blobs
  76 :   output a series of git blob ids that correspond to the chunks in
  77     the dataset.  Incompatible with -n, -t, and -c.
  78
  79 \--noop
  80 :   read the data and split it into blocks based on the "bupsplit"
  81     rolling checksum algorithm, but don't store anything in the repo.
  82     Can be combined with -b or -t to compute (but not store) the git
  83     blobs or tree ids for the dataset. This is mostly useful for
  84     benchmarking and validating the bupsplit algorithm. Incompatible
  85     with -n and -c.
  86
  87 \--copy
  88 :   like `--noop`, but also write the data to stdout.  This can be
  89     useful for benchmarking the speed of read+bupsplit+write for large
  90     amounts of data.  Incompatible with -n, -t, -c, and -b.
  91
  92 # OPTIONS
  93
  94 -r, \--remote=*host*:*path*
  95 :   save the backup set to the given remote server.  If *path* is
  96     omitted, uses the default path on the remote server (you still
  97     need to include the ':').  The connection to the remote server is
  98     made with SSH.  If you'd like to specify which port, user or
  99     private key to use for the SSH connection, we recommend you use
 100     the `~/.ssh/config` file.  Even though the destination is remote,
 101     a local bup repository is still required.
 102
 103 -d, \--date=*seconds-since-epoch*
 104 :   specify the date inscribed in the commit (seconds since 1970-01-01).
 105
 106 -q, \--quiet
 107 :   disable progress messages.
 108
 109 -v, \--verbose
 110 :   increase verbosity (can be used more than once).
 111
 112 \--git-ids
 113 :   stdin is a list of git object ids instead of raw data.
 114     `bup split` will read the contents of each named git
 115     object (if it exists in the bup repository) and split
 116     it.  This might be useful for converting a git
 117     repository with large binary files to use bup-style
 118     hashsplitting instead.  This option is probably most
 119     useful when combined with `--keep-boundaries`.
 120
 121 \--keep-boundaries
 122 :   if multiple filenames are given on the command line,
 123     they are normally concatenated together as if the
 124     content all came from a single file.  That is, the
 125     set of blobs/trees produced is identical to what it
 126     would have been if there had been a single input file.
 127     However, if you use `--keep-boundaries`, each file is
 128     split separately.  You still only get a single tree or
 129     commit or series of blobs, but each blob comes from
 130     only one of the files; the end of one of the input
 131     files always ends a blob.
 132
 133 \--bench
 134 :   print benchmark timings to stderr.
 135
 136 \--max-pack-size=*bytes*
 137 :   never create git packfiles larger than the given number
 138     of bytes.  Default is 1 billion bytes.  Usually there
 139     is no reason to change this.
 140
 141 \--max-pack-objects=*numobjs*
 142 :   never create git packfiles with more than the given
 143     number of objects.  Default is 200 thousand objects.
 144     Usually there is no reason to change this.
 145
 146 \--fanout=*numobjs*
 147 :   when splitting very large files, try and keep the number
 148     of elements in trees to an average of *numobjs*.
 149
 150 \--bwlimit=*bytes/sec*
 151 :   don't transmit more than *bytes/sec* bytes per second
 152     to the server.  This is good for making your backups
 153     not suck up all your network bandwidth.  Use a suffix
 154     like k, M, or G to specify multiples of 1024,
 155     1024*1024, 1024*1024*1024 respectively.
 156
 157 -*#*, \--compress=*#*
 158 :   set the compression level to # (a value from 0-9, where
 159     9 is the highest and 0 is no compression).  The default
 160     is 1 (fast, loose compression)
 161
 162
 163 # EXAMPLES
 164
 165     $ tar -cf - /etc | bup split -r myserver: -n mybackup-tar
 166     tar: Removing leading /' from member names
 167     Indexing objects: 100% (196/196), done.
 168
 169     $ bup join -r myserver: mybackup-tar | tar -tf - | wc -l
 170     1961
 171
 172
 173 # SEE ALSO
 174
 175 `bup-join`(1), `bup-index`(1), `bup-save`(1), `bup-on`(1), `ssh_config`(5)
 176
 177 # BUP
 178
 179 Part of the `bup`(1) suite.