Documentation/bup-split.md

   1 % bup-split(1) Bup %BUP_VERSION%
   2 % Avery Pennarun <apenwarr@gmail.com>
   3 % %BUP_DATE%
   4
   5 # NAME
   6
   7 bup-split - save individual files to bup backup sets
   8
   9 # SYNOPSIS
  10
  11 bup split \[-t\] \[-c\] \[-n *name*\] COMMON\_OPTIONS
  12
  13 bup split -b COMMON\_OPTIONS
  14
  15 bup split \<--noop \[--copy\]|--copy\> COMMON\_OPTIONS
  16
  17 COMMON\_OPTIONS
  18   ~ \[-r *host*:*path*\] \[-v\] \[-q\] \[-d *seconds-since-epoch*\] \[\--bench\]
  19     \[\--max-pack-size=*bytes*\] \[-#\] \[\--bwlimit=*bytes*\]
  20     \[\--max-pack-objects=*n*\] \[\--fanout=*count*\]
  21     \[\--keep-boundaries\] \[--git-ids | filenames...\]
  22
  23 # DESCRIPTION
  24
  25 `bup split` concatenates the contents of the given files
  26 (or if no filenames are given, reads from stdin), splits
  27 the content into chunks of around 8k using a rolling
  28 checksum algorithm, and saves the chunks into a bup
  29 repository.  Chunks which have previously been stored are
  30 not stored again (ie. they are 'deduplicated').
  31
  32 Because of the way the rolling checksum works, chunks
  33 tend to be very stable across changes to a given file,
  34 including adding, deleting, and changing bytes.
  35
  36 For example, if you use `bup split` to back up an XML dump
  37 of a database, and the XML file changes slightly from one
  38 run to the next, nearly all the data will still be
  39 deduplicated and the size of each backup after the first
  40 will typically be quite small.
  41
  42 Another technique is to pipe the output of the `tar`(1) or
  43 `cpio`(1) programs to `bup split`.  When individual files
  44 in the tarball change slightly or are added or removed, bup
  45 still processes the remainder of the tarball efficiently.
  46 (Note that `bup save` is usually a more efficient way to
  47 accomplish this, however.)
  48
  49 To get the data back, use `bup-join`(1).
  50
  51 # MODES
  52
  53 These options select the primary behavior of the command, with -n
  54 being the most likely choice.
  55
  56 -n, \--name=*name*
  57 :   after creating the dataset, create a git branch
  58     named *name* so that it can be accessed using
  59     that name.  If *name* already exists, the new dataset
  60     will be considered a descendant of the old *name*.
  61     (Thus, you can continually create new datasets with
  62     the same name, and later view the history of that
  63     dataset to see how it has changed over time.)  The original data
  64     will also be available as a top-level file named "data" in the VFS,
  65     accessible via `bup fuse`, `bup ftp`, etc.
  66
  67 -t, \--tree
  68 :   output the git tree id of the resulting dataset.
  69
  70 -c, \--commit
  71 :   output the git commit id of the resulting dataset.
  72
  73 -b, \--blobs
  74 :   output a series of git blob ids that correspond to the chunks in
  75     the dataset.  Incompatible with -n, -t, and -c.
  76
  77 \--noop
  78 :   read the data and split it into blocks based on the "bupsplit"
  79     rolling checksum algorithm, but don't do anything with the blocks.
  80     This is mostly useful for benchmarking.  Incompatible with -n, -t,
  81     -c, and -b.
  82
  83 \--copy
  84 :   like `--noop`, but also write the data to stdout.  This can be
  85     useful for benchmarking the speed of read+bupsplit+write for large
  86     amounts of data.  Incompatible with -n, -t, -c, and -b.
  87
  88 # OPTIONS
  89
  90 -r, \--remote=*host*:*path*
  91 :   save the backup set to the given remote server.  If *path* is
  92     omitted, uses the default path on the remote server (you still
  93     need to include the ':').  The connection to the remote server is
  94     made with SSH.  If you'd like to specify which port, user or
  95     private key to use for the SSH connection, we recommend you use
  96     the `~/.ssh/config` file.  Even though the destination is remote,
  97     a local bup repository is still required.
  98
  99 -d, \--date=*seconds-since-epoch*
 100 :   specify the date inscribed in the commit (seconds since 1970-01-01).
 101
 102 -q, \--quiet
 103 :   disable progress messages.
 104
 105 -v, \--verbose
 106 :   increase verbosity (can be used more than once).
 107
 108 \--git-ids
 109 :   stdin is a list of git object ids instead of raw data.
 110     `bup split` will read the contents of each named git
 111     object (if it exists in the bup repository) and split
 112     it.  This might be useful for converting a git
 113     repository with large binary files to use bup-style
 114     hashsplitting instead.  This option is probably most
 115     useful when combined with `--keep-boundaries`.
 116
 117 \--keep-boundaries
 118 :   if multiple filenames are given on the command line,
 119     they are normally concatenated together as if the
 120     content all came from a single file.  That is, the
 121     set of blobs/trees produced is identical to what it
 122     would have been if there had been a single input file.
 123     However, if you use `--keep-boundaries`, each file is
 124     split separately.  You still only get a single tree or
 125     commit or series of blobs, but each blob comes from
 126     only one of the files; the end of one of the input
 127     files always ends a blob.
 128
 129 \--bench
 130 :   print benchmark timings to stderr.
 131
 132 \--max-pack-size=*bytes*
 133 :   never create git packfiles larger than the given number
 134     of bytes.  Default is 1 billion bytes.  Usually there
 135     is no reason to change this.
 136
 137 \--max-pack-objects=*numobjs*
 138 :   never create git packfiles with more than the given
 139     number of objects.  Default is 200 thousand objects.
 140     Usually there is no reason to change this.
 141
 142 \--fanout=*numobjs*
 143 :   when splitting very large files, try and keep the number
 144     of elements in trees to an average of *numobjs*.
 145
 146 \--bwlimit=*bytes/sec*
 147 :   don't transmit more than *bytes/sec* bytes per second
 148     to the server.  This is good for making your backups
 149     not suck up all your network bandwidth.  Use a suffix
 150     like k, M, or G to specify multiples of 1024,
 151     1024*1024, 1024*1024*1024 respectively.
 152
 153 -*#*, \--compress=*#*
 154 :   set the compression level to # (a value from 0-9, where
 155     9 is the highest and 0 is no compression).  The default
 156     is 1 (fast, loose compression)
 157
 158
 159 # EXAMPLES
 160
 161     $ tar -cf - /etc | bup split -r myserver: -n mybackup-tar
 162     tar: Removing leading /' from member names
 163     Indexing objects: 100% (196/196), done.
 164
 165     $ bup join -r myserver: mybackup-tar | tar -tf - | wc -l
 166     1961
 167
 168
 169 # SEE ALSO
 170
 171 `bup-join`(1), `bup-index`(1), `bup-save`(1), `bup-on`(1), `ssh_config`(5)
 172
 173 # BUP
 174
 175 Part of the `bup`(1) suite.