(One)-minute geek news

Group talks, Laboratoire de Chimie, ENS de Lyon

rsync


rsync is a very powerful file-copying tool that can be seen as a smarter version of cp and scp combined, as it allows for highly performant file-copy both locally and remotely.

The basics

Usually, when you want to copy a file from one directory to another destination, you use the command cp. Similarly, if you want to perform a copy where the source and the destination are not on the same machine (basically, you want to retrieve the results of your simulation on the cluster, and save it on your local machine), you would use the command scp (which is basically a cp copy over SSH, hence the name).

rsync is a command that can be used for both previously mentionned situations! To copy a file locally: rsync TARGET DESTINATION. And remotely: rsync LOGIN@HOST:TARGET LOGIN@HOST:DESTINATION

Preserving metadata

Using the -a archive option, rsync will preserve almost all metadata during the copy: modification date ; symlinks ; directories ; device/special files ; owner ; group ; permissions. (But extended attributes, ACLs and hardlinks are not preserved).

Syncronising folders

rsync can be applied recursively on a directory by using the -r recursive option. In that case, only the new/modified files/directories of the target are copied into the destination (i.e. pre-existing files of the farget in the destination are kept).

To delete the files of the destination that are not in target (basically to copy a folder instead of syncronising folders), use the --delete option.

Where is the magic?

For each file of the target, rsync check if this file has a match in the destination (therefore, with the same relativ path). If not, the file is transfered. If there is a match, the transfer of this file is performed only if either their size or time of last modification is different between them.

Therefore, for remote copies, this method can help to reduce the amount of transfered data! Quite smart, isn't it?

However, it is possible to go beyond, and check for checksum to discriminate files. Or even use a compression method in order to compress the amount of data sent, with option -z or -zz. (And even better, use a compression algorithm that only send the compressed differences of the targeted file against the pre-existing file, with option -z, or --old-compress in the future)

Remarks

When dealing with directories, should you put a trailing slash? The table below will help you figure it out:

Destination→
Target↓
destination_dir destination_dir/
target_dir destination_dir will contain target_dir
(with its elements inside)
idem
target_dir/ destination_dir will contain the elements of target_dir
(but not target_dir itself)
idem

Advanced usage

rsync is a versatile and powerful file-copying tool with numerous features. However, I must admit that I am not a expert user of rsync, so I will focus here on some cool and powerful features that I use, but there might be cooler features that I haven't discovered yet (so, contributions are welcomed, or go see the manual: man rsync)...

Delete copied files

It is possible to mimic (to some extent) the mv command by removing the transfered files on the source with the --remove-source-files option (beware: emptied directories will not be removed).

Display transfer progress

For long syncronisations, it is possible to display the transfer progress (and keep partially transfered files in case of a network incident) with the -P option. Note that the verbose option -v available on cp is also available for rsync, and gives you some additional statistics.

Pattern-matching transfer

This feature is the major reason of why I use rsync.

With rsync, it is possible to copy only some kind of files (i.e. matching a given pattern), while conserving the tree-like structure of the folder! Basically, rsync allow you to have a high control over the pattern used.

Example: let's say that I have a directory simulations containing multiple simulation directories (C2H2_bri, C2H4_pi, H2O_top, ...), each containing a coordinates file of the form coords.xyz that I don't want to retrieve, and an input file of the form cp2k.inp that I want to retrieve. The following rsync command will produce a copy local_simulations of my simulations directory containing all its subdirectories and only the cp2k.inp files (conserving the tree-like structure of simulations): rsync -arP --include="cp2k.inp" --include="*/" --exclude="*" allo-psmn:path/to/simulations/ local_simulations

Dry run

With pattern-matching transfer, rsync commands can become non trivial, and mistakes can occur. That's why it is possible to perform a dry run (i.e. only emulate the operations that would be produced by the command, without effectively performing them) with the -n option.

Hard link duplicates compare

One of the most powerful feature of rsync: during syncronisations, it is possible for rsync to compare received files relative to another directory instead of the destination. And even better, create hardlinks to preserved files!

A little gift

Ultimate backup script to add to your .bashrc

backup() { CUR_DATE=$(date "+%Y-%m-%d_%H-%M-%S");DIR_NAME="${@: -1}";rsync -avhPH --link-dest="../$DIR_NAME-current" "${@: 1: ${#@}-1}" ".$DIR_NAME-$CUR_DATE";[[ -h "$DIR_NAME-current" ]] && rm "$DIR_NAME-current";[[ -d ".$DIR_NAME-$CUR_DATE" ]] && ln -s ".$DIR_NAME-$CUR_DATE" "$DIR_NAME-current"; }