rsync
rsync is a very powerful file-copying tool that can be seen as a smarter version of cp and scp combined, as it allows for highly performant file-copy both locally and remotely.
The basics
Usually, when you want to copy a file from one directory to another destination, you use the command cp. Similarly, if you want to perform a copy where the source and the destination are not on the same machine (basically, you want to retrieve the results of your simulation on the cluster, and save it on your local machine), you would use the command scp (which is basically a cp copy over SSH, hence the name).
rsync is a command that can be used for both previously mentionned situations! To copy a file locally: rsync TARGET DESTINATION. And remotely: rsync LOGIN@HOST:TARGET LOGIN@HOST:DESTINATION
Preserving metadata
Using the -a archive option, rsync will preserve almost all metadata during the copy: modification date ; symlinks ; directories ; device/special files ; owner ; group ; permissions. (But extended attributes, ACLs and hardlinks are not preserved).
Syncronising folders
rsync can be applied recursively on a directory by using the -r recursive option. In that case, only the new/modified files/directories of the target are copied into the destination (i.e. pre-existing files of the farget in the destination are kept).
To delete the files of the destination that are not in target (basically to copy a folder instead of syncronising folders), use the --delete option.
Where is the magic?
For each file of the target, rsync check if this file has a match in the destination (therefore, with the same relativ path). If not, the file is transfered. If there is a match, the transfer of this file is performed only if either their size or time of last modification is different between them.
Therefore, for remote copies, this method can help to reduce the amount of transfered data! Quite smart, isn't it?
However, it is possible to go beyond, and check for checksum to discriminate files. Or even use a compression method in order to compress the amount of data sent, with option -z or -zz. (And even better, use a compression algorithm that only send the compressed differences of the targeted file against the pre-existing file, with option -z, or --old-compress in the future)
Remarks
When dealing with directories, should you put a trailing slash? The table below will help you figure it out:
| Destination→ Target↓ |
destination_dir |
destination_dir/ |
|---|---|---|
target_dir |
destination_dir will contain target_dir(with its elements inside) |
idem |
target_dir/ |
destination_dir will contain the elements of target_dir(but not target_dir itself) |
idem |
Advanced usage
rsync is a versatile and powerful file-copying tool with numerous features. However, I must admit that I am not a expert user of rsync, so I will focus here on some cool and powerful features that I use, but there might be cooler features that I haven't discovered yet (so, contributions are welcomed, or go see the manual: man rsync)...
Delete copied files
It is possible to mimic (to some extent) the mv command by removing the transfered files on the source with the --remove-source-files option (beware: emptied directories will not be removed).
Display transfer progress
For long syncronisations, it is possible to display the transfer progress (and keep partially transfered files in case of a network incident) with the -P option. Note that the verbose option -v available on cp is also available for rsync, and gives you some additional statistics.
Pattern-matching transfer
This feature is the major reason of why I use rsync.
With rsync, it is possible to copy only some kind of files (i.e. matching a given pattern), while conserving the tree-like structure of the folder! Basically, rsync allow you to have a high control over the pattern used.
Example: let's say that I have a directory simulations containing multiple simulation directories (C2H2_bri, C2H4_pi, H2O_top, ...), each containing a coordinates file of the form coords.xyz that I don't want to retrieve, and an input file of the form cp2k.inp that I want to retrieve. The following rsync command will produce a copy local_simulations of my simulations directory containing all its subdirectories and only the cp2k.inp files (conserving the tree-like structure of simulations): rsync -arP --include="cp2k.inp" --include="*/" --exclude="*" allo-psmn:path/to/simulations/ local_simulations
Dry run
With pattern-matching transfer, rsync commands can become non trivial, and mistakes can occur. That's why it is possible to perform a dry run (i.e. only emulate the operations that would be produced by the command, without effectively performing them) with the -n option.
Hard link duplicates compare
One of the most powerful feature of rsync: during syncronisations, it is possible for rsync to compare received files relative to another directory instead of the destination. And even better, create hardlinks to preserved files!
A little gift
Ultimate backup script to add to your .bashrc
backup() { CUR_DATE=$(date "+%Y-%m-%d_%H-%M-%S");DIR_NAME="${@: -1}";rsync -avhPH --link-dest="../$DIR_NAME-current" "${@: 1: ${#@}-1}" ".$DIR_NAME-$CUR_DATE";[[ -h "$DIR_NAME-current" ]] && rm "$DIR_NAME-current";[[ -d ".$DIR_NAME-$CUR_DATE" ]] && ln -s ".$DIR_NAME-$CUR_DATE" "$DIR_NAME-current"; }