===================================================
Cautions (or *why it's hard to write a dupefinder*)
===================================================

This section covers good practice for safe duplicate removal.  It is not intended to
be specifically related to ``rmlint``.  It includes general discussion on duplicate
detection and shows, by example, some of the traps that duplicate finders can fall into.
This section might not only be useful for developers of dupe finders, but also
educational for users that strive for best practices regarding deduplication.

Good Practice when Deleting Duplicates
--------------------------------------

Backup your data
~~~~~~~~~~~~~~~~

There is a wise adage, *"if it's not backed up, it's not important"*.  It's just good
practice to keep your important data backed up.  In particular, any time you are
contemplating doing major file reorganisations or deletions, that's a good time to
make sure that your backups are up to date.

What about when you want to clean up your backups by deleting duplicate files from your
backup drive?  Well as long as your duplicate file finder is reliable, you shouldn't have
any problems.  Consider replacing the duplicate with a link (hardlink, symlink or reflink)
to the original data.  This still frees up the space, but makes it easier to find the file
if and when it comes time to restore files from backup.

Measure twice, cut once
~~~~~~~~~~~~~~~~~~~~~~~

This is a popular saying amongst builders; the same goes for your files.  Do at least some
sort of sanity check on which files are going to be deleted.  All duplicate file finders
(including ``rmlint``) are capable of identifying false positives or more serious bugs.

Beware of unusual filename characters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Even a space in a filename is capable of `causing grief`_.  Make sure your file deletion command
(or the one used by your duplicate finder) has the filename properly escaped.

.. _`causing grief`: https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/issues/123

Consider safe removal options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rather than deleting duplicates, consider moving them to a holding area or trash
folder.  The `trash-cli utility`_ is one option for this.  Alternatively if
using the ``rmlint`` shell script you can replace the ``remove_cmd`` section as
follows to move the files to */tmp*:

.. _`trash-cli utility`: http://github.com/andreafrancia/trash-cli

.. code-block:: bash

   remove_cmd() {
       echo 'Deleting:' "$1"
       if original_check "$1" "$2"; then
           if [ -z "$DO_DRY_RUN" ]; then
               # was: rm -rf "$1"
               mv -p "$1" "/tmp$1"
           fi
       fi
   }

Another safe alternative, if your files are on a copy-on-write filesystem such
as ``btrfs``, and you have linux kernel 4.2 or higher, is to use a deduplication
utility such as ``duperemove`` or ``rmlint --dedupe``:

.. code-block:: bash

   $ duperemove -dh original duplicate
   $ rmlint --dedupe original duplicate

Both of the above first verify (via the kernel)  that ``original`` and
``duplicate`` are identical, then modifies ``duplicate`` to reference
``original``'s data extents.  Note they do not change the mtime or other
metadata of the duplicate (unlike hardlinks).

If you pass ``-c sh:link`` to ``rmlint``, it will even check for you if your
filesystem is capable of reflinks and emit the correct command conveniently.

You might think hardlinking as a safe alternative to deletion, but in fact hardlinking
first deletes the duplicate and then creates a hardlink to the original in its place.
If your duplicate finder has found a false positive, it is possible that you may lose
your data.


Attributes of a Robust Duplicate Finder
---------------------------------------

(also known as *"Traps for young dupe finders"*)

Traversal Robustness
~~~~~~~~~~~~~~~~~~~~

**Path Doubles**


One simple trap for a dupe finder is to not realise that it has reached the same file
via two different paths.  This can happen due to user inputting overlapping paths to
traverse, or due to symlinks or other filesystem loops such as bind mounts.
Here's a simple "path double" example trying to trick a duplicate file finder
by "accidentally" feeding it the same path twice.  We'll use
fdupes_ for this example:

.. _fdupes: https://github.com/adrianlopezroche/fdupes

.. code-block:: bash

   $ mkdir dir
   $ echo "important" > dir/file
   $ fdupes -r --delete --noprompt dir dir
   [no output]
   $ ls dir
   file

So far so good, ``fdupes`` didn't fall for the trick.  It has a check built-in which looks at
the files' device and inode numbers, which automatically filters out path doubles.

Let's try again using the -H option to find hardlinked duplicates:

.. code-block:: bash

   $ fdupes -r -H --delete --noprompt dir dir
      [+] dir/file
      [-] dir/file
   $ ls -l dir/
   total 0

Oh dear, our file is gone!  The problem is that hardlinks share the same device and inode numbers,
so the inode check is turned off for this option.

Dupe finders ``rdfind`` and ``dupd`` can also be tricked with the right combination of settings:

.. code-block:: bash

   $ rdfind -removeidentinode false -deleteduplicates true a a
   [snip]
   Now deleting duplicates:
   Deleted 1 files.
   $ ls -l dir/
   total 0

   $ dupd scan --path /home/foo/a --path /home/foo/a
   Files scanned: 2
   Total duplicates: 2
   Run 'dupd report' to list duplicates.
   $ dupd report
   Duplicate report from database /home/foo/.dupd_sqlite:
   20 total bytes used by duplicates:
     /home/foo/a/data
     /home/foo/a/data

*Solution:*

For a duplicate finder to be able to find hardlinked duplicates, without also inadvertently
identifying a file as a duplicate or itself, a more sophisticated test is required.  Path
doubles will always have:

- matching device and inode.
- matching basename.
- parent directories also have matching device and inode.

That **seems** pretty fool-proof (see ``rmlint`` example below) but please file an issue
on our `Issue Tracker`_ if you find an exception.

.. _`Issue Tracker`: https://github.com/sahib/rmlint/issues

.. code-block:: bash

   $ echo "data" > dir/file
   $ # rmlint with default settings:
   $  rmlint dir dir
   ==> In total 2 files, whereof 0 are duplicates in 0 groups.
   ==> This equals 0 B of duplicates which could be removed.
   $
   $ # rmlint with hardlink duplicate detection enabled:
   $  rmlint --hardlinked dir dir
   ==> In total 2 files, whereof 0 are duplicates in 0 groups.
   ==> This equals 0 B of duplicates which could be removed.
   $ ls dir
   file

**Symlinks:**

*"Ah but I'm not silly enough to enter the same path twice"* you say.  Well maybe so, but
there are other ways that folder traversal can reach the same path twice, for example
via symbolic links:

.. code-block:: bash

   $ mkdir dir
   $ echo "important" > dir/file
   $ ln -s dir link
   $ fdupes -r --delete --noprompt .
   $ ls -l dir/
   total 0

Symlinks can make a real mess out of filesystem traversal:

.. code-block:: bash

   $ mkdir dir
   $ cd dir
   $ ln -s . link
   $ cd ..
   $ echo "data" > dir/file
   $ fdupes -rHs dir
   dir/file
   dir/link/file
   dir/link/link/file
   [snip]
   dir/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/file

   Set 1 of 1, preserve files [1 - 41, all]:

*Solution:*

During traversal, the duplicate finder should keep track of all folders visited (by device and inode number).
Don't re-traverse folders that were already traversed.

**Hardlinks:**

Also as noted above, replacing duplicates with hardlinks can still end badly if there are
false positives.  For example, using ``rdfind``'s  the ``-makehardlinks`` option:

.. code-block:: bash

   $ echo "data" > dir/file
   $ rdfind -removeidentinode false -makehardlinks true dir dir
   [snip]
   It seems like you have 2 files that are not unique
   Totally, 5 b can be reduced.
   Now making results file results.txt
   Now making hard links.
   failed to make hardlink dir/file to dir/file
   $ ls -l dir
   total 0

*Solution:*

Don't find false positives. Check files are on same filesystem before trying to create hardlink.
Temporarily rename the duplicate before creating the hardlink and then deleting the renamed file.

Collision Robustness
~~~~~~~~~~~~~~~~~~~~

**Duplicate detection by file hash**

If a duplicate finder uses file hashes to identify duplicates, there is a very
small risk that two different files have the same hash value.  This is called a
*hash collision* and can result in the two files being falsely flagged as
duplicates.

Several duplicate finders use the popular MD5 Hash, which is 128 bits
long.  With a 128-bit hash, if you have a million sets of same-size files, each set containing
a million different files, the chance of a hash collision is about
``0.000 000 000 000 000 000 147%``. To get a ``0.1%`` chance of a hash collision you would
need nine hundred thousand million (:math:`9\times10^{11}`) groups of (:math:`9\times10^{11}`) files each, or one group
of eight hundred thousand million million (:math:`8\times10^{17}`) files.

If someone had access to your files, and *wanted* to create a malicious duplicate, they
could potentially do something like this (based on http://web.archive.org/web/20071226014140/http://www.cits.rub.de/MD5Collisions/):

.. code-block:: bash

   $ mkdir test && cd test
   $ # get two different files with same md5 hash:
   $ wget http://web.archive.org/web/20071226014140/http://www.cits.rub.de/imperia/md/content/magnus/order.ps
   $ wget http://web.archive.org/web/20071226014140/http://www.cits.rub.de/imperia/md/content/magnus/letter_of_rec.ps
   $ md5sum *  # verify that they have the same md5sum
   a25f7f0b29ee0b3968c860738533a4b9  letter_of_rec.ps
   a25f7f0b29ee0b3968c860738533a4b9  order.ps
   $ sha1sum * # verify that they are not actually the same
   07835fdd04c9afd283046bd30a362a6516b7e216  letter_of_rec.ps
   3548db4d0af8fd2f1dbe02288575e8f9f539bfa6  order.ps
   $ rmlint -a md5 . -o pretty  # run rmlint using md5 hash for duplicate file detection
   # Duplicate(s):
       ls '/home/foo/test/order.ps'
       rm '/home/foo/test/letter_of_rec.ps'
   $ rmlint test -a sha1 -o summary   # run using sha1 hash
   ==> In total 2 files, whereof 0 are duplicates in 0 groups.

If your intention was to free up space by hardlinking the duplicate to the original, you would end up with two
hardlinked files, one called ``order.ps`` and the other called
``letter_of_rec.ps``, both containing the contents of ``order.ps``.

*Solution:*

``fdupes`` detects duplicates using MD5 Hashes, but eliminates the collision
risk by doing a byte-wise comparison of the duplicates detected.  This means
each file is read twice, which can tend to slow things down.

``dupd`` uses direct file comparison, unless there are more than 3 files in a set of duplicates, in which
case it uses MD5 only.

If you use ``rmlint``'s ``sha1`` hash features, which features 160 bit output,
you need at least :math:`5.4\times10^{22}` files before you get a :math:`0.1\%`
probability of collision.  ``rmlint``'s ``-p`` option uses ``SHA512``
(:math:`5.2\times10^{75}` files for :math:`0.1\%` risk), while ``rmlint``'s
``-pp`` option uses direct file comparison to eliminate the risk altogether.
Refer to the :ref:`benchmark_ref` chapter for speed and memory overhead
implications.


Unusual Characters Robustness
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Spaces, commas, nonprinting characters etc can all potentially trip up a duplicate finder or the subsequent file
deletion command.  For example:

.. code-block:: bash

   $ mkdir test
   $ echo "data" > 'test/\t\r\"\b\f\\,.'
   $ cp test/\\t\\r\\\"\\b\\f\\\\\,. test/copy  # even just copying filenames like this is ugly!
   $ ls -1 test/
   copy
   \t\r\"\b\f\\,.
   $ md5sum test/*  # md5's output gets a little bit mangled by the odd characters
   6137cde4893c59f76f005a8123d8e8e6  test/copy
   \6137cde4893c59f76f005a8123d8e8e6  test/\\t\\r\\"\\b\\f\\\\,.
   $ dupd scan --path /home/foo/test
   SKIP (comma) [/home/foo/test/\t\r\"\b\f\\,.]
   Files scanned: 1
   Total duplicates: 0

*Solution:* Be careful!

*"Seek Thrash"* Robustness
~~~~~~~~~~~~~~~~~~~~~~~~~~

Duplicate finders use a range of strategies to find duplicates.  It is common to reading and compare small increments
of potential duplicates.  This avoids the need to read the whole file if the files differ in the first few megabytes,
so this can give a major speedup in some cases.  However, in the case of hard disk drives, constantly reading small
increments from several files at the same time causes the hard drive head to have to jump around ("seek thrash").

Here are some speed test results showing relative speed for scanning my ``/usr`` folder (on SSD) and an HDD copy of same.
The speed ratio gives an indication of how effectively the search algorithm manages disk seek overheads:

+----------------+----------------+---------------------+---------+
| Program        | ``/usr`` (SSD) |  ``/mnt/usr`` (HDD) | *Ratio* |
+================+================+=====================+=========+
| ``dupd``       |   48s          |  1769s              | 36.9    |
+----------------+----------------+---------------------+---------+
| ``fdupes``     |   65s          |  486s               |  7.5    |
+----------------+----------------+---------------------+---------+
| ``rmlint``     |   38s          |  106s               |  2.8    |
+----------------+----------------+---------------------+---------+
| ``rmlint -pp`` |   40s          |  139s               |  3.5    |
+----------------+----------------+---------------------+---------+

.. note::

    Before each run, disk caches were cleared:

    .. code-block:: bash

        $ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

*Solution:*

Achieving good speeds on HDD's requires a balance between small file increments early on, then switching to
bigger file increments.  Fiemap information (physical location of files on the disk) can be used to sort the
files into an order that reduces disk seek times.


Memory Usage Robustness
~~~~~~~~~~~~~~~~~~~~~~~

When scanning very large filesystems, duplicate finders may have to hold a large amount of information in
memory at the same time.  Once this information exceeds the computers' RAM, performance will suffer
significantly.  ``dupd`` handles this quite nicely by storing a lot of the data in a sqlite database file,
although this may have a slight performance penalty due to disk read/write time to the database file.
``rmlint`` uses a path tree structure to reduce the memory required to store all traversed paths.
