Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Verifying copies (jrs-s.net)
41 points by snaky on Aug 14, 2016 | hide | past | favorite | 18 comments


tar doesn't sort entries on archive creation time. I've tried to use it to compare trees in the past, to notice this caveat. It's a detail most people don't seem to realize about their file system, but files aren't (usually, on most FSes) sorted in any particular order. By creation time order, MAYBE (no guarantee, often not), but as directories get modified and files are deleted an added, this gets clobbered anyhow. Use "ls -f" to see the raw, unsorted list of a directory's contents. Since it is faster for tar to just use this, than to pre-sort entries, that's the order they'll get added in.

What you want is a tar that is deterministically identical regardless of the underlying file system's order. Sadly, no such option exists in GNU tar, but that doesn't quite mean the end of the quest. Behold:

find bin -print0 | sort -Vz | tar -cf bin.tar --no-recursion --null -T -

The first two shouldn't be hard to parse. Find files, separated by NULL characters (so any whitespace or other special characters won't be an issue), pass that onto "sort" that uses a version sort (identical rules to dpkg version sorting, and will disregard the locale sort) and interpret entries as being between NULLs (as well as emit them, likewise). The tar command should be pretty self-explanatory, but the "-T -" tells it to read stdin for a file list, with --null specifying that the input list is separated by NULLs, and --no-recursion prevents it from recursing into directories, instead just adding an entry in the tar for directories -- the find|sort will take care of the directory members already, sorted. Optionally add --numeric-owner to tar, if you want to compare trees on different computers, but results might vary (rsync and tar normally send/store both the id and name, and if the "name" exists on the local machine, uses that instead of id).


You can probably replace that --no-recursion option on tar, with '\! -type d' on find.


If you don't care about comparing directory metadata, sure. Otherwise, it's still useful.


Since I develop backup software, this is a problem I needed to solve (e.g., for verifying that restore works). So I wrote a tool: http://liw.fi/summain/

It produces output that is meant to be usefully diffable.


That's a neat looking application - bonus points for a good manpage, too.

That said, how does it handle firstpath/somefile.bin and secondpath/somefile.bin being identical? This breaks your "diffable" output because the paths are different.


You run it in such a way that the paths are identical. For example, by using the -r option.

$ mkdir foo

$ echo foo > foo/bar

$ cp -a foo foo2

$ summain -r foo > foo.summain

$ summain -r foo2 > foo2.summain

$ diff foo*.summain


The easiest way to verify file checksums is to use "rsync -c". Rsync will usually skip checksum verification of files if both the modification time and size match on the source and destination, the "-c" option tells it to always compute a checksum.

Sometimes I also use the following "one" liner, if it's local.

    diff <(sort <(cd /path/to/source; \
                  find . -type f -print0 | \
                  xargs -0 sha1sum)) \
         <(sort <(cd /path/to/destination; \
                  find . -type f -print0 | \
                  xargs -0 sha1sum))
Another nice rsync tip is that if you aren't using -a, you should at least use -t (preserve modification time). This will make a second rsync faster as it can skip files if the modification time and size match.


Is it just me, or is anyone else getting a 403 when trying to access this page?


ditto (i'm using chrome). weirdly, after looking at the version on archive.org, it started working, although that's probably coincidence.


> For example, if we rsync -a /source /target, we trust that the contents of /target will exactly match the contents of /source

You might trust that, but you'd be wrong:

    $ cd /tmp
    $ mkdir foo
    $ cd foo
    $ mkdir bar
    $ mkdir baz
    $ touch bar/quux
    $ rsync -a ./bar ./baz
    $ ls baz
    bar
    $ ls -R baz
    baz:
    bar
    
    baz/bar:
    quux
You see, without a final '/' rsync will put the source into the target, rather than synchronising the source and the target. Also, if you really want a sync as opposed to just a full copy, you probably want to add --delete (which will delete files in the target which don't exist in the source.

So you probably want rsync -a --delete source/ target/.


Incidentally, given the directory structure above, here's a good way to calculate sha256 checksums over the files. It shells out rather than use the IRONCLAD package to calculate the checksums.

    (let ((src #P"/tmp/foo/bar/")
          (dst #P"/tmp/foo/baz/"))
      (flet ((sha256sum (path)
               "Return the SHA256 checksum of PATH as a string (which is
    good enough for our purposes here)."
               (first (split-sequence:split-sequence
                       #\Space
                       (uiop:run-program `("sha256sum" ,(namestring path))
                                         :output :string)))))
        (loop for src-path in (directory (uiop:merge-pathnames* "**/*.*" src))
           for dst-path = (uiop:merge-pathnames* (uiop:subpathp src-path src) dst)
           for file = (uiop:truename* dst-path)
           if file
           when (uiop:file-pathname-p file)
           do (unless (string= (sha256sum src-path) (sha256sum dst-path))
                (warn "~a does not match ~a" dst-path src-path))
           end
           else
           do (warn "~a not found" dst-path))))


That would make for a great property-based test.

It would be interesting to see the minimal counter-examples to widely-believed properties that don't hold (not just for rsync but for dfferent CLI tools).

(See John Hughes, testing Dropbox with QuickCheck: https://vimeo.com/158002499)

EDIT: s/Jon/John/



Just use diffoscope:

http://diffoscope.org/


Nice read on how to check your rsyncs.



None of the methods described in the article is correct. md5sum does not compare files, it compares checksums. One mostly-correct solution is the 'd' (diff) option to GNU tar.


> md5sum does not compare files, it compares checksums.

.. which is, almost all of the time, good enough and requires much less bandwidth between the copies. These days it might be better to pick SHA256.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: