Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New option: Consider mtime (modification time) #197

Closed
Awerick opened this issue Nov 11, 2016 · 38 comments
Closed

New option: Consider mtime (modification time) #197

Awerick opened this issue Nov 11, 2016 · 38 comments

Comments

@Awerick
Copy link

Awerick commented Nov 11, 2016

I would like to have an option to not consider files with the same content but different mtimes ("modification time") as duplicates.
Reason: Sometimes the mtime carries semantic/pragmatic information which makes it desirable to keep it!

@sahib
Copy link
Owner

sahib commented Nov 12, 2016

Can you elaborate a bit on your usecase (i.e. what semantic information would that be in your case)?
I would be interested to see if it's seems to be a common usecase. For very specialized needs, many users might be better off with parsing the produced .json (or using the .py formatter) and implementing the needed logic themselves.

@Awerick
Copy link
Author

Awerick commented Nov 16, 2016

Clarification

My issue (edit: when having no option to ensure that only files with the same mtime are duplicates), is: When using rmlint with -c sh:hardlink (replacing duplicates with hardlinks), the mtime of all duplicates are equalized (edit: because hardlinks always share the same mtime).

I realized that when using rmlint differently (removing duplicates or replacing them with symbolic links), this issue might be different not exist: The mtimes are not equalized (edit: because each duplicate can have its own mtime or is removed anyway)! (edit: However even then the option to consider the mtime for duplicates might still be desirable due to other reasons!)

Details on my use case:

Using rmlint in conjunction with rsync for space-saving, incremental backups:

  1. First, rsync creates a new, incremental backup: Files that (a) have not been modified since the last backup and (b) are still at the same path (not moved) are hardlinked to the previous backup (given via the rsync option --link-dest PREVIOUS).
  2. Since rsync can not detect unmodified files that have been moved, rmlint can be run subsequently on the new and the previous backup: It detects these files (and maybe more) as duplicates and replaces them with hardlinks.

When rsync (with option a/--archive!) compares two files, they are not hardlinked if the mtime differs. That means, rsync does not change mtimes; this is why I want rmlint to also not change mtimes.
(Tolerable exception for rmlint: When removing duplicates, the mtimes of the parent directories change. Possible extension: rmlint could set the mtimes of the directories back after removing duplicates.)

In general I would like to have backups that keep the disk usage to a minimum, but also (more important:) mirror the original data as close as possible (incl. metadata such as mtimes).


Meanwhile I have taken a look at some other "duplicate finder" tools:
It seems that only hardlink considers mtimes; but it also provides the option --ignore-time and the manual even states that "you almost always want this." I thought, when dealing with duplicates it would be the normal use case to not change mtimes (although I understand that sometimes/often this is not important). But now I am interested / really wondering why this not at least an option in most tools?

Some examples (not very common but still important)

  • A file is changed but that change gets reverted lateron (=> same content as before but updated mtime). This mtime update might be an important information for me/a user, indicating a recent modification (especially interesting for config files, source code – finding bugs).

However, if there exists a copy of that file (with the original mtime) and rmlint is run, the updated mtime might get "lost"!
This scenario might appear a bit unlikely because it involves another copy of the file – but it becomes quite reasonable when using rmlint on backups (as in my use case): Previous backups contain copies of all files. Therefore when running rmlint on two backups, such mtime updates (as described above) will probably get removed from the new backup.

  • Downloaded files (e.g. configs, photos, mp3, …) sometimes have their mtime set to the current time (same may also apply to extracted/unzipped files or files created by a program). The mtime might be used to indicate the download/creation date which might be important/interesting to me/a user.

However, I might accidentally download/create/extract an existing file a second time but delete this copy when I realize my mistake. – If rmlint was run in between, the mtime of the existing file could get updated to the mtime of the newer file and by this the information about the creation date is lost.

  • (Hypothetical: Maybe there are also programs that somehow rely on the mtime of the files they create or the mtime of their config files and don't operate as expected if these mtimes get reset by rmlint? Although, I guess, this would be a bad program behavior.)

"What information would be lost if mtimes get equalized?"
My original formulation that the "mtime carries semantic/pragmatic information" was indeed very vague, I just meant that the mtime is sometimes important or at least interesting for the user, even though the computer does not rely on it. But picturing examples with actual consequences is not easy.
Basically: If mtimes of duplicates are equalized, this changes the meaning of the mtime date from "(this file was) last modified at " to "a file with this content was last modified at (but the current file was maybe modified before or after that)". To me it seems, this makes the mtime entry unnecessary for (hardlinking) duplicates.
=> Possible information the mtime might carry:

  • Was a file recently changed? (see first example above)
  • When was the most recent change to this file?
  • When was a file created/downloaded/extracted? (see second example above)
  • If a file (e.g. a config file) was copied to several different locations/projects: At what date was each file created/updated to this status?

@Awerick
Copy link
Author

Awerick commented Nov 16, 2016

(Extended option: "Time window"?)

In case this feature request gets implement, the rsync option --modify-window might be interesting for rmlint as well: This option allows the mtimes of two files to differ by a certain time span and still be considered equal. (However it might be too specialized to be included in rmlint.)

(rsync, in contrast to rmlint, only compares two files. So it would be more difficult to implement this option deterministically in rmlint.)

The rsync manual elaborates on --modify-window:

"This is normally 0 (for an exact match), but you may find it useful to set this to a larger value in some situations. In particular, when transferring to or from an MS Windows FAT filesystem […] --modify-window=1 is useful (allowing times to differ by up to 1 second)."
(Maybe also helpful for other filesystems to compensate for small shifts in the mtimes? I think, I encountered this situation once on an NTFS formatted drive.)

@sahib
Copy link
Owner

sahib commented Nov 16, 2016

So... it's late and I might had some misunderstanding during the read.
From the first post of your's I thought you want the following:

  • Introduce an option that considers files to be the same only if the mtime is the same too.

But from your second post it seems you want this:

  • Hardlinks created by rmlint should be created with the mtime of the file they replaced.

If you want the second option, we might have a little problem. Consider this:

$ echo x > a
$ ln a b
$ stat --printf '%y\n' a b  # Same mtimes, weird.
2016-11-16 23:30:59.467200299 +0100
2016-11-16 23:30:59.467200299 +0100
$ sleep 1 && touch a
$ stat --printf '%y\n' a b   # Still the same.
2016-11-16 23:31:00.467200299 +0100
2016-11-16 23:31:00.467200299 +0100

It seems that every hardlink mirrors the mtime of each other. Touching one changes the mtime of others.
We actually do what you suggested for symlinks and reflinks (because @SeeSpotRun cared much for it).

Regarding the --ignore-time option of hardlink: This seems to just check if the mtime differs before attempting to hardlink files. It doesn't seem to do anything about preserving them and is the same as my first interpretation. This would be doable, although despite your long text (or because?) I don't know yet how it would help in your usecase. Only considering files with the same mtime wouldn't hit all duplicates (especially when you search for moved files), would it?

The time window option feels (yes, feels) a bit too specialized to rmlint indeed. An implementation might be tricky too since a file of a duplicate group might belong to more than one time frame. Therefore it would not only be hard to implement, but also hard to understand for the user.

@Awerick
Copy link
Author

Awerick commented Nov 17, 2016

Sorry, then my "clarification" was rather a "confusion" statement... (I added a short addendum to it.)

  • Your conclusion from my first post was correct! (I would like an option to consider files as duplicates, only if they also have the same mtime.)
  • In my second post I only wanted to explain one of the effects that happens without this option and why this is a problem for me (i.e., the effect of hardlinking two files with different mtimes and without the requested option is: The mtimes are equalized).

I agree, that hardlinks for one file can probably not be created with different mtimes, as it is saved in the inode. (For other handlers, e.g. symlinking, different mtimes are possible (good to know that rmlint also implements this!), so the problem of equalized mtimes only exists when hardlinking.)

This would be doable, although despite your long text (or because?) I don't know yet how it would help in your usecase. Only considering files with the same mtime wouldn't hit all duplicates (especially when you search for moved files), would it?

Probably "because", sorry about that 😄
My second post was so lengthy because (if I understood you correctly) you were interested whether this is a common use case and what information is lost/what the problem is with equalized mtimes.
Also I was really wondering why most duplicate finders (except for hardlink) don't consider this is a problem, so I wanted to describe it elaborately.

Back to your question: I think, such an option would solve the problem for my use case!
Moved files normally don't change their mtime, so they would still be considered duplicates (same if a file is copied via a file manager or cp -p/cp -a).

Regarding the linked code: I don't understand how this implements the correct behavior in case several groups with same mtimes exist (I thought, each mtime group should be also one duplicate group?), but it seems to work!
Here is a possible test (involving 2 groups with same mtimes and 4 files with the same content):

$ mkdir consider_time && cd consider_time
$ echo "foo" > fooA1
$ cp -p fooA1 fooA2
$ echo "foo" > fooB1
$ cp -p fooB1 fooB2
$ ls -il --full-time
58800  […]  1  […]  2016-11-17 12:49:00.324444575 +0100 fooA1
58801  […]  1  […]  2016-11-17 12:49:00.324444575 +0100 fooA2
58802  […]  1  […]  2016-11-17 12:50:00.460617626 +0100 fooB1
58803  […]  1  […]  2016-11-17 12:50:00.460617626 +0100 fooB2
$ hardlink .
# or if implemented:  rmlint --consider-time -c sh:hardlink  &&  ./rmlint.sh
$ ls -il --full-time
58800  […]  2  […]  2016-11-17 12:49:00.324444575 +0100 fooA1
58800  […]  2  […]  2016-11-17 12:49:00.324444575 +0100 fooA2
58802  […]  2  […]  2016-11-17 12:50:00.460617626 +0100 fooB1
58802  […]  2  […]  2016-11-17 12:50:00.460617626 +0100 fooB2

(I used --consider-mtime as the corresponding name for the rmlint option. hardlink does consider the mtime per default and therefore uses --ignore-time to opt-out of this.)

@sahib
Copy link
Owner

sahib commented Nov 17, 2016

Back to your question: I think, such an option would solve the problem for my use case!

Gotcha.

Moved files normally don't change their mtime, so they would still be considered duplicates (same if a file is copied via a file manager or cp -p/cp -a).

Okay, I expected the mtime to change when using mv, but I see that this was an false assumption.

Go try out 8f38c6e. I took your --consider-mtime name as working name. Tests and docs still missing.

@sahib
Copy link
Owner

sahib commented Nov 17, 2016

Okay, after looking at the code I noticed that it's almost less effort to implement the more flexible
--mtime-window option. This is done by 23e3efd. Usage example:

$ echo x > a; echo x > b
$ rmlint --mtime-window 0
# finds both.
$ sleep 1 && touch b
$ rmlint --mtime-window 0
# finds none.
$ rmlint --mtime-window 100 # less may suffice, just to be sure.
# finds both again.
$ rmlint --mtime-window -1  # negative values disable the check.
# finds both again.

@Awerick
Copy link
Author

Awerick commented Nov 19, 2016

The first implementation with --consider-mtime worked as expected: Thanks! 😄


The implementation of --mtime-window seems more difficult to me:

  • For simple scenarios: If all files (of a duplicate group) are mutually within the mtime window of each other, or there are only two duplicates, rmlint seems to work correctly.
  • For complex scenarios (see below): It is actually yet not well defined for me what the desired outcome should be; also the result of rmlint was partly random for these scenarios.

Complex scenarios: Example and possible behaviors

$ echo "foo" > fooA;  echo "foo" > fooMiddle;  echo "foo" > fooZ
$ touch -d "2000-01-01 00:00:01" fooA
$ touch -d "2000-01-01 00:00:02" fooMiddle
$ touch -d "2000-01-01 00:00:03" fooZ
$ rmlint  --mtime-window 1  -S (some criteria)
# What is actually the desired outcome?

If three files (or more) are involved, as in the example above, I see difficulties:
The file fooMiddle includes the other two files, fooA and fooZ, within its mtime window. But those two files are not in each other's mtime window.
I think the question is: "Which of these files should belong to one duplicate group?"

Probably we have to clarify first, what the intended behavior of rmlint in such scenarios actually is!
I see three (different) reasonable behaviors (2. and 3. are good, rmlint seems to randomly mix both):

  1. The mtime difference between fooA and fooZ is bigger than mtime window, so they should never be considered duplicates of each other (i.e. they can not belong to the same duplicate group).

    => This behavior would lead to random results as both files are within fooMiddle's mtime window:
    Either fooA+fooMiddle or fooMiddle+fooZ will form a duplicate group (the remaining file would then be another, independent original). I don't like this behavior due to its randomness.

  2. Whether fooA and fooZ are considered duplicates depends on the original file:

    1. If fooMiddle is original, fooA and fooZ are both within its mtime window and thus duplicates.
    2. If fooA is original, fooMiddle is a duplicate (of course). fooZ would not belong to this duplicate group as it is not in fooA's mtime window (even though it is within fooMiddle's mtime window).
    3. If fooZ is the original, ... (opposite behavior of fooA, see ii.).

    In other words: The mtime window is the allowed mtime difference between an original and its duplicates (this implies that the mtime between two duplicates might be up to 2 seconds).

    => Result depends on which file is considered the original. (If ranking criteria -S results in a tie, the original (and also the result of this behavior) is chosen by chance.)

    (Might require sophisticated implementation: Originals have to be already defined; if there are files outside the mtime window of an original: create a new duplicate group (with a new original).)

  3. There is a "chain" of mtime windows from fooA (via fooMiddle) to fooZ, so they should in any case be considered duplicates of the same group.

    => Most deterministic behavior: If "mtime window" and the files' mtimes are known, it is unambiguous which files form duplicate groups. (For behaviors 1 and 2 this depends also on the given parameters.)

    Caution! Long "chains" of mtime windows are possible: Files with very different mtimes may be part of the same duplicate group, if there are several "intermediate files" with small mtime differences in-between, connecting them.
    Depending on the use case, such long chains might be a problem or are actually desired!
    (Possible solution: Implement an information/warning for the user, if the mtime between the oldest and the newest duplicate of a group differs by more than 5(?) times the mtime window.)

    (Rough implementation idea (not sure if it is applicable): Before processing the files, sort them not only by size but also (second sorting criteria) by mtime (i.e. first sort by mtime, then a stable sort by size). During processing, always consider the file with the most recent mtime.)

Current behavior of rmlint, for the three example files (mix of behavior 2 and 3):

  • With -S L: fooMiddle is the original, the others are duplicates (matches behavior 2 and 3).
  • With -S a: fooA is the original, fooMiddle a duplicate. Results for fooZ are randomly different:
  • Sometimes it is a standalone original (matches behavior 2).
  • Sometimes it is considered a duplicate of fooA (matches behavior 3).
  • (With -S A: Same as with -S a, sometimes behavior 2, sometime 3. But independent of -S a!)
    (With a fourth file with mtime 00:00:04, it seemed to always match behavior 2.)

The result for -S a/-S A even changes when I rename one of the files and name it back (although the mtime is not affected by this)! (This "randomness" is probably not intended but a bug, right?)

To sum up:

  • There are three (or even more) reasonable behaviors for --mtime-window in complex scenarios that rmlint could implement.
  • Currently rmlint seems to randomly give the result for behavior 2 or 3 in complex scenarios.
  • rmlint should probably be deterministic and implement either behavior 2 or 3 (3 might be easier).

@Awerick
Copy link
Author

Awerick commented Nov 19, 2016

It seems, milliseconds of the mtime are not taken into account (on filesystems with millisecond resolution).
Is this a bug or intended?

$ echo "bar" > a;  echo "bar" > b
$ touch -d "2010-01-01 00:00:00.00" a
$ touch -d "2010-01-01 00:00:01.99" b  # 1.99 seconds later
$ rmlint --mtime-window 1
  # considers a and b as duplicates...

@sahib
Copy link
Owner

sahib commented Nov 19, 2016

Those are valid points and it might need some work. I somehow hoped it would be well defined already,
but you did (of course) your homework. Sorry for the sloppy work. I think behaviour 3 sounds the most sanest and is coincidentally the easiest to implement.

It seems, milliseconds of the mtime are not taken into account (on filesystems with millisecond resolution). Is this a bug or intended?

Depends on what you see as bug. We currently only save the mtime as integer (i.e. seconds since epoch).
Linux offers millisecond resolution (see man 2 stat; st_mtim), but that is not guaranteed by POSIX.
See the st_mtime member, that's no timespec (secs + usecs), it's a simple time_t (secs).
Although st_mtim is found on many platforms, it might introduce portability problems.

@Awerick
Copy link
Author

Awerick commented Nov 19, 2016

Well, I don't think it was sloppy work as it works fine for usual scenarios!
Only if edge cases occur it gets complex/problematic – therefore I actually did not expect you to implement it ;-) I only came up with the possible behaviors and problems when you implemented it, so I am also not sure whether I considered everything correctly.
If you are still going to implement it, I am looking forward to it, but the more simple --consider-mtime already worked fine for me.

Regarding the milliseconds resolution: Thanks for the info. Having a finer resolution would probably be good, but I understand that portability is a concern.


Three other things I stumbled upon:

  • Just a remark: If the generated script is run in paranoia mode (-p), it only rechecks that the content is still identical, it does not recheck the mtime. I think this is ok as an mtime change after the script was generated is extremely rare. But I wanted to mention it in case you think it should be addressed or explained in the manual/help output of this option.
  • The short option -z (that is now used as a short option for considering the mtime) seems to be already used for --perms in rmlint.
  • Speaking of permissions and because it is a related scenario, I am curious (although currently not affected by it): Files with same content but different permissions are considered duplicates – and currently there is no option to consider this, right? (Same for the other two metadata entries: owner and group? Sorry for bringing up/speaking about another of these meta-data scenarios... 😊)

@sahib
Copy link
Owner

sahib commented Nov 20, 2016

Approch 3 should be implemented by 78bbe18. The file list is now sorted by mtime and now only split by the mtime criteria (before, it was also sorted by the mtime window criteria, which lead to the random behaviour).

Only if edge cases occur it gets complex/problematic

Well, I thought that's my job. 😛

Regarding the milliseconds resolution: Thanks for the info. Having a finer resolution would probably be > good, but I understand that portability is a concern.

Probably, but fixing stuff on systems you have no access to makes you kinda conservative regarding those changes. But it's on my todo list, since it's no harm to have second precision on platforms that
don't support it. My concern was just that it needs some buildsystem magic, which can break easily.

Just a remark: If the generated script is run in paranoia mode (-p), it only rechecks that the content is still identical, it does not recheck the mtime. I think this is ok as an mtime change after the script was generated is extremely rare. But I wanted to mention it in case you think it should be addressed or explained in the manual/help output of this option.

The short option -z (that is now used as a short option for considering the mtime) seems to be already used for --perms in rmlint.

Oops, thanks. I had a comment in the source code that showed --perms as free, my bad (fixed by 787b024, now it's a big 'Z' - I'm running low on letters).

Speaking of permissions and because it is a related scenario, I am curious (although currently not affected by it): Files with same content but different permissions are considered duplicates – and currently there is no option to consider this, right? (Same for the other two metadata entries: owner and group? Sorry for bringing up/speaking about another of these meta-data scenarios... )

No, there's no such option. It would be possible to have such options, but if there's no usecase (yet) I won't implement those just for reasons of consistency.

Anything left on this issue?

@sahib
Copy link
Owner

sahib commented Nov 20, 2016

It seems st_mtim is a standard member of struct stat since POSIX.1-2008.
I switched the type of all mtime members to be a floating point type in e9934c5.
The above testcase with the 1.99 time difference now yields the expected result.

@Awerick
Copy link
Author

Awerick commented Nov 21, 2016

  • mtime resolution: Nice!
    For 1.9 difference it works now. However for 0.9 with --mtime-window 0 it does not:

    $ touch -d "2000-01-01 00:00:00.0" a
    $ touch -d "2000-01-01 00:00:00.9" b
    $ rmlint --mtime-window 0
      # still gives two duplicates
  • The improved mtime resolution seems to also affect the complex scenarios:

    echo "foo" > fooA;  echo "foo" > fooB;  echo "foo" > fooC;  echo "foo" > fooD
      # Test with regular mtimes:
    touch -d "2000-01-01 00:00:01.0" fooA
    touch -d "2000-01-01 00:00:02.0" fooB
    touch -d "2000-01-01 00:00:03.0" fooC
    touch -d "2000-01-01 00:00:04.0" fooD
    rmlint  --mtime-window 1  # => 1 duplicate group (OK)
      # Test with irregular mtimes:
    touch -d "2000-01-01 00:00:01.1" fooA
    rmlint  --mtime-window 1  # => 2 duplicate groups (should still be 1)

    (When multiplying the (milli-)seconds by 10 (i.e. :11, :20, :30, :40) and using --mtime-window 10, everything works correctly. This is why I think it is also related to the milliseconds mtime.)

  • I have been thinking, how to describe the mtime-window behavior for the docs. Just a suggestion:

    "--mtime-window=T (default: -1)
    [Current explanation for usual use cases ...]"

    However, with three (or more) files, the mtime difference between two duplicates can be bigger than the mtime window T! (The behavior is "transitive".)
    Example: With --mtime-window 1, the four files "fooA (mtime: 00:00:00), fooB (00:00:01), fooC (00:00:02), fooD (00:00:03)" would all belong to the same duplicate group, although the mtime of fooA and fooD differs by 3 seconds.

    (The example include 4 files, not only 3 to be unambiguous: Behavior 2 can sometimes also treat 3 files as a duplicate group. With 4 files this is only possible with behavior 3.)

    (Or more elaborate in the description:)

    ...than the mtime window T, if there are other duplicates with intermediate mtimes ("chain of mtime windows"). (The behavior is "transitive".)

  • Another manual suggestion: Include short info that the default "mtime window" value (-1) means: "ignore the mtime (only content is compared)"?
    (Maybe INF/all/ignore would be more descriptive than -1?)

@sahib
Copy link
Owner

sahib commented Nov 21, 2016

Hehe, are you a mathematician by chance? You would be a good software tester too.

For 1.9 difference it works now. However for 0.9 with --mtime-window 0 it does not:

Yes, this was a slightly stupid mistake. If the diff was smaller than 1.0 it got casted to an integer. In that case 0 was returned falsely. Fixed by 3221b9b.

The improved mtime resolution seems to also affect the complex scenarios:

I couldn't reproduce your testcase here, but I believe it was due to the same bug.

I have been thinking, how to describe the mtime-window behavior for the docs. Just a suggestion:

I used most of your text, thanks (see 042dfc5). I spared the term transitive though. It's very accurate, but at least I had too look it up (although I guess I'm supposed to know that). Since I'm just a human, I expect other people to be as stupid as me. 😄

Another manual suggestion: Include short info that the default "mtime window" value (-1) means: "ignore the mtime (only content is compared)"?
(Maybe INF/all/ignore would be more descriptive than -1?)

I think that info is in there:

If the window size is negative, the mtime of  duplicates will not be considered.

Regarding the -1: The reasons are rather pragmatic:

  1. The option parser only handles integers or string, but not both at the same time.
  2. I'm a lazy being that tries to reduce the amount of code he writes. Parsing the argument myself
    is possible but involves more code.
  3. While -1 does not make much sense from a math point of view, it feels rather intuitive/easy to remember to me. If the user has to guess what option is there to disable the window (he usually won't bother) he will rather guess a negative number than What was the string again? inf? INF? all? AlL?".

I can see that inf would be more well defined, but at least my computer did not learn yet what "infinity" is regarding integers. He just keeps counting.

@Awerick
Copy link
Author

Awerick commented Nov 22, 2016

Hehe, are you a mathematician by chance? You would be a good software tester too.

Haha – no, but I do have some love for mathematics :-)

I couldn't reproduce your testcase here, but I believe it was due to the same bug.

Yep, seems to be fixed by this!

I think that info is in there: [...]

Oh, yes it is... I must have overlooked it then!

The change from -z to -Z may also be updated in the manual...
This does actually not belong to this issue but it's just another minor manual thing: The explanation of the -S --rank-by default value still speaks about H ("OH ensures..." instead of "O ensures...") but for the moment H is not used in the default value (pOma).

I just realized: Now --mtime-window also allows milliseconds – very cool 😄

@sahib
Copy link
Owner

sahib commented Nov 22, 2016

The change from -z to -Z may also be updated in the manual...
This does actually not belong to this issue but it's just another minor manual thing: The explanation of the -S --rank-by default value still speaks about H ("OH ensures..." instead of "O ensures...") but for the moment H is not used in the default value (pOma).

Oops, fixed in 24afd58. Maybe one day I'll remove all those short versions for the rather obscure options. I'm running really low on letters...

I just realized: Now --mtime-window also allows milliseconds – very cool 

I added a note in ae3a877. The manpage might need a bit polish here and there.
Something like https://asciinema.org for some features and as introduction would be cool too.

Anything left on this issue now?

@Awerick
Copy link
Author

Awerick commented Nov 22, 2016

(The 'msec' was just me realizing that now this is possible. I have no opinion on putting it in the manual.)

I think that's it.
Many thanks for implementing this, as well as taking the time for the discussions!!

@Awerick Awerick closed this as completed Nov 22, 2016
@SeeSpotRun
Copy link
Collaborator

I don't have a good solution for this but the mtime window algorithm is not particularly robust; see SeeSpotRun@d694656

@sahib
Copy link
Owner

sahib commented Apr 9, 2017

I think we already discussed the issue you are testing for here above.
In short:

  • The chaining of the mtime windows is supposed to happen (see here)
  • It's the easiest to understand (and also implement...) option to support,
    because we do not have originals yet at the point (and I'm not a fan of moving more stuff into shredder).

@SeeSpotRun
Copy link
Collaborator

Not quite the same - in the above example the chained group are all duplicates; in the testcase I referenced the middle file is a random file that just happens to have the same size. It's probably reasonable to expect rmlint to reject the duplicates in this case. But there's already quite a bit of code for an arguably not very critical feature so I'm happy to leave as-is (and certainly not keen to move it into shredder).
My thought was to add a testcase that is known to fail as a way of flagging this issue, in case someone comes along later with a better solution... but flag it in such a way the Travis doesn't complain (see 97b7198).

@SeeSpotRun
Copy link
Collaborator

My gut feeling here is this is a post-processing issue and should come after duplicate detection. In the use cases above the user is interested in the forensic information that a files mtime has changed but not its contents. They don't want rmlint to clobber that indiscriminately by hardlinking the two copies; the current implementation successfully avoids that. But the current implementation doesn't bring these duplicates to the user's attention either.
A different solution could be an interactive rmlint.sh (yes I know you don't like that idea on principle, but this is a special case) which asks what to do with duplicates outside of mtime window. That could be logically extended to other metadata (owner, permissions). Maybe a boolean json field metadata_conflict to flag this for other handlers such as shredder.
Perhaps @Awerick wants to weigh in but I'm certainly happy to let this one lie where it is for now; there are bigger fish to fry.

@sahib
Copy link
Owner

sahib commented Apr 10, 2017

Ah, Sorry, I didn't notice the files were different in the testcase. Well observed.
It should be fixable though by adding a post-filter that filters groups by mtime gaps in rm_fmt_flush (and implying cache_file_structs for --mtime-window).

A different solution could be an interactive rmlint.sh (yes I know you don't like that idea on principle, but this is a special case) which asks what to do with duplicates outside of mtime window. That could be logically extended to other metadata (owner, permissions). Maybe a boolean json field metadata_conflict to flag this for other handlers such as shredder.

True, I don't like rmlint.sh being interactive. 😛
metadata_conflict as json field sounds more reasonable, but we don't have any usecase for that yet. Also, we would still need to scan all duplicates if I understood your proposal correctly.

@SeeSpotRun
Copy link
Collaborator

Yes the proposal would be to ignore mtime during
preprocessing and duplicate finding, then consider it during post processing. This would mean scanning more files but in return we get the useful information that two files have same data but different metadata.

@sahib
Copy link
Owner

sahib commented Apr 10, 2017

Since we don't have any usecase for that yet, I would not go there.
Also, if a user is interested in the files that got filtered due to mtime, he can always re-run (and I think even replay) without the filter. From my point of view the user actually added the filter to get rid of those and possibly to speedup the scan.

@sahib sahib reopened this Apr 10, 2017
@Awerick
Copy link
Author

Awerick commented Apr 10, 2017

I tried to understand the reason for the issue but I am not so familiar with the code. Did I understand the following correct?

Reason for the Problem

During Preprocessing: Actually the split method (lib/preprocess.c:rm_file_cmp_split) is supposed to split a and c into different groups (because their mtime is bigger than mtime_window). However this takes place during preprocessing; this is a problem because during preprocessing the files are only roughly grouped by their file size – so b ends up in the same group as a and c.
Now the split method does not separate a and c anymore because b forms a chain of mtime_windows between them.

During the extensive check (by file content/hash): Lateron b is extracted in a new group because it has different content. However there is (currently) no check that the remaining files in the group (a and c) are still connected by a chain of mtime_windows.
=> So a and c are considered duplicates despite their "big" mtime difference.

Solutions

If this is correct, I can think of two possible solutions (do you mean these by post-filter?).
(Assuming that the ordering by mtime is still done during preprocessing…)

  1. After the extensive check, run a single check for each final group to check, whether all files in the group are connected by a chain of mtime_windows; otherwise split the group at the disconnected files (in rmlint/lib/shredder.c, correct?).
    Since the files are already ordered by their mtime, this should not be problematic, right? (Either a single loop, or something like checking the first and the last file and if they are within the time window, everything is good, otherwise divide the group and recursively check the sub-groups for disconnected files.)
    (Side effect: One could then remove the mtime_window specific part from the split method to make the code more clean. (Or leave it because it may sometimes speed up the process…))
  2. (During the extensive check?) For each file that is put in in another group, perform a check, to ensure that the mtime_window chain is still intact (and another check for the other group?). – But I am not sure whether this is possible with the current implementation.

I am sorry if this is a repetition on @SeeSpotRun's suggestion but I did not fully understand the usecase and the details you were discussing.

@SeeSpotRun
Copy link
Collaborator

@Awerick your understanding of the situation is correct.

For now my preference is:
3. Leave everything as-is and add a note to the documentation to say the mtime-window may give false positives in rare cases. There are some other areas of code I wanted to work on before coming back to this.

If/when we do decide this needs fixing I would favour an algorithm as follows:

  1. No mtime consideration during pre-processing or duplicate matching
  2. In post-processing, consider mtime window of duplicates vs original file(s); anything within the window gets promoted to original
  3. During output (depending on output formatter), consider whether to add any additional output to bring the mtime discrepancy to user's attention (eg maybe ls -t /dir/foo # original with different mtime)

Disadvantage vs current:

  • Doesn't give the speed increase that the current method does (current method kicks out mtime-rejects prior to duplicate matching).

Advantage vs current:

  • Gives user useful information about duplicates with different mtimes. For example in the usecase about photographs where one copy has original mtime (desired) and another copy has the mtime when the copy was made (not desired), the user has the chance to correct the situation by deleting or hardlinking or re-copying the offending files.

@sahib
Copy link
Owner

sahib commented Apr 11, 2017

Just to add to @SeeSpotRun:

My preference is to fix it directly by adding the kind of code @Awerick described in 1 to eiher shredder.c:rm_shred_forward_to_output or formats.c:rm_fmt_flush. This would directly fix the issue that the brain deprived whoever (:stuck_out_tongue:) who wrote that code did not foresee.

I personally don't see any point at @SeeSpotRun's approach to keep them until output.
In my opinion this would just further complicate the code and the user's experience. If the rare case occurs that someone wants to view the filtered files he can just re-run rmlint and find them out himself:

$ rmlint -T df /tmp -o json:rmlint.json.with-mtime --with-mtime 1
$ rmlint -T df /tmp -o json:rmlint.json.without-mtime
# Show the symmetric difference between both runs:
$ sort <(grep -h '"path":' rmlint.json.with*) | uniq -u   

(Just a sketch, you get the idea).

Also it would kill the usecase where the user has more knowledge than rmlint and wants to speed up things, by filtering files out of the window. Option 3 by @SeeSpotRun is not my preference, documenting bugs is evil. 😏

@Awerick
Copy link
Author

Awerick commented Apr 11, 2017

@SeeSpotRun: With "during duplicate matching" (in 1.) do you mean, what I called "extensive check"?
I can see that it's cleaner to remove all mtime specific code (splitting as well as initial ordering by mtime) from preprocessing and instead do it after the extensive check. Did you mean this when you said "in post-processing"? (I am not sure what "post-processing" refers to, there is no method called so…)

But I don't quite understand why (in 2.) you would consider all files within the mtime window as originals. Did you mean all files within the mtime window are a duplicate set and one of them is the original? That's what I would expect, because all files within the mtime window (or a chain of mtime windows) should be considered duplicates.

In general I'd prefer to not have additional output filtering/interaction for the user if rmlint is able to handle it. (Addendum: See below, probably I did not understand your intention correctly.)


Oh, after re-reading your suggestion I think you are focusing on another (important) aspect of the use case. Scenario:
(a) Two copies with very similar mtimes.
(b) Another copy with a very different mtime.
I was only thinking, how rmlint should treat the two copies (a) as a duplicate group and thought, you want rmlint to give additional output whenever this happens.
But it seems, you want the additional output only if rmlint detects another copy (b) because the existence of such a copy is often something, the user might want to investigate. – I like that! (And as you said: I like the perspective that it should be easily extendable to cover differences in other file meta data…) That said, I am not the main developer which has to maintain all the code 😜

Having said that, the initial solution would still be helpful to fix the current unstable situation and would imo be needed anyway even if later this "output notification" is added.

@SeeSpotRun
Copy link
Collaborator

@Awerick, you're right I had that back-to-front. And by post-processing I mean everything that happens after a duplicate group has been finalised (starting with shredder.c:rm_shred_result_factory)

@sahib I have no objections to option 1 except to point out that this possibly clashes with options -m and -B if someone wanted to devise some particularly devious testcases. That just means the post-processing would have to iteratively re-test -Z, -B and -m criteria until there are no further rejections.

@sahib sahib added the bug label Apr 14, 2017
@sahib
Copy link
Owner

sahib commented Apr 15, 2017

I pushed a fix (which implements option 1 from above) as of f17a3f1 and 44a9ca6.
All tests pass, but I'll see tomorrow if I can break it somehow with the points you noted (although I think the new implementation should handle -Z, -B and -m okayish...)

@sahib
Copy link
Owner

sahib commented Apr 28, 2017

Anyone already tried to break it? 😏

@Awerick
Copy link
Author

Awerick commented Apr 28, 2017

I tried some file constellations that I could think of and for these rmlint worked correctly!
(However I did not play with different options!)


I just have a question of detail regarding the implementation:
I thought that the files don't need another sorting by mtime during the post-processing/final splitting because the files are already sorted by mtime in preprocess.c#L98.
However there is another sorting by mtime during post-processing (shredder.c#L1348) – and without it, the code also does not work correctly.
So, is the ordering by mtime not stable/preserved from pre-processing until post-processing?

@sahib
Copy link
Owner

sahib commented Apr 28, 2017

I tried some file constellations that I could think of and for these rmlint worked correctly!

So far, so good. Thanks for having a look.

So, is the ordering by mtime not stable/preserved from pre-processing until post-processing?

No. The ordering that arrives at the output sink kind on depends on how fast the group of files finished and the inner sorting of that group is not really defined but also kind of depends on the finishing order.
Think of shredder as a complex beast that may re-order files at will (see also shredder.c:L939).
The pre-process sort is still "needed" to avoid unnecessary I/O.

@Awerick
Copy link
Author

Awerick commented Apr 29, 2017

Ah ok, thanks for the insights!

@sahib
Copy link
Owner

sahib commented Apr 30, 2017

I will close this again then, until new bugs will be discovered.

@Awerick
Copy link
Author

Awerick commented Oct 7, 2017

I stumbled upon a failed assertion, if --mtime-window is used on a hardlinked group and some more factors exist. I could reproduce the bug with the following, minimal constellation (took me some time to pinpoint it):

$ echo text > file1
$ echo BLUB > file2
$ echo text > file3
$ ln   file3  file4
$ touch -d "2001-01-01 00:00:01" file1
$ touch -d "2001-01-01 00:00:02" file2
$ touch -d "2001-01-01 00:00:03" file3
$ ll --full-time -i *   # Overview of the file setup:
550 -rw-rw-r--. 1 user group 5 2001-01-01 00:00:01.000000000 +0100 file1
551 -rw-rw-r--. 1 user group 5 2001-01-01 00:00:02.000000000 +0100 file2
552 -rw-rw-r--. 2 user group 5 2001-01-01 00:00:03.000000000 +0100 file3
552 -rw-rw-r--. 2 user group 5 2001-01-01 00:00:03.000000000 +0100 file4

$ rmlint  --mtime-window 1
FEHLER: BUG: Assertion failed at `lib/shredder.c@903`: file->hash_offset == shred_group->hash_offset
FEHLER:      Will try to continue in 2 seconds. Expect crashes.
# Duplikate:
    ls '/tmp/rmlint_test/file3'
    rm '/tmp/rmlint_test/file4'

I identified the following keypoints for the failure to happen:

  • file3 and file4 are hardlinked.
  • file1 has the same content as file3/file4 but has a 2 seconds earlier mtime. (If its mtime is 2s after file3/file4, the failure does not occur.)
  • file2 has a different content than the others, BUT has the same number of bytes.
  • Furthermore: file2 has a time difference of 1 seconds to file1 and file3/file4.
    (Speculation: It seems as if file2 is used for "chaining" the mtimes from file1 to file3/file4 – but in the end, an assertion is violated, due to the different content of file2.)

@SeeSpotRun
Copy link
Collaborator

Wow nice catch.

It seems daisy-chaining has plenty of potential to confuse pre-processing. Thanks for creating the testcase @Awerick - it will make debugging a lot easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants