-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New option: Consider mtime (modification time) #197
Comments
Can you elaborate a bit on your usecase (i.e. what semantic information would that be in your case)? |
ClarificationMy issue (edit: when having no option to ensure that only files with the same mtime are duplicates), is: When using I realized that when using Details on my use case:Using
When In general I would like to have backups that keep the disk usage to a minimum, but also (more important:) mirror the original data as close as possible (incl. metadata such as mtimes). Meanwhile I have taken a look at some other "duplicate finder" tools: Some examples (not very common but still important)
However, if there exists a copy of that file (with the original mtime) and
However, I might accidentally download/create/extract an existing file a second time but delete this copy when I realize my mistake. – If
"What information would be lost if mtimes get equalized?"
|
(Extended option: "Time window"?)In case this feature request gets implement, the ( The
|
So... it's late and I might had some misunderstanding during the read.
But from your second post it seems you want this:
If you want the second option, we might have a little problem. Consider this: $ echo x > a
$ ln a b
$ stat --printf '%y\n' a b # Same mtimes, weird.
2016-11-16 23:30:59.467200299 +0100
2016-11-16 23:30:59.467200299 +0100
$ sleep 1 && touch a
$ stat --printf '%y\n' a b # Still the same.
2016-11-16 23:31:00.467200299 +0100
2016-11-16 23:31:00.467200299 +0100 It seems that every hardlink mirrors the mtime of each other. Touching one changes the mtime of others. Regarding the The time window option feels (yes, feels) a bit too specialized to |
Sorry, then my "clarification" was rather a "confusion" statement... (I added a short addendum to it.)
I agree, that hardlinks for one file can probably not be created with different mtimes, as it is saved in the inode. (For other handlers, e.g. symlinking, different mtimes are possible (good to know that
Probably "because", sorry about that 😄 Back to your question: I think, such an option would solve the problem for my use case! Regarding the linked code: I don't understand how this implements the correct behavior in case several groups with same mtimes exist (I thought, each mtime group should be also one duplicate group?), but it seems to work! $ mkdir consider_time && cd consider_time
$ echo "foo" > fooA1
$ cp -p fooA1 fooA2
$ echo "foo" > fooB1
$ cp -p fooB1 fooB2
$ ls -il --full-time
58800 […] 1 […] 2016-11-17 12:49:00.324444575 +0100 fooA1
58801 […] 1 […] 2016-11-17 12:49:00.324444575 +0100 fooA2
58802 […] 1 […] 2016-11-17 12:50:00.460617626 +0100 fooB1
58803 […] 1 […] 2016-11-17 12:50:00.460617626 +0100 fooB2
$ hardlink .
# or if implemented: rmlint --consider-time -c sh:hardlink && ./rmlint.sh
$ ls -il --full-time
58800 […] 2 […] 2016-11-17 12:49:00.324444575 +0100 fooA1
58800 […] 2 […] 2016-11-17 12:49:00.324444575 +0100 fooA2
58802 […] 2 […] 2016-11-17 12:50:00.460617626 +0100 fooB1
58802 […] 2 […] 2016-11-17 12:50:00.460617626 +0100 fooB2 (I used |
Gotcha.
Okay, I expected the mtime to change when using Go try out 8f38c6e. I took your |
Okay, after looking at the code I noticed that it's almost less effort to implement the more flexible $ echo x > a; echo x > b
$ rmlint --mtime-window 0
# finds both.
$ sleep 1 && touch b
$ rmlint --mtime-window 0
# finds none.
$ rmlint --mtime-window 100 # less may suffice, just to be sure.
# finds both again.
$ rmlint --mtime-window -1 # negative values disable the check.
# finds both again. |
The first implementation with The implementation of
Complex scenarios: Example and possible behaviors$ echo "foo" > fooA; echo "foo" > fooMiddle; echo "foo" > fooZ
$ touch -d "2000-01-01 00:00:01" fooA
$ touch -d "2000-01-01 00:00:02" fooMiddle
$ touch -d "2000-01-01 00:00:03" fooZ
$ rmlint --mtime-window 1 -S (some criteria)
# What is actually the desired outcome? If three files (or more) are involved, as in the example above, I see difficulties: Probably we have to clarify first, what the intended behavior of
Current behavior of
The result for To sum up:
|
It seems, milliseconds of the mtime are not taken into account (on filesystems with millisecond resolution). $ echo "bar" > a; echo "bar" > b
$ touch -d "2010-01-01 00:00:00.00" a
$ touch -d "2010-01-01 00:00:01.99" b # 1.99 seconds later
$ rmlint --mtime-window 1
# considers a and b as duplicates... |
Those are valid points and it might need some work. I somehow hoped it would be well defined already,
Depends on what you see as bug. We currently only save the mtime as integer (i.e. seconds since epoch). |
Well, I don't think it was sloppy work as it works fine for usual scenarios! Regarding the milliseconds resolution: Thanks for the info. Having a finer resolution would probably be good, but I understand that portability is a concern. Three other things I stumbled upon:
|
Approch 3 should be implemented by 78bbe18. The file list is now sorted by mtime and now only split by the mtime criteria (before, it was also sorted by the mtime window criteria, which lead to the random behaviour).
Well, I thought that's my job. 😛
Probably, but fixing stuff on systems you have no access to makes you kinda conservative regarding those changes. But it's on my todo list, since it's no harm to have second precision on platforms that
Oops, thanks. I had a comment in the source code that showed
No, there's no such option. It would be possible to have such options, but if there's no usecase (yet) I won't implement those just for reasons of consistency. Anything left on this issue? |
It seems |
|
Hehe, are you a mathematician by chance? You would be a good software tester too.
Yes, this was a slightly stupid mistake. If the diff was smaller than 1.0 it got casted to an integer. In that case 0 was returned falsely. Fixed by 3221b9b.
I couldn't reproduce your testcase here, but I believe it was due to the same bug.
I used most of your text, thanks (see 042dfc5). I spared the term transitive though. It's very accurate, but at least I had too look it up (although I guess I'm supposed to know that). Since I'm just a human, I expect other people to be as stupid as me. 😄
I think that info is in there:
Regarding the
I can see that |
Haha – no, but I do have some love for mathematics :-)
Yep, seems to be fixed by this!
Oh, yes it is... I must have overlooked it then! The change from I just realized: Now |
Oops, fixed in 24afd58. Maybe one day I'll remove all those short versions for the rather obscure options. I'm running really low on letters...
I added a note in ae3a877. The manpage might need a bit polish here and there. Anything left on this issue now? |
(The 'msec' was just me realizing that now this is possible. I have no opinion on putting it in the manual.) I think that's it. |
I don't have a good solution for this but the mtime window algorithm is not particularly robust; see SeeSpotRun@d694656 |
I think we already discussed the issue you are testing for here above.
|
Not quite the same - in the above example the chained group are all duplicates; in the testcase I referenced the middle file is a random file that just happens to have the same size. It's probably reasonable to expect |
My gut feeling here is this is a post-processing issue and should come after duplicate detection. In the use cases above the user is interested in the forensic information that a files mtime has changed but not its contents. They don't want rmlint to clobber that indiscriminately by hardlinking the two copies; the current implementation successfully avoids that. But the current implementation doesn't bring these duplicates to the user's attention either. |
Ah, Sorry, I didn't notice the files were different in the testcase. Well observed.
True, I don't like rmlint.sh being interactive. 😛 |
Yes the proposal would be to ignore mtime during |
Since we don't have any usecase for that yet, I would not go there. |
I tried to understand the reason for the issue but I am not so familiar with the code. Did I understand the following correct? Reason for the ProblemDuring Preprocessing: Actually the split method ( During the extensive check (by file content/hash): Lateron SolutionsIf this is correct, I can think of two possible solutions (do you mean these by post-filter?).
I am sorry if this is a repetition on @SeeSpotRun's suggestion but I did not fully understand the usecase and the details you were discussing. |
@Awerick your understanding of the situation is correct. For now my preference is: If/when we do decide this needs fixing I would favour an algorithm as follows:
Disadvantage vs current:
Advantage vs current:
|
Just to add to @SeeSpotRun: My preference is to fix it directly by adding the kind of code @Awerick described in I personally don't see any point at @SeeSpotRun's approach to keep them until output. $ rmlint -T df /tmp -o json:rmlint.json.with-mtime --with-mtime 1
$ rmlint -T df /tmp -o json:rmlint.json.without-mtime
# Show the symmetric difference between both runs:
$ sort <(grep -h '"path":' rmlint.json.with*) | uniq -u (Just a sketch, you get the idea). Also it would kill the usecase where the user has more knowledge than |
@SeeSpotRun: With "during duplicate matching" (in 1.) do you mean, what I called "extensive check"? But I don't quite understand why (in 2.) you would consider all files within the mtime window as originals. Did you mean all files within the mtime window are a duplicate set and one of them is the original? That's what I would expect, because all files within the mtime window (or a chain of mtime windows) should be considered duplicates. In general I'd prefer to not have additional output filtering/interaction for the user if Oh, after re-reading your suggestion I think you are focusing on another (important) aspect of the use case. Scenario: Having said that, the initial solution would still be helpful to fix the current unstable situation and would imo be needed anyway even if later this "output notification" is added. |
@Awerick, you're right I had that back-to-front. And by post-processing I mean everything that happens after a duplicate group has been finalised (starting with @sahib I have no objections to option 1 except to point out that this possibly clashes with options |
Anyone already tried to break it? 😏 |
I tried some file constellations that I could think of and for these I just have a question of detail regarding the implementation: |
So far, so good. Thanks for having a look.
No. The ordering that arrives at the output sink kind on depends on how fast the group of files finished and the inner sorting of that group is not really defined but also kind of depends on the finishing order. |
Ah ok, thanks for the insights! |
I will close this again then, until new bugs will be discovered. |
I stumbled upon a failed assertion, if $ echo text > file1
$ echo BLUB > file2
$ echo text > file3
$ ln file3 file4
$ touch -d "2001-01-01 00:00:01" file1
$ touch -d "2001-01-01 00:00:02" file2
$ touch -d "2001-01-01 00:00:03" file3
$ ll --full-time -i * # Overview of the file setup:
550 -rw-rw-r--. 1 user group 5 2001-01-01 00:00:01.000000000 +0100 file1
551 -rw-rw-r--. 1 user group 5 2001-01-01 00:00:02.000000000 +0100 file2
552 -rw-rw-r--. 2 user group 5 2001-01-01 00:00:03.000000000 +0100 file3
552 -rw-rw-r--. 2 user group 5 2001-01-01 00:00:03.000000000 +0100 file4
$ rmlint --mtime-window 1
FEHLER: BUG: Assertion failed at `lib/shredder.c@903`: file->hash_offset == shred_group->hash_offset
FEHLER: Will try to continue in 2 seconds. Expect crashes.
# Duplikate:
ls '/tmp/rmlint_test/file3'
rm '/tmp/rmlint_test/file4' I identified the following keypoints for the failure to happen:
|
Wow nice catch. It seems daisy-chaining has plenty of potential to confuse pre-processing. Thanks for creating the testcase @Awerick - it will make debugging a lot easier. |
I would like to have an option to not consider files with the same content but different mtimes ("modification time") as duplicates.
Reason: Sometimes the mtime carries semantic/pragmatic information which makes it desirable to keep it!
The text was updated successfully, but these errors were encountered: