Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listAllDataObjectsUnderPath returns duplicate files for big folders with replica's #437

Closed
ramonvansparrentak opened this issue Jan 18, 2024 · 7 comments
Assignees
Labels
Milestone

Comments

@ramonvansparrentak
Copy link
Contributor

We discovered that listAllDataObjectsUnderPath sometimes returns a "file" twice when there are over 5000 dataobjects in the path.

The cause seems to be in listDataObjectsUnderPathViaGenQuery, this method filters out duplicate entries by checking if the previous path equals the current path

if (currentPath.equals(lastPath)) {
  continue;
}

When the method listAllDataObjectsUnderPath calls listDataObjectsUnderPathViaGenQuery multiple times in the while loop, the first result is never filtered. Although it could be the same as the result set of the previous call.

https://github.com/DICE-UNC/jargon/blob/master/jargon-core/src/main/java/org/irods/jargon/core/pub/CollectionListingUtils.java#L834

@ramonvansparrentak
Copy link
Contributor Author

#438

Note, I haven't tested it yet.

@korydraughn
Copy link
Collaborator

Although it could be the same as the result set of the previous call.

I found this statement a little confusing due to the word, result set. Are you talking about a specific row or the entire set of rows?

Are you finding duplicates at page boundaries only or, do you see duplicates in the middle of a page too?

@trel
Copy link
Member

trel commented Jan 18, 2024

i interpret the description and the fix as only affecting boundaries. the fix gives the current call 'visibility' into the last element of a 'previous' call/page - so it can be ignored if it is a duplicate.

@ramonvansparrentak
Copy link
Contributor Author

ramonvansparrentak commented Jan 18, 2024

The bug and suggested fix are only for the boundaries.

For example, for files a, b, c, d, with 2 replica's each. The first result set could look like ['a', 'a', 'b', 'b', 'c'] and the second set would look like ['c', 'd', 'd']. The current code fixes the duplicates in both sets and thus returns ['a', 'b', 'c'] and ['c', 'd']. But, the total result contains two c, at the boundaries of the result sets.

The suggest fix keeps track of c to filter it in the next set.

Hope that makes it clear.

@trel trel added this to the 4.3.4.0 milestone May 21, 2024
@korydraughn
Copy link
Collaborator

I have not been able to reproduce this issue.

What version of Jargon are you using?
What version of iRODS did you see this happen against?

Do you have Jargon code which captures the issue?

@korydraughn
Copy link
Collaborator

It has been pointed out to me that the issue appears when there are multiple replicas.

Will adjust my test and give it another try.

@korydraughn korydraughn self-assigned this Aug 13, 2024
@korydraughn
Copy link
Collaborator

I now have a test which reproduces the issue.

All that's left is to confirm PR #438 resolves it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants