-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #4076: bug in reading and writing to/from parquet when the locales change #4077
Closes #4076: bug in reading and writing to/from parquet when the locales change #4077
Conversation
I think the changes here are good. However, I am worried about removing files based on name-matching. IOW, the user could just have a file with a given name that matches the pattern purely by coincidence. I think there are two things that can mitigate that:
I don't think these are necessary for the first step, but could be further action items to increase the robustness. (Could be added as a comment on the client where we call |
87b509c
to
a654f28
Compare
The files are deleted according to the regex in this function: proc getMatchingFilenames(prefix : string, extension : string) throws {
return glob("%s_LOCALE*%s".format(prefix, extension));
} So the user would would only delete files following the pattern _LOCALE. I think the probability of random files following this pattern and being deleted is low. I added extra file reads to the unit tests to show that a similarly named prefix would not be deleted. |
I think you make a valid point though that using prefix's to identify data has some weaknesses that would be work addressing. For example, I noticed this issue with reading files with similar prefixes: #4083 |
a6ede3c
to
fa7f7fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me, although I'm not at all sure I have the resources to test it the way I'd like to. But walking through the code, I believe this is a correct fix to the issue.
Thanks! I commented under that issue about what I outlined above. |
841b073
to
9a76f45
Compare
…en the locales change
9a76f45
to
940472f
Compare
Fixes a bug that was reported by a user that when an array is saved to parquet under 40 nodes, and then saved later to 20 nodes, not all the files from the original file are deleted. Therefore when the array is read in again it contains incorrect data.