Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #4076: bug in reading and writing to/from parquet when the locales change #4077

Conversation

ajpotts
Copy link
Contributor

@ajpotts ajpotts commented Feb 3, 2025

Fixes a bug that was reported by a user that when an array is saved to parquet under 40 nodes, and then saved later to 20 nodes, not all the files from the original file are deleted. Therefore when the array is read in again it contains incorrect data.

@ajpotts ajpotts marked this pull request as ready for review February 3, 2025 21:28
@ajpotts ajpotts requested a review from e-kayrakli February 3, 2025 21:28
src/FileIO.chpl Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
arkouda/io.py Show resolved Hide resolved
@e-kayrakli
Copy link
Contributor

I think the changes here are good. However, I am worried about removing files based on name-matching. IOW, the user could just have a file with a given name that matches the pattern purely by coincidence. I think there are two things that can mitigate that:

  • Save per-locale files in a directory. This doesn't eliminate the problem as the user may just create a problematic file in the given directory, which would, again, be removed.
  • Like above, but also add a metadata. The directory could contain a metadata file that lists all the files that represent chunks of an array. Instead of matching files by name, we can read that metadata, and delete files based on the names stored there.

I don't think these are necessary for the first step, but could be further action items to increase the robustness. (Could be added as a comment on the client where we call _delete_arkouda_files.

@ajpotts ajpotts force-pushed the 4076_bug_in_reading_and_writing_to_from_parquet_when_the_locales_change branch 2 times, most recently from 87b509c to a654f28 Compare February 4, 2025 21:36
@ajpotts
Copy link
Contributor Author

ajpotts commented Feb 5, 2025

The files are deleted according to the regex in this function:

    proc getMatchingFilenames(prefix : string, extension : string) throws {
        return glob("%s_LOCALE*%s".format(prefix, extension));    
    }

So the user would would only delete files following the pattern _LOCALE. I think the probability of random files following this pattern and being deleted is low. I added extra file reads to the unit tests to show that a similarly named prefix would not be deleted.

@ajpotts
Copy link
Contributor Author

ajpotts commented Feb 5, 2025

I think you make a valid point though that using prefix's to identify data has some weaknesses that would be work addressing. For example, I noticed this issue with reading files with similar prefixes: #4083

@ajpotts ajpotts marked this pull request as draft February 5, 2025 18:54
@ajpotts ajpotts force-pushed the 4076_bug_in_reading_and_writing_to_from_parquet_when_the_locales_change branch from a6ede3c to fa7f7fa Compare February 5, 2025 18:58
@ajpotts ajpotts marked this pull request as ready for review February 5, 2025 21:08
Copy link
Contributor

@drculhane drculhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me, although I'm not at all sure I have the resources to test it the way I'd like to. But walking through the code, I believe this is a correct fix to the issue.

@e-kayrakli
Copy link
Contributor

I think you make a valid point though that using prefix's to identify data has some weaknesses that would be work addressing. For example, I noticed this issue with reading files with similar prefixes: #4083

Thanks! I commented under that issue about what I outlined above.

arkouda/io.py Outdated Show resolved Hide resolved
@ajpotts ajpotts force-pushed the 4076_bug_in_reading_and_writing_to_from_parquet_when_the_locales_change branch 2 times, most recently from 841b073 to 9a76f45 Compare February 7, 2025 17:40
@ajpotts ajpotts force-pushed the 4076_bug_in_reading_and_writing_to_from_parquet_when_the_locales_change branch from 9a76f45 to 940472f Compare February 7, 2025 17:41
@ajpotts ajpotts enabled auto-merge February 7, 2025 17:44
@ajpotts ajpotts added this pull request to the merge queue Feb 7, 2025
Merged via the queue into Bears-R-Us:master with commit f215ecd Feb 7, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants