Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch bad data prep #1644

Merged
merged 26 commits into from
Nov 15, 2024
Merged

Catch bad data prep #1644

merged 26 commits into from
Nov 15, 2024

Conversation

milocress
Copy link
Contributor

When privatelink blocks a delta table from being downloaded, this fails silently in the dataset preparation script.

This PR adds a validation step to check that the downloaded file indeed exists and contains data to prevent this kind of silent error.

@milocress milocress requested a review from a team as a code owner November 6, 2024 17:22
Copy link
Contributor

@nancyhung nancyhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why fetch is writing empty rows. Do we why this fails silently? @milocress

@milocress
Copy link
Contributor Author

I wonder why fetch is writing empty rows. Do we why this fails silently? @milocress

https://github.com/mosaicml/llm-foundry/blob/9b22b6de4f67c3e4129cf2dd6b1435f926bbbc7b/llmfoundry/command_utils/data_prep/convert_delta_to_json.py#L485C1-L499C18

I believe there is a situation where cursor is None and no exception is thrown. This PR adds a more general check to make sure the file is written, which catches a wider variety of silent errors that could happen in the stack.

Copy link
Contributor

@irenedea irenedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments. Thanks Milo!

@milocress milocress enabled auto-merge (squash) November 13, 2024 20:46
@milocress milocress merged commit db7a5c0 into main Nov 15, 2024
10 checks passed
@milocress milocress deleted the milo/catch-storage-issues branch November 15, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants