Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard size is automatically determined to produce ~100MB tfrecords files #258

Open
ohinds opened this issue Aug 25, 2023 · 3 comments
Open

Comments

@ohinds
Copy link
Contributor

ohinds commented Aug 25, 2023

According to the tensorflow user guide, tfrecords files should be ~100MB (https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/performance/overview.md). When tfrecords datasets are constructed from files, the shard size could be automatically computed to follow this guidance.

@ohinds ohinds self-assigned this Aug 25, 2023
@ohinds ohinds changed the title Shard size is automatically determined to produce ~100MB tfrecrods files Shard size is automatically determined to produce ~100MB tfrecords files Aug 25, 2023
@satra
Copy link
Contributor

satra commented Aug 25, 2023

100MB doesn't make sense on fast disk systems like we have on openmind or for brain imaging data. i believe we have played with TB sized shards as well. i would make this a user controllable parameter.

@ohinds
Copy link
Contributor Author

ohinds commented Aug 25, 2023

Well, the default currently produces tfrecord files sizes of about 20MB, so that makes even less sense. I'm suggesting an automatically-determined default, with the facility for people to override if the want something else.

Also, specifying a shard size in bytes makes way more sense than number of examples, as it currently is.

@hvgazula hvgazula assigned hvgazula and unassigned ohinds Mar 11, 2024
@hvgazula hvgazula added this to Harsha Mar 11, 2024
@hvgazula hvgazula moved this to Todo in Harsha Mar 11, 2024
@hvgazula
Copy link
Contributor

Probably a combination of du -hL /path/to/data and this might do?

@hvgazula hvgazula moved this from Todo to Done in Harsha Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants