Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the splitter use less memory #85

Closed
rom1504 opened this issue Jan 3, 2022 · 4 comments
Closed

Make the splitter use less memory #85

rom1504 opened this issue Jan 3, 2022 · 4 comments

Comments

@rom1504
Copy link
Owner

rom1504 commented Jan 3, 2022

If doing that then the memory usage can be capped to very low values

https://github.com/criteo/autofaiss/blob/master/autofaiss/readers/embeddings_iterators.py#L139 see this as example

@rom1504
Copy link
Owner Author

rom1504 commented Jan 7, 2022

https://arrow.apache.org/docs/python/ipc.html + read parquet in batch

@rom1504
Copy link
Owner Author

rom1504 commented Feb 4, 2022

read the files in batches, no need to prepare the whole first file of feather batches

@rom1504
Copy link
Owner Author

rom1504 commented Feb 5, 2022

at least write all the shards in parallel so it's faster to write to high latency fs (s3, hdfs), eg by using https://filesystem-spec.readthedocs.io/en/latest/async.html

@rom1504
Copy link
Owner Author

rom1504 commented May 18, 2022

done now

@rom1504 rom1504 closed this as completed May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant