-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune Parquet Dependencies #2668
Conversation
We created a test class for ParquetDenseVectorCollection that extends DocumentCollectionTest. Instead of creating new test files, we utilized an existing Parquet test file containing BGE embeddings. Replaces Hadoop dependencies with parquet-floor for: reduced dependency footprint, simplified Parquet file handling, removal of complex Hadoop configuration. Tests verify: basic Parquet file reading functionality, document iteration and content validation, integration with existing BGE embedding test data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
What's the fatjar size, before and after?
src/test/java/io/anserini/collection/ParquetDenseVectorCollectionTest.java
Show resolved
Hide resolved
src/test/java/io/anserini/collection/ParquetDenseVectorCollectionTest.java
Show resolved
Hide resolved
…1.36)) - Remove unnecessary strategicblue repository - Fix test documentation formatting in ParquetDenseVectorCollectionTest (reordered) to be above the class - Add Apache 2.0 license boilerplate to test file's header
🎉 Yay for smaller size!
Do we still need this though? |
I keep the jitpack.io repository as it's required for com.github.TREMA-UNH:trec-car-tools-java dependency. |
Ah, okay! I'll queue this up for regression testing (machines occupied right now) before I merge. What tests have you run, BTW? Did the MS MARCO v1 tests pass? |
Need to run more regressions, but initial signs are 👍 - the following passed:
|
Any updates on the regressions? |
@vincent-4 why are we on v1.36 and not a later version? |
Without modifying the existing codebase, I tested I have two possible options:
Let me know which approach you prefer? |
I ran all Parquet regressions: cat src/main/python/regressions-batch0* | grep parquet > src/main/python/regressions-parquet.txt
nohup python src/main/python/run_regressions_with_load.py --file src/main/python/regressions-parquet.txt --load 64 --sleep 60 >& regressions.parquet.log & Everything checks out. Ready to merge. |
Pruned with https://github.com/strategicblue/parquet-floor, as well as added tests.