Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune Parquet Dependencies #2668

Merged
merged 3 commits into from
Jan 3, 2025

Conversation

vincent-4
Copy link
Contributor

Pruned with https://github.com/strategicblue/parquet-floor, as well as added tests.

We created a test class for ParquetDenseVectorCollection that extends DocumentCollectionTest. Instead of creating new test files, we utilized an existing Parquet test file containing BGE embeddings.

Replaces Hadoop dependencies with parquet-floor for: reduced dependency footprint, simplified Parquet file handling, removal of complex Hadoop configuration.

Tests verify: basic Parquet file reading functionality, document iteration and content validation, integration with existing BGE embedding test data.
Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

What's the fatjar size, before and after?

Vincent Zhong added 2 commits December 24, 2024 11:25
…1.36))

- Remove unnecessary strategicblue repository
- Fix test documentation formatting in ParquetDenseVectorCollectionTest (reordered) to be above the class
- Add Apache 2.0 license boilerplate to test file's header
@vincent-4
Copy link
Contributor Author

vincent-4 commented Dec 24, 2024

@lintool
See 3e85803 for the changes that address your feedback. I updated the Parquet format to version 1.49 in the same commit, but it caused issues. d17afc8 resolves this problem.

I get:
140MB using mvn package -DskipTests
234MB using mvn package -DskipTests on the previous version.
Yay!

@lintool
Copy link
Member

lintool commented Dec 24, 2024

🎉 Yay for smaller size!

    <repository>
      <id>jitpack.io</id>
      <url>https://jitpack.io</url>
    </repository>

Do we still need this though?

@vincent-4
Copy link
Contributor Author

I keep the jitpack.io repository as it's required for com.github.TREMA-UNH:trec-car-tools-java dependency.
See: https://github.com/TREMA-UNH/trec-car-tools-java/blob/master/pom.xml#L14-L16
If I'm not wrong? I don't find it on Maven central.

@lintool
Copy link
Member

lintool commented Dec 24, 2024

Ah, okay! I'll queue this up for regression testing (machines occupied right now) before I merge.

What tests have you run, BTW? Did the MS MARCO v1 tests pass?

@lintool
Copy link
Member

lintool commented Dec 26, 2024

Need to run more regressions, but initial signs are 👍 - the following passed:

python src/main/python/run_regression.py --download --index --verify --search --regression msmarco-v1-passage.bge-base-en-v1.5.parquet.flat.onnx

@vincent-4
Copy link
Contributor Author

Any updates on the regressions?

@lintool
Copy link
Member

lintool commented Jan 2, 2025

@vincent-4 why are we on v1.36 and not a later version?

@vincent-4
Copy link
Contributor Author

@vincent-4 why are we on v1.36 and not a later version?

Without modifying the existing codebase, I tested parquet-floor versions between 1.36 and 1.49. The latest version that doesn't produce upgrade-related errors using mvn clean verify is 1.36.

I have two possible options:

  1. Continue working on fixing the (several) errors related to switching to the latest version of parquet-floor (which would require more changes).
  2. Implement all the changes now if your priority is just to use parquet-floor, which should be fine given the successful regressions and MSMARCO tests. If we choose this option, I could try upgrading to the latest version in another PR.

Let me know which approach you prefer?

@lintool
Copy link
Member

lintool commented Jan 3, 2025

I ran all Parquet regressions:

cat src/main/python/regressions-batch0* | grep parquet > src/main/python/regressions-parquet.txt
nohup python src/main/python/run_regressions_with_load.py --file src/main/python/regressions-parquet.txt --load 64 --sleep 60 >& regressions.parquet.log &

Everything checks out. Ready to merge.

@lintool lintool self-requested a review January 3, 2025 16:01
@lintool lintool merged commit ea48ec8 into castorini:master Jan 3, 2025
1 check passed
@vincent-4 vincent-4 deleted the parquet-floor-dependency branch January 12, 2025 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants