-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libbeat][reader] - Adding support for parquet reader #35183
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am keen to see some tests and some benchmarks. The areas I'm interested in are how much memory this requires for a given file size and what are the processing costs of converting arrow -> JSON -> beat.Event. We know this reader will be used with VPC flow data from Security Lake, and VPC flows are usually high volume so we want to ensure that this is up to that task (and ideally be faster / more efficient than its text based format equivalent which requires parsing).
I didn't leave any code level comments at this time. I kind of think that this could have its own package somewhere because the aws-s3 input is already complex. Then that package would be used from this input to apply parquet decoding to a stream.
All of the current parsers are stream based. The input can read a chuck of the stream, extract an event, and continue. This has modest memory requirements because it only needs to buffer enough data to uncompress and parse a single chunk.
In contrast, because the Parquet reader needs an io.ReadSeeker this implementation allocates memory (via io.ReadAll) to hold the entire uncompressed object in a buffer. Users are pretty sensitive to high memory usage. Imagine the input downloads a 100 MiB gzip compressed parquet file that has an 80% compression ratio. That file will require at least 800 MiB to be allocated. Now imagine the user has configured max_number_of_messages to 5 so they are trying to process five files in parallel. If all their Parquet files are uniform then they need close 5 GiB of memory just for Filebeat.
If we cannot process the parquet file in chunks directly from S3 then we should consider some means of implementing a reader that has less demands on memory. One method would be to switch over to using disk instead of memory if the content is larger than a few MiB.
Please update the PR title and description to reflect latest the contents. |
@andrewkroh, have resolved the suggestions and updated the PR. |
This pull request is now in conflicts. Could you fix it? 🙏
|
@efd6, @andrewkroh if all looks good atm could you approve the PR ? |
Co-authored-by: subham sarkar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need do add this to any external documentation or it's just for internal implementation only?
@rdner For now it's an internal implementation. We will add documentation to the inputs which will leverage this reader for reading parquet files. |
* initial commit for s3 parquet support * updated changelog * added license updates * updated notice and go mod/sum * removed libgering panic * added parquet benchmark tests * updated osquery package due to update in dependant thrift package * added parquet reader with benchmark tests and implemented that reader in awss3 package * addressed linting errors * refactored parquet reader, added tests and benchmarks and addressed pr comments * addressed pr comments * resolved merged conflicts * updated notice * added more parquet file tests with json comparisons, addressed pr comments * removed commented codeS * removed bad imports & cleaned up tests * updated notice * added graceful closures with err checks in test * added graceful closures with err checks in test * removed s3 parquet implementation from this PR * removed s3 parquet implementation from this PR * Update filebeat.yml * Update filebeat.yml * updated notice * addressed PR suggestions * addressed PR comments * updated godoc comment * addressed PR comments, switched path with filebath * updated CODEOWNERS and addressed PR comments * addressed PR comments, added a rand seeding process * fixed test seed value to 1 * updated comments * removed defers in loops * updated notice * updated godoc comments as suggested * updated changelog * Update x-pack/libbeat/reader/parquet/parquet.go Co-authored-by: subham sarkar <[email protected]> --------- Co-authored-by: subham sarkar <[email protected]>
Manual backport of this change from #35183
Manual backport of the same upgrade in #35183 to ensure we are compatible with the updated thrift dependency.
* Update go grpc version to 1.58.3 (#36904) (cherry picked from commit 09823f3) # Conflicts: # NOTICE.txt # go.mod # go.sum * Resolve conflicts. * Add CC0-1.0 License to notice rules. Manual backport of this change from #35183 * Update notice. * Update gRPC to the expected 1.58.3 version * Upgrade osquery-go to fix broken build. Manual backport of the same upgrade in #35183 to ensure we are compatible with the updated thrift dependency. --------- Co-authored-by: Michal Pristas <[email protected]> Co-authored-by: Craig MacKenzie <[email protected]>
Type of change
What does this PR do?
This PR adds support for reading and parsing apache parquet files.
Why is it important?
This change enables us to support future amazon security lake integrations/solutions.
Checklist
- [ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
.Author's Checklist
Related issues
Benchmark
command : go test -v -cpu 1,2,4,8,10 -benchmem -run=^$ -bench . - used on personal macbook 10 core cpu
Output :
More Benchmarks : benchmarkResults.txt
Sample Log