[libbeat][reader] - Adding support for parquet reader #35183

ShourieG · 2023-04-24T08:09:33Z

Type of change

Enhancement

What does this PR do?

This PR adds support for reading and parsing apache parquet files.

Why is it important?

This change enables us to support future amazon security lake integrations/solutions.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc .

Author's Checklist

Code style
Code syntax

Related issues

Relates [Filebeat S3 Input] Add support for Apache Parquet files #34662

Benchmark

command : go test -v -cpu 1,2,4,8,10 -benchmem -run=^$ -bench . - used on personal macbook 10 core cpu
Output :

Benchmark Format : BenchmarkName-(ConcurrencyLevel)    Iterations    TimePerIteration    MemoryAllocatedPerIteration    AllocsPerIteration

Descriptions: 
BenchmarkName: The name of the function being benchmarked.
(ConcurrencyLevel): The number of parallel processes used for the benchmark.
Iterations: The number of times the function was executed during the benchmark.
TimePerIteration: The average execution time for the function, measured in nanoseconds (ns) per operation.
MemoryAllocatedPerIteration: The average amount of memory allocated per operation, measured in bytes (B) per operation.
AllocsPerIteration: The average number of memory allocations per operation.


Some Benchmark Results : 

Serial Benchmarks: 
File: testdata/taxi_2023_1.parquet, records: 1533383 
Memory consumption:  762919806488 Bytes approx (x_x)
File Size : 47.7 MB
BatchSize : 1
Batches: 1533383

BenchmarkReadParquetSerial-10                  1        498926780292 ns/op      762919806488 B/op       3337097252 allocs/op
PASS
ok      github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet       499.246s

Serial Benchmarks: 
File: testdata/taxi_2023_1.parquet, records: 1533383 
Memory consumption:  7.1 GB approx
File Size : 47.7 MB
BatchSize : 10000
Batches: 154 

BenchmarkReadParquetSingleSerialBatch_10000
BenchmarkReadParquetSingleSerialBatch_10000                    1        7136202750 ns/op        7161997624 B/op 40874004 allocs/op
BenchmarkReadParquetSingleSerialBatch_10000-2                  1        7249478666 ns/op        7161029488 B/op 40870153 allocs/op
BenchmarkReadParquetSingleSerialBatch_10000-4                  1        7015922458 ns/op        7161377352 B/op 40871251 allocs/op
BenchmarkReadParquetSingleSerialBatch_10000-8                  1        7101368125 ns/op        7161955840 B/op 40872282 allocs/op
BenchmarkReadParquetSingleSerialBatch_10000-10                 1        7113368875 ns/op        7162300232 B/op 40872797 allocs/op
PASS
ok      github.com/elastic/beats/v7/x-pack/libbeat/reader/parquet       36.133s

Explanation:
For example, the first benchmark result shows that when the function was executed with a concurrency level of 1, 
it took an average of 7.136202750 seconds to complete one iteration of the function (TimePerIteration), allocated 
an average of 7.161997624 GB of memory (MemoryAllocatedPerIteration), and made an average of 40,874,004 memory allocations (AllocsPerIteration). 

Serial Benchmarks: 
File: testdata/vpc_flow.parquet, records: 652
Memory consumption:  15 mb (approx)
File Size : 33 KB
BatchSize : 1000
Batches: 1
BenchmarkReadParquetSingleVPCSerialBatch_1000

BenchmarkReadParquetSingleVPCSerialBatch_1000       	     100	  11252157 ns/op	15247132 B/op	   55042 allocs/op
BenchmarkReadParquetSingleVPCSerialBatch_1000-2     	     148	   8014055 ns/op	15260143 B/op	   55069 allocs/op
BenchmarkReadParquetSingleVPCSerialBatch_1000-4     	     169	   7139663 ns/op	15266732 B/op	   55042 allocs/op
BenchmarkReadParquetSingleVPCSerialBatch_1000-8     	     132	   8927298 ns/op	15280284 B/op	   55071 allocs/op
BenchmarkReadParquetSingleVPCSerialBatch_1000-10    	     124	   9600573 ns/op	15282426 B/op	   55067 allocs/op

More Benchmarks : benchmarkResults.txt

Sample Log

{
    "@timestamp": "2023-04-24T08:04:11.717Z",
    "@metadata": {
        "beat": "filebeat",
        "type": "_doc",
        "version": "8.8.0",
        "_id": "2fd1e2fbf5-000000000000"
    },
    "message": "{\"activity_id\":3,\"activity_name\":\"Operational\",\"api\":{\"operation\":\"GetBucketAcl\",\"request\":{\"uid\":\"5CQ7E6RQPH8MX989\"},\"response\":{\"error\":null,\"message\":null},\"service\":{\"name\":\"s3.amazonaws.com\"},\"version\":null},\"category_name\":\"Cloud Activity\",\"category_uid\":5,\"class_name\":\"Cloud API\",\"class_uid\":5001,\"cloud\":{\"provider\":\"AWS\",\"region\":\"us-east-1\"},\"http_request\":{\"user_agent\":\"cloudtrail.amazonaws.com\"},\"identity\":{\"idp\":{\"name\":null},\"invoked_by\":\"cloudtrail.amazonaws.com\",\"session\":{\"created_time\":null,\"issuer\":null,\"mfa\":null},\"user\":{\"account_uid\":null,\"credential_uid\":null,\"name\":null,\"type\":\"AWSService\",\"uid\":null,\"uuid\":null}},\"metadata\":{\"product\":{\"feature\":{\"name\":\"Management, Data, and Insights\"},\"name\":\"CloudTrail\",\"vendor_name\":\"AWS\",\"version\":\"1.08\"},\"profiles\":[\"cloud\"],\"version\":\"0.26.1\"},\"ref_event_uid\":\"f7a441e2-c283-4ce3-88e8-811abe3020a4\",\"resources\":[\"arn:aws:s3:::cloudtrail-awslogs-422354213072-innkkddg-isengard-do-not-delete\"],\"severity\":\"Unknown\",\"severity_id\":0,\"src_endpoint\":{\"domain\":\"cloudtrail.amazonaws.com\",\"ip\":null,\"uid\":null},\"time\":1680138206000,\"type_name\":\"Cloud API: Operational\",\"type_uid\":500103,\"unmapped\":[{\"key\":\"eventCategory\",\"value\":\"Management\"},{\"key\":\"sharedEventID\",\"value\":\"db2c11c9-14ce-4a6c-8c68-61484391855b\"},{\"key\":\"requestParameters\",\"value\":\"{\\\"bucketName\\\":\\\"cloudtrail-awslogs-422354213072-innkkddg-isengard-do-not-delete\\\",\\\"Host\\\":\\\"cloudtrail-awslogs-422354213072-innkkddg-isengard-do-not-delete.s3.us-east-1.amazonaws.com\\\",\\\"acl\\\":\\\"\\\"}\"},{\"key\":\"recipientAccountId\",\"value\":\"422354213072\"},{\"key\":\"readOnly\",\"value\":\"true\"},{\"key\":\"eventType\",\"value\":\"AwsApiCall\"},{\"key\":\"managementEvent\",\"value\":\"true\"},{\"key\":\"additionalEventData\",\"value\":\"{\\\"SignatureVersion\\\":\\\"SigV4\\\",\\\"CipherSuite\\\":\\\"ECDHE-RSA-AES128-GCM-SHA256\\\",\\\"bytesTransferredIn\\\":0,\\\"AuthenticationMethod\\\":\\\"AuthHeader\\\",\\\"x-amz-id-2\\\":\\\"YxIU+BXMAq9DAg4pVtxXCWa+42UKP18/B9zrp+LBdFYzjEGmIel619E44tjCig45WJTQhyT1EIE=\\\",\\\"bytesTransferredOut\\\":546}\"}]}",
    "log": {
        "offset": 0,
        "file": {
            "path": "https://securitylakeaws.s3.us-east-1.amazonaws.com/s3_elastic_security_lake/aws/CLOUD_TRAIL/region%3Dus-east-1/accountId%3D422354213072/eventHour%3D2023033001/5ec39807962c1fd7804b70acb94b4abe.gz.parquet"
        }
    },
    "aws": {
        "s3": {
            "bucket": {
                "name": "securitylakeaws",
                "arn": "arn:aws:s3:::securitylakeaws"
            },
            "object": {
                "key": "s3_elastic_security_lake/aws/CLOUD_TRAIL/region=us-east-1/accountId=422354213072/eventHour=2023033001/5ec39807962c1fd7804b70acb94b4abe.gz.parquet"
            }
        }
    },
    "cloud": {
        "provider": "aws",
        "region": "us-east-1"
    },
    "input": {
        "type": "aws-s3"
    },
    "agent": {
        "type": "filebeat",
        "version": "8.8.0",
        "ephemeral_id": "53f26411-5e07-4de5-963c-c7e7de9c3d27",
        "id": "fc0c94b0-ad0e-486e-9f55-c643cf7ef542",
        "name": "Shouries-MacBook-Pro.local"
    },
    "ecs": {
        "version": "8.0.0"
    }
}

mergify · 2023-04-24T08:10:09Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-04-24T08:10:19Z

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

elasticmachine · 2023-04-24T09:22:25Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-05-23T06:29:18.693+0000
Duration: 98 min 23 sec

Test stats 🧪

Test	Results
Failed	0
Passed	26399
Skipped	1975
Total	28374

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

andrewkroh

I am keen to see some tests and some benchmarks. The areas I'm interested in are how much memory this requires for a given file size and what are the processing costs of converting arrow -> JSON -> beat.Event. We know this reader will be used with VPC flow data from Security Lake, and VPC flows are usually high volume so we want to ensure that this is up to that task (and ideally be faster / more efficient than its text based format equivalent which requires parsing).

I didn't leave any code level comments at this time. I kind of think that this could have its own package somewhere because the aws-s3 input is already complex. Then that package would be used from this input to apply parquet decoding to a stream.

All of the current parsers are stream based. The input can read a chuck of the stream, extract an event, and continue. This has modest memory requirements because it only needs to buffer enough data to uncompress and parse a single chunk.

In contrast, because the Parquet reader needs an io.ReadSeeker this implementation allocates memory (via io.ReadAll) to hold the entire uncompressed object in a buffer. Users are pretty sensitive to high memory usage. Imagine the input downloads a 100 MiB gzip compressed parquet file that has an 80% compression ratio. That file will require at least 800 MiB to be allocated. Now imagine the user has configured max_number_of_messages to 5 so they are trying to process five files in parallel. If all their Parquet files are uniform then they need close 5 GiB of memory just for Filebeat.

If we cannot process the parquet file in chunks directly from S3 then we should consider some means of implementing a reader that has less demands on memory. One method would be to switch over to using disk instead of memory if the content is larger than a few MiB.

dev-tools/notice/rules.json

andrewkroh · 2023-05-18T10:37:57Z

Please update the PR title and description to reflect latest the contents.

x-pack/libbeat/reader/parquet/parquet_test.go

ShourieG · 2023-05-18T11:34:19Z

@andrewkroh, have resolved the suggestions and updated the PR.

mergify · 2023-05-19T00:38:08Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b awss3/parquet upstream/awss3/parquet
git merge upstream/main
git push upstream awss3/parquet

ShourieG · 2023-05-22T07:14:49Z

@efd6, @andrewkroh if all looks good atm could you approve the PR ?

x-pack/libbeat/reader/parquet/parquet.go

already approved

x-pack/libbeat/reader/parquet/parquet.go

Co-authored-by: subham sarkar <[email protected]>

rdner

Do we need do add this to any external documentation or it's just for internal implementation only?

ShourieG · 2023-05-23T09:28:50Z

Do we need do add this to any external documentation or it's just for internal implementation only?

@rdner For now it's an internal implementation. We will add documentation to the inputs which will leverage this reader for reading parquet files.

* initial commit for s3 parquet support * updated changelog * added license updates * updated notice and go mod/sum * removed libgering panic * added parquet benchmark tests * updated osquery package due to update in dependant thrift package * added parquet reader with benchmark tests and implemented that reader in awss3 package * addressed linting errors * refactored parquet reader, added tests and benchmarks and addressed pr comments * addressed pr comments * resolved merged conflicts * updated notice * added more parquet file tests with json comparisons, addressed pr comments * removed commented codeS * removed bad imports & cleaned up tests * updated notice * added graceful closures with err checks in test * added graceful closures with err checks in test * removed s3 parquet implementation from this PR * removed s3 parquet implementation from this PR * Update filebeat.yml * Update filebeat.yml * updated notice * addressed PR suggestions * addressed PR comments * updated godoc comment * addressed PR comments, switched path with filebath * updated CODEOWNERS and addressed PR comments * addressed PR comments, added a rand seeding process * fixed test seed value to 1 * updated comments * removed defers in loops * updated notice * updated godoc comments as suggested * updated changelog * Update x-pack/libbeat/reader/parquet/parquet.go Co-authored-by: subham sarkar <[email protected]> --------- Co-authored-by: subham sarkar <[email protected]>

Manual backport of this change from #35183

Manual backport of the same upgrade in #35183 to ensure we are compatible with the updated thrift dependency.

* Update go grpc version to 1.58.3 (#36904) (cherry picked from commit 09823f3) # Conflicts: # NOTICE.txt # go.mod # go.sum * Resolve conflicts. * Add CC0-1.0 License to notice rules. Manual backport of this change from #35183 * Update notice. * Update gRPC to the expected 1.58.3 version * Upgrade osquery-go to fix broken build. Manual backport of the same upgrade in #35183 to ensure we are compatible with the updated thrift dependency. --------- Co-authored-by: Michal Pristas <[email protected]> Co-authored-by: Craig MacKenzie <[email protected]>

initial commit for s3 parquet support

ffe109d

ShourieG requested review from a team as code owners April 24, 2023 08:09

ShourieG requested review from rdner and cmacknz and removed request for a team April 24, 2023 08:09

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 24, 2023

ShourieG requested review from a team and andrewkroh April 24, 2023 08:09

mergify bot assigned ShourieG Apr 24, 2023

ShourieG added the Team:Security-External Integrations label Apr 24, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 24, 2023

ShourieG added needs_team Indicates that the issue/PR needs a Team:* label 8.9-candidate labels Apr 24, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 24, 2023

ShourieG added 3 commits April 24, 2023 13:44

updated changelog

5295efd

Merge remote-tracking branch 'upstream/main' into awss3/parquet

0f5b475

added license updates

b41aa40

ShourieG requested a review from efd6 April 24, 2023 08:35

ShourieG added 4 commits April 24, 2023 17:32

updated notice and go mod/sum

83598fa

Merge branch 'main' into awss3/parquet

1ad3fe9

Merge remote-tracking branch 'upstream/main' into awss3/parquet

f7c5498

removed libgering panic

ec642f5

andrewkroh reviewed Apr 24, 2023

View reviewed changes

dev-tools/notice/rules.json Outdated Show resolved Hide resolved

ShourieG added 2 commits April 25, 2023 15:36

added parquet benchmark tests

1664648

Merge remote-tracking branch 'upstream/main' into awss3/parquet

8f56a5e

updated comments

d32f412

andrewkroh added enhancement libbeat labels May 18, 2023

andrewkroh reviewed May 18, 2023

View reviewed changes

x-pack/libbeat/reader/parquet/parquet_test.go Outdated Show resolved Hide resolved

ShourieG changed the title ~~[filebeat][aws-s3] - Adding support for parquet files~~ [libbeat][reader] - Adding support for parquet reader May 18, 2023

removed defers in loops

b3e69b5

Merge remote-tracking branch 'upstream/main' into awss3/parquet

39ce083

ShourieG added 2 commits May 19, 2023 06:13

merged with upstream and resolved conflicts

f4d6019

updated notice

339f57d

andrewkroh approved these changes May 22, 2023

View reviewed changes

x-pack/libbeat/reader/parquet/parquet.go Outdated Show resolved Hide resolved

ShourieG added 3 commits May 23, 2023 11:37

updated godoc comments as suggested

8c6e7e0

Merge remote-tracking branch 'upstream/main' into awss3/parquet

6c05241

updated changelog

766c9da

shmsr reviewed May 23, 2023

View reviewed changes

x-pack/libbeat/reader/parquet/parquet.go Outdated Show resolved Hide resolved

Update x-pack/libbeat/reader/parquet/parquet.go

1243902

Co-authored-by: subham sarkar <[email protected]>

rdner reviewed May 23, 2023

View reviewed changes

rdner approved these changes May 23, 2023

View reviewed changes

ShourieG merged commit 90e370b into elastic:main May 23, 2023

ShourieG deleted the awss3/parquet branch May 23, 2023 09:35

jamiehynds mentioned this pull request Oct 26, 2023

[Amazon Security Lake] Parquet File Support elastic/elastic-serverless-forwarder#506

Open

cmacknz added a commit that referenced this pull request Nov 8, 2023

Add CC0-1.0 License to notice rules.

5e2ee55

Manual backport of this change from #35183

cmacknz added a commit that referenced this pull request Nov 9, 2023

Upgrade osquery-go to fix broken build.

08a9789

Manual backport of the same upgrade in #35183 to ensure we are compatible with the updated thrift dependency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbeat][reader] - Adding support for parquet reader #35183

[libbeat][reader] - Adding support for parquet reader #35183

ShourieG commented Apr 24, 2023 •

edited

Loading

mergify bot commented Apr 24, 2023

elasticmachine commented Apr 24, 2023

elasticmachine commented Apr 24, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

andrewkroh left a comment •

edited

Loading

andrewkroh commented May 18, 2023

ShourieG commented May 18, 2023

mergify bot commented May 19, 2023

ShourieG commented May 22, 2023

rdner left a comment

ShourieG commented May 23, 2023

[libbeat][reader] - Adding support for parquet reader #35183

[libbeat][reader] - Adding support for parquet reader #35183

Conversation

ShourieG commented Apr 24, 2023 • edited Loading

Type of change

What does this PR do?

Why is it important?

Checklist

Author's Checklist

Related issues

Benchmark

Sample Log

mergify bot commented Apr 24, 2023

elasticmachine commented Apr 24, 2023

elasticmachine commented Apr 24, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

andrewkroh left a comment • edited Loading

Choose a reason for hiding this comment

andrewkroh commented May 18, 2023

ShourieG commented May 18, 2023

mergify bot commented May 19, 2023

ShourieG commented May 22, 2023

rdner left a comment

Choose a reason for hiding this comment

ShourieG commented May 23, 2023

ShourieG commented Apr 24, 2023 •

edited

Loading

elasticmachine commented Apr 24, 2023 •

edited by jenkins-beats-ci bot

Loading

andrewkroh left a comment •

edited

Loading