Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage increased by more than 4x until OOM(140G) when upgrading from v1.5.4 to the latest commit #581

Open
h27771420 opened this issue Mar 18, 2024 · 2 comments

Comments

@h27771420
Copy link

h27771420 commented Mar 18, 2024

Hello there, as the title.

Due to some cases, I need to change from s3 to use the gocloud/blob implementation to support the other storage providers.
I initially used the release of github.com/xitongsys/parquet-go v1.5.4 for testing the gocloud/blob, but found that gocloud/blob sometimes failed to write/read data in this version, so I later changed it to use the latest release (github.com/xitongsys/parquet-go v1.6.2), and also have updated the parquet tags to meet the new release.
(e.g., update the below tag from type=UTF8 to type=BYTE_ARRAY, convertedtype=UTF8)

PlayerID            string    `parquet:"name=player_id, type=UTF8"` 

to

PlayerID            string    `parquet:"name=player_id, type=BYTE_ARRAY, convertedtype=UTF8"`

The read/write tests were all good at first, but when I needed to write dozens gigabytes(over 100G) of data, I found that the updated version caused OOM problems.
At the beginning, I thought this might be because there might be an issue with the implementation of gocloud/blob, so I changed back to s3 for testing, but found that the problem was still not solved.

I saw that after the latest release, there were several fixes that seemed to be related to memory, so I upgraded my version to the latest commit (github.com/xitongsys/parquet-go v1.6.3-0.20231102094431-8ca067b2bd32), but the OOM problem is still not solved. 😭 😔

==============================
So I'm here and raise my hand.
Does anyone know of any reasons that may cause 4x(or more) of memory usage after upgrading from v1.5.4 to v1.6.2/v1.6.3-0.20231102094431-8ca067b2bd32?

When writing dozens gigabytes of data, memory usage grows as follows:

  1. v1.5.4 with S3 - 35G (works nicely)
  2. (UPDATE): v1.6.0 with S3 - 35G (works nicely)
  3. (UPDATE - 2024/04/10): v1.6.0 with gocloud/blob - 35G (works nicely)
  4. v1.6.2 with S3 - 140G (OOM, so might more than 140G)
  5. v1.6.2 with gocloud/blob - 140G (OOM, so might more than 140G)
  6. v1.6.3-0.20231102094431-8ca067b2bd32 with S3 - 140G (OOM, so might more than 140G)

My writer parameters(there isn't any code change at here from v1.5.4):

	fw, err := s3.NewS3FileWriter(ctx, bucket, s3key,
		[]func(*s3manager.Uploader){
			func(u *s3manager.Uploader) {
				u.PartSize = 64 * 1024 * 1024 // 64MB per part
			}},
		awsConfig,
	)
	if err != nil {
		// do something
	}

	parquetWriter, err := writer.NewParquetWriter(fw, new(ParquetStruct), 1)
	if err != nil {
		// do something
	}
	parquetWriter.CompressionType = parquet.CompressionCodec_GZIP // gzip to reduce the file size
	parquetWriter.PageSize = 1 * 1024 * 1024                      // larger pages for better compression
	parquetWriter.RowGroupSize = 128 * 1024 * 1024                // default
@h27771420 h27771420 changed the title Memory usage increased by more than 4x until OOM when upgrading from v1.5.4 to the latest commit Memory usage increased by more than 4x until OOM(14G) when upgrading from v1.5.4 to the latest commit Mar 18, 2024
@h27771420
Copy link
Author

(UPDATE): S3 + v1.6.0 looks good so far, the memory usage is almost same as v1.5.4, but not yet test with gocloud/blob.

@h27771420
Copy link
Author

h27771420 commented Apr 10, 2024

(Update): Adding some information obtained from previous debugs, the .parquet files generated from S3 + v1.5.4, v1.6.0, and v1.6.2 can all be read without any issue.
However, v1.6.2 and v1.6.3-0.20231102094431-8ca067b2bd32(the latest commit) encounters a huge memory usage issue.

The below couple logs are the memory usage(inuse_space) via pprof.

v1.6.2, encoding.WritePlainBYTE_ARRAY seems to consume a lot. (exported just before reaching OOM) (screenshot for the caller chains)

MacBook-Pro-CH ~ % go tool pprof -inuse_space dtcd-ab-240319-after-v2.prof 
File: playerid_cli
Type: inuse_space
Time: Mar 19, 2024 at 1:01pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 60854.17MB, 97.88% of 62170.46MB total
Dropped 301 nodes (cum <= 310.85MB)
Showing top 10 nodes out of 30
      flat  flat%   sum%        cum   cum%
49100.89MB 78.98% 78.98% 49100.89MB 78.98%  github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY
...<REDACTED>

v1.5.4, the encoding.WritePlainBYTE_ARRAY ranked at fourth. (exported during stable writing)

MacBook-Pro-CH ~ % go tool pprof -inuse_space dtcd-240326.prof
File: playerid_cli
Type: inuse_space
Time: Mar 26, 2024 at 10:44am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 12675.05MB, 97.93% of 12942.45MB total
Dropped 253 nodes (cum <= 64.71MB)
Showing top 10 nodes out of 39
      flat  flat%   sum%        cum   cum%
<REDACTED>...
  858.03MB  6.63% 92.99%   858.03MB  6.63%  github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY
...<REDACTED>

Currently I am using v1.6.0, which works fine with gocloud/blob and memory usage is almost same as the v1.5.4.
So I think I will pin the version at there for a while.
I think the change that will cause the memory usage to increase should be in v1.6.1 or v1.6.2.
Hope someone can give me some tips.

@h27771420 h27771420 changed the title Memory usage increased by more than 4x until OOM(14G) when upgrading from v1.5.4 to the latest commit Memory usage increased by more than 4x until OOM(140G) when upgrading from v1.5.4 to the latest commit Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant