[COST-1192] Use chunks in pandas to process csv->parquet #2740

adberglund · 2021-03-18T20:51:01Z

Summary

Use pandas chunk to reduce memory use while converting CSV to parquet.

Testing

Set the following environment variables

ENABLE_PARQUET_PROCESSING=True
S3_BUCKET_NAME=koku-bucket
S3_ENDPOINT=http://kokuminio:9000
S3_ACCESS_KEY=kokuminioaccess
S3_SECRET=kokuminiosecret
INITIAL_INGEST_OVERRIDE='True'
INITIAL_INGEST_NUM_MONTHS=1

Loaded a large AWS source
Let finish
Checked AWS report APIs

codecov · 2021-03-18T21:00:02Z

Codecov Report

Merging #2740 (9b5670d) into master (c50bfd8) will decrease coverage by 0.0%.
The diff coverage is 92.1%.

@@           Coverage Diff            @@
##           master   #2740     +/-   ##
========================================
- Coverage    94.7%   94.7%   -0.0%     
========================================
  Files         289     289             
  Lines       22403   22433     +30     
  Branches     2554    2559      +5     
========================================
+ Hits        21212   21239     +27     
- Misses        722     726      +4     
+ Partials      469     468      -1

…b.com/project-koku/koku into cost-1192-parquet-processing-by-chunk

Red-HAP · 2021-03-23T15:46:50Z

koku/masu/util/aws/common.py

-        elif "resourceTags/" in column:
-            data_frame = data_frame.drop(columns=[column])
-
+        new_col_name = column.replace("-", "_").replace("/", "_").replace(":", "_").lower()


This column name scrubber appears in multiple places. Would it make more sense to make a common or generic scrubber for this to prevent potential issues with replicated code?

Maybe something similar to:

AWS_SCRUB_MAP = {"_": r"[-/:]+"} AZURE_SCRUB_MAP = AWS_SCRUB_MAP GCP_SCRUB_MAP = {"_": r"[-/:.]+"} import re def standardize_column_name(raw_col_name, translate_map): scrubbed_name = raw_col_name.strip().lower() for repl, srch in translate_map.items(): scrubbed_name = re.sub(srch, repl, scrubbed_name) return scrubbed_name

@Red-HAP Totally, yes. I think there's probably a few places where this could use some refactor/consolidation. My ask would be that we move that idea into a new issue we can tackle in this epic. Since we have are facing a crunch and getting this out could let us enable AWS Trino.

Created https://issues.redhat.com/browse/COST-1208

Use chunks in pandas to process csv->parquet

b01188a

adberglund self-assigned this Mar 18, 2021

adberglund changed the title ~~Use chunks in pandas to process csv->parquet~~ [COST-1192] Use chunks in pandas to process csv->parquet Mar 18, 2021

adberglund added 11 commits March 22, 2021 14:48

Update hive/postgres properties

20a105e

Add resourceTags as required AWS column

6e07513

Process parquet files in chunks using pandas

12020b1

Update AWS, Azure, GCP post processors

0cd0800

Add env vars

b003ca8

Run trino update sql over smaller date ranges

8b4085b

Merge branch 'master' into cost-1192-parquet-processing-by-chunk

5f1b5cf

Test updates

a5e2f1c

Add report type for other sources

0eedb3e

Add parquet processing test

b9ac903

Merge branch 'master' into cost-1192-parquet-processing-by-chunk

03c6b75

adberglund marked this pull request as ready for review March 23, 2021 14:20

adberglund requested a review from a team March 23, 2021 14:20

adberglund added 2 commits March 23, 2021 10:50

Remove other report type

e00397e

Merge branch 'cost-1192-parquet-processing-by-chunk' of https://githu…

94a718f

…b.com/project-koku/koku into cost-1192-parquet-processing-by-chunk

Red-HAP reviewed Mar 23, 2021

View reviewed changes

Default to 5 days for trino summary

4d2050f

dccurtis approved these changes Mar 23, 2021

View reviewed changes

adberglund added 2 commits March 23, 2021 13:32

Merge branch 'master' into cost-1192-parquet-processing-by-chunk

0391d89

Merge branch 'master' into cost-1192-parquet-processing-by-chunk

9b5670d

adberglund merged commit 075a515 into master Mar 23, 2021

adberglund deleted the cost-1192-parquet-processing-by-chunk branch March 23, 2021 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[COST-1192] Use chunks in pandas to process csv->parquet #2740

[COST-1192] Use chunks in pandas to process csv->parquet #2740

adberglund commented Mar 18, 2021 •

edited

Loading

codecov bot commented Mar 18, 2021 •

edited

Loading

Red-HAP Mar 23, 2021

adberglund Mar 23, 2021

adberglund Mar 23, 2021

[COST-1192] Use chunks in pandas to process csv->parquet #2740

[COST-1192] Use chunks in pandas to process csv->parquet #2740

Conversation

adberglund commented Mar 18, 2021 • edited Loading

Summary

Testing

codecov bot commented Mar 18, 2021 • edited Loading

Codecov Report

Red-HAP Mar 23, 2021

Choose a reason for hiding this comment

adberglund Mar 23, 2021

Choose a reason for hiding this comment

adberglund Mar 23, 2021

Choose a reason for hiding this comment

adberglund commented Mar 18, 2021 •

edited

Loading

codecov bot commented Mar 18, 2021 •

edited

Loading