Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COST-1192] Use chunks in pandas to process csv->parquet #2740

Merged
merged 17 commits into from
Mar 23, 2021

Conversation

adberglund
Copy link
Contributor

@adberglund adberglund commented Mar 18, 2021

Summary

Use pandas chunk to reduce memory use while converting CSV to parquet.

Testing

  1. Set the following environment variables
ENABLE_PARQUET_PROCESSING=True
S3_BUCKET_NAME=koku-bucket
S3_ENDPOINT=http://kokuminio:9000
S3_ACCESS_KEY=kokuminioaccess
S3_SECRET=kokuminiosecret
INITIAL_INGEST_OVERRIDE='True'
INITIAL_INGEST_NUM_MONTHS=1
  1. Loaded a large AWS source
  2. Let finish
  3. Checked AWS report APIs

@adberglund adberglund self-assigned this Mar 18, 2021
@adberglund adberglund changed the title Use chunks in pandas to process csv->parquet [COST-1192] Use chunks in pandas to process csv->parquet Mar 18, 2021
@codecov
Copy link

codecov bot commented Mar 18, 2021

Codecov Report

Merging #2740 (9b5670d) into master (c50bfd8) will decrease coverage by 0.0%.
The diff coverage is 92.1%.

@@           Coverage Diff            @@
##           master   #2740     +/-   ##
========================================
- Coverage    94.7%   94.7%   -0.0%     
========================================
  Files         289     289             
  Lines       22403   22433     +30     
  Branches     2554    2559      +5     
========================================
+ Hits        21212   21239     +27     
- Misses        722     726      +4     
+ Partials      469     468      -1     

@adberglund adberglund marked this pull request as ready for review March 23, 2021 14:20
@adberglund adberglund requested a review from a team March 23, 2021 14:20
elif "resourceTags/" in column:
data_frame = data_frame.drop(columns=[column])

new_col_name = column.replace("-", "_").replace("/", "_").replace(":", "_").lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This column name scrubber appears in multiple places. Would it make more sense to make a common or generic scrubber for this to prevent potential issues with replicated code?

Maybe something similar to:

AWS_SCRUB_MAP = {"_": r"[-/:]+"}
AZURE_SCRUB_MAP = AWS_SCRUB_MAP
GCP_SCRUB_MAP = {"_": r"[-/:.]+"}

import re

def standardize_column_name(raw_col_name, translate_map): 
    scrubbed_name = raw_col_name.strip().lower() 
    for repl, srch in translate_map.items(): 
        scrubbed_name = re.sub(srch, repl, scrubbed_name) 
    return scrubbed_name 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Red-HAP Totally, yes. I think there's probably a few places where this could use some refactor/consolidation. My ask would be that we move that idea into a new issue we can tackle in this epic. Since we have are facing a crunch and getting this out could let us enable AWS Trino.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adberglund adberglund merged commit 075a515 into master Mar 23, 2021
@adberglund adberglund deleted the cost-1192-parquet-processing-by-chunk branch March 23, 2021 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants