-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[COST-1192] Use chunks in pandas to process csv->parquet #2740
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2740 +/- ##
========================================
- Coverage 94.7% 94.7% -0.0%
========================================
Files 289 289
Lines 22403 22433 +30
Branches 2554 2559 +5
========================================
+ Hits 21212 21239 +27
- Misses 722 726 +4
+ Partials 469 468 -1 |
…b.com/project-koku/koku into cost-1192-parquet-processing-by-chunk
elif "resourceTags/" in column: | ||
data_frame = data_frame.drop(columns=[column]) | ||
|
||
new_col_name = column.replace("-", "_").replace("/", "_").replace(":", "_").lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This column name scrubber appears in multiple places. Would it make more sense to make a common or generic scrubber for this to prevent potential issues with replicated code?
Maybe something similar to:
AWS_SCRUB_MAP = {"_": r"[-/:]+"}
AZURE_SCRUB_MAP = AWS_SCRUB_MAP
GCP_SCRUB_MAP = {"_": r"[-/:.]+"}
import re
def standardize_column_name(raw_col_name, translate_map):
scrubbed_name = raw_col_name.strip().lower()
for repl, srch in translate_map.items():
scrubbed_name = re.sub(srch, repl, scrubbed_name)
return scrubbed_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Red-HAP Totally, yes. I think there's probably a few places where this could use some refactor/consolidation. My ask would be that we move that idea into a new issue we can tackle in this epic. Since we have are facing a crunch and getting this out could let us enable AWS Trino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
Use pandas chunk to reduce memory use while converting CSV to parquet.
Testing