-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing parquet
loading in load_profiles
function
#262
Conversation
Is there a reason not to simply use Pathlib.Path.suffix? https://docs.python.org/3/library/pathlib.html?highlight=suffix#pathlib.PurePath.suffix This would easily support csv as well |
Agreeing with @gwaybio here, it may be simpler to use the file extension via For filetype inference in specific: Leveraging existing work in this area might be a good idea due to the complexities involved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work - excited to see parquet compatibility additions! I made a few comments throughout. Please don't hesitate to let me know if you have any questions.
I feel the inference discussion and also the failing test (see here) should be resolved before we move forward.
My initial thought is that using extension based file validation can result in errors. For example, users can submit extension-less files that may "confuse" I do agree that the implementation is much more involved and using only extensions is much easier to implement. I was looking at Here are some examples below: import pathlib
import magic
# magic object
magic_reader = magic.Magic(uncompress=True, extension=False)
print(magic_reader.from_file("./test_SQ00014613.parquet"))
print()
print(magic_reader.from_file("./test_SQ00014613.sqlite"))
print()
print(magic_reader.from_file("./test_SQ00014613.csv.gz")) Here is the output:
In cases with symlink files, we need to extract the soruce path in order for it to work. Here's an example: import pathlib
import magic
# magic object
magic_reader = magic.Magic(uncompress=True, extension=False)
# now lets try with filepaths with symbolic links
sym_link_data = "./SQ00014617.parquet"
path = pathlib.Path(sym_link_data)
# check if the path is a sym_link
# -- if so, get true path
if path.is_symlink():
path = path.resolve()
# with true path, identify file
print(magic_reader.from_file(path)) output:
If this is too much at this moment, then resorting to only file extensions is an ideal plan! Let me know what ya'll think :) |
My opinion is that it is too much - I think keeping things simple is the way to go for readability and maintainability. I generally get a bit queazy when adding a new dependency as well. I think all we need is the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments that need to be addressed prior to merging - looking good!
Co-authored-by: Gregory Way <[email protected]>
Co-authored-by: Gregory Way <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! I left a few comments and suggestions with this review. Please don't hesitate to let me know if you have any questions.
I'd request your feedback on partitioned parquet-based directories (those containing multiple files) with is_path_a_parquet_file
before approval.
Co-authored-by: Dave Bunten <[email protected]>
Co-authored-by: Dave Bunten <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #262 +/- ##
=======================================
Coverage 95.71% 95.72%
=======================================
Files 53 53
Lines 2826 2852 +26
=======================================
+ Hits 2705 2730 +25
- Misses 121 122 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly there! 🎉
I made some comments which probably should be addressed prior to merging. I will let @d33bs give the final 👍
Co-authored-by: Gregory Way <[email protected]>
Co-authored-by: Gregory Way <[email protected]>
Co-authored-by: Gregory Way <[email protected]>
Co-authored-by: Gregory Way <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Erik, great job on the new additions! I'm requesting some changes and your thoughts on a few items, especially on pathlib.Path.absolute()
usage. Please don't hesitate to let me know if you have any questions.
Co-authored-by: Dave Bunten <[email protected]>
Helllo @d33bs I have attended your comments. Hopefully they were answered! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @axiomcura - congrats on this contribution! 🎉
Thank you! |
…g#262) * added new function `infer_profile_file_type` * Fixed Unicode Bug * fixed csv error * improved variable names * removed unwanted comments * added extension based inference for parquet * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * Update pycytominer/tests/test_cyto_utils/test_load.py Co-authored-by: Gregory Way <[email protected]> * edited pathlib imports, documentation fixed * applied black formatting * added typing * updated tests * update tests * testing update * Update pycytominer/cyto_utils/load.py Co-authored-by: Dave Bunten <[email protected]> * Update pycytominer/cyto_utils/load.py Co-authored-by: Dave Bunten <[email protected]> * added black formatting * update pathing * fixed docs * black formatting * tests update * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * test update * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * fixed typo * added comments * Update pycytominer/cyto_utils/load.py Co-authored-by: Dave Bunten <[email protected]> * replaced `.absolute()` with `.resolve()` * applied black formatting * removed try and accept block --------- Co-authored-by: Gregory Way <[email protected]> Co-authored-by: Dave Bunten <[email protected]>
…g#262) * added new function `infer_profile_file_type` * Fixed Unicode Bug * fixed csv error * improved variable names * removed unwanted comments * added extension based inference for parquet * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * Update pycytominer/tests/test_cyto_utils/test_load.py Co-authored-by: Gregory Way <[email protected]> * edited pathlib imports, documentation fixed * applied black formatting * added typing * updated tests * update tests * testing update * Update pycytominer/cyto_utils/load.py Co-authored-by: Dave Bunten <[email protected]> * Update pycytominer/cyto_utils/load.py Co-authored-by: Dave Bunten <[email protected]> * added black formatting * update pathing * fixed docs * black formatting * tests update * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * test update * Update pycytominer/cyto_utils/load.py Co-authored-by: Gregory Way <[email protected]> * fixed typo * added comments * Update pycytominer/cyto_utils/load.py Co-authored-by: Dave Bunten <[email protected]> * replaced `.absolute()` with `.resolve()` * applied black formatting * removed try and accept block --------- Co-authored-by: Gregory Way <[email protected]> Co-authored-by: Dave Bunten <[email protected]>
Edit by @gwaybio
If you happen to stumble upon this PR, note the we elected to perform a simpler implementation than what is described immediately below. See files changed for our implemented solution.
Below is @axiomcura original post
Implementing parquet loading
This PR focuses on solving #211
This introduces a new function known as
is_path_a_parquet_file
that allows theload_profile
function to identify whether the input file is either aparquet
file.By identifying what file format the
profiles
are, it will use the appropriate instructions to load it into memory viapandas
’s API.Implementation approach
File types contain a unique signature within the header of each file. For example,
sqlite
files containsSQlite format
signature within the first 100 bytes of contents.Below is a code example:
Here is the output:
Similarly,
parquet
files also contain a unique format known asPAR
within the first 100 bytes of data of the files.Here is the
This PR introduces a new function known as
infer_profile_file_type
that leverages these unique signatures fromparquet
andsqlite
types.Because of this, the
load_profile
function is able to identify which files to loadExample below:
loading
csv
file:loading
parquet
files:modified from EmbeddedArtistry
Description
Thank you for your contribution to pycytominer!
Please succinctly summarize your proposed change.
What motivated you to make this change?
Please also link to any relevant issues that your code is associated with.
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.