Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing parquet loading in load_profiles function #262

Merged
merged 32 commits into from
Mar 23, 2023

Conversation

axiomcura
Copy link
Member

@axiomcura axiomcura commented Mar 16, 2023

Edit by @gwaybio

If you happen to stumble upon this PR, note the we elected to perform a simpler implementation than what is described immediately below. See files changed for our implemented solution.

Below is @axiomcura original post


Implementing parquet loading

This PR focuses on solving #211

This introduces a new function known as is_path_a_parquet_file that allows the load_profile function to identify whether the input file is either a parquet file.

By identifying what file format the profiles are, it will use the appropriate instructions to load it into memory via pandas’s API.

Implementation approach

File types contain a unique signature within the header of each file. For example, sqlite files contains SQlite format signature within the first 100 bytes of contents.

Below is a code example:

buffer_size = 100 
sqlite_file = "pycytominer/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite"
with open(sqlite_file, "rb") as stream:
    header_conts = stream.read(buffer_size)

print(header_conts)

Here is the output:

b'SQLite format 3\x00\x04\x00\x01\x01\x00@  \x00\x005\xfd\x01T\x0b\xd7\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x005\xfd\x00-\xe6\x02'

Similarly, parquet files also contain a unique format known as PAR within the first 100 bytes of data of the files.

buffer_size = 100 
sqlite_file = "pycytominer/tests/test_data/cytominer_database_example_data/test_SQ00014613.parquet""
with open(sqlite_file, "rb") as stream:
    header_conts = stream.read(buffer_size)

print(header_conts)

Here is the

b'PAR1\x15\x04\x15\x90\xaf\x03\x15\xf0\xd7\x01L\x15\xf25\x15\x04\x12\x00\x00\xc8\xd7\x01\x04\x01\x00\t\x01\x00\x02\t\x07\x04\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08\x00\x08\r\x08\x00\t\r\x08\x00\n\r\x08\x00\x0b\r\x08\x00\x0c\r\x08\x00\r\r\x08\x00\x0e\r\x08\x00\x0f\r\x08\x00\x10\r\x08\x00\x11\r\x08\x00\x12\r\x08'

This PR introduces a new function known as infer_profile_file_typethat leverages these unique signatures from parquet and sqlite types.

Because of this, the load_profile function is able to identify which files to load

Example below:

loading csv file:

csv_loading

loading parquet files:

parquet_loading


modified from EmbeddedArtistry

Description

Thank you for your contribution to pycytominer!
Please succinctly summarize your proposed change.
What motivated you to make this change?

Please also link to any relevant issues that your code is associated with.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@axiomcura axiomcura requested review from gwaybio and d33bs March 16, 2023 22:04
@gwaybio
Copy link
Member

gwaybio commented Mar 17, 2023

Is there a reason not to simply use Pathlib.Path.suffix? https://docs.python.org/3/library/pathlib.html?highlight=suffix#pathlib.PurePath.suffix

This would easily support csv as well

@d33bs
Copy link
Member

d33bs commented Mar 17, 2023

Is there a reason not to simply use Pathlib.Path.suffix? https://docs.python.org/3/library/pathlib.html?highlight=suffix#pathlib.PurePath.suffix

This would easily support csv as well

Agreeing with @gwaybio here, it may be simpler to use the file extension via pathlib.Path.suffix if available. If extensions are not available for specific datasets, inference might still be useful. Is there a specific dataset you've seen which does not include the extensions? Referencing this may help with the discussion. Maybe also a combined approach could make sense (first pathlib.path.suffix, then inference)?

For filetype inference in specific: Leveraging existing work in this area might be a good idea due to the complexities involved. python-magic in particular could be useful (possibly using python-magic-bin for dependencies, though this too is not without possible challenges across various OS environments). Especially if/when other formats are considered, providing a scaffold which is widely accepted can help with maintenance over time.

Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work - excited to see parquet compatibility additions! I made a few comments throughout. Please don't hesitate to let me know if you have any questions.

I feel the inference discussion and also the failing test (see here) should be resolved before we move forward.

pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
@axiomcura
Copy link
Member Author

@gwaybio @d33bs

My initial thought is that using extension based file validation can result in errors. For example, users can submit extension-less files that may "confuse" pycytominer loading functions. Therefore resorting to file signatures can help avoid this issue.

I do agree that the implementation is much more involved and using only extensions is much easier to implement. I was looking at python-magic that @d33bs has recommended and it's pretty handy.

Here are some examples below:

import pathlib
import magic

# magic object
magic_reader = magic.Magic(uncompress=True, extension=False)
print(magic_reader.from_file("./test_SQ00014613.parquet"))
print()
print(magic_reader.from_file("./test_SQ00014613.sqlite"))
print()
print(magic_reader.from_file("./test_SQ00014613.csv.gz"))

Here is the output:

Apache Parquet

SQLite 3.x database, last written using SQLite version 3037002, page size 1024, file counter 13832, database pages 192, cookie 0xa, schema 4, UTF-8, version-valid-for 13832

CSV text (gzip compressed data, was "test_SQ00014613.csv", last modified: Sat Sep  3 15:32:39 2022, max compression)

In cases with symlink files, we need to extract the soruce path in order for it to work. Here's an example:

import pathlib
import magic

# magic object
magic_reader = magic.Magic(uncompress=True, extension=False)

# now lets try with filepaths with symbolic links
sym_link_data = "./SQ00014617.parquet"
path = pathlib.Path(sym_link_data)

# check if the path is a sym_link
# -- if so, get true path
if path.is_symlink():
    path = path.resolve()

# with true path, identify file
print(magic_reader.from_file(path))

output:

Apache Parquet

If this is too much at this moment, then resorting to only file extensions is an ideal plan! Let me know what ya'll think :)

@gwaybio
Copy link
Member

gwaybio commented Mar 19, 2023

If this is too much at this moment, then resorting to only file extensions is an ideal plan! Let me know what ya'll think :)

My opinion is that it is too much - I think keeping things simple is the way to go for readability and maintainability. I generally get a bit queazy when adding a new dependency as well.

I think all we need is the following:

  1. function to ask whether or not the input string or path is a parquet extension (maybe called is_path_a_parquet_file() 🤷
  2. An update to load_profiles(profiles) that calls is_path_a_parquet_file() and uses pandas.DataFrame.read_parquet() if true
  3. corresponding tests

@axiomcura
Copy link
Member Author

@gwaybio and @d33bs

I have replace my previous implementation with extension based inference of parquet files by using pathlib.Path.suffix

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments that need to be addressed prior to merging - looking good!

pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
@d33bs d33bs self-requested a review March 20, 2023 18:48
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! I left a few comments and suggestions with this review. Please don't hesitate to let me know if you have any questions.

I'd request your feedback on partitioned parquet-based directories (those containing multiple files) with is_path_a_parquet_file before approval.

pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Show resolved Hide resolved
pycytominer/cyto_utils/load.py Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2023

Codecov Report

Merging #262 (5f7ecae) into master (b2c6cc4) will increase coverage by 0.00%.
The diff coverage is 97.36%.

@@           Coverage Diff           @@
##           master     #262   +/-   ##
=======================================
  Coverage   95.71%   95.72%           
=======================================
  Files          53       53           
  Lines        2826     2852   +26     
=======================================
+ Hits         2705     2730   +25     
- Misses        121      122    +1     
Flag Coverage Δ
unittests 95.72% <97.36%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/cyto_utils/load.py 87.34% <93.75%> (+0.77%) ⬆️
pycytominer/tests/test_cyto_utils/test_load.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@axiomcura axiomcura requested review from gwaybio and d33bs March 21, 2023 16:30
@axiomcura
Copy link
Member Author

@d33bs @gwaybio
I have attended all comments!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there! 🎉

I made some comments which probably should be addressed prior to merging. I will let @d33bs give the final 👍

pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Erik, great job on the new additions! I'm requesting some changes and your thoughts on a few items, especially on pathlib.Path.absolute() usage. Please don't hesitate to let me know if you have any questions.

pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
pycytominer/tests/test_cyto_utils/test_load.py Outdated Show resolved Hide resolved
@axiomcura
Copy link
Member Author

Helllo @d33bs

I have attended your comments. Hopefully they were answered!

@axiomcura axiomcura requested a review from d33bs March 22, 2023 20:57
pycytominer/cyto_utils/load.py Outdated Show resolved Hide resolved
Copy link
Member

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @axiomcura - congrats on this contribution! 🎉

@axiomcura
Copy link
Member Author

axiomcura commented Mar 23, 2023

Thank you!

@gwaybio gwaybio merged commit 9340ff3 into cytomining:master Mar 23, 2023
kenibrewer pushed a commit to kenibrewer/pycytominer that referenced this pull request Mar 25, 2023
…g#262)

* added new function `infer_profile_file_type`

* Fixed Unicode Bug

* fixed csv error

* improved variable names

* removed unwanted comments

* added extension based inference for parquet

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* Update pycytominer/tests/test_cyto_utils/test_load.py

Co-authored-by: Gregory Way <[email protected]>

* edited pathlib imports, documentation fixed

* applied black formatting

* added typing

* updated tests

* update tests

* testing update

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Dave Bunten <[email protected]>

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Dave Bunten <[email protected]>

* added black formatting

* update pathing

* fixed docs

* black formatting

* tests update

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* test update

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* fixed typo

* added comments

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Dave Bunten <[email protected]>

* replaced `.absolute()` with `.resolve()`

* applied black formatting

* removed try and accept block

---------

Co-authored-by: Gregory Way <[email protected]>
Co-authored-by: Dave Bunten <[email protected]>
alxndrkalinin pushed a commit to alxndrkalinin/pycytominer that referenced this pull request Mar 27, 2023
…g#262)

* added new function `infer_profile_file_type`

* Fixed Unicode Bug

* fixed csv error

* improved variable names

* removed unwanted comments

* added extension based inference for parquet

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* Update pycytominer/tests/test_cyto_utils/test_load.py

Co-authored-by: Gregory Way <[email protected]>

* edited pathlib imports, documentation fixed

* applied black formatting

* added typing

* updated tests

* update tests

* testing update

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Dave Bunten <[email protected]>

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Dave Bunten <[email protected]>

* added black formatting

* update pathing

* fixed docs

* black formatting

* tests update

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* test update

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Gregory Way <[email protected]>

* fixed typo

* added comments

* Update pycytominer/cyto_utils/load.py

Co-authored-by: Dave Bunten <[email protected]>

* replaced `.absolute()` with `.resolve()`

* applied black formatting

* removed try and accept block

---------

Co-authored-by: Gregory Way <[email protected]>
Co-authored-by: Dave Bunten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants