-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add code for validating fragement file #1095
base: main
Are you sure you want to change the base?
Conversation
TODO: add tests TODO: optimize validation steps
This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed. |
This PR was closed because it has been inactive for 17 days, 3 days since being marked as stale. Please re-open if you still need this to be addressed. |
This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed. |
This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1095 +/- ##
==========================================
- Coverage 89.88% 89.78% -0.10%
==========================================
Files 19 20 +1
Lines 2194 2369 +175
==========================================
+ Hits 1972 2127 +155
- Misses 222 242 +20
|
…eature_reference, validate_anndata_is_primary_data
|
||
def validate_fragment_barcode_in_adata_index(parquet_file: str, anndata_file: str) -> Optional[str]: | ||
df = ddf.read_parquet(parquet_file, columns=["barcode"]) | ||
obs = ad.read_h5ad(anndata_file, backed="r").obs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obs = ad.read_h5ad(anndata_file, backed="r").obs | |
with h5py.File(anndata_file) as f: | |
obs = ad.io.read_elem(f["obs"]) |
^ This will read less, and could be significantly faster depending on how much other data is in the file.
If you are on anndata < 0.11 read_elem
is under anndata.experimental
Similar below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good! I can help with the dask TODOs if you'd like.
I would like to see at least one test that reads and validates the output bgz/ tbi files. I would like this test to be parameterized over both index implementations (notably, write_bgzip_cli
is not covered by tests). I would probably make the indexing method explicitly selectable to make this easier to test.
Broadly, I think more tests checking the processed outputs would be good (I think there's just test_process_fragment
right now?).
Some ideas:
- Check that the output file is actually sorted as we expect
- Access the file via the index and make sure the results are as expected
|
||
logger.info(f"Fragment sorted and compressed: {bgzip_output_file}") | ||
|
||
pysam.tabix_index(bgzip_output_file, preset="bed", force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for this say:
The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.
The input you are providing here should be contiguous by contig, but not necessarily sorted since we don't know the order in which the tasks will complete. Right?
I assume this is fine, but maybe a comment noting the behavior is different from the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment in prepare_fragment
- reading anndata with h5py to improve read efficiency
# limit calls to dask.compute to improve performance. The number of jobs to run at once is determined by the | ||
# step variable. If we run all the jobs in the same call to dask.compute, the local cluster hangs. | ||
# TODO: investigate why | ||
step = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was this number determined? are we planning on running this many cores for the atac-seq validation container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is number of jobs we send to the dask scheduler at a time. The dask configuration will determine the compute used. The value of 4 was determined by running the sample fragment a few times with a decreasing number of jobs until it didn't freeze.
- change name of cli command validate-fragment to process-fragment - install tabix and bgzip in test GHA
Co-authored-by: Isaac Virshup <[email protected]>
Co-authored-by: Isaac Virshup <[email protected]>
|
||
@delayed | ||
def sort_fragment(parquet_file: str, write_path: str, chromosome: str) -> Path: | ||
temp_data = Path(write_path) / f"temp_{chromosome}.tsv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp_data = Path(write_path) / f"temp_{chromosome}.tsv" | |
temp_data = Path(write_path) / f"temp_{chromosome}.tsv.gz" |
I think this file should be compressed to keep storage requirements during processing down.
Ideally, I would use a better compressor than GZIP (like zstd) but I'm not sure what you'll have installed here.
- add output_file cli parameter - move all cli commands under schema-cli
Reason for Change
Changes
Testing
Remaining Work
chanzuckerberg/single-cell#724