Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add code for validating fragement file #1095

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Conversation

Bento007
Copy link
Contributor

@Bento007 Bento007 commented Nov 14, 2024

Reason for Change

Changes

  • add cli command
$ cellxgene-schema process-fragment -h                                                                                                    
Usage: cellxgene-schema process-fragment [OPTIONS] H5AD_FILE FRAGMENT_FILE

  Check that an ATAC SEQ fragment follows the cellxgene data integration schema. If
  validation fails this command will return an exit status of 1 otherwise 0. When
  the '--generate-index' tag is present, the command will generate a tabix
  compatible version of the fragment and tabix index. The generated fragment will
  have the file suffix .bgz and the index will have the file suffix .bgz.tbi.

Options:
  -i, --generate-index  Generate index for fragment
  -h, --help            Show this message and exit.

Testing

  • added unit test for all validation steps.
pip install git+https://github.com/chanzuckerberg/single-cell-curation/@tsmith/10x-ATAC#subdirectory=cellxgene_schema_cli

Remaining Work

chanzuckerberg/single-cell#724

Copy link
Contributor

github-actions bot commented Dec 7, 2024

This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

@github-actions github-actions bot added the Stale label Dec 7, 2024
Copy link
Contributor

This PR was closed because it has been inactive for 17 days, 3 days since being marked as stale. Please re-open if you still need this to be addressed.

Copy link
Contributor

github-actions bot commented Jan 5, 2025

This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

@github-actions github-actions bot added the Stale label Jan 5, 2025
@Bento007 Bento007 removed the Stale label Jan 7, 2025
Copy link
Contributor

This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

@github-actions github-actions bot added the Stale label Jan 22, 2025
@github-actions github-actions bot removed the Stale label Jan 23, 2025
Copy link

codecov bot commented Jan 25, 2025

Codecov Report

Attention: Patch coverage is 89.14286% with 19 lines in your changes missing coverage. Please review.

Project coverage is 89.78%. Comparing base (b7e96bf) to head (873a155).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1095      +/-   ##
==========================================
- Coverage   89.88%   89.78%   -0.10%     
==========================================
  Files          19       20       +1     
  Lines        2194     2369     +175     
==========================================
+ Hits         1972     2127     +155     
- Misses        222      242      +20     
Components Coverage Δ
cellxgene_schema_cli 90.60% <89.14%> (-0.23%) ⬇️
migration_assistant 91.26% <ø> (ø)
schema_bump_dry_run_genes 79.74% <ø> (ø)
schema_bump_dry_run_ontologies 99.53% <ø> (ø)

@Bento007 Bento007 requested review from ivirshup and joyceyan and removed request for joyceyan and ivirshup February 12, 2025 22:31

def validate_fragment_barcode_in_adata_index(parquet_file: str, anndata_file: str) -> Optional[str]:
df = ddf.read_parquet(parquet_file, columns=["barcode"])
obs = ad.read_h5ad(anndata_file, backed="r").obs
Copy link

@ivirshup ivirshup Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
obs = ad.read_h5ad(anndata_file, backed="r").obs
with h5py.File(anndata_file) as f:
obs = ad.io.read_elem(f["obs"])

^ This will read less, and could be significantly faster depending on how much other data is in the file.

If you are on anndata < 0.11 read_elem is under anndata.experimental

Similar below

Copy link

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good! I can help with the dask TODOs if you'd like.

I would like to see at least one test that reads and validates the output bgz/ tbi files. I would like this test to be parameterized over both index implementations (notably, write_bgzip_cli is not covered by tests). I would probably make the indexing method explicitly selectable to make this easier to test.

Broadly, I think more tests checking the processed outputs would be good (I think there's just test_process_fragment right now?).

Some ideas:

  • Check that the output file is actually sorted as we expect
  • Access the file via the index and make sure the results are as expected

cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved

logger.info(f"Fragment sorted and compressed: {bgzip_output_file}")

pysam.tabix_index(bgzip_output_file, preset="bed", force=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for this say:

The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.

The input you are providing here should be contiguous by contig, but not necessarily sorted since we don't know the order in which the tasks will complete. Right?

I assume this is fine, but maybe a comment noting the behavior is different from the docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment in prepare_fragment

cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved
cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved
cellxgene_schema_cli/cellxgene_schema/cli.py Outdated Show resolved Hide resolved
- reading anndata with h5py to improve read efficiency
# limit calls to dask.compute to improve performance. The number of jobs to run at once is determined by the
# step variable. If we run all the jobs in the same call to dask.compute, the local cluster hangs.
# TODO: investigate why
step = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this number determined? are we planning on running this many cores for the atac-seq validation container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is number of jobs we send to the dask scheduler at a time. The dask configuration will determine the compute used. The value of 4 was determined by running the sample fragment a few times with a decreasing number of jobs until it didn't freeze.

- change name of cli command validate-fragment to process-fragment
- install tabix and bgzip in test GHA

@delayed
def sort_fragment(parquet_file: str, write_path: str, chromosome: str) -> Path:
temp_data = Path(write_path) / f"temp_{chromosome}.tsv"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
temp_data = Path(write_path) / f"temp_{chromosome}.tsv"
temp_data = Path(write_path) / f"temp_{chromosome}.tsv.gz"

I think this file should be compressed to keep storage requirements during processing down.

Ideally, I would use a better compressor than GZIP (like zstd) but I'm not sure what you'll have installed here.

- add output_file cli parameter
- move all cli commands under schema-cli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants