feat: Add code for validating fragement file #1095

Bento007 · 2024-11-14T00:24:59Z

Reason for Change

prototype cellxgene schema CLI changes needed to validate fragments single-cell#721
All validataion step related to fragments have been implemented

Changes

add cli command

$ cellxgene-schema process-fragment -h                                                                                                    
Usage: cellxgene-schema process-fragment [OPTIONS] H5AD_FILE FRAGMENT_FILE

  Check that an ATAC SEQ fragment follows the cellxgene data integration schema. If
  validation fails this command will return an exit status of 1 otherwise 0. When
  the '--generate-index' tag is present, the command will generate a tabix
  compatible version of the fragment and tabix index. The generated fragment will
  have the file suffix .bgz and the index will have the file suffix .bgz.tbi.

Options:
  -i, --generate-index  Generate index for fragment
  -h, --help            Show this message and exit.

add requirements to run dask in local distributed mode
add validation steps for ATAC-SEQ fragment file
add tabix index generation steps.

Testing

added unit test for all validation steps.

pip install git+https://github.com/chanzuckerberg/single-cell-curation/@tsmith/10x-ATAC#subdirectory=cellxgene_schema_cli

Remaining Work

chanzuckerberg/single-cell#724

TODO: add tests TODO: optimize validation steps

github-actions · 2024-12-07T02:07:06Z

This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

github-actions · 2024-12-10T02:09:39Z

This PR was closed because it has been inactive for 17 days, 3 days since being marked as stale. Please re-open if you still need this to be addressed.

github-actions · 2025-01-05T02:06:13Z

This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

github-actions · 2025-01-22T02:01:31Z

This PR has not seen any activity in the past 2 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

codecov · 2025-01-25T00:31:22Z

Codecov Report

Attention: Patch coverage is 89.14286% with 19 lines in your changes missing coverage. Please review.

Project coverage is 89.78%. Comparing base (b7e96bf) to head (873a155).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1095      +/-   ##
==========================================
- Coverage   89.88%   89.78%   -0.10%     
==========================================
  Files          19       20       +1     
  Lines        2194     2369     +175     
==========================================
+ Hits         1972     2127     +155     
- Misses        222      242      +20

Components	Coverage Δ
cellxgene_schema_cli	`90.60% <89.14%> (-0.23%)`	⬇️
migration_assistant	`91.26% <ø> (ø)`
schema_bump_dry_run_genes	`79.74% <ø> (ø)`
schema_bump_dry_run_ontologies	`99.53% <ø> (ø)`

…eature_reference, validate_anndata_is_primary_data

ivirshup · 2025-02-12T22:52:35Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+
+def validate_fragment_barcode_in_adata_index(parquet_file: str, anndata_file: str) -> Optional[str]:
+    df = ddf.read_parquet(parquet_file, columns=["barcode"])
+    obs = ad.read_h5ad(anndata_file, backed="r").obs


Suggested change

obs = ad.read_h5ad(anndata_file, backed="r").obs

with h5py.File(anndata_file) as f:

obs = ad.io.read_elem(f["obs"])

^ This will read less, and could be significantly faster depending on how much other data is in the file.

If you are on anndata < 0.11 read_elem is under anndata.experimental

Similar below

ivirshup

I think it looks good! I can help with the dask TODOs if you'd like.

I would like to see at least one test that reads and validates the output bgz/ tbi files. I would like this test to be parameterized over both index implementations (notably, write_bgzip_cli is not covered by tests). I would probably make the indexing method explicitly selectable to make this easier to test.

Broadly, I think more tests checking the processed outputs would be good (I think there's just test_process_fragment right now?).

Some ideas:

Check that the output file is actually sorted as we expect
Access the file via the index and make sure the results are as expected

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

ivirshup · 2025-02-12T23:27:04Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+
+    logger.info(f"Fragment sorted and compressed: {bgzip_output_file}")
+
+    pysam.tabix_index(bgzip_output_file, preset="bed", force=True)


The docs for this say:

The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.

The input you are providing here should be contiguous by contig, but not necessarily sorted since we don't know the order in which the tasks will complete. Right?

I assume this is fine, but maybe a comment noting the behavior is different from the docs?

added a comment in prepare_fragment

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

cellxgene_schema_cli/cellxgene_schema/cli.py

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

- reading anndata with h5py to improve read efficiency

nayib-jose-gloria · 2025-02-13T20:01:05Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+    # limit calls to dask.compute to improve performance. The number of jobs to run at once is determined by the
+    # step variable. If we run all the jobs in the same call to dask.compute, the local cluster hangs.
+    # TODO: investigate why
+    step = 4


how was this number determined? are we planning on running this many cores for the atac-seq validation container?

This is number of jobs we send to the dask scheduler at a time. The dask configuration will determine the compute used. The value of 4 was determined by running the sample fragment a few times with a decreasing number of jobs until it didn't freeze.

- change name of cli command validate-fragment to process-fragment - install tabix and bgzip in test GHA

.github/workflows/push_tests.yml

Co-authored-by: Isaac Virshup <[email protected]>

cellxgene_schema_cli/tests/test_atac_seq.py

Co-authored-by: Isaac Virshup <[email protected]>

ivirshup · 2025-02-14T18:55:25Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+
+@delayed
+def sort_fragment(parquet_file: str, write_path: str, chromosome: str) -> Path:
+    temp_data = Path(write_path) / f"temp_{chromosome}.tsv"


Suggested change

temp_data = Path(write_path) / f"temp_{chromosome}.tsv"

temp_data = Path(write_path) / f"temp_{chromosome}.tsv.gz"

I think this file should be compressed to keep storage requirements during processing down.

Ideally, I would use a better compressor than GZIP (like zstd) but I'm not sure what you'll have installed here.

- add output_file cli parameter - move all cli commands under schema-cli

Bento007 and others added 7 commits November 13, 2024 16:11

Add prototype code for validating fragement file

9f3b9bd

TODO: add tests TODO: optimize validation steps

print errors and exit with 1

e17c153

cleaning up the code and making sure it runs

94b6ab1

add test for the atac_seq process and test data.

5743033

Merge branch 'main' into tsmith/10x-ATAC

c160738

- use logger

a59d7b2

fix logical error when validating

59dc3d0

Bento007 mentioned this pull request Nov 19, 2024

Work remaining to productionalize the tabix prototype chanzuckerberg/single-cell#724

Open

4 tasks

Bento007 and others added 5 commits November 20, 2024 12:57

Merge branch 'main' into tsmith/10x-ATAC

0b6debd

remove compute from validate because it's not used

5013642

check fragment files exists before uppacking in cli

197558e

add TestValidateFragmentBarcodeInAdataIndex

d95b1cc

stop coordinate is <= chromosome length

1d30722

github-actions bot added the Stale label Dec 7, 2024

github-actions bot added the autoclosed label Dec 10, 2024

github-actions bot closed this Dec 10, 2024

Bento007 reopened this Dec 20, 2024

github-actions bot removed autoclosed Stale labels Dec 21, 2024

github-actions bot added the Stale label Jan 5, 2025

Bento007 removed the Stale label Jan 7, 2025

github-actions bot added the Stale label Jan 22, 2025

Merge branch 'main' into tsmith/10x-ATAC

bc8f8f3

github-actions bot removed the Stale label Jan 23, 2025

adding todos

e6be307

Bento007 added 2 commits February 12, 2025 13:03

Add validation steps for validate_anndata_is_atac, validate_anndata_f…

742cca6

…eature_reference, validate_anndata_is_primary_data

fix typing

d0ae8b9

Bento007 requested review from nayib-jose-gloria, joyceyan and ivirshup February 12, 2025 22:29

add one more check

873a155

Bento007 requested review from ivirshup and joyceyan and removed request for joyceyan and ivirshup February 12, 2025 22:31

ivirshup reviewed Feb 12, 2025

View reviewed changes

ivirshup reviewed Feb 13, 2025

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved

additional testing for generate bgzip and tbi files

c2ac0f2

- reading anndata with h5py to improve read efficiency

nayib-jose-gloria reviewed Feb 13, 2025

View reviewed changes

Bento007 added 3 commits February 13, 2025 14:03

- parameterize test_process_fragment to use all write methods

9cd400a

- change name of cli command validate-fragment to process-fragment - install tabix and bgzip in test GHA

- install libbz2-dev

a7bb52d

install liblzma-dev

0ffad8c

ivirshup reviewed Feb 14, 2025

View reviewed changes

.github/workflows/push_tests.yml Outdated Show resolved Hide resolved

Bento007 and others added 4 commits February 13, 2025 16:31

Update .github/workflows/push_tests.yml

f7d5b07

Co-authored-by: Isaac Virshup <[email protected]>

install in ~/bin

2c6788b

install in /user/bin

46ed1ac

sudo

e29f1df

ivirshup reviewed Feb 14, 2025

View reviewed changes

cellxgene_schema_cli/tests/test_atac_seq.py Outdated Show resolved Hide resolved

Bento007 and others added 4 commits February 13, 2025 19:04

Update cellxgene_schema_cli/tests/test_atac_seq.py

536285b

Co-authored-by: Isaac Virshup <[email protected]>

find bgzip

d074190

add make

df1cb0c

apt-get

38f8afd

ivirshup reviewed Feb 14, 2025

View reviewed changes

add test for processing bgz and gz files

bc6aae8

- add output_file cli parameter - move all cli commands under schema-cli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add code for validating fragement file #1095

feat: Add code for validating fragement file #1095

Bento007 commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Dec 7, 2024

github-actions bot commented Dec 10, 2024

github-actions bot commented Jan 5, 2025

github-actions bot commented Jan 22, 2025

codecov bot commented Jan 25, 2025 •

edited

Loading

ivirshup Feb 12, 2025 •

edited

Loading

ivirshup left a comment

ivirshup Feb 12, 2025

Bento007 Feb 14, 2025

nayib-jose-gloria Feb 13, 2025

Bento007 Feb 13, 2025

ivirshup Feb 14, 2025

	obs = ad.read_h5ad(anndata_file, backed="r").obs
	with h5py.File(anndata_file) as f:
	obs = ad.io.read_elem(f["obs"])


		logger.info(f"Fragment sorted and compressed: {bgzip_output_file}")

		pysam.tabix_index(bgzip_output_file, preset="bed", force=True)

	temp_data = Path(write_path) / f"temp_{chromosome}.tsv"
	temp_data = Path(write_path) / f"temp_{chromosome}.tsv.gz"

feat: Add code for validating fragement file #1095

Are you sure you want to change the base?

feat: Add code for validating fragement file #1095

Conversation

Bento007 commented Nov 14, 2024 • edited Loading

Reason for Change

Changes

Testing

Remaining Work

github-actions bot commented Dec 7, 2024

github-actions bot commented Dec 10, 2024

github-actions bot commented Jan 5, 2025

github-actions bot commented Jan 22, 2025

codecov bot commented Jan 25, 2025 • edited Loading

Codecov Report

ivirshup Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

ivirshup left a comment

Choose a reason for hiding this comment

ivirshup Feb 12, 2025

Choose a reason for hiding this comment

Bento007 Feb 14, 2025

Choose a reason for hiding this comment

nayib-jose-gloria Feb 13, 2025

Choose a reason for hiding this comment

Bento007 Feb 13, 2025

Choose a reason for hiding this comment

ivirshup Feb 14, 2025

Choose a reason for hiding this comment

Bento007 commented Nov 14, 2024 •

edited

Loading

codecov bot commented Jan 25, 2025 •

edited

Loading

ivirshup Feb 12, 2025 •

edited

Loading