Add scATAC-seq UMAP DR + clustering -> Arrow container #40

mruffalo · 2020-06-18T15:21:44Z

This duplicates a lot of the code in the h5ad-to-arrow code in a way that I'm not very happy about; I didn't realize how much more containers/h5ad-to-arrow/context/main.py does since the last time I looked at it. This works, though, and should allow visualization of scATAC-seq dimensionality reduction and clustering in the same way as scRNA-seq, more-or-less for "free" in the UI code. (At least, that's how I interpreted @mccalluc's comments in a standup today.)

At the moment I don't have many good ideas about how to de-duplicate this code, given how the repository and containers are structured. That seems worth doing ASAP, but not necessarily right now.

mruffalo · 2020-06-18T15:25:41Z

Hmm, this succeeded for me locally through ./test.sh:

scatac-csv-to-arrow
Sending build context to Docker daemon  8.427kB
Step 1/8 : FROM continuumio/miniconda3:4.7.12
 ---> 406f2b43ea59
Step 2/8 : RUN apt-get update &&      apt-get install -y gcc python3-dev
 ---> Using cache
 ---> 0ea0bd200b0a
Step 3/8 : COPY requirements-freeze.txt .
 ---> Using cache
 ---> fd42df0abe9b
Step 4/8 : RUN pip install  -r ./requirements-freeze.txt
 ---> Using cache
 ---> eddcbdcd183b
Step 5/8 : COPY requirements.txt .
 ---> Using cache
 ---> 89e0e76255a1
Step 6/8 : RUN pip install  -r ./requirements.txt
 ---> Using cache
 ---> c9c324e4b0c7
Step 7/8 : COPY . .
 ---> Using cache
 ---> c47ad736acc7
Step 8/8 : CMD [ "python", "main.py",       "--input_dir", "/input",       "--output_dir", "/output" ]
 ---> Using cache
 ---> 83e97873969b
Successfully built 83e97873969b
Successfully tagged hubmap/portal-container-scatac-csv-to-arrow:0.0.1
scatac-csv-to-arrow
hubmap/portal-container-scatac-csv-to-arrow:0.0.1 is good!
travis_fold:end:scatac-csv-to-arrow

I'm not familiar enough with the test setup -- what does

Binary files ./workflows/scatac-csv-to-arrow/test-output-expected/output/umap_coords_clusters.arrow

mean in the test output? I might have expected something like Binary files <X> and <Y> differ in the case of a hash mismatch.

I had thought these would be binary identical after verifying that the DataFrame read from the CSV was equal to what was constructed from the h5ad file in h5ad-to-arrow, but apparently "equal" isn't "identical" for these purposes.

mccalluc · 2020-06-18T18:13:16Z

Adding @ilan-gold as a reviewer. I'll take a look at it, but I really haven't kept up with this code base. (I have this browser tab open, so I'm assuming there was a request for me to look at this in Slack or something, but I've lost track now of how it came to me.)

@mruffalo : Our convention has been that after review, and if the tests pass, the author of the PR merges. Fine with me if you'll do that, when those conditions are met, or if you think it should be in our court, that's fine, too.

ilan-gold · 2020-06-18T18:14:21Z

@mruffalo I agree about the code duplication. That's unfortunate - it seems like simply changing the input file would have solved it (once it is converted to arrow), but we're past that point now. I'll take a look closer as well.

ilan-gold · 2020-06-18T18:33:16Z

@mccalluc Do you remember why you were excluding .arrow files in the original tests for the docker container? f75f212

Perhaps it's the same issue creeping up again and the solution is to just ignore the files.

Not binary identical, and apparently some package version differences on my host machine cause this to differ.

mruffalo · 2020-06-18T18:51:49Z

Thanks for the feedback -- I fixed this by updating the .arrow files to versions that are actually generated by the code in the container, not by my host system. Apparently some version difference (Python, pyarrow, etc.) causes files to be written which aren't binary identical.

I don't like the code duplication here, but I would advocate for fixing this after the first data release.

@mccalluc I'll be happy to merge and publish a tagged Docker container once @ilan-gold gives a 👍

ilan-gold · 2020-06-18T18:54:44Z

I agree @mruffalo about the timing of de-duplicating code. I don't think this is the last we will hear of needing to generate this sort of information. @mccalluc It may be useful for us to begin discussing soon some sort of py-vitessce package with utility functions that take in pandas dataframes and output Vitessce schemas. We repeat this sort of functionality in a lot more places than just here.

mruffalo · 2020-06-18T18:59:01Z

Thanks!

mccalluc · 2020-06-19T21:51:58Z

@mruffalo , @ilan-gold : Sorry to be late to the party:

yeah, I excluded arrow files from tests because of some binary inconsistency... I hadn't gotten as far as tying it to the platform. Since we are also generating the JSON downstream from that, I've thought that would be sufficient to catch any problems.
Merging and running now is absolutely the right call. Thanks!
but I filed Handle code duplication #43 to help us not forget about the code duplication.

mruffalo added 2 commits June 18, 2020 11:03

Add scatac-csv-to-arrow container, workflow, manifest

980fc6c

Rename scatac-csv-to-arrow test data to required filename

bf1dd30

Update test data for scatac-csv-to-arrow

5708a65

I had thought these would be binary identical after verifying that the DataFrame read from the CSV was equal to what was constructed from the h5ad file in h5ad-to-arrow, but apparently "equal" isn't "identical" for these purposes.

mccalluc requested a review from ilan-gold June 18, 2020 18:07

Fix scatac-csv-to-arrow arrow test output

c7c547e

Not binary identical, and apparently some package version differences on my host machine cause this to differ.

ilan-gold approved these changes Jun 18, 2020

View reviewed changes

mruffalo merged commit 0b1da06 into master Jun 18, 2020

mruffalo deleted the mruffalo/add-scatac-csv-to-arrow branch June 18, 2020 18:59

mccalluc mentioned this pull request Jun 19, 2020

Handle code duplication #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scATAC-seq UMAP DR + clustering -> Arrow container #40

Add scATAC-seq UMAP DR + clustering -> Arrow container #40

mruffalo commented Jun 18, 2020

mruffalo commented Jun 18, 2020

mccalluc commented Jun 18, 2020

ilan-gold commented Jun 18, 2020 •

edited

Loading

ilan-gold commented Jun 18, 2020

mruffalo commented Jun 18, 2020 •

edited

Loading

ilan-gold commented Jun 18, 2020

mruffalo commented Jun 18, 2020

mccalluc commented Jun 19, 2020 •

edited

Loading

Add scATAC-seq UMAP DR + clustering -> Arrow container #40

Add scATAC-seq UMAP DR + clustering -> Arrow container #40

Conversation

mruffalo commented Jun 18, 2020

mruffalo commented Jun 18, 2020

mccalluc commented Jun 18, 2020

ilan-gold commented Jun 18, 2020 • edited Loading

ilan-gold commented Jun 18, 2020

mruffalo commented Jun 18, 2020 • edited Loading

ilan-gold commented Jun 18, 2020

mruffalo commented Jun 18, 2020

mccalluc commented Jun 19, 2020 • edited Loading

ilan-gold commented Jun 18, 2020 •

edited

Loading

mruffalo commented Jun 18, 2020 •

edited

Loading

mccalluc commented Jun 19, 2020 •

edited

Loading