Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scATAC-seq UMAP DR + clustering -> Arrow container #40

Merged
merged 4 commits into from
Jun 18, 2020

Conversation

mruffalo
Copy link
Contributor

This duplicates a lot of the code in the h5ad-to-arrow code in a way that I'm not very happy about; I didn't realize how much more containers/h5ad-to-arrow/context/main.py does since the last time I looked at it. This works, though, and should allow visualization of scATAC-seq dimensionality reduction and clustering in the same way as scRNA-seq, more-or-less for "free" in the UI code. (At least, that's how I interpreted @mccalluc's comments in a standup today.)

At the moment I don't have many good ideas about how to de-duplicate this code, given how the repository and containers are structured. That seems worth doing ASAP, but not necessarily right now.

@mruffalo
Copy link
Contributor Author

Hmm, this succeeded for me locally through ./test.sh:

scatac-csv-to-arrow
Sending build context to Docker daemon  8.427kB
Step 1/8 : FROM continuumio/miniconda3:4.7.12
 ---> 406f2b43ea59
Step 2/8 : RUN apt-get update &&      apt-get install -y gcc python3-dev
 ---> Using cache
 ---> 0ea0bd200b0a
Step 3/8 : COPY requirements-freeze.txt .
 ---> Using cache
 ---> fd42df0abe9b
Step 4/8 : RUN pip install  -r ./requirements-freeze.txt
 ---> Using cache
 ---> eddcbdcd183b
Step 5/8 : COPY requirements.txt .
 ---> Using cache
 ---> 89e0e76255a1
Step 6/8 : RUN pip install  -r ./requirements.txt
 ---> Using cache
 ---> c9c324e4b0c7
Step 7/8 : COPY . .
 ---> Using cache
 ---> c47ad736acc7
Step 8/8 : CMD [ "python", "main.py",       "--input_dir", "/input",       "--output_dir", "/output" ]
 ---> Using cache
 ---> 83e97873969b
Successfully built 83e97873969b
Successfully tagged hubmap/portal-container-scatac-csv-to-arrow:0.0.1
scatac-csv-to-arrow
hubmap/portal-container-scatac-csv-to-arrow:0.0.1 is good!
travis_fold:end:scatac-csv-to-arrow

I'm not familiar enough with the test setup -- what does

Binary files ./workflows/scatac-csv-to-arrow/test-output-expected/output/umap_coords_clusters.arrow 

mean in the test output? I might have expected something like Binary files <X> and <Y> differ in the case of a hash mismatch.

I had thought these would be binary identical after verifying that the
DataFrame read from the CSV was equal to what was constructed from the h5ad
file in h5ad-to-arrow, but apparently "equal" isn't "identical" for these
purposes.
@mccalluc mccalluc requested a review from ilan-gold June 18, 2020 18:07
@mccalluc
Copy link
Contributor

Adding @ilan-gold as a reviewer. I'll take a look at it, but I really haven't kept up with this code base. (I have this browser tab open, so I'm assuming there was a request for me to look at this in Slack or something, but I've lost track now of how it came to me.)

@mruffalo : Our convention has been that after review, and if the tests pass, the author of the PR merges. Fine with me if you'll do that, when those conditions are met, or if you think it should be in our court, that's fine, too.

@ilan-gold
Copy link
Member

ilan-gold commented Jun 18, 2020

@mruffalo I agree about the code duplication. That's unfortunate - it seems like simply changing the input file would have solved it (once it is converted to arrow), but we're past that point now. I'll take a look closer as well.

@ilan-gold
Copy link
Member

@mccalluc Do you remember why you were excluding .arrow files in the original tests for the docker container? f75f212

Perhaps it's the same issue creeping up again and the solution is to just ignore the files.

Not binary identical, and apparently some package version differences on my
host machine cause this to differ.
@mruffalo
Copy link
Contributor Author

mruffalo commented Jun 18, 2020

Thanks for the feedback -- I fixed this by updating the .arrow files to versions that are actually generated by the code in the container, not by my host system. Apparently some version difference (Python, pyarrow, etc.) causes files to be written which aren't binary identical.

I don't like the code duplication here, but I would advocate for fixing this after the first data release.

@mccalluc I'll be happy to merge and publish a tagged Docker container once @ilan-gold gives a 👍

@ilan-gold
Copy link
Member

I agree @mruffalo about the timing of de-duplicating code. I don't think this is the last we will hear of needing to generate this sort of information. @mccalluc It may be useful for us to begin discussing soon some sort of py-vitessce package with utility functions that take in pandas dataframes and output Vitessce schemas. We repeat this sort of functionality in a lot more places than just here.

@mruffalo
Copy link
Contributor Author

Thanks!

@mruffalo mruffalo merged commit 0b1da06 into master Jun 18, 2020
@mruffalo mruffalo deleted the mruffalo/add-scatac-csv-to-arrow branch June 18, 2020 18:59
@mccalluc
Copy link
Contributor

mccalluc commented Jun 19, 2020

@mruffalo , @ilan-gold : Sorry to be late to the party:

  • yeah, I excluded arrow files from tests because of some binary inconsistency... I hadn't gotten as far as tying it to the platform. Since we are also generating the JSON downstream from that, I've thought that would be sufficient to catch any problems.
  • Merging and running now is absolutely the right call. Thanks!
  • but I filed Handle code duplication #43 to help us not forget about the code duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants