Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ALS-4461] Allow incremental vcf loading #73

Merged
merged 40 commits into from
Aug 23, 2023
Merged

Conversation

ramari16
Copy link
Contributor

@ramari16 ramari16 commented Aug 8, 2023

Changes to allow incremental vcf loading, as well as some changes to how we store genomic data since we're breaking backwards compatibility anyway. Mainly:

  • Allow FileBackedByteIndexedStorage to support serialization other than java object serialization. Specifically to add JSON
  • Updates to store all references to variants by their index in the main variant list, which is now formalized and explicitly stored with the genomic data
  • Introducing GenomicDatasetMerger which will take two genomic datasets and merge them appropriately

Still todo:

  • Unit testing for GenomicDatasetMerger
  • Documenting limitations of GenomicDatasetMerger, specifically in dealing with duplicate patients. Which it does not handle currently.

ramari16 added 30 commits May 25, 2023 11:35
Copy link
Member

@Luke-Sikina Luke-Sikina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few for now.

Long[] recordIndex;
try (ByteArrayOutputStream out = writeObject(value)) {
recordIndex = new Long[2];
synchronized (storage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get some really difficult to debug concurrency problems here if a thread calls updateStorageDirectory while you're inside this synchronized block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, shouldn't the storage file name be immutable within the lifetime of this object? That would address my locking concerns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This code was just moved from somewhere else. I did not introduce it and am very hesitant to actually change it. I will think about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did misunderstand your second comment originally -- the reason updateStorageDirectory exists is because the directory where this is built during the ETL process having to match the directory where the data was stored in HPDS was really annoying.

The creating, saving, loading, and actual usage of this class by HPDS is somewhat jumbled and unsafe right now, I agree.

import org.apache.commons.io.output.ByteArrayOutputStream;

public class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable {
public abstract class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be a ton of work to make this implement Map<K, V>; As is, you're approximating a lot of methods from that interface while missing small details that make this code hard to reuse. You could just crib from java's UnmodifiableMap

import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public abstract class FileBackedJsonIndexStorage <K, V extends Serializable> extends FileBackedByteIndexedStorage<K, V> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're starting to create a pretty involved inheritance hierarchy. In my experience, these get difficult to read. We aren't in Java 17 yet, so you don't have sealed classes, which would help a lot. That said, you could approximate the concept of contained (bounded?) inheritance by putting your two implementing classes in this file.

Example: https://gist.github.com/Luke-Sikina/70d3fc83f34610623ea052d0ef04b5d8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The implementing classes are in another package? Oof

Copy link
Contributor Author

@ramari16 ramari16 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My most changes to this have actually completely decoupled reading/writing from the rest of the logic in this class so I think it would be pretty easy to get rid of the inheritance and introduce a dependency on an "objectMapper". Maybe...

@srpiatt
Copy link
Contributor

srpiatt commented Aug 10, 2023

Any particular reason why we're pushing without unit tests in this PR? Typically it's harder to do it later as a tech-debt item, especially if the code is formalized enough to be pushed.

Comment on lines +83 to +84
Long offsetInStorage = index.get(key)[0];
int offsetLength = index.get(key)[1].intValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Long offsetInStorage = index.get(key)[0];
int offsetLength = index.get(key)[1].intValue();
Long offsetInStorage = offsetsInStorage[0];
int offsetLength = offsetsInStorage[1].intValue();

for (int y = 1; y < variantsSortedByOffsetList.size(); y++) {
if (variantsSortedByOffsetList.get(y).metadata.offset.equals(variantsSortedByOffsetList.get(y - 1).metadata.offset)) {
try {
System.out.println("Matching offsets : " + variantsSortedByOffsetList.get(y - 1).specNotation() + " : " + variantsSortedByOffsetList.get(y).specNotation() + ":" + maskMap.get(variantsSortedByOffsetList.get(y - 1).specNotation()).heterozygousMask.toString(2) + ":" + ":" + maskMap.get(variantsSortedByOffsetList.get(y).specNotation()).heterozygousMask.toString(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're not using logger in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I thought about just deleting this class, I'm not even sure what it is for. I'll figure out if we are actually still using it anywhere

Copy link
Member

@Luke-Sikina Luke-Sikina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit tests? Idk. I ate too much and now I'm tired and bloated.

}

private void validate() {
if (!variantStore1.getVariantMaskStorage().keySet().equals(variantStore2.getVariantMaskStorage().keySet())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Shouldn't you just assume that a missing chromosome in one variantStore just means there are no muts in that chromosome for that store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, although I don't think that is ever likely to happen. A missing chromosome for now strongly suggests invalid data, which actually did happen once. We don't have a great way to re-merge the data if we accidentally succeed without a chromosome that should be there.

Comment on lines +103 to +107
Map<String, Integer> variantSpecToIndexMap = new HashMap<>();
LinkedList<String> variantSpecList = new LinkedList<>(Arrays.asList(variantIndex1));
for (int i = 0; i < variantIndex1.length; i++) {
variantSpecToIndexMap.put(variantIndex1[i], i);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do it like this. Probably not cleaner, more of a thought exercise on my part:

Suggested change
Map<String, Integer> variantSpecToIndexMap = new HashMap<>();
LinkedList<String> variantSpecList = new LinkedList<>(Arrays.asList(variantIndex1));
for (int i = 0; i < variantIndex1.length; i++) {
variantSpecToIndexMap.put(variantIndex1[i], i);
}
Map<String, Integer> variantSpecToIndexMap = IntStream.range(0, variantIndex1.length).boxed()
.collect(Collectors.toMap((i) -> variantIndex1[i], (i) -> i));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I see you use variantSpecList elsewhere

// data sets, the merged data set will use the index from dataset 1 to reference it, and any references in data
// set 2 for this variant needs to be re-mapped. Likewise, if a variant in set 2 is new, it will be appended to
// the list and also need to be re-mapped
Integer[] remappedIndexes = new Integer[variantIndex2.length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this boxed?

@Luke-Sikina Luke-Sikina added the enhancement New feature or request label Aug 18, 2023
@ramari16
Copy link
Contributor Author

ramari16 commented Aug 21, 2023

I did some refactoring last week to try to make unit testing possible/easier but it's still overly difficult because of the tight coupling with the file system, specifically FileBackedByteIndexedInfoStore. This is definitely something I'll address in v3 which is coming soon. FWIW, I do have some integration tests set up for this that I cannot check in because of the data they use.

Comment on lines 76 to 78
if(this.storage==null) {
synchronized(this) {
this.open();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write this, but I'm also worried that this is not as safe as it seems.

  1. Thread A evaluates this.storage==null as true
  2. Thread B evaluates this.storage==null as true
  3. Thread A enters the synchronized block
  4. Thread B blocks due to the monitor
  5. Thread A continues, starts executing the rest of the get
  6. Thread B unblocks, opens the file again, changing the reference
  7. ???

99% sure the synchronized block has to include the null check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I made the change and added a todo comment to make this class immutable and remove the need for this check

@@ -51,8 +52,7 @@ public void complete() {
}

public FileBackedByteIndexedInfoStore(File storageFolder, InfoStore infoStore) throws IOException {
this.allValues = new FileBackedByteIndexedStorage(String.class, String[].class,
new File(storageFolder, infoStore.column_key + "_infoStoreStorage.javabin"));
this.allValues = new FileBackedStorageVariantIndexImpl(new File(storageFolder, infoStore.column_key + "_infoStoreStorage.javabin"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some System.out.println below here

@ramari16 ramari16 merged commit a67d8fe into genomic-v2-release/0001 Aug 23, 2023
@ramari16 ramari16 deleted the ALS-4461 branch April 23, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants