-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ALS-4461] Allow incremental vcf loading #73
Conversation
…toring to support incremental vcf loading
…c) to Integer (variant id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few for now.
Long[] recordIndex; | ||
try (ByteArrayOutputStream out = writeObject(value)) { | ||
recordIndex = new Long[2]; | ||
synchronized (storage) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get some really difficult to debug concurrency problems here if a thread calls updateStorageDirectory
while you're inside this synchronized block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, shouldn't the storage file name be immutable within the lifetime of this object? That would address my locking concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This code was just moved from somewhere else. I did not introduce it and am very hesitant to actually change it. I will think about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did misunderstand your second comment originally -- the reason updateStorageDirectory
exists is because the directory where this is built during the ETL process having to match the directory where the data was stored in HPDS was really annoying.
The creating, saving, loading, and actual usage of this class by HPDS is somewhat jumbled and unsafe right now, I agree.
import org.apache.commons.io.output.ByteArrayOutputStream; | ||
|
||
public class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable { | ||
public abstract class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be a ton of work to make this implement Map<K, V>
; As is, you're approximating a lot of methods from that interface while missing small details that make this code hard to reuse. You could just crib from java's UnmodifiableMap
import java.util.zip.GZIPInputStream; | ||
import java.util.zip.GZIPOutputStream; | ||
|
||
public abstract class FileBackedJsonIndexStorage <K, V extends Serializable> extends FileBackedByteIndexedStorage<K, V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're starting to create a pretty involved inheritance hierarchy. In my experience, these get difficult to read. We aren't in Java 17 yet, so you don't have sealed classes, which would help a lot. That said, you could approximate the concept of contained (bounded?) inheritance by putting your two implementing classes in this file.
Example: https://gist.github.com/Luke-Sikina/70d3fc83f34610623ea052d0ef04b5d8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. The implementing classes are in another package? Oof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My most changes to this have actually completely decoupled reading/writing from the rest of the logic in this class so I think it would be pretty easy to get rid of the inheritance and introduce a dependency on an "objectMapper". Maybe...
Any particular reason why we're pushing without unit tests in this PR? Typically it's harder to do it later as a tech-debt item, especially if the code is formalized enough to be pushed. |
Long offsetInStorage = index.get(key)[0]; | ||
int offsetLength = index.get(key)[1].intValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long offsetInStorage = index.get(key)[0]; | |
int offsetLength = index.get(key)[1].intValue(); | |
Long offsetInStorage = offsetsInStorage[0]; | |
int offsetLength = offsetsInStorage[1].intValue(); |
for (int y = 1; y < variantsSortedByOffsetList.size(); y++) { | ||
if (variantsSortedByOffsetList.get(y).metadata.offset.equals(variantsSortedByOffsetList.get(y - 1).metadata.offset)) { | ||
try { | ||
System.out.println("Matching offsets : " + variantsSortedByOffsetList.get(y - 1).specNotation() + " : " + variantsSortedByOffsetList.get(y).specNotation() + ":" + maskMap.get(variantsSortedByOffsetList.get(y - 1).specNotation()).heterozygousMask.toString(2) + ":" + ":" + maskMap.get(variantsSortedByOffsetList.get(y).specNotation()).heterozygousMask.toString(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're not using logger in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I thought about just deleting this class, I'm not even sure what it is for. I'll figure out if we are actually still using it anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add unit tests? Idk. I ate too much and now I'm tired and bloated.
} | ||
|
||
private void validate() { | ||
if (!variantStore1.getVariantMaskStorage().keySet().equals(variantStore2.getVariantMaskStorage().keySet())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Shouldn't you just assume that a missing chromosome in one variantStore just means there are no muts in that chromosome for that store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, although I don't think that is ever likely to happen. A missing chromosome for now strongly suggests invalid data, which actually did happen once. We don't have a great way to re-merge the data if we accidentally succeed without a chromosome that should be there.
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Outdated
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Outdated
Show resolved
Hide resolved
Map<String, Integer> variantSpecToIndexMap = new HashMap<>(); | ||
LinkedList<String> variantSpecList = new LinkedList<>(Arrays.asList(variantIndex1)); | ||
for (int i = 0; i < variantIndex1.length; i++) { | ||
variantSpecToIndexMap.put(variantIndex1[i], i); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do it like this. Probably not cleaner, more of a thought exercise on my part:
Map<String, Integer> variantSpecToIndexMap = new HashMap<>(); | |
LinkedList<String> variantSpecList = new LinkedList<>(Arrays.asList(variantIndex1)); | |
for (int i = 0; i < variantIndex1.length; i++) { | |
variantSpecToIndexMap.put(variantIndex1[i], i); | |
} | |
Map<String, Integer> variantSpecToIndexMap = IntStream.range(0, variantIndex1.length).boxed() | |
.collect(Collectors.toMap((i) -> variantIndex1[i], (i) -> i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I see you use variantSpecList elsewhere
// data sets, the merged data set will use the index from dataset 1 to reference it, and any references in data | ||
// set 2 for this variant needs to be re-mapped. Likewise, if a variant in set 2 is new, it will be appended to | ||
// the list and also need to be re-mapped | ||
Integer[] remappedIndexes = new Integer[variantIndex2.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this boxed?
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Outdated
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Outdated
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Outdated
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/GenomicDatasetMerger.java
Show resolved
Hide resolved
I did some refactoring last week to try to make unit testing possible/easier but it's still overly difficult because of the tight coupling with the file system, specifically |
if(this.storage==null) { | ||
synchronized(this) { | ||
this.open(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didn't write this, but I'm also worried that this is not as safe as it seems.
- Thread
A
evaluatesthis.storage==null
astrue
- Thread
B
evaluatesthis.storage==null
astrue
- Thread
A
enters the synchronized block - Thread
B
blocks due to the monitor - Thread
A
continues, starts executing the rest of theget
- Thread
B
unblocks, opens the file again, changing the reference - ???
99% sure the synchronized block has to include the null check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. I made the change and added a todo comment to make this class immutable and remove the need for this check
@@ -51,8 +52,7 @@ public void complete() { | |||
} | |||
|
|||
public FileBackedByteIndexedInfoStore(File storageFolder, InfoStore infoStore) throws IOException { | |||
this.allValues = new FileBackedByteIndexedStorage(String.class, String[].class, | |||
new File(storageFolder, infoStore.column_key + "_infoStoreStorage.javabin")); | |||
this.allValues = new FileBackedStorageVariantIndexImpl(new File(storageFolder, infoStore.column_key + "_infoStoreStorage.javabin")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some System.out.println below here
Changes to allow incremental vcf loading, as well as some changes to how we store genomic data since we're breaking backwards compatibility anyway. Mainly:
FileBackedByteIndexedStorage
to support serialization other than java object serialization. Specifically to add JSONGenomicDatasetMerger
which will take two genomic datasets and merge them appropriatelyStill todo:
GenomicDatasetMerger
Documenting limitations ofGenomicDatasetMerger
, specifically in dealing with duplicate patients. Which it does not handle currently.