[ALS-4461] Allow incremental vcf loading #73

ramari16 · 2023-08-08T19:55:28Z

Changes to allow incremental vcf loading, as well as some changes to how we store genomic data since we're breaking backwards compatibility anyway. Mainly:

Allow FileBackedByteIndexedStorage to support serialization other than java object serialization. Specifically to add JSON
Updates to store all references to variants by their index in the main variant list, which is now formalized and explicitly stored with the genomic data
Introducing GenomicDatasetMerger which will take two genomic datasets and merge them appropriately

Still todo:

Unit testing for GenomicDatasetMerger
~~Documenting limitations of GenomicDatasetMerger, specifically in dealing with duplicate patients. Which it does not handle currently.~~

…toring to support incremental vcf loading

…toring

…c) to Integer (variant id)

Luke-Sikina

Just a few for now.

Luke-Sikina · 2023-08-10T15:31:05Z

...n/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/storage/FileBackedByteIndexedStorage.java

+		Long[] recordIndex;
+		try (ByteArrayOutputStream out = writeObject(value)) {
+			recordIndex = new Long[2];
+			synchronized (storage) {


You can get some really difficult to debug concurrency problems here if a thread calls updateStorageDirectory while you're inside this synchronized block

Really, shouldn't the storage file name be immutable within the lifetime of this object? That would address my locking concerns.

I agree. This code was just moved from somewhere else. I did not introduce it and am very hesitant to actually change it. I will think about this

I did misunderstand your second comment originally -- the reason updateStorageDirectory exists is because the directory where this is built during the ETL process having to match the directory where the data was stored in HPDS was really annoying.

The creating, saving, loading, and actual usage of this class by HPDS is somewhat jumbled and unsafe right now, I agree.

Luke-Sikina · 2023-08-10T15:50:50Z

...n/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/storage/FileBackedByteIndexedStorage.java

-import org.apache.commons.io.output.ByteArrayOutputStream;
-
-public class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable {
+public abstract class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable {


It wouldn't be a ton of work to make this implement Map<K, V>; As is, you're approximating a lot of methods from that interface while missing small details that make this code hard to reuse. You could just crib from java's UnmodifiableMap

Luke-Sikina · 2023-08-10T15:57:31Z

common/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/storage/FileBackedJsonIndexStorage.java

+import java.util.zip.GZIPInputStream;
+import java.util.zip.GZIPOutputStream;
+
+public abstract class FileBackedJsonIndexStorage <K, V extends Serializable> extends FileBackedByteIndexedStorage<K, V> {


You're starting to create a pretty involved inheritance hierarchy. In my experience, these get difficult to read. We aren't in Java 17 yet, so you don't have sealed classes, which would help a lot. That said, you could approximate the concept of contained (bounded?) inheritance by putting your two implementing classes in this file.

Example: https://gist.github.com/Luke-Sikina/70d3fc83f34610623ea052d0ef04b5d8

Oh I see. The implementing classes are in another package? Oof

My most changes to this have actually completely decoupled reading/writing from the rest of the logic in this class so I think it would be pretty easy to get rid of the inheritance and introduce a dependency on an "objectMapper". Maybe...

srpiatt · 2023-08-10T18:07:05Z

Any particular reason why we're pushing without unit tests in this PR? Typically it's harder to do it later as a tech-debt item, especially if the code is formalized enough to be pushed.

srpiatt · 2023-08-10T18:14:11Z

...n/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/storage/FileBackedByteIndexedStorage.java

+				Long offsetInStorage = index.get(key)[0];
+				int offsetLength = index.get(key)[1].intValue();


Suggested change

Long offsetInStorage = index.get(key)[0];

int offsetLength = index.get(key)[1].intValue();

Long offsetInStorage = offsetsInStorage[0];

int offsetLength = offsetsInStorage[1].intValue();

srpiatt · 2023-08-10T18:49:48Z

etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/genotype/MultialleleCounter.java

+                for (int y = 1; y < variantsSortedByOffsetList.size(); y++) {
+                    if (variantsSortedByOffsetList.get(y).metadata.offset.equals(variantsSortedByOffsetList.get(y - 1).metadata.offset)) {
+                        try {
+                            System.out.println("Matching offsets : " + variantsSortedByOffsetList.get(y - 1).specNotation() + " : " + variantsSortedByOffsetList.get(y).specNotation() + ":" + maskMap.get(variantsSortedByOffsetList.get(y - 1).specNotation()).heterozygousMask.toString(2) + ":" + ":" + maskMap.get(variantsSortedByOffsetList.get(y).specNotation()).heterozygousMask.toString(2));


Is there a reason we're not using logger in this class?

No, I thought about just deleting this class, I'm not even sure what it is for. I'll figure out if we are actually still using it anywhere

Luke-Sikina

Add unit tests? Idk. I ate too much and now I'm tired and bloated.

Luke-Sikina · 2023-08-10T17:20:54Z