Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ALS-4461] Allow incremental vcf loading #73

Merged
merged 40 commits into from
Aug 23, 2023
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
6287251
ALS-4461: Deserialize variant index from disk
ramari16 May 25, 2023
3537737
ALS-4461: Add variant index builder for VCF loading
ramari16 May 25, 2023
620a5af
ALS-4461: Upgrade major version
ramari16 May 31, 2023
c0ad4a4
ALS-4461: Upgrade major version
ramari16 May 31, 2023
5afa1fe
ALS-4461: Store variants by index instead of full variant spec. Refac…
ramari16 Jun 15, 2023
dbf3a2f
ALS-4461: Initial commit for genomic dataset merger
ramari16 Jun 15, 2023
01efbf0
ALS-4461: Add jar with dependencies build instructions
ramari16 Jun 16, 2023
68e849e
ALS-4461: Fix issue with hardcoded directory
ramari16 Jun 20, 2023
a141f92
ALS-4461: Fix more issues with non-relative file paths, various refac…
ramari16 Jun 21, 2023
160bc87
ALS-4461: Fix more issues with non-relative file paths, various refac…
ramari16 Jun 21, 2023
ee6ee2f
ALS-4461: Parallelize chromosome mask merging
ramari16 Jun 21, 2023
df366d0
ALS-4461: Updated hpds version in dockerfile
ramari16 Jun 22, 2023
5e3a93e
ALS-4461: Update genomic directory on loading for variant index stores
ramari16 Jun 26, 2023
7433fde
ALS-4461: Change type of variant index store from String (variant spe…
ramari16 Jun 29, 2023
f508062
ALS-4461: Refactor duplicated variant store read/write code
ramari16 Jul 5, 2023
8da85d0
ALS-4461: Refactor duplicated variant store read/write code
ramari16 Jul 5, 2023
295f52e
ALS-4461: Fixing thread issues at startup
ramari16 Jul 7, 2023
36904a8
ALS-4461: Fixing thread issues at startup
ramari16 Jul 7, 2023
9e22c10
Testing GET/POST bug
ramari16 Jul 24, 2023
747b0a1
Revert
ramari16 Jul 26, 2023
c7afde5
Revert
ramari16 Jul 26, 2023
5e56f09
ALS-4461: Fix error handling
ramari16 Aug 8, 2023
a2749e7
Merge master into ALS-4461
ramari16 Aug 8, 2023
7de09ff
ALS-4461: Rollback testing change
ramari16 Aug 9, 2023
15fe3bf
ALS-4461: Clean up error handling in file backed storages
ramari16 Aug 9, 2023
52cc4a0
ALS-4461: Clean up error handling in file backed storages
ramari16 Aug 9, 2023
88ea1c2
ALS-4461: Remove IOExceptions thrown from FBBIS
ramari16 Aug 9, 2023
df03002
ALS-4461: Remove IOExceptions thrown from FBBIS
ramari16 Aug 9, 2023
b0d13ea
ALS-4461: Fix deserialization issue
ramari16 Aug 9, 2023
28b2672
ALS-4461: Add comment explaining chromosome index merging
ramari16 Aug 10, 2023
9bc9e21
ALS-4461: Add comments
ramari16 Aug 10, 2023
b22cc1b
ALS-4461: Changes per PR
ramari16 Aug 11, 2023
15c9f00
ALS-4461: Refactor variant spec index to make testing easier
ramari16 Aug 17, 2023
0a5fde5
ALS-4461: Refactor genomic dataset merger to support testing
ramari16 Aug 17, 2023
abb2659
ALS-4461: Add validation to prevent patient id duplicates
ramari16 Aug 21, 2023
93c5a17
ALS-4461: Fix GenomicDatasetMerger name in jar with dependencies config
ramari16 Aug 21, 2023
3bb4fe0
ALS-4461: Add main args validation
ramari16 Aug 21, 2023
c2a0f38
ALS-4461: Remove unused method
ramari16 Aug 21, 2023
ed9a5d4
ALS-4461: Remove unused classes
ramari16 Aug 23, 2023
f6ea975
ALS-4461: Remove potential race condition
ramari16 Aug 23, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions client-api/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
<parent>
<artifactId>pic-sure-hpds</artifactId>
<groupId>edu.harvard.hms.dbmi.avillach.hpds</groupId>
<version>1.0-SNAPSHOT</version>
<version>2.0.0-SNAPSHOT</version>
</parent>

<groupId>edu.harvard.hms.dbmi.avillach.hpds</groupId>
<artifactId>client-api</artifactId>
<version>1.0-SNAPSHOT</version>
<version>2.0.0-SNAPSHOT</version>

<name>client-api</name>
<!-- FIXME change it to the project's website -->
Expand Down
10 changes: 9 additions & 1 deletion common/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<parent>
<artifactId>pic-sure-hpds</artifactId>
<groupId>edu.harvard.hms.dbmi.avillach.hpds</groupId>
<version>1.0-SNAPSHOT</version>
<version>2.0.0-SNAPSHOT</version>
</parent>

<artifactId>common</artifactId>
Expand All @@ -21,5 +21,13 @@
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
@@ -1,44 +1,52 @@
package edu.harvard.hms.dbmi.avillach.hpds.storage;

import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.RandomAccessFile;
import java.io.Serializable;
import java.io.*;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.function.Function;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

import org.apache.commons.io.output.ByteArrayOutputStream;

public class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable {
public abstract class FileBackedByteIndexedStorage <K, V extends Serializable> implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be a ton of work to make this implement Map<K, V>; As is, you're approximating a lot of methods from that interface while missing small details that make this code hard to reuse. You could just crib from java's UnmodifiableMap

private static final long serialVersionUID = -7297090745384302635L;
private transient RandomAccessFile storage;
private ConcurrentHashMap<K, Long[]> index;
private File storageFile;
private boolean completed = false;
private Long maxStorageSize; //leave this in to not break serialization
protected transient RandomAccessFile storage;
protected ConcurrentHashMap<K, Long[]> index;
protected File storageFile;
protected boolean completed = false;


public FileBackedByteIndexedStorage(Class<K> keyClass, Class<V> valueClass, File storageFile) throws FileNotFoundException {
this.index = new ConcurrentHashMap<K, Long[]>();
this.storageFile = storageFile;
this.storage = new RandomAccessFile(this.storageFile, "rw");
}

public void updateStorageDirectory(File storageDirectory) {
if (!storageDirectory.isDirectory()) {
throw new IllegalArgumentException("storageDirectory is not a directory");
}
String currentStoreageFilename = storageFile.getName();
storageFile = new File(storageDirectory, currentStoreageFilename);
}

public Set<K> keys(){
return index.keySet();
}

public void put(K key, V value) throws IOException {
public void put(K key, V value) {
if(completed) {
throw new RuntimeException("A completed FileBackedByteIndexedStorage cannot be modified.");
}
Long[] recordIndex = store(value);
Long[] recordIndex;
try (ByteArrayOutputStream out = writeObject(value)) {
recordIndex = new Long[2];
synchronized (storage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get some really difficult to debug concurrency problems here if a thread calls updateStorageDirectory while you're inside this synchronized block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, shouldn't the storage file name be immutable within the lifetime of this object? That would address my locking concerns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This code was just moved from somewhere else. I did not introduce it and am very hesitant to actually change it. I will think about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did misunderstand your second comment originally -- the reason updateStorageDirectory exists is because the directory where this is built during the ETL process having to match the directory where the data was stored in HPDS was really annoying.

The creating, saving, loading, and actual usage of this class by HPDS is somewhat jumbled and unsafe right now, I agree.

storage.seek(storage.length());
recordIndex[0] = storage.getFilePointer();
storage.write(out.toByteArray());
recordIndex[1] = storage.getFilePointer() - recordIndex[0];
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
index.put(key, recordIndex);
}

Expand All @@ -63,60 +71,43 @@ public void complete() {
this.completed = true;
}

public boolean isComplete() {
return this.completed;
}

private Long[] store(V value) throws IOException {

ByteArrayOutputStream out = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(new GZIPOutputStream(out));
oos.writeObject(value);
oos.flush();
oos.close();

Long[] recordIndex = new Long[2];
synchronized(storage) {
storage.seek(storage.length());
recordIndex[0] = storage.getFilePointer();
storage.write(out.toByteArray());
recordIndex[1] = storage.getFilePointer() - recordIndex[0];
// maxStorageSize = storage.getFilePointer();
}
return recordIndex;
}

public V get(K key) throws IOException {
if(this.storage==null) {
synchronized(this) {
this.open();
}
}
Long[] offsetsInStorage = index.get(key);
if(offsetsInStorage != null) {
Long offsetInStorage = index.get(key)[0];
int offsetLength = index.get(key)[1].intValue();
if(offsetInStorage != null && offsetLength>0) {
byte[] buffer = new byte[offsetLength];
synchronized(storage) {
storage.seek(offsetInStorage);
storage.readFully(buffer);
public V get(K key) {
try {
if(this.storage==null) {
synchronized(this) {
this.open();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write this, but I'm also worried that this is not as safe as it seems.

  1. Thread A evaluates this.storage==null as true
  2. Thread B evaluates this.storage==null as true
  3. Thread A enters the synchronized block
  4. Thread B blocks due to the monitor
  5. Thread A continues, starts executing the rest of the get
  6. Thread B unblocks, opens the file again, changing the reference
  7. ???

99% sure the synchronized block has to include the null check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. I made the change and added a todo comment to make this class immutable and remove the need for this check

}
ObjectInputStream in = new ObjectInputStream(new GZIPInputStream(new ByteArrayInputStream(buffer)));

try {
V readObject = (V) in.readObject();
}
Long[] offsetsInStorage = index.get(key);
if(offsetsInStorage != null) {
Long offsetInStorage = index.get(key)[0];
int offsetLength = index.get(key)[1].intValue();
Comment on lines +84 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Long offsetInStorage = index.get(key)[0];
int offsetLength = index.get(key)[1].intValue();
Long offsetInStorage = offsetsInStorage[0];
int offsetLength = offsetsInStorage[1].intValue();

if(offsetInStorage != null && offsetLength>0) {
byte[] buffer = new byte[offsetLength];
synchronized(storage) {
storage.seek(offsetInStorage);
storage.readFully(buffer);
}
V readObject = readObject(buffer);
return readObject;
} catch (ClassNotFoundException e) {
throw new RuntimeException("This should never happen.");
} finally {
in.close();
}else {
return null;
}
}else {
} else {
return null;
}
} else {
return null;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}

protected abstract V readObject(byte[] buffer);

protected abstract ByteArrayOutputStream writeObject(V value) throws IOException;

public V getOrELse(K key, V defaultValue) {
V result = get(key);
return result == null ? defaultValue : result;
}

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
package edu.harvard.hms.dbmi.avillach.hpds.storage;

import java.io.*;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public class FileBackedJavaIndexedStorage <K, V extends Serializable> extends FileBackedByteIndexedStorage<K, V> {
public FileBackedJavaIndexedStorage(Class<K> keyClass, Class<V> valueClass, File storageFile) throws FileNotFoundException {
super(keyClass, valueClass, storageFile);
}

protected ByteArrayOutputStream writeObject(V value) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(new GZIPOutputStream(out));
oos.writeObject(value);
oos.flush();
oos.close();
return out;
}

@Override
protected V readObject(byte[] buffer) {
try (ObjectInputStream in = new ObjectInputStream(new GZIPInputStream(new ByteArrayInputStream(buffer)));) {
V readObject = (V) in.readObject();
return readObject;
} catch (IOException e) {
throw new UncheckedIOException(e);
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
package edu.harvard.hms.dbmi.avillach.hpds.storage;

import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;

import java.io.*;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public abstract class FileBackedJsonIndexStorage <K, V extends Serializable> extends FileBackedByteIndexedStorage<K, V> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're starting to create a pretty involved inheritance hierarchy. In my experience, these get difficult to read. We aren't in Java 17 yet, so you don't have sealed classes, which would help a lot. That said, you could approximate the concept of contained (bounded?) inheritance by putting your two implementing classes in this file.

Example: https://gist.github.com/Luke-Sikina/70d3fc83f34610623ea052d0ef04b5d8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The implementing classes are in another package? Oof

Copy link
Contributor Author

@ramari16 ramari16 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My most changes to this have actually completely decoupled reading/writing from the rest of the logic in this class so I think it would be pretty easy to get rid of the inheritance and introduce a dependency on an "objectMapper". Maybe...

private static final long serialVersionUID = -1086729119489479152L;

protected transient ObjectMapper objectMapper = new ObjectMapper();

public FileBackedJsonIndexStorage(File storageFile) throws FileNotFoundException {
super(null, null, storageFile);
}

protected ByteArrayOutputStream writeObject(V value) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
objectMapper.writeValue(new GZIPOutputStream(out), value);
return out;
}

protected V readObject(byte[] buffer) {
try {
return objectMapper.readValue(new GZIPInputStream(new ByteArrayInputStream(buffer)), getTypeReference());
} catch (IOException e) {
throw new RuntimeException(e);
}
}

// Required to populate the objectMapper on deserialization
private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
objectMapper = new ObjectMapper();
}

public abstract TypeReference<V> getTypeReference();
}
2 changes: 1 addition & 1 deletion data/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<parent>
<artifactId>pic-sure-hpds</artifactId>
<groupId>edu.harvard.hms.dbmi.avillach.hpds</groupId>
<version>1.0-SNAPSHOT</version>
<version>2.0.0-SNAPSHOT</version>
</parent>

<artifactId>data</artifactId>
Expand Down
Loading