Merge pull request #1194 from phac-nml/object-store

Object store
phac-nml · Feb 3, 2023 · 36f67dd · 36f67dd
2 parents 3c22445 + 4e23959
commit 36f67dd
Show file tree

Hide file tree

Showing 136 changed files with 4,155 additions and 1,320 deletions.
diff --git a/.github/workflows/ci-test.yml b/.github/workflows/ci-test.yml
@@ -39,6 +39,7 @@ jobs:
             "galaxy_testing",
             "galaxy_pipeline_testing",
             "open_api_testing",
+            "file_system_testing",
           ]
 
     steps:

diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -61,6 +61,6 @@ jobs:
         with:
           github_token: ${{ secrets.github_token }}
           reporter: github-pr-review
-          fail_error: true
+          fail_on_error: true
           level: error
           checkstyle_config: './checkstyle.xml'
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,7 @@
 * [Developer]: Replaced Apache OLTU with Nimbusds for performing OAuth2 authentication flow during syncing and Galaxy exporting. See [PR 1432](https://github.com/phac-nml/irida/pull/1432)
 * [Developer/UI]: Performance enhancements to the metadata uploader. See [PR 1445](https://github.com/phac-nml/irida/pull/1445).
 * [Developer/UI]: Fix for updating sample modified date when metadata is deleted. See [PR 1457](https://github.com/phac-nml/irida/pull/1457).
+* [Developer]: Added support for cloud based storage. Currently, Microsoft Azure Blob and Amazon AWS S3 are supported. [See PR 1194](https://github.com/phac-nml/irida/pull/1194)
 
 ## [22.09.7] - 2022/01/24
 * [UI]: Fixed bugs on NCBI Export page preventing the NCBI `submission.xml` file from being properly written. See [PR 1451](https://github.com/phac-nml/irida/pull/1451)

diff --git a/build.gradle.kts b/build.gradle.kts
@@ -174,6 +174,17 @@ dependencies {
         exclude(group = "jakarta.validation", module = "jakarta.validation-api")
     }
 
+    // Microsoft Azure
+    implementation("com.azure:azure-storage-blob:12.18.0") {
+        exclude(group = "jakarta.xml.bind", module = "jakarta.xml.bind-api")
+        exclude(group = "jakarta.activation", module = "jakarta.activation-api")
+    }
+
+    // Amazon AWS
+    implementation("com.amazonaws:aws-java-sdk-s3:1.12.326") {
+        exclude(group = "commons-logging", module = "commons-logging")
+    }
+
     // Customized fastqc
     implementation(files("${projectDir}/lib/jbzip2-0.9.jar"))
     implementation(files("${projectDir}/lib/sam-1.103.jar"))
@@ -401,6 +412,10 @@ val integrationTestsMap = mapOf(
         "tags" to "IntegrationTest & Galaxy & Pipeline",
         "excludeListeners" to "ca.corefacility.bioinformatics.irida.junit5.listeners.*"
     ),
+    "fileSystem" to mapOf(
+        "tags" to "IntegrationTest & FileSystem",
+        "excludeListeners" to "ca.corefacility.bioinformatics.irida.junit5.listeners.*"
+    ),
 )
 
 integrationTestsMap.forEach {

diff --git a/doc/administrator/galaxy/cleanup/index.md b/doc/administrator/galaxy/cleanup/index.md
@@ -66,3 +66,25 @@ Once this script is installed, it can be scheduled to run periodically by adding
 This will clean up any **deleted** files every day at 2:00 am.  Log files will be stored in `galaxy/galaxy_cleanup.log` and `galaxy/cleanup_datasets/*.log`.
 
 For more information please see the [Purging Histories and Datasets](https://galaxyproject.org/admin/config/performance/purge-histories-and-datasets/) document.  ***Note: the metadata about each analysis will still be stored and available in Galaxy, but the data file contents will be permanently removed.***
+
+# Cleaning up temporary files
+
+When using Galaxy with an IRIDA instance which is using cloud based storage (Azure, AWS, etc) for example, files are uploaded from IRIDA instead of linking to them since the files are stored in the cloud and not on a shared filesystem. Since these files are uploaded to Galaxy it is a good idea to clean these files up. An example script that can be used to clean these files up is provided below:
+
+```bash
+#!/bin/bash
+
+GALAXY_ROOT_DIR=/path/to/galaxy-dist
+CLEANUP_LOG=$GALAXY_ROOT_DIR/irida_galaxy_tmp_files_cleanup.log
+TMP_FILES_DIR=$GALAXY_ROOT_DIR/databases/tmp/
+NUMBER_OF_DAYS_OLD=30
+
+source $CONDA_ROOT/bin/activate galaxy
+
+echo -e "\nBegin temporary file cleanup at `date`" >> $CLEANUP_LOG
+find $TMP_FILES_DIR -mindepth 1 -mtime +$NUMBER_OF_DAYS_OLD -delete
+
+echo -e "\nEnd temporary file cleanup at `date`" >> $CLEANUP_LOG
+```
+
+This can be added as a cleanup script which can be scheduled to run using cron.
diff --git a/doc/administrator/galaxy/index.md b/doc/administrator/galaxy/index.md
@@ -59,9 +59,9 @@ The overall architecture of IRIDA and Galaxy is as follows:
 
 ![irida-galaxy.jpg][]
 
-1. IRIDA manages all input files for a workflow.  This includes sequencing reads, reference files, and the Galaxy workflow definition file.  On execution of a workflow, references to these files are sent to a Galaxy instance using the [Galaxy API][].  It is assumed that these files exist on a file system shared between IRIDA and Galaxy.
+1. IRIDA manages all input files for a workflow.  This includes sequencing reads, reference files, and the Galaxy workflow definition file.  On execution of a workflow, if using cloud based storage the files are uploaded to a Galaxy instance, otherwise references to these files are sent to a Galaxy instance, using the [Galaxy API][]. If using IRIDA with cloud based storage (Azure, AWS, etc) the files will be downloaded to the IRIDA server, then uploaded to Galaxy, otherwise it is assumed that these files exist on a file system shared between IRIDA and Galaxy.
 2. All tools used by a workflow are assumed to have been installed in Galaxy during the setup of IRIDA.  The Galaxy workflow is uploaded to Galaxy and the necessary tools are executed by Galaxy.  Galaxy can be setup to either execute tools on a local machine, or submit jobs to a cluster.
-3. Once the workflow execution is complete, a copy of the results are downloaded into IRIDA and stored in the shared filesystem.
+3. Once the workflow execution is complete, a copy of the results are downloaded into IRIDA and stored in the shared filesystem or uploaded to the cloud based storage being used by IRIDA.
 
 [Docker]: https://www.docker.com/
 [irida-galaxy.jpg]: images/irida-galaxy.jpg

diff --git a/doc/administrator/galaxy/setup/index.md b/doc/administrator/galaxy/setup/index.md
@@ -13,7 +13,7 @@ This document describes the necessary steps for installing and integrating [Gala
 The following must be set up before proceeding with the installation.
 
 1. A machine that has been set up to install Galaxy.  This could be the same machine as the IRIDA web interface, or (recommended) a separate machine.
-2. A shared filesystem has been set up between IRIDA and Galaxy.  If Galaxy will be submitting to a compute cluster this filesystem must also be shared with the cluster.
+2. A shared filesystem has been set up between IRIDA and Galaxy if using a local filesystem and not cloud based storage. If Galaxy will be submitting to a compute cluster this filesystem must also be shared with the cluster.
 
 * this comment becomes the table of contents.
 {:toc}

diff --git a/doc/developer/getting-started/index.md b/doc/developer/getting-started/index.md
@@ -199,9 +199,10 @@ Gradle will download all required dependencies and run the full suite of unit te
 ##### Integration tests
 {:.no_toc}
 
-IRIDA has 5 integration test tasks which splits the integration test suite into functional groups.  This allows GitHub Actions to run the tests in parallel, and local test executions to only run the required portion of the test suite.  The 5 tasks are the following:
+IRIDA has 6 integration test tasks which splits the integration test suite into functional groups.  This allows GitHub Actions to run the tests in parallel, and local test executions to only run the required portion of the test suite.  The 6 tasks are the following:
 
 * `serviceITest` - Runs the service layer and repository testing.
+* `fileSystemITest` - Runs the file system testing
 * `uiITest` - Integration tests for IRIDA's web interface.
 * `restITest` - Tests IRIDA's REST API.
 * `galaxyITest` - Runs tests for IRIDA communicating with Galaxy.  This profile will automatically start a test galaxy instance to test with.
@@ -217,6 +218,7 @@ As the integration tests simulate a running IRIDA installation, in order to run
 
 Where <TEST PROFILE> is one of the following:
 * `service_testing` - Runs the `serviceITest` task
+* `file_system_testing` - Runs the `fileSystemITest` task
 * `ui_testing` - Runs the `uiITest` task
 * `rest_testing` - Runs the `restITest` task
 * `galaxy_testing` - Runs the `galaxyITest` task

diff --git a/doc/developer/setup/index.md b/doc/developer/setup/index.md
@@ -122,13 +122,56 @@ docker run hello-world
 
 ### Configure Filesystem Locations
 
-IRIDA stores much of its metadata in the relational database, but all sequencing and analysis files are stored on the filesystem. Directory configuration is:
+IRIDA stores much of its metadata in the relational database. As of IRIDA 23.01, you can use cloud based storage (BETA) as well as a local filesystem to store the sequencing, reference, and analysis output files. Currently, Azure Blob and AWS S3 storage is supported.
+
+Directory configuration is:
 
 * **Sequencing Data**: `sequence.file.base.directory`
 * **Reference Files**: `reference.file.base.directory`
 * **Analysis Output**: `output.file.base.directory`
 
-If the directories that are configured do not exist (they don't likely exist if you don't configure them), IRIDA will default to automatically creating a temporary directory using Java's [`Files.createTempDirectory`](http://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#createTempDirectory-java.lang.String-java.nio.file.attribute.FileAttribute...-).
+If using a local filesystem and these directories that are configured do not exist (they don't likely exist if you don't configure them), IRIDA will default to automatically creating a temporary directory using Java's [`Files.createTempDirectory`](http://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#createTempDirectory-java.lang.String-java.nio.file.attribute.FileAttribute...-).
+However, if you are using cloud based storage you will still need to set these directories in the configuration as these will make up the virtual path to the file, but no local directories will be created.
+
+To setup IRIDA to use cloud based file storage, follow the instructions below for the storage type.
+
+### Setup using Azure Storage Blob
+
+In the configuration file (such as irida.conf) you will need to add these configuration values:
+
+* `irida.storage.type=azure`
+* `azure.container.name=CONTAINER_NAME` where the CONTAINER_NAME is a container previously setup on Azure
+* `azure.container.url=CONTAINER_ENDPOINT_URL`
+* `azure.sas.token=SAS_TOKEN` where the SAS_TOKEN has both read/write permissions
+
+See [Azure Storage Setup](https://learn.microsoft.com/en-us/azure/storage/blobs/) for instructions on how to setup Blob storage.
+
+Microsoft has also made available a storage emulator,`Azurite`, which can be used to develop and test Azure storage functionality on a local machine instead of requiring the use of an Azure Storage account. See [Microsoft Azurite](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite).
+
+For you to be able to use `Azurite` you will need to get the `SAS` token for the blob container. This can be retrieved by setting up [Azure Storage Explorer](https://learn.microsoft.com/en-us/azure/vs-azure-tools-storage-manage-with-storage-explorer) and adding a new resource (`Local Storage Emulator`). Once that is set-up, click on the `Blob Containers menu` item in the explorer and then right click on the container (created by default) `test`, and click Get Shared Access Signature. From the popup window, you can select the date range of validity of the token and permissions (Read, Write, Delete, List, Add, and Create) for the container, and then click create.
+
+Once you have Azurite setup and container created, you can update these configuration values (in irida.conf etc)
+* `irida.storage.type=azure`
+* `azure.container.name=test`
+* `azure.container.url=http://127.0.0.1:10000/devstoreaccount1/test?SAS_TOKEN_RETRIEVED_ABOVE`
+* `azure.sas.token=SAS_TOKEN_RETRIEVED_ABOVE` where the SAS_TOKEN has both read/write permissions
+
+If using `Azurite` make sure you have it running before starting up IRIDA.
+
+### Setup using Amazon AWS S3 Bucket Storage
+
+In the configuration file (such as irida.conf) you will need to add these configuration values:
+
+* `irida.storage.type=aws`
+* `aws.bucket.name=BUCKET_NAME` where the BUCKET_NAME is the S3 Bucket previously setup and has read/write permissions.
+* `aws.bucket.region=BUCKET_REGION`
+* `aws.access.key=ACCESS_KEY`
+* `aws.secret.key=SECRET_KEY`
+
+See [AWS S3 Bucket Storage Setup](https://docs.aws.amazon.com/AmazonS3/latest/userguide/GetStartedWithS3.html) for instructions on how to setup S3 storage.
+
+There is no other configuration necessary in IRIDA to use cloud based storage. After adding these values to the configuration file you should be able to start up IRIDA, and it will use the cloud based storage that is configured.
+
 
 ### Testing IRIDA
 

diff --git a/run-tests.sh b/run-tests.sh
@@ -11,6 +11,8 @@ JDBC_URL=jdbc:mysql://$DATABASE_HOST:$DATABASE_PORT/$DATABASE_NAME
 TMP_DIRECTORY=`mktemp -d /tmp/irida-test-XXXXXXXX`
 chmod 777 $TMP_DIRECTORY # Needs to be world-accessible so that Docker/Galaxy can access
 
+S3MOCK_DOCKER_NAME=irida-docker-s3Mock
+AZURITE_DOCKER_NAME=irida-docker-azurite
 GALAXY_DOCKER=phacnml/galaxy-irida-20.09:21.05.2-it
 GALAXY_DOCKER_NAME=irida-galaxy-test
 GALAXY_PORT=48889
@@ -104,6 +106,18 @@ test_service() {
 	return $exit_code
 }
 
+test_file_system() {
+  docker run -d -p 9090:9090 -p 9191:9191 --name $S3MOCK_DOCKER_NAME -t adobe/s3mock
+
+  docker run -d -p 10000:10000 -p 10001:10001 -p 10002:10002 --name $AZURITE_DOCKER_NAME mcr.microsoft.com/azure-storage/azurite
+    ./gradlew clean check fileSystemITest -Dspring.datasource.url=$JDBC_URL -Dfile.processing.decompress=true -Dirida.it.rootdirectory=$TMP_DIRECTORY -Dspring.datasource.dbcp2.max-wait=$DB_MAX_WAIT_MILLIS $@
+	exit_code=$?
+
+  docker rm -f -v $S3MOCK_DOCKER_NAME
+	docker rm -f -v $AZURITE_DOCKER_NAME;
+	return $exit_code
+}
+
 test_rest() {
     ./gradlew clean check restITest -Dspring.datasource.url=$JDBC_URL -Dfile.processing.decompress=true -Dirida.it.rootdirectory=$TMP_DIRECTORY -Dspring.datasource.dbcp2.max-wait=$DB_MAX_WAIT_MILLIS $@
 	exit_code=$?
@@ -169,7 +183,7 @@ test_open_api() {
 }
 
 test_all() {
-	for test_profile in test_rest test_service test_ui test_galaxy test_galaxy_pipelines test_open_api;
+	for test_profile in test_rest test_service test_ui test_galaxy test_galaxy_pipelines test_open_api test_file_system;
 	do
 		tmp_dir_cleanup
 		eval $test_profile
@@ -198,9 +212,11 @@ then
 	echo -e "\t--no-kill-docker: Do not kill Galaxy Docker after Galaxy tests have run."
 	echo -e "\t--no-headless: Do not run chrome in headless mode (for viewing results of UI tests)."
 	echo -e "\t--selenium-docker: Use selenium/standalone-chrome docker container for executing UI tests."
-	echo -e "\ttest_type:     One of the IRIDA test types {service_testing, ui_testing, rest_testing, galaxy_testing, galaxy_pipeline_testing, open_api_testing, all}."
+	echo -e "\ttest_type:     One of the IRIDA test types {service_testing, ui_testing, rest_testing, galaxy_testing, galaxy_pipeline_testing, open_api_testing, file_system_testing, all}."
 	echo -e "\t[gradle options]: Additional options to pass to 'gradle'.  In particular, can pass '--test ca.corefacility.bioinformatics.irida.fully.qualified.name' to run tests from a particular class.\n"
 	echo -e "Examples:\n"
+  echo -e "$0 file_system_testing\n"
+  echo -e "\tThis will test the File System of IRIDA, cleaning up the test database/docker containers first.\n"
 	echo -e "$0 service_testing\n"
 	echo -e "\tThis will test the Service layer of IRIDA, cleaning up the test database/docker containers first.\n"
 	echo -e "$0 -d irida_integration_test2 galaxy_testing\n"
@@ -301,6 +317,13 @@ case "$1" in
 		exit_code=$?
 		posttest_cleanup
 	;;
+	file_system_testing)
+		shift
+		pretest_cleanup
+		test_file_system $@
+		exit_code=$?
+		posttest_cleanup
+	;;
 	all)
 		shift
 		pretest_cleanup

diff --git a/.../ca/corefacility/bioinformatics/irida/config/analysis/AnalysisExecutionServiceConfig.java b/.../ca/corefacility/bioinformatics/irida/config/analysis/AnalysisExecutionServiceConfig.java
@@ -25,6 +25,7 @@
 import ca.corefacility.bioinformatics.irida.pipeline.upload.galaxy.GalaxyWorkflowService;
 import ca.corefacility.bioinformatics.irida.plugins.IridaPlugin;
 import ca.corefacility.bioinformatics.irida.plugins.IridaPluginException;
+import ca.corefacility.bioinformatics.irida.repositories.filesystem.IridaFileStorageUtility;
 import ca.corefacility.bioinformatics.irida.repositories.sample.SampleRepository;
 import ca.corefacility.bioinformatics.irida.service.AnalysisService;
 import ca.corefacility.bioinformatics.irida.service.AnalysisSubmissionService;
@@ -105,6 +106,10 @@ public class AnalysisExecutionServiceConfig {
 	@Autowired
 	private List<AnalysisSampleUpdater> defaultAnalysisSampleUpdaters;
 
+	@Autowired
+	private IridaFileStorageUtility iridaFileStorageUtility;
+
+
 	private List<AnalysisSampleUpdater> loadPluginAnalysisSampleUpdaters() {
 		List<AnalysisSampleUpdater> pluginUpdaters = Lists.newLinkedList();
 
@@ -159,7 +164,7 @@ public AnalysisWorkspaceServiceGalaxy analysisWorkspaceService() {
 		return new AnalysisWorkspaceServiceGalaxy(galaxyHistoriesService, galaxyWorkflowService,
 				galaxyLibrariesService, iridaWorkflowsService, analysisCollectionServiceGalaxy(),
 				analysisProvenanceService(), analysisParameterServiceGalaxy,
-				sequencingObjectService);
+				sequencingObjectService, iridaFileStorageUtility);
 	}
 
 	@Lazy
@@ -171,6 +176,6 @@ public AnalysisProvenanceServiceGalaxy analysisProvenanceService() {
 	@Lazy
 	@Bean
 	public AnalysisCollectionServiceGalaxy analysisCollectionServiceGalaxy() {
-		return new AnalysisCollectionServiceGalaxy(galaxyHistoriesService);
+		return new AnalysisCollectionServiceGalaxy(galaxyHistoriesService, iridaFileStorageUtility);
 	}
 }
diff --git a/...ain/java/ca/corefacility/bioinformatics/irida/config/analysis/ExecutionManagerConfig.java b/...ain/java/ca/corefacility/bioinformatics/irida/config/analysis/ExecutionManagerConfig.java
@@ -76,7 +76,7 @@ public class ExecutionManagerConfig {
 
 	/**
 	 * Builds a new ExecutionManagerGalaxy from the given properties.
-	 * 
+	 *
 	 * @return An ExecutionManagerGalaxy.
 	 * @throws ExecutionManagerConfigurationException If no execution manager is configured.
 	 */
@@ -89,7 +89,7 @@ public ExecutionManagerGalaxy executionManager() throws ExecutionManagerConfigur
 
 	/**
 	 * Builds a new ExecutionManagerGalaxy given the following environment properties.
-	 * 
+	 *
 	 * @param urlProperty         The property defining the URL to Galaxy.
 	 * @param apiKeyProperty      The property defining the API key to Galaxy.
 	 * @param emailProperty       The property defining the account email in Galaxy.
@@ -111,7 +111,7 @@ private ExecutionManagerGalaxy buildExecutionManager(String urlProperty, String
 
 	/**
 	 * Gets and validates a GalaxyAccountEmail from the given property.
-	 * 
+	 *
 	 * @param emailProperty The property to find the email address.
 	 * @return A valid GalaxyAccountEmail.
 	 * @throws ExecutionManagerConfigurationException If the properties value was invalid.
@@ -132,7 +132,7 @@ private GalaxyAccountEmail getGalaxyEmail(String emailProperty) throws Execution
 
 	/**
 	 * Gets and validates a Galaxy API key from the given property.
-	 * 
+	 *
 	 * @param apiKeyProperty The API key property to get.
 	 * @return A API key for Galaxy.
 	 * @throws ExecutionManagerConfigurationException If the given properties value was invalid.
@@ -149,7 +149,7 @@ private String getAPIKey(String apiKeyProperty) throws ExecutionManagerConfigura
 
 	/**
 	 * Gets and validates the given property for a Galaxy url.
-	 * 
+	 *
 	 * @param urlProperty The property with the Galaxy URL.
 	 * @return A valid Galaxy URL.
 	 * @throws ExecutionManagerConfigurationException If the properties value was invalid.
@@ -170,7 +170,7 @@ private URL getGalaxyURL(String urlProperty) throws ExecutionManagerConfiguratio
 
 	/**
 	 * Gets and validates a property with the storage strategy for Galaxy.
-	 * 
+	 *
 	 * @param dataStorageProperty The property with the storage strategy for Galaxy.
 	 * @return The corresponding storage strategy object, defaults to DEFAULT_DATA_STORAGE if invalid.
 	 */