Deprecate and remove export to parquet feature #3156 (#3228)

* Deprecate and remove export to parquet feature #3156 Signed-off-by: Paul Bastide <[email protected]> * Update Parquet to Deprecated Signed-off-by: Paul Bastide <[email protected]>
LinuxForHealth · Jan 26, 2022 · 8d5cc0f · 8d5cc0f
1 parent 07a077b
commit 8d5cc0f
Show file tree

Hide file tree

Showing 11 changed files with 22 additions and 458 deletions.
diff --git a/docs/src/pages/guides/FHIRBulkOperations.md b/docs/src/pages/guides/FHIRBulkOperations.md
@@ -2,7 +2,7 @@
 layout: post
 title: IBM FHIR Server Bulk Data Guide
 description: IBM FHIR Server Bulk Data Guide
-date:   2021-03-10
+date:   2022-01-20
 permalink: /FHIRBulkOperations/
 ---
 
@@ -23,7 +23,7 @@ The `$export` operation uses three OperationDefinition:
 - [Patient](http://hl7.org/fhir/uv/bulkdata/STU1/OperationDefinition-patient-export.html) - Obtain a set of resources pertaining to all patients. Exports to an S3-compatible data store.
 - [Group](http://hl7.org/fhir/uv/bulkdata/STU1/OperationDefinition-group-export.html) - Obtain a set of resources pertaining to patients in a specific Group. Only supports static membership; does not resolve inclusion/exclusion criteria.
 
-The export may be to the ndjson or parquet format.
+The export is in the ndjson format.
 
 ### **$export: Create a Bulk Data Request**
 To create an export request, the IBM FHIR Server requires the body fields of the request object to be a FHIR Resource `Parameters` JSON Object.  The request must be posted to the server using `POST`. Each request is limited to a single resource type in each imported or referenced file.

diff --git a/docs/src/pages/guides/FHIRServerUsersGuide.md b/docs/src/pages/guides/FHIRServerUsersGuide.md
@@ -2,8 +2,8 @@
 layout: post
 title:  IBM FHIR Server User's Guide
 description: IBM FHIR Server User's Guide
-Copyright: years 2017, 2021
-lastupdated: "2021-12-03"
+Copyright: years 2017, 2022
+lastupdated: "2022-01-20"
 permalink: /FHIRServerUsersGuide/
 ---
 
@@ -1337,7 +1337,6 @@ The Bulk Data web application writes the exported FHIR resources to an IBM Cloud
                 "accessKeyId": "example",
                 "secretAccessKey": "example-password"
             },
-            "enableParquet": false,
             "disableBaseUrlValidation": true,
             "disableOperationOutcomes": true,
             "duplicationCheck": false,
@@ -1512,7 +1511,6 @@ Example of `path` based access:
                 "accessKeyId": "example",
                 "secretAccessKey": "example-password"
             },
-            "enableParquet": false,
             "disableBaseUrlValidation": true,
             "disableOperationOutcomes": true,
             "duplicationCheck": false,
@@ -1554,7 +1552,6 @@ Example of `host` based access:
                 "accessKeyId": "example",
                 "secretAccessKey": "example-password"
             },
-            "enableParquet": false,
             "disableBaseUrlValidation": true,
             "disableOperationOutcomes": true,
             "duplicationCheck": false,
@@ -1611,15 +1608,6 @@ This feature is useful for imports which follow a prefix pattern:
 ### 4.10.3 Integration Testing
 To integration test, there are tests in `ExportOperationTest.java` in `fhir-server-test` module with server integration test cases for system, patient and group export. Further, there are tests in `ImportOperationTest.java` in `fhir-server-test` module. These tests rely on the `fhir-server-config-db2.json` which specifies two storageProviders.
 
-### 4.10.4 Export to Parquet
-Version 4.4 of the IBM FHIR Server introduced experimental support for exporting to Parquet format (as an alternative to the default NDJSON export). However, due to the size of the dependencies needed to make this work, this feature is disabled by default.
-
-To enable export to parquet, an administrator must:
-1. make Apache Spark (version 3.0) and the IBM Stocator adapter (version 1.1) available to the fhir-bulkdata-webapp by dropping the necessary jar files under `fhir-server/userlib` directory; and
-2. set the `/fhirServer/bulkdata/storageProviders/(source)/enableParquet` config property to `true`
-
-An alternative way to accomplish the first part of this is to change the scope of these dependencies from the fhir-bulkdata-webapp pom.xml and rebuild the webapp to include them.
-
 ### 4.10.5 Job Logs
 Because the bulk import and export operations are built on Liberty's java batch implementation, users may need to check the [Liberty batch job logs](https://www.ibm.com/support/knowledgecenter/SSEQTP_liberty/com.ibm.websphere.wlp.doc/ae/rwlp_batch_view_joblog.html) for detailed step information / troubleshooting.
 
@@ -2232,7 +2220,6 @@ This section contains reference information about each of the configuration prop
 |`fhirServer/bulkdata/storageProviders/<source>/fileBase`|string| The absolute path of the output directory. It is recommended this path is not the mount point of a volume. For instance, if a volume is mounted to /output/bulkdata, use /output/bulkdata/data to ensure a failed mount does not result in writing to the root file system.|
 |`fhirServer/bulkdata/storageProviders/<source>/validBaseUrls`|list|The list of supported urls which are approved for the fhir server to access|
 |`fhirServer/bulkdata/storageProviders/<source>/disableBaseUrlValidation`|boolean|Disables the URL checking feature, allowing all URLs to be imported|
-|`fhirServer/bulkdata/storageProviders/<source>/enableParquet`|boolean|Whether or not the server is configured to support export to parquet; to properly enable it the administrator must first make spark and stocator available to the fhir-bulkdata-webapp (e.g through the shared lib at `wlp/user/shared/resources/lib`)|
 |`fhirServer/bulkdata/storageProviders/<source>/disableOperationOutcomes`|boolean|Disables the base url validation, allowing all URLs to be imported|
 |`fhirServer/bulkdata/storageProviders/<source>/duplicationCheck`|boolean|Enables duplication check on import|
 |`fhirServer/bulkdata/storageProviders/<source>/validateResources`|boolean|Enables the validation of imported resources|
@@ -2363,7 +2350,6 @@ This section contains reference information about each of the configuration prop
 |`fhirServer/bulkdata/cosFileMaxSize`|209715200|
 |`fhirServer/bulkdata/patientExportPageSize`|200|
 |`fhirServer/bulkdata/useFhirServerTrustStore`|false|
-|`fhirServer/bulkdata/enableParquet`|false|
 |`fhirServer/bulkdata/ignoreImportOutcomes`|false|
 |`fhirServer/bulkdata/enabled`|true |
 |`fhirServer/bulkdata/core/api/trustAll`|false|
@@ -2386,7 +2372,6 @@ This section contains reference information about each of the configuration prop
 |`fhirServer/bulkdata/core/defaultOutcomeProvider`|default|
 |`fhirServer/bulkdata/core/enableSkippableUpdates`|true|
 |`fhirServer/bulkdata/storageProviders/<source>/disableBaseUrlValidation`|false|
-|`fhirServer/bulkdata/storageProviders/<source>/enableParquet`|false|
 |`fhirServer/bulkdata/storageProviders/<source>/disableOperationOutcomes`|false|
 |`fhirServer/bulkdata/storageProviders/<source>/duplicationCheck`|false|
 |`fhirServer/bulkdata/storageProviders/<source>/validateResources`|false|
@@ -2554,7 +2539,6 @@ must restart the server for that change to take effect.
 |`fhirServer/bulkdata/storageProviders/<source>/fileBase`|Y|Y|
 |`fhirServer/bulkdata/storageProviders/<source>/validBaseUrls`|Y|Y|
 |`fhirServer/bulkdata/storageProviders/<source>/disableBaseUrlValidation`|Y|Y|
-|`fhirServer/bulkdata/storageProviders/<source>/enableParquet`|Y|Y|
 |`fhirServer/bulkdata/storageProviders/<source>/disableOperationOutcomes`|Y|Y|
 |`fhirServer/bulkdata/storageProviders/<source>/duplicationCheck`|Y|Y|
 |`fhirServer/bulkdata/storageProviders/<source>/validateResources`|Y|Y|

diff --git a/fhir-bulkdata-webapp/pom.xml b/fhir-bulkdata-webapp/pom.xml
@@ -107,6 +107,11 @@
             <artifactId>fhir-provider</artifactId>
             <version>${project.version}</version>
         </dependency>
+        <dependency>
+            <groupId>jakarta.servlet</groupId>
+            <artifactId>jakarta.servlet-api</artifactId>
+            <scope>provided</scope>
+        </dependency>
         <dependency>
             <!-- azure needs to come before spark/stocator -->
             <groupId>com.azure</groupId>
@@ -126,66 +131,6 @@
             <groupId>com.azure</groupId>
             <artifactId>azure-core</artifactId>
         </dependency>
-        <dependency>
-            <groupId>org.apache.spark</groupId>
-            <artifactId>spark-sql_2.12</artifactId>
-            <scope>provided</scope>
-            <exclusions>
-                <exclusion>
-                    <groupId>org.apache.avro</groupId>
-                    <artifactId>avro</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.apache.avro</groupId>
-                    <artifactId>avro-mapred</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.apache.zookeeper</groupId>
-                    <artifactId>zookeeper</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.glassfish.jersey.core</groupId>
-                    <artifactId>jersey-client</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.glassfish.jersey.core</groupId>
-                    <artifactId>jersey-common</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.glassfish.jersey.containers</groupId>
-                    <artifactId>jersey-container-servlet</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.glassfish.jersey.core</groupId>
-                    <artifactId>jersey-server</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.glassfish.jersey.inject</groupId>
-                    <artifactId>jersey-hk2</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>io.netty</groupId>
-                    <artifactId>netty-all</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.apache.hive</groupId>
-                    <artifactId>hive-storage-api</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.apache.orc</groupId>
-                    <artifactId>orc-mapreduce</artifactId>
-                </exclusion>
-                <exclusion>
-                    <groupId>org.apache.arrow</groupId>
-                    <artifactId>arrow-vector</artifactId>
-                </exclusion>
-            </exclusions>
-        </dependency>
-        <dependency>
-            <groupId>com.ibm.stocator</groupId>
-            <artifactId>stocator</artifactId>
-            <scope>provided</scope>
-        </dependency>
         <dependency>
             <groupId>org.testng</groupId>
             <artifactId>testng</artifactId>

diff --git a/...bulkdata-webapp/src/main/java/com/ibm/fhir/bulkdata/export/writer/SparkParquetWriter.java b/...bulkdata-webapp/src/main/java/com/ibm/fhir/bulkdata/export/writer/SparkParquetWriter.java
diff --git a/...ta-webapp/src/main/java/com/ibm/fhir/bulkdata/jbatch/export/system/ExportJobListener.java b/...ta-webapp/src/main/java/com/ibm/fhir/bulkdata/jbatch/export/system/ExportJobListener.java
@@ -1,5 +1,5 @@
 /*
- * (C) Copyright IBM Corp. 2020, 2021
+ * (C) Copyright IBM Corp. 2020, 2022
  *
  * SPDX-License-Identifier: Apache-2.0
  */
@@ -21,8 +21,6 @@
 import javax.enterprise.context.Dependent;
 import javax.inject.Inject;
 
-import org.apache.spark.sql.SparkSession;
-
 import com.ibm.fhir.bulkdata.jbatch.context.BatchContextAdapter;
 import com.ibm.fhir.bulkdata.jbatch.export.data.ExportCheckpointUserData;
 import com.ibm.fhir.exception.FHIRException;
@@ -62,24 +60,6 @@ public void beforeJob() throws Exception {
             // Register the context to get the right configuration.
             ConfigurationAdapter adapter = ConfigurationFactory.getInstance();
             adapter.registerRequestContext(ctx.getTenantId(), ctx.getDatastoreId(), ctx.getIncomingUrl());
-
-            if (adapter.isStorageProviderParquetEnabled(ctx.getSource())) {
-                try {
-                    Class.forName("org.apache.spark.sql.SparkSession");
-
-                    // Create the global spark session
-                    SparkSession.builder().appName("parquetWriter")
-                        // local : Run Spark locally with one worker thread (i.e. no parallelism at all).
-                        // local[*] : Run Spark locally with as many worker threads as logical cores on your machine.
-                        .master("local[*]")
-                        // this undocumented feature allows us to avoid a bunch of unneccessary dependencies and avoid
-                        // launching the unnecessary SparkUI stuff, but there is some risk in using it as its
-                        // undocumented.
-                        .config("spark.ui.enabled", false).getOrCreate();
-                } catch (ClassNotFoundException e) {
-                    logger.warning("No SparkSession in classpath; skipping spark session initialization");
-                }
-            }
         } catch (Exception e) {
             logger.log(Level.SEVERE, "ExportJobListener: beforeJob failed job[" + executionId + "]", e);
             throw e;