-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1569] Add RDBMS-backed MysqlJobCatalog
, as alternative to file system storage
#3421
Conversation
* upstream/master: Refactor `MysqlSpecStore` into a generalization, `MysqlNonFlowSpecStore` (not limited to `FlowSpec`s), also useable for `TopologySpec`s (apache#3414) [GOBBLIN-1563]Collect more information to analyze the RC for some job cannot emit kafka events to update job status (apache#3416) [GOBBLIN-1521] Create local mode of streaming kafka job to help user quickly onboard (apache#3372) [GOBBLIN-1559] Support wildcard for input paths (apache#3410) [GOBBLIN-1561]Improve error message when flow compilation fails (apache#3412) [GOBBLIN-1556]Add shutdown logic in FsJobConfigurationManager (apache#3407) [GOBBLIN-1542] Integrate with Helix API to add/remove task from a running helix job (apache#3393)
// for backward compatibility... (is this needed for `JobSpec`, or only for (inspiration) `FlowSpec`???) | ||
properties = new Properties(); | ||
try { | ||
properties.load(new StringReader(jsonObject.get(JobSpecSerializer.JOB_SPEC_CONFIG_AS_PROPERTIES_KEY).getAsString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this originated (with the comment "for backward compatibility") in FlowSpecDeserializer
... is it necessary here too or a historical quirk specific to the former?
try { | ||
templateURI = Optional.ofNullable(jsonObject.get(JobSpecSerializer.JOB_SPEC_TEMPLATE_URI_KEY).getAsString()).map(s -> { | ||
try { | ||
return new URI(s); | ||
} catch (URISyntaxException e) { | ||
LOGGER.warn(String.format("error deserializing '%s' as a URI", | ||
jsonObject.get(JobSpecSerializer.JOB_SPEC_TEMPLATE_URI_KEY).getAsString()), e); | ||
throw new RuntimeException(e); | ||
} | ||
}); | ||
} catch (RuntimeException e) { | ||
templateURI = Optional.empty(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map function seems unnecessarily complicated, couldn't we just do:
try {
templateURI = Optional.ofNullable(new URI(jsonObject.get(JobSpecSerializer.JOB_SPEC_TEMPLATE_URI_KEY).getAsString()));
} catch (URISyntaxException e) {
LOGGER.warn(String.format("error deserializing '%s' as a URI",
jsonObject.get(JobSpecSerializer.JOB_SPEC_TEMPLATE_URI_KEY).getAsString()), e);
templateURI = Optional.empty();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first off, I agree it's unsatisfying. (I welcome any tips toward finessing java's checked exceptions!)
the transformation you suggest doesn't protect against passing null
to the URI
ctor. that's why I was working (via map
) within the Optional
. still, you did helped spot a NPE bug from the parens closed in the wrong place. we need:
Optional.ofNullable(jsonObj.get(...)).map
not the potentially explosive:
Optional.ofNullable(jsonObj.get(...).getAsString()).map
in addition, I'll use Optional.flatMap
to eliminate the second level of exception handling. I'll push those corrections, and you please tell me if you see any other potential improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I see, yeah I think the current way looks fine, it is good to not have the two levels of exception handling.
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/job_catalog/MysqlJobCatalog.java
Outdated
Show resolved
Hide resolved
public class MysqlJobCatalog extends JobCatalogBase implements MutableJobCatalog { | ||
|
||
private static final Logger LOGGER = LoggerFactory.getLogger(MysqlJobCatalog.class); | ||
public static final String DB_CONFIG_PREFIX = "mysqlJobCatalog"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name does not match the value it holds, right? how about MYSQL_JOB_CATALOG or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's named after its purpose/function, to read clearly when used; e.g. from the unit test:
Config config = ConfigBuilder.create()
.addPrimitive(ConfigurationKeys.METRICS_ENABLED_KEY, "true")
.addPrimitive(MysqlJobCatalog.DB_CONFIG_PREFIX + "." + ConfigurationKeys.STATE_STORE_DB_URL_KEY, testDb.getJdbcUrl())
...
what do you think?
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/job_catalog/MysqlJobCatalog.java
Show resolved
Hide resolved
JobSpec jobSpec = (JobSpec) jobSpecStore.getSpec(jobURI); | ||
jobSpecStore.deleteSpec(jobURI); | ||
this.mutableMetrics.updateRemoveJobTime(startTime); | ||
this.listeners.onDeleteJob(jobURI, jobSpec.getVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems supporting a feature (spec versioning) which may have bugs, and is not being used is costing us more spec store calls. Should we remove this in future, or at least provide a version of APIs without the version (that passes to the existing API with a default version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lack the history on this one. I'm also not following what additional spec store calls arise as a result.
(overall I'm not opposed to streamlining the functionality on offer...)
please either provide more background context, or, if that effort would be a separate follow-on, we could save the discussion for later.
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/spec_serde/GsonJobSpecSerDe.java
Outdated
Show resolved
Hide resolved
/** | ||
* {@link SpecSerDe} that serializes as Json using {@link Gson}. Note that currently only {@link JobSpec}s are supported. | ||
*/ | ||
public class GsonJobSpecSerDe implements SpecSerDe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a way to reuse GsonFlowSpecSerDe
, can we use generics in the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion... I guess I took the prior impl at face value, but I was able to re-work w/ generics into an abstract base class doing the heavy lifting.
since java lacks a type class resolution capability, we still need to create a specific, named derivation to supply the serializer and the deserializer. handily, that non-generic class name also imparts the ability to refer to the type within HOCON or even simple properties-based config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…file system storage (apache#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup
…file system storage (apache#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup
…file system storage (apache#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup
…file system storage (apache#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup
…asedSource (#3436) * [GOBBLIN-1582] Fill low/high watermark info in SourceState for QueryBasedSource * add unit test * address comments to make high/low watermark optional * Refactor `MysqlSpecStore` into a generalization, `MysqlNonFlowSpecStore` (not limited to `FlowSpec`s), also useable for `TopologySpec`s (#3414) * Refactor `MysqlSpecStore` into a generalization, `MysqlNonFlowSpecStore` (not limited to `FlowSpec`s), also useable for `TopologySpec`s * Add missing file, `MysqlNonFlowSpecStoreTest` * Fixup `MysqlNonFlowSpecStoreTest` * Simplify implementaiton of `MysqlSpecStore.getSpecsImpl`. * Rename `MysqlNonFlowSpecStore` to `MysqlBaseFlowSpecStore`. * Aid maintainers with additional code comments * [GOBBLIN-1557] Make KafkaSource getFilteredTopics method protected (#3408) The method was originally private, and it is useful to be able to override it in subclasses, to redefine how to get topics to be processed. Change-Id: If94cda2f7a5e65e52e2453427c60f4abb932b3f8 * [GOBBLIN-1567] do not set a custom maxConnLifetime for sql connection (#3418) * do not set a custom maxConnLifetime for sql connection * address review comment * [GOBBLIN-1568] Exponential backoff for Salesforce bulk api polling (#3420) * Exponential backoff for Salesforce bulk api polling * Read min and max wait time from prop with default * set RETENTION_DATASET_ROOT in CleanableIcebergDataset so that any retention job can use this information (#3422) * [GOBBLIN-1569] Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage (#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup * Tag metrics with proxy url if available (#3423) * remove use of deprecated helix class (#3424) codestyle changes * [GOBBLIN-1565]Make GMCEWriter fault tolerant so that one topic failure will not affect other topics in the same container (#3419) * [GOBBLIN-1565]Make GMCEWriter fault tolerant so that one topic failure will not affect other topics in the same container * address comments * change the way we set low watermark to have a better indicate for the watermark range of the snapshot * address comments * fix test error * [GOBBLIN-1552] determine flow status correctly when dag manager is disabled (#3403) * determine flow status based on the fact if dag manager is enabled this is needed because when dag manager is not enabled, flow level events are not emitted and cannot be used to determine flow status. in that case flow status has to be determined by using job statuses. store flow status in the FlowStatus * address review comments * address review comments * removed a commented line * [GOBBLIN-1564] codestyle changes, typo corrections, improved javadoc and fix a sync… (#3415) * codestyle changes, typo corrections, improved javadoc and fix a synchronization issue * address review comments * add review comments * address review comments * address review comments * fix bugsFixMain * do not delete data while dropping a hive table because data is deleted, if needed, separately (#3431) * [GOBBLIN-1574] Added whitelist for iceberg tables to add new partitio… (#3426) * [GOBBLIN-1574] Added whitelist for iceberg tables to add new partition column * fix to failing test case * Updated IncebergMetadataWriterTest to blacklist the test table from non-completeness tests * moved dataset name update in tablemetadata * Added newPartition checks in Table Metadata * Fixed test case to include new_parition_enabled Co-authored-by: Vikram Bohra <[email protected]> * [GOBBLIN-1577] change the multiplier used in ExponentialWaitStrategy to a reasonable… (#3430) * change the multiplier used in ExponentialWaitStrategy to 1 second. old multiplier 2ms was retrying too fast for some use cases * . * [GOBBLIN-1580] Check table exists instead of call create table directly to make sure table exists (#3432) * [hotfix] workaround to catch exception when iceberg does not support get metrics for non-union type * address comments * [GOBBLIN-1580]Check table exists instead of call create table directly to make sure table exists * [GOBBLIN-1573]Fix the ClassNotFoundException in streaming test pipeline (#3425) * [GOBBLIN-1565]Make GMCEWriter fault tolerant so that one topic failure will not affect other topics in the same container * address comments * address comments * [GOBBLIN-1573]Fix the ClassNotFoundException in streaming test pipeline * [GOBBLIN-1576] skip appending record count to staging file if present… (#3429) * [GOBBLIN-1576] skip appending record count to staging file if present already * fixed checkstyle * fixed method Co-authored-by: Vikram Bohra <[email protected]> * fix the NPE in dagManager * fix quota check issue in dagManager * address comments Co-authored-by: Kip Kohn <[email protected]> Co-authored-by: Joseph Allemandou <[email protected]> Co-authored-by: Arjun Singh Bora <[email protected]> Co-authored-by: Jiashuo Wang <[email protected]> Co-authored-by: William Lo <[email protected]> Co-authored-by: vbohra <[email protected]> Co-authored-by: Vikram Bohra <[email protected]>
…file system storage (apache#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup
…asedSource (apache#3436) * [GOBBLIN-1582] Fill low/high watermark info in SourceState for QueryBasedSource * add unit test * address comments to make high/low watermark optional * Refactor `MysqlSpecStore` into a generalization, `MysqlNonFlowSpecStore` (not limited to `FlowSpec`s), also useable for `TopologySpec`s (apache#3414) * Refactor `MysqlSpecStore` into a generalization, `MysqlNonFlowSpecStore` (not limited to `FlowSpec`s), also useable for `TopologySpec`s * Add missing file, `MysqlNonFlowSpecStoreTest` * Fixup `MysqlNonFlowSpecStoreTest` * Simplify implementaiton of `MysqlSpecStore.getSpecsImpl`. * Rename `MysqlNonFlowSpecStore` to `MysqlBaseFlowSpecStore`. * Aid maintainers with additional code comments * [GOBBLIN-1557] Make KafkaSource getFilteredTopics method protected (apache#3408) The method was originally private, and it is useful to be able to override it in subclasses, to redefine how to get topics to be processed. Change-Id: If94cda2f7a5e65e52e2453427c60f4abb932b3f8 * [GOBBLIN-1567] do not set a custom maxConnLifetime for sql connection (apache#3418) * do not set a custom maxConnLifetime for sql connection * address review comment * [GOBBLIN-1568] Exponential backoff for Salesforce bulk api polling (apache#3420) * Exponential backoff for Salesforce bulk api polling * Read min and max wait time from prop with default * set RETENTION_DATASET_ROOT in CleanableIcebergDataset so that any retention job can use this information (apache#3422) * [GOBBLIN-1569] Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage (apache#3421) * Add RDBMS-backed `MysqlJobCatalog`, as alternative to file system storage * Streamline `JobSpecDeserializer` error handling, on review feedback. * Refactor `GsonJobSpecSerDe` into a reusable `GenericGsonSpecSerDe`. * Fix javadoc slipup * Tag metrics with proxy url if available (apache#3423) * remove use of deprecated helix class (apache#3424) codestyle changes * [GOBBLIN-1565]Make GMCEWriter fault tolerant so that one topic failure will not affect other topics in the same container (apache#3419) * [GOBBLIN-1565]Make GMCEWriter fault tolerant so that one topic failure will not affect other topics in the same container * address comments * change the way we set low watermark to have a better indicate for the watermark range of the snapshot * address comments * fix test error * [GOBBLIN-1552] determine flow status correctly when dag manager is disabled (apache#3403) * determine flow status based on the fact if dag manager is enabled this is needed because when dag manager is not enabled, flow level events are not emitted and cannot be used to determine flow status. in that case flow status has to be determined by using job statuses. store flow status in the FlowStatus * address review comments * address review comments * removed a commented line * [GOBBLIN-1564] codestyle changes, typo corrections, improved javadoc and fix a sync… (apache#3415) * codestyle changes, typo corrections, improved javadoc and fix a synchronization issue * address review comments * add review comments * address review comments * address review comments * fix bugsFixMain * do not delete data while dropping a hive table because data is deleted, if needed, separately (apache#3431) * [GOBBLIN-1574] Added whitelist for iceberg tables to add new partitio… (apache#3426) * [GOBBLIN-1574] Added whitelist for iceberg tables to add new partition column * fix to failing test case * Updated IncebergMetadataWriterTest to blacklist the test table from non-completeness tests * moved dataset name update in tablemetadata * Added newPartition checks in Table Metadata * Fixed test case to include new_parition_enabled Co-authored-by: Vikram Bohra <[email protected]> * [GOBBLIN-1577] change the multiplier used in ExponentialWaitStrategy to a reasonable… (apache#3430) * change the multiplier used in ExponentialWaitStrategy to 1 second. old multiplier 2ms was retrying too fast for some use cases * . * [GOBBLIN-1580] Check table exists instead of call create table directly to make sure table exists (apache#3432) * [hotfix] workaround to catch exception when iceberg does not support get metrics for non-union type * address comments * [GOBBLIN-1580]Check table exists instead of call create table directly to make sure table exists * [GOBBLIN-1573]Fix the ClassNotFoundException in streaming test pipeline (apache#3425) * [GOBBLIN-1565]Make GMCEWriter fault tolerant so that one topic failure will not affect other topics in the same container * address comments * address comments * [GOBBLIN-1573]Fix the ClassNotFoundException in streaming test pipeline * [GOBBLIN-1576] skip appending record count to staging file if present… (apache#3429) * [GOBBLIN-1576] skip appending record count to staging file if present already * fixed checkstyle * fixed method Co-authored-by: Vikram Bohra <[email protected]> * fix the NPE in dagManager * fix quota check issue in dagManager * address comments Co-authored-by: Kip Kohn <[email protected]> Co-authored-by: Joseph Allemandou <[email protected]> Co-authored-by: Arjun Singh Bora <[email protected]> Co-authored-by: Jiashuo Wang <[email protected]> Co-authored-by: William Lo <[email protected]> Co-authored-by: vbohra <[email protected]> Co-authored-by: Vikram Bohra <[email protected]>
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Add RDBMS-backed
MysqlJobCatalog
, as alternative to file system storagePresent storage bindings for
JobCatalog
are in-memory or file system. While the latter is durable, it requires an NFS-like capability for cross-instance sharing.Support alternative multi-instance backing through MySQL, precluding the need for NFS.
Tests
New unit tests
Commits