How to customize originEntity. #1848

cobolbaby · 2020-09-03T09:09:10Z

Currently originEntity is defined as dev or prod. In actual, we prefer it to identify the database instance.

Now it is not clear how to customize this information. Looking forward to some suggestions.

jplaisted · 2020-09-03T21:22:34Z

Actually, this is the full list of possible values:

/**
 * Fabric group type
 */
enum FabricType {

  /**
   * Designates development fabrics
   */
  DEV

  /**
   * Designates early-integration (staging) fabrics
   */
  EI

  /**
   * Designates production fabrics
   */
  PROD

  /**
   * Designates corporation fabrics
   */
  CORP
}

This is a problem with enums in general. They don't expand well to open source (one cannot easily customize it for their use case; these values made sense at LI but probably not elsewhere).

@mars-lan thoughts on how to resolve this? Both for the existing enums and going forward.

I feel like, going forward, perhaps we should ban enums in new models? Enforce the fields to be strings. Then, the would-be-enum-values can be constants in a Java utility class somewhere (e.g. StandardFabricTypes).

mars-lan · 2020-09-04T12:09:31Z

I think "data origin" isn't the correct choice here in the first place. @cobolbaby when you say "identify the database instance", do you mean "different database hosts with different tables", or "different database hosts with same tables via replication"?

cobolbaby · 2020-09-04T13:56:33Z

In my opinion, originEntity means the db host that may contain some platformEntity.

mars-lan · 2020-09-05T11:28:44Z

I think "data origin" isn't the correct choice here in the first place. @cobolbaby when you say "identify the database instance", do you mean "different database hosts with different tables", or "different database hosts with same tables via replication"?

@cobolbaby please help us understand your use case better so we can make the correct recommendation here

cobolbaby · 2020-09-07T02:44:13Z

Data sets often contain some attributes used to indicate the source of the data:

db host -- Which IP the specific data comes from
db instance -- Multiple business databases may be stored under one instance
table schema -- The logical location of a table in a database, such as schema in PG
table -- The table name also contains some prefix information to indicate the level in data warehouse modeling, such as ods_, dw_, dm_

In order to better show the business hierarchy in Datahub, I temporarily thought of establishing the following mapping relationship.

origin -- db host
platform -- db instance
dataset -- table

When the name of the dataset contains ., the front-end page will recursively display the dataset in a directory tree and ensure
the isolation on the display just like schema.

But the current origin is an enumerated constant. In the prod env, we don't have dev, only prod, so it doesn't make much sense to us. So I hope to skip this layer prod, or to customize originEntity.

mars-lan · 2020-09-07T11:35:47Z

I believe this is somewhat related to #1853 (comment)?

Each part of the Dataset URN (or any other URNs for that matter) has a specific use,

platform: type of data platform, e.g. mysql, hive, bigquery, etc, as listed here
name: the name of the dataset. Depending on the platform, it can have multiple segments separated by the separator defined here, e.g. /foo/bar/baz for HDFS/ADLS or <db_name>.<table_name> for hive.
origin: the "nature" of the data, e.g. PROD for production data, EI for staging data, CORP for corporation data

As a result, the correct mapping for your Postgres dataset should be

platform: postgres (currently missing, will be added by feat(platform): add "postgres" as a supported data platform #1859)
name: <table_schema>.<table_name>
origin: EI/CORP/PROD

At this point you'll be asking "what about db host/instance?", which is why I was asking #1848 (comment). Internally we have an entity called DatasetInstance which is designed specifically to capture the "instance" concept. Each Dataset entity can therefore have zero or more DatasetInstances. For example, a Postgres database may have multiple read replicas. DatasetInstance allows you to store replica-specific metadata, in additional to capturing the cluster as a single logical Dataset.

We do plan to open source the DatasetInstance model at some point, but was afraid that it'd confuse the open source community if we do so prematurely.

cobolbaby · 2020-09-07T14:54:00Z

Thank you for your suggestion. I will make some adjustments based on this standard.

Close the issue for now. I will always pay attention to the progress of the DatasetInstance.

cobolbaby added the question Question label Sep 3, 2020

mars-lan mentioned this issue Sep 7, 2020

How to extend the data model on Neo4j #1853

Closed

mars-lan self-assigned this Sep 7, 2020

cobolbaby closed this as completed Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to customize originEntity. #1848

How to customize originEntity. #1848

cobolbaby commented Sep 3, 2020 •

edited

Loading

jplaisted commented Sep 3, 2020 •

edited

Loading

mars-lan commented Sep 4, 2020

cobolbaby commented Sep 4, 2020

mars-lan commented Sep 5, 2020

cobolbaby commented Sep 7, 2020

mars-lan commented Sep 7, 2020

cobolbaby commented Sep 7, 2020

How to customize originEntity. #1848

How to customize originEntity. #1848

Comments

cobolbaby commented Sep 3, 2020 • edited Loading

jplaisted commented Sep 3, 2020 • edited Loading

mars-lan commented Sep 4, 2020

cobolbaby commented Sep 4, 2020

mars-lan commented Sep 5, 2020

cobolbaby commented Sep 7, 2020

mars-lan commented Sep 7, 2020

cobolbaby commented Sep 7, 2020

cobolbaby commented Sep 3, 2020 •

edited

Loading

jplaisted commented Sep 3, 2020 •

edited

Loading