Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to customize originEntity. #1848

Closed
cobolbaby opened this issue Sep 3, 2020 · 7 comments
Closed

How to customize originEntity. #1848

cobolbaby opened this issue Sep 3, 2020 · 7 comments
Assignees
Labels
question Question

Comments

@cobolbaby
Copy link
Contributor

cobolbaby commented Sep 3, 2020

Currently originEntity is defined as dev or prod. In actual, we prefer it to identify the database instance.

2020-09-03 22-47-43屏幕截图

Now it is not clear how to customize this information. Looking forward to some suggestions.

@cobolbaby cobolbaby added the question Question label Sep 3, 2020
@jplaisted
Copy link
Contributor

jplaisted commented Sep 3, 2020

Actually, this is the full list of possible values:

/**
 * Fabric group type
 */
enum FabricType {

  /**
   * Designates development fabrics
   */
  DEV

  /**
   * Designates early-integration (staging) fabrics
   */
  EI

  /**
   * Designates production fabrics
   */
  PROD

  /**
   * Designates corporation fabrics
   */
  CORP
}

This is a problem with enums in general. They don't expand well to open source (one cannot easily customize it for their use case; these values made sense at LI but probably not elsewhere).

@mars-lan thoughts on how to resolve this? Both for the existing enums and going forward.

I feel like, going forward, perhaps we should ban enums in new models? Enforce the fields to be strings. Then, the would-be-enum-values can be constants in a Java utility class somewhere (e.g. StandardFabricTypes).

@mars-lan
Copy link
Contributor

mars-lan commented Sep 4, 2020

I think "data origin" isn't the correct choice here in the first place. @cobolbaby when you say "identify the database instance", do you mean "different database hosts with different tables", or "different database hosts with same tables via replication"?

@cobolbaby
Copy link
Contributor Author

In my opinion, originEntity means the db host that may contain some platformEntity.

@mars-lan
Copy link
Contributor

mars-lan commented Sep 5, 2020

I think "data origin" isn't the correct choice here in the first place. @cobolbaby when you say "identify the database instance", do you mean "different database hosts with different tables", or "different database hosts with same tables via replication"?

@cobolbaby please help us understand your use case better so we can make the correct recommendation here

@cobolbaby
Copy link
Contributor Author

Data sets often contain some attributes used to indicate the source of the data:

  • db host -- Which IP the specific data comes from
  • db instance -- Multiple business databases may be stored under one instance
  • table schema -- The logical location of a table in a database, such as schema in PG
  • table -- The table name also contains some prefix information to indicate the level in data warehouse modeling, such as ods_, dw_, dm_

In order to better show the business hierarchy in Datahub, I temporarily thought of establishing the following mapping relationship.

  • origin -- db host
  • platform -- db instance
  • dataset -- table

When the name of the dataset contains ., the front-end page will recursively display the dataset in a directory tree and ensure
the isolation on the display just like schema.

But the current origin is an enumerated constant. In the prod env, we don't have dev, only prod, so it doesn't make much sense to us. So I hope to skip this layer prod, or to customize originEntity.

@mars-lan
Copy link
Contributor

mars-lan commented Sep 7, 2020

I believe this is somewhat related to #1853 (comment)?

Each part of the Dataset URN (or any other URNs for that matter) has a specific use,

  • platform: type of data platform, e.g. mysql, hive, bigquery, etc, as listed here
  • name: the name of the dataset. Depending on the platform, it can have multiple segments separated by the separator defined here, e.g. /foo/bar/baz for HDFS/ADLS or <db_name>.<table_name> for hive.
  • origin: the "nature" of the data, e.g. PROD for production data, EI for staging data, CORP for corporation data

As a result, the correct mapping for your Postgres dataset should be

At this point you'll be asking "what about db host/instance?", which is why I was asking #1848 (comment). Internally we have an entity called DatasetInstance which is designed specifically to capture the "instance" concept. Each Dataset entity can therefore have zero or more DatasetInstances. For example, a Postgres database may have multiple read replicas. DatasetInstance allows you to store replica-specific metadata, in additional to capturing the cluster as a single logical Dataset.

We do plan to open source the DatasetInstance model at some point, but was afraid that it'd confuse the open source community if we do so prematurely.

@cobolbaby
Copy link
Contributor Author

Thank you for your suggestion. I will make some adjustments based on this standard.

Close the issue for now. I will always pay attention to the progress of the DatasetInstance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

3 participants