-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually populate missing metadata items #51
Comments
Somewhat blocked on #52 but I can start putting a spreadsheet together |
Stuff to populate see (https://docs.google.com/spreadsheets/d/1O1EjO96lyqDbyuzFWU8SNBky3b7tKM8dccZVS7-_t9g/edit#gid=784169301) Data product and table level:
Table level:
Custom properties:
Ignore for now
Can we bulk update properties via the command line, or do we need to write another script for this? |
https://datahubproject.io/docs/cli/#user-user-entity this looks useful for adding some data owners We can interact with individual data products using this one https://datahubproject.io/docs/cli/#dataproduct-data-product-entity For tables, there are enough that we will probably need to script it. We need to
I wonder if we could use the file sink to dump to file, then edit the files, then load it back in ??? |
How to import data from remote datahub into local datahub lite for debugging and data wranglinglite_sink.yaml: pipeline_name: datahub_source_1
datahub_api:
server: "https://data-platform-datahub-catalogue-dev.apps.live.cloud-platform.service.justice.gov.uk/api/gms"
token: "xxxxx"
source:
type: datahub
config:
include_all_versions: false
pull_from_datahub_api: true
sink:
type: datahub-lite
|
How to export data from remote datahub into a big json fileSame as above, just replace the sink with sink:
type: file
config:
filename: ./datahub_export.json Example output (array of aspects): [
{
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
"changeType": "UPSERT",
"aspectName": "browsePaths",
"aspect": {
"json": {
"paths": [
"/prod/glue/nomis"
]
}
},
"systemMetadata": {
"lastObserved": 1707128162978,
"runId": "datahub-2024_02_05-10_15_54",
"lastRunId": "no-run-id-provided"
}
},
{
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
"changeType": "UPSERT",
"aspectName": "datasetKey",
"aspect": {
"json": {
"platform": "urn:li:dataPlatform:glue",
"name": "nomis.offender_contact_persons",
"origin": "PROD"
}
},
"systemMetadata": {
"lastObserved": 1707128162979,
"runId": "datahub-2024_02_05-10_15_54",
"lastRunId": "no-run-id-provided"
}
},
{
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
"changeType": "UPSERT",
"aspectName": "browsePathsV2",
"aspect": {
"json": {
"path": [
{
"id": "nomis"
}
]
}
},
"systemMetadata": {
"lastObserved": 1707128162980,
"runId": "datahub-2024_02_05-10_15_54",
"lastRunId": "no-run-id-provided"
}
},
{
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
"changeType": "UPSERT",
"aspectName": "dataPlatformInstance",
"aspect": {
"json": {
"platform": "urn:li:dataPlatform:glue"
}
},
"systemMetadata": {
"lastObserved": 1707128162981,
"runId": "datahub-2024_02_05-10_15_54",
"lastRunId": "no-run-id-provided"
}
},
{
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
"changeType": "UPSERT",
"aspectName": "datasetProperties",
"aspect": {
"json": {
"customProperties": {},
"description": "",
"tags": []
}
},
"systemMetadata": {
"lastObserved": 1707128162982,
"runId": "datahub-2024_02_05-10_15_54",
"lastRunId": "no-run-id-provided"
}
},
{
"entityType": "dataset",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
"changeType": "UPSERT",
"aspectName": "schemaMetadata",
"aspect": {
"json": {
"schemaName": "offender_contact_persons",
"platform": "urn:li:dataPlatform:glue",
.... |
Next steps:
|
This thread includes the source of the common platform metadata, may be useful for filling in missing information https://asdslack.slack.com/archives/CH935DZGS/p1678203340379569 |
DBT ingestion is currently
We could automate the adding of owners via https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings or https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer/#pattern-add-dataset-ownership We can add fixed properties using https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer#simple-add-dataset-datasetproperties |
Following discussion with UR we will also do the following before datahub is ready for UR testing.
|
You can do negative filtering for views, so I would make a
|
This issue is being marked as stale because it has been open for 60 days with no activity. Remove stale label or comment to keep the issue open. |
First identify what is to be fake populated, share this with the team to agree then implement.
Manual population of metadata, based on our best guesses, so that useful information can be used in user research.
Tag status as draft, to indicate this is information we have put in.
Data should look as real as possible
If we don't know fields we should leave them blank
Reformat metadata extract into format that Data Hub expects
Script to upload metadata into Data Hub, not using our registration API
Pick data products which already have descriptions
The text was updated successfully, but these errors were encountered: