Manually populate missing metadata items #51

seanprivett · 2024-01-22T13:15:39Z

First identify what is to be fake populated, share this with the team to agree then implement.

Manual population of metadata, based on our best guesses, so that useful information can be used in user research.
Tag status as draft, to indicate this is information we have put in.
Data should look as real as possible
If we don't know fields we should leave them blank

Reformat metadata extract into format that Data Hub expects

Script to upload metadata into Data Hub, not using our registration API

Pick data products which already have descriptions

MatMoore · 2024-02-02T17:30:38Z

Somewhat blocked on #52 but I can start putting a spreadsheet together

MatMoore · 2024-02-02T17:38:30Z

Stuff to populate see (https://docs.google.com/spreadsheets/d/1O1EjO96lyqDbyuzFWU8SNBky3b7tKM8dccZVS7-_t9g/edit#gid=784169301)

Data product and table level:

Description
Owners
Domains (already set for data product but not showing up for tables)

Table level:

Row count

Custom properties:

dpiaRequired
dpiaLocation
retentionPeriod
Sensitivity level

Ignore for now

s3Location
possible custom property for last data update
sourceDatasetName
sourceDatasetLocation
Column descriptions

Can we bulk update properties via the command line, or do we need to write another script for this?

MatMoore · 2024-02-02T17:52:35Z

https://datahubproject.io/docs/cli/#user-user-entity this looks useful for adding some data owners

We can interact with individual data products using this one https://datahubproject.io/docs/cli/#dataproduct-data-product-entity

For tables, there are enough that we will probably need to script it. We need to

identify which assets are missing which fields
come up with our best guess for the missing metadata
update the assets

I wonder if we could use the file sink to dump to file, then edit the files, then load it back in ???

MatMoore · 2024-02-05T10:14:30Z

How to import data from remote datahub into local datahub lite for debugging and data wrangling

lite_sink.yaml:

pipeline_name: datahub_source_1
datahub_api:
  server: "https://data-platform-datahub-catalogue-dev.apps.live.cloud-platform.service.justice.gov.uk/api/gms" 
  token: "xxxxx"
source:
  type: datahub
  config:
    include_all_versions: false
    pull_from_datahub_api: true
sink:
  type: datahub-lite

datahub ingest -c lite_sink.yaml

datahub lite ls

See https://datahubproject.io/docs/datahub_lite/

MatMoore · 2024-02-05T10:18:26Z

How to export data from remote datahub into a big json file

Same as above, just replace the sink with

sink:
  type: file
  config:
    filename: ./datahub_export.json

Example output (array of aspects):

[
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "browsePaths",
    "aspect": {
        "json": {
            "paths": [
                "/prod/glue/nomis"
            ]
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162978,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetKey",
    "aspect": {
        "json": {
            "platform": "urn:li:dataPlatform:glue",
            "name": "nomis.offender_contact_persons",
            "origin": "PROD"
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162979,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "browsePathsV2",
    "aspect": {
        "json": {
            "path": [
                {
                    "id": "nomis"
                }
            ]
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162980,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "dataPlatformInstance",
    "aspect": {
        "json": {
            "platform": "urn:li:dataPlatform:glue"
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162981,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
        "json": {
            "customProperties": {},
            "description": "",
            "tags": []
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162982,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
        "json": {
            "schemaName": "offender_contact_persons",
            "platform": "urn:li:dataPlatform:glue",
....

MatMoore · 2024-02-12T11:54:50Z

Next steps:

Re-run data discovery import, to fix descriptions, sensitivityLevel, rowCount (Matt is doing this)
Change catalogue library to default source system metadata to data product name and table name & rerun data discovery tool import
Use datahub bulk edit UI to add domains to all tables based on their data product (Working on this)
Add owners/maintainers to datahub using CLI
Add short summaries to each data product (less urgent, affects GOV.UK frontend only)
Add some transformation rules to DBT ingestion, to add extra custom properties like sensitivityLevel, rowCount, source system etc

MatMoore · 2024-02-12T14:26:15Z

This thread includes the source of the common platform metadata, may be useful for filling in missing information https://asdslack.slack.com/archives/CH935DZGS/p1678203340379569

MatMoore · 2024-02-12T15:11:43Z

DBT ingestion is currently

source:
    type: dbt
    config:
        manifest_path: 'https://mojap-derived-tables.s3.eu-west-1.amazonaws.com/prod/run_artefacts/run_time%3D2024-01-30T07%3A35%3A49/target/manifest.json?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFEaCWV1LXdlc3QtMiJIMEYCIQDN%2BnV1cIYKwElcJ6aIv3JlB67F0Kru%2BvwbOO7TyjMqtAIhAO3oYimobCVdSN6dNwIagfcd11efYgiVZD40Rt9H0FszKvgDCBoQAxoMNTkzMjkxNjMyNzQ5Igy1Zihb274TpbttXZwq1QNNxdNMQsXVK05ABRbcBT6QKpjDcIudz%2F%2BdY5IlJXFMkKVHjz3yRsiccheu%2F%2FR8cy3O4hTu62enLBgpWepOQM6YZElFv895VUYVvsn6aLQ%2BrrEWU9yDRWjINFRAp9FeOrxC41NgIsDncPxjpd5XQKa5TuPTzjT%2BHcotuy%2Bj5n%2BA16f9sXG5vdPvX7oCh4tcTBdHdWUlE%2B5OgNumylrl8LuSMJHrxn5jQXSgA91Eg5JXpK1LphCVG1FBLoZCgL5%2F%2FciDGGWslrCU8kWcLSvXmYZi%2F7e9I9WCvbJGMfYP0O37X51uK4mMM1PemL9mn8QD01sP5pQ3CliiHjgfcdmP4rx8bZyvgyIc17kyYAS4BstGYRSA8QNwXqFTb6RGiDAZLeDbGsD%2Bm6MrIPPb6fbRxOiLrZ0N9lszVkC5W%2Bl%2Fhyd9xogP79QONdPppTXendkNkixKyLo1fwkltxP%2B6i0xOVlPUa09NOD1Nj0JUIqtzJzdYZAFx%2BoJWZuF4IWVHii4btCYzooWLLEl7Osn%2BxFR%2BPOikMqwFg7d7skflyjadsfurQAayfkxJEqcEi%2FMqZ6ESCE0AmqmjJhR4ybhRsH5u3GXUMmK2cNqWElRLv84jx0jtjFQ96U9MNHM5K0GOpMCoJ2Ofy0xccZwHhPbS2hXHacqmD0RhvlAv5QfEj4hsuMNnGJVS02qfOaYcHE8mgJfWyKzK8DfUQBtLshcfeHtizcaomR%2FBwIcnFdEgj0CKqJ62geVLnAoHrEoHJ0MqjXeij6HVmWXCvOScr3VCnoB88IEvm4sobn2ua9mgKtwRq3z3p%2FdfZ7VJrovL%2FynhqptCOb6aFbu5SmpmDPbpVd9M0yCGxfGrB%2FHF9f42asvm62ZtV8MaArJ3GrVZjhcfo0BBRoO5YsYQ6a3jocomK3drnKvQI7zvV0vHa6GmiiXOL%2FD25GsuACnXkUfWpJxKupGWPYq8rmKR0yaryT%2BOmafK62XTDt1qi%2BroQY6m03H8A1aALg%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240130T164430Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAYUIXP4BWSH6USAO2%2F20240130%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=ad654521811b2b2e8ec3344f62b5c427133ac7a8583c12ed2f634c9d7a467ebe'
        catalog_path: 'https://mojap-derived-tables.s3.eu-west-1.amazonaws.com/prod/run_artefacts/run_time%3D2024-01-30T07%3A35%3A49/target/catalog.json?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFEaCWV1LXdlc3QtMiJIMEYCIQDN%2BnV1cIYKwElcJ6aIv3JlB67F0Kru%2BvwbOO7TyjMqtAIhAO3oYimobCVdSN6dNwIagfcd11efYgiVZD40Rt9H0FszKvgDCBoQAxoMNTkzMjkxNjMyNzQ5Igy1Zihb274TpbttXZwq1QNNxdNMQsXVK05ABRbcBT6QKpjDcIudz%2F%2BdY5IlJXFMkKVHjz3yRsiccheu%2F%2FR8cy3O4hTu62enLBgpWepOQM6YZElFv895VUYVvsn6aLQ%2BrrEWU9yDRWjINFRAp9FeOrxC41NgIsDncPxjpd5XQKa5TuPTzjT%2BHcotuy%2Bj5n%2BA16f9sXG5vdPvX7oCh4tcTBdHdWUlE%2B5OgNumylrl8LuSMJHrxn5jQXSgA91Eg5JXpK1LphCVG1FBLoZCgL5%2F%2FciDGGWslrCU8kWcLSvXmYZi%2F7e9I9WCvbJGMfYP0O37X51uK4mMM1PemL9mn8QD01sP5pQ3CliiHjgfcdmP4rx8bZyvgyIc17kyYAS4BstGYRSA8QNwXqFTb6RGiDAZLeDbGsD%2Bm6MrIPPb6fbRxOiLrZ0N9lszVkC5W%2Bl%2Fhyd9xogP79QONdPppTXendkNkixKyLo1fwkltxP%2B6i0xOVlPUa09NOD1Nj0JUIqtzJzdYZAFx%2BoJWZuF4IWVHii4btCYzooWLLEl7Osn%2BxFR%2BPOikMqwFg7d7skflyjadsfurQAayfkxJEqcEi%2FMqZ6ESCE0AmqmjJhR4ybhRsH5u3GXUMmK2cNqWElRLv84jx0jtjFQ96U9MNHM5K0GOpMCoJ2Ofy0xccZwHhPbS2hXHacqmD0RhvlAv5QfEj4hsuMNnGJVS02qfOaYcHE8mgJfWyKzK8DfUQBtLshcfeHtizcaomR%2FBwIcnFdEgj0CKqJ62geVLnAoHrEoHJ0MqjXeij6HVmWXCvOScr3VCnoB88IEvm4sobn2ua9mgKtwRq3z3p%2FdfZ7VJrovL%2FynhqptCOb6aFbu5SmpmDPbpVd9M0yCGxfGrB%2FHF9f42asvm62ZtV8MaArJ3GrVZjhcfo0BBRoO5YsYQ6a3jocomK3drnKvQI7zvV0vHa6GmiiXOL%2FD25GsuACnXkUfWpJxKupGWPYq8rmKR0yaryT%2BOmafK62XTDt1qi%2BroQY6m03H8A1aALg%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240130T164402Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAYUIXP4BWSH6USAO2%2F20240130%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=b352908d72597b667f98f50113c3549b13c772adae33cbc9ee1c78f3bd370dfa'
        target_platform: s3
        entities_enabled:
            test_results: No
            seeds: No
            snapshots: No
            models: Yes
            sources: No
            test_definitions: No
        node_name_pattern:
            allow:
                - '.*oasys_set$'
                - '.*oasys_section$'
                - '.*oasys_question$'
                - '.*oasys_answer$'
                - '.*oasys_assessment_group$'
                - '.*offender$'
                - '.*ref_question$'
                - '.*prison_population_history__imprisonment_spells$'
                - '.*prison_population_history__jicsl_lookup_ao_population_nart$'
                - '.*derived_delius__components_at_latest$'
                - '.*derived_delius__components_at_comm$'
                - '.*derived_delius__components_at_term$'
                - '.*derived_delius__contacts$'
                - '.*derived_delius__court_appearances$'
                - '.*derived_delius__court_reports$'
                - '.*derived_delius__first_release$'
                - '.*derived_delius__releases$'
                - '.*derived_delius__sentences_at_disp$'
                - '.*derived_delius__sentences_at_latest$'
                - '.*derived_delius__sentences_at_term$'
                - '.*derived_delius__upw_appointments$'
                - '.*common_platform_derived__all_offence_fct$'
                - '.*common_platform_derived__cases_fct$'
                - '.*common_platform_derived__crown_trials_fct$'
                - '.*common_platform_derived__def_hearing_summary_fct$'
                - '.*common_platform_derived__defendant_summary_fct$'
                - '.*common_platform_derived__disposal_summary_fct$'
                - '.*common_platform_derived__sjp_all_offence_fct$'
                - '.*common_platform_derived__sjp_defendant_summary_fct$'
                - '.*common_platform_derived__sjp_disposal_summary_fct$'
                - '.*common_platform_derived__sjp_session_summary_fct$'
                - '.*lookup_offence_v2__cjs_offence_code_to_ho_offence_code$'
                - '.*lookup_offence_v2__ho_offence_codes$'
                - '.*lookup_offence_v2__offence_group$'
                - '.*lookup_offence_v2__offence_group_code$'
                - '.*lookup_offence_v2__offence_priority$'
        stateful_ingestion:
            remove_stale_metadata: true

We could automate the adding of owners via https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings or https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer/#pattern-add-dataset-ownership

We can add fixed properties using https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer#simple-add-dataset-datasetproperties

LavMatt · 2024-02-13T16:51:51Z

Following discussion with UR we will also do the following before datahub is ready for UR testing.

Remove datasets from datahub that have not been assigned to a data product - Most likley a handful of the dbt models ingested.
Add in an ideal data product, i.e. one that isn't based on what we have now but more what we are aiming to get to.
Add in civil timeliness publication metadata. This should be available in s3 in the data-platform-development account's metadata bucket
Tag examples of well documented data - undecided on what this tag should be called. Or explore creating a globally available search view that doesn't include poorly documented metadata
Make dataset location = to where you'll find it, e.g. the analytical platform

tom-webber · 2024-02-14T11:02:12Z

Tag examples of well documented data

You can do negative filtering for views, so I would make a needsDocumentation tag, add it to everything, then remove it from the good ones, and create a 'not needsDocumentation' view

Make dataset location = to where you'll find it, e.g. the analytical platform

WhereToAccessDataset field name? (to remove ambiguity)

github-actions · 2024-04-25T01:49:10Z

This issue is being marked as stale because it has been open for 60 days with no activity. Remove stale label or comment to keep the issue open.

seanprivett changed the title ~~Explore default population of metadata fields~~ Manually populate missing metadata items Jan 22, 2024

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report ministryofjustice/analytical-platform#3148

Closed

murdo-moj assigned murdo-moj and unassigned murdo-moj Feb 2, 2024

MatMoore self-assigned this Feb 2, 2024

LavMatt self-assigned this Feb 13, 2024

moj-data-platform-robot transferred this issue from ministryofjustice/analytical-platform Apr 25, 2024

moj-data-platform-robot removed this from Analytical Platform Apr 25, 2024

MatMoore closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually populate missing metadata items #51

Manually populate missing metadata items #51

seanprivett commented Jan 22, 2024 •

edited

Loading

MatMoore commented Feb 2, 2024

MatMoore commented Feb 2, 2024 •

edited

Loading

MatMoore commented Feb 2, 2024

MatMoore commented Feb 5, 2024 •

edited

Loading

MatMoore commented Feb 5, 2024 •

edited

Loading

MatMoore commented Feb 12, 2024 •

edited by LavMatt

Loading

MatMoore commented Feb 12, 2024

MatMoore commented Feb 12, 2024

LavMatt commented Feb 13, 2024 •

edited

Loading

tom-webber commented Feb 14, 2024

github-actions bot commented Apr 25, 2024

Manually populate missing metadata items #51

Manually populate missing metadata items #51

Comments

seanprivett commented Jan 22, 2024 • edited Loading

MatMoore commented Feb 2, 2024

MatMoore commented Feb 2, 2024 • edited Loading

MatMoore commented Feb 2, 2024

MatMoore commented Feb 5, 2024 • edited Loading

How to import data from remote datahub into local datahub lite for debugging and data wrangling

MatMoore commented Feb 5, 2024 • edited Loading

How to export data from remote datahub into a big json file

MatMoore commented Feb 12, 2024 • edited by LavMatt Loading

MatMoore commented Feb 12, 2024

MatMoore commented Feb 12, 2024

LavMatt commented Feb 13, 2024 • edited Loading

tom-webber commented Feb 14, 2024

github-actions bot commented Apr 25, 2024

seanprivett commented Jan 22, 2024 •

edited

Loading

MatMoore commented Feb 2, 2024 •

edited

Loading

MatMoore commented Feb 5, 2024 •

edited

Loading

MatMoore commented Feb 5, 2024 •

edited

Loading

MatMoore commented Feb 12, 2024 •

edited by LavMatt

Loading

LavMatt commented Feb 13, 2024 •

edited

Loading