Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually populate missing metadata items #51

Closed
seanprivett opened this issue Jan 22, 2024 · 11 comments
Closed

Manually populate missing metadata items #51

seanprivett opened this issue Jan 22, 2024 · 11 comments
Assignees

Comments

@seanprivett
Copy link

seanprivett commented Jan 22, 2024

First identify what is to be fake populated, share this with the team to agree then implement.

Manual population of metadata, based on our best guesses, so that useful information can be used in user research.
Tag status as draft, to indicate this is information we have put in.
Data should look as real as possible
If we don't know fields we should leave them blank

Reformat metadata extract into format that Data Hub expects

Script to upload metadata into Data Hub, not using our registration API

Pick data products which already have descriptions

@seanprivett seanprivett changed the title Explore default population of metadata fields Manually populate missing metadata items Jan 22, 2024
@murdo-moj murdo-moj assigned murdo-moj and unassigned murdo-moj Feb 2, 2024
@MatMoore MatMoore self-assigned this Feb 2, 2024
@MatMoore
Copy link
Contributor

MatMoore commented Feb 2, 2024

Somewhat blocked on #52 but I can start putting a spreadsheet together

@MatMoore
Copy link
Contributor

MatMoore commented Feb 2, 2024

Stuff to populate see (https://docs.google.com/spreadsheets/d/1O1EjO96lyqDbyuzFWU8SNBky3b7tKM8dccZVS7-_t9g/edit#gid=784169301)

Data product and table level:

  • Description
  • Owners
  • Domains (already set for data product but not showing up for tables)

Table level:

  • Row count

Custom properties:

  • dpiaRequired
  • dpiaLocation
  • retentionPeriod
  • Sensitivity level

Ignore for now

  • s3Location
  • possible custom property for last data update
  • sourceDatasetName
  • sourceDatasetLocation
  • Column descriptions

Can we bulk update properties via the command line, or do we need to write another script for this?

@MatMoore
Copy link
Contributor

MatMoore commented Feb 2, 2024

https://datahubproject.io/docs/cli/#user-user-entity this looks useful for adding some data owners

We can interact with individual data products using this one https://datahubproject.io/docs/cli/#dataproduct-data-product-entity

For tables, there are enough that we will probably need to script it. We need to

  • identify which assets are missing which fields
  • come up with our best guess for the missing metadata
  • update the assets

I wonder if we could use the file sink to dump to file, then edit the files, then load it back in ???

@MatMoore
Copy link
Contributor

MatMoore commented Feb 5, 2024

How to import data from remote datahub into local datahub lite for debugging and data wrangling

lite_sink.yaml:

pipeline_name: datahub_source_1
datahub_api:
  server: "https://data-platform-datahub-catalogue-dev.apps.live.cloud-platform.service.justice.gov.uk/api/gms" 
  token: "xxxxx"
source:
  type: datahub
  config:
    include_all_versions: false
    pull_from_datahub_api: true
sink:
  type: datahub-lite
datahub ingest -c lite_sink.yaml

datahub lite ls

See https://datahubproject.io/docs/datahub_lite/

@MatMoore
Copy link
Contributor

MatMoore commented Feb 5, 2024

How to export data from remote datahub into a big json file

Same as above, just replace the sink with

sink:
  type: file
  config:
    filename: ./datahub_export.json

Example output (array of aspects):

[
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "browsePaths",
    "aspect": {
        "json": {
            "paths": [
                "/prod/glue/nomis"
            ]
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162978,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetKey",
    "aspect": {
        "json": {
            "platform": "urn:li:dataPlatform:glue",
            "name": "nomis.offender_contact_persons",
            "origin": "PROD"
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162979,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "browsePathsV2",
    "aspect": {
        "json": {
            "path": [
                {
                    "id": "nomis"
                }
            ]
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162980,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "dataPlatformInstance",
    "aspect": {
        "json": {
            "platform": "urn:li:dataPlatform:glue"
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162981,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "datasetProperties",
    "aspect": {
        "json": {
            "customProperties": {},
            "description": "",
            "tags": []
        }
    },
    "systemMetadata": {
        "lastObserved": 1707128162982,
        "runId": "datahub-2024_02_05-10_15_54",
        "lastRunId": "no-run-id-provided"
    }
},
{
    "entityType": "dataset",
    "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:glue,nomis.offender_contact_persons,PROD)",
    "changeType": "UPSERT",
    "aspectName": "schemaMetadata",
    "aspect": {
        "json": {
            "schemaName": "offender_contact_persons",
            "platform": "urn:li:dataPlatform:glue",
....

@MatMoore
Copy link
Contributor

MatMoore commented Feb 12, 2024

Next steps:

  • Re-run data discovery import, to fix descriptions, sensitivityLevel, rowCount (Matt is doing this)
  • Change catalogue library to default source system metadata to data product name and table name & rerun data discovery tool import
  • Use datahub bulk edit UI to add domains to all tables based on their data product (Working on this)
  • Add owners/maintainers to datahub using CLI
  • Add short summaries to each data product (less urgent, affects GOV.UK frontend only)
  • Add some transformation rules to DBT ingestion, to add extra custom properties like sensitivityLevel, rowCount, source system etc

@MatMoore
Copy link
Contributor

This thread includes the source of the common platform metadata, may be useful for filling in missing information https://asdslack.slack.com/archives/CH935DZGS/p1678203340379569

@MatMoore
Copy link
Contributor

DBT ingestion is currently

source:
    type: dbt
    config:
        manifest_path: 'https://mojap-derived-tables.s3.eu-west-1.amazonaws.com/prod/run_artefacts/run_time%3D2024-01-30T07%3A35%3A49/target/manifest.json?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFEaCWV1LXdlc3QtMiJIMEYCIQDN%2BnV1cIYKwElcJ6aIv3JlB67F0Kru%2BvwbOO7TyjMqtAIhAO3oYimobCVdSN6dNwIagfcd11efYgiVZD40Rt9H0FszKvgDCBoQAxoMNTkzMjkxNjMyNzQ5Igy1Zihb274TpbttXZwq1QNNxdNMQsXVK05ABRbcBT6QKpjDcIudz%2F%2BdY5IlJXFMkKVHjz3yRsiccheu%2F%2FR8cy3O4hTu62enLBgpWepOQM6YZElFv895VUYVvsn6aLQ%2BrrEWU9yDRWjINFRAp9FeOrxC41NgIsDncPxjpd5XQKa5TuPTzjT%2BHcotuy%2Bj5n%2BA16f9sXG5vdPvX7oCh4tcTBdHdWUlE%2B5OgNumylrl8LuSMJHrxn5jQXSgA91Eg5JXpK1LphCVG1FBLoZCgL5%2F%2FciDGGWslrCU8kWcLSvXmYZi%2F7e9I9WCvbJGMfYP0O37X51uK4mMM1PemL9mn8QD01sP5pQ3CliiHjgfcdmP4rx8bZyvgyIc17kyYAS4BstGYRSA8QNwXqFTb6RGiDAZLeDbGsD%2Bm6MrIPPb6fbRxOiLrZ0N9lszVkC5W%2Bl%2Fhyd9xogP79QONdPppTXendkNkixKyLo1fwkltxP%2B6i0xOVlPUa09NOD1Nj0JUIqtzJzdYZAFx%2BoJWZuF4IWVHii4btCYzooWLLEl7Osn%2BxFR%2BPOikMqwFg7d7skflyjadsfurQAayfkxJEqcEi%2FMqZ6ESCE0AmqmjJhR4ybhRsH5u3GXUMmK2cNqWElRLv84jx0jtjFQ96U9MNHM5K0GOpMCoJ2Ofy0xccZwHhPbS2hXHacqmD0RhvlAv5QfEj4hsuMNnGJVS02qfOaYcHE8mgJfWyKzK8DfUQBtLshcfeHtizcaomR%2FBwIcnFdEgj0CKqJ62geVLnAoHrEoHJ0MqjXeij6HVmWXCvOScr3VCnoB88IEvm4sobn2ua9mgKtwRq3z3p%2FdfZ7VJrovL%2FynhqptCOb6aFbu5SmpmDPbpVd9M0yCGxfGrB%2FHF9f42asvm62ZtV8MaArJ3GrVZjhcfo0BBRoO5YsYQ6a3jocomK3drnKvQI7zvV0vHa6GmiiXOL%2FD25GsuACnXkUfWpJxKupGWPYq8rmKR0yaryT%2BOmafK62XTDt1qi%2BroQY6m03H8A1aALg%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240130T164430Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAYUIXP4BWSH6USAO2%2F20240130%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=ad654521811b2b2e8ec3344f62b5c427133ac7a8583c12ed2f634c9d7a467ebe'
        catalog_path: 'https://mojap-derived-tables.s3.eu-west-1.amazonaws.com/prod/run_artefacts/run_time%3D2024-01-30T07%3A35%3A49/target/catalog.json?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFEaCWV1LXdlc3QtMiJIMEYCIQDN%2BnV1cIYKwElcJ6aIv3JlB67F0Kru%2BvwbOO7TyjMqtAIhAO3oYimobCVdSN6dNwIagfcd11efYgiVZD40Rt9H0FszKvgDCBoQAxoMNTkzMjkxNjMyNzQ5Igy1Zihb274TpbttXZwq1QNNxdNMQsXVK05ABRbcBT6QKpjDcIudz%2F%2BdY5IlJXFMkKVHjz3yRsiccheu%2F%2FR8cy3O4hTu62enLBgpWepOQM6YZElFv895VUYVvsn6aLQ%2BrrEWU9yDRWjINFRAp9FeOrxC41NgIsDncPxjpd5XQKa5TuPTzjT%2BHcotuy%2Bj5n%2BA16f9sXG5vdPvX7oCh4tcTBdHdWUlE%2B5OgNumylrl8LuSMJHrxn5jQXSgA91Eg5JXpK1LphCVG1FBLoZCgL5%2F%2FciDGGWslrCU8kWcLSvXmYZi%2F7e9I9WCvbJGMfYP0O37X51uK4mMM1PemL9mn8QD01sP5pQ3CliiHjgfcdmP4rx8bZyvgyIc17kyYAS4BstGYRSA8QNwXqFTb6RGiDAZLeDbGsD%2Bm6MrIPPb6fbRxOiLrZ0N9lszVkC5W%2Bl%2Fhyd9xogP79QONdPppTXendkNkixKyLo1fwkltxP%2B6i0xOVlPUa09NOD1Nj0JUIqtzJzdYZAFx%2BoJWZuF4IWVHii4btCYzooWLLEl7Osn%2BxFR%2BPOikMqwFg7d7skflyjadsfurQAayfkxJEqcEi%2FMqZ6ESCE0AmqmjJhR4ybhRsH5u3GXUMmK2cNqWElRLv84jx0jtjFQ96U9MNHM5K0GOpMCoJ2Ofy0xccZwHhPbS2hXHacqmD0RhvlAv5QfEj4hsuMNnGJVS02qfOaYcHE8mgJfWyKzK8DfUQBtLshcfeHtizcaomR%2FBwIcnFdEgj0CKqJ62geVLnAoHrEoHJ0MqjXeij6HVmWXCvOScr3VCnoB88IEvm4sobn2ua9mgKtwRq3z3p%2FdfZ7VJrovL%2FynhqptCOb6aFbu5SmpmDPbpVd9M0yCGxfGrB%2FHF9f42asvm62ZtV8MaArJ3GrVZjhcfo0BBRoO5YsYQ6a3jocomK3drnKvQI7zvV0vHa6GmiiXOL%2FD25GsuACnXkUfWpJxKupGWPYq8rmKR0yaryT%2BOmafK62XTDt1qi%2BroQY6m03H8A1aALg%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240130T164402Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAYUIXP4BWSH6USAO2%2F20240130%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=b352908d72597b667f98f50113c3549b13c772adae33cbc9ee1c78f3bd370dfa'
        target_platform: s3
        entities_enabled:
            test_results: No
            seeds: No
            snapshots: No
            models: Yes
            sources: No
            test_definitions: No
        node_name_pattern:
            allow:
                - '.*oasys_set$'
                - '.*oasys_section$'
                - '.*oasys_question$'
                - '.*oasys_answer$'
                - '.*oasys_assessment_group$'
                - '.*offender$'
                - '.*ref_question$'
                - '.*prison_population_history__imprisonment_spells$'
                - '.*prison_population_history__jicsl_lookup_ao_population_nart$'
                - '.*derived_delius__components_at_latest$'
                - '.*derived_delius__components_at_comm$'
                - '.*derived_delius__components_at_term$'
                - '.*derived_delius__contacts$'
                - '.*derived_delius__court_appearances$'
                - '.*derived_delius__court_reports$'
                - '.*derived_delius__first_release$'
                - '.*derived_delius__releases$'
                - '.*derived_delius__sentences_at_disp$'
                - '.*derived_delius__sentences_at_latest$'
                - '.*derived_delius__sentences_at_term$'
                - '.*derived_delius__upw_appointments$'
                - '.*common_platform_derived__all_offence_fct$'
                - '.*common_platform_derived__cases_fct$'
                - '.*common_platform_derived__crown_trials_fct$'
                - '.*common_platform_derived__def_hearing_summary_fct$'
                - '.*common_platform_derived__defendant_summary_fct$'
                - '.*common_platform_derived__disposal_summary_fct$'
                - '.*common_platform_derived__sjp_all_offence_fct$'
                - '.*common_platform_derived__sjp_defendant_summary_fct$'
                - '.*common_platform_derived__sjp_disposal_summary_fct$'
                - '.*common_platform_derived__sjp_session_summary_fct$'
                - '.*lookup_offence_v2__cjs_offence_code_to_ho_offence_code$'
                - '.*lookup_offence_v2__ho_offence_codes$'
                - '.*lookup_offence_v2__offence_group$'
                - '.*lookup_offence_v2__offence_group_code$'
                - '.*lookup_offence_v2__offence_priority$'
        stateful_ingestion:
            remove_stale_metadata: true

We could automate the adding of owners via https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings or https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer/#pattern-add-dataset-ownership

We can add fixed properties using https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer#simple-add-dataset-datasetproperties

@LavMatt LavMatt self-assigned this Feb 13, 2024
@LavMatt
Copy link
Contributor

LavMatt commented Feb 13, 2024

Following discussion with UR we will also do the following before datahub is ready for UR testing.

  • Remove datasets from datahub that have not been assigned to a data product - Most likley a handful of the dbt models ingested.
  • Add in an ideal data product, i.e. one that isn't based on what we have now but more what we are aiming to get to.
  • Add in civil timeliness publication metadata. This should be available in s3 in the data-platform-development account's metadata bucket
  • Tag examples of well documented data - undecided on what this tag should be called. Or explore creating a globally available search view that doesn't include poorly documented metadata
  • Make dataset location = to where you'll find it, e.g. the analytical platform

@tom-webber
Copy link
Contributor

Tag examples of well documented data

You can do negative filtering for views, so I would make a needsDocumentation tag, add it to everything, then remove it from the good ones, and create a 'not needsDocumentation' view

Make dataset location = to where you'll find it, e.g. the analytical platform

WhereToAccessDataset field name? (to remove ambiguity)

Copy link

This issue is being marked as stale because it has been open for 60 days with no activity. Remove stale label or comment to keep the issue open.

@moj-data-platform-robot moj-data-platform-robot transferred this issue from ministryofjustice/analytical-platform Apr 25, 2024
@MatMoore MatMoore closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

5 participants