-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Populate DataHub instances with appropriate data #5
Comments
Extract glossary terms from a datahub export json
|
I'll have to create the well documented view manually I think - can't see a way to export these. I also had a look in datahub slack to see if anyones thought about a terraform provider for some of the data that is relatively static. There doesn't seem to be anything available at the moment, but there was some discussion about this about a year ago https://datahubspace.slack.com/archives/CUMUWQU66/p1635352024067800 |
Draft runbook section on populating environmentsWe have 3 datahub environments:
Note: At least during alpha, all of these should be populated from the same sources of metadata, so that research participants are working with a catalogue that is as realistic as possible. This means there should be nothing important in our pre-production catalogue that is not also in the dev & test catalogues (including metadata from the production analytical platform). Prerequisites for populating environments
One-off ingestionsIn each environment, we (the Data Catalogue team) have prepopulated a set of metadata that we have collated ourselves. These ingestions are push based, using the Datahub API and/or command line. After these steps, we expect the following to be created:
Step 1: Import draft domain model from create a derived table
Example yaml: source:
type: create_derived_table_domains_source.source.CreateDerivedTableDomainsSource
config:
manifest_local_path: "manifest.json"
sink:
type: datahub-rest
config:
server: "https://datahub-catalogue-ENV.apps.live.cloud-platform.service.justice.gov.uk/api/gms"
token: xxxxx Step 2: Import draft glossary and usersFollow the instructions to import via the CLI in https://github.com/ministryofjustice/data-catalogue-metadata Step 3: Import metadata taken from Data Discovery ToolFollow the instructions to run the python script in https://github.com/ministryofjustice/data-catalogue-metadata Scheduled ingestionsEach environment is configured with scheduled ingestions for metadata we expect to be updated. This demonstrates how we can continually pull data from other parts of the MOJ estate that the catalogue has direct access to. These sources are configured from the ingestion tab in Datahub. TBC: should the configuration for these be checked into a github repo? Note that the ingestion tab may also show ingestions triggered from the command line, although these will show up as view-only and cannot be triggered again from the UI. Step 4: Schedule DBT ingestionThis brings in derived tables and their lineage. Source tables may overlap with those ingested from other sources. Step 5: Schedule custom ingestion for Justice Data charts (currently untested)Manual environment setupCertain aspects of the environment are not reproducible from code. These include
These must be set up manually when recreating an environment. |
Our previous DBT config used a presigned s3 url to access the manifest. This time round we would like to try
source:
type: dbt
config:
# insert s3 path here
target_platform: s3
entities_enabled:
test_results: No
seeds: No
snapshots: No
models: Yes
sources: No
test_definitions: No
node_name_pattern:
allow:
- '.*oasys_set$'
- '.*oasys_section$'
- '.*oasys_question$'
- '.*oasys_answer$'
- '.*oasys_assessment_group$'
- '.*offender$'
- '.*ref_question$'
- '.*prison_population_history__imprisonment_spells$'
- '.*prison_population_history__jicsl_lookup_ao_population_nart$'
- '.*derived_delius__components_at_latest$'
- '.*derived_delius__components_at_comm$'
- '.*derived_delius__components_at_term$'
- '.*derived_delius__contacts$'
- '.*derived_delius__court_appearances$'
- '.*derived_delius__court_reports$'
- '.*derived_delius__first_release$'
- '.*derived_delius__releases$'
- '.*derived_delius__sentences_at_disp$'
- '.*derived_delius__sentences_at_latest$'
- '.*derived_delius__sentences_at_term$'
- '.*derived_delius__upw_appointments$'
- '.*common_platform_derived__all_offence_fct$'
- '.*common_platform_derived__cases_fct$'
- '.*common_platform_derived__crown_trials_fct$'
- '.*common_platform_derived__def_hearing_summary_fct$'
- '.*common_platform_derived__defendant_summary_fct$'
- '.*common_platform_derived__disposal_summary_fct$'
- '.*common_platform_derived__sjp_all_offence_fct$'
- '.*common_platform_derived__sjp_defendant_summary_fct$'
- '.*common_platform_derived__sjp_disposal_summary_fct$'
- '.*common_platform_derived__sjp_session_summary_fct$'
- '.*lookup_offence_v2__cjs_offence_code_to_ho_offence_code$'
- '.*lookup_offence_v2__ho_offence_codes$'
- '.*lookup_offence_v2__offence_group$'
- '.*lookup_offence_v2__offence_group_code$'
- '.*lookup_offence_v2__offence_priority$'
stateful_ingestion:
remove_stale_metadata: true
|
Modified version for running on test. I haven't set this up as a scheduled task yet - but we can try this once https://github.com/moj-analytical-services/create-a-derived-table/pull/1269 is merged
|
manifest_path will change to with |
This is done as far as the test environment is concerned. The remaining work is blocked on ministryofjustice/find-moj-data#175 Just need to make sure the python script runs on preprod after this. The above DBT job can also be converted into a scheduled ingestion once there is data in s3://mojap-derived-tables/prod/run_artefacts/latest/target/manifest.json |
We need to populate the different DataHub instances (
Preprod
,dev
,test
) with appropriate data.Preprod
andtest
should contain copies of the data currently indev
.Metadata that needs importing
In theory we could sink from instance to instance using the datahub source however I don't think we can filter this(?)
Definition of Done
The text was updated successfully, but these errors were encountered: