Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled ingestion of CaDeT assets #123

Closed
seanprivett opened this issue May 15, 2024 · 4 comments
Closed

Scheduled ingestion of CaDeT assets #123

seanprivett opened this issue May 15, 2024 · 4 comments
Assignees

Comments

@seanprivett
Copy link

seanprivett commented May 15, 2024

Where and how should this ingestion be scheduled

DataHub actions API?

Should the ingestion recipes live in GitHub?

Do it live

@seanprivett seanprivett converted this from a draft issue May 15, 2024
@seanprivett seanprivett changed the title Scheduled ingestion of CaDeT assets Spike: scheduled ingestion of CaDeT assets May 16, 2024
@YvanMOJdigital YvanMOJdigital changed the title Spike: scheduled ingestion of CaDeT assets Scheduled ingestion of CaDeT assets May 23, 2024
@MatMoore MatMoore self-assigned this May 23, 2024
@MatMoore MatMoore moved this from Todo to In Progress in Data Catalogue May 23, 2024
@MatMoore
Copy link
Contributor

MatMoore commented May 23, 2024

We think yes, github actions is the way to go here. It will be a proof of concept for other ingestion types, and it will enable us to add custom transformers if we need to (e.g. to assign domains in ministryofjustice/find-moj-data#343)

@MatMoore
Copy link
Contributor

MatMoore commented May 23, 2024

TODO:

  • allow Github actions to authenticate with datahub (use CATALOGUE_TOKEN)
  • allow Github actions to access DBT outputs in s3 (use aws-actions/configure-aws-credentials with a role)
  • set up a scheduled workflow

@MatMoore
Copy link
Contributor

MatMoore commented May 28, 2024

To allow github actions to access this bucket, we need

  • A policy for reading from the bucket
  • A role assumable by github actions

An OIDC assumable role looks something like this:

data "aws_iam_policy_document" "ci_ingestion_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/token.actions.githubusercontent.com"]
    }
    condition {
      test     = "StringEquals"
      values   = ["sts.amazonaws.com"]
      variable = "token.actions.githubusercontent.com:aud"
    }
    condition {
      test     = "StringLike"
      values   = ["repo:ministryofjustice/find-moj-data:*"]
      variable = "token.actions.githubusercontent.com:sub"
    }
  }
}

resource "aws_iam_role" "ci_ingestion_role" {
  name               = "ci-ingestion"
  assume_role_policy = data.aws_iam_policy_document.ci_ingestion_role.json
  tags               = local.tags
}

resource "aws_iam_role_policy_attachment" "ci_ingestion_role" {
  policy_arn = ...
  role       = aws_iam_role.ci_ingestion_role.name
}

See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html#idp_oidc_Create_GitHub

If we want to restrict access it to builds of the main branch, then we can use repo:ministryofjustice/find-moj-data:ref:refs/heads/main for the subject instead of using a wildcard

@MatMoore
Copy link
Contributor

Policy datahubReadCaDeTBucket is already defined in the analytical-platform repo

MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 28, 2024
We want to schedule Datahub ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

This role needs read only access to the bucket that contains CaDeT
outputs.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 28, 2024
We want to schedule Datahub ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

This role needs read only access to the bucket that contains CaDeT
outputs.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 28, 2024
We want to schedule Datahub DBT ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC,
and use it to access the s3 bucket containing the outputs from DBT.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

We already had IRSAs (IAM roles for service accounts) which can be assumed by Datahub itself,
but these assume you are running an application in a kubernetes pod on
AWS, whereas in this case we are going to run the ingestion from github
actions.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 28, 2024
We want to schedule Datahub DBT ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC,
and use it to access the s3 bucket containing the outputs from DBT.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

We already had IRSAs (IAM roles for service accounts) which can be assumed by Datahub itself,
but these assume you are running an application in a kubernetes pod on
AWS, whereas in this case we are going to run the ingestion from github
actions.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 28, 2024
We want to schedule Datahub DBT ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC,
and use it to access the s3 bucket containing the outputs from DBT.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

We already had IRSAs (IAM roles for service accounts) which can be assumed by Datahub itself,
but these assume you are running an application in a kubernetes pod on
AWS, whereas in this case we are going to run the ingestion from github
actions.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 28, 2024
We want to schedule Datahub DBT ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC,
and use it to access the s3 bucket containing the outputs from DBT.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

We already had IRSAs (IAM roles for service accounts) which can be assumed by Datahub itself,
but these assume you are running an application in a kubernetes pod on
AWS, whereas in this case we are going to run the ingestion from github
actions.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 29, 2024
We want to schedule Datahub DBT ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC,
and use it to access the s3 bucket containing the outputs from DBT.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

We already had IRSAs (IAM roles for service accounts) which can be assumed by Datahub itself,
but these assume you are running an application in a kubernetes pod on
AWS, whereas in this case we are going to run the ingestion from github
actions.
MatMoore added a commit to ministryofjustice/analytical-platform that referenced this issue May 29, 2024
Access for data-catalogue github actions

We want to schedule Datahub DBT ingestions using github actions.
(ministryofjustice/data-catalogue#123)

To do this, Github actions needs to be able to assume a role via OIDC,
and use it to access the s3 bucket containing the outputs from DBT.
See https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

We already had IRSAs (IAM roles for service accounts) which can be assumed by Datahub itself,
but these assume you are running an application in a kubernetes pod on
AWS, whereas in this case we are going to run the ingestion from github
actions.
@MatMoore MatMoore moved this from In Progress to Review in Data Catalogue May 29, 2024
@LavMatt LavMatt moved this from Review to Done in Data Catalogue May 31, 2024
@LavMatt LavMatt closed this as completed by moving to Done in Data Catalogue May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done ✅
Development

No branches or pull requests

3 participants