Skip to content

Airflow for harvesting data for research intelligence and open access analysis

License

Notifications You must be signed in to change notification settings

sul-dlss/rialto-airflow

Repository files navigation

rialto-airflow

.github/workflows/test.yml codecov

Airflow for harvesting data for open access analysis and research intelligence. The workflow integrates data from sul_pub, rialto-orgs, OpenAlex and Dimensions APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from SUL-Pub, OpenAlex, and Dimensions, enrich them with additional metadata from OpenAlex and Dimensions using the DOI, merge the organizational data found in [rialto_orgs], and publish the data to our JupyterHub environment.

flowchart TD
  sul_pub_harvest(SUL-Pub harvest) --> sul_pub_pubs[/SUL-Pub publications/]
  rialto_orgs_export(Manual RIALTO app export) --> org_data[/Stanford organizational data/]
  org_data --> dimensions_harvest_orcid(Dimensions harvest ORCID)
  org_data --> openalex_harvest_orcid(OpenAlex harvest ORCID)
  dimensions_harvest_orcid --> dimensions_orcid_doi_dict[/Dimensions DOI-ORCID dictionary/]
  openalex_harvest_orcid --> openalex_orcid_doi_dict[/OpenAlex DOI-ORCID dictionary/]
  dimensions_orcid_doi_dict -- DOI --> doi_set(DOI set)
  openalex_orcid_doi_dict -- DOI --> doi_set(DOI set)
  sul_pub_pubs -- DOI --> doi_set(DOI set)
  doi_set --> dois[/All unique DOIs/]
  dois --> dimensions_enrich(Dimensions harvest DOI)
  dois --> openalex_enrich(OpenAlex harvest DOI)
  dimensions_enrich --> dimensions_enriched[/Dimensions publications/]
  openalex_enrich --> openalex_enriched[/OpenAlex publications/]
  dimensions_enriched -- DOI --> merge_pubs(Merge publications)
  openalex_enriched -- DOI --> merge_pubs
  sul_pub_pubs -- DOI --> merge_pubs
  merge_pubs --> all_enriched_publications[/All publications/]
  all_enriched_publications --> join_org_data(Join organizational data)
  org_data --> join_org_data
  join_org_data --> publications_with_org[/Publication with organizational data/]
  publications_with_org -- DOI & SUNET --> contributions(Publications to contributions)
  contributions --> contributions_set[/All contributions/]
  contributions_set --> publish(Publish)
Loading

Running Locally with Docker

Based on the documentation, Running Airflow in Docker.

  1. Clone repository git clone [email protected]:sul-dlss/rialto-airflow.git (cloning using the git over ssh URL will make it easier to push changes back than using the https URL)

  2. Start up docker locally.

  3. Create a .env file with the AIRFLOW_UID and AIRFLOW_GROUP values. For local development these can usually be:

AIRFLOW_UID=50000
AIRFLOW_GROUP=0
AIRFLOW_VAR_DATA_DIR="data"

(See Airflow docs for more info.)

  1. Add to the .env values for any environment variables used by DAGs. Not in place yet--they will usually applied to VMs by puppet once productionized.

Here is an script to generate content for your dev .env file:

for i in `vault kv list -format yaml puppet/application/rialto-airflow/stage | sed 's/- //'` ; do \
  val=$(echo $i| tr '[a-z]' '[A-Z]'); \
  echo AIRFLOW_VAR_$val=`vault kv get -field=content puppet/application/rialto-airflow/stage/$i`; \
done

Additionally, you may want a value for AIRFLOW_VAR_MAIS_BASE_URL. This is available from the rialto-orgs configuration (either in the base config, or overridden via shared_configs).

  1. The harvest DAG requires a CSV file of authors from rialto-orgs to be available. This is not yet automatically available, so to set up locally, download the file at https://sul-rialto-stage.stanford.edu/authors?action=index&commit=Search&controller=authors&format=csv&orcid_filter=&q=. Put the authors.csv file in the data/ directory.

  2. Bring up containers.

docker compose up -d
  1. The Airflow application will be available at localhost:8080 and can be accessed with the default Airflow username and password.

Development

Set-up

  1. Install uv for dependency management as described in the uv docs. NOTE: As of Feb 2025, at least one developer has had better luck with dependency management using the uv standalone installer, as opposed to installing using pip or pipx. YMMV of course, but if you run into hard to explain pyproject.toml complaints or dependency resolution issues, consider uninstalling the pip managed uv, and installing from the uv installation script.

To add a dependency, e.g. flask:

  1. uv add flask
  2. Then commit pyproject.toml and uv.lock files.

Upgrading dependencies

To upgrade Python dependencies:

uv lock --upgrade

Run Tests

docker compose up -d postgres
uv run pytest

Test coverage reporting

In addition to the terminal display of a summary of the test coverage percentages, you can get a detailed look at which lines are covered or not by opening htmlcov/index.html after running the test suite.

Linting and formatting

  1. Run linting: uv run ruff check
  2. Automatically fix lint: uv run ruff check --fix
  3. Run formatting: uv run ruff format (or uv run ruff format --check to identify any unformatted files, or uv run ruff format --diff to see what would change without applying)

Type Checking

To see if there are any type mismatches:

uv run mypy .

Run all the checks

One line for running the linter, the type checker, and the test suite (failing fast if there are errors):

uv run ruff format --diff . && uv run ruff check && uv run mypy . && uv run pytest

Deployment

First you'll need to build a Docker image and publish it DockerHub:

DOCKER_DEFAULT_PLATFORM="linux/amd64" docker build . -t suldlss/rialto-airflow:latest
docker push suldlss/rialto-airflow

Deployment to https://sul-rialto-airflow-XXXX.stanford.edu/ is handled like other SDR services using Capistrano. You'll need to have Ruby installed and then:

bundle exec cap stage deploy # stage
bundle exec cap prod deploy  # prod
# Note: there is no QA

About

Airflow for harvesting data for research intelligence and open access analysis

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages