pipelines-airflow

Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.

Installation

These scripts have been tested with Airflow (MWAA) and EMR.

DAGS

This section describes some of the important DAGs in this project.

load_dataset_dag.py

Steps:

Look up the dataset in the collectory
Retrieve the details of DwCA associated with the dataset
Copy the DwCA to S3 for ingestion
Determine the file size of the dataset, and either run pipelines on:
- single node cluster for a small dataset
- multi node cluster for a large dataset
Run all pipelines to ingest the dataset, excluding SOLR indexing.

load_provider_dag.py

Steps:

Look up the data provider in the collectory
Retrieve the details of DwCA associated with the datasets
Copy the DwCAs to S3 for all datasets for this provider ready for ingestion
Run all pipelines to ingest the dataset, excluding SOLR indexing This can be used to load all the datasets associated with an IPT

ingest_small_datasets_dag.py

A DAG used by the Ingest_all_datasets DAG to load large numbers of small datasets using a single node cluster in EMR. This will not run SOLR indexing. Includes the following options:

load_images - whether to load images for archives
skip_dwca_to_verbatim - skip the DWCA to Verbatim stage (which is expensive), and just reprocess

ingest_large_datasets_dag.py

A DAG used by the Ingest_all_datasets DAG to load large numbers of large datasets using a multi node cluster in EMR. This will not run SOLR indexing. Includes the following options:

load_images - whether to load images for archives
skip_dwca_to_verbatim - skip the DWCA to Verbatim stage (which is expensive), and just reprocess

ingest_all_datasets_dag.py

Steps:

Retrieve a list of all available DwCAs in S3
Run all pipelines to ingest each dataset. To do this it creates:
- Several single node clusters for small datasets
- Several multi-node clusters for large datasets
- A single multi-node cluster for the largest dataset (eBird) Includes the following options:
- load_images - whether to load images for archives
- skip_dwca_to_verbatim - skip the DWCA to Verbatim stage (which is expensive), and just reprocess
- run_index - whether to run a complete reindex on completion of ingestion

full_index_to_solr.py

Steps:

Run Sampling of environmental and contextual layers
Run Jackknife environmental outlier detection
Run Clustering
Run Expert Distribution outlier detection
Run SOLR indexing for all datasets

solr_dataset_indexing

Run SOLR indexing for single dataset into the live index. This does not run the all dataset processes (Jackknife etc)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.devcontainer		.devcontainer
.vscode		.vscode
dags		dags
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pipelines-airflow

Installation

DAGS

load_dataset_dag.py

load_provider_dag.py

ingest_small_datasets_dag.py

ingest_large_datasets_dag.py

ingest_all_datasets_dag.py

full_index_to_solr.py

solr_dataset_indexing

About

Releases 3

Packages

Contributors 3

Languages

License

AtlasOfLivingAustralia/pipelines-airflow

Folders and files

Latest commit

History

Repository files navigation

pipelines-airflow

Installation

DAGS

About

Topics

Resources

License

Stars

Watchers

Forks

Languages