Skip to content

About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.

License

Notifications You must be signed in to change notification settings

AtlasOfLivingAustralia/pipelines-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pipelines-airflow

Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.

Installation

These scripts have been tested with Airflow (MWAA) and EMR.

Screen Shot 2022-03-02 at 1 52 28 pm

DAGS

This section describes some of the important DAGs in this project.

Steps:

  • Look up the dataset in the collectory
  • Retrieve the details of DwCA associated with the dataset
  • Copy the DwCA to S3 for ingestion
  • Determine the file size of the dataset, and either run pipelines on:
    • single node cluster for a small dataset
    • multi node cluster for a large dataset
  • Run all pipelines to ingest the dataset, excluding SOLR indexing.

Steps:

  • Look up the data provider in the collectory
  • Retrieve the details of DwCA associated with the datasets
  • Copy the DwCAs to S3 for all datasets for this provider ready for ingestion
  • Run all pipelines to ingest the dataset, excluding SOLR indexing This can be used to load all the datasets associated with an IPT

load_provider

A DAG used by the Ingest_all_datasets DAG to load large numbers of small datasets using a single node cluster in EMR. This will not run SOLR indexing. Includes the following options:

  • load_images - whether to load images for archives
  • skip_dwca_to_verbatim - skip the DWCA to Verbatim stage (which is expensive), and just reprocess

A DAG used by the Ingest_all_datasets DAG to load large numbers of large datasets using a multi node cluster in EMR. This will not run SOLR indexing. Includes the following options:

  • load_images - whether to load images for archives
  • skip_dwca_to_verbatim - skip the DWCA to Verbatim stage (which is expensive), and just reprocess

Steps:

  • Retrieve a list of all available DwCAs in S3
  • Run all pipelines to ingest each dataset. To do this it creates:
    • Several single node clusters for small datasets
    • Several multi-node clusters for large datasets
    • A single multi-node cluster for the largest dataset (eBird) Includes the following options:
    • load_images - whether to load images for archives
    • skip_dwca_to_verbatim - skip the DWCA to Verbatim stage (which is expensive), and just reprocess
    • run_index - whether to run a complete reindex on completion of ingestion

Screen Shot 2022-03-16 at 12 52 42 pm

Steps:

  • Run Sampling of environmental and contextual layers
  • Run Jackknife environmental outlier detection
  • Run Clustering
  • Run Expert Distribution outlier detection
  • Run SOLR indexing for all datasets

Run SOLR indexing for single dataset into the live index. This does not run the all dataset processes (Jackknife etc)

About

About Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published