Airflow DAGs and supporting files for running pipelines on Apache Airflow with Elastic Map Reduce.
These scripts have been tested with Airflow (MWAA) and EMR.
This section describes some of the important DAGs in this project.
Steps:
- Look up the dataset in the collectory
- Retrieve the details of DwCA associated with the dataset
- Copy the DwCA to S3 for ingestion
- Determine the file size of the dataset, and either run pipelines on:
- single node cluster for a small dataset
- multi node cluster for a large dataset
- Run all pipelines to ingest the dataset, excluding SOLR indexing.
Steps:
- Look up the data provider in the collectory
- Retrieve the details of DwCA associated with the datasets
- Copy the DwCAs to S3 for all datasets for this provider ready for ingestion
- Run all pipelines to ingest the dataset, excluding SOLR indexing This can be used to load all the datasets associated with an IPT
A DAG used by the Ingest_all_datasets
DAG to load large numbers of small datasets using a single node cluster in EMR.
This will not run SOLR indexing.
Includes the following options:
load_images
- whether to load images for archivesskip_dwca_to_verbatim
- skip the DWCA to Verbatim stage (which is expensive), and just reprocess
A DAG used by the Ingest_all_datasets
DAG to load large numbers of large datasets using a multi node cluster in EMR.
This will not run SOLR indexing.
Includes the following options:
load_images
- whether to load images for archivesskip_dwca_to_verbatim
- skip the DWCA to Verbatim stage (which is expensive), and just reprocess
Steps:
- Retrieve a list of all available DwCAs in S3
- Run all pipelines to ingest each dataset. To do this it creates:
- Several single node clusters for small datasets
- Several multi-node clusters for large datasets
- A single multi-node cluster for the largest dataset (eBird) Includes the following options:
load_images
- whether to load images for archivesskip_dwca_to_verbatim
- skip the DWCA to Verbatim stage (which is expensive), and just reprocessrun_index
- whether to run a complete reindex on completion of ingestion
Steps:
- Run Sampling of environmental and contextual layers
- Run Jackknife environmental outlier detection
- Run Clustering
- Run Expert Distribution outlier detection
- Run SOLR indexing for all datasets
Run SOLR indexing for single dataset into the live index. This does not run the all dataset processes (Jackknife etc)