The goal of this project is to expand NASA's Cooperative Open Online Landslide Repository (COOLR) by automatically extracting landslide events from online sources.
This is an overview of the project on how landslides are extracted from Reddit:
News articles are first extracted from online sources and then passed to a model that extractes the landlside's event properties.
For Reddit specifically, the data extraction pipeline looks like this:
Models are needed to extract information from the articles : time
, location
, casualties
, landslide category
and landslide trigger
.
The baseline is a straightforward implementation of the extraction process and combines an NER model (ontonotes-large
), linear models and in some cases post processing to extract the most likely event properties:
A more complex implementation of the extraction process, uses span based extraction to extract all the event properties:
The model is used to extract the event information directly:
The multitask model was pre-trained using squad-v2
"where" and "when" questions and text labels extracted using the NASA COOLR database.
The datasets used to train the model are in the data/processed
folder and the following notebooks can be consulted for more details on the process:
notebooks/multi-task-learning-QA-squad-data.ipynb
: notebook used to train the multitask model.notebooks/multi-task-dataset-squad.ipynb
: notebook showing how thesquad-v2
pre-training data was extracted.notebooks/multi-task-dataset.ipynb
: notebook showing how the training data was extracted.
├── data
├── diagrams
├── docs
├── models
├── notebooks
│ ├── Time\ notebook.ipynb
│ ├── article_detection.ipynb
│ ├── data_preprocessing.ipynb
│ ├── location_extraction.ipynb
│ ├── location_pipeline.ipynb
│ ├── multi-task-dataset-squad.ipynb
│ ├── multi-task-dataset.ipynb
│ └── multi-task-learning-QA-squad-data.ipynb
├── requirements.txt
├── setup.py
├── src
│ ├── config.py
│ ├── data
│ │ ├── articles.py
│ │ ├── data.py
│ │ ├── downloader
│ │ │ └── reddit.py
│ │ └── duplicates.py
│ ├── extraction
│ │ ├── casualties
│ │ │ └── casualties.py
│ │ ├── location
│ │ │ ├── landslide_event_location.py
│ │ │ └── location.py
│ │ └── time
│ │ ├── helpers.py
│ │ ├── landslide_event_time.py
│ │ └── time.py
│ ├── main.py
│ └── models
│ ├── baseline
│ │ ├── baseline.py
│ │ └── ner.py
│ └── multitask
│ └── multitask.py
└── user
├── baseline_results.csv
├── multitask_results.csv
├── config.json
├── history.log
└── run.log
The config file can be configured to extract articles from a certain time interval and extract information from them with the chosen model.
- default : if set to
"yes"
, will read thehistory.log
file and download the articles from the latest date recorded in the file to the current date. At each run, the program will record the download dates, which means if default is left to"yes"
, the program will always download new articles starting from the latest date. - now : if set to
"yes"
will ingore the end date and end at the current date.
- Install
git-lfs
and rungit lfs install
on your terminal. - Clone this repository.
- Install the requirements:
pip install -r requirements.txt
- Edit the config file.
- Run:
python src/main.py
If you encounter runtime errors related to PyTorch
like:
AttributeError: partially initialized module 'torch' has no attribute 'UntypedStorage' (most likely due to a circular import). Did you mean: '_UntypedStorage'?
Try to uninstall : pip uninstall torch
and reinstall with conda install pytorch
or follow the installation instructions from the pytorch website.
For M1 macs, you will need an environment with python=3.8
and before installing requirements.txt
run:
conda install sentencepiece=0.1.95
conda install gensim
And finally edit fasttext
in requirements.txt
to fasttext-wheel
If the git lfs
limit is passed:
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
Or git lfs
cannot be installed, then the models can be downloaded from link in the models folder and should just be placed inside the models folder in the repository.
- Install Docker.
- Download landslides.tar and the user folder from this link to your local directory.
- You can choose model and start/end date by editing the config file.
- Go to the directory where you saved the
landslides.tar
in the terminal. - Type and run
docker load < landslides.tar
in terminal The terminal should showLoaded image: landslides:latest
. - Type and run
docker run -v $(pwd)/user:/user landslides
in terminal. - You can see the final results in the user folder.
Thanks!!