CORD-19-NLP

Overview

The CORD-19 dataset https://www.semanticscholar.org/cord19 is an open research collection provided by the Allen Institute for AI. Its intent is to facilitate natural language processing research by collecting and annotating scholarly articles pertaining to COVID-19. The dataset is updated daily, with both full text of articles when available and metadata.

Workflow motivation

If we consider a use case wherein analysis of the CORD-19 would need to be performed on an as needed basis, the need arising from updated data being made available, we would have at least two operations of importance:

Acquiring the new data
Processing the data

This is at minimum and these operations could be broken down into further steps as well. The first operation could be extended to generate a delta between datasets, the second could become preprocessing, feature extraction, evaluation, result generation, and so on. For purposes of this work we focused just on these two steps with the expectation of expanding upon these for specific use cases.

For this type of application, a repeatable and reliable set of operations is crucial, we don't necessarily need constantly running services for these steps. Instead, we can treat them as on demand operations, which makes an Argo workflow on OpenShift a fitting solution. Indeed, OpenShift provides an enterprise-grade means for creating and operating containerized instances of these operations, whilst Argo provides a simple and powerful way to orchestrate and manage the flow of said containers.

Our use case then involves wanting autogenerated topic models derived from an instance of the CORD-19 dataset. As such, after doing initial exploratory data analysis and topic modeling on an instance of the CORD-19 dataset, found in the notebook directory of this repository, we then extracted acquisition and processing steps from said notebook. These became our workflow application steps, and indeed illustrate the convenience and power of a notebook to containerized application development approach provided by OpenShift. In data-acqusition and data-processing the containerized applications are provided. Data acqusition acquires the data file given a source url. For our purposes we stick with the metadata.csv file but could expand further. Data processing is then run, which generates topic models for all paper abstracts in the metadata file. We could extend this to parse the full text of the individual papers by traversing the metadata for fulltext accordingly, and model topics on the entire corpus, however, for sake of demo purposes we limited the complexity of the operation. Once the topics are computed, they are published to a kafka topic/broker specified within environment variables. A simple flask application is run which consumes the data sent over the kafka topic. As the flask app is a persistent service rather than an on-demand workflow item, it is not included in the workflow specification. The data processing task could be altered to publish/store results to another persistent resource, as well.

Running the workflow

On an OpenShift cluster you are logged into, create a project if you haven't already, for sake of example we'll call it cord-19

oc new-project cord-19

For the persistent Kafka service and flask application, we will first create the kafka resources within our OpenShift cluster.

oc create -f https://raw.githubusercontent.com/EldritchJS/adversarial_pipeline/master/openshift_templates/strimzi-0.1.0.yaml

Then start a new application of Kafka via strimzi.

oc new-app strimzi

Now the flask application can be generated. Change KAFKA_TOPIC as desired.

oc new-app centos/python-36-centos7~https://gitlab.com/bones-brigade/flask-kafka-python-listener.git \
  -e KAFKA_BROKERS=kafka:9092 \
  -e KAFKA_TOPIC=cord-19-nlp \
  --name=listener

In order to browse to the flask app, expose the listener service just generated.

oc expose svc/listener

Then load the buildconfig and imagestream information for the acquisition and processing applications via the following:

oc apply -f cord-19-resources.yaml

Finally, head over to your Argo workflow webui, click Submit New Workflow, and copy paste what's found in argo-workflow/workflow.yamlinto the textarea that appears. (Note: change <PROJECT_NAME> to your project's name, and delete all the sample YAML that Argo provides.

Click Submit and watch the workflow happen.

You can see the topic modeling results by clicking on the log button for the processing node of the workflow as well as by browsing to the listener service.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
argo-workflow		argo-workflow
data-acquisition		data-acquisition
data-processing		data-processing
notebook		notebook
README.md		README.md
cord-19-resources.yaml		cord-19-resources.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORD-19-NLP

Overview

Workflow motivation

Running the workflow

About

Releases

Packages

Languages

EldritchJS/CORD-19-NLP

Folders and files

Latest commit

History

Repository files navigation

CORD-19-NLP

Overview

Workflow motivation

Running the workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages