The CORD-19 dataset https://www.semanticscholar.org/cord19 is an open research collection provided by the Allen Institute for AI. Its intent is to facilitate natural language processing research by collecting and annotating scholarly articles pertaining to COVID-19. The dataset is updated daily, with both full text of articles when available and metadata.
If we consider a use case wherein analysis of the CORD-19 would need to be performed on an as needed basis, the need arising from updated data being made available, we would have at least two operations of importance:
- Acquiring the new data
- Processing the data
This is at minimum and these operations could be broken down into further steps as well. The first operation could be extended to generate a delta between datasets, the second could become preprocessing, feature extraction, evaluation, result generation, and so on. For purposes of this work we focused just on these two steps with the expectation of expanding upon these for specific use cases.
For this type of application, a repeatable and reliable set of operations is crucial, we don't necessarily need constantly running services for these steps. Instead, we can treat them as on demand operations, which makes an Argo workflow on OpenShift a fitting solution. Indeed, OpenShift provides an enterprise-grade means for creating and operating containerized instances of these operations, whilst Argo provides a simple and powerful way to orchestrate and manage the flow of said containers.
Our use case then involves wanting autogenerated topic models derived from an instance of the CORD-19 dataset. As such, after doing initial exploratory data analysis and topic modeling on an instance of the CORD-19 dataset, found in the notebook
directory of this repository, we then extracted acquisition and processing steps from said notebook. These became our workflow application steps, and indeed illustrate the convenience and power of a notebook to containerized application development approach provided by OpenShift. In data-acqusition
and data-processing
the containerized applications are provided. Data acqusition acquires the data file given a source url. For our purposes we stick with the metadata.csv
file but could expand further. Data processing is then run, which generates topic models for all paper abstracts in the metadata file. We could extend this to parse the full text of the individual papers by traversing the metadata for fulltext accordingly, and model topics on the entire corpus, however, for sake of demo purposes we limited the complexity of the operation. Once the topics are computed, they are published to a kafka topic/broker specified within environment variables. A simple flask application is run which consumes the data sent over the kafka topic. As the flask app is a persistent service rather than an on-demand workflow item, it is not included in the workflow specification. The data processing task could be altered to publish/store results to another persistent resource, as well.
On an OpenShift cluster you are logged into, create a project if you haven't already, for sake of example we'll call it cord-19
oc new-project cord-19
For the persistent Kafka service and flask application, we will first create the kafka resources within our OpenShift cluster.
oc create -f https://raw.githubusercontent.com/EldritchJS/adversarial_pipeline/master/openshift_templates/strimzi-0.1.0.yaml
Then start a new application of Kafka via strimzi.
oc new-app strimzi
Now the flask application can be generated. Change KAFKA_TOPIC
as desired.
oc new-app centos/python-36-centos7~https://gitlab.com/bones-brigade/flask-kafka-python-listener.git \
-e KAFKA_BROKERS=kafka:9092 \
-e KAFKA_TOPIC=cord-19-nlp \
--name=listener
In order to browse to the flask app, expose the listener service just generated.
oc expose svc/listener
Then load the buildconfig and imagestream information for the acquisition and processing applications via the following:
oc apply -f cord-19-resources.yaml
Finally, head over to your Argo workflow webui, click Submit New Workflow, and copy paste what's found in argo-workflow/workflow.yaml
into the textarea that appears. (Note: change <PROJECT_NAME>
to your project's name, and delete all the sample YAML that Argo provides.
Click Submit and watch the workflow happen.
You can see the topic modeling results by clicking on the log button for the processing node of the workflow as well as by browsing to the listener service.