corpus-build

This is a small repository to build text corpora from a database of fulltext entries. The objective is to provide content from the Norwegian Web Archive for Natural Language Processing (NLP) through the DH-lab at the National Library of Norway.

corpus-build is part of a bigger pipeline for building linguistic corpora from WebARChive (WARC) files.

Prerequisites

In order to make a new corpus from a database of material from the web archive, this repo contains functionality to extract full text from specific news websites that have declared a responsible editor:

responsible-editor-filter.yaml contains the domains that should be filtered upon.

A postgreSQL database is required with the following tables:

warcinfo - contains metadata about the full text entry
fulltext - contains the actual text

Both tables are linked together using the field fulltext_hash

Local setup

Build

To build the container image locally, simply run

docker build . --tag corpus-build

Run

Caution

Note that the password is passed in as plain text, which is a security vulnerability.

To run the built container, run:

docker run --interactive --tty --rm corpus-build
# Inside the container run:
source /virtual_environment/bin/activate
python main.py \
    --filter-yaml-file=responsible-editor-filter.yaml \
    --hostname=<host-or-ip-of-database-server> \
    --port=<target-port> \
    --database=<name-of-database-to-connect> \
    --user=<database-username> \
    --password=<plain-text-password-for-user> \
    --output-dir="/build/output"

If the setup of the server is as main.py expect, then files will appear in /build/output/ containing the relevant full text for the specified domains listed in responsible-editor-filter.yaml.

Kubernetes

Configuration

Take a look kubernetes/ for a template of how to deploy it in your cluster.

You need to configure the namespace field in kubernetes/kustomization.yaml, and also the proxy fields in kubernetes/secrets.yaml (or delete the entire file if you do not need it)

Deploy

To deploy the pod, run:

kubectl apply --kustomize kubernetes/

Run

When the pod has been created successfully, run:

kubectl exec --stdin --tty corpus-build -- bash
# Inside the container run:
source /virtual_environment/bin/activate
python main.py \
    --filter-yaml-file=responsible-editor-filter.yaml \
    --hostname=<host-or-ip-of-database-server> \
    --port=<target-port> \
    --database=<name-of-database-to-connect> \
    --user=<database-username> \
    --password=<plain-text-password-for-user> \
    --output-dir="/build/output"

To export the output to your local machine, run:

kubectl exec corpus-build -- tar cf - /build/output | tar xf - -C .

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
image		image
kubernetes		kubernetes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
lint-requirements.in		lint-requirements.in
lint-requirements.txt		lint-requirements.txt
main.py		main.py
pyproject.toml		pyproject.toml
requirements.in		requirements.in
requirements.txt		requirements.txt
responsible-editor-filter.yaml		responsible-editor-filter.yaml
test-requirements.in		test-requirements.in
test-requirements.txt		test-requirements.txt
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corpus-build

Prerequisites

Local setup

Build

Run

Kubernetes

Configuration

Deploy

Run

About

Releases 1

Packages

Contributors 3

Languages

License

nlnwa/corpus-build

Folders and files

Latest commit

History

Repository files navigation

corpus-build

Prerequisites

Local setup

Build

Run

Kubernetes

Configuration

Deploy

Run

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages