This repository contains R code and a Docker image definition that facilitate pulling the CDC's NHANES data files into a SQL DBMS. It is inspired by the this project, retaining its design but using a different implentation.
The image is derived from rocker/tidyverse, which is an Ubuntu GNU/Linux image containing R and RStudio Server, and adds a PostgreSQL database containing data from NHANES.
To run the current pre-built image, install docker and run
docker run --rm --name nhanes-pg -d \
-p 8787:8787 \
-p 2200:22 \
-p 5432:5432 \
-e 'CONTAINER_USER_USERNAME=test' \
-e 'CONTAINER_USER_PASSWORD=test' \
deepayansarkar/nhanes-postgresql:0.11.0
To map a local directory so that it becomes accessible within the container, add
-v <LOCAL_DIRECTORY>:/HostData \
On Windows, run docker desktop, open a terminal, and execute the same command without any line breaks. The following omits a couple of port forwarding flags which are unlikely to be useful (see below).
docker run --rm --name nhanes-pg -d -p 8787:8787 -e 'CONTAINER_USER_USERNAME=test' -e 'CONTAINER_USER_PASSWORD=test' deepayansarkar/nhanes-postgresql:0.11.0
This is the easiest way to get started. The various -p
flags exposes
useful ports in the container:
-
Port 8787 is used for access to the RStudio Server HTTP server (usually through a web browser using the address http://localhost:8787). This is the simplest way to use the container, as it will give access to an RStudio setup (with username
test
and passwordtest
, customizable in the call above), which will have thenhanesA
pre-installed and pre-configured to use the database. -
Port 22 can be used to access the container via SSH. If needed, it is usually mapped to a different port (2200 in the incantation above) as the host machine is likely running its own SSH server on port 22.
-
The PostgreSQL server running in the container can be accessed through port 5432. Mapping it to a port on the host machine allows direct access to the database over the local network, which can be useful in multi-user environments.
In principle, we can build the docker image using the following, but
for the recommended approach, see the version using time
below.
docker build --progress plain --shm-size=1024M --platform=linux/amd64 --tag nhanes-postgres -f Container/Dockerfile .
In principle, one could also build on ARM64 (typically on an Apple Silicon machine) using the following,
docker build --progress plain --shm-size=1024M --platform=linux/arm64 --tag nhanes-postgres -f Container/Dockerfile .
but this is not currently possible due to limitations of the parent image (rocker/tidyverse).
A major part of the build process is to collect NHANES data (specifically, the raw SAS format tables, the HTML docs, and the codebook other metadata extracted from these docs) and insert them into a database. Although these could be obtained directly from the NHANES website maintained by CDC, we prefer to use an intermediate snapshot with a well-defined collection date. This avoids any ambiguity regarding which datasets (and which versions thereof) are included in the docker image.
From the perspective of the build process, the contents of this snapshot are expected to be available through a web server. This serves two purposes: (a) In principle, the docker build can work without external dependencies as long as a reliable internet connection is available. (b) Accessing web resources is easy during the docker build process, whereas accessing local files in the build machine is not as easy.
However, depending on web access has the disadvantage that it slows
down the build process. To avoid this, we can check out the snapshot
repository and serve it locally by running an appropriate local
server. One implementation, using the R httpuv
package, is available
here.
The code below currently assumes that a local server has been initiated using this code.
FIXME: Allow this to be controlled via an environment variable.
Before starting, make sure to edit the files COLLECTION_DATE
and
CONTAINER_VERSION
if necessary. The COLLECTION_DATE
should be a
copy of the corresponding file in the snapshot being used to populate
the database.
It may be useful to redirect the console output to a log file, as it will contain informative messages about failures. This may be done using something like
export CDATE=`cat COLLECTION_DATE`
export CVERSION=`cat CONTAINER_VERSION`
time docker build --progress plain --shm-size=2048M --platform=linux/amd64 --tag nhanes-postgres --build-arg COLLECTION_DATE=${CDATE} --build-arg CONTAINER_VERSION=${CVERSION} -f Container/Dockerfile . &> build.log
To upload to docker hub:
echo ${CVERSION}
docker tag nhanes-postgres <user>/nhanes-postgresql:${CVERSION}
docker push <user>/nhanes-postgresql:${CVERSION}
Run the locally built image as before, replacing the image name:
docker run --rm --name nhanes-pg -d \
-p 8787:8787 \
-p 2200:22 \
-p 5432:5432 \
-e 'CONTAINER_USER_USERNAME=test' \
-e 'CONTAINER_USER_PASSWORD=test' \
nhanes-postgres
As before, to map a local directory, add
-v <LOCAL_DIRECTORY>:/HostData \
-
Based on rocker/tidyverse:4.3.3, which gives us Ubuntu 22.04 LTS along with RStudio Server and DBI packages (and of course tidyverse, which we do not use)
-
All database insertions are done through
DBI
. All data are downloaded directly from GitHub "raw" URLs. No data is saved in files locally. -
By default, raw data is obtained from https://github.com/ccb-hms/nhanes-data, and not directly from the CDC website. The source is customizable in
Dockerfile
, which may be useful if one wants to avoid build-time downloads by making a a local clone of the repository available via a (local) server such as httpuv. -
Codebook data is obtained from https://github.com/ccb-hms/NHANES-metadata (ontology tables are not loaded yet)
-
There are a handful of failures, which are not yet tracked properly, but some information is saved inside the
/status/
directory inside the image, as well as in thebuild.log
file (see build instructions above).
The DB schema and other interface details are largely similar to the SQL Server implementation. However, Postgres has one big difference: unquoted table and column names in SQL statements are automatically converted to lowercase, which means we need to change how queries are constructed.
This is mainly an issue with the hard-coded SQL queries in
nhanesA
. Work on this is being done in the postgres-backend
branch
of https://github.com/cjendres1/nhanes.
Use RPostgres
to establish connections, with port 5432 by default. For example:
DBI::dbConnect(RPostgres::Postgres(),
dbname = "NhanesLandingZone",
host = "localhost",
port = 5432L,
password = "NHAN35",
user = "sa")
There is no need to install separate drivers, as in the case of Microsoft SQL Server.
See https://www.postgresql.org/docs/current/populate.html.
We follow two of these recommendations explicitly:
-
In lieu of bulk insert, we use RPostgres::dbWriteTable(), which uses
COPY
by default. -
Primary keys are added after all tables have been created.
We should explore others as well.
See https://www.pgadmin.org/download/
- Not sure if there is anything analogous to SQL Server's
DBCC SHRINKFILE
or MariaDB'sFLUSH BINARY LOG
. The one log file produced invar/log/postgresql
is fairly small.