Skip to content
This repository has been archived by the owner on Feb 24, 2022. It is now read-only.

Code to parse and clean the CDC's Ambulatory Health Care Data (AHCD) (NAMCS and NHAMCS)

License

Notifications You must be signed in to change notification settings

humandx/hdx-data-extraction-ahcd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hdx-data-extraction-ahcd


Code to convert CDC's Ambulatory Health Care Data (AHCD) (NAMCS and NHAMCS) in human readable form. https://www.cdc.gov/nchs/ahcd/about_ahcd.htm.

  • NAMCS

    The National Ambulatory Medical Care Survey (NAMCS) is a national survey designed to meet the need for objective, reliable information about the provision and use of ambulatory medical care services in the United States. Findings are based on a sample of visits to nonfederally employed office-based physicians who are primarily engaged in direct patient care.
  • NHAMCS

    The National Hospital Ambulatory Medical Care Survey (NHAMCS) is designed to collect data on the utilization and provision of ambulatory care services in hospital emergency and outpatient departments, and in ambulatory surgery centers.

Code Structure

hdx_ahcd serves as base directory.

hdx_ahcd
├── api.py
├── controllers
│   ├── __init__.py
│   ├── namcs_converter.py
│   ├── namcs_extractor.py
│   └── namcs_processors.py
├── helpers
│   ├── functions.py
│   └── __init__.py
├── __init__.py
├── mappers
│   ├── functions.py
│   ├── __init__.py
│   └── years.py
├── namcs
│   ├── config.py
│   ├── constants.py
│   ├── enums.py
│   └── __init__.py
├── scripts
│   ├── __init__.py
│   └── namcs_validators.py
└── utils
    ├── context.py
    ├── decorators.py
    ├── exceptions.py
    ├── __init__.py
    └── utils.py
namcs_test.py
  • api - API to process NAMCS dataset file(s).
  • controllers
    • namcs_extractor - Download and extract public NAMCS data.
    • namcs_converter - Process and convert NAMCS data in human readable form.
    • namcs_processors - Provide common entry point for execution.
  • helpers - Various methods for manipulating dataset and it's details.
  • mappers
    • helpers - Methods to translate raw data from dataset to human readable format.
    • years - Year wise NAMCS details like fields, field location, length etc.
  • namcs - Contains configurable parameters and constants.
  • scripts
    • namcs_validators - Validation of dataset and parameters provided while invoking script namcs_processors.
  • utils - Contains useful decorators, context managers etc.
  • namcs_test - Script to perform regression for all namcs year(DEV purpose only).

Supported fields


  • date_of_visit - Patient date of visit.
  • date_of_birth - Patient date of birth.
  • year_of_visit - Patient year of visit.
  • year_of_birth - Patient year of birth.
  • month_of_visit - Patient month of visit.
  • month_of_birth - Patient month of birth.
  • patient_age - Patient age in days.
  • gender - Patient gender.
  • physician_diagnoses - ICD-9-CM code (International Classification of Diseases, 9th Revision, Clinical Modification) for Diagnostic information
  • visit_weight - The "patient visit weight" is a vital component in the process of producing national estimates from sample data, and its use should be clearly understood by all micro-data file users. The statistics contained on the micro-data file reflect data concerning only a sample of patient visits, not a complete count of all the visits that occurred in the United States. Each record on the data file represents one visit in the sample of 27,369 visits. In order to obtain national estimates from the sample, each record is assigned an inflation factor called the "patient visit weight."

Installation


Currently supported python version 3.6.x, To check python version

python --version

Ensure pip, setuptools, and wheel are up to date

python -m pip install --upgrade pip setuptools wheel

If you have local copy of this repo and want to install directly from it.

pip install ${PATH_FOR_hdx-data-extraction-ahcd_REPO}

Similarly you can execute setup file

python3 ${PATH_FOR_hdx-data-extraction-ahcd_REPO}/setup.py install

for example:

pip install /var/tmp/hdx-data-extraction-ahcd/

or

python3 /var/tmp/hdx-data-extraction-ahcd/setup.py install

You can also use pip directly for Installation.

pip install hdx_ahcd

Usage


>>> import hdx_ahcd
>>> from hdx_ahcd.api import get_cleaned_data_by_year
>>> import pprint
>>> pp =pprint.PrettyPrinter(indent=4)

Case 1: Downloading NAMCS data for single year (say, 1973). If the file is already present in the downloaded_files then process the downloaded file.

>>> gen =  get_cleaned_data_by_year(year=1973)
INFO:hdx_ahcd:Downloading file:ftp://ftp.cdc.gov
/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe for year:1973
>>> pp.pprint(gen)
defaultdict(<class 'dict'>,
        {   1973: {   'generator': <generator object get_generator_by_year at 0x7fe4b6480150>,
                      'source_file_info': {
                      'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe',
                      'year': 1973,
                      'zip_file_name': 'namcs73.exe'}}})

Case 2: Downloading NAMCS data for multiple years (say, 1973 and 1975). If the file is already present in the downloaded_files then process the downloaded file.

>>> gen =  get_cleaned_data_by_year(year= (1973,1975))
INFO:hdx_ahcd:Downloading file:ftp://ftp.cdc.gov
/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs75.exe for year:1975
>>> pp.pprint(gen)
defaultdict(<class 'dict'>,
            {   1973: {   'generator': <generator object get_generator_by_year at 0x7fe4b5fa9e08>,
                          'source_file_info': {
                          'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe',
                          'year': 1973,
                          'zip_file_name': 'namcs73.exe'}},
                1975: {   'generator': <generator object
                get_generator_by_year at 0x7fe4b45e7e60>,
                          'source_file_info': {
                          'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs75.exe',
                          'year': 1975,
                          'zip_file_name': 'namcs75.exe'}}})

Iterating over generator

>>> pp.pprint(next(gen.get(1973).get('generator')))
{   'age': 22889.0,
    'month_of_visit': 6,
    'patient_visit_weight': 13479.0,
    'physician_diagnoses': ['470.0', 'V03.2'],
    'sex': 'Female',
    'source_file_ID': '1973_NAMCS',
    'source_file_row': 1,
    'year_of_visit': 1973}
>>> pp.pprint(next(gen.get(1975).get('generator')))
{   'age': 14610.0,
    'month_of_visit': 4,
    'patient_visit_weight': 3722.0,
    'physician_diagnoses': ['492.0'],
    'sex': 'Male',
    'source_file_ID': '1975_NAMCS',
    'source_file_row': 1,
    'year_of_visit': 1975}

Case 3: Forcefully download and then process the NAMCS data for multiple years (say, 1973 and 1975).

>>> gen =  get_cleaned_data_by_year(year= (1973,1975), force_download=True)
INFO:hdx_ahcd:Downloading file:ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe for year:1973
INFO:hdx_ahcd:Downloading file:ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs75.exe for year:1975
>>> pp.pprint(gen)
defaultdict(<class 'dict'>,
            {   1973: {   'generator': <generator object get_generator_by_year at 0x7fe4b5fa9e08>,
                          'source_file_info': {
                          'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe',
                          'year': 1973,
                          'zip_file_name': 'namcs73.exe'}},
                1975: {   'generator': <generator object get_generator_by_year at 0x7fe4b45e7e60>,
                          'source_file_info': {
                          'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs75.exe',
                          'year': 1975,
                          'zip_file_name': 'namcs75.exe'}}})

Case 4: Forcefully download and then export the processed NAMCS data for multiple years (say, 1973 and 1975) in separate csv files.

>>> gen =  get_cleaned_data_by_year(year= (1973,1975), force_download=True, do_export=True)
INFO:hdx_ahcd:Downloading file:ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe for year:1973
INFO:hdx_ahcd:Finished writing to the file /home/velotio/.hdx_ahcd/data/1973_NAMCS_CONVERTED.csv
INFO:hdx_ahcd:Downloading file:ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs75.exe for year:1975
INFO:hdx_ahcd:Finished writing to the file /home/velotio/.hdx_ahcd/data/
1975_NAMCS_CONVERTED.csv
>>> pp.pprint(gen)
defaultdict(<class 'dict'>,
        {   1973: {   'file_name': '/home/velotio/.hdx_ahcd/data/1973_NAMCS_CONVERTED.csv',
                      'generator': <generator object get_generator_by_year at 0x7fe4b6480150>,
                      'source_file_info': {
                      'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/namcs73.exe',
                      'year': 1973,
                      'zip_file_name': 'namcs73.exe'}},
            1975: {   'file_name': '/home/velotio/.hdx_ahcd/data/1975_NAMCS_CONVERTED.csv',
                      'generator': <generator object get_generator_by_year at 0x7fe4b45e7e60>,
                      'source_file_info': {   '
                      url': 'ftp://ftp.cdc.gov/pub/Health_Statistics
                      /NCHS/namcs_public_use_files/namcs75.exe',
                      'year': 1975,
                      'zip_file_name': 'namcs75.exe'}}})

Case 5: Process the provided NAMCS data set file. In this case file name is assumed to follow "YEAR_NAMCS" format.

>>> gen = get_cleaned_data_by_year(file_name="/var/tmp/2015_NAMCS")
>>> pp.pprint(gen)
defaultdict(<class 'dict'>,
            {   2015: {   'generator': <generator object get_generator_by_year at 0x7f7ba17acf68>,
                          'source_file_info': {   'url': 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NAMCS/namcs2015.zip',
                                                  'year': 2015,
                                                  'zip_file_name': 'namcs2015.zip'}}})
>>> pp.pprint(next(gen.get(2015).get("generator")))
{   'age': 23725.0,
    'month_of_visit': 10,
    'patient_visit_weight': 414200.0481,
    'physician_diagnoses': ['723.10', '719.41', '729.50', 'V50.80', 'V00.009'],
    'sex': 'Female',
    'source_file_ID': '2015_NAMCS',
    'source_file_row': 1,
    'year_of_visit': 2015}

Uninstall


To uninstall you can use either

easy_install -m hdx_ahcd

or

pip uninstall hdx_ahcd

Scope


  • Support for NHAMCS data set to be added in subsequent releases.
  • Unsupported years due to missing data sets on CDC server.
    • 1974, 1982, 1983, 1984, 1986, 1987, 1988, 1991

About

Code to parse and clean the CDC's Ambulatory Health Care Data (AHCD) (NAMCS and NHAMCS)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages