Skip to content

Latest commit

 

History

History
319 lines (231 loc) · 8.96 KB

README.md

File metadata and controls

319 lines (231 loc) · 8.96 KB

Awk CSV parser

Latest stable version Build Status

AWK and Bash code to easily parse CSV files, with possibly embedded commas and quotes.

Table of Contents

Features

  • Parse CSV files with only Bash and Awk.
  • Allow to process CSV data with standard UNIX shell commands.
  • Properly handle CSV data that contain field separators (commas by default) and field enclosures (double quotes by default) inside enclosed data fields.
  • Process CSVs from stdin pipe as well as from multiple command line file arguments.
  • Handle any character both for field separator and field enclosure.
  • Can rewrite CSV records with a multi-character output field separator, CSV enclosure characters removed and escaped enclosures unescaped.
  • Each line may not contain the same number of fields throughout the file.

Known limitations

  • Does not yet handle embedded newlines inside data fields.

Links

Other Awk implementations:

Requirements

  • Bash v4 (2009) and above
  • GNU Awk 3.1+

Tested on Debian/Ubuntu Linux.

Usage

Displayed by:

$ awk-csv-parser.sh --help

Help on command prompt

Text version
Description
    AWK and Bash code to easily parse CSV files, with possibly embedded commas and quotes.

Usage
    awk-csv-parser.sh [OPTION]… [<CSV-file>]…

Options
    -e <character>, --enclosure=<character>
        Set the CSV field enclosure. One character only, '"' (double quote) by default.

    -o <string>, --output-separator=<string>
        Set the output field separator. Multiple characters allowed, '|' (pipe) by default.

    -s <character>, --separator=<character>
        Set the CSV field separator. One character only, ',' (comma) by default.

    -h, --help
        Display this help.

    <CSV-file>
        CSV file to parse.

Discussion
    – The last record in the file may or may not have an ending line break.
    – Each line may not contain the same number of fields throughout the file.
    – The last field in the record must not be followed by a field separator.
    – Fields containing field enclosures or field separators must be enclosed in field
      enclosure.
    – A field enclosure appearing inside a field must be escaped by preceding it with
      another field enclosure. Example: "aaa","b""bb","ccc"

Examples
    Parse a CSV and display records without field enclosure, fields pipe-separated:
        awk-csv-parser.sh --output-separator='|' resources/iso_3166-1.csv

    Remove CSV's header before parsing:
        tail -n+2 resources/iso_3166-1.csv | awk-csv-parser.sh

    Keep only first column of multiple files:
        awk-csv-parser.sh a.csv b.csv c.csv | cut -d'|' -f1

    Keep only first column, using multiple UTF-8 characters output separator:
        awk-csv-parser.sh -o '⇒⇒' resources/iso_3166-1.csv | awk -F '⇒⇒' '{print $1}'

    You can directly call the Awk script:
        awk -f csv-parser.awk -v separator=',' -v enclosure='"' --source '{
            csv_parse_record($0, separator, enclosure, csv)
            print csv[2] " ⇒ " csv[0]
        }' resources/iso_3166-1.csv

Examples

Excerpt from resources/iso_3166-1.csv (full version):

Country or Area Name,ISO ALPHA-2 Code,ISO ALPHA-3 Code,ISO Numeric Code
Brazil,BR,BRA,076
British Virgin Islands,VG,VGB,092
British Indian Ocean Territory,IO,IOT,086
Brunei Darussalam,BN,BRN,096
Burkina Faso,BF,BFA,854
"Hong Kong, Special Administrative Region of China",HK,HKG,344
"Macao, Special Administrative Region of China",MO,MAC,446
Christmas Island,CX,CXR,162
Cocos (Keeling) Islands,CC,CCK,166
1. Parse a CSV and display records without field enclosure, output fields pipe-separated
$ awk-csv-parser.sh --output-separator='|' resources/iso_3166-1.csv | head -n10
# or:
$ cat resources/iso_3166-1.csv | awk-csv-parser.sh --output-separator='|' | head -n10

Result:

Country or Area Name|ISO ALPHA-2 Code|ISO ALPHA-3 Code|ISO Numeric Code|
Brazil|BR|BRA|076|
British Virgin Islands|VG|VGB|092|
British Indian Ocean Territory|IO|IOT|086|
Brunei Darussalam|BN|BRN|096|
Burkina Faso|BF|BFA|854|
Hong Kong, Special Administrative Region of China|HK|HKG|344|
Macao, Special Administrative Region of China|MO|MAC|446|
Christmas Island|CX|CXR|162|
Cocos (Keeling) Islands|CC|CCK|166|
2. Remove CSV header, keep only first column and grep fields containing separator
$ tail -n+2 resources/iso_3166-1.csv | awk-csv-parser.sh | cut -d'|' -f1 | grep ,

Result:

Hong Kong, Special Administrative Region of China
Macao, Special Administrative Region of China
Congo, Democratic Republic of the
Iran, Islamic Republic of
Korea, Democratic People's Republic of
Korea, Republic of
Micronesia, Federated States of
Taiwan, Republic of China
Tanzania, United Republic of
3. You can directly call the Awk script
$ awk -f csv-parser.awk -v separator=',' -v enclosure='"' --source '{
    csv_parse_record($0, separator, enclosure, csv)
    print csv[2] " ⇒ " csv[0]
}' resources/iso_3166-1.csv | head -n10

Result:

ISO ALPHA-3 Code ⇒ Country or Area Name
BRA ⇒ Brazil
VGB ⇒ British Virgin Islands
IOT ⇒ British Indian Ocean Territory
BRN ⇒ Brunei Darussalam
BFA ⇒ Burkina Faso
HKG ⇒ Hong Kong, Special Administrative Region of China
MAC ⇒ Macao, Special Administrative Region of China
CXR ⇒ Christmas Island
CCK ⇒ Cocos (Keeling) Islands
4. Technical example

Content of tests/resources/ok.csv:

,,
a, b,c , d ,e e
"","a","a,",",a",",,"
"a""b","""","c"""""

Test:

$ awk-csv-parser.sh tests/resources/ok.csv

Result:

|| |
a| b|c | d |e e|
|a|a,|,a|,,|
a"b|"|c""|
5. Errors

Content of tests/resources/invalid.csv:

"
"a,
a"
"a"b

Test:

$ awk-csv-parser.sh tests/resources/invalid.csv

Result:

[CSV ERROR: 3] Missing closing quote after '' in following record: '"'
[CSV ERROR: 3] Missing closing quote after 'a,' in following record: '"a,'
[CSV ERROR: 1] Missing opening quote before 'a' in following record: 'a"'
[CSV ERROR: 2] Missing separator after 'a' in following record: '"a"b'

Installation

Debian/Ubuntu

  1. Move to the directory where you wish to store the source.

  2. Clone the repository:

$ git clone https://github.com/geoffroy-aubry/awk-csv-parser.git
  1. You should be on stable branch. If not, switch your clone to that branch:
$ cd awk-csv-parser && git checkout stable
  1. You can create a symlink to awk-csv-parser.sh:
$ sudo ln -s /path/to/src/awk-csv-parser.sh /usr/local/bin/awk-csv-parser
  1. It's ready for use:
$ awk-csv-parser

OS X

As both readlink and sed Mac OS X versions are based on BSD with small differences with the GNU version, you need to install GNU utilities:

$ brew install coreutils gnu-sed [--with-default-names]

With --with-default-names option, GNU utilities replace those of OS X. Else GNU utilities are prefixed with a g and you have to edit the scripts src/awk-csv-parser.sh and tests/all-tests.sh to replace both readlink and sed with greadlink and gsed respectively.

Then follow Debian/Ubuntu installation process.

Copyrights & licensing

Licensed under the GNU Lesser General Public License v3 (LGPL version 3). See LICENSE file for details.

Change log

See CHANGELOG file for details.

Continuous integration

Build Status

Launch unit tests:

$ tests/all-tests.sh

Git branching model

The git branching model used for development is the one described and assisted by twgit tool: https://github.com/Twenga/twgit.