Tab-separated values (TSV) is a simple and popular format for data storage, data transfer, exporting data from and importing data to relational databases. For example, PostgreSQL COPY moves data between PostgreSQL tables and standard file-system files or in-memory stores, and its text
format (a text file with one line per table row) is a generic version of TSV. Meanwhile, packages like asyncpg help efficiently insert, update or query data in bulk with binary data transfer between Python and PostgreSQL.
This package offers a high-performance alternative to convert data between a TSV text file and Python objects. The parser can read a TSV record into a Python tuple consisting of built-in Python types, one for each field. The generator can produce a TSV record from a tuple.
Even though tsv2py contains native code, the package is already pre-built for several target architectures. In most cases, you can install directly from a binary wheel, selected automatically by pip
:
python3 -m pip install tsv2py
If a binary wheel is not available for the target platform, pip
will attempt to install tsv2py from the source distribution. This will build the package on the fly as part of the installation process, which requires a C compiler such as gcc
or clang
. The following commands install a C compiler and the Python development headers on AWS Linux:
sudo yum groupinstall -y "Development Tools"
sudo yum install -y python3-devel python3-pip
If you lack a C compiler or the Python development headers, you will get error messages similar to the following:
error: command 'gcc' failed: No such file or directory
lib/tsv_parser.c:2:10: fatal error: Python.h: No such file or directory
from tsv.helper import Parser
# specify the column structure
parser = Parser(fields=(bytes, date, datetime, float, int, str, UUID, bool))
# read and parse an entire file
with open(tsv_path, "rb") as f:
py_records = parser.parse_file(f)
# read and parse a file line by line
with open(tsv_path, "rb") as f:
for line in f:
py_record = parser.parse_line(line)
Text format is a simple tabular format in which each record (table row) occupies a single line.
- Output always begins with a header row, which lists data field names.
- Fields (table columns) are delimited by tab characters.
- Non-printable characters and special values are escaped with backslash (
\
), as shown below:
Escape | Interpretation |
---|---|
\N |
NULL value |
\0 |
NUL character (ASCII 0) |
\b |
Backspace (ASCII 8) |
\f |
Form feed (ASCII 12) |
\n |
Newline (ASCII 10) |
\r |
Carriage return (ASCII 13) |
\t |
Tab (ASCII 9) |
\v |
Vertical tab (ASCII 11) |
\\ |
Backslash (single character) |
This format allows data to be easily imported into a database engine, e.g. with PostgreSQL COPY.
Output in this format is transmitted as media type text/plain
or text/tab-separated-values
in UTF-8 encoding.
The parser understands the following Python types:
None
. This special value is returned for the TSV escape sequence\N
.bool
. A literaltrue
orfalse
is converted into a boolean value.bytes
. TSV escape sequences are reversed before the data is passed to Python as abytes
object. NUL bytes are permitted.datetime
. The input has to comply with RFC 3339 and ISO 8601. The timezone must be UTC (a.k.a. suffixZ
).date
. The input has to conform to the formatYYYY-MM-DD
.time
. The input has to conform to the formathh:mm:ssZ
with no fractional seconds, orhh:mm:ss.ffffffZ
with fractional seconds. Fractional seconds allow up to 6 digits of precision.float
. Interpreted as double precision floating point numbers.int
. Arbitrary-length integers are allowed.str
. TSV escape sequences are reversed before the data is passed to Python as astr
. NUL bytes are not allowed.uuid.UUID
. The input has to comply with RFC 4122, or be a string of 32 hexadecimal digits.decimal.Decimal
. Interpreted as arbitrary precision decimal numbers.ipaddress.IPv4Address
.ipaddress.IPv6Address
.list
anddict
, which are understood as JSON, and invoke the equivalent ofjson.loads
to parse a serialized JSON string.
The backslash character \
is both a TSV and a JSON escape sequence initiator. When JSON data is written to TSV, several backslash characters may be needed, e.g. \\n
in a quoted JSON string translates to a single newline character. First, \\
in \\n
is understood as an escape sequence by the TSV parser to produce a single \
character followed by an n
character, and in turn \n
is understood as a single newline embedded in a JSON string by the JSON parser. Specifically, you need four consecutive backslash characters in TSV to represent a single backslash in a JSON quoted string.
Internally, the implementation uses AVX2 instructions to
- parse RFC 3339 date-time strings into Python
datetime
objects, - parse RFC 4122 UUID strings or 32-digit hexadecimal strings into Python
UUID
objects, - and find
\t
delimiters between fields in a line.
For parsing integers up to the range of the long
type, the parser calls the C standard library function strtol.
For parsing IPv4 and IPv6 addresses, the parser calls the C function inet_pton in libc or Windows Sockets (WinSock2).
If installed, the parser employs orjson to improve parsing speed of nested JSON structures. If not available, the library falls back to the built-in JSON decoder.
YYYY-MM-DDThh:mm:ssZ
YYYY-MM-DDThh:mm:ss.fZ
YYYY-MM-DDThh:mm:ss.ffZ
YYYY-MM-DDThh:mm:ss.fffZ
YYYY-MM-DDThh:mm:ss.ffffZ
YYYY-MM-DDThh:mm:ss.fffffZ
YYYY-MM-DDThh:mm:ss.ffffffZ
YYYY-MM-DD
hh:mm:ssZ
hh:mm:ss.fZ
hh:mm:ss.ffZ
hh:mm:ss.fffZ
hh:mm:ss.ffffZ
hh:mm:ss.fffffZ
hh:mm:ss.ffffffZ
Depending on the field types, tsv2py is up to 7 times faster to parse TSV records than a functionally equivalent Python implementation based on the Python standard library. Savings in execution time are more substantial for dates, UUIDs and longer strings with special characters (up to 90% savings), and they are more moderate for simple types like small integers (approx. 60% savings).