Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clear indication of non-GPU accelerated parameters in read_json docstring #11825

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 45 additions & 43 deletions python/cudf/cudf/utils/ioutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,8 +473,36 @@
engine : {{ 'auto', 'cudf', 'cudf_experimental', 'pandas' }}, default 'auto'
Parser engine to use. If 'auto' is passed, the engine will be
automatically selected based on the other parameters.
orient : string,
Indication of expected JSON string format (pandas engine only).
lines : boolean, default False
Read the file as a json object per line.
dtype : boolean or dict, default True
If True, infer dtypes, if a dict of column to dtype, then use those,
if False, then don't infer dtypes at all, applies only to the data.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
byte_range : list or tuple, default None (cudf engine only)
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
Byte range within the input file to be read.
The first number is the offset in bytes, the second number is the range
size in bytes. Set the size to zero to read all data after the offset
location. Reads the row that starts before or at the end of the range,
even if it ends after the end of the range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pair of `(offset, length)` specifying a subrange of the file to be read, in bytes.
To read from `offset` to the end of the file, set `length=0`. Reads the starting
before or at the end of the range even if it ends past the end of the range.

What does "at the end of the range" mean? I guess the byte range specifies a semi-open interval [offset, offset+length) does that mean if a row starts at offset + length - 1 then we read the entire row?

Aside: this is not a very ergonomic way of specifying "read from this offset to the end of the file". Could we accept either an offset int or an (offset, length) pair?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too needs a bit of re-work like you said. I'll try to address this in #11780

keep_quotes : bool, default False (cudf_experimental engine only)
If `True`, any string values are read literally (and wrapped in an
additional set of quotes).
If `False` string values are parsed into Python strings.
typ : type of object to recover (series or frame), default 'frame'
With cudf engine, only frame output is supported.
encoding : str, default is 'utf-8'
The encoding to use to decode py3 bytes.
galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
With cudf engine, only utf-8 is supported.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use
gzip, bz2, zip or xz if path_or_buf is a string ending in
'.gz', '.bz2', '.zip', or 'xz', respectively, and no decompression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not use https://pypi.org/project/python-magic/ and just detect the appropriate decompression scheme?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we dynamically populate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that if infer is provided one could detect the actual file type not from the extension (brittle) but using magic (reasonably robust).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read the email with this GH notification without the context of the thread. It sounded super viable.

otherwise. If using 'zip', the ZIP file must contain only one data
file to be read in. Set to None for no decompression.


galipremsagar marked this conversation as resolved.
Show resolved Hide resolved
orient : string, (pandas engine only)
Indication of expected JSON string format .
Compatible JSON strings can be produced by ``to_json()`` with a
corresponding orient value.
The set of possible orients is:
Expand Down Expand Up @@ -504,66 +532,40 @@
``'columns'``.
- The DataFrame columns must be unique for orients ``'index'``,
``'columns'``, and ``'records'``.
typ : type of object to recover (series or frame), default 'frame'
With cudf engine, only frame output is supported.
dtype : boolean or dict, default True
If True, infer dtypes, if a dict of column to dtype, then use those,
if False, then don't infer dtypes at all, applies only to the data.
convert_axes : boolean, default True
Try to convert the axes to the proper dtypes (pandas engine only).
convert_dates : boolean, default True
List of columns to parse for dates (pandas engine only); If True, then try
convert_axes : boolean, default True (pandas engine only)
Try to convert the axes to the proper dtypes.
convert_dates : boolean, default True (pandas engine only)
List of columns to parse for dates; If True, then try
to parse datelike columns default is True; a column label is datelike if

* it ends with ``'_at'``,
* it ends with ``'_time'``,
* it begins with ``'timestamp'``,
* it is ``'modified'``, or
* it is ``'date'``
keep_default_dates : boolean, default True
If parsing dates, parse the default datelike columns (pandas engine only)
numpy : boolean, default False
Direct decoding to numpy arrays (pandas engine only). Supports numeric
keep_default_dates : boolean, default True (pandas engine only)
If parsing dates, parse the default datelike columns
numpy : boolean, default False (pandas engine only)
Direct decoding to numpy arrays. Supports numeric
data only, but non-numeric column and index labels are supported. Note
also that the JSON ordering MUST be the same for each term if numpy=True.
precise_float : boolean, default False
precise_float : boolean, default False (pandas engine only)
Set to enable usage of higher precision (strtod) function when
decoding string to double values (pandas engine only). Default (False)
decoding string to double values. Default (False)
is to use fast but less precise builtin functionality
date_unit : string, default None
The timestamp unit to detect if converting dates (pandas engine only).
date_unit : string, default None (pandas engine only)
The timestamp unit to detect if converting dates.
The default behavior is to try and detect the correct precision, but if
this is not desired then pass one of 's', 'ms', 'us' or 'ns' to force
parsing only seconds, milliseconds, microseconds or nanoseconds.
encoding : str, default is 'utf-8'
The encoding to use to decode py3 bytes.
With cudf engine, only utf-8 is supported.
lines : boolean, default False
Read the file as a json object per line.
chunksize : integer, default None
Return JsonReader object for iteration (pandas engine only).
chunksize : integer, default None (pandas engine only)
Return JsonReader object for iteration.
See the `line-delimited json docs
<http://pandas.pydata.org/pandas-docs/stable/io.html#io-jsonl>`_
for more information on ``chunksize``.
This can only be passed if `lines=True`.
If this is None, the file will be read into memory all at once.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use
gzip, bz2, zip or xz if path_or_buf is a string ending in
'.gz', '.bz2', '.zip', or 'xz', respectively, and no decompression
otherwise. If using 'zip', the ZIP file must contain only one data
file to be read in. Set to None for no decompression.
byte_range : list or tuple, default None
Byte range within the input file to be read (cudf engine only).
The first number is the offset in bytes, the second number is the range
size in bytes. Set the size to zero to read all data after the offset
location. Reads the row that starts before or at the end of the range,
even if it ends after the end of the range.
keep_quotes : bool, default False
This parameter is only supported in ``cudf_experimental`` engine.
If `True`, any string values are read literally (and wrapped in an
additional set of quotes).
If `False` string values are parsed into Python strings.


Returns
-------
Expand Down