-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chunksize param to read_json when lines=True #17168
Conversation
@louispotok : Thanks for doing this! Before we can merge, you will need to add tests and |
pandas/io/json/json.py
Outdated
Also note this is different from the `chunksize` parameter in | ||
`read_csv`, which returns a FileTextReader. | ||
If the JSON input is a string, this argument has no effect. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a version-added tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/io/json/json.py
Outdated
@@ -322,6 +332,27 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |||
|
|||
filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf, | |||
encoding=encoding) | |||
|
|||
def _read_json_as_lines(fh, chunksize): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring for this. Will be useful for developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/io/json/json.py
Outdated
|
||
def _get_obj(typ, json, orient, dtype, convert_axes, convert_dates, | ||
keep_default_dates, numpy, precise_float, | ||
date_unit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you do this? I like abstraction but not sure how this relates to your PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, this code was only called once. Now it needs to be called from two places, so I extracted it into a function (see line 343).
One possible improvement I was considering is to define the function inline in the read_json function so that we don't need to pass so many parameters.
needs tests that validate chunksize is current and errors correctly see how read_csv does this |
@jreback thanks for the tip. I looked at the read_csv tests and I'm still a bit confused. Specifically I'm looking at "test_read_chunksize" in tests/io/parser/common.py. I see that it's testing two main things:
|
Okay, the latest commits should address all these comments:
|
Codecov Report
@@ Coverage Diff @@
## master #17168 +/- ##
=========================================
Coverage ? 90.97%
=========================================
Files ? 162
Lines ? 49500
Branches ? 0
=========================================
Hits ? 45035
Misses ? 4465
Partials ? 0
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17168 +/- ##
==========================================
- Coverage 91.27% 91.25% -0.02%
==========================================
Files 163 163
Lines 49766 49766
==========================================
- Hits 45423 45414 -9
- Misses 4343 4352 +9
Continue to review full report at Codecov.
|
Also, do you prefer to leave these atomic commits or should I squash them? |
@louispotok : Whatever is most convenient for you. We'll squash them at the end once we merge. |
Thanks @gfyoung , this is ready for another round of review. |
@louispotok : This looks pretty good (there are some minor stylistic things I think...@jreback?) Could you post timing information so that we can see the performance difference with and without your changes in this PR? Link: https://pandas.pydata.org/pandas-docs/stable/contributing.html#id53 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise an error when a value for chunksize
is passed and lines
is not True ?
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -197,6 +197,7 @@ Other API Changes | |||
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`) | |||
- :func:`read_csv` now treats ``'null'`` strings as missing values by default (:issue:`16471`) | |||
- :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`) | |||
- :func:`read_json` now accepts a ``chunksize`` parameter that can reduce memory usage when ``lines=True``. (:issue:`17048`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in the "Other Enhancements" section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update for this
pandas/io/json/json.py
Outdated
If this is None, the file will be read into memory all at once. | ||
Passing a chunksize helps with memory usage, but is slower. | ||
Also note this is different from the `chunksize` parameter in | ||
`read_csv`, which returns a FileTextReader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't our goal be to have it with a similar behaviour?
(formatting comment: there is some extra indentation on this line that is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to return an iterator (as @jorisvandenbossche indicates).
you return a new class that inherits from BaseIterator
(in pandas.io.common); see how pandas.io.stata.StateReader
does this). You pretty much just return a class and define __next__
, which processes the chunked rows (also an example of this in pandas.io.pytables, though that doesn't use BaseIterator
, it should, which is another issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? (the subclassing from BaseIterator) We could also just do a yield
of the chunk in the _read_json_as_lines
function of the PR (instead of returning the concatted frame of all chunks)
That seems less invasive to add given the current implementation of read_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes because that's how its done elsewhere, and it avoid weird things like py 2/3 compat issues (this is in fact the point of BaseIterator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the python 2/3 compat issue with using yield
? (it has no .next
method in python 3, but you should use next(..)
anyway?)
(that's how it is done in read_sql
, so if there is an issue it should be fixed there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update these doc-strings
pandas/io/json/json.py
Outdated
@@ -263,6 +266,16 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |||
|
|||
.. versionadded:: 0.19.0 | |||
|
|||
chunksize: integer, default None | |||
If `lines=True`, how many lines to read into memory at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to use double backticks here (so ``lines=True``
), then it will be formatted as code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
My main comment/question here is: should the API be consistent with how @louispotok I understand that you don't need it in your specific case (as once concatenated, your dataframe fits into memory). But in general it could be needed in certain cases to have it as an iterator. For your case you can then do the concat yourself:
|
pandas/io/json/json.py
Outdated
If this is None, the file will be read into memory all at once. | ||
Passing a chunksize helps with memory usage, but is slower. | ||
Also note this is different from the `chunksize` parameter in | ||
`read_csv`, which returns a FileTextReader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to return an iterator (as @jorisvandenbossche indicates).
you return a new class that inherits from BaseIterator
(in pandas.io.common); see how pandas.io.stata.StateReader
does this). You pretty much just return a class and define __next__
, which processes the chunked rows (also an example of this in pandas.io.pytables, though that doesn't use BaseIterator
, it should, which is another issue)
pandas/io/json/json.py
Outdated
@@ -322,6 +335,39 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |||
|
|||
filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf, | |||
encoding=encoding) | |||
|
|||
if chunksize is not None: | |||
_validate_integer("chunksize", chunksize, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should raise if lines!=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/tests/io/json/test_pandas.py
Outdated
@@ -1032,6 +1032,32 @@ def test_to_jsonl(self): | |||
assert result == expected | |||
assert_frame_equal(pd.read_json(result, lines=True), df) | |||
|
|||
def test_read_jsonchunks(self): | |||
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add issue number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/tests/io/json/test_pandas.py
Outdated
pd.read_json(strio, lines=True, chunksize=2.2) | ||
|
||
with tm.assert_raises_regex(ValueError, msg): | ||
pd.read_json(strio, lines=True, chunksize='foo') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with chunksize and lines=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
It looks like we still need a decision on whether to return a I'll point out one thing though. I think there are really two issues here worth separating:
I was originally attempting to fix (1) but there may be another way to fix it instead. The culprit seems to be line 349 which on my laptop kills 10G of available memory with a 1.7GB json object. My solution to (1) was to read in the file in chunks, but maybe there are other possible fixes. Then, I think a separate pull request could solve issue (2). What do you think? |
make a class that inherits BaseIterator, call it JsonLineReader. return this to the user. this solves both issues generally. |
Hello @louispotok! Thanks for updating the PR.
Comment last updated on September 28, 2017 at 21:47 Hours UTC |
*** Note: not ready for re-review -- found some bugs that were not caught by tests. Fixing behavior and tests. |
Okay, this is ready for re-review. @jreback I believe I've addressed all your comments. One other point I want to confirm. I'm assuming (based on my reading of the code and documentation) that currently when |
I'm not sure why the AppVeyor build broke, it looks like an unrelated issue. |
@louispotok : Maybe not. |
Ah, I think I caught it. Let's see. To document fully, in my understanding, there are 4 cases to consider. Here's the current behavior:
Do those decisions make sense? I could see the argument for changing (4) so that it leaves the buffer open and the caller has to close it. I think (and we'll see with this build) the problem was in situation (2), I was checking that chunksize was valid after the file was opened, so the test suite correctly caught the error ( |
Hm, still failing and I'm not sure how to investigate further. I do notice that the method that errors ( |
Split out tests with lines=True into separate test class Parametrize tests Replace """ comments with #.
db2749b
to
28d1cbe
Compare
thanks @louispotok nice PR! very responsive! keep em coming! |
This was fun, thanks for all your help! |
Previous behavior: reading the whole file to memory and then split into lines.
New behavior: if
lines=True
andchunksize
is passed: read inchunksize
lines at a time, and concat.This only covers some kinds of input to
read_json
. Whenchunksize
is passed,read_json
becomes slower but more memory-efficient.Closes #17048.
Tests and style-check pass, no new tests added.