Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chunksize param to read_json when lines=True #17168

Merged
merged 62 commits into from
Sep 28, 2017

Conversation

louispotok
Copy link
Contributor

@louispotok louispotok commented Aug 3, 2017

Previous behavior: reading the whole file to memory and then split into lines.
New behavior: if lines=True and chunksize is passed: read in chunksize lines at a time, and concat.

This only covers some kinds of input to read_json. When chunksize is passed, read_json becomes slower but more memory-efficient.

Closes #17048.

Tests and style-check pass, no new tests added.

@gfyoung gfyoung added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Aug 3, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 3, 2017

@louispotok : Thanks for doing this! Before we can merge, you will need to add tests and whatsnew entry to this PR.

Also note this is different from the `chunksize` parameter in
`read_csv`, which returns a FileTextReader.
If the JSON input is a string, this argument has no effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a version-added tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -322,6 +332,27 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,

filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf,
encoding=encoding)

def _read_json_as_lines(fh, chunksize):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring for this. Will be useful for developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


def _get_obj(typ, json, orient, dtype, convert_axes, convert_dates,
keep_default_dates, numpy, precise_float,
date_unit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you do this? I like abstraction but not sure how this relates to your PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, this code was only called once. Now it needs to be called from two places, so I extracted it into a function (see line 343).

One possible improvement I was considering is to define the function inline in the read_json function so that we don't need to pass so many parameters.

@jreback
Copy link
Contributor

jreback commented Aug 3, 2017

needs tests that validate chunksize is current and errors correctly

see how read_csv does this

@louispotok
Copy link
Contributor Author

@jreback thanks for the tip. I looked at the read_csv tests and I'm still a bit confused. Specifically I'm looking at "test_read_chunksize" in tests/io/parser/common.py. I see that it's testing two main things:

  1. Each returned chunk is equal to that slice of the input dataframe -- not relevant in this case, since read_json is still returning the whole dataframe.
  2. Ensure that read_csv is using _validate_integer. I wasn't using that on read_json, but I added it and am now testing for it.

@louispotok
Copy link
Contributor Author

Okay, the latest commits should address all these comments:

  • describe change in whatsnew
  • add version-added tag to the chunksize param in the read_json docstring
  • add docstring for _read_json_as_lines internal function
  • validate that chunksize is integer-like and >=1
  • add tests for chunksize: reads correctly and rejects invalid chunksize argument.

@codecov
Copy link

codecov bot commented Aug 4, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@9b07ef4). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #17168   +/-   ##
=========================================
  Coverage          ?   90.97%           
=========================================
  Files             ?      162           
  Lines             ?    49500           
  Branches          ?        0           
=========================================
  Hits              ?    45035           
  Misses            ?     4465           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.75% <100%> (?)
#single 40.26% <0%> (?)
Impacted Files Coverage Δ
pandas/io/json/json.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b07ef4...70a43a1. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 4, 2017

Codecov Report

Merging #17168 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17168      +/-   ##
==========================================
- Coverage   91.27%   91.25%   -0.02%     
==========================================
  Files         163      163              
  Lines       49766    49766              
==========================================
- Hits        45423    45414       -9     
- Misses       4343     4352       +9
Flag Coverage Δ
#multiple 89.05% <ø> (ø) ⬆️
#single 40.34% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 100% <ø> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc58b84...28d1cbe. Read the comment docs.

@louispotok
Copy link
Contributor Author

Also, do you prefer to leave these atomic commits or should I squash them?

@gfyoung
Copy link
Member

gfyoung commented Aug 4, 2017

@louispotok : Whatever is most convenient for you. We'll squash them at the end once we merge.

@louispotok
Copy link
Contributor Author

Thanks @gfyoung , this is ready for another round of review.

@gfyoung gfyoung added this to the 0.21.0 milestone Aug 8, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 8, 2017

@louispotok : This looks pretty good (there are some minor stylistic things I think...@jreback?)

Could you post timing information so that we can see the performance difference with and without your changes in this PR?

Link: https://pandas.pydata.org/pandas-docs/stable/contributing.html#id53

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise an error when a value for chunksize is passed and lines is not True ?

@@ -197,6 +197,7 @@ Other API Changes
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
- :func:`read_csv` now treats ``'null'`` strings as missing values by default (:issue:`16471`)
- :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)
- :func:`read_json` now accepts a ``chunksize`` parameter that can reduce memory usage when ``lines=True``. (:issue:`17048`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in the "Other Enhancements" section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update for this

If this is None, the file will be read into memory all at once.
Passing a chunksize helps with memory usage, but is slower.
Also note this is different from the `chunksize` parameter in
`read_csv`, which returns a FileTextReader.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't our goal be to have it with a similar behaviour?

(formatting comment: there is some extra indentation on this line that is not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to return an iterator (as @jorisvandenbossche indicates).

you return a new class that inherits from BaseIterator (in pandas.io.common); see how pandas.io.stata.StateReader does this). You pretty much just return a class and define __next__, which processes the chunked rows (also an example of this in pandas.io.pytables, though that doesn't use BaseIterator, it should, which is another issue)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? (the subclassing from BaseIterator) We could also just do a yield of the chunk in the _read_json_as_lines function of the PR (instead of returning the concatted frame of all chunks)
That seems less invasive to add given the current implementation of read_json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because that's how its done elsewhere, and it avoid weird things like py 2/3 compat issues (this is in fact the point of BaseIterator).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the python 2/3 compat issue with using yield ? (it has no .next method in python 3, but you should use next(..) anyway?)
(that's how it is done in read_sql, so if there is an issue it should be fixed there)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update these doc-strings

@@ -263,6 +266,16 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,

.. versionadded:: 0.19.0

chunksize: integer, default None
If `lines=True`, how many lines to read into memory at a time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to use double backticks here (so ``lines=True``), then it will be formatted as code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jorisvandenbossche
Copy link
Member

My main comment/question here is: should the API be consistent with how chunksize is for read_csv ?
This would mean that when using this, the return value is an iterator.

@louispotok I understand that you don't need it in your specific case (as once concatenated, your dataframe fits into memory). But in general it could be needed in certain cases to have it as an iterator. For your case you can then do the concat yourself:

In [41]: s = """a,b
    ...: 1,2
    ...: 3,4
    ...: 5,6
    ...: 7,8"""

In [42]: pd.read_csv(StringIO(s))
Out[42]: 
   a  b
0  1  2
1  3  4
2  5  6
3  7  8

In [43]: pd.read_csv(StringIO(s), chunksize=2)
Out[43]: <pandas.io.parsers.TextFileReader at 0x7f2e9c2ef630>

In [44]: pd.concat(pd.read_csv(StringIO(s), chunksize=2))
Out[44]: 
   a  b
0  1  2
1  3  4
2  5  6
3  7  8

If this is None, the file will be read into memory all at once.
Passing a chunksize helps with memory usage, but is slower.
Also note this is different from the `chunksize` parameter in
`read_csv`, which returns a FileTextReader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to return an iterator (as @jorisvandenbossche indicates).

you return a new class that inherits from BaseIterator (in pandas.io.common); see how pandas.io.stata.StateReader does this). You pretty much just return a class and define __next__, which processes the chunked rows (also an example of this in pandas.io.pytables, though that doesn't use BaseIterator, it should, which is another issue)

@@ -322,6 +335,39 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,

filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf,
encoding=encoding)

if chunksize is not None:
_validate_integer("chunksize", chunksize, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should raise if lines!=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1032,6 +1032,32 @@ def test_to_jsonl(self):
assert result == expected
assert_frame_equal(pd.read_json(result, lines=True), df)

def test_read_jsonchunks(self):
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add issue number as a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pd.read_json(strio, lines=True, chunksize=2.2)

with tm.assert_raises_regex(ValueError, msg):
pd.read_json(strio, lines=True, chunksize='foo')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with chunksize and lines=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback removed this from the 0.21.0 milestone Aug 8, 2017
@louispotok
Copy link
Contributor Author

It looks like we still need a decision on whether to return a BaseIterator or just yield chunks. I don't have a strong opinion on that and will wait for a decision there.

I'll point out one thing though. I think there are really two issues here worth separating:

  1. The existing implementation doesn't permit loading JSONs even when the full dataframe will fit in memory (this was the original bug report).
  2. Should we provide a way for users to handle JSON dataframes that don't fit in memory by returning an iterator / generator?

I was originally attempting to fix (1) but there may be another way to fix it instead. The culprit seems to be line 349 which on my laptop kills 10G of available memory with a 1.7GB json object. My solution to (1) was to read in the file in chunks, but maybe there are other possible fixes. Then, I think a separate pull request could solve issue (2).

What do you think?

@jreback
Copy link
Contributor

jreback commented Aug 8, 2017

It looks like we still need a decision on whether to return a BaseIterator or just yield chunks. I don't have a strong opinion on that and will wait for a decision there.

make a class that inherits BaseIterator, call it JsonLineReader. return this to the user. this solves both issues generally.

@pep8speaks
Copy link

pep8speaks commented Aug 13, 2017

Hello @louispotok! Thanks for updating the PR.

Line 205:80: E501 line too long (97 > 79 characters)
Line 206:35: E231 missing whitespace after ','
Line 206:52: E231 missing whitespace after ','
Line 211:9: E722 do not use bare except'
Line 218:72: E226 missing whitespace around arithmetic operator
Line 224:72: E226 missing whitespace around arithmetic operator

Comment last updated on September 28, 2017 at 21:47 Hours UTC

@louispotok
Copy link
Contributor Author

*** Note: not ready for re-review -- found some bugs that were not caught by tests. Fixing behavior and tests.

@louispotok
Copy link
Contributor Author

Okay, this is ready for re-review.

@jreback I believe I've addressed all your comments.
@gfyoung In my latest (coarse) performance testing, with large enough chunks this is not noticeably slower than reading it all at once. Sorry for the false alarm.

One other point I want to confirm. I'm assuming (based on my reading of the code and documentation) that currently when lines=True, we always have orient="records" and the resulting object should always have an integer index [0, 1, ... , len(obj) - 1 ]. Is that a true assumption?

@louispotok
Copy link
Contributor Author

I'm not sure why the AppVeyor build broke, it looks like an unrelated issue.

@gfyoung
Copy link
Member

gfyoung commented Aug 14, 2017

@louispotok : Maybe not. ResourceWarning could be indicative of a stream not having been properly closed, which is related to what you are doing.

@louispotok
Copy link
Contributor Author

Ah, I think I caught it. Let's see. To document fully, in my understanding, there are 4 cases to consider. Here's the current behavior:

  1. filepath passed in without chunksize: file is opened, then read in, then closed (not changed from previous)
  2. filepath passed in with chunksize: file is opened, then JsonLineReader is returned. When JsonLineReader gets to the end of the file, it attempts to close the file.
  3. Buffer passed in without chunksize: json is read out of the buffer using filepath_or_buffer.read(), and then the object is returned, but the buffer stays open. (unchanged from previous)
  4. Buffer passed in with chunksize: JsonLineReader is returned and assumes that buffer will stay open until it has finished iterating through. When it is done, it closes the buffer.

Do those decisions make sense? I could see the argument for changing (4) so that it leaves the buffer open and the caller has to close it.

I think (and we'll see with this build) the problem was in situation (2), I was checking that chunksize was valid after the file was opened, so the test suite correctly caught the error (assert_raises_regex) but was then exiting while leaving the file open. If that's true, we'll have to find a way to write a test for this.

@louispotok
Copy link
Contributor Author

Hm, still failing and I'm not sure how to investigate further. I do notice that the method that errors (assert_produces_warning) says it is not thread safe, not sure if that is relevant. Would appreciate some help here figuring out how to fix this.

@jreback jreback merged commit 42adf7d into pandas-dev:master Sep 28, 2017
@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

thanks @louispotok nice PR! very responsive! keep em coming!

@louispotok louispotok deleted the read-json-lines branch September 28, 2017 23:56
@louispotok
Copy link
Contributor Author

This was fun, thanks for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants