Add chunksize param to read_json when lines=True #17168

louispotok · 2017-08-03T21:20:14Z

Previous behavior: reading the whole file to memory and then split into lines.
New behavior: if lines=True and chunksize is passed: read in chunksize lines at a time, and concat.

This only covers some kinds of input to read_json. When chunksize is passed, read_json becomes slower but more memory-efficient.

Closes #17048.

Tests and style-check pass, no new tests added.

gfyoung · 2017-08-03T21:22:41Z

@louispotok : Thanks for doing this! Before we can merge, you will need to add tests and whatsnew entry to this PR.

gfyoung · 2017-08-03T21:23:09Z

pandas/io/json/json.py

+        Also note this is different from the `chunksize` parameter in
+            `read_csv`, which returns a FileTextReader.
+        If the JSON input is a string, this argument has no effect.
+


Add a version-added tag.

gfyoung · 2017-08-03T21:24:19Z

pandas/io/json/json.py

@@ -322,6 +332,27 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,

    filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf,
                                                      encoding=encoding)
+
+    def _read_json_as_lines(fh, chunksize):


Add a docstring for this. Will be useful for developers.

gfyoung · 2017-08-03T21:24:51Z

pandas/io/json/json.py

+
+def _get_obj(typ, json, orient, dtype, convert_axes, convert_dates,
+             keep_default_dates, numpy, precise_float,
+             date_unit):


Why did you do this? I like abstraction but not sure how this relates to your PR.

Before, this code was only called once. Now it needs to be called from two places, so I extracted it into a function (see line 343).

One possible improvement I was considering is to define the function inline in the read_json function so that we don't need to pass so many parameters.

jreback · 2017-08-03T21:43:47Z

needs tests that validate chunksize is current and errors correctly

see how read_csv does this

louispotok · 2017-08-04T01:40:40Z

@jreback thanks for the tip. I looked at the read_csv tests and I'm still a bit confused. Specifically I'm looking at "test_read_chunksize" in tests/io/parser/common.py. I see that it's testing two main things:

Each returned chunk is equal to that slice of the input dataframe -- not relevant in this case, since read_json is still returning the whole dataframe.
Ensure that read_csv is using _validate_integer. I wasn't using that on read_json, but I added it and am now testing for it.

louispotok · 2017-08-04T01:44:51Z

Okay, the latest commits should address all these comments:

describe change in whatsnew
add version-added tag to the chunksize param in the read_json docstring
add docstring for _read_json_as_lines internal function
validate that chunksize is integer-like and >=1
add tests for chunksize: reads correctly and rejects invalid chunksize argument.

codecov · 2017-08-04T02:13:29Z

Codecov Report

❗ No coverage uploaded for pull request base (master@9b07ef4). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #17168   +/-   ##
=========================================
  Coverage          ?   90.97%           
=========================================
  Files             ?      162           
  Lines             ?    49500           
  Branches          ?        0           
=========================================
  Hits              ?    45035           
  Misses            ?     4465           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`88.75% <100%> (?)`
#single	`40.26% <0%> (?)`

Impacted Files	Coverage Δ
pandas/io/json/json.py	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b07ef4...70a43a1. Read the comment docs.

codecov · 2017-08-04T02:13:34Z

Codecov Report

Merging #17168 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17168      +/-   ##
==========================================
- Coverage   91.27%   91.25%   -0.02%     
==========================================
  Files         163      163              
  Lines       49766    49766              
==========================================
- Hits        45423    45414       -9     
- Misses       4343     4352       +9

Flag	Coverage Δ
#multiple	`89.05% <ø> (ø)`	⬆️
#single	`40.34% <ø> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/json.py	`100% <ø> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.73% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc58b84...28d1cbe. Read the comment docs.

louispotok · 2017-08-04T03:13:48Z

Also, do you prefer to leave these atomic commits or should I squash them?

gfyoung · 2017-08-04T03:23:02Z

@louispotok : Whatever is most convenient for you. We'll squash them at the end once we merge.

louispotok · 2017-08-08T01:47:57Z

Thanks @gfyoung , this is ready for another round of review.

gfyoung · 2017-08-08T01:50:07Z

@louispotok : This looks pretty good (there are some minor stylistic things I think...@jreback?)

Could you post timing information so that we can see the performance difference with and without your changes in this PR?

Link: https://pandas.pydata.org/pandas-docs/stable/contributing.html#id53

jorisvandenbossche

Should we raise an error when a value for chunksize is passed and lines is not True ?

jorisvandenbossche · 2017-08-08T07:50:16Z

doc/source/whatsnew/v0.21.0.txt

@@ -197,6 +197,7 @@ Other API Changes
  raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
 - :func:`read_csv` now treats ``'null'`` strings as missing values by default (:issue:`16471`)
 - :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)
+- :func:`read_json` now accepts a ``chunksize`` parameter that can reduce memory usage when ``lines=True``. (:issue:`17048`)


Can you put this in the "Other Enhancements" section?

update for this

jorisvandenbossche · 2017-08-08T07:51:56Z

pandas/io/json/json.py

+        If this is None, the file will be read into memory all at once.
+        Passing a chunksize helps with memory usage, but is slower.
+        Also note this is different from the `chunksize` parameter in
+            `read_csv`, which returns a FileTextReader.


Shouldn't our goal be to have it with a similar behaviour?

(formatting comment: there is some extra indentation on this line that is not needed

you need to return an iterator (as @jorisvandenbossche indicates).

you return a new class that inherits from BaseIterator (in pandas.io.common); see how pandas.io.stata.StateReader does this). You pretty much just return a class and define __next__, which processes the chunked rows (also an example of this in pandas.io.pytables, though that doesn't use BaseIterator, it should, which is another issue)

Is this needed? (the subclassing from BaseIterator) We could also just do a yield of the chunk in the _read_json_as_lines function of the PR (instead of returning the concatted frame of all chunks)
That seems less invasive to add given the current implementation of read_json

yes because that's how its done elsewhere, and it avoid weird things like py 2/3 compat issues (this is in fact the point of BaseIterator).

What is the python 2/3 compat issue with using yield ? (it has no .next method in python 3, but you should use next(..) anyway?)
(that's how it is done in read_sql, so if there is an issue it should be fixed there)

need to update these doc-strings

jorisvandenbossche · 2017-08-08T07:52:26Z

pandas/io/json/json.py

@@ -263,6 +266,16 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,

        .. versionadded:: 0.19.0

+    chunksize: integer, default None
+        If `lines=True`, how many lines to read into memory at a time.


best to use double backticks here (so ``lines=True``), then it will be formatted as code

jorisvandenbossche · 2017-08-08T07:59:34Z

My main comment/question here is: should the API be consistent with how chunksize is for read_csv ?
This would mean that when using this, the return value is an iterator.

@louispotok I understand that you don't need it in your specific case (as once concatenated, your dataframe fits into memory). But in general it could be needed in certain cases to have it as an iterator. For your case you can then do the concat yourself:

In [41]: s = """a,b
    ...: 1,2
    ...: 3,4
    ...: 5,6
    ...: 7,8"""

In [42]: pd.read_csv(StringIO(s))
Out[42]: 
   a  b
0  1  2
1  3  4
2  5  6
3  7  8

In [43]: pd.read_csv(StringIO(s), chunksize=2)
Out[43]: <pandas.io.parsers.TextFileReader at 0x7f2e9c2ef630>

In [44]: pd.concat(pd.read_csv(StringIO(s), chunksize=2))
Out[44]: 
   a  b
0  1  2
1  3  4
2  5  6
3  7  8

jreback · 2017-08-08T10:57:03Z

pandas/io/json/json.py

+        If this is None, the file will be read into memory all at once.
+        Passing a chunksize helps with memory usage, but is slower.
+        Also note this is different from the `chunksize` parameter in
+            `read_csv`, which returns a FileTextReader.


you need to return an iterator (as @jorisvandenbossche indicates).

you return a new class that inherits from BaseIterator (in pandas.io.common); see how pandas.io.stata.StateReader does this). You pretty much just return a class and define __next__, which processes the chunked rows (also an example of this in pandas.io.pytables, though that doesn't use BaseIterator, it should, which is another issue)

jreback · 2017-08-08T10:57:42Z

pandas/io/json/json.py

@@ -322,6 +335,39 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,

    filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf,
                                                      encoding=encoding)
+
+    if chunksize is not None:
+        _validate_integer("chunksize", chunksize, 1)


this should raise if lines!=True

jreback · 2017-08-08T10:58:14Z

pandas/tests/io/json/test_pandas.py

@@ -1032,6 +1032,32 @@ def test_to_jsonl(self):
        assert result == expected
        assert_frame_equal(pd.read_json(result, lines=True), df)

+    def test_read_jsonchunks(self):
+        df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})


add issue number as a comment

jreback · 2017-08-08T10:58:28Z

pandas/tests/io/json/test_pandas.py

+            pd.read_json(strio, lines=True, chunksize=2.2)
+
+        with tm.assert_raises_regex(ValueError, msg):
+            pd.read_json(strio, lines=True, chunksize='foo')


test with chunksize and lines=False

louispotok · 2017-08-08T18:12:39Z

It looks like we still need a decision on whether to return a BaseIterator or just yield chunks. I don't have a strong opinion on that and will wait for a decision there.

I'll point out one thing though. I think there are really two issues here worth separating:

The existing implementation doesn't permit loading JSONs even when the full dataframe will fit in memory (this was the original bug report).
Should we provide a way for users to handle JSON dataframes that don't fit in memory by returning an iterator / generator?

I was originally attempting to fix (1) but there may be another way to fix it instead. The culprit seems to be line 349 which on my laptop kills 10G of available memory with a 1.7GB json object. My solution to (1) was to read in the file in chunks, but maybe there are other possible fixes. Then, I think a separate pull request could solve issue (2).

What do you think?

jreback · 2017-08-08T23:52:35Z

It looks like we still need a decision on whether to return a BaseIterator or just yield chunks. I don't have a strong opinion on that and will wait for a decision there.

make a class that inherits BaseIterator, call it JsonLineReader. return this to the user. this solves both issues generally.

pep8speaks · 2017-08-13T23:43:02Z

Hello @louispotok! Thanks for updating the PR.

In the file asv_bench/benchmarks/io_bench.py, following are the PEP8 issues :

Line 205:80: E501 line too long (97 > 79 characters)
Line 206:35: E231 missing whitespace after ','
Line 206:52: E231 missing whitespace after ','
Line 211:9: E722 do not use bare except'
Line 218:72: E226 missing whitespace around arithmetic operator
Line 224:72: E226 missing whitespace around arithmetic operator

Comment last updated on September 28, 2017 at 21:47 Hours UTC

louispotok · 2017-08-13T23:59:34Z

*** Note: not ready for re-review -- found some bugs that were not caught by tests. Fixing behavior and tests.

louispotok · 2017-08-14T16:17:32Z

Okay, this is ready for re-review.

@jreback I believe I've addressed all your comments.
@gfyoung In my latest (coarse) performance testing, with large enough chunks this is not noticeably slower than reading it all at once. Sorry for the false alarm.

One other point I want to confirm. I'm assuming (based on my reading of the code and documentation) that currently when lines=True, we always have orient="records" and the resulting object should always have an integer index [0, 1, ... , len(obj) - 1 ]. Is that a true assumption?

louispotok · 2017-08-14T17:20:16Z

I'm not sure why the AppVeyor build broke, it looks like an unrelated issue.

gfyoung · 2017-08-14T17:34:21Z

@louispotok : Maybe not. ResourceWarning could be indicative of a stream not having been properly closed, which is related to what you are doing.

louispotok · 2017-08-14T19:06:00Z

Ah, I think I caught it. Let's see. To document fully, in my understanding, there are 4 cases to consider. Here's the current behavior:

filepath passed in without chunksize: file is opened, then read in, then closed (not changed from previous)
filepath passed in with chunksize: file is opened, then JsonLineReader is returned. When JsonLineReader gets to the end of the file, it attempts to close the file.
Buffer passed in without chunksize: json is read out of the buffer using filepath_or_buffer.read(), and then the object is returned, but the buffer stays open. (unchanged from previous)
Buffer passed in with chunksize: JsonLineReader is returned and assumes that buffer will stay open until it has finished iterating through. When it is done, it closes the buffer.

Do those decisions make sense? I could see the argument for changing (4) so that it leaves the buffer open and the caller has to close it.

I think (and we'll see with this build) the problem was in situation (2), I was checking that chunksize was valid after the file was opened, so the test suite correctly caught the error (assert_raises_regex) but was then exiting while leaving the file open. If that's true, we'll have to find a way to write a test for this.

louispotok · 2017-08-15T00:51:16Z

Hm, still failing and I'm not sure how to investigate further. I do notice that the method that errors (assert_produces_warning) says it is not thread safe, not sure if that is relevant. Would appreciate some help here figuring out how to fix this.

Split out tests with lines=True into separate test class Parametrize tests Replace """ comments with #.

jreback · 2017-09-28T23:42:13Z

thanks @louispotok nice PR! very responsive! keep em coming!

louispotok · 2017-09-29T05:06:54Z

This was fun, thanks for all your help!

closes pandas-dev#17048

louispotok mentioned this pull request Aug 3, 2017

read_json with lines=True not using buff/cache memory #17048

Closed

gfyoung added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Aug 3, 2017

gfyoung reviewed Aug 3, 2017

View reviewed changes

gfyoung added this to the 0.21.0 milestone Aug 8, 2017

jorisvandenbossche reviewed Aug 8, 2017

View reviewed changes

jreback requested changes Aug 8, 2017

View reviewed changes

jreback removed this from the 0.21.0 milestone Aug 8, 2017

louispotok added 22 commits September 28, 2017 14:47

update benchmarks

0782df9

move json_lines tests to io_bench

014d493

add peakmem for jsonlines

a913d8e

smaller benchmark

ce7aef6

refactor JsonReader

1dc1526

add test for reading with multiple empty lines

03b6069

add support for JSON docs with multiple consecutive newlines

aef6bbc

remove raw_json init param

30e4043

DRY for combining lines

7dae78a

use floor division in asv bench

fe95445

add teardown to asv bench

e41124a

add docs

9cfd012

pep fixup

035ca84

update documentation

4c92287

simplify JsonReader._preprocess_data

61178be

simplify _get_data_from_filepath

a284187

Update read_json tests

55170dd

Split out tests with lines=True into separate test class Parametrize tests Replace """ comments with #.

JsonReader should only close if it opened

1d7087d

split out json readlines to sep test class

6a76c55

add encoding to test_readlines

a72411f

pep8 cleanup

5612934

minor fixups

28d1cbe

louispotok force-pushed the read-json-lines branch from db2749b to 28d1cbe Compare September 28, 2017 21:47

jreback merged commit 42adf7d into pandas-dev:master Sep 28, 2017

louispotok deleted the read-json-lines branch September 28, 2017 23:56

brychlicki-reef mentioned this pull request Oct 19, 2017

Add teardowns in asv benchmarks #17616

Closed

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

Add chunksize param to read_json when lines=True (pandas-dev#17168)

6e51a5c

closes pandas-dev#17048

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

Add chunksize param to read_json when lines=True (pandas-dev#17168)

c39ad7c

closes pandas-dev#17048

Add chunksize param to read_json when lines=True #17168

Add chunksize param to read_json when lines=True #17168

Conversation

louispotok commented Aug 3, 2017 • edited by gfyoung Loading

gfyoung commented Aug 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 3, 2017

louispotok commented Aug 4, 2017

louispotok commented Aug 4, 2017

codecov bot commented Aug 4, 2017 • edited Loading

Codecov Report

codecov bot commented Aug 4, 2017 • edited Loading

Codecov Report

louispotok commented Aug 4, 2017

gfyoung commented Aug 4, 2017

louispotok commented Aug 8, 2017

gfyoung commented Aug 8, 2017 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

louispotok commented Aug 8, 2017

jreback commented Aug 8, 2017

pep8speaks commented Aug 13, 2017 • edited Loading

Comment last updated on September 28, 2017 at 21:47 Hours UTC

louispotok commented Aug 13, 2017

louispotok commented Aug 14, 2017

louispotok commented Aug 14, 2017

gfyoung commented Aug 14, 2017

louispotok commented Aug 14, 2017

louispotok commented Aug 15, 2017

jreback commented Sep 28, 2017

louispotok commented Sep 29, 2017

louispotok commented Aug 3, 2017 •

edited by gfyoung

Loading

codecov bot commented Aug 4, 2017 •

edited

Loading

codecov bot commented Aug 4, 2017 •

edited

Loading

gfyoung commented Aug 8, 2017 •

edited

Loading

pep8speaks commented Aug 13, 2017 •

edited

Loading