read_json with lines=True not using buff/cache memory #17048

louispotok · 2017-07-21T15:21:41Z

I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.

I'm on Ubuntu, and the free command shows >12GB of "Available" memory, most of which is "buff/cache".

I'm able to read the file into a dataframe by iterating over the file like so:

dfs = []
with open(fp, 'r') as f:
    while True:
        lines = list(itertools.islice(f, 1000))
        
        if lines:
            lines_str = ''.join(lines)
            dfs.append(pd.read_json(StringIO(lines_str), lines=True))
        else:
            break

df = pd.concat(dfs)

You'll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.

It seems that pd.read_json with lines=True doesn't use the available memory, which looks to me like a bug.

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-21T16:31:14Z

@louispotok : that behavior does sound buggy to me, but before I label it as such, could you provide a minimal reproducible example for us?

louispotok · 2017-07-21T18:26:05Z

Happy to, but what exactly would constitute an example here? I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

gfyoung · 2017-07-21T19:46:11Z

I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

Just provide the smallest possible JSON file that causes this MemoryError to occur.

jreback · 2017-07-21T23:48:23Z

The lines=True impl is currently not designed this way. If you subsittue your soln into the current impl does it pass the test suite?

jreback · 2017-07-21T23:48:39Z

cc @aterrel

louispotok · 2017-07-24T06:16:42Z

@gfyoung I'm still not sure exactly what would be most helpful for you here.

I tried doing head -n 10 path/to/file | testing.py, where testing.py contains: df = pd.read_json(sys.stdin, lines=True), and then varying how many lines to pass.

Results:
Every million lines is about .8 GB, according to head -n 1000000 path/to/file | wc -c. And I did these each a few times in varying orders, always the same results.

1Million lines: success.
1.3M lines: success
2M lines: got "Killed" and also killed a watch in another terminal window, with message "unable to fork process: Cannot allocate memory"
3M lines: got "MemoryError" (I had a watch running here too, no problems at all)
Full file: got "MemoryError"

louispotok · 2017-07-24T06:17:17Z

@jreback I think your question was for me, but I don't know how to do what you described. Are there instructions you could point me to?

gfyoung · 2017-07-24T06:31:26Z

1.3M lines: success

Yikes! That's a pretty massive file. That does certainly help us with regards to what we would need to do to reproduce this issue.

gfyoung · 2017-07-24T06:35:26Z

I think your question was for me, but I don't know how to do what you described.

Here is the documentation for making contributions to the repository. Essentially @jreback is asking if you could somehow incorporate your workaround in your issue description into the implementation of read_json, which you can find in pandas/io/json/json.py.

A quick glance there indicates what might be the issue: we're putting ALL of the lines into a list in memory! Your workaround might be able to address that.

louispotok · 2017-07-24T15:58:21Z

Thanks! I added it for one of the possible input types. You can see it here. It passes all the existing tests, and I'm now able to use it to load that file.

I think this is much slower than the previous implementation, and I don't know whether it can be extended to other input types. We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

gfyoung · 2017-07-24T16:01:48Z

We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

I think it would make sense to add such a parameter. We have it for read_csv. Try adding that and let us know how that works! This looks pretty good so far.

louispotok · 2017-07-25T14:45:07Z

Using the chunksize param in read_csv returns a TextFileReader, though, right? Won't that be confusing?

gfyoung · 2017-07-25T15:06:08Z

@louispotok : IMO, it would not because there's more confusion when people try to pass in the same parameters to one read_* function that they're used to passing in for another and found out they don't work or don't exist. Thus, you would be doing all read_json users a service by adding a similar parameter as exists in read_csv. 😄

louispotok · 2017-07-27T16:14:01Z

@gfyoung Makes sense. Here's the latest with the chunksize param.

I still don't know how to make it work on any of the other filepath_or_buffer branches or really what are the input types that would trigger those. I would need an explanation of what's happening there to extend this.

gfyoung · 2017-07-27T16:39:34Z

I would need an explanation of what's happening there to extend this.

Certainly. We accept three types of inputs for read_json:

file-path (this option BTW is not clearly documented, so a PR to make this clearer is welcome!)
file-object
valid JSON string

Your contribution would address the first two options. You have at this addressed the first one. The second comes in the conditional that checks if the filepath_or_buffer has a read method. Thus, you should also add your logic there under that check (we'll handle refactoring later).

louispotok · 2017-08-03T20:27:42Z

Okay @gfyoung , Thanks for your help. I added it to that conditional you mentioned as well. Latest here. Passes the tests.

I also changed the behavior so that if chunksize is not explicitly passed, we try to read it all at once. My thinking is that using chunksize changes the performance drastically, and better to let people make this tradeoff explicitly without changing the default behavior.

From here, what are the next steps? There's probably a bit of cleanup you'd like me to do -- let me know. Thanks again!

gfyoung · 2017-08-03T20:28:43Z

@louispotok : Sure thing. Just submit a PR, and we'll be happy to review!

louispotok · 2017-08-03T21:21:07Z

Here goes: #17168.

closes #17048

closes pandas-dev#17048

alessandrobenedetti · 2018-07-14T10:28:44Z

Hi,
I am experimenting with Json of various sizes.
I am using the Pandas read_json with lines=True and noticing a great memory output in the parsing phase.
Using a chunksize of 10.000 to experiment :
For example :
Input Json : 280 Mb Memory Usage : up to 2.6 Gb Resulting Data Frame : 400 Mb (because of dtypes, not much I can do with this)
Input Json : 4Gb Memory Usage : up to 28 Gb Resulting Data Frame : 6Gb .
It seems the memory necessary to parse the Json is definitely too much ( not sure if there are better ways to read big Json in Pandas).
Furthermore it seems this memory remains allocated to the Python process.
Now I am a Python newbie, so this may be perfectly fine and this memory may just remain for Python in a buffer to be used in case of necessity ( it doesn't grow up when the data frame start getting processed).
But it look suspicious.
Let me know if you noticed the same and find out any tips or tricks for that!
Thanks in advance

louispotok · 2018-07-15T02:08:21Z

@alessandrobenedetti

I've definitely experienced some of what you're describing.

First, the read_json function probably uses more memory overall than it needs to. I don't fully know why that is or how to improve it - that probably belongs in a separate issue if it's important to what you're doing.

Second, when lines=True, I think you're right that all the memory isn't actually being used, it's just not being released back to the OS, so it's a bit spurious.

Third, if you read with lines=True and a small chunksize, you should be fine either way.

alessandrobenedetti · 2018-07-16T10:48:46Z

hi @louispotok , thank you for the kind answer.
I just noticed that even using simpler approaches such as :
`with open(interactions_input_file) as json_file:
data_lan = []
for line in json_file:
data_lan.append(pd.io.json.loads(line))

all_columns = data_lan[0].keys()
print("Size "+str(len(data_lan)))
interactions = pd.DataFrame(columns=all_columns, data=data_lan)`

I have similar memory outputs.
I will stop the conversation here as it's slighly off topic.
Should I assume that parsing json lines in Python is just that expensive ?
we are talking about 5-7 times more ram than the initial file...

rosswait · 2018-07-23T21:34:29Z

I'm having a similar experience with this function as well, @alessandrobenedetti. I ended up regenerating my data to use read_csv instead, which is using a dramatically smaller amount of ram.

alessandrobenedetti · 2018-07-24T10:01:14Z

thanks @rosswait , I have a small update if that helps...

My file was heavily String and Lists based ( each line was a Json object with a lot of Strings and lists of Strings).
For a matter of fact, those Strings were actually Integer ids, so, after I got that information I switched the Strings to Int and Lists of int.
This first of all brought down the size of the Json from 4.5 Gb to 3 Gb and the memory output from 30 GB to 10 GB.
If I end up with stricter memory requirements I will definitely take a look to the csv option.
Thanks!

vamlumber · 2019-02-08T18:56:57Z

The problem still exists , I am loading a 5GB json file with 16 GB ram ,but still i get memory error . The lines true attribute doesnot work as expected still

vongohren · 2020-03-08T18:02:55Z

If anyone is going to implement a better version of this, its worth looking at these libraries to do a json out of memory read:
https://pypi.org/project/jsonslicer/#description
https://github.com/ICRAR/ijson

I experienced heavy memory issues just opening a JSON file, but these libraries fixed this issue, and added parsing functionality on top of it. Without beeing to expensive :D

wvwhome · 2020-03-25T14:11:23Z

json.loads for a single item creates about a 50K dictionary for me for each of the 7,000 lines. The resultant dataframe is 38 MB as measured by the asizeof function. The memory issue is that 300 MB in memory is used overall and stays as a high mark. Statement is:
data_frame = pd.read_json(in_file_name, lines=True)
Further analysis of the RSS (Resident Set Size) values indicate that the actual size used by the dataframe is about 300 MB.

louispotok · 2020-03-25T14:32:05Z

Just to be clear, the PR that closed this issue did NOT solve the underlying issue of memory usage when reading json. Instead it added a parameter chunksize to the read_json method which allows you to balance speed vs memory usage for your usecase. See the docs on line-delimited json for more info.

@wvwhome
@vamlumber

gfyoung added the IO JSON label Jul 21, 2017

jreback added the Performance label Jul 21, 2017

louispotok mentioned this issue Aug 3, 2017

Add chunksize param to read_json when lines=True #17168

Merged

jorisvandenbossche added this to the 0.21.0 milestone Aug 8, 2017

jreback closed this as completed in #17168 Sep 28, 2017

jreback pushed a commit that referenced this issue Sep 28, 2017

Add chunksize param to read_json when lines=True (#17168)

42adf7d

closes #17048

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

Add chunksize param to read_json when lines=True (pandas-dev#17168)

6e51a5c

closes pandas-dev#17048

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

Add chunksize param to read_json when lines=True (pandas-dev#17168)

c39ad7c

closes pandas-dev#17048

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

read_json with lines=True not using buff/cache memory #17048

read_json with lines=True not using buff/cache memory #17048

louispotok commented Jul 21, 2017

gfyoung commented Jul 21, 2017

louispotok commented Jul 21, 2017

gfyoung commented Jul 21, 2017

jreback commented Jul 21, 2017

jreback commented Jul 21, 2017

louispotok commented Jul 24, 2017

louispotok commented Jul 24, 2017

gfyoung commented Jul 24, 2017

gfyoung commented Jul 24, 2017

louispotok commented Jul 24, 2017

gfyoung commented Jul 24, 2017

louispotok commented Jul 25, 2017

gfyoung commented Jul 25, 2017 •

edited

Loading

louispotok commented Jul 27, 2017

gfyoung commented Jul 27, 2017

louispotok commented Aug 3, 2017

gfyoung commented Aug 3, 2017

louispotok commented Aug 3, 2017

alessandrobenedetti commented Jul 14, 2018

louispotok commented Jul 15, 2018

alessandrobenedetti commented Jul 16, 2018

rosswait commented Jul 23, 2018

alessandrobenedetti commented Jul 24, 2018

vamlumber commented Feb 8, 2019

vongohren commented Mar 8, 2020

wvwhome commented Mar 25, 2020 •

edited

Loading

louispotok commented Mar 25, 2020 •

edited

Loading

read_json with lines=True not using buff/cache memory #17048

read_json with lines=True not using buff/cache memory #17048

Comments

louispotok commented Jul 21, 2017

gfyoung commented Jul 21, 2017

louispotok commented Jul 21, 2017

gfyoung commented Jul 21, 2017

jreback commented Jul 21, 2017

jreback commented Jul 21, 2017

louispotok commented Jul 24, 2017

louispotok commented Jul 24, 2017

gfyoung commented Jul 24, 2017

gfyoung commented Jul 24, 2017

louispotok commented Jul 24, 2017

gfyoung commented Jul 24, 2017

louispotok commented Jul 25, 2017

gfyoung commented Jul 25, 2017 • edited Loading

louispotok commented Jul 27, 2017

gfyoung commented Jul 27, 2017

louispotok commented Aug 3, 2017

gfyoung commented Aug 3, 2017

louispotok commented Aug 3, 2017

alessandrobenedetti commented Jul 14, 2018

louispotok commented Jul 15, 2018

alessandrobenedetti commented Jul 16, 2018

rosswait commented Jul 23, 2018

alessandrobenedetti commented Jul 24, 2018

vamlumber commented Feb 8, 2019

vongohren commented Mar 8, 2020

wvwhome commented Mar 25, 2020 • edited Loading

louispotok commented Mar 25, 2020 • edited Loading

gfyoung commented Jul 25, 2017 •

edited

Loading

wvwhome commented Mar 25, 2020 •

edited

Loading

louispotok commented Mar 25, 2020 •

edited

Loading