Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_json with lines=True not using buff/cache memory #17048

Closed
louispotok opened this issue Jul 21, 2017 · 27 comments
Closed

read_json with lines=True not using buff/cache memory #17048

louispotok opened this issue Jul 21, 2017 · 27 comments
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Milestone

Comments

@louispotok
Copy link
Contributor

I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.

I'm on Ubuntu, and the free command shows >12GB of "Available" memory, most of which is "buff/cache".

I'm able to read the file into a dataframe by iterating over the file like so:

dfs = []
with open(fp, 'r') as f:
    while True:
        lines = list(itertools.islice(f, 1000))
        
        if lines:
            lines_str = ''.join(lines)
            dfs.append(pd.read_json(StringIO(lines_str), lines=True))
        else:
            break

df = pd.concat(dfs)

You'll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.

It seems that pd.read_json with lines=True doesn't use the available memory, which looks to me like a bug.

@gfyoung gfyoung added the IO JSON read_json, to_json, json_normalize label Jul 21, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

@louispotok : that behavior does sound buggy to me, but before I label it as such, could you provide a minimal reproducible example for us?

@louispotok
Copy link
Contributor Author

Happy to, but what exactly would constitute an example here? I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2017

I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine?

Just provide the smallest possible JSON file that causes this MemoryError to occur.

@jreback
Copy link
Contributor

jreback commented Jul 21, 2017

The lines=True impl is currently not designed this way. If you subsittue your soln into the current impl does it pass the test suite?

@jreback jreback added the Performance Memory or execution speed performance label Jul 21, 2017
@jreback
Copy link
Contributor

jreback commented Jul 21, 2017

cc @aterrel

@louispotok
Copy link
Contributor Author

@gfyoung I'm still not sure exactly what would be most helpful for you here.

I tried doing head -n 10 path/to/file | testing.py, where testing.py contains: df = pd.read_json(sys.stdin, lines=True), and then varying how many lines to pass.

Results:
Every million lines is about .8 GB, according to head -n 1000000 path/to/file | wc -c. And I did these each a few times in varying orders, always the same results.

  • 1Million lines: success.
  • 1.3M lines: success
  • 2M lines: got "Killed" and also killed a watch in another terminal window, with message "unable to fork process: Cannot allocate memory"
  • 3M lines: got "MemoryError" (I had a watch running here too, no problems at all)
  • Full file: got "MemoryError"

@louispotok
Copy link
Contributor Author

@jreback I think your question was for me, but I don't know how to do what you described. Are there instructions you could point me to?

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

1.3M lines: success

Yikes! That's a pretty massive file. That does certainly help us with regards to what we would need to do to reproduce this issue.

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

I think your question was for me, but I don't know how to do what you described.

Here is the documentation for making contributions to the repository. Essentially @jreback is asking if you could somehow incorporate your workaround in your issue description into the implementation of read_json, which you can find in pandas/io/json/json.py.

A quick glance there indicates what might be the issue: we're putting ALL of the lines into a list in memory! Your workaround might be able to address that.

@louispotok
Copy link
Contributor Author

Thanks! I added it for one of the possible input types. You can see it here. It passes all the existing tests, and I'm now able to use it to load that file.

I think this is much slower than the previous implementation, and I don't know whether it can be extended to other input types. We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

@gfyoung
Copy link
Member

gfyoung commented Jul 24, 2017

We could make it faster by increasing the chunk size or doing fewer concats, but at the cost of more memory usage.

I think it would make sense to add such a parameter. We have it for read_csv. Try adding that and let us know how that works! This looks pretty good so far.

@louispotok
Copy link
Contributor Author

Using the chunksize param in read_csv returns a TextFileReader, though, right? Won't that be confusing?

@gfyoung
Copy link
Member

gfyoung commented Jul 25, 2017

@louispotok : IMO, it would not because there's more confusion when people try to pass in the same parameters to one read_* function that they're used to passing in for another and found out they don't work or don't exist. Thus, you would be doing all read_json users a service by adding a similar parameter as exists in read_csv. 😄

@louispotok
Copy link
Contributor Author

@gfyoung Makes sense. Here's the latest with the chunksize param.

I still don't know how to make it work on any of the other filepath_or_buffer branches or really what are the input types that would trigger those. I would need an explanation of what's happening there to extend this.

@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

I would need an explanation of what's happening there to extend this.

Certainly. We accept three types of inputs for read_json:

  • file-path (this option BTW is not clearly documented, so a PR to make this clearer is welcome!)
  • file-object
  • valid JSON string

Your contribution would address the first two options. You have at this addressed the first one. The second comes in the conditional that checks if the filepath_or_buffer has a read method. Thus, you should also add your logic there under that check (we'll handle refactoring later).

@louispotok
Copy link
Contributor Author

Okay @gfyoung , Thanks for your help. I added it to that conditional you mentioned as well. Latest here. Passes the tests.

I also changed the behavior so that if chunksize is not explicitly passed, we try to read it all at once. My thinking is that using chunksize changes the performance drastically, and better to let people make this tradeoff explicitly without changing the default behavior.

From here, what are the next steps? There's probably a bit of cleanup you'd like me to do -- let me know. Thanks again!

@gfyoung
Copy link
Member

gfyoung commented Aug 3, 2017

@louispotok : Sure thing. Just submit a PR, and we'll be happy to review!

@louispotok
Copy link
Contributor Author

Here goes: #17168.

@alessandrobenedetti
Copy link

Hi,
I am experimenting with Json of various sizes.
I am using the Pandas read_json with lines=True and noticing a great memory output in the parsing phase.
Using a chunksize of 10.000 to experiment :
For example :
Input Json : 280 Mb Memory Usage : up to 2.6 Gb Resulting Data Frame : 400 Mb (because of dtypes, not much I can do with this)
Input Json : 4Gb Memory Usage : up to 28 Gb Resulting Data Frame : 6Gb .
It seems the memory necessary to parse the Json is definitely too much ( not sure if there are better ways to read big Json in Pandas).
Furthermore it seems this memory remains allocated to the Python process.
Now I am a Python newbie, so this may be perfectly fine and this memory may just remain for Python in a buffer to be used in case of necessity ( it doesn't grow up when the data frame start getting processed).
But it look suspicious.
Let me know if you noticed the same and find out any tips or tricks for that!
Thanks in advance

@louispotok
Copy link
Contributor Author

@alessandrobenedetti

I've definitely experienced some of what you're describing.

First, the read_json function probably uses more memory overall than it needs to. I don't fully know why that is or how to improve it - that probably belongs in a separate issue if it's important to what you're doing.

Second, when lines=True, I think you're right that all the memory isn't actually being used, it's just not being released back to the OS, so it's a bit spurious.

Third, if you read with lines=True and a small chunksize, you should be fine either way.

@alessandrobenedetti
Copy link

hi @louispotok , thank you for the kind answer.
I just noticed that even using simpler approaches such as :
`with open(interactions_input_file) as json_file:
data_lan = []
for line in json_file:
data_lan.append(pd.io.json.loads(line))

all_columns = data_lan[0].keys()
print("Size "+str(len(data_lan)))
interactions = pd.DataFrame(columns=all_columns, data=data_lan)`

I have similar memory outputs.
I will stop the conversation here as it's slighly off topic.
Should I assume that parsing json lines in Python is just that expensive ?
we are talking about 5-7 times more ram than the initial file...

@rosswait
Copy link

I'm having a similar experience with this function as well, @alessandrobenedetti. I ended up regenerating my data to use read_csv instead, which is using a dramatically smaller amount of ram.

@alessandrobenedetti
Copy link

thanks @rosswait , I have a small update if that helps...

My file was heavily String and Lists based ( each line was a Json object with a lot of Strings and lists of Strings).
For a matter of fact, those Strings were actually Integer ids, so, after I got that information I switched the Strings to Int and Lists of int.
This first of all brought down the size of the Json from 4.5 Gb to 3 Gb and the memory output from 30 GB to 10 GB.
If I end up with stricter memory requirements I will definitely take a look to the csv option.
Thanks!

@vamlumber
Copy link

The problem still exists , I am loading a 5GB json file with 16 GB ram ,but still i get memory error . The lines true attribute doesnot work as expected still

@vongohren
Copy link

If anyone is going to implement a better version of this, its worth looking at these libraries to do a json out of memory read:
https://pypi.org/project/jsonslicer/#description
https://github.com/ICRAR/ijson

I experienced heavy memory issues just opening a JSON file, but these libraries fixed this issue, and added parsing functionality on top of it. Without beeing to expensive :D

@wvwhome
Copy link

wvwhome commented Mar 25, 2020

json.loads for a single item creates about a 50K dictionary for me for each of the 7,000 lines. The resultant dataframe is 38 MB as measured by the asizeof function. The memory issue is that 300 MB in memory is used overall and stays as a high mark. Statement is:
data_frame = pd.read_json(in_file_name, lines=True)
Further analysis of the RSS (Resident Set Size) values indicate that the actual size used by the dataframe is about 300 MB.

@louispotok
Copy link
Contributor Author

louispotok commented Mar 25, 2020

Just to be clear, the PR that closed this issue did NOT solve the underlying issue of memory usage when reading json. Instead it added a parameter chunksize to the read_json method which allows you to balance speed vs memory usage for your usecase. See the docs on line-delimited json for more info.

@wvwhome
@vamlumber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

9 participants