-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_json with lines=True not using buff/cache memory #17048
Comments
@louispotok : that behavior does sound buggy to me, but before I label it as such, could you provide a minimal reproducible example for us? |
Happy to, but what exactly would constitute an example here? I can provide an example json file, but how would you suggest I reproduce the memory capacity and allocation on my machine? |
Just provide the smallest possible JSON file that causes this |
The |
cc @aterrel |
@gfyoung I'm still not sure exactly what would be most helpful for you here. I tried doing Results:
|
@jreback I think your question was for me, but I don't know how to do what you described. Are there instructions you could point me to? |
Yikes! That's a pretty massive file. That does certainly help us with regards to what we would need to do to reproduce this issue. |
Here is the documentation for making contributions to the repository. Essentially @jreback is asking if you could somehow incorporate your workaround in your issue description into the implementation of A quick glance there indicates what might be the issue: we're putting ALL of the lines into a list in memory! Your workaround might be able to address that. |
Thanks! I added it for one of the possible input types. You can see it here. It passes all the existing tests, and I'm now able to use it to load that file. I think this is much slower than the previous implementation, and I don't know whether it can be extended to other input types. We could make it faster by increasing the chunk size or doing fewer |
I think it would make sense to add such a parameter. We have it for |
Using the chunksize param in |
@louispotok : IMO, it would not because there's more confusion when people try to pass in the same parameters to one |
@gfyoung Makes sense. Here's the latest with the chunksize param. I still don't know how to make it work on any of the other filepath_or_buffer branches or really what are the input types that would trigger those. I would need an explanation of what's happening there to extend this. |
Certainly. We accept three types of inputs for
Your contribution would address the first two options. You have at this addressed the first one. The second comes in the conditional that checks if the |
Okay @gfyoung , Thanks for your help. I added it to that conditional you mentioned as well. Latest here. Passes the tests. I also changed the behavior so that if chunksize is not explicitly passed, we try to read it all at once. My thinking is that using chunksize changes the performance drastically, and better to let people make this tradeoff explicitly without changing the default behavior. From here, what are the next steps? There's probably a bit of cleanup you'd like me to do -- let me know. Thanks again! |
@louispotok : Sure thing. Just submit a PR, and we'll be happy to review! |
Here goes: #17168. |
Hi, |
I've definitely experienced some of what you're describing. First, the read_json function probably uses more memory overall than it needs to. I don't fully know why that is or how to improve it - that probably belongs in a separate issue if it's important to what you're doing. Second, when lines=True, I think you're right that all the memory isn't actually being used, it's just not being released back to the OS, so it's a bit spurious. Third, if you read with lines=True and a small chunksize, you should be fine either way. |
hi @louispotok , thank you for the kind answer. all_columns = data_lan[0].keys() I have similar memory outputs. |
I'm having a similar experience with this function as well, @alessandrobenedetti. I ended up regenerating my data to use read_csv instead, which is using a dramatically smaller amount of ram. |
thanks @rosswait , I have a small update if that helps... My file was heavily String and Lists based ( each line was a Json object with a lot of Strings and lists of Strings). |
The problem still exists , I am loading a 5GB json file with 16 GB ram ,but still i get memory error . The lines true attribute doesnot work as expected still |
If anyone is going to implement a better version of this, its worth looking at these libraries to do a json out of memory read: I experienced heavy memory issues just opening a JSON file, but these libraries fixed this issue, and added parsing functionality on top of it. Without beeing to expensive :D |
json.loads for a single item creates about a 50K dictionary for me for each of the 7,000 lines. The resultant dataframe is 38 MB as measured by the asizeof function. The memory issue is that 300 MB in memory is used overall and stays as a high mark. Statement is: |
Just to be clear, the PR that closed this issue did NOT solve the underlying issue of memory usage when reading json. Instead it added a parameter |
I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.
I'm on Ubuntu, and the
free
command shows >12GB of "Available" memory, most of which is "buff/cache".I'm able to read the file into a dataframe by iterating over the file like so:
You'll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.
It seems that
pd.read_json
withlines=True
doesn't use the available memory, which looks to me like a bug.The text was updated successfully, but these errors were encountered: