Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Processes killed and semaphore objects leaked when reading pandas data #28936

Closed
asfimport opened this issue Jul 2, 2021 · 12 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jul 2, 2021

When I run pa.Table.from_pandas(df) for a >1G dataframe, it reports
 
 Killed: 9 ../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
 
 

Environment: OS name and version: macOS 11.4
Python version: 3.8.10
Pyarrow version: 4.0.1
Reporter: Koyomi Akaguro
Assignee: Weston Pace / @westonpace

Related issues:

Note: This issue was originally created as ARROW-13254. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
I believe these are two different messages.  The first (Killed: 9) is coming from the OOM-killer.  You must be running out of RAM on the device.  Is this expected?

 

The second "../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" is a message from python stating that resources weren't quite released properly.  Given that the process was suddenly killed with a -9 signal this isn't too surprising.  I don't think this message is relevant.

@asfimport
Copy link
Collaborator Author

Koyomi Akaguro:
@westonpace  Thanks for reply. How can I know that the data load is out of RAM? And if it is the case, is there any suggestion?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
First you should figure out how large your dataframe is.  You could use df.memory_usage(deep=True) to get this information.

Second, you should determine how much memory you have available.  The linux command "free -h" can be used to get this information.

To convert from Pandas safely you will probably need around double the amount of memory required to store the dataframe.  If you do not have this much memory then you can convert the table in parts.

@asfimport
Copy link
Collaborator Author

Koyomi Akaguro:
@westonpace  If it needs double the amount of memory then yes it goes over the memory. Though weirdly I run the exactly same code several month ago and it goes well.

In terms of convert table in parts, do you mean split the dataframe and take each to pa.Table and then combine?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
There are a few reasons it may have worked previously.  If the data or data type changed then the amount of memory used in either representation may have changed.  It's possible your OS was previously allowing swap and that was allowing you to run over the amount of physical memory on the device.  It's also possible the amount of available memory on the server has changed because other processes are running that were not running previously.

 

In terms of convert table in parts, do you mean split the dataframe and take each to pa.Table and then combine?

Yes, but you will need to make sure to delete the old parts of the dataframe as they are no longer needed.  For example...

 

df_1 = df.iloc[:1000000,:]
df_2 = df.iloc[1000001:,:]
del df
table_1 = pa.Table.from_pandas(df_1)
del df_1
table_2 = pa.Table.from_pandas(df_2)
del df_2

@asfimport
Copy link
Collaborator Author

Koyomi Akaguro:
@westonpace  Yes, my OS does use swap. I watch the activity monitor and find the whole python process jump from 12G memory use (the data is around 9G) to 60+G during pa.Table.from_pandas and then killed. Why would the process use over sixfold memory than the data?

@asfimport
Copy link
Collaborator Author

Koyomi Akaguro:
@westonpace  I try several times by cutting my data to different size. I find that when I use only 1/4 data, the memory used double from 3+G to 6G and then is done smoothly. However when I use half of the data which is 4.4G, the memory used again jump from 6+G to 60G and then killed. It seems that if my data is large to some limit, the pa.Table.from_pandas function would just explode the memory to unlimited high.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
6x memory usage is not normal.  This may be blowup from the dynamic allocator or it could be a bug.  In fact, it sounds a bit like ARROW-12983.  You mentioned you did not encounter this in the past.  Do you get this error with the exact same data on 3.0.0?  If you do not get the error then ARROW-12983 is most likely the culprit.  Are you able to try with the latest nightly build (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)?

@asfimport
Copy link
Collaborator Author

Koyomi Akaguro:
@westonpace  I downgrade to 3.0 and this problem disappears. I haven't try the latest nightly build but as you said this seems to be a bug in 4.0.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
I'm going to go ahead and close this as a duplicate of ARROW-12983.  If you try it on 5.0.0 (the next version to have the fix for ARROW-12983) or the latest nightly and the issue is still there then feel free to reopen.

@asfimport
Copy link
Collaborator Author

Todd Farmer / @toddfarmer:
Transitioning issue from Resolved to Closed to based on resolution field value.

@frank-amy
Copy link

how could i remove the message? above all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants