[Python] Processes killed and semaphore objects leaked when reading pandas data #28936

asfimport · 2021-07-02T19:51:47Z

When I run pa.Table.from_pandas(df) for a >1G dataframe, it reports

Killed: 9 ../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

Environment: OS name and version: macOS 11.4
Python version: 3.8.10
Pyarrow version: 4.0.1
Reporter: Koyomi Akaguro
Assignee: Weston Pace / @westonpace

Related issues:

[C++][Python] Converter::Extend gets stuck in infinite loop causing OOM if values don't fit in single chunk (duplicates)

_{Note: This issue was originally created as ARROW-13254. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-07-02T21:15:28Z

Weston Pace / @westonpace:
I believe these are two different messages. The first (Killed: 9) is coming from the OOM-killer. You must be running out of RAM on the device. Is this expected?

The second "../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" is a message from python stating that resources weren't quite released properly. Given that the process was suddenly killed with a -9 signal this isn't too surprising. I don't think this message is relevant.

asfimport · 2021-07-02T21:18:48Z

Koyomi Akaguro:
@westonpace Thanks for reply. How can I know that the data load is out of RAM? And if it is the case, is there any suggestion?

asfimport · 2021-07-02T21:44:01Z

Weston Pace / @westonpace:
First you should figure out how large your dataframe is. You could use df.memory_usage(deep=True) to get this information.

Second, you should determine how much memory you have available. The linux command "free -h" can be used to get this information.

To convert from Pandas safely you will probably need around double the amount of memory required to store the dataframe. If you do not have this much memory then you can convert the table in parts.

asfimport · 2021-07-02T22:32:38Z

Koyomi Akaguro:
@westonpace If it needs double the amount of memory then yes it goes over the memory. Though weirdly I run the exactly same code several month ago and it goes well.

In terms of convert table in parts, do you mean split the dataframe and take each to pa.Table and then combine?

asfimport · 2021-07-02T22:52:13Z

Weston Pace / @westonpace:
There are a few reasons it may have worked previously. If the data or data type changed then the amount of memory used in either representation may have changed. It's possible your OS was previously allowing swap and that was allowing you to run over the amount of physical memory on the device. It's also possible the amount of available memory on the server has changed because other processes are running that were not running previously.

In terms of convert table in parts, do you mean split the dataframe and take each to pa.Table and then combine?

Yes, but you will need to make sure to delete the old parts of the dataframe as they are no longer needed. For example...

df_1 = df.iloc[:1000000,:]
df_2 = df.iloc[1000001:,:]
del df
table_1 = pa.Table.from_pandas(df_1)
del df_1
table_2 = pa.Table.from_pandas(df_2)
del df_2

asfimport · 2021-07-03T01:34:18Z

Koyomi Akaguro:
@westonpace Yes, my OS does use swap. I watch the activity monitor and find the whole python process jump from 12G memory use (the data is around 9G) to 60+G during pa.Table.from_pandas and then killed. Why would the process use over sixfold memory than the data?

asfimport · 2021-07-03T02:24:59Z

Koyomi Akaguro:
@westonpace I try several times by cutting my data to different size. I find that when I use only 1/4 data, the memory used double from 3+G to 6G and then is done smoothly. However when I use half of the data which is 4.4G, the memory used again jump from 6+G to 60G and then killed. It seems that if my data is large to some limit, the pa.Table.from_pandas function would just explode the memory to unlimited high.

asfimport · 2021-07-03T05:25:06Z

Weston Pace / @westonpace:
6x memory usage is not normal. This may be blowup from the dynamic allocator or it could be a bug. In fact, it sounds a bit like ARROW-12983. You mentioned you did not encounter this in the past. Do you get this error with the exact same data on 3.0.0? If you do not get the error then ARROW-12983 is most likely the culprit. Are you able to try with the latest nightly build (https://arrow.apache.org/docs/python/install.html#installing-nightly-packages)?

asfimport · 2021-07-03T06:00:48Z

Koyomi Akaguro:
@westonpace I downgrade to 3.0 and this problem disappears. I haven't try the latest nightly build but as you said this seems to be a bug in 4.0.

asfimport · 2021-07-05T19:05:19Z

Weston Pace / @westonpace:
I'm going to go ahead and close this as a duplicate of ARROW-12983. If you try it on 5.0.0 (the next version to have the fix for ARROW-12983) or the latest nightly and the issue is still there then feel free to reopen.

asfimport · 2022-08-27T14:41:54Z

Todd Farmer / @toddfarmer:
Transitioning issue from Resolved to Closed to based on resolution field value.

frank-amy · 2023-06-19T08:36:26Z

how could i remove the message? above all

asfimport closed this as completed Jul 15, 2021

asfimport assigned westonpace Jan 10, 2023

asfimport added this to the 5.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++][Python] Converter::Extend gets stuck in infinite loop causing OOM if values don't fit in single chunk #28701

Closed

ZachNagengast mentioned this issue Aug 29, 2024

There appear to be 1 leaked semaphore objects to clean up at shutdown apple/ml-stable-diffusion#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Processes killed and semaphore objects leaked when reading pandas data #28936

[Python] Processes killed and semaphore objects leaked when reading pandas data #28936

asfimport commented Jul 2, 2021 •

edited

Loading

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 5, 2021

asfimport commented Aug 27, 2022

frank-amy commented Jun 19, 2023

[Python] Processes killed and semaphore objects leaked when reading pandas data #28936

[Python] Processes killed and semaphore objects leaked when reading pandas data #28936

Comments

asfimport commented Jul 2, 2021 • edited Loading

Related issues:

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 2, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 3, 2021

asfimport commented Jul 5, 2021

asfimport commented Aug 27, 2022

frank-amy commented Jun 19, 2023

asfimport commented Jul 2, 2021 •

edited

Loading