-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory issue when using COPY with PARTITIONED BY #11042
Comments
Does the issue happen if you do COPY without PARTITIONED BY? The partitioning code spawns a potentially large number of tokio tasks (one for each partition), so if those tasks are not being cleaned up properly, it could lead to a memory leak. |
Yes, I can see the same behavior by running the following query multiple times:
However, the memory increase is smaller and it takes many more queries to make it noticeable (10+). |
@devinjdangelo do you have any suggestions on how to investigate this issue further? I am happy to take the lead on it. I was chatting with @alamb yesterday and he suggested using heaptrack, but I was wondering if you would suggest other options. Thanks! |
A self contained example script may be helpful. I have used peak_alloc crate in the past as a very simple way to measure how much memory is being consumed. Heaptrack will provide more detail and will likely help narrow down the source of the issue faster, but a self contained script I think is useful for demonstration and sanity checking. |
BTW something we have seen in InfluxDB, especially for very compressible data, was that the arrow writer was consuming substantial memory. Something that might be worth testing would be to set the parquet writer's options to set By default it is unlimited. We just changed the default upstream in arrow-rs apache/arrow-rs#5957 but that is not yet released |
Possible related to #11344 where the memory tracking for the parquet writing could be improved |
Thanks for the suggestion @alamb. I tested with a different value for In general I have been having a hard time trying to debug this since there is no I'll wait for #11344 to land and test again. |
I have had good luck with the Instruments tool that comes with XCode / Mac. Specifically the "allocations" tool: But I think @wiedld said she didn't have good luck with it so your mileage may vary |
While using the xcode allocations tool, I was getting <10% of the allocations vs the peak measured with the time builtin. (Note: our process was an application-triggered datafusion query and not via the datafusion-cli.) As a result I ended up using heaptrack.
I ran into the same problems. Here is the work-around I used (I'm sure there are others):
If you run into any issues, or find any better alternatives, please let me know @hveiga . |
Ah, I forgot to mention a key point. When extracting data via heaptrack_print, I was looking at memory peaks and hence used |
I finally have some time to continue investigating this issue. I have not been able to make heaptrack work (yet!) but I did try using dhat and I got an interesting lead: I won't claim I am experienced with this tool but I was curious why it was highlighting After disabling it I see the memory increasing only marginally for every invocation (in the 100-200MB range) while with I don't have a root cause of the issue yet but wanted to share this behavior in case somebody else might find this pattern familiar. This might also only be a red herring and not the actual issue. I also found apache/arrow-rs#5828 which might be related and/or relevant. |
I would expect that the memory usage hightlighted in apache/arrow-rs#5828 would be directly reduced by setting the
I wonder if this could be related to DataFusion overriding the I think you can set this option like COPY (SELECT col1, timestamp, col10, col12 FROM my_table ORDER BY col1 ASC, timestamp ASC)
TO './output' STORED AS PARQUET PARTITIONED BY (col1)
OPTIONS (
compression 'uncompressed',
'format.parquet.data_pagesize_limit' 20000
); |
@alamb is mentioning the The current gotchas with using the defaults in
|
@hveiga is correct that this is one suspected place with extra memory usage (specifically in the dict_encoder) when processing many rows per page. But that is not the only possible impact from very-many-rows-per-page, and as such that's why we focused on changing the config setting |
Thanks for all the feedback, we are actively testing this and can report our findings later today. Regarding |
There is another similarly named one called Sorry this is so confusing |
I believe the right one is |
Describe the bug
Memory does not get freed after executing multiple
COPY ... TO ... PARTITIONED BY ...
queries. I have not been able to identify what is causing this behavior.To Reproduce
The behavior can be observed using datafusion-cli. I have been monitoring the memory usage through Activity Monitor.
datafusion-cli
COPY .. PARTITIONED BY
query:COPY .. PARTITIONED BY
query and continue monitoring memory usage.Expected behavior
My expectation is to be able to run the
COPY
command multiple times without having the memory usage increasing every time.Additional context
There is more context of what I am trying to do in Discord: https://discord.com/channels/885562378132000778/1166447479609376850/1253419900043526236
I am also experiencing the same behavior when running my application in Kubernetes. K8s terminates my pod once it exceeds the pod memory limits:
The text was updated successfully, but these errors were encountered: