-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Reimplement of fix memory pandas #48970
Conversation
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
sampled_indices = np.random.choice( | ||
total_size, sample_size, replace=False | ||
) | ||
sampled_data = sampled_column[sampled_indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see if the tests work -- i remember last time i had to implement def take
in PythonObjectArray, but maybe you get around it more easily
relevant tests failing it looks like |
Signed-off-by: zhilong <[email protected]>
It can pass this time but two tests are retried. In fact these two test has no problem locally so I am not sure it's related to the memory problem. i.e. when run full test there will OOM but retry can succeed. Let me try to make it stable first. But currently I fixed many test by checking numeric tensor or numpy. I think in these case it's safe to use nbytes will object string need future check, |
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
|
||
# TensorDtype for ray.air.util.tensor_extensions.pandas.TensorDtype | ||
object_need_check = (TensorDtype,) | ||
min_sample_size = _PANDAS_SIZE_BYTES_MIN_COUNT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it should be max_sample_size
? as in "max number of items to sample"
Co-authored-by: Hao Chen <[email protected]> Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
@richardliaw @raulchen I think it OK now. |
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: hjiang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Thanks for this PR! I've had to hack around this exact issue so many times - it would be great if we could get a release with the fix ASAP 🙌 |
Why are these changes needed?
fix of #46939 with sampling to reduce the overhead.
Key improvement:
nbytes
can be use directly.Related issue number
#46785
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.