-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: operations with NPartitions not a power of 2 fails in some cases. BUG IN PARTITON_MANAGER?. possible solution in here! #7383
Comments
This method, in @classmethod
def map_partitions_joined_by_column(
cls,
partitions,
column_splits,
map_func,
map_func_args=None,
map_func_kwargs=None,
):
"""
Combine several blocks by column into one virtual partition and apply "map_func" to them.
Parameters
----------
partitions : NumPy 2D array
Partitions of Modin Frame.
column_splits : int
The number of splits by column.
map_func : callable
Function to apply.
map_func_args : iterable, optional
Positional arguments for the 'map_func'.
map_func_kwargs : dict, optional
Keyword arguments for the 'map_func'.
Returns
-------
NumPy array
An array of new partitions for Modin Frame.
"""
if column_splits < 1:
raise ValueError(
"The value of columns_splits must be greater than or equal to 1."
)
# step cannot be less than 1
step = max(partitions.shape[0] // column_splits, 1)
preprocessed_map_func = cls.preprocess_func(map_func)
kw = {
"num_splits": step,
}
result = np.empty(partitions.shape, dtype=object)
for i in range(
0,
partitions.shape[0],
step,
):
joined_column_partitions = cls.column_partitions(partitions[i : i + step])
for j in range(partitions.shape[1]):
result[i : i + step, j] = joined_column_partitions[j].apply(
preprocessed_map_func,
*map_func_args if map_func_args is not None else (),
**kw,
**map_func_kwargs if map_func_kwargs is not None else {},
)
return result So when joined_column_partitions[j].apply(
preprocessed_map_func,
*map_func_args if map_func_args is not None else (),
**kw,
**map_func_kwargs if map_func_kwargs is not None else {},
) while this just wants 1 element: result[i : i + step, j] which means in this case... result[74 : 75, 0] # where this just means ..
result[74,0] # because the shape[0] is just 75 This all seams to happen because in if np.prod(partitions.shape) <= 1.5 * CpuCount.get(): this evaluates to false when i go above 64 partitions, so the behaviour changes. Still I fail to see how I can fix this. I dont need nor want column partition, as we just have very little columns. |
back to the joined_column_partitions = cls.column_partitions(partitions[i : i + step])
for j in range(partitions.shape[1]):
result[i : i + step, j] = joined_column_partitions[j].apply(
preprocessed_map_func,
*map_func_args if map_func_args is not None else (),
**kw,
**map_func_kwargs if map_func_kwargs is not None else {},
) I fund something I dont quite understand
Makes me nuts. I need to solve this but I dont understand |
So it seams the issue is that the apply function gets the number of splits with the this is set right before with: kw = {
"num_splits" : step
} This is an issue because in case of step changing it to kw = {
"num_splits": len(partitions[i : i + step]),
} The issue does not appear. But I cant see how I can apply that fix for my pipeline now.... Without pulling and building from source.. right? |
I am also affected by this issue. In my case I'm using awswrangler and a ray cluster and this has been very difficult to reproduce. It seems that it's sensitive to partitions and only happens sometimes. |
I thought i could help myself out by just using even number of partitions, but that does only help for small number of partitions.
This is a mayor issue, as it completely undermines the whole purpose of the package. I need to be able to set partition sizes without it crashing... |
Could some collaborator look into this solution if it would be viable for a hotfix? |
…th custom NPartitions Signed-off-by: Anatoly Myachev <[email protected]>
Hi @Liquidmasl! Thanks for researching the problem. I opened #7399 to fix this problem. Will you be able to upgrade to the new version of Modin? |
I cannot tell you how hapy you are making me right now! I will be able to upgrade to the new version immediatly yes! |
…rtitions (#7399) Signed-off-by: Anatoly Myachev <[email protected]>
Thank you very much for the super fast response time here, amazing! I suppose it might be some time until the next release, so until then I will try and install directly from github, lets see how successful I am Thanks again! |
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
I was am really struggling creating a reproducer, all the data I create does not lead to the error.
I will further try to find artificial data that works.
I have a pointcloud that I save in 75 partitions to parquets
When I load it again (using a fresh process), i get issues
leads to
works fine
Setting MinColumnsPerPartition to something larger then the amount of columns I have (its just 15 columns) nothing changes.
Issue Description
using partitions numbers that are no power of 2 can lead to issues. Details below
Expected Behavior
I would like the operations to not fail with the given error
Error Logs
Installed Versions
INSTALLED VERSIONS
commit : c8bbca8
python : 3.11.9.final.0
python-bits : 64
OS : Linux
OS-release : 6.8.0-40-generic
Version : #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.31.0
ray : 2.24.0
dask : 2024.6.0
distributed : 2024.6.0
pandas dependencies
pandas : 2.2.2
numpy : 1.26.3
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.0
sqlalchemy : 2.0.32
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: