searchsorted function implementation #1284

aregm · 2020-04-15T04:36:11Z

No description provided.

amyskov · 2020-07-02T13:51:37Z

First version of searchsorted function was implemented via MapReduce approach, but it was found that function performance degraded in comparison with pure Pandas. The reason is that complexity of numpy.searchsorted (that is called in Pandas) is very low and Modin conversions and reduce stage overheads became significant. Looking for a way to solve this issue.

devin-petersohn · 2020-07-07T23:00:38Z

@amyskov Is it possible with just a MapFunction?

MapFunction.register(lambda x, value, side, sorter: x.squeeze(axis=1).searchsorted(value, side, sorter))

That is a start, you will need to convert the result of searchsorted back to a DataFrame, then in the API layer (dataframe.py) convert it to pandas, then to a list. Does that make sense to you? Do you think it would work?

amyskov · 2020-07-08T12:46:06Z

@devin-petersohn i don't think that it can help us to solve this problem, because after calling map function on each partition, we will obtain results from each function call, which have to be processed (reduced) anyway.
I have measured execution time of Series.searchsorted function for pure Pandas with different Series sizes (see attached graph, 1e8 elements in Series is equivalent to ~500 MB .csv file). The curve shape is very close to k*log(N) curve, as it mentioned in numpy docs and even for 1e9 elements, execution time is ~1e-4 second, that is less than potential Modin overheads. Based on this, i think that applying function on each partition separately is not good approach in this case and we have to function on the full frame at once. One of possible solutions is implemented in the linked to this issue PR in commit 97bcd1e using query_compiler._modin_frame._apply_full_axis function. What do you think about it?

devin-petersohn · 2020-07-08T14:37:18Z

What is the execution time of the MapReduce version you wrote in comparison with that chart?

amyskov · 2020-07-08T16:27:12Z

Execution time of the MapReduce version is greater than pure Pandas, see attached graph

amyskov · 2020-07-09T10:41:50Z

Additional measurements of Modin searchsorted implementations (with usage of default_to_pandas function and with query_compiler._modin_frame._apply_full_axis introduced in the commit 97bcd1e6b9c112ce4e9e7aea5179e309cffa6436) with python and ray engines were made

As it can be seen from the graph, curve shapes of all implementations are close, so i think, that curve shape can be determined by Modin specific processes with partitions.

amyskov · 2020-07-09T16:10:43Z

Script for execution time measurement

import modin.pandas as pd
from timeit import default_timer as timer
import csv
import random
import os

rows = [10, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8]
value_low = 0
value_high = 10
runs_number = 5
values = 5
csv_name_series = "../local_benches_data/local_bench_series.csv"

times = {}
for rows_number in rows:
    try:
        # creation of files with test data
        with open(csv_name_series, "w", newline='') as f1:
            w1=csv.writer(f1, delimiter=',')
            w1.writerow(['data_column1'])
            for i in range(int(rows_number)):
                w1.writerow([round(random.uniform(value_low, value_high), 1)])
    except Exception as exc:
        print(exc)
        os.remove(csv_name_series)

    t_ser_modin_array = {}
    for run in range(runs_number):
        ser_modin = pd.read_csv(csv_name_series).squeeze().sort_values().reset_index(drop=True)
        t0 = timer()
        ans_ser_modin = ser_modin.searchsorted(values)
        t_ser_modin_array[run] = timer() - t0

    times[rows_number] = min(t_ser_modin_array.values())

print("times \n", times)

amyskov · 2020-07-15T15:26:43Z

In order to find out the reason of such performance degradation in comparison with pandas, execution time of pure partitioning and pure functions applying was measured in modin.engines.base.frame.partition_manager.py::map_axis_partition function (function that called from query_compiler._modin_frame._apply_full_axis function) for 97bcd1e6b9c112ce4e9e7aea5179e309cffa6436 commit.
Partitioning part:

partitions = (
    cls.column_partitions(partitions)
    if not axis
    else cls.row_partitions(partitions)
)

Function applying part:

result_blocks = np.array(
    [
        part.apply(preprocessed_map_func, num_splits=num_splits)
        for part in partitions
    ]
)

Results are next

As it can be seen full_time curve (execution time of all searchsorted function) is mostly determined by partitioning time, which mostly spent by modin.engines.python.pandas_on_python.frame.axis_partition.py::PandasOnPythonFrameAxisPartition::__init__ in code block

for obj in list_of_blocks:
    obj.drain_call_queue()

Using workaround in ba7594bdcdb697e01936ee6c58c5b2bff775f3c5 (clearing partitions call_queue before functions applying), partitioning time can be decreased

For Python engine full_execution time is decreased, but now it determined by function apply time (function apply time is large enough, that can be caused by data.copy in apply function for python engine).
For Ray engine apply time is almost constant, but most of the time is spent on the new _modin_frame creation in the query_compiler._modin_frame._apply_full_axis function (doesn't shown on the chart).

Signed-off-by: Alexander Myskov <[email protected]>

…y case Signed-off-by: Alexander Myskov <[email protected]>

Signed-off-by: Alexander Myskov <[email protected]>

dchigarev · 2021-02-03T10:52:14Z

implementation of this function was reverted and now defaults to pandas #2655, new implementation is needed

aregm added the pandas.series label Apr 15, 2020

aregm added the P2 Minor bugs or low-priority feature requests label May 12, 2020

anmyachev assigned amyskov Jun 4, 2020

anmyachev added this to the 0.7.4 milestone Jun 4, 2020

amyskov mentioned this issue Jun 26, 2020

FIX-#1284: Series.searchsorted #1668

Merged

5 tasks

devin-petersohn modified the milestones: 0.8.0, 0.8.1 Jul 29, 2020

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: single value implementation

4ced2d9

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: several values implementation

455275d

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: fix

3bbd0bc

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: fixes, code clean-up

4825893

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: fix

3954b79

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: searchsorted implementation rework

10e650c

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: remove redudant code block

6f559d1

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: workaround to improve performance

0263bf4

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: slightly improve performance for pandas_on_ra…

cfec73d

…y case Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: try to combine partitions

3db76ea

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: use MapReduce approach

9c21bbe

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: add tests

34c00aa

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: remove data copy

bde6ce7

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: return data.copy()

da9b909

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: remove comment

c5c5d5b

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: correct random in tests

963d4af

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: remove inplace operations

9abdc09

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: improve code readability

4adef11

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: add test case

314805b

Signed-off-by: Alexander Myskov <[email protected]>

amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020

FIX-modin-project#1284: change formatting

6f62a7b

Signed-off-by: Alexander Myskov <[email protected]>

devin-petersohn closed this as completed in #1668 Sep 4, 2020

devin-petersohn pushed a commit that referenced this issue Sep 4, 2020

FIX-#1284: Series.searchsorted (#1668)

ac85a49

Signed-off-by: Alexander Myskov <[email protected]>

aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020

FIX-modin-project#1284: Series.searchsorted (modin-project#1668)

81a299f

Signed-off-by: Alexander Myskov <[email protected]>

dchigarev reopened this Feb 3, 2021

anmyachev modified the milestones: 0.8.1, Someday Feb 9, 2021

Garra1980 unassigned amyskov Oct 25, 2021

mvashishtha added the new feature/request 💬 Requests and pull requests for new features label Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searchsorted function implementation #1284

searchsorted function implementation #1284

aregm commented Apr 15, 2020

amyskov commented Jul 2, 2020

devin-petersohn commented Jul 7, 2020

amyskov commented Jul 8, 2020 •

edited

Loading

devin-petersohn commented Jul 8, 2020

amyskov commented Jul 8, 2020

amyskov commented Jul 9, 2020

amyskov commented Jul 9, 2020

amyskov commented Jul 15, 2020

dchigarev commented Feb 3, 2021

searchsorted function implementation #1284

searchsorted function implementation #1284

Comments

aregm commented Apr 15, 2020

amyskov commented Jul 2, 2020

devin-petersohn commented Jul 7, 2020

amyskov commented Jul 8, 2020 • edited Loading

devin-petersohn commented Jul 8, 2020

amyskov commented Jul 8, 2020

amyskov commented Jul 9, 2020

amyskov commented Jul 9, 2020

amyskov commented Jul 15, 2020

dchigarev commented Feb 3, 2021

amyskov commented Jul 8, 2020 •

edited

Loading