Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searchsorted function implementation #1284

Open
aregm opened this issue Apr 15, 2020 · 10 comments · Fixed by #1668
Open

searchsorted function implementation #1284

aregm opened this issue Apr 15, 2020 · 10 comments · Fixed by #1668
Labels
new feature/request 💬 Requests and pull requests for new features P2 Minor bugs or low-priority feature requests pandas.series
Milestone

Comments

@aregm
Copy link
Collaborator

aregm commented Apr 15, 2020

No description provided.

@aregm aregm added the P2 Minor bugs or low-priority feature requests label May 12, 2020
@anmyachev anmyachev added this to the 0.7.4 milestone Jun 4, 2020
@amyskov
Copy link
Contributor

amyskov commented Jul 2, 2020

First version of searchsorted function was implemented via MapReduce approach, but it was found that function performance degraded in comparison with pure Pandas. The reason is that complexity of numpy.searchsorted (that is called in Pandas) is very low and Modin conversions and reduce stage overheads became significant. Looking for a way to solve this issue.

@devin-petersohn
Copy link
Collaborator

@amyskov Is it possible with just a MapFunction?

MapFunction.register(lambda x, value, side, sorter: x.squeeze(axis=1).searchsorted(value, side, sorter))

That is a start, you will need to convert the result of searchsorted back to a DataFrame, then in the API layer (dataframe.py) convert it to pandas, then to a list. Does that make sense to you? Do you think it would work?

@amyskov
Copy link
Contributor

amyskov commented Jul 8, 2020

@devin-petersohn i don't think that it can help us to solve this problem, because after calling map function on each partition, we will obtain results from each function call, which have to be processed (reduced) anyway.
I have measured execution time of Series.searchsorted function for pure Pandas with different Series sizes (see attached graph, 1e8 elements in Series is equivalent to ~500 MB .csv file). The curve shape is very close to k*log(N) curve, as it mentioned in numpy docs and even for 1e9 elements, execution time is ~1e-4 second, that is less than potential Modin overheads. Based on this, i think that applying function on each partition separately is not good approach in this case and we have to function on the full frame at once. One of possible solutions is implemented in the linked to this issue PR in commit 97bcd1e using query_compiler._modin_frame._apply_full_axis function. What do you think about it?
image

@devin-petersohn
Copy link
Collaborator

What is the execution time of the MapReduce version you wrote in comparison with that chart?

@amyskov
Copy link
Contributor

amyskov commented Jul 8, 2020

Execution time of the MapReduce version is greater than pure Pandas, see attached graph
image

@amyskov
Copy link
Contributor

amyskov commented Jul 9, 2020

Additional measurements of Modin searchsorted implementations (with usage of default_to_pandas function and with query_compiler._modin_frame._apply_full_axis introduced in the commit 97bcd1e6b9c112ce4e9e7aea5179e309cffa6436) with python and ray engines were made
image
As it can be seen from the graph, curve shapes of all implementations are close, so i think, that curve shape can be determined by Modin specific processes with partitions.

@amyskov
Copy link
Contributor

amyskov commented Jul 9, 2020

Script for execution time measurement

import modin.pandas as pd
from timeit import default_timer as timer
import csv
import random
import os

rows = [10, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8]
value_low = 0
value_high = 10
runs_number = 5
values = 5
csv_name_series = "../local_benches_data/local_bench_series.csv"

times = {}
for rows_number in rows:
    try:
        # creation of files with test data
        with open(csv_name_series, "w", newline='') as f1:
            w1=csv.writer(f1, delimiter=',')
            w1.writerow(['data_column1'])
            for i in range(int(rows_number)):
                w1.writerow([round(random.uniform(value_low, value_high), 1)])
    except Exception as exc:
        print(exc)
        os.remove(csv_name_series)

    t_ser_modin_array = {}
    for run in range(runs_number):
        ser_modin = pd.read_csv(csv_name_series).squeeze().sort_values().reset_index(drop=True)
        t0 = timer()
        ans_ser_modin = ser_modin.searchsorted(values)
        t_ser_modin_array[run] = timer() - t0

    times[rows_number] = min(t_ser_modin_array.values())

print("times \n", times)

@amyskov
Copy link
Contributor

amyskov commented Jul 15, 2020

In order to find out the reason of such performance degradation in comparison with pandas, execution time of pure partitioning and pure functions applying was measured in modin.engines.base.frame.partition_manager.py::map_axis_partition function (function that called from query_compiler._modin_frame._apply_full_axis function) for 97bcd1e6b9c112ce4e9e7aea5179e309cffa6436 commit.
Partitioning part:

partitions = (
    cls.column_partitions(partitions)
    if not axis
    else cls.row_partitions(partitions)
)

Function applying part:

result_blocks = np.array(
    [
        part.apply(preprocessed_map_func, num_splits=num_splits)
        for part in partitions
    ]
)

Results are next
image
As it can be seen full_time curve (execution time of all searchsorted function) is mostly determined by partitioning time, which mostly spent by modin.engines.python.pandas_on_python.frame.axis_partition.py::PandasOnPythonFrameAxisPartition::__init__ in code block

for obj in list_of_blocks:
    obj.drain_call_queue()

Using workaround in ba7594bdcdb697e01936ee6c58c5b2bff775f3c5 (clearing partitions call_queue before functions applying), partitioning time can be decreased
image
For Python engine full_execution time is decreased, but now it determined by function apply time (function apply time is large enough, that can be caused by data.copy in apply function for python engine).
For Ray engine apply time is almost constant, but most of the time is spent on the new _modin_frame creation in the query_compiler._modin_frame._apply_full_axis function (doesn't shown on the chart).

@devin-petersohn devin-petersohn modified the milestones: 0.8.0, 0.8.1 Jul 29, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
Signed-off-by: Alexander Myskov <[email protected]>
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
Signed-off-by: Alexander Myskov <[email protected]>
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
Signed-off-by: Alexander Myskov <[email protected]>
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
amyskov added a commit to amyskov/modin that referenced this issue Sep 3, 2020
devin-petersohn pushed a commit that referenced this issue Sep 4, 2020
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
@dchigarev
Copy link
Collaborator

implementation of this function was reverted and now defaults to pandas #2655, new implementation is needed

@dchigarev dchigarev reopened this Feb 3, 2021
@anmyachev anmyachev modified the milestones: 0.8.1, Someday Feb 9, 2021
@mvashishtha mvashishtha added the new feature/request 💬 Requests and pull requests for new features label Sep 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature/request 💬 Requests and pull requests for new features P2 Minor bugs or low-priority feature requests pandas.series
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants