Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling binary operations with list-like Python objects. #2054

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Feb 15, 2021

So far, Koalas doesn't support list-like Python objects for Series binary operations.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64

>>> kser + [10, 20, 30, 40, 50, 60]
Traceback (most recent call last):
...

This PR enables it.

>>> kser
0    1
1    2
2    3
3    4
4    5
5    6
Name: x, dtype: int64
>>> kser + [10, 20, 30, 40, 50, 60]
0    11
1    22
2    33
3    44
4    55
5    66
Name: x, dtype: int64
>>> kser - [10, 20, 30, 40, 50, 60]
0    -9
1   -18
2   -27
3   -36
4   -45
5   -54
Name: x, dtype: int64
>>> kser * [10, 20, 30, 40, 50, 60]
0     10
1     40
2     90
3    160
4    250
5    360
Name: x, dtype: int64
>>> kser / [10, 20, 30, 40, 50, 60]
0    0.1
1    0.1
2    0.1
3    0.1
4    0.1
5    0.1
Name: x, dtype: float64

ref #2022 (comment)

@itholic itholic marked this pull request as draft February 15, 2021 13:22
@@ -495,7 +495,7 @@ def __rsub__(self, other) -> Union["Series", "Index"]:
return -column_op(F.datediff)(self, F.lit(other)).astype("long")
else:
raise TypeError("date subtraction can only be applied to date series.")
return column_op(Column.__rsub__)(self, other)
return column_op(lambda left, right: right - left)(self, other)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Column.__rsub__ doesn't support pyspark.sql.column.Column for second parameter.

>>> kdf = ks.DataFrame({"A": [1, 2, 3, 4], "B": [10, 20, 30, 40]})
>>> sdf = kdf.to_spark()
>>> col1 = sdf.A
>>> col2 = sdf.B
>>> Column.__rsub__(col1, col2)
Traceback (most recent call last):
...
TypeError: Column is not iterable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does support:

>>> Column.__rsub__(df.id, 1)
Column<'(1 - id)'>

It doesn't work in your case above because the instance is Spark column. In practice, that wouldn't happen because it will only be called when the first operand doesn't know how to handle Spark column e.g.) 1 - df.id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cause any exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use column_op(Column.__rsub__)(self, other) as it is, it raises TypeError: Column is not iterable for the case below.

>>> kser = ks.Series([1, 2, 3, 4])
>>> [10, 20, 30, 40] - kser
Traceback (most recent call last):
...
TypeError: Column is not iterable

Copy link
Collaborator

@ueshin ueshin Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

@itholic itholic marked this pull request as ready for review February 15, 2021 14:45
# other = tuple with the different length
other = (np.nan, 1, 3, 4, np.nan)
with self.assertRaisesRegex(
ValueError, "operands could not be broadcast together with shapes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message looks weird. Is it matched with pandas'?

Copy link
Contributor Author

@itholic itholic Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original error message from pandas looks like :

ValueError: operands could not be broadcast together with shapes (4,) (8,)

@ueshin , maybe we don't include the (4,) (8,) part since it requires to compute length of both objects which can be expensive ??

@codecov-io
Copy link

codecov-io commented Feb 18, 2021

Codecov Report

Merging #2054 (0fd3666) into master (87f5b18) will decrease coverage by 1.44%.
The diff coverage is 91.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2054      +/-   ##
==========================================
- Coverage   94.71%   93.26%   -1.45%     
==========================================
  Files          54       54              
  Lines       11503    11735     +232     
==========================================
+ Hits        10895    10945      +50     
- Misses        608      790     +182     
Impacted Files Coverage Δ
databricks/koalas/utils.py 93.66% <75.00%> (-1.71%) ⬇️
databricks/koalas/base.py 97.35% <96.00%> (+0.06%) ⬆️
databricks/koalas/indexes/base.py 97.43% <100.00%> (ø)
databricks/koalas/usage_logging/__init__.py 26.66% <0.00%> (-65.84%) ⬇️
databricks/koalas/usage_logging/usage_logger.py 47.82% <0.00%> (-52.18%) ⬇️
databricks/koalas/__init__.py 80.00% <0.00%> (-12.00%) ⬇️
databricks/conftest.py 91.30% <0.00%> (-8.70%) ⬇️
databricks/koalas/accessors.py 86.43% <0.00%> (-7.04%) ⬇️
databricks/koalas/spark/accessors.py 88.67% <0.00%> (-6.29%) ⬇️
databricks/koalas/typedef/typehints.py 91.06% <0.00%> (-2.75%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87f5b18...0fd3666. Read the comment docs.

@@ -813,3 +813,31 @@ def compare_disallow_null(left, right, comp):

def compare_allow_null(left, right, comp):
return left.isNull() | right.isNull() | comp(left, right)


def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]):
Copy link
Contributor

@xinrong-meng xinrong-meng Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice utility! The function name might be misleading considering its return type. Would it be possible to annotate the return type or rename the function?

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you try to reduce the amount of test codes by using loop or parameterizing if there is no difference except for the operators?

Comment on lines +832 to +833
if LooseVersion(pd.__version__) < LooseVersion("1.2.0"):
right = pd.Index(right, name=pindex_ops.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with pandas<1.2?
Seems like it's working with pandas >= 1.0 in the test?

Actually it works:

>>> pd.__version__
'1.0.5'
>>> pd.Index([1,2,3]) + [4,5,6]
Int64Index([5, 7, 9], dtype='int64')
>>> [4,5,6] + pd.Index([1,2,3])
Int64Index([5, 7, 9], dtype='int64')

Copy link
Contributor Author

@itholic itholic Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh,,, seems like It doesn't work for only rmod in pandas < 1.2.

>>> [4, 5, 6] % pd.Index([1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Int64Index' object has no attribute 'rmod'

Let me address for only this case.

Thanks!

Comment on lines +838 to +841
raise ValueError(
"operands could not be broadcast together with shapes ({},) ({},)".format(
len_pindex_ops, len_right
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can show the length of left if it's less than the length of right, but if it's greater, the actual length is unknown.

@@ -321,6 +322,9 @@ def spark_column(self) -> Column:
__neg__ = column_op(Column.__neg__)

def __add__(self, other) -> Union["Series", "Index"]:
if isinstance(other, (list, tuple)):
pindex_ops, other = check_same_length(self, other)
return ks.from_pandas(pindex_ops + other) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we avoid using # type: ignore as possible? We can use cast instead.

Comment on lines +475 to +476
if isinstance(other, (list, tuple)):
other = ks.Index(other, name=self.name) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

@@ -495,7 +495,7 @@ def __rsub__(self, other) -> Union["Series", "Index"]:
return -column_op(F.datediff)(self, F.lit(other)).astype("long")
else:
raise TypeError("date subtraction can only be applied to date series.")
return column_op(Column.__rsub__)(self, other)
return column_op(lambda left, right: right - left)(self, other)
Copy link
Collaborator

@ueshin ueshin Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__.

Comment on lines +599 to +600
if isinstance(other, (list, tuple)):
other = ks.Index(other, name=self.name) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

@xinrong-meng
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants