Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array_distance function #12211

Merged
merged 10 commits into from
Aug 29, 2024
Merged

Conversation

austin362667
Copy link
Contributor

@austin362667 austin362667 commented Aug 28, 2024

Which issue does this PR close?

Partially closes #8782.

Rationale for this change

Add distance functionality to DataFusion. This function is particularly useful for scenarios involving spatial analysis, clustering, or similarity computations, where distance metrics are crucial.

It might be valuable to add scalar UDFs like list_distance/array_distance, similar to DuckDB, along with other methods of distance measurement (e.g., cosine etc)

XREF: DuckDB list functions, array functions

side note: In DuckDB, list and array have different definitions, but in DataFusion, they are considered the same.

What changes are included in this PR?

New function Euclidean array_distance(arr1, arr2) is added.

Are these changes tested?

Yes, added SQL logic tests.

Screenshot 2024-08-29 at 12 20 38 PM

Are there any user-facing changes?

New function array_distance(arr1, arr2) is added.

Examples:

> SELECT array_distance([1, 2, 3], [4, 5, 6]);
+-----------------------------------------------------------------------------------------------+
| array_distance(make_array(Int64(1),Int64(2),Int64(3)),make_array(Int64(4),Int64(5),Int64(6))) |
+-----------------------------------------------------------------------------------------------+
| 5.196152422706632                                                                             |
+-----------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.
> CREATE TABLE points (
    point_a DOUBLE[],
    point_b DOUBLE[]
);
0 row(s) fetched.
Elapsed 0.039 seconds.

> INSERT INTO points VALUES
([1.0, 2.0, 3.0], [1.0, 2.0, 5.0]),
([2.0, 4.0, 6.0], [2.0, 4.0, 6.0]),
([1.5, 2.5, 3.5], [4.5, 6.5, 8.5]);
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.036 seconds.

> SELECT
    point_a,
    point_b,
    list_distance(point_a, point_b) AS euclidean_distance
FROM
    points;
+-----------------+-----------------+--------------------+
| point_a         | point_b         | euclidean_distance |
+-----------------+-----------------+--------------------+
| [1.0, 2.0, 3.0] | [1.0, 2.0, 5.0] | 2.0                |
| [2.0, 4.0, 6.0] | [2.0, 4.0, 6.0] | 0.0                |
| [1.5, 2.5, 3.5] | [4.5, 6.5, 8.5] | 7.0710678118654755 |
+-----------------+-----------------+--------------------+
3 row(s) fetched.
Elapsed 0.013 seconds.

No breaking change.

Signed-off-by: Austin Liu <[email protected]>

Add `distance` aggregation function

Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Aug 28, 2024
datafusion/functions-aggregate/src/distance.rs Outdated Show resolved Hide resolved
datafusion/functions-aggregate/src/distance.rs Outdated Show resolved Hide resolved
datafusion/functions-aggregate/src/distance.rs Outdated Show resolved Hide resolved
datafusion/functions-aggregate/src/distance.rs Outdated Show resolved Hide resolved
from data
----
5.196152422707

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> select distance(sq.column1, sq.column2) from (values (NULL, 2), (0,0)) as sq;
+---------------------------------+
| distance(sq.column1,sq.column2) |
+---------------------------------+
| 0.0                             |
+---------------------------------+

I prefer it to be NULL

Copy link
Contributor Author

@austin362667 austin362667 Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure let's do it this way~

@Weijun-H
Copy link
Member

I am unsure how useful this feature would be because it only calculates two scalar column distances. 🤔

@jayzhan211
Copy link
Contributor

@austin362667 I think array_distance is not aggregate function but array function. We tends to add new functions if other well-known DBs are also supported. In this case, I think it makes sense to have array_distance as array function but not aggregate function

Signed-off-by: Austin Liu <[email protected]>
@austin362667
Copy link
Contributor Author

Got it! Thanks for reviewing.
Just adding list_distance in functions-nested and removing aggregate distance function from functions-aggregate.

Signed-off-by: Austin Liu <[email protected]>
@austin362667 austin362667 changed the title Add distance aggregate function Add array_distance function Aug 28, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

We could support column value on the follow up PR

@Weijun-H
Copy link
Member

Nice! Please add some doc here
## Array Functions

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 29, 2024
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
@Weijun-H Weijun-H merged commit bd50698 into apache:main Aug 29, 2024
26 checks passed
@Weijun-H
Copy link
Member

Thanks, @austin362667, for the contribution, and @jayzhan211 for the review. We'll support the column value in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support array_distance in array_expression
3 participants