Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce integer division operator // #2426

Closed
eitsupi opened this issue Apr 14, 2023 · 22 comments · Fixed by #2684
Closed

Introduce integer division operator // #2426

eitsupi opened this issue Apr 14, 2023 · 22 comments · Fixed by #2684
Labels
compiler language-design Changes to PRQL-the-language

Comments

@eitsupi
Copy link
Member

eitsupi commented Apr 14, 2023

What's up?

DuckDB seems to introduce a new division operator / and the current division operator / will be moved to // (duckdb/duckdb#7082).

I'm wondering if the same thing could be done with PRQL.

In my opinion, PRQL should not do the division itself as it currently does, but should convert // to / for targets that do not have two division operators.

Integer Division: SQLite, Postgres, SQL Server
Floating Point Division: MySQL, SparkSQL, BigQuery, Snowflake, ClickHouse

Note that My SQL seems to require the DEV operator for integer division.

@eitsupi eitsupi added language-design Changes to PRQL-the-language compiler labels Apr 14, 2023
@aljazerzen
Copy link
Member

Ref #1712

@max-sixty
Copy link
Member

In lieu of the full framework at #1712 (comment), I think this could be a method on the DialectHandler trait, such that:

should convert // to / for targets that do not have two division operators.

I don't think it would be a huge PR — it would require:

  • adding // to the parser,
  • adding to the DialectHandler as above,
  • ...and then adjusting the expression generation depending on the dialect at
    BinOp::Div => BinaryOperator::Divide,

@aljazerzen
Copy link
Member

aljazerzen commented Apr 15, 2023

Wait, before we continue, what should be the result of this query:

from_text format:json '[{ "int": 13, "float": 13.0 }]'
select [
    int /  5, int /  5.0, float /  5, float /  5.0,
    int // 5, int // 5.0, float // 5, float // 5.0,
]

I think that:

  • the result must be same on all DBMSs,
  • int / float may be a type error,

@eitsupi
Copy link
Member Author

eitsupi commented Apr 15, 2023

from_text format:json '[{ "i": 13, "f": 13.0 }]'
select [
    i/5, i/5.0, f/5, f/5.0,
]

is returns

(i / 5) (i / 5.0) (f / 5) (f / 5.0)
2 2.6 2.6 2.6

on playground now.

This behavior can also be seen in the DuckDB web shell.
https://shell.duckdb.org/

duckdb> select 13 / 5;
┌──────────┐
│ (13 / 5) │
╞══════════╡
│        2 │
└──────────┘
Elapsed: 3 ms

duckdb> select 13.0 / 5;
┌────────────┐
│ (13.0 / 5) │
╞════════════╡
│        2.6 │
└────────────┘
Elapsed: 7 ms

duckdb> select 13 / 5.0;
┌────────────┐
│ (13 / 5.0) │
╞════════════╡
│        2.6 │
└────────────┘
Elapsed: 7 ms

This is the same type as Python and others, but differs in that Python's // operator always returns 2.

In [1]: 13 // 5
Out[1]: 2

In [2]: 13.0 // 5
Out[2]: 2.0

In [3]: 13 // 5.0
Out[3]: 2.0

@max-sixty
Copy link
Member

int / float may be a type error,

Do you mean for PRQL or for the DB? Are there DBs that don't coerce there?

@eitsupi
Copy link
Member Author

eitsupi commented May 9, 2023

I have looked to see if this could be implemented, perhaps apache/datafusion-sqlparser-rs#868 needs to be merged?
(Given the s-string formatting issues, does apache/datafusion-sqlparser-rs#873 need to be resolved as well?)

@max-sixty
Copy link
Member

Yes, probably:

@eitsupi
Copy link
Member Author

eitsupi commented May 10, 2023

I noticed that MySQL's DIV operator always returns an integer.
So in DuckDB and Python etc. int // float is float, but in MySQL, int DIV float is int.

Could this be an important difference on PRQL?

@aljazerzen
Copy link
Member

@eitsupi Yes, that is an important difference. We should inject a cast into the appropriate type after the operation - if that dialect does not handle the types correctly.


So this is the table we are going with?

op int int int float float int float float
/ 2 2.6 2.6 2.6
// 2 2.0 2.0 2.0

This would mean that we compile like this:

# DuckDB
13 // 5.0   ->    13 // 5.0

# Postgres
13 // 5.0   ->    DIV(13, 5.0)

# MySQL
13 // 5.0   ->    CAST(13 DIV 5.0 AS FLOAT)

@aljazerzen
Copy link
Member

@max-sixty Yes, I wanted to say I we may want to produce an error for division of non-matching types:

13 / 5      -> 2
13 / 5.0    -> Error, type mismatch.
13.0 / 5    -> Error, type mismatch.
13.0 / 5.0  -> 2.6

I don't have a clear argument here, but I've heard a lot of stories about implicit type casting causing unexpected bugs. If we want to double down on "throwing errors early", this is the way to go.

If we decide to go with this, I would also consider / always producing floats and // always producing ints:

op int int int float float int float float
/ 2.6 err err 2.6
// 2 err err 2

A bit inconvenient, but we can provide really good errors around this.

@eitsupi
Copy link
Member Author

eitsupi commented May 10, 2023

Thanks for the summary.

The // operator in DuckDB returns 2.6 instead of 2.0 when 13 // 5.0 or 13.0 // 5, so it doesn't work that way......

Maybe?

# DuckDB
13 // 5.0   ->    trunc(13 // 5.0)  (or trunc(13 / 5.0))

# Postgres
13 // 5.0   ->    DIV(13, 5.0)

# MySQL
13 // 5.0   ->    CAST(13 DIV 5.0 AS FLOAT)

I'm just not sure that returning a Float is the best thing to do here, could it be that an integer return behavior like MySQL would be more reasonable?

@aljazerzen
Copy link
Member

Yes, my compilation examples were about currently agreed-on behavior. We can implement any behavior we decide on, so the real question is "what is the behavior we want?".


could it be that an integer return behavior like MySQL would be more reasonable?

I'd say it is. This is aligned with my proposal in the last comment: // would always return an int.

@eitsupi
Copy link
Member Author

eitsupi commented May 11, 2023

Since the integer division is easily computed by using something like trunc function, perhaps only int / int -> float is sufficient to potentially generate an error?

op int int int float float int float float
/ maybe err 2.6 2.6 2.6
// 2 2 2 2
# SQLite
13 / 5		->	err
13 / 5.0	->	13 / 5.0
13 // 5		->	CAST(13 / 5 AS INT)

# DuckDB
13 / 5		->	13 / 5
13 // 5		->	trunc(13 / 5)

# Postgres
13 / 5		->	err
13 / 5.0	->	13 / 5.0
13 // 5		->	trunc(13 / 5)

# MySQL
13 / 5		->	13 / 5
13 // 5		->	13 DIV 5

@eitsupi
Copy link
Member Author

eitsupi commented May 19, 2023

PEP 238 – Changing the Division Operator
https://peps.python.org/pep-0238/

The correct work-around is subtle: casting an argument to float() is wrong if it could be a complex number; adding 0.0 to an argument doesn’t preserve the sign of the argument if it was minus zero. The only solution without either downside is multiplying an argument (typically the first) by 1.0. This leaves the value and sign unchanged for float and complex, and turns int and long into a float with the corresponding value.

# SQLite
13 / 5		->	13 * 1.0 / 5
13 / 5.0	->	13 * 1.0 / 5.0
13 // 5		->	CAST(13 / 5 AS INT)

# DuckDB
13 / 5		->	13 / 5
13 // 5		->	trunc(13 / 5)

# Postgres
13 / 5		->	13 * 1.0 / 5
13 / 5.0	->	13 * 1.0 / 5.0
13 // 5		->	trunc(13 / 5)

# MySQL
13 / 5		->	13 / 5
13 // 5		->	13 DIV 5

How about this?

@aljazerzen
Copy link
Member

aljazerzen commented May 20, 2023

Oh, I see. At first I didn't understand what you meant with "maybe err". Now I assume you meant "error during compilation saying 'unsupported operation in this dialect'". With Python's impl notes, you now found how we could implement it in all dialects.

If I understand correctly, this is a table you are now proposing:

op int int int float float int float float
/ 2.6 2.6 2.6 2.6
// 2 2 2 2

I like this due to one key fact: the result type of an operator is always the same. This is nice because it means we don't need generic functions or operator overloading in the language. This makes the semantics of the language simpler. But the main benefit is that operators in PRQL are simpler. You want result to be an int? Use //. You want it to be float? Use /.

Sidenote: PRQL should test what the output of 1000000000000000000000000.0 // 10.0 is what we expect it to be. Cases like this might reveal quirks of query engines or shortfalls of our impl.

@aljazerzen
Copy link
Member

aljazerzen commented May 20, 2023

Another sidenote: I just want to complain about how SQLite works:

with a as (
   select 5 as b union all select 5.0 as b
)
select 13 / b from a;
2
2.6

Edit: Postgres and DuckDB both return 2.6 for both results.

@eitsupi
Copy link
Member Author

eitsupi commented May 20, 2023

Another thing I learned while looking into this is that integer division in Python, R, etc. is "Floor division", which produces different results than MySQL's DIV operator......

Like:

>>> 13 // -5
-3
>>> 13 // 5
2

If we were to mimic this, we would use the floor function, which would look something like these:

# SQLite
13 / 5		->	13 * 1.0 / 5
13 / 5.0	->	13 * 1.0 / 5.0
13 // 5		->	ROUND(13 / 5 - 0.5)

# DuckDB
13 / 5		->	13 / 5
13 // 5		->	FLOOR(13 / 5)

# Postgres
13 / 5		->	13 * 1.0 / 5
13 / 5.0	->	13 * 1.0 / 5.0
13 // 5		->	FLOOR(13 / 5)

# MySQL
13 / 5		->	13 / 5
13 // 5		->	FLOOR(13 / 5)

@max-sixty max-sixty changed the title Introduce integer devision operator // Introduce integer division operator // Jun 1, 2023
@aljazerzen
Copy link
Member

@max-sixty @snth Are we sure we want this behavior?

>>> 13 // -5
-3

Python agrees, but Rust says it's -2.

aljazerzen added a commit that referenced this issue Jun 1, 2023
Closes #2426

Also makes sure that float division is always producing floats.
This is done by casting to float before division. This can be removed
when match is implemented and we can check types and use only the
appropriate operation.
@snth
Copy link
Member

snth commented Jun 1, 2023

Hmm, just looking at the last question and not having read the rest of the thread, I don't know whether rounding towards zero or taking the FLOOR is preferred.

Here's what ChatGPT has to say about it:

The behavior of integer division, also known as truncating division or floored division, can vary depending on the programming language or mathematical context. In some programming languages, integer division rounds towards zero, while in others it takes the floor function of the result.

Rounding towards zero means that the quotient of the division is rounded to the nearest integer towards zero. For example:

  • 7 divided by 3 (7/3) is 2.33, so rounding towards zero gives 2.

  • -7 divided by 3 (-7/3) is -2.33, so rounding towards zero gives -2.
    On the other hand, taking the floor function means that the quotient is rounded down to the nearest integer. For example:

  • 7 divided by 3 (7/3) is 2.33, so taking the floor function gives 2.

  • -7 divided by 3 (-7/3) is -2.33, so taking the floor function also gives -2.
    Both approaches have their uses depending on the specific requirements of a given situation. Rounding towards zero is often useful in mathematical calculations that aim to minimize rounding errors, while taking the floor function can be advantageous in certain algorithmic or number theory applications.

To determine the behavior of integer division in a specific programming language, it is recommended to consult the language's documentation or specifications.

In relational databases that use SQL, the behavior of integer division is typically defined to take the floor function of the result. This means that the quotient is rounded down to the nearest integer.

The SQL standard specifies the behavior of integer division using the floor function. For example, in the SQL language, the division operator ("/") performs integer division and returns the floor of the division result. Here are some examples:

  • 7 divided by 3 (7/3) would result in 2.
  • -7 divided by 3 (-7/3) would also result in -2.
    This behavior aligns with the mathematical concept of floor division, which is often considered more appropriate for data processing tasks in relational databases. When working with integer values in a database, it is common to expect the division operation to yield an integer result without any decimal places, rather than rounding towards zero.

It's worth noting that different database systems or SQL dialects may have slight variations in their behavior, so it's always a good practice to consult the specific documentation or reference material for the database system you are using to confirm the behavior of integer division in that context.


I don't know if we should take its word about the SQL Standard definition. I just tried to google for that and didn't come up with anything.

I would try to think of some examples.

The truncating / rounding to zero definition seems more intuitive because then apart from the sign difference, you get the same result whether you're dealing with positive or negative numbers.

@eitsupi
Copy link
Member Author

eitsupi commented Jun 1, 2023

Floor division seems to be related to Floor modulo.
The R help page says:

%% indicates x mod y (“x modulo y”), i.e., computes the ‘remainder’ r <- x %% y, and %/% indicates integer division, where R uses “floored” integer division, i.e., q <- x %/% y := floor(x/y), as promoted by Donald Knuth, see the Wikipedia page on ‘Modulo operation’, and hence sign(r) == sign(y). It is guaranteed that

x == (x %% y) + y * (x %/% y)

The Wikipedia page has a list of behaviors for each programing language.
https://en.wikipedia.org/wiki/Modulo

@aljazerzen
Copy link
Member

Looking at the modulo Wikipedia article, it would seem we want this property to hold:

x = ?  # dividend
k = ?  # divisor

q = x // k      # quotient
r = x % k       # remainder
x == q * k + r  # multiplied back together

@aljazerzen
Copy link
Member

It turns out that this holds iff -13 // 5 == -2, because -13 % 5 == -3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler language-design Changes to PRQL-the-language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants