Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential planning time (100s of seconds) with UNION and ORDER BY queries #13748

Open
Tracked by #14008
alamb opened this issue Dec 12, 2024 · 14 comments · May be fixed by #14049
Open
Tracked by #14008

Exponential planning time (100s of seconds) with UNION and ORDER BY queries #13748

alamb opened this issue Dec 12, 2024 · 14 comments · May be fixed by #14049
Assignees
Labels
bug Something isn't working regression Something that used to work no longer does

Comments

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

Describe the bug

Our internal system generates queries that look like the following

select c1, null as c2, ... null as cn from t ORDER BY c1
   UNION ALL
 select null as c1, c2, ... null as cn from t ORDER BY c2
 ...
 select null as c1, null as c2, ... cn from t ORDER BY cn
  ORDER BY c1, c2 ... CN

When there are 10 columns this takes 22 ms (in release mode) to plan (which is still quite a while)

When there are 100 columns, it takes over 2 minutes (!!) to plan, which basically caused two production incidents

Here are some timings with numbers of columns (you can see the exponential growth):

Running with 10 columns...completed in 22.65575ms
Running with 20 columns...completed in 107.885ms
Running with 30 columns...completed in 481.31775ms
Running with 40 columns...completed in 1.656844042s
Running with 50 columns...completed in 4.560470708s
Running with 60 columns...completed in 10.54814975s
Running with 70 columns...completed in 21.993968458s
Running with 80 columns...completed in 41.614843209s
Running with 90 columns...completed in 73.642939542s
Running with 100 columns...completed in 123.150163417s

To Reproduce

With this data file (has 100 columns): data.csv

Create Table

CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'data.csv' WITH ORDER (c1)

Run query

40 column version (this takes over a second to plan in release mode)

(SELECT c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c0)
  UNION ALL
(SELECT null as c0, c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c1)
  UNION ALL
(SELECT null as c0, null as c1, c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c2)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c3)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c4)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c5)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c6)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c7)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c8)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c9)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c10)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c11)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c12)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c13)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c14)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c15)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c16)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c17)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c18)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c19)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c20)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c21)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c22)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c23)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c24)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c25)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c26)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c27)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c28)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c29)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c30)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c31)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c32)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, c33, null as c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c33)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, c34, null as c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c34)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, c35, null as c36, null as c37, null as c38, null as c39 FROM t ORDER BY c35)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, c36, null as c37, null as c38, null as c39 FROM t ORDER BY c36)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, c37, null as c38, null as c39 FROM t ORDER BY c37)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, c38, null as c39 FROM t ORDER BY c38)
  UNION ALL
(SELECT null as c0, null as c1, null as c2, null as c3, null as c4, null as c5, null as c6, null as c7, null as c8, null as c9, null as c10, null as c11, null as c12, null as c13, null as c14, null as c15, null as c16, null as c17, null as c18, null as c19, null as c20, null as c21, null as c22, null as c23, null as c24, null as c25, null as c26, null as c27, null as c28, null as c29, null as c30, null as c31, null as c32, null as c33, null as c34, null as c35, null as c36, null as c37, null as c38, c39 FROM t ORDER BY c39)
ORDER BY c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39

Expected behavior

I expect that the query plans within a second with 100 columns

Additional context

The code spends an increasing amount of time in :

27.22 Gc  99.9%	-	                  datafusion_physical_expr::equivalence::properties::calculate_union::h9a6d0c834d33dfa8
27.20 Gc  99.9%	2.00 Mc	                   datafusion_physical_expr::equivalence::properties::UnionEquivalentOrderingBuilder::add_satisfied_orderings::haff5d34880826c26
27.20 Gc  99.8%	-	                    datafusion_physical_expr::equivalence::properties::EquivalenceProperties::ordering_satisfy_requirement::h7800d9c061e4551e
26.94 Gc  98.9%	2.00 Mc	                     datafusion_physical_expr::equivalence::properties::EquivalenceProperties::with_constants::h9b627573961f7ac1
26.72 Gc  98.1%	17.83 Mc	                      datafusion_physical_expr::equivalence::properties::EquivalenceProperties::discover_new_orderings::hdb50bd332b5dedd2
21.92 Gc  80.5%	3.85 Gc	                       datafusion_physical_expr::equivalence::ordering::OrderingEquivalenceClass::remove_redundant_entries::h3394f8a44ca387d2
2.52 Gc   9.2%	5.25 Mc	                       _$LT$alloc..vec..Vec$LT$T$GT$$u20$as$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$GT$::from_iter::hb5d026af61b973ba
2.12 Gc   7.8%	2.03 Gc	                       _$LT$datafusion_physical_expr..expressions..column..Column$u20$as$u20$core..cmp..PartialEq$LT$dyn$u20$core..any..Any$GT$$GT$::eq::h7faccabfc76823d5

This particularly bad behavior was introduced in influxdata@577e4bb / #12562 (🤦 myself)

I think this is also what @berkaysynnada was warning us in #12446 (comment)

@alamb alamb added bug Something isn't working regression Something that used to work no longer does labels Dec 12, 2024
@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2024

I tested this with various released versions:

42.0.0:

Worst case takes 7.5 seconds

Running with 10 columns...completed in 25.767125ms
Running with 20 columns...completed in 59.605292ms
Running with 30 columns...completed in 151.040583ms
Running with 40 columns...completed in 352.528417ms
Running with 50 columns...completed in 701.194625ms
Running with 60 columns...completed in 1.259548375s
Running with 70 columns...completed in 2.142141458s
Running with 80 columns...completed in 3.42683925s
Running with 90 columns...completed in 5.22279025s
Running with 100 columns...completed in 7.59148425s

43.0.0:

Same as reported above (123.150163417s)

main

I also tested with main as of now: 36a1361

It is better, but still quite a bit worse than 42

Running with 10 columns...completed in 35.857458ms
Running with 20 columns...completed in 102.454125ms
Running with 30 columns...completed in 429.636584ms
Running with 40 columns...completed in 1.403798625s
Running with 50 columns...completed in 3.770518708s
Running with 60 columns...completed in 8.758160834s
Running with 70 columns...completed in 18.022044708s
Running with 80 columns...completed in 34.337127625s
Running with 90 columns...completed in 60.926777084s
Running with 100 columns...completed in 101.470021875s

@alamb alamb self-assigned this Dec 12, 2024
@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2024

BTW here is the code I am using to generate those numbers (it is pretty grotty)

Details

// DataFusion spilling sort benchmark / exmaples
// Idea is to replicate a report from  https://discord.com/channels/885562378132000778/1166447479609376850/1276137008435298335
// where sort doesn't spill

// Related link: sorting strings

use std::fs::File;
use std::path::PathBuf;
use datafusion::arrow::array::{ArrayRef, Int32Array};
use datafusion::arrow::datatypes::{Field, Fields, Schema};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::Result;
use std::sync::Arc;
use datafusion::arrow;
use datafusion::prelude::SessionContext;

#[tokio::main]
async fn main() -> Result<()> {
    // initialize logging to see DataFusion's internal logging
    std::env::set_var("RUST_LOG", "info");
    env_logger::init();


    let ctx = SessionContext::new();
    let n = 100;
    make_table(n);
    ctx.sql(&format!("CREATE EXTERNAL TABLE t({})  STORED AS CSV LOCATION 'data.csv' WITH ORDER (c1)", column_list(100))).await?
        .show().await?;

    println!("10 columns");
    println!("{}", make_query(10));

    println!("40 columns");
    println!("{}", make_query(40));


    for n in [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] {
        let sql = make_query(n);
        print!("Running with {n} columns");
        let start = std::time::Instant::now();
        ctx.sql(&sql).await?.collect().await?;
        let elapsed = start.elapsed();
        println!("...completed in {:?}", elapsed);
    }
    Ok(())
}

fn column_list(n: usize) -> String {
    (0..n)
        .map(|i| format!("c{} int", i))
        .collect::<Vec<_>>()
        .join(", ")
}

/// Writes a table like this to CSV file
/// c1: int, c2: int, c3: int....cn:int
///
/// returns the filename
fn make_table(n: usize) -> PathBuf {
    let path = PathBuf::from("data.csv");
    let arrays = (0..n)
        .map(|i| {
            let i = i as i32;
            let n = n as i32;
            Arc::new(Int32Array::from(vec![i * n, 2 * i * n, 3 * i * n])) as ArrayRef
        })
        .collect::<Vec<_>>();

    let schema = Schema::new(Fields::from(
        arrays
            .iter()
            .enumerate()
            .map(|(i, arr)| Field::new(&format!("c{}", i), arr.data_type().clone(), false))
            .collect::<Vec<_>>(),
    ));

    let batch = RecordBatch::try_new(Arc::new(schema), arrays).unwrap();
    let file = File::create(&path).unwrap();
    let mut writer = arrow::csv::Writer::new(file);
    writer.write(&batch).unwrap();
    // flush on drop
    // writer.into_inner();
    path
}

/// return a query like
/// ```sql
/// select c1, null as c2, ... null as cn from t ORDER BY c1
///   UNION ALL
/// select null as c1, c2, ... null as cn from t ORDER BY c2
/// ...
/// select null as c1, null as c2, ... cn from t ORDER BY cn
///  ORDER BY c1, c2 ... CN
/// ```
fn make_query(n: usize) -> String {
    let mut query = String::new();
    for i in 0..n {
        if i != 0 {
            query.push_str("\n  UNION ALL \n");
        }
        let select_list = (0..n)
            .map(|j| {
                if i == j {
                    format!("c{j}")
                } else {
                    format!("null as c{j}")
                }
            })
            .collect::<Vec<_>>()
            .join(", ");
        query.push_str(&format!("(SELECT {} FROM t ORDER BY c{})", select_list, i));
    }
    query.push_str(&format!(
        "\nORDER BY {}",
        (0..n)
            .map(|i| format!("c{}", i))
            .collect::<Vec<_>>()
            .join(", ")
    ));
    query
}

@findepi
Copy link
Member

findepi commented Dec 13, 2024

Do you know where the time is being spent?
esp is this some plan property that's being recalculated in an expensive manner?

@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2024

Do you know where the time is being spent? esp is this some plan property that's being recalculated in an expensive manner?

Yes the time is being spent normalizing equivalence orderings.

I just need to spend some time staring at the code and figuring out how it could be done better

@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2024

Here is a flamegraph showing where the time is spent

flamegraph

@alamb
Copy link
Contributor Author

alamb commented Dec 13, 2024

I spent some time reviewing the flamegraph.

Screenshot 2024-12-13 at 4 58 05 PM

  • remove_redundant_entries is called / takes up 1/3 of the time
  • normalized_oeq_class requires 2/3 of the time

The high level observation is that continually recomputing normalized equivalence groups requires significaint amounts of time. What I am going to try is to make these implementations much more efficient (avoid allocation and recomputation) and

Sketch:

  1. try and avoid calling normalized_oeq_class by ensuring that the oeq_class is always normalized
  2. I think we could reduce remove_redundant_entries by storing it in an IndexSet rather than a Vec

@Omega359
Copy link
Contributor

FYI I am pretty sure I was seeing this with some of the sqlite test files, specifically https://github.com/Omega359/arrow-datafusion/blob/feature/sqllogictest_add_sqlite/datafusion/sqllogictest/test_files/sqlite/select4.slt

@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2024

I will make this much better, but it will take me a while

@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2024

I have big plans on how to fix this holistically, but I also want to be able to resolve the issue quickly if possible. I am going to see if I can find some way to work around this for now while I work on a better long term plan

@alamb
Copy link
Contributor Author

alamb commented Dec 18, 2024

FYI I am pretty sure I was seeing this with some of the sqlite test files, specifically https://github.com/Omega359/arrow-datafusion/blob/feature/sqllogictest_add_sqlite/datafusion/sqllogictest/test_files/sqlite/select4.slt

For anyone following along, we are tracking completing the sqllogictest work in

@alamb
Copy link
Contributor Author

alamb commented Jan 7, 2025

I have a few PRs open to make the fields non pub. I have also been studying the code -- my first POC will be to figure out how to avoid calling normalized_oeq_class() so much as that is expensive and gets called many times.

@alamb
Copy link
Contributor Author

alamb commented Jan 8, 2025

Update:
I spent some time trying to avoid calling OrderingEquivalenceClass::normalized_oeq_class as much. This helped but not enough to really fix the problem:

I made a flamegraph with #14049 and it looks like almost all the time now is in "remove_reundant_entries"

Screenshot 2025-01-08 at 4 31 59 PM

(here is the raw flamegraph):
flamegraph

My next step will be to attempt to use some sort of Set rather than Vec for OrderingEquivalenceClass

@alamb
Copy link
Contributor Author

alamb commented Jan 16, 2025

I think I am close to actually making measurable progress

With some more flame graph profiling, I am convinced that the time is largely spent in

  • OrderingEquivalenceProperties::resolve_overlap (this shows up as OrderingEquivalenceProperties::remove_redundant_entries
  • normalize_sort_exprs

Image

My next steps are to create a benchmark in the repository and then work on validating some optimizations (such as #14136)

@alamb
Copy link
Contributor Author

alamb commented Jan 16, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants