Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MariaDB statistics support #19408

Merged
merged 2 commits into from
Oct 16, 2023
Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Oct 16, 2023

Description

Get table statistics for MariaDB tables.

@cla-bot cla-bot bot added the cla-signed label Oct 16, 2023
@findepi
Copy link
Member Author

findepi commented Oct 16, 2023

cc @homar @findinpath

@findepi findepi requested review from ebyhr and hashhar October 16, 2023 08:39
@findepi findepi force-pushed the findepi/mariadb-stats branch from c5df09c to 4e50d1f Compare October 16, 2023 09:24
@findepi findepi force-pushed the findepi/mariadb-stats branch from 4e50d1f to 8c1747a Compare October 16, 2023 12:01
Change `TestingMySqlServer.execute(String sql)` and
`TestingMySqlServer.execute(String sql)` to run in context of the
default database ('tpch'), simplifying usage.
Get table statistics for MariaDB tables.
@findepi findepi force-pushed the findepi/mariadb-stats branch from 8c1747a to fcd57b9 Compare October 16, 2023 12:06
@findepi
Copy link
Member Author

findepi commented Oct 16, 2023

( just rebased, after #19391 merged )

@findepi findepi merged commit 4f3efb7 into trinodb:master Oct 16, 2023
22 checks passed
@findepi findepi deleted the findepi/mariadb-stats branch October 16, 2023 13:30
@github-actions github-actions bot added this to the 430 milestone Oct 16, 2023
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (!columnIndexStatistics.nullable()) {
double knownNullFraction = columnStatisticsBuilder.build().getNullsFraction().getValue();
if (knownNullFraction > 0) {
log.warn("Inconsistent statistics, null fraction for a column %s, %s, that is not nullable according to index statistics: %s", table, columnName, knownNullFraction);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean "column %s.%s"? otherwise it would log foo, bar where foo is table and bar is column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i admit having copied this from MySQL's io.trino.plugin.mysql.MySqlClient#updateColumnStatisticsFromIndexStatistics

yes, we can improve both places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of this.

Comment on lines +734 to +735
// row count from INFORMATION_SCHEMA.TABLES may be very inaccurate
rowCount = max(rowCount, columnIndexStatistics.cardinality());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code seems to assume inaccuracy is always towards the lower values - is this a fair assumption to make?

i.e. rowCount is always lower than column index cardinality value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it is. I think the estimate may be well under-estimated and also over-estimated.
However, row count and NDV are related. We meed to do something (either way).
I copied the current approach from MySQL connector stats code (where it's probably where i implemented it earlier)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, looking at a few other places the logic is that overestimating stats is better than underestimation because underestimation can actually lead to objectively bad plans.

e.g. broadcast of a table that is too large to broadcast is probably worse than a hash join even if the table could be broadcast.

Marking as resolved.

Comment on lines +845 to +848
AND SUB_PART IS NULL -- ignore cases where only a column prefix is indexed
AND CARDINALITY IS NOT NULL -- CARDINALITY might be null (https://stackoverflow.com/a/42242729/65458)
AND CARDINALITY != 0 -- CARDINALITY is initially 0 until analyzed
GROUP BY COLUMN_NAME -- there might be multiple indexes on a column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to port this to MySQL as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was copied from MySQL and slightly adapted.

i think here AND CARDINALITY != 0 line was added (based on observation with MariaDB)
i suspect MySQL benefits from AND CARDINALITY IS NOT NULL line and maybe it's redundant for mariadb. it is very hard to test this, so i retained old condition and just added a new one

someone would need to observe what mysql is doing to determine whether AND CARDINALITY != 0 would make sense there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of this.

getQueryRunner()::execute,
"test_numeric_corner_cases_",
ImmutableMap.<String, List<String>>builder()
// TODO Infinity and NaNs not supported by MySQL. Are they not supported in MariaDB as well?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd look at type mapping tests to find the answer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

@homar please follow-up here. See all // TODO Infinity and NaNs not supported by MySQL occurrences in the code.


@Test
@Override
public void testStatsWithPredicatePushdownWithStatsPrecalculationDisabled()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one case to test here should also be when multi-column indexes exist.

I'm thinking of case where cardinality of a multi-column index can be > actual row count. (I don't know what "CARDINALITY" refers to in MariaDB case so maybe this is an impractical situation to test for).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of case where cardinality of a multi-column index can be > actual row count.

that's possible because cardinality is just an estimate.

one case to test here should also be when multi-column indexes exist.

good point, we should have a test for this.
For both MySQL and MariaDB.
@homar can you please follow-up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants