Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise filtering ID columns #10860

Closed
emteknetnz opened this issue Jul 6, 2023 · 6 comments
Closed

Optimise filtering ID columns #10860

emteknetnz opened this issue Jul 6, 2023 · 6 comments

Comments

@emteknetnz
Copy link
Member

emteknetnz commented Jul 6, 2023

When peer reviewing #10855 looking at query speed it was discovered that

DataList::byIDs($id), which simply uses $this->filter('ID', $ids), generates this style of MySQL query with placeholders:

SELECT DISTINCT "Player"."ClassName", "Player"."LastEdited", "Player"."Created", "Player"."Name", "Player"."ID", CASE WHEN "Player"."ClassName" IS NOT NULL THEN "Player"."ClassName" ELSE 'Player' END AS "RecordClassName" FROM "Player" WHERE ("Player"."ID" IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ... )

Which for 10k records took [0.1772s, 0.1981s, 0.1684s] = avg 0.1812s

However simply updating the query to this

->where(sprintf('"Player"."ID" in (%s)', implode(', ', $ids)))

Which uses this style of query without placeholders

SELECT DISTINCT "Player"."ClassName", "Player"."LastEdited", "Player"."Created", "Player"."Name", "Player"."ID", CASE WHEN "Player"."ClassName" IS NOT NULL THEN "Player"."ClassName" ELSE 'Player' END AS "RecordClassName" FROM "Player" WHERE ("Player"."ID" in (1, 2, 3, 4, 5, 6, 7, 8, ... )

Resulted in far better performance - [0.0615s, 0.0602s, 0.0606s] = avg 0.0608s

So we should look to use this far more optimal non-placeholder version when only filtering by IDs due to the 3x performance gain that will impact many areas of the CMS.

A few things to be careful of (some of these will turn into ACs)

  • Only do this is only a single column being filtered on and that column name === 'ID' - doesn't even do this for substr($column, -2) == 'ID' - since within a project this could contain anything
  • Remember to prefix the table name in the query e.g. "Player"."ID" instead of just "ID"
  • Sort $ids prior to query - may as well be nice to MySQL - sort($ids); - do this after validation
  • Should have config to disable converting the prepared statements to raw queries
  • Validation of $ids - if a single $id is not a valid $id then throw a hard exception fallback to the old method, e.g. -- while we COULD do this for non ints, I think it makes sense not to because if you're chosen to use non int IDs then you'll have a very strong security focus and having queries with a bunch of ? in them is going to be more secure than queries that include the ID as plain text.
foreach ($ids as $id) {
    if (!ctype_digit((string) $id) || $id != (int) $id) {
        // throw new InvalidArgumentException("ID $id does not look like a valid ID");
        // fallback to the old method here
    }
});

Note: prepared statements are should be faster if you have them in an optimised setup and the same query is being run many times. However this won't always be the case and wasn't the case on my local. For Silverstripe which is designed to be an easy to setup and accessible system it seems we should probably assume "not an optimised setup". And even in optimised setups they can still be slower e.g. here's an example of someone who knows what they're doing intentionally turning them off - https://orangematter.solarwinds.com/2014/11/19/analyzing-prepared-statement-performance/

Acceptance

  • Validate that there's a performance improvements in multiple scenarios including on SC/SCPS and in multiple types of queries.
  • Placeholders are not use when filtering by DBPRimaryKey or DBForeignKey. But we validate that the value is an integer.
  • If the value is not an integer, we fall back to using placeholder.
  • You can opt-out of this behaviour.
  • Security consideration are documented for a future audit.

Notes

  • Double check what happens on other DB types and raise community issues if things starts breaking.

PRs

@michalkleiner
Copy link
Contributor

I'm all for speeding up queries, but I wouldn't throw a hard exception if any of the IDs weren't numeric, just fallback to the old system.
I think (very) long-term we want to support non-numeric ID's to prevent enumeration which is often raised in security reviews, don't we?

@emteknetnz
Copy link
Member Author

emteknetnz commented Jul 6, 2023

Yeah that's a fair point, some website may have a higher up mandate using md5($id . $salt) as their primary ID.

Have updated description

@GuySartorelli
Copy link
Member

This is about byIDs which is already pretty explicitly only filtering by the ID column - but you have a point here that seems to indicate a wider scope of optimisation that might apply to other methods:

Only do this is only a single column being filtered on and that column name === 'ID' - doesn't even do this for substr($column, -2) == 'ID' - since within a project this could contain anything

I don't know if that wider scope should be done or not - but if it is, then instead of relying on the name of the field we could (and probably should) check if the field type is PrimaryKey or ForeignKey and only optimise for those.

@GuySartorelli
Copy link
Member

Merged, ACs met.

@maxime-rainville
Copy link
Contributor

There's still an open PR attached to this thing.

@GuySartorelli
Copy link
Member

Oops. Merged it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants