Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Resolve problems with eagerloading performance #10855

Conversation

GuySartorelli
Copy link
Member

@GuySartorelli GuySartorelli commented Jul 4, 2023

Multiple commits

There are two commits here. They are both important.

Commit 1: Refactors the existing code so that each relation type is fetched in its own method. There's more information about why I needed this in the commit message.
Commit 2: This is the commit that holds the actual change of logic. This commit also includes changes to some variable names, to make them more explicit or self explanatory.

It'll almost certainly be easier to review this PR by looking at the changes in each commit, so you can more easily see the actual change to logic. But for convenience's sake, I've added a github comment to every change in the second commit that isn't just a change of variable name or linting fix or added comment.

Performance before and after

The following results are for effectively the same setup as in the linked issue - except the Team class also has a separate has_many relation for the Player class, which holds another (separate) 10 players per team. This means there are a total of 20,000 player records in the database.

The following numbers are for the full request of loading a page, and come from the lekoala debugbar, so they're not 100% accurate - but they give a clear indication of the improvement this PR makes.

has_many

This PR doesn't affect the performance of eager loading has_many relations - but it's nice to see that eagerloading does improve the speed of loading has_many relations in general.

without eager loading with eager loading (no PR) with eager loading (with PR)
~400ms ~350ms ~350ms

many_many

Without this PR, eager loading many_many relations is effectively impossible.
Even with this PR, we can see that eager loading many_many relations is slower than without it, so there's still room for further optimisations (it's worth noting the database query itself (WHERE ("Player"."ID" IN (<IDs of teams>)) is taking ~1 second of that time so the PHP side of it is about as good as it's likely to get).

without eager loading with eager loading (no PR) with eager loading (with PR)
~500ms times out (30+ seconds) ~1.5seconds

Issue

@GuySartorelli GuySartorelli marked this pull request as draft July 4, 2023 21:51
@GuySartorelli GuySartorelli force-pushed the pulls/5/fix-eagerloading-performance branch 2 times, most recently from b4f0636 to 1570b3c Compare July 5, 2023 00:35
It is hard to work with the current structure - having each relation
type in its own method makes it way easier to see where my concerns
start and end with a given relation.

This also reduces the chance of variable bleed-over, where the value in
a variable in the first loop (in the same or other relation type) could
bleed into the next iteration of the loop.
@GuySartorelli GuySartorelli force-pushed the pulls/5/fix-eagerloading-performance branch 4 times, most recently from c29d7cf to 24948c8 Compare July 5, 2023 03:30
Comment on lines -1316 to +1323
$this->eagerLoadRelations = array_merge($this->eagerLoadRelations, $arr);
$this->eagerLoadRelations = array_unique(array_merge($this->eagerLoadRelations, $arr));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, we'd end up processing the same set of relations twice.
For example, you added

SomeClass::get()->eagerLoad(
    'MyManyMany.MyHasMany',
    'MyManyMany.MySecondHasMany',
);

you would have fetched MyManyMany for both relation chains.

Comment on lines -1308 to +1320
for ($i = 0; $i < $count; $i++) {
if ($i === 0) {
$arr[] = $parts[$i];
} else {
$arr[] = $arr[count($arr) - 1] . '.' . $parts[$i];
}
// Add each relation in the chain as its own entry to be eagerloaded
// e.g. for "Players.Teams.Coaches" you'll have three entries:
// "Players", "Players.Teams", and "Players.Teams.Coaches
$usedParts = [];
foreach ($parts as $part) {
$usedParts[] = $part;
$arr[] = implode('.', $usedParts);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't at all clear what was happening here - this change doesn't improve performance but it does improve code readability.

Comment on lines 1256 to 1263
$joinTableQuery = DB::query('SELECT * FROM "' . $join . '" WHERE "' . $parentField . '" IN (' . implode(',', $ids) . ')');
$relationItemIDs = [];
$rows = [];
while ($row = $joinTableQuery->record()) {
$rows[] = [
$parentField => $row[$parentField],
$childField => $row[$childField]
];
$relationItemIDs[] = $row[$childField];
$joinTableQuery = DB::query('SELECT * FROM "' . $join . '" WHERE "' . $parentIDField . '" IN (' . implode(',', $parentIDs) . ')');
$relationItemIDs = $joinTableQuery->column($childIDField);
$joinRows = $joinTableQuery;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the query result's own iterator rather than re-inventing the wheel.

Comment on lines 1268 to 1292
// Get ALL of the items for this relation up front, for ALL of the parents
// Fetched as a map so we can get the ID for all records up front (instead of in another nested loop)
// Fetched after that as an array because for some reason that performs better in the loop
$relationArray = DataObject::get($relationDataClass)->byIDs($relationItemIDs)->map('ID', 'Me')->toArray();

// Build a map of which children belong to which parent
$map = [];
foreach ($joinRows as $row) {
$parentID = $row[$parentIDField];
$childID = $row[$childIDField];
if (!isset($map[$parentID])) {
$map[$parentID] = [];
}
$map[$parentID][] = $relationArray[$childID];
}
$relationArray = DataObject::get($relationDataClass)->filter(['ID' => $relationItemIDs])->toArray();
foreach ($rows as $row) {
$eagerLoadID = $row[$parentField];
if (!isset($this->eagerLoadedData[$eagerLoadRelation][$eagerLoadID][$relation])) {
$arrayList = ArrayList::create();
$arrayList->setDataClass($manyManyLastComponent['childClass']);
$this->eagerLoadedData[$eagerLoadRelation][$eagerLoadID][$relation] = $arrayList;

// Store the children in an ArrayList against the correct parent
foreach ($map as $parentID => $children) {
// There shouldn't be a list for this relation already - if there is, something went wrong.
if (isset($this->eagerLoadedData[$eagerLoadRelation][$parentID][$relationName])) {
throw new LogicException("Relation list for $relationName on $parentDataClass cannot be fetched twice");
}
$relationItem = array_values(array_filter($relationArray, function ($relationItem) use ($row, $childField) {
return $relationItem->ID === $row[$childField];
}))[0];
$this->eagerLoadedData[$eagerLoadRelation][$eagerLoadID][$relation]->push($relationItem);

$arrayList = ArrayList::create($children);
$this->eagerLoadedData[$eagerLoadRelation][$parentID][$relationName] = $arrayList;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section of changes is the main performance booster and fixes the root problem in the issue.

It basically boils down to this:

  • Previously we were looping through all of the join records, and for every join record we'd then loop through all of the relation records, and call ->ID on it.
  • Now we're getting the IDs up front, and the nested loop is removed entirely.

Comment on lines 1244 to 1249
$joinThroughObjs = $join::get()->filter([$parentField => $ids]);
$joinThroughObjs = DataObject::get($join)->filter([$parentIDField => $parentIDs]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to be more consistent with the other DataObject::get() calls in the eager loading code

Comment on lines 1158 to 1155
if ($dataClass === $dataClasses[0]) {
while ($row = $query->record()) {

// It's a has_one directly on the records in THIS list
if ($parentDataClass === $this->dataClass()) {
foreach ($query as $itemData) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the query's own iterator rather than reinventing the wheel.

@emteknetnz
Copy link
Member

emteknetnz commented Jul 6, 2023

Just noting in my own testing with this PR that eager loading actually improved performance on many-many than has-many, counter to the testing in the description on this PR where it hurt it. I guess it's a case of "your mileage may vary" and "you should validate whether or not eager loading works for your particular website"

I've raised a new issue to look at optimising the performance of byIDs()

Copy link
Member

@emteknetnz emteknetnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, seems to work great.

src/ORM/DataList.php Outdated Show resolved Hide resolved
src/ORM/DataList.php Show resolved Hide resolved
@GuySartorelli GuySartorelli force-pushed the pulls/5/fix-eagerloading-performance branch from 24948c8 to 7af0fe2 Compare July 6, 2023 02:33
@emteknetnz emteknetnz merged commit 730a03d into silverstripe:5 Jul 6, 2023
@emteknetnz emteknetnz deleted the pulls/5/fix-eagerloading-performance branch July 6, 2023 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants