Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistinctBy to avoid duplications #10677

Closed
wants to merge 4 commits into from
Closed

Add DistinctBy to avoid duplications #10677

wants to merge 4 commits into from

Conversation

erikjanwestendorp
Copy link
Contributor

Add DistinctBy to avoid duplications

Prerequisites

See issue description:
#10383

Description

Add DistinctBy to avoid duplications

Add DistinctBy to avoid duplications
@ronaldbarendse
Copy link
Contributor

@erikjanwestendorp Thanks for creating a PR to fix the linked issue 👍🏻 Although this does indeed remove the duplications, it doesn't fix the underlying issue when more than 2000 items are in the collection.

var docCultureVariationsToUpdate = editedLanguageVersions.InGroupsOf(2000)
.SelectMany(_ => Database.Fetch<DocumentCultureVariationDto>(
Sql().Select<DocumentCultureVariationDto>().From<DocumentCultureVariationDto>()
.WhereIn<DocumentCultureVariationDto>(x => x.LanguageId, editedLanguageVersions.Keys.Select(x => x.langId).ToList())
.WhereIn<DocumentCultureVariationDto>(x => x.NodeId, editedLanguageVersions.Keys.Select(x => x.nodeId))))
//convert to dictionary with the same key type
.ToDictionary(x => (x.NodeId, (int?)x.LanguageId), x => x);

This issue is twofold:

  • Splitting the editedLanguageVersions into groups of 2000 items, but then not using those groups, but all items in the SQL query: this could result in multiple queries (for every group of 2000 items) doing the exact same query, causing the duplicate items
  • The grouping is added to avoid hitting the maximum SQL parameter count of 2100, but for every item 2 parameters are added (for both the language and node ID): this could result in 4000 parameters.

I expect the WhereIn removes duplicate values, so that would explain why the maximum SQL parameter count isn't always reached... So this fix should also ensure the amount of items per group is lowered to 1000 and the actual items in that group are used (the SelectMany(_ => ...) currently discards the grouped items).

@umbrabot
Copy link

Hi there @erikjanwestendorp, thank you for this contribution! 👍

While we wait for one of the Core Collaborators team to have a look at your work, we wanted to let you know about that we have a checklist for some of the things we will consider during review:

  • It's clear what problem this is solving, there's a connected issue or a description of what the changes do and how to test them
  • The automated tests all pass (see "Checks" tab on this PR)
  • The level of security for this contribution is the same or improved
  • The level of performance for this contribution is the same or improved
  • Avoids creating breaking changes; note that behavioral changes might also be perceived as breaking
  • If this is a new feature, Umbraco HQ provided guidance on the implementation beforehand
  • The contribution looks original and the contributor is presumably allowed to share it

Don't worry if you got something wrong. We like to think of a pull request as the start of a conversation, we're happy to provide guidance on improving your contribution.

If you realize that you might want to make some changes then you can do that by adding new commits to the branch you created for this work and pushing new commits. They should then automatically show up as updates to this pull request.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

@ronaldbarendse
Copy link
Contributor

But thinking about this a little bit more, grouping the items and doing multiple SQL queries could result in missing rows. To illustrate:

  • This would return a row with languageId = NULL and nodeId = 3000:
    WHERE languageId IN (null, 1, 2) AND WHERE nodeId IN (1000, 2000, 3000)
  • But splitting this into groups of 2 won't:
    WHERE languageId IN (null, 1) AND WHERE nodeId IN (1000, 2000)
    WHERE languageId IN (2) AND WHERE nodeId IN (3000)

So we should ensure one of the WHERE IN statements is always completely added, which should be the least amount of items (typically we have less languages than nodes) and then try to add the remaining maximum amount of parameters for the other WHERE IN statement. So the above example would end up as:

WHERE languageId IN (null, 1, 2) AND WHERE nodeId IN (1000, 2000)
WHERE languageId IN (null, 1, 2) AND WHERE nodeId IN (3000)

@erikjanwestendorp
Copy link
Contributor Author

erikjanwestendorp commented Jul 14, 2021

@ronaldbarendse Thanks for your review and your feedback. Just updated the PR, would you like to look at this again? 😄.

  • Work with groups now
  • Items per group lowered to 1000

@ronaldbarendse
Copy link
Contributor

Hi @erikjanwestendorp, I just got around to looking at this again and unfortunately this still has the issue I described in #10677 (comment) (we can't group both language and node IDs at the same time, as that might result in missing results).

I've created PR #11369 to introduce a constant for the 'magic' number we use in quite a few places to prevent going over the maximum amount of SQL parameters. It also fixes this and a similar issue when fetching user group permissions, so I'll close this one in favor of that PR. Thanks for the effort you've put into this though! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants