Add DistinctBy to avoid duplications #10677

erikjanwestendorp · 2021-07-14T11:35:49Z

Add DistinctBy to avoid duplications

Prerequisites

See issue description:
#10383

Description

Add DistinctBy to avoid duplications

ronaldbarendse · 2021-07-14T12:20:28Z

@erikjanwestendorp Thanks for creating a PR to fix the linked issue 👍🏻 Although this does indeed remove the duplications, it doesn't fix the underlying issue when more than 2000 items are in the collection.

Umbraco-CMS/src/Umbraco.Core/Persistence/Repositories/Implement/ContentTypeRepositoryBase.cs

Lines 1099 to 1105 in b8540ca

    
           var docCultureVariationsToUpdate = editedLanguageVersions.InGroupsOf(2000) 
        
               .SelectMany(_ => Database.Fetch<DocumentCultureVariationDto>( 
        
                   Sql().Select<DocumentCultureVariationDto>().From<DocumentCultureVariationDto>() 
        
                           .WhereIn<DocumentCultureVariationDto>(x => x.LanguageId, editedLanguageVersions.Keys.Select(x => x.langId).ToList()) 
        
                           .WhereIn<DocumentCultureVariationDto>(x => x.NodeId, editedLanguageVersions.Keys.Select(x => x.nodeId)))) 
        
               //convert to dictionary with the same key type 
        
               .ToDictionary(x => (x.NodeId, (int?)x.LanguageId), x => x);

This issue is twofold:

Splitting the editedLanguageVersions into groups of 2000 items, but then not using those groups, but all items in the SQL query: this could result in multiple queries (for every group of 2000 items) doing the exact same query, causing the duplicate items
The grouping is added to avoid hitting the maximum SQL parameter count of 2100, but for every item 2 parameters are added (for both the language and node ID): this could result in 4000 parameters.

I expect the WhereIn removes duplicate values, so that would explain why the maximum SQL parameter count isn't always reached... So this fix should also ensure the amount of items per group is lowered to 1000 and the actual items in that group are used (the SelectMany(_ => ...) currently discards the grouped items).

umbrabot · 2021-07-14T12:20:45Z

Hi there @erikjanwestendorp, thank you for this contribution! 👍

While we wait for one of the Core Collaborators team to have a look at your work, we wanted to let you know about that we have a checklist for some of the things we will consider during review:

It's clear what problem this is solving, there's a connected issue or a description of what the changes do and how to test them
The automated tests all pass (see "Checks" tab on this PR)
The level of security for this contribution is the same or improved
The level of performance for this contribution is the same or improved
Avoids creating breaking changes; note that behavioral changes might also be perceived as breaking
If this is a new feature, Umbraco HQ provided guidance on the implementation beforehand
The contribution looks original and the contributor is presumably allowed to share it

Don't worry if you got something wrong. We like to think of a pull request as the start of a conversation, we're happy to provide guidance on improving your contribution.

If you realize that you might want to make some changes then you can do that by adding new commits to the branch you created for this work and pushing new commits. They should then automatically show up as updates to this pull request.

Thanks, from your friendly Umbraco GitHub bot 🤖 🙂

ronaldbarendse · 2021-07-14T12:35:35Z

But thinking about this a little bit more, grouping the items and doing multiple SQL queries could result in missing rows. To illustrate:

This would return a row with languageId = NULL and nodeId = 3000:

WHERE languageId IN (null, 1, 2) AND WHERE nodeId IN (1000, 2000, 3000)

But splitting this into groups of 2 won't:

WHERE languageId IN (null, 1) AND WHERE nodeId IN (1000, 2000)
WHERE languageId IN (2) AND WHERE nodeId IN (3000)

So we should ensure one of the WHERE IN statements is always completely added, which should be the least amount of items (typically we have less languages than nodes) and then try to add the remaining maximum amount of parameters for the other WHERE IN statement. So the above example would end up as:

WHERE languageId IN (null, 1, 2) AND WHERE nodeId IN (1000, 2000)
WHERE languageId IN (null, 1, 2) AND WHERE nodeId IN (3000)

erikjanwestendorp · 2021-07-14T16:13:40Z

@ronaldbarendse Thanks for your review and your feedback. Just updated the PR, would you like to look at this again? 😄.

Work with groups now
Items per group lowered to 1000

ronaldbarendse · 2021-10-15T21:44:01Z

Hi @erikjanwestendorp, I just got around to looking at this again and unfortunately this still has the issue I described in #10677 (comment) (we can't group both language and node IDs at the same time, as that might result in missing results).

I've created PR #11369 to introduce a constant for the 'magic' number we use in quite a few places to prevent going over the maximum amount of SQL parameters. It also fixes this and a similar issue when fetching user group permissions, so I'll close this one in favor of that PR. Thanks for the effort you've put into this though! 🤗

DistinctBy

d683447

Add DistinctBy to avoid duplications

erikjanwestendorp added 3 commits July 14, 2021 18:04

Splitting the editedLanguageVersions into groups

c3c167a

fewer items per group (1000)

808e41e

Delete web.config

19573f4

ronaldbarendse mentioned this pull request Oct 13, 2021

Add Constants.Sql.MaxParameterCount and update query logic #11369

Merged

1 task

ronaldbarendse closed this Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DistinctBy to avoid duplications #10677

Add DistinctBy to avoid duplications #10677

erikjanwestendorp commented Jul 14, 2021

ronaldbarendse commented Jul 14, 2021

umbrabot commented Jul 14, 2021

ronaldbarendse commented Jul 14, 2021

erikjanwestendorp commented Jul 14, 2021 •

edited

Loading

ronaldbarendse commented Oct 15, 2021

Add DistinctBy to avoid duplications #10677

Add DistinctBy to avoid duplications #10677

Conversation

erikjanwestendorp commented Jul 14, 2021

Prerequisites

Description

ronaldbarendse commented Jul 14, 2021

umbrabot commented Jul 14, 2021

ronaldbarendse commented Jul 14, 2021

erikjanwestendorp commented Jul 14, 2021 • edited Loading

ronaldbarendse commented Oct 15, 2021

erikjanwestendorp commented Jul 14, 2021 •

edited

Loading