Improve correlation by keeping arch info declared in manifest arp DisplayName entry #3100

yao-msft · 2023-03-23T00:35:32Z

Microsoft Reviewers: Open in CodeFlow

Trenly · 2023-03-23T21:04:19Z

src/AppInstallerCommonCore/NameNormalization.cpp

+
+        if (WI_IsFlagSet(fieldsToInclude, NormalizationField::Architecture) && m_arch != Utility::Architecture::Unknown)
+        {
+            result += '(' + std::string(Utility::ToString(m_arch)) + ')';


What about for packages that just include their architecture and not the parens? Or where it isn't just the arch to string? Won't this still cause issues? I feel like where DisplayName is included in the manifest, the normalization shouldn't occur.

13 AppsAndFeaturesEntries: 14: - DisplayName: Bluebeam Revu x64 21

34: - DisplayName: calibre 64bit 35 UpgradeCode: '{5DD881FF-756B-4097-9D82-8C0F11D521EA}'

23 - ProductCode: SyncBackPro64_is1_is1 24: DisplayName: SyncBackPro x64

87: - DisplayName: 7-Zip 22.01 (x64 edition) 88 Publisher: Igor Pavlov

28 - Publisher: Adobe 29: DisplayName: Adobe Acrobat DC (64-bit)

Our normalization code uses a regex which should take care of different formats of arch representations. x64, (x64), 64bit etc.

The reason we still need to do normalization on DisplayName comes from various considerations:

Including every declared DisplayName explodes the index (which is currently around 13 mb), it will be at least several times bigger. Considering by default winget updates source every 3 mins if new version applicable, it's a lot of data downloading.

If DisplayName contains version (or any other Package Version specific string) and we don't normalize it, our correlation will rely on the repo containing every version of the package (with every package version declaring DisplayName). For example, if DisplayName is Foo v1.0.0 and Foo v2.0.0 (2 versions exist in winget-pkgs repo), and if we don't normalize them in index, a local installed version with Foo v1.5.0 will not be able to get correlated. We cannot guarantee winget-pkgs will have every version of a package listed and client should not rely on this assumption.

According to our brief analysis, the arch is the main source of false correlation. And it makes sense to keep arch info specifically. We could expand to include locale, or even version if we decide to in the future. But each expansion will come with the cost of making index multiple times larger, and less correlation success if the installed version is not in the repo. Unless we redesign the index structure in future winget v2 (if winget v2 is a thing).

I see; thanks for the detailed explanation!

JohnMcPMS · 2023-03-30T00:19:13Z

src/AppInstallerRepositoryCore/Public/winget/RepositorySearch.h


        PackageMatchFilter(PackageMatchField f, MatchType t) : RequestMatch(t), Field(f) { EnsureRequiredValues(); }
        PackageMatchFilter(PackageMatchField f, MatchType t, Utility::NormalizedString& v) : RequestMatch(t, v), Field(f) { EnsureRequiredValues(); }
        PackageMatchFilter(PackageMatchField f, MatchType t, const Utility::NormalizedString& v) : RequestMatch(t, v), Field(f) { EnsureRequiredValues(); }
        PackageMatchFilter(PackageMatchField f, MatchType t, Utility::NormalizedString&& v) : RequestMatch(t, std::move(v)), Field(f) { EnsureRequiredValues(); }
-        PackageMatchFilter(PackageMatchField f, MatchType t, std::string_view v1, std::string_view v2) : RequestMatch(t, v1, v2), Field(f) { EnsureRequiredValues(); }
+        PackageMatchFilter(PackageMatchField f, MatchType t, std::string_view v1, std::string_view v2, Utility::NormalizationField n = Utility::NormalizationField::None) : RequestMatch(t, v1, v2), Field(f), NameNormalizationField(n) { EnsureRequiredValues(); }


The searching should not know or understand how normalization is being done.

JohnMcPMS · 2023-03-30T00:23:20Z

src/AppInstallerRepositoryCore/CompositeSource.cpp

-                void AddToFilters(std::vector<PackageMatchFilter>& filters) const
+                void AddToFilters(
+                    std::vector<PackageMatchFilter>& filters,
+                    Utility::NormalizationField nameNormalizationField = Utility::NormalizationField::None) const


I don't think that CompositeSource should need to be aware of this, other than potentially to be able to get the necessary data (names and publishers). But that is likely a separate issue that isn't directly related to this one; for instance, we probably aren't getting the full set of names and publishers from REST searches.

JohnMcPMS · 2023-03-30T00:31:56Z

src/AppInstallerRepositoryCore/CompositeSource.cpp

@@ -1054,14 +1099,17 @@ namespace AppInstaller::Repository
                auto names = installedVersion->GetMultiProperty(PackageVersionMultiProperty::Name);
                auto publishers = installedVersion->GetMultiProperty(PackageVersionMultiProperty::Publisher);

+                Utility::NameNormalizer normer(Utility::NormalizationVersion::Initial);


Different versions of the index will use different versions of the normalization; I don't think that this is future proof. The goal was that from the outside, normalization was known to be used, but not understood. I don't think that this level of detail is a workable solution for an arbitrary source to implement.

JohnMcPMS · 2023-03-30T00:46:27Z

src/AppInstallerRepositoryCore/CompositeSource.cpp

+                    {
+                        candidateSearches.emplace_back(installedPackageData.CreateInclusionsSearchRequest(Utility::NormalizationField::Architecture));
+                    }
+                    candidateSearches.emplace_back(installedPackageData.CreateInclusionsSearchRequest());


I don't really understand why we need to search for both in any scenario; it seems like it could lead to the same problem, just in the less likely reverse case. If winget-pkgs only had one architecture version, and we had the other installed, we would get a mismatch. The reason to leave the non-architectured index entry is for older clients.

If we find architecture, we should only search for the one with architecture. This does mean that we need to roll out the updated index before the updated client of course. This could be done by putting the search behavior behind an experimental feature for now, while leaving the index construction to include both. Once the index update is on the server, we can try it out with the experimental feature and then make it non-experimental if it is working.

JohnMcPMS · 2023-03-30T00:49:42Z

src/AppInstallerRepositoryCore/CompositeSource.cpp

I wonder if this is part of a larger problem where we are taking the first match rather than treating multiple matches as an issue (and thus resulting in no matches, with some logging of course).

Also, some analysis that I hadn't thought is to look at how many normalized name and publisher values (or any of the system reference string cohort) map onto multiple package ids. Any time that is happening is likely cause for concern.

There's 153 unique norm_names that map to more than one package.

ID_Name_Map_DuplicatesOnly_2023-03-29.csv

If you look at the mapping where different package id's have duplicate name AND publisher, there's 339 id's that have duplication.

ID_Name_Publisher_Map_DuplicatesOnly_2023-03-29.csv

JohnMcPMS · 2023-03-30T00:57:50Z

src/AppInstallerRepositoryCore/Microsoft/Schema/1_2/Interface_1_2.cpp

@@ -218,7 +224,7 @@ namespace AppInstaller::Repository::Microsoft::Schema::V1_2
            if (filter.Field == PackageMatchField::NormalizedNameAndPublisher && filter.Type == MatchType::Exact)
            {
                Utility::NormalizedName normalized = m_normalizer.Normalize(Utility::FoldCase(filter.Value), Utility::FoldCase(filter.Additional.value()));
-                filter.Value = normalized.Name();
+                filter.Value = normalized.GetNormalizedName(filter.NameNormalizationField);


I think maybe we do need a new minor version so that we can always call GetNormalizedName with architecture.

yao-msft · 2023-03-30T21:30:29Z

After offline discussion, we will address above comments by moving all normalization related logic inside index search logic.
And add a new enum called SearchPurpose in search request, together with source type, SearchInternal can decide on the fallback logic if needed.

yao-msft · 2023-04-05T04:35:39Z

src/AppInstallerRepositoryCore/Public/winget/RepositorySearch.h

+        // Default search purpose.
+        Default,
+        // The result is used for correlation to an installed package.
+        CorrelationToInstalled,


In the end I have to use 3 enums for search purpose because the sqlite index itself does not know if it's installed source, this info is only available at Source level

yao-msft added 3 commits March 22, 2023 15:51

Correlation with arch

fd08bb9

e2e test data

e9b89a2

e2e test

93d88fd

yao-msft changed the title ~~Improve correlation by using arch info if declared in manifest arp entry~~ Improve correlation by keeping arch info declared in manifest arp DisplayName entry Mar 23, 2023

fix

754cc9c

Trenly reviewed Mar 23, 2023

View reviewed changes

yao-msft added 3 commits March 24, 2023 15:22

Fix

055c6e7

fix

6746056

fix e2e

f846d02

This comment has been minimized.

Sign in to view

spelling

c9a0a82

yao-msft marked this pull request as ready for review March 28, 2023 00:29

yao-msft requested a review from a team as a code owner March 28, 2023 00:29

JohnMcPMS requested changes Mar 30, 2023

View reviewed changes

microsoft-github-policy-service bot added the Needs-Author-Feedback Issue needs attention from issue or PR author label Mar 30, 2023

microsoft-github-policy-service bot added Needs-Attention Issue needs attention from Microsoft and removed Needs-Author-Feedback Issue needs attention from issue or PR author labels Mar 30, 2023

pr comments

dba1b71

yao-msft commented Apr 5, 2023

View reviewed changes

Fix build

0e6da40

JohnMcPMS approved these changes Apr 6, 2023

View reviewed changes

yao-msft merged commit 3165abe into microsoft:master Apr 6, 2023

yao-msft deleted the archmatch branch April 6, 2023 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve correlation by keeping arch info declared in manifest arp DisplayName entry #3100

Improve correlation by keeping arch info declared in manifest arp DisplayName entry #3100

yao-msft commented Mar 23, 2023 •

edited by microsoft-github-policy-service bot

Loading

Trenly Mar 23, 2023

yao-msft Mar 23, 2023

Trenly Mar 23, 2023

This comment has been minimized.

JohnMcPMS Mar 30, 2023

JohnMcPMS Mar 30, 2023

JohnMcPMS Mar 30, 2023

JohnMcPMS Mar 30, 2023

JohnMcPMS Mar 30, 2023

Trenly Mar 30, 2023 •

edited

Loading

JohnMcPMS Mar 30, 2023

yao-msft commented Mar 30, 2023

yao-msft Apr 5, 2023

Improve correlation by keeping arch info declared in manifest arp DisplayName entry #3100

Improve correlation by keeping arch info declared in manifest arp DisplayName entry #3100

Conversation

yao-msft commented Mar 23, 2023 • edited by microsoft-github-policy-service bot Loading

Microsoft Reviewers: Open in CodeFlow

Microsoft Reviewers: Open in CodeFlow

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trenly Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yao-msft commented Mar 30, 2023

Choose a reason for hiding this comment

yao-msft commented Mar 23, 2023 •

edited by microsoft-github-policy-service bot

Loading

Trenly Mar 30, 2023 •

edited

Loading