Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error The WMI service or the WMI provider returned an unknown error: HRESULT 0x80041033 when importing ActiveSyncDeviceAccessRules #4982

Open
JbMachtMuschel opened this issue Aug 22, 2024 · 82 comments

Comments

@JbMachtMuschel
Copy link

Description of the issue

I am trying to import more than 400 ActiveSyncDeviceAccessRules into another tenant, but after approx. 90 rules this error appears:
The WS-Management service cannot process the request. The WMI service or the WMI provider returned an unknown error: HRESULT 0x80041033
+ CategoryInfo : ResourceUnavailable: (root/Microsoft/...gurationManager:String) [], CimException
+ FullyQualifiedErrorId : HRESULT 0x80041033
+ PSComputerName : localhost

Without setting this: Set-WSManInstance -ResourceURI winrm/config -ValueSet @{MaxEnvelopeSizekb = "1048576"}
an import stops almost immediately.

Microsoft 365 DSC Version

1.24.731.1

Which workloads are affected

Exchange Online

The DSC configuration

I exported EXOClientAccessRule in the prod tenant using
Authentication methods specified:
- Service Principal with Certificate Thumbprint

On the destination tenant I cerated the mof file and adapted the ConfigurationData.psd1. I tried several machines. 

PS Version:
Name             : ConsoleHost
Version          : 5.1.17763.6189
InstanceId       : aa499f6c-4b52-401a-a18d-78ce4d475acc
UI               : System.Management.Automation.Internal.Host.InternalHostUserInterface
CurrentCulture   : de-DE
CurrentUICulture : en-US
PrivateData      : Microsoft.PowerShell.ConsoleHost+ConsoleColorProxy
DebuggerEnabled  : True
IsRunspacePushed : False
Runspace         : System.Management.Automation.Runspaces.LocalRunspace

Verbose logs showing the problem

VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)] Test-TargetResource returned False
VERBOSE: [ServerName,]: LCM:  [ End    Test     ]  [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)]  in 5.5740 seconds.
VERBOSE: [ServerName,]: LCM:  [ Start  Set      ]  [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)]
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)] Setting Active Sync Device Access Rule configuration for iOS 8.3 12F70 (DeviceOS)
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)] Getting Active Sync Device Access Rule configuration for iOS 8.3 12F70 (DeviceOS)
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)] Trying to retrieve instance by Identity
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)] Active Sync Device Access Rule iOS 8.3 12F70 (DeviceOS) does not exist.
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)] Active Sync Device Access Rule 'iOS 8.3 12F70 (DeviceOS)' does not exist but it
should. Create and configure it.
VERBOSE: [ServerName,]: LCM:  [ End    Set      ]  [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)]  in 9.8730 seconds.
VERBOSE: [ServerName,]: LCM:  [ End    Resource ]  [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.3 12F70 (DeviceOS)]
VERBOSE: [ServerName,]: LCM:  [ Start  Resource ]  [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.4 12H143 (DeviceOS)]
VERBOSE: [ServerName,]: LCM:  [ Start  Test     ]  [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.4 12H143 (DeviceOS)]
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.4 12H143 (DeviceOS)] Testing Active Sync Device Access Rule configuration for iOS 8.4 12H143 (DeviceOS)
VERBOSE: [ServerName,]:                            [[EXOActiveSyncDeviceAccessRule]EXOActiveSyncDeviceAccessRule-iOS 8.4 12H143 (DeviceOS)] Getting Active Sync Device Access Rule configuration for iOS 8.4 12H143 (DeviceOS)
The WS-Management service cannot process the request. The WMI service or the WMI provider returned an unknown error: HRESULT 0x80041033
    + CategoryInfo          : ResourceUnavailable: (root/Microsoft/...gurationManager:String) [], CimException
    + FullyQualifiedErrorId : HRESULT 0x80041033
    + PSComputerName        : localhost

Environment Information + PowerShell Version

OsName               :
OsOperatingSystemSKU :
OsArchitecture       :
WindowsVersion       : 1809
WindowsBuildLabEx    : 17763.1.amd64fre.rs5_release.180914-1434
OsLanguage           :
OsMuiLanguages       :
@JbMachtMuschel
Copy link
Author

Command I run:
Start-DscConfiguration -Path C:\DscCert\20240822_170552\EXOActiveSyncDeviceAccessRule20240822_170552M365TenantConfig -Verbose -Force -Wait #serveral ThrottleLimts

@jadamones
Copy link

I'm still having this issue. It's only happening with EXO resources but not the same one consistently. The only consistency seems to be that the WMI Provider Host process hits about 4100 MB and then it crashes. If I run 1.24.403.1 the WMI Provider Host doesn't use nearly as much memory, and it appears to return memory to the system throughout the test. I've tried pwsh 7 as well as three other machines (Win11 and 2022) and it doesn't help.

@FabienTschanz
Copy link
Contributor

@jadamones Quick question: When you check the memory consumption in Task Manager, is the WMI Provider Host process running in 32-bit? If yes, that would explain why it fails, since 32-bit applications have a memory limit of 4GB. Unfortunately, I'm not aware of any way how to switch to 64-bit for the WMI process, which would mitigate that issue.

@JbMachtMuschel
Copy link
Author

I check my system and WmiPrvSE.exe is a 64bit application. After consuming a bit more than 4 GB memory, the process changes the status to "suspended" and afterwards the import fails with the mentioned error.
Any hints? This test is the beginning of our DSC eval and I am importing a low number of policies, so when doing imports with more than 10000 items, it will most likely fail.

@JbMachtMuschel
Copy link
Author

Interessting point is also, that after crashing of WmiPrvSE.exe the process is starting again, even after the powershell command reported the error:
The WS-Management service cannot process the request. The WMI service or the WMI provider returned an unknown error: HRESULT 0x80041033
+ CategoryInfo : ResourceUnavailable: (root/Microsoft/...gurationManager:String) [], CimException
+ FullyQualifiedErrorId : HRESULT 0x80041033
+ PSComputerName : localhost
WmiPrvSE.exe starts again and it is importing the ActiveSyncDeviceAccessRules from the mof-file. After reaching the process memory limit of 4194304kb it is looping.

@jadamones
Copy link

Hey @FabienTschanz I wondered that as well and should've mentioned that I already confirmed that. It is indeed the 64-bit provider.
image

@jadamones
Copy link

Looking at what changed between 1.24.403.1 and the next version where this problem started for me is documented below for the EXO workload. Nothing is jumping out at me but I'm not sure what went into the Misc changes. Interestingly, @JbMachtMuschel - EXOActiveSyncDeviceAccessRules were introduced. I don't have those rules defined in any of the tenants where this is a problem, and I don't monitor or define the EXOMailboxSettings resource in any of my configurations, so in my case, it doesn't appear to be an issue with either of those resources.

EXOActiveSyncDeviceAccessRule

  • Retrieve instance by Identity if not found by characteristic.

EXOMailboxSettings

  • Simplifyied the Setlogic and removed Timezone validation to remove checks to regstry key which caused issues in Linux.

MISC

  • Provided the ability to force reload the EXO or SC modules to prevent calling the wrong cmdlet where the same names are defined (e.g. Get-RoleGroup).

Telemetry

  • Get operating system using faster method to speed up telemetry calls.

@FabienTschanz
Copy link
Contributor

@jadamones In that view, only the application name is visible, but not if it's 32-Bit or not. That is only visible in the "Processes" view of the Task Manager, when you scroll down to "Background Processes" --> WMI Provider Host (32 bit). If you go to details for a 32-bit process, you'll end up with the same view as you wrote.
image

Unfortunately I don't have an Exchange infrastructure in my test lab (and I have no idea how to manage that 😅), so I'm not of much help even if I have your configuration.

Still I will take a look at what changed in the code just as you did, maybe we can find something.

@jadamones
Copy link

Also @JbMachtMuschel I do not experience the loop. The WMI Provider just crashes and does not restart.

@JbMachtMuschel
Copy link
Author

image

Soon I will restart..
@jadamones: If you like, simply create some.
1..400 |%{$OS = "iOS" + $_; write-host "Creating $OS";New-ActiveSyncDeviceAccessRule -QueryString $OS -Characteristic UserAgent -AccessLevel Quarantine -confirm:$false}

QueryString : iOS1
Characteristic : UserAgent
AccessLevel : Quarantine
Name : iOS1 (UserAgent)
AdminDisplayName :
ExchangeVersion : 0.10 (14.0.100.0)
DistinguishedName : CN=iOS1 (UserAgent),CN=Mobile Mailbox Settings,CN=Configuration,CN=bdfgrptest.onmicrosoft.com,
CN=ConfigurationUnits,DC=DEUP281A012,DC=PROD,DC=OUTLOOK,DC=COM
Identity : iOS1 (UserAgent)
ObjectCategory : DEUP281A012.PROD.OUTLOOK.COM/Configuration/Schema/ms-Exch-Device-Access-Rule
ObjectClass : {top, msExchDeviceAccessRule}
WhenChanged : 8/27/2024 5:50:58 PM
WhenCreated : 8/27/2024 5:50:58 PM
WhenChangedUTC : 8/27/2024 3:50:58 PM
WhenCreatedUTC : 8/27/2024 3:50:58 PM
ExchangeObjectId : a6617681-e938-4296-81bc-32b8d0671cbf
OrganizationalUnitRoot : bdfgrptest.onmicrosoft.com
OrganizationId : DEUP281A012.PROD.OUTLOOK.COM/Microsoft Exchange Hosted
Organizations/bdfgrptest.onmicrosoft.com -
DEUP281A012.PROD.OUTLOOK.COM/ConfigurationUnits/bdfgrptest.onmicrosoft.com/Configuration
Id : iOS1 (UserAgent)
Guid : a6617681-e938-4296-81bc-32b8d0671cbf
OriginatingServer : FR3P281A12DC003.DEUP281A012.PROD.OUTLOOK.COM
IsValid : True
ObjectState : Unchanged

@jadamones
Copy link

Hey @FabienTschanz thank you for helping out with this and looking into it! The 7th column over in that screenshot shows the system architecture as x64 for the process but regardless, I do not have the (32 bit) tag in the name under the processes view for any of them or the one that's running up the memory during a Test-DscConfiguration. Can also confirm that the file path is System32 and not SysWow64 for that particular process.

image

@jadamones
Copy link

@JbMachtMuschel Thanks, I will create some when I have a chance just to see if I get the loop.

@JbMachtMuschel
Copy link
Author

@jadamones Thank you. The loop is one problem, but the memory consumption of WmiPrvSE.exe seems to be the causing problem. I checked https://learn.microsoft.com/en-us/troubleshoot/windows-server/system-management-components/scenario-guide-troubleshoot-wmiprvse-quota-exceeded-issues
I configured this from the article: Client application performs abnormal, inefficient, or large queries
But I do not know if this is happens: The WmiPrvse.exe process doesn't release resources as expected.

@JbMachtMuschel
Copy link
Author

image

@FabienTschanz
Copy link
Contributor

@jadamones Of course, silly me. Sometimes I'm like a blind chicken. Aaanyways, under the assumption that the WMI process has a memory limit of 4GB, we could try to increase the quota per process and overall like the following in an elevated Windows PowerShell session (according to https://techcommunity.microsoft.com/t5/ask-the-performance-team/wmi-high-memory-usage-by-wmi-service-or-wmiprvse-exe/ba-p/375491):

$quotaConfiguration = Get-WmiObject -Class __ProviderHostQuotaConfiguration -Namespace Root
$quotaConfiguration.MemoryAllHosts = 4 * 4GB
$quotaConfiguration.MemoryPerHost = 3 * 4GB
Set-WmiInstance -InputObject $quotaConfiguration

I don't know if a restart is required, but just to be sure, I would recommend one.

Another approach we can take is to bump up the ExchangeOnlineManagement version to 3.5.1. For this to work, you can update the file C:\Program Files\WindowsPowerShell\Modules\Microsoft365DSC\1.24.731.1\Dependencies\Manifest.psd1 so that the version for it is now 3.5.1 instead of 3.4.0. After that, run Update-M365DSCDependencies to install the updated module (or Install-Module ExchangeOnlineManagement -RequiredVersion 3.5.1 -Force for direct install).

@JbMachtMuschel
Copy link
Author

@jadamones Thank you...the quotaConfiguration seems to be an approch, but I guess that this will reach the 16GB limit. I will inform you. Currently I am inporting the Rules and approx. 110 are already imported, but the memory consumption is very high:
image

@ricmestre
Copy link
Contributor

ricmestre commented Aug 28, 2024

The EXO management module is a known memory hog ever since they moved to using REST API and it seems to only get worse and worse on each release, you can search in the interwebs that several people complain about this. Heck, there's even an official article [0] on how to reduce its memory usage which by the way doesn't solve the problem since it also leaks the huge amounts of memory it allocates and is never freed!!! Additionally to this the module is also prone to spitting random 0x800... errors so to workaround this problem I always loop the deployment until a max of 3 attempts if it fails.

I have this setup in DevOps pipelines which run on discardable containers so they are memory constrained and always fail if no changes are done, but with the code below, which is a variation of what @FabienTschanz posted before, I can get it going along with the rest of all the workloads I use. Please bear in mind that this allows to allocate all available memory of the machine to the WMI process (mem is not allocated all at once, only when required) so you may need to do some math to calculate better values for your requirements.

    #region Increase Memory Quota for WMI processes
    try {
        $ComputerSystem = Get-CimInstance -ClassName "Win32_ComputerSystem"
    }
    catch {
        throw $_
    }
    if (![String]::IsNullOrEmpty($ComputerSystem)) {
        $TotalPhysicalMemory = $ComputerSystem.TotalPhysicalMemory
        if (![String]::IsNullOrEmpty($TotalPhysicalMemory)) {
            try {
                $Quota = Get-CimInstance -Namespace "Root" `
                    -Class "__ProviderHostQuotaConfiguration"
            }
            catch {
                throw $_
            }

            if (![String]::IsNullOrEmpty($Quota)) {
                if ($Quota.MemoryAllHosts -ne $TotalPhysicalMemory) {
                    $Quota.MemoryAllHosts = $TotalPhysicalMemory
                    $Quota.MemoryPerHost = $TotalPhysicalMemory
                    $Quota.HandlesPerHost = 8192
                    $Quota.ThreadsPerHost = 512

                    Write-Output "Increasing WMI processes memory quota"
                    try {
                        Set-CimInstance -InputObject $Quota
                    }
                    catch {
                        throw $_
                    }

                    $WMIProcesses = Get-Process -Name "WMIPrvSE" `
                        -ErrorAction "SilentlyContinue"
                    if ($WMIProcesses.Count -ne 0) {
                        Write-Output "Restarting WMI processes"
                        foreach ($WMIProcess in $WMIProcesses) {
                            try {
                                Stop-Process -Id $WMIProcess.Id -Force
                            }
                            catch {
                                throw $_
                            }
                        }
                    }
                }
            }
        }
    }
    #endregion Increase Memory Quota for WMI processes

[0] https://techcommunity.microsoft.com/t5/exchange-team-blog/reducing-memory-consumption-of-the-exchange-online-powershell-v3/ba-p/3970086

@JbMachtMuschel
Copy link
Author

@ricmestre Thank you for the information.
Just now the process mentioned in the screenshot above, released memory as it reached 16GB and it is running and importing still:
image

@JbMachtMuschel
Copy link
Author

I was able to import the ActiveSyncDeviceAccessRules now :) I did a retry to observe how DSC is working on onready created rules and after a while the command throw this error:
VERBOSE: [HAMS010288]: LCM: [ End Set ]
The SendConfigurationApply function did not succeed.
+ CategoryInfo : NotSpecified: (root/Microsoft/...gurationManager:String) [], CimException
+ FullyQualifiedErrorId : MI RESULT 1
+ PSComputerName : localhost

VERBOSE: Operation 'Invoke CimMethod' complete.
VERBOSE: Time taken for configuration job to complete is 4292.016 seconds

Interessting is, that the WMI process "never gives up", even after deleting all ActiveSyncDeviceAccessRules it starts again to create them - the powershell is not involved, I terminated the shell. So I killed the WMI process now.

@jadamones
Copy link

jadamones commented Aug 28, 2024

Thank you all for all the effort on this. Unfortunately, these haven't resolved my issue :( . I updated the host quota as suggested, and I attempted an upgrade of the Exchange Online module (maybe I'm missing something here). After upgrading the Exchange module to 3.5.1, I get this error MSFT_EXOOrganizationConfig failed to execute Test-TargetResource functionality with error message: IDX12729: Unable to decode the header '[PII of type 'System.String' is hidden. FWIW I'm running into this issue using the checkdsccompliancy.ps1 script in a DevOps pipeline.

Here are my WMI quotas (system has 32GB)
image

@JbMachtMuschel
Copy link
Author

@jadamones I have had the same error with the 3.5.1 module and certificate based login. Just now I am using 3.4.0.

@jadamones
Copy link

Oh SMH... 🤦‍♂️ I realized that I didn't reboot the machine after updating the quotas. That seems to have done the trick! Interesting that the documentation just says to restart the WMI service, but definitely seems to be working after a reboot now. Thank you all for your input here. Much appreciated!

@FabienTschanz
Copy link
Contributor

@ricmestre Do we want to document this as a workaround somewhere? Just for the sake that it is documented and that there is something we can actually do to circumvent the issue?

@ricmestre
Copy link
Contributor

@FabienTschanz I wouldn't mind if you add something for example here [0] which looks scarcely empty and both you and I know that there are other issues out there that are not documented how to fix it or at least workaround them, but for this particular case I'd really like to have some kind of improvement in the module itself.

@NikCharlebois @ykuijs @andikrueger @desmay @malauter Hi, is this something that any of you can take to the EXO team in order to solve it or at least improve the experience? Running a cmdlet here and there is one thing, another one is trying to import the whole workload to other tenants using M365DSC, or even just trying to deploy a single resource but that contains hundreds or even thousands of children which will only exacerbate this known memory issue.

[0] https://microsoft365dsc.com/user-guide/get-started/troubleshooting/

@FabienTschanz
Copy link
Contributor

@ricmestre I will add an entry in the troubleshooting section for this issue and raise a PR.

@JbMachtMuschel
Copy link
Author

Shall I open another issue regarding?
Interessting is, that the WMI process "never gives up", even after deleting all ActiveSyncDeviceAccessRules it starts again to create them - the powershell is not involved, I terminated the shell. So I killed the WMI process now.
To clarify:
The import of the more than 400 is done, the command powershell is reflecting this, but WmiPrvSE process dosen't care. It seems, it never stops without interfering.

@ricmestre
Copy link
Contributor

That's how the LCM works, check https://learn.microsoft.com/en-us/powershell/dsc/managing-nodes/metaconfig?view=dsc-1.1 and settings ConfigurationMode and ConfigurationModeFrequencyMins

@JbMachtMuschel
Copy link
Author

Hi again, I did several tests with different ExchangeOnlinemanagement module versions and only 3.4.0 is working - 3.5.0 and 3.5.1 is throwing "Unable to decode the header '[PII of type 'System.String' is hidden".
Powershell 7 makes no difference.

Unfortunately the error is back. It seems that WMI process does NOT release memory like in the image below, it crashes again. Then I restarted the process and once it released memory, but a while afterwards, it crashed again.

image

@ricmestre Thanks for the info, I was not aware of this.

@FabienTschanz
Copy link
Contributor

This means that all resources are executed in the same PowerShell session. Only when the deployment is finished (successful or not) and I do not push another config within a couple of minutes, the runspace is ended and my debug session stops.

That's what I expected it to be as well, but this would mean that the $global scope should be available as well, but it doesn't. I'll quickly check why the Confirm-M365DSCDependencies works over multiple resources - My guess is because the $script scope of the M365DSC module is kept "alive". I will report back with my findings.

About the eventlog for MSCloudLoginAssistant: Instead of just flooding the event log with many events, why not "hide" it behind a custom environment variable that can be set, which will then trigger the VerbosePreference to be Continue instead of SilentlyContinue? That way we can avoid clustering the event log when it's totally unnecessary but still being able to enable debugging in the normal view when we want to.

@FabienTschanz
Copy link
Contributor

FabienTschanz commented Nov 29, 2024

That's what I expected it to be as well, but this would mean that the $global scope should be available as well, but it doesn't. I'll quickly check why the Confirm-M365DSCDependencies works over multiple resources - My guess is because the $script scope of the M365DSC module is kept "alive". I will report back with my findings.

Indeed, the $script scope is persisted over multiple resources. The $global scope is cleared after every resource, making it unusable for our purpose, unless we find a way how to persist it over multiple runs.

If we cannot use the global scope, but the script scope is persisted, the logic of MSCloudLoginAssistant most likely has to cache all its state internally and expose it through public functions to retrieve. Currently, access to the $global scope is possible because every resource reauthenticates again, setting all of the connection information like $Global:MSCloudLoginConnectionProfile.MicrosoftGraph.ResourceUrl.

Edit: To be blunt, this should be the approach: Caching relevant state internally and expose it through specific APIs is best practice and much easier than knowing which global variable is responsible for what without documentation. The best documentation from an API is through its clear naming convention.

@Borgquite
Copy link
Contributor

Borgquite commented Nov 29, 2024

I'm not qualified to comment authoritatively on the 'big picture' solution here, but wanted to mention a couple of details which may be relevant:

On verbose logging When trying to work out why Verbose Logging was so difficult to enable, I learned that in PowerShell, code executed in modules doesn't inherit the VerbosePreference from the calling script. Could we implement this from Stack Overflow in some of the modules so that it's not hard for end users to enable (or could we change modules like MSCloudLoginAssistant to log using Write-Debug, i.e. Verbose logging is used for resource-level logging, and Debug is used for modules like MSCloudLoginAssistant?)

https://stackoverflow.com/questions/44900568/how-to-propagate-verbose-to-module-functions

On $Global vs $Script what @FabienTschanz says makes sense based on what 'appears to work' - $script is persisted between resources, but $global is not - that's why Confirm-M365DSCDependencies works whereas the other stuff doesn't.

On why debug mode may work differently a couple of possibly relevant quotes

When debugging DSC

the script is run in a special LCM debug mode

(so behaviour may change).

And in general

the LCM resets the runspace each time it runs a class method.

(which may explain why $Global does not persist).

https://devblogs.microsoft.com/powershell/debugging-powershell-dsc-class-resources/

Also someone asking what appears to be the same question (although in May 2016 so don't know if things changed e.g. when PowerShell 5.1 was released in August 2016)

Unfortunately due to how CIM is implemented the runspace is reset between invocations so even a globally scoped variable won’t be accessible.

https://forums.powershell.org/t/passing-variable-between-dsc-resources-at-runtime/6397/2

On general points

Please be aware that all my logs above are from Microsoft365DSC 1.24.904.1 - it's a bit out of date, but I've found upgrades very painful in the past (not only due to issues with this project, but also assembly issues with the underlying PowerShell modules interacting with other PowerShell modules on the same server) - just be aware to avoid any head-scratching as you debug with the latest version.

Also I am compiling everything using Azure Desired State Configuration so this is Windows PowerShell 5.1 throughout - I don't think this is relevant since I'm fairly sure this is an LCM-related issue and the LCM isn't even supported in PowerShell 7+, but just wanted to mention it.

@FabienTschanz
Copy link
Contributor

Thank you @Borgquite for the two links, I was not aware of them. As you said, they confirm what I suspected unfortunately for Start-DscConfiguration. Now we need some strategy on how to deal with this issue and what the consequences are with cross-dependencies and cross-module access using $global scope. This will be difficult, unless we update MSCloudLoginAssistant and Microsoft365DSC in the same release, otherwise if we switch away from the $global scope, everything will break.

@Borgquite
Copy link
Contributor

Borgquite commented Nov 29, 2024

@FabienTschanz No worries. Have we confirmed that the connections themselves to Microsoft Graph, Exchange Online et al are able to survive when the LCM resets the runspace? (i.e. it's not necessary to run Connect-* again after the reset?)

If not, this is all a bit moot?

@FabienTschanz
Copy link
Contributor

@Borgquite Get-MgContext returns a state even after switching to another resource. If it was to be reset (or disconnected), Get-MgContext would return $null. I did check that, but I need to verify if actually calling e.g. Get-MgGroup works. Currently also working on another Intune resource.

@Borgquite
Copy link
Contributor

Borgquite commented Nov 29, 2024

@FabienTschanz Yep - I guess it would be good to specifically check Exchange too, as I know that module can work quite differently to the rest.

(Myself also working on other DSC stuff so appreciate this is 'as we get time :)'

(NB Even if the modules don't persist, the specific Exchange memory issue is still solvable by creating a new PowerShell process for each Exchange Online connection - per tip 3). This should solve the memory issue, but would be a last resort (as performance wise, it doesn't help).

https://techcommunity.microsoft.com/blog/exchange/reducing-memory-consumption-of-the-exchange-online-powershell-v3-module/3970086

@FabienTschanz
Copy link
Contributor

@Borgquite I can confirm that also the authentication context of ExchangeOnline seems to survive. I was able to call Get-AcceptedDomain without problems starting from the second resource before again authenticating, which means that the context from first resource was carried over.

@ykuijs and @NikCharlebois: Before we start throwing around everything, what do you think of the following proposition:

  • Switch from $global to $script scope
  • Expose public functions to get the current state
  • Modify the state only internally in MSCloudLoginAssistant

This way, we would ensure that the whole thing is only modifiable inside of MSCloudLoginAssistant, removing the potential hazard of somebody interacting with our authentication context, and it also fixes the issue with the not available $global scope inside of Start-DscConfiguration.

@FabienTschanz
Copy link
Contributor

@ykuijs @NikCharlebois Please give us your opinion so that we can continue towards a working version. Thank you.

@ykuijs
Copy link
Member

ykuijs commented Dec 6, 2024

Indeed, the $script scope is persisted over multiple resources. The $global scope is cleared after every resource, making it unusable for our purpose, unless we find a way how to persist it over multiple runs.

If we cannot use the global scope, but the script scope is persisted, the logic of MSCloudLoginAssistant most likely has to cache all its state internally and expose it through public functions to retrieve. Currently, access to the $global scope is possible because every resource reauthenticates again, setting all of the connection information like $Global:MSCloudLoginConnectionProfile.MicrosoftGraph.ResourceUrl.

Excellent work in discovering this. Never knew this and always assumed the Global scope acted the same way as in normal PS sessions.......assumptions 😄

About the eventlog for MSCloudLoginAssistant: Instead of just flooding the event log with many events, why not "hide" it behind a custom environment variable that can be set, which will then trigger the VerbosePreference to be Continue instead of SilentlyContinue? That way we can avoid clustering the event log when it's totally unnecessary but still being able to enable debugging in the normal view when we want to.

I like that idea. This environment variable needs to be a system wide variable......it is possible to create a user session environment variable as well. Since the LCM is running as System, it doesn't have access to the user session variable.

Just one question: Within M365DSC we already log a lot. If we add the verbose logging of the MSCloudLoginAssistant to that, doesn't the output become huge and therefore difficult to read? We could combine both ideas to only log to the event log after setting the env variable.

  • Switch from $global to $script scope
  • Expose public functions to get the current state
  • Modify the state only internally in MSCloudLoginAssistant

Great idea, totally agree! We should also add a function to reset the current state (Disconnect all sessions correctly, clear the variable data, etc). Just to make sure it is easy to start over.

@FabienTschanz
Copy link
Contributor

@ykuijs Thank you very much for your opinion. Then I'll go ahead and migrate from the global to script scope, expose some functions to fetch and clear the state and also move the output of MSCloudLoginAssistant to the event log instead of the console.

@Borgquite
Copy link
Contributor

@FabienTschanz @ykuijs Thanks guys! Hopefully this will improve performance as well as solve the high memory usage issue - a really impactful change.

@FabienTschanz
Copy link
Contributor

@ykuijs @NikCharlebois PR open: microsoft/MSCloudLoginAssistant#188. There is a dependency for some other functionality necessary: See microsoft/MSCloudLoginAssistant#185.

I tested it on my machine and it seems like the connection now is cached and used for all subsequent runs of any *-DscConfiguration command. No more connection retries, no entries in the event log after the initial login (which indicates that the connections are not being created again).

@ykuijs
Copy link
Member

ykuijs commented Dec 9, 2024

Excellent work. Just reviewed the PR. Just one small question. After that, we can merge this PR and release a new version which can then be included into the next version of M365DSC.

@ykuijs
Copy link
Member

ykuijs commented Dec 11, 2024

@FabienTschanz Nik and I just merged the PRs in the MSCloudLoginAssistant. During the release of the new version of M365DSC we discovered an issue that we didn't think of:
The code in M365DSC is using the global MSCloudLoginConnectionProfile variable.....there are 152 references to that variable in the code:
Image

Would you have the opportunity to change these to the new Get function?

@FabienTschanz
Copy link
Contributor

@ykuijs PR already open: #5540

@ykuijs
Copy link
Member

ykuijs commented Dec 11, 2024

Man, you are quick 😄 👍

@FabienTschanz
Copy link
Contributor

Indeed 😆 I was so surprised to see it being merged and just released, I expected it to not be released for about another day so we would have time to properly migrate and check if it actually works. I expect it to work, but don't want to guarantee it 😓

@FabienTschanz
Copy link
Contributor

FabienTschanz commented Dec 12, 2024

Soo, after using it for today, it seems very much okay. I ran it repeatedly against a quick setup of mine using a couple of EXO distribution and AAD groups, everything switching in between them. Even after five consecutive runs, the memory stayed roughly the same (+/- 250MB at around 1.3GB) after the first run. I can't say how it looks like for a configuration file with, let's say, 300 items, my biggest setup was ~50 elements on my test tenant.

Very curious to see what others get - A couple of optimizations in the module will follow. Please post your experiences with the new version here.
@ykuijs @NikCharlebois Can we pin this issue to the top of the repository? So we don't lose track of it? And can we maybe close some other issues that reported a high memory usage so we only track it in this one? Or should we open a separate issue that only deals with performance improvements?

@Borgquite
Copy link
Contributor

Borgquite commented Dec 17, 2024

@FabienTschanz I updated Microsoft365DSC to 1.24.1211.1 yesterday, and it was very promising! The configuration which was currently taking multiple hours (increasing as the memory is consumed), now appears to be consistently taking 7 minutes :) Great work!

I've set the system-wide environment variable 'MSCLOUDLOGINASSISTANT_WRITETOEVENTLOG' to $true to see what's happening - as far as I can tell, I still receive a lot of 'Connect-M365Tenant' 'Resetting connection profile' events though (using the configuration file I previously sent you via LinkedIn) - is that expected?

@FabienTschanz
Copy link
Contributor

@Borgquite Thanks so much for testing and super cool it's now massively quicker. Some resources need to perform a reset of the connection profile, so that might be explicit, e.g. EXOManagementRoleAssignment (which needs it for permissions to be effective) or O365OrgSettings. If I'm not mistaken, you do have some EXO role assignments, correct? That's most likely why.

# Need to force reconnect to Exchange for the new permissions to kick in.
if ($null -ne (Get-MSCloudLoginConnectionProfile -Workload ExchangeOnline))
{
    Write-Verbose -Message 'Waiting for 20 seconds for new permissions to be effective.'
    Start-Sleep 20
    Write-Verbose -Message 'Disconnecting from Exchange Online'
    Reset-MSCloudLoginConnectionProfileContext
}

I don't know if this is necessary. My PR #5565 was just merged, so only the requested context will be reset (instead of every workload). This will probably also help a bit for the future. Do you have a couple of logs to share?

@Borgquite
Copy link
Contributor

Borgquite commented Dec 18, 2024

@FabienTschanz Yeah - the 'Resetting connection profile' events have 'Connect-M365Tenant' as the source, and many relate to MicrosoftGraph (not Reset-MSCloudLoginConnectionProfileContext or EXO) so I don't think that fix will help. I would share logs, but it's quite difficult to show how this relates to different resources running, since the Verbose logging for the resources in the PowerShell window is now separate from the MSCloudLoginAssistant event logs (I do wonder if we'd be better off making that environment variable just enable Verbose logging for MSCloudLoginAssistant, so it's easier to see the interaction, even despite the concern about noise).

I did wonder if it might be the Compare-InputParametersForChange call is faulty. As far as I am aware, all my connection strings are identical (see the files I sent you earlier). If you can't immediately spot a potential issue let me know what logs you need to troubleshoot this and in what format, I can do my best to share.

@Borgquite
Copy link
Contributor

Borgquite commented Dec 19, 2024

@FabienTschanz Have been looking through the logs & am fairly sure this 'reconnect lots and lots' issue is only affecting my Graph resources, not Exchange Online (I can see in the Azure sign in logs that the LCM connects to EXO roughly every 15 minutes, i.e. once per LCM session, whereas the LCM connects to Microsoft Graph every few seconds.

Just wondering if this could be because I haven't defined TenantId for my Microsoft Graph resources (it works fine without). Is this possibly confusing Connect-M365Tenant, when it calls Compare-InputParametersForChange to check if it should reset the connection profile?

Sorry - above statement is incorrect, I just wasn't seeing the TenantId in my configs for Graph, but it's there. Nonetheless there may be an issue with Connect-M365Tenant when it comes to Graph, which results in connection resets, that doesn't affect EXO.

@FabienTschanz
Copy link
Contributor

@Borgquite Alright, I'll take a look at it once I have time with a (hopefully) large enough sample of my tenant.

@FabienTschanz
Copy link
Contributor

@Borgquite I just created a PR that will fix those Microsoft Graph reconnections. It was indeed the Compare-InputParametersForChange which was faulty.

@Borgquite
Copy link
Contributor

@FabienTschanz Awesome! I'm on Christmas break (visiting Switzerland with family!) from now until January, but it feels like you've nailed it. I'll test and confirm when I get back :)

Happy Christmas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants