Resolved – Trending Issue: Distributed Cache problems after applying January 2023 CU

With January 2023 CU for SharePoint Server 2016, 2019 and Subscription Edition we released a security fix which increased the transport security for communications between the SharePoint applications and the distributed cache cluster.

This fix introduced a problem for some customers which so far could only be resolved using the workaround described in my blog post.

With May 2023 CU we released a fix for this issue.

 
Update from May 25th, 2023:
Thanks to several customer reports we identified an additional aspect of the distributed cache problem introduced with January 2023 CU which is not addressed by this fix:

Application pools using a managed account that has been added after the farm was created might not be able to talk to the Distributed Cache server.
The reason is that these accounts are not automatically added as a member to the WSS_WPG security group on pure Distributed Cache servers where no WebApplications are provisioned.

In a classic minrole farm with separate machines for WFE, Distributed Cache, Search and Application Server the WSS_WPG is only updated on servers holding the WFE and the Application Server role but not on servers holding the Distributed Cache or Search role. As a result granting read permission to the WSS_WPG group on the remote registry access key on the Distributed Cache servers does not grant permissions for these application pool accounts.

A fix for this is now in development.

Workaround: Manually add all accounts used as application pool identity for SharePoint web applications to the WSS_WPG security group on the Distributed Cache server.

17 Comments


  1. Thank you Stefan for the update on this issue.

    Reply

    1. You’re welcome!
      🙂

      Reply

  2. Sorry for the duplicate reply … this post is a better spot for my experience.

    With respect to the fix applied in the May CU… I have installed the May CU in a 2016 and 2019 test environment. Both gave me an error that follows when running configuration wizard on the first server.

    Failed updating SecurityMode and ProtectionLevel in cluster. [Exception=System.Management.Automation.CmdletInvocationException: Cannot start service AppFabricCachingService on computer ‘wps-spsrch-t05.corp.wpsic.com’. —> System.InvalidOperationException: Cannot start service AppFabricCachingService on computer ‘wps-spsrch-t05.corp.wpsic.com’. —> System.ComponentModel.Win32Exception: The service cannot be started, either because it is disabled or because it has no enabled devices associated with it — End of inner exception stack trace — at System.ServiceProcess.ServiceController.Start(String[] args) at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.StartHostHelper(IHostConfiguration hostConfig, Boolean clusterCommand, Int32 hostTimeout) at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.StartExternalManagedCluster(IClusterConfigurationReader reader, ProgressDelegate progressDelegate, WriteErrorDelegate writeError, Int32 hostTimeout) at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.StartCluster(ProgressDelegate progressDelegate, WriteErrorDelegate writeError, Nullable1 quorumTimeout, Nullable1 hostTimeout) at Microsoft.ApplicationServer.Caching.Commands.StartAFCacheClusterCommand.BeginProcessing() at System.Management.Automation.Cmdlet.DoBeginProcessing() at System.Management.Automation.CommandProcessorBase.DoBegin() — End of inner exception stack trace — at System.Management.Automation.Runspaces.PipelineBase.Invoke(IEnumerable input) at Microsoft.SharePoint.DistributedCaching.Utilities.SPVelocityPowerShellWrapper.StartCacheCluster(String provider, String connectionString) at Microsoft.SharePoint.Upgrade.UpdateClusterConfigurationInSPDistributedCache.Upgrade()]

    I dumped the cluster configuration and it appears to be correct per the article posted.

    The AppFabricCachingService was disabled on this server. Why would this service be disabled and since the cluster configuration is correct, do I need to worry about this?

    Reply

    1. Hi Tom,
      indeed the config is correct.
      I don’t know why the service is disabled in your environment but this is what caused the “Cannot start service AppFabricCachingService on computer ‘wps-spsrch-t05.corp.wpsic.com’” error.
      I would suggest to manually enable and start the service.
      Cheers,
      Stefan

      Reply

      1. It appears the reason App Fabric Caching Service is disabled is because the Distributing Caching Service is not provisioned on that server. I’ve read that best practice is to NOT run DCS on App servers. If I enable App Fabric Caching Service manually, it is shutdown after about an hour. I assume SharePoint is shutting it down. I’ve seen this occur with other services, like Search.

        Should DCS be provisioned on all servers in the farm? Seems to go against some articles I have read. https://learn.microsoft.com/en-us/sharepoint/administration/plan-for-feeds-and-the-distributed-cache-service. I guess I need to dig into this further. 🙁

        Reply

  3. We are still seeing the following in ULS log after May CU in the 2019 farm. Is the issue really fixed in this CU?

    w3wp.exe (0x285C) 0x2DA4 SharePoint Foundation DistributedCache ah24w Unexpected Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage ‘DistributedLogonTokenCache’ – Exception ‘Microsoft.SharePoint.Internal.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. …’

    Reply

  4. We have dedicated Disrtibuted Cache Servers that don’t have the web applications deployed to them. As a result, even after patching with the May CU, we had to manually add the app pool account to the WSS_WPG group on the DC servers as documented in the fix in the blog.

    Reply

    1. Hi Robert,
      thanks for the report!
      I understand the scenario and will circle it back to the product group.
      Cheers,
      Stefan

      Reply

      1. Hi Robert, we have not been able to reproduce this issue.
        Can you please confirm the SharePoint version you are using that caused the issue?
        Thanks,
        Stefan

        Reply

        1. SharePoint Server 2019 with the December 2022 CU.

          Reply

          1. I meant to say, we had SharePoint Server 2019 with the December 2022 CU before encountering this with the May 2023 CU.


  5. There is still a bug. When a new application pool is provisioned, it only adds the application pool account of the affected web application in a member of the WSS_WPG group on the servers which has Web Front End (WFE) and App minrole. It does not add the app pool account to WSS_WPG group, if the server only has the “Distributed cache” minrole. This causes “ah24w ” errors in the ULS log even after the May 2023 CU.

    https://learn.microsoft.com/en-us/sharepoint/administration/description-of-minrole-and-associated-services-in-sharepoint-server-2016

    Reply

    1. Hi Girish,
      thanks for the report!
      I understand the scenario and will circle it back to the product group.
      Cheers,
      Stefan

      Reply

      1. Hi Girish, we have not been able to reproduce this issue.
        Can you please confirm the SharePoint version you are using that caused the issue?
        Thanks,
        Stefan

        Reply

        1. We were on December 2022 CU for SharePoint Server 2019 and we didn’t see “ah24w” errors in the ULS log, but as soon as we deployed the May 2023 CU, we started seeing them.

          We skipped CUs from Jan 2023 to April 2023 due to the known issue with the Jan CU which was not fixed until the May CU.

          Again, we have servers running just the “Distributed Cache”, other servers just running the “Front End”, and other servers just running the “Application” minrole in this farm.

          Reply

  6. While running PS Config in 1 farm got the folloiwng:
    “UpdateClusterConfigurationInSPDistributedCache (16.0.7.3) 4wtrk ERROR Failed updating SecurityMode and ProtectionLevel in cluster. [Exception=System.Management.Automation.CmdletInvocationException: ErrorCode:SubStatus:Hosts are already running in the cluster.”

    I noticed that the App Fabric Caching service was stuck on Stopping. I ended the DC process in Task Manager and started the service again. I then ran PSConfig again, but when I noticed the service was stuck on stopping, I ended the process again and PSConfig finished successfully. I re-ran PSConfig a 3rd time on the server and didn’t have a problem with the service and it finished successfully.

    Reply

  7. Dear Stefan,
    We got the issues after patching with March 2023 CU and applied your fix and it got resolved. Recently we patched with the May 2023 CU and during the psconfig again was the issue on distributed cache/wfe nodes: Unexpected error while executing ExportCacheClusterConfig with parameters provider: ‘SPDistributedCacheClusterProvider’
    BUT! After running the psconfig second time it was completed successfully and the cluster security settings are correct. App Fabric caching services are up and running. So is it a normal behaviour and no further action is needed?
    Thanks
    Andrei.

    Reply

Leave a Reply to Stefan Goßner Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.