Thursday, January 16, 2014

High Processor Utilization on Lync 2013 Front-End Servers

UPDATE (2015-Oct-02): The problem has finally been fixed!  Took nearly 2 years, but here's a link to the KB article. The fix is in the September 2015 CU for Lync Server 2013.  The cause is "because the topology snapshot was recomputed multiple times by the Response Group Service."  Thanks to @sublimeashish for telling me. 

We have a customer who is about to migrate from Lync 2010 to Lync 2013.  They've got a few lightly loaded Lync 2013 Enterprise Edition pools with 3 servers each.  All are running Windows 2008 R2 Standard Edition on VMWare.  All patches are up-to-date.

For inexplicable reasons, some of the servers will suddenly see their processor utilization spike to near 100% for extended periods of time, when their typical utilization is less than 5%. A look at Task Manager shows two instances of the W3WP.exe service (IIS web service) that are consuming large amounts of processor resources.  There are no events in the Event Logs to indicate an issue.

Performing an IISReset on the affected node makes the processor go back to normal, but this is obviously not a real solution.  We opened a ticket with Microsoft PSS, and they confirmed there are others seeing the same thing.  It seems the source of the problem is the "garbage collection" process in the LyncIntFeature and LyncExtFeature application pools in IIS.  Recycling those pools makes processor utilization return to normal (for a while at least).

Microsoft is actively working to resolve the issue, and I will post a permanent solution for all to see as soon as one becomes available.

UPDATE:  Thanks to @dannydpa  on Twitter, it appears the trigger may be Lync topology publishing. I confirmed this by updating the topology and publishing it.  Less than 10 minutes later, all the servers processor utilization spiked.  Recycling the aforementioned apppools resolved the issue.

To help others with this issue, I've created a little Powershell script that will recycle the LyncIntFeature and LyncExtFeature app pools for all Lync servers.  For the script to work, you need to make sure that remote management is enabled on all Lync servers.  On Windows Server 2012, this is on by default, but in Windows 2008 R2, you need to log on locally and run: Enable-PSRemoting -Force before running the script.
$WebPools = (Get-CSService -WebServer).PoolFQDN

ForEach ($Pool in $WebPools)
{
  $PoolMembers = (Get-CSPool $Pool).Computers
  Foreach ($Computer in $PoolMembers)
  {
    Write-Host "Resetting LyncExtFeature and LyncIntFeature app pools on $Computer"
    $Session = New-PSSession -ComputerName $Computer
    Invoke-Command -session $Session -ScriptBlock {Restart-WebAppPool LyncExtFeature}
    Invoke-Command -session $Session -ScriptBlock {Restart-WebAppPool LyncIntFeature}
    Remove-PSSession $Session
  }
}

15 comments:

  1. The CPU spike only happens when adding or removing an object from Topology. Changing a value does not cause this. It also appears to be related some how to response groups.

    If these at VMs, disable NUMA support.

    ReplyDelete
  2. We have NUMA spanning disabled and also made sure there is no CPU over commit. We have a little over 100 RGS workflows on our pool. We created all RGS via PowerShell on our new Lync 2013 pool. We never migrated them from Lync 2010 to Lync 2013. Microsoft is investigating new traces. Hopefully they find something.

    ReplyDelete
    Replies
    1. Seems as though MS is homing in on response groups, but we had just one for testing. Probably not the cause in our case.

      Ken

      Delete
  3. From what I am seeing, it only affects the pool that hosts the CMS as well. We have no Response Groups but it appears an addition to the topology caused it.

    ReplyDelete
    Replies
    1. We saw this happen on the same pool in December, and it wasn't the CMS at the time. It is now, but I suspect the CMS isn't part of the issue.

      Delete
  4. Had the same issue and opened a ticket with MS.
    They saw nothing suspicious, but CPU was around 60% at all times.
    Figured out the Call Park Service consumes up to 70% CPU at times, and the worse part is that it's not being used at all.
    Disabled CP in all policies, same issue. Restarted the service, same issue.
    Only fixed after I manually removed it from "Programs and Features" and stopped the service.
    CPU is now around 10-15% at most times.

    ReplyDelete
  5. We have had this high CPU spike on the W3WP processes when publishing topology changes. We have had this issue though before we upgraded to 2013. Shortly after installing our first 2013 pool we had the issue again but it did seem to have an impact on other 2010 servers in the pool and it was not just restricted to servers hosting CMS. It does not seem to happen every time - we have made several topology additions without having the issue.
    I have just had the issue now after the topology was modified to remove a 2010 pool after migration. In this instance all four cores on the CMS hosting server were at 100% with two of the eight or so w3wp.exe services consuming the resource between them. One strange anecdote to this instance is that I only noticed it because my remote powershell session timed out connecting - and I only noticed that because I was attempting to access the rgsconfig page on a 2010 pool. I checked the CPU on the 2010 front end and it was really low but there was a RGS error in the Lync event log and an ASP.net Error in the Application Event log.

    Mike Dickin
    Hempel

    ReplyDelete
  6. Problem Solved:
    I had this problem on a 2 node Enterprise FE pool. It started on one of the front ends, and then spread to the second one. After a lot of investigation, the solution was to apply the SQL 2012 Express SP1 on both the LYNCLOCAL and the RTCLOCAL instances. The LYNCLOCAL instance would not install using the unattended install but did install using the GUI install.

    ReplyDelete
  7. Hi Ken,
    I had figured out some additional problems, well IISReset helps, but I had seen a huge traffic coming via the load balancer. still investigating this. Its unclear why, because there is not user active right now.
    I can confirm it part of the topology publishing point. I will check from where the traffic is coming on the LB. use if you check the netstat -ano, you will see the connection from the LB.
    More when I found some more interesting.
    Cheers
    Thomas

    ReplyDelete
    Replies
    1. need to correct the last statement, nothing is solved.
      I figured out after reading the internal IIS LOG files, the we have issues based on every Lync CLient/ User logging. After they are logged in, the client LSASS.EXE consumed nearly 100% cpu. parallel on the Fornt End servers, we see, ERROR 500 in the IIS logs on RSGClient and also WebSessionTickets.
      The clients are all WIN7 and most of them are Citrix users with roaming profiles.
      So now we are checking if its a well-know issue with roaming profile.
      If not, we must open a support ticket. I let you know whats the solution

      Delete
  8. Michael PapalabrouJune 6, 2014 at 9:45 AM

    Same issue here - after changing a file store in the topology, the 2013 FE server of another site (not related to the change) spiked. iisreset helped. Network utilisation not high in our case.

    ReplyDelete
  9. I have a customer that is experiencing the same issue with high CPU on all FEs after making a change in the Topology (6 servers, 2 pools). It is not restricted to the CMS Pool and is resolved by conducting an IISRESET or recycling the application pools (LyncIntFeature, LyncExtFeature). We applied the 8/5/14 Lync Server updates and the issues still persists. Has anyone found a resolution to this issue?
    Thanks, -John Lockett

    ReplyDelete
    Replies
    1. Hey John, this is definitely on the radar at Microsoft. Some MS folks have asked some of us MVPs for some information about it, so we may see some activity soon.

      Ken

      Delete
    2. Any update on this issue?

      We have 4 Lync pools (Lync 2010). Looks like after adding a new pool and publishing topology, FE on one of the Lync pool are having 100% CPU utilization. Task Manager shows two instances of the W3WP.exe service. We have published topology many times but it never happened before.

      To avoid service outage, we moved users to different pool, rebooted servers but no difference. The reset of IIS helped to bring CPU utilization back to normal. However as soon as I moved few users back to pool, CPU started to spiked again. I had to reset IIS again to bring it down.

      Did we get any permanent solution?

      Hemat

      Delete
    3. This is the first I've heard of the issue on Lync 2010 pools, and user moves having a role in it. The symptoms you describe are exactly the same. Moving users to different pools triggers a replication of the central management database between pools, so if you recycle the LyncIntFeature and LyncExtFeature app pools (or do IISRESET), it will probably fix the issue, until the next user move.

      MS still does not have an ETA on solving this issue. Rest assured it is definitely on their radar.

      Ken

      Delete