rebelpeon.com

Thursday, July 22, 2010

TokenAndPermUserStore Cache on x64 SQL 2005

We’ve been having yet another performance issue that has taken awhile to fix.  This one isn’t related to code though.  However, it has been a nightmare to solve.  Thankfully we have some Microsoft guys on sight who were able to quickly diagnose the problem and get us moving along.

Some background:

  • x64 SQL 2005 SP3
  • x64 Windows 2003 R2
  • 64GB RAM per cluster node
  • 12 HT enabled cores

With that, we were noticing a lot of CPU activity even though we really shouldn’t have, but it was strange as CPU would peak at about 45% and the SQL would appear to become very sluggish.  Running ad hoc queries took forever, and it was causing our app to run slowly too.  In fact, we were normally front-end CPU bound, but when we were having these issues, the front-end sat at about 25% CPU, SQL was at 45% and there were no waits on the SQL server.  It was very annoying as there was no place to point the finger at.

As I was digging through things, I noticed that the last geography we did we had set MDOP to 1 (see previous post).  As I was slowly going through various SQL configurations I noticed that it was set to 0 on all of the cluster nodes.  Needless to say, I talked to a few people, and 1 is apparently the default setting during their buildout, so I changed it. 

Miraculously everything was solved!  A quality day’s worth of work!  Ran some validation steps, and then handed it off to our offshore team.  The next day, I notice that the results were looking a lot worse that what I was seeing.  I spin up LoadRunner (ugh, another post I should probably write about) and run a test.  My results are for crap too.  Looking at front-end CPU, I notice it is once again not maxing out, and SQL is running hot again. 

I know how to fix this though, so I go and look at the MDOP setting.  Interesting, it is still set to 1.  Well, maybe something is hung someplace, so I set it to 0 then back to 1.  Run the test, problem solved, move on to something else. 

Well, this continually happens for the next week, and so I send an email asking if anyone has seen this before to the local DBAs and the onsite Microsoft DBAs if they have ever seen MDOP “revert”, but not really.  Needless to say, they had never heard of this, but the Microsoft DBA quickly narrowed in on the fact that the cache plan is being dumped when I re-run that configuration.  Looking at some memory dumps, he quickly jumped on the TokenAndPermUserStore cache.  We also verified it by only flushing that cache and watching the improvements on the site. 

The good news is that we are not the only ones having issues with this.  MS has actually tried to fix this issue since before SP2.  However, nothing, not even in SP3 has actually fixed it.  Yet, in SP3 they finally added a few trace flags that can be used to manually set the size of this specific cache.  Before I get to telling you how to fix it, here is what is going on.

On the previous geography our SQL tier looked the same except that we only had 24GB of RAM on the SQL nodes, plus our databases were a lot larger because of legacy data.  Therefore, we have memory pressure on the box.  SQL wants to load all the databases into RAM, but it can’t.  Therefore, the caches are continually going through garbage collection.  Now, on this current project, we have more than doubled the amount of RAM, and our databases are are tiny in comparison.  In fact, all of our databases are effectively in memory.  Therefore, we have no memory pressure, and the caches are never collected.  They keep growing to sizes that make them useless as they are spending more time in CPU finding the corresponding item (the security token in this case) than just recreating it.

Our SQL nodes are too big.  Who knew that was possible?

Now the solution.  Microsoft has tons of articles on this, but the one that describes it the best is 927396.  The top bullet points explain exactly what we were seeing:

  • Queries that typically run faster take a longer time to finish running.
  • CPU utilization for the SQL Server process is more than usual.
  • When you experience decreased performance when you run an ad hoc query, you view the query from the sys.dm_exec_requests or sys.dm_os_waiting_tasks dynamic management view. However, the query does not appear to be waiting for any resource.
  • The size of the TokenAndPermUserStore cache store grows at a steady rate.
  • The size of the TokenAndPermUserStore cache store is in the order of several hundred megabytes (MB).
  • In some cases, execution of the DBCC FREEPROCCACHE command provides temporary relief.

Now there are multiple ways of fixing it, it really depends on how many users are accessing your database.  For us, it is very few as we only have application accounts.  However, here are your options:

  1. Enable trace flag 4618 to set a quota of 1,024 entries.
  2. Enable both trace flag 4618 and 4610 to set a quota of 8,192 entries.
  3. Put a custom quota in the registry and enable trace flag 4621.

We went with option #2, so we have the default settings, and guess what it works!  The downside is that the memory keeps increasing, but we haven’t had a performance issue.  I am guessing is that the used memory space is staying consistently the same size, but it is not reclaiming memory, which is causing a memory leak.  I am going to work with our DBA this next week to validate that assumption (and make sure it will reclaim the memory at some point), so I will keep this post updated based on what we found.

In addition, I am going to switch MDOP back to 0, and see if SQL isn’t quite as dumb with parallelism as we think it is now.

 

posted by aaron at 01:23 PM
posted in work • (0) commentspermalink

Tuesday, June 29, 2010

Base Perf Improvements

Since I keep forgetting to check these base performance assumptions prior to digging in, I thought I would list them here with some description.  Yes, this is mostly for my personal edification, so deal with it.

  1. SQL - These items can be done when you see Buffer Latch waits
    1. Max Degree of Parallelism (DOP) - This setting should be set to 1.  By default it is set to 0, so it needs to be changed.  This setting becomes more important as CPU cores scale up.  It turns out that SQL is not good at parallelizing T-SQL queries.  When set to the default of 0, CPU is artificially increased due to SQL attempting to parallelize threads.  Some additional info can be found here.  This affects all versions of SQL.  Be sure to test first!
    2. T1118 Flag - This is a trace flag that much be added to the SQL startup parameters (-T1118).  This allows SQL to access multiple pages.  When you have a multi-core (processor) machine, you should definitely enable this flag on all instances of SQL.  In addition to this change you must do the following item to see benefits.  You can get more information here.
    3. File Partitioning - In addition to the T1118 flag, you also need to break out your files.  TempDB is the most common bottleneck that you should do this to.  In our testing it is at least worth while to create the number of data files (with autogrowth turned off) to 1/2 the number of cores you have.  We added all our This also helps other databases that have high contention.  Log files can also be broken out if lots or log contention is occurring.  However, in our cases, it has typically been limited to the data files.
  2. Web/App Server
    1. Wildcarding Disabled - By enabling wildcarding you can have pretty URLs that don’t end in an extension without using a rewrite module (not free for IIS6).  However, by enabling scriptmap wildcarding all files go through the asp.net handler.  This is obviously a huge CPU burden on the web server, plus an even larger burden if you are backing session in the database because the database CPU and IO goes through the charts.  Nothing like having session information for loading a JPG file!  You can see what not to do here.
    2. Kernel Mode SSL - This adds ~10% CPU relief on the web servers.  You can’t do anything with client certifications though, so be sure to test and read all the gotchas.  Find out more here.
posted by aaron at 03:35 PM
posted in work • (0) commentspermalink

Friday, June 11, 2010

Black Logon Screen for RDP

I get this every once in a while, and it is quite annoying.  Basically all the colors are messed up when you try to log in via RDP.  Everything appears black.  You can see your cursor, so you can log into machines, but it is very difficult when dialog boxes are around.  I found the solution over at server intellect, and wanted to slap it here so that I have an easy place to find it.

A black Remote Desktop Login Screen when connecting to your Windows Server 2003 Server is caused by registry entries changed either through minor corruption or user error. To resolve it, replace the registry entries with the ones in the registry script below. Simply copy and paste the below script into a text file, and rename it “rdpreset.reg”. Right click on the resulting file, and select Merge.

Windows Registry Editor Version 5.00

[HKEY_USERS\.DEFAULT\Control Panel\Colors]
“ActiveBorder”=“212 208 200”
“ActiveTitle”=“10 36 106”
“AppWorkSpace”=“128 128 128”
“Background”=“102 111 116”
“ButtonAlternateFace”=“181 181 181”
“ButtonDkShadow”=“64 64 64”
“ButtonFace”=“212 208 200”
“ButtonHilight”=“255 255 255”
“ButtonLight”=“212 208 200”
“ButtonShadow”=“128 128 128”
“ButtonText”=“0 0 0”
“GradientActiveTitle”=“166 202 240”
“GradientInactiveTitle”=“192 192 192”
“GrayText”=“128 128 128”
“Hilight”=“10 36 106”
“HilightText”=“255 255 255”
“HotTrackingColor”=“0 0 128”
“InactiveBorder”=“212 208 200”
“InactiveTitle”=“128 128 128”
“InactiveTitleText”=“212 208 200”
“InfoText”=“0 0 0”
“InfoWindow”=“255 255 225”
“Menu”=“212 208 200”
“MenuText”=“0 0 0”
“Scrollbar”=“212 208 200”
“TitleText”=“255 255 255”
“Window”=“255 255 255”
“WindowFrame”=“0 0 0”
“WindowText”=“0 0 0”
This will correct the colors on the login screen to Windows Server default.

posted by aaron at 02:44 PM
posted in work • (0) commentspermalink

Thursday, October 09, 2008

SQL Performance Optimizations

The team that I’m currently working on is focused on performance testing. None of us on the current team are really SQL experts though and we’re running into an issue that appears to be SQL related. Well, it turns out in SQL 2005, there’s these great new things: Database Management Views. What makes them even cooler is that you can pool them all together. And the hottest thing about this whole thing, is a script we found to do most of the work on our own. This things is awesome for SQL performance testing, and will definitely get a lot of reuse, that’s for sure.

posted by aaron at 01:56 PM
posted in work • (0) commentspermalink

Tuesday, September 23, 2008

Commerce Server 2007 Performance Counters

Having issues accessing Commerce Server 2007 performance counters remotely?  We were too.  It seemed like only specific ones were able to be remotely accessed.  Well, it turns out that is “by design” (bullet point #2).  However, you can get around this by allowing the Remote Registry service run as an account that has access to the database.  Down side to that is I’m not exactly sure what permissions are actually required for the Remote Registry service (something above a normal user), and who knows what that breaks.

posted by aaron at 10:27 AM
posted in work • (0) commentspermalink

Tuesday, August 12, 2008

Visual Studio Test Load Agent Brain Dump

On my current project, I’m doing a lot with performance testing using Microsoft Visual Studio 2008 Test Load Agent.  Now, there’s not a whole lot of things on Microsoft’s site about it, and blog postings are fairly sparse too.  Hopefully some small nuggets listed here will be beneficial.

The first thing to note is that there is some documentation for 2005, but very little about the 2008 version.  Do not use the 2005 version if possible.  It has many blocking issues you probably will encounter.  Use 2008 as many have been fixed.  Many of the guides below are for 2005, but work just as well for 2008.

Installation:

Configuration:

  • Misc Good Configuration”
  • In the installation documentation, it talks about if you have a domain or workgroup.  However, in many test scenarios you will have multiple domains with no trusts in place (i.e. controller sits in one domain while the machines to be tested and monitor sit in another).  Don’t panic.  Simply create the Controller service account with the same username and password in all the domains.  Make the Controller service account a member of the Performance Log Users and Performance Monitor Users groups, and it will do synchronized pass-though.

How-To’s and Walkthroughs:

FAQs and Indexes:

Hopefully those links will help people.  This is usually the set of links I send people who are knew to the whole Visual Studio Load Test Agent stuff.

posted by aaron at 01:55 PM
posted in work • (0) commentspermalink

Monday, April 07, 2008

WCF Extensions and .NET 3.0 SP1

Having issues installing the WCF Extensions on Visual Studio 2005 when you have .NET 3.0 SP1 installed?  Getting an error message the looks something like the following? 

Setup has detected that a prerequisite is missing. To use Visual Studio 2005 extensions for .NET Framework 3.0 (WCF & WPF), November 2006 CTP you must have the .NET Framework 3.0 runtime installed. Please install the .NET Framework 3.0 runtime and restart setup.

Well, instead of just installing the application by double clicking on the MSI, run it the following way.  This ends up bypassing the prereq checks and it installs successfully.  Yay to broken installers!

msiexec /i vsextwfx.msi WRC_INSTALLED_OVERRIDE=1

posted by aaron at 11:17 AM
posted in workcomputers • (0) commentspermalink
Page 1 of 9 pages  1 2 3 >  Last »