What sort of CPU capture ratios can I expect on my Windows machine?

Pretty good ones for the most part, unless you encounter one of several possible problems.

Capture ratios deal with the difference between the theory and practice of computer performance monitoring. In theory, since Windows derives all the CPU time measurements from the same data collection mechanism, the following relationship should hold:

S Thread(_Total) % Processor Time = S Process(_Total) % Processor Time =  Processor(_Total) % Processor Time

which says that the amount of total Thread % Processor Time should be equal to the total Process % Processor Time, which should be equal to the total Processor % Processor Time. If you examine the numbers over any meaningful interval, however, you are likely to find that

S Thread(_Total) % Processor Time < S Process(_Total) % Processor Time < Processor(_Total) % Processor Time

which is inconsistent, to say the least. These inconsistencies become an issue primarily when you need to be able to project CPU requirements in the future, for example, and you are not sure which set of numbers to work from. When these numbers do not add up the way they should, you have a capture ratio problem.

To compute the % Processor Time capture ratio in Windows 2000, calculate the following ratio between 0 and 1:

S Process(_Total) % Processor Time / ( Processor(_Total) % Processor Time – (Processor % Interrupt Time + Processor % DPC Time))

When this ratio is at or near 1, almost 100 % of the CPU time consumed can be attributed to specific application (or system) workloads. When this ratio is less than 70%, for example, you have a lot of explaining to do to figure out what application is consuming so many CPU cycles.

Normally, CPU capture ratios in Windows are good enough that you do not need to do a whole of explaining. However, under some circumstances, you may run into one of the following problems.

Transient processes.

The most serious capture ratio problem you can encounter in Windows is the result of transient processes. These are processes which do not run for very long, compared to the duration of a measurement interval. Using the built-in performance monitoring interface, it is only possible to collect resource usage information about processes that are currently executing. Once a process terminates, there is no more resource usage information to gather. (Unlike MVS and Unix, there is no separate resource accounting log which tallies the total resource usage of a process.) The effect of this approach is that the resource accounting information about an executing process is lost during the measurement interval in which the process terminates. This is not normally a concern on most Windows Servers because most important server processes execute more or less continuously from boot to shutdown. It does suggest that using shorter measurements will minimize exposure to this problem, which is one of the reasons why we like to recommend data collection intervals of one minute or less.

This can cause capture ratio problems if you have processes that come and go relatively quickly. This is something that you might see on machines that run a lot of scripts, for example. The way this problem is often manifest is that you will see multiple instances of processes for the script host (i.e., cmd.exe) where the Process Elapsed Time Counter is frequently less than the measurement interval duration. Then you have a case where these transient process come and go so quickly that their resource usage is not being accounted for completely.

The best way to address the transient process capture ratio problem is to increase the rate of sampling. There is a result in communications theory that suggests that you need to sample at about twice the frequency of the waveform in order to assess it accurately. If you calculate the average Process Elapsed Time and sample at twice that interval duration, you should obtain significantly improved capture ratios. For instance, increasing the measurement data collection interval to once every 15 or 30 seconds has worked effectively in the several cases where we have tried it. Of course, shortening the data collection interval inevitably leads to increased measurement overhead, so there is no magic bullet that addresses this potentially serious problem.

Capping Process % Processor Time at 100%

A CPU capture ratio that can occur on multiprocessors has an easy fix. The definition of the % Processor Time Counter type specifies that it can never exceed 100%. Logically, this is a legitimate limitation for instances of both Thread % Processor Time and Processor % Processor Time, but is problematic for Process % Processor Time whenever there is a multithreaded process executing on a multiprocessor machine (more than one CPU). If a multithreaded process does accumulate more than 100% % Processor Time, then the default behavior of the System Monitor is to report a truncated value of 100%. The registry key HKCUSoftwareMicrosoftPerfMonCapPercentsAt100 can be set to 0 to turn off the default truncation rule that System Monitor uses.

Using Performance Sentry, the default behavior is to never truncate % Processor Time values, so this potential problem is addressed. (We do not truncate values for the % Disk Time counters either. For an explanation of what is going on with this Counter refer to “FAQ’s about DISK” If for some reason you would like NTSMF to mirror the default behavior of the Microsoft tools, specify “Yes” for the File Contents, Truncate CPU Utilization at 100% runtime parameter in your Data Collection Set definition.

Rounding errors

This is a relatively minor problem that only effects very fine-grained measurements of thread and process CPU execution time data. Going back to the initial version of Windows NT, there is a HAL function which creates a machine-independent virtual clock denominated using 100 nanosecond timer ticks. However, processor utilization samples are usually taken at 10 millisecond intervals. This somewhat leisurely sampling rate of the system processor utilization measurement function on today’s fast machines clock can lead to inconsistencies. These sampling errors are usually associated with very short measurement intervals (5 seconds or less), so they will not effect you when you are looking at coarser grained measurement for capacity planning purposes.


Comments are closed.