1. What is the most reliable indicator of CPU contention?
2. What sort of CPU capture ratios can I expect on my Windows 2000 machine?
3. Can I use the published clock speed in MHz of the processor reliably as a relative speed rating?
1. What is the most reliable indicator of CPU contention?
Look at a combination of processor utilization and processor queuing.
(1) The primary indicator of processor utilization is the % Processor Time Counter in the Processor Object. Note that the _Total instance of the Processor Object is actually the average value over all processors. The System Object in NT 4.0 contains a Counter named % Total Processor Time, which is also the average value over all processors.
The thread is the unit of execution in Windows. Each process address space that is launched has at least one thread, and many applications, of course, are multithreaded. There is an operating system function in Windows that keeps track of how CPU time each thread consumes using a sampling technique. Samples are normally taken approximately one or two hundred times per second, which suggests that this technique is probably accurate for measurement intervals of 30 seconds or more. These samples are used to maintain the Thread % Processor Time Counter, once execution time recorded in 100 nanosecond timer tick units, is normalized to a percentage of the measurement interval duration. Thread % Processor Time is also summarized at the Process level. The table below summarizes this overall measurement scheme:
Object |
Counter |
Derivation |
| Thread | % Processor Time | Dispatcher timing mechanism |
| Process | % Processor Time | SThread % Processor Time |
| Processor | % Processor Time | 100% - Idle Thread % Processor Time |
| Processor (Win2K, XP) | Processor(_Total) % Processor Time | SProcessor % Processor Time / # processors |
| System (NT) | Total % Processor Time | SProcessor % Processor Time / # processors |
Processor busy is measured using an Idle thread mechanism. The operating system dispatches an Idle thread whenever there are no ready threads to run. Whenever the processor accounting routine finds the Idle thread dispatched, processor time is accumulated for the Idle thread. By the way, the Idle thread is not an actual execution thread, it is a HAL function which fulfills this essentially bookkeeping function.
At the end of a measurement interval, % Processor Time is
calculated by subtracting the amount of accumulated Idle thread time from 100%.
On a multiprocessor, there is a dedicated Idle thread per processor so that
reliable measurements are kept. % Processor Time at the processor level can be
broken down into Privileged mode execution time, User mode execution time,
execution time in Interrupt mode, and execution time in Deferred Procedure Calls
(DPC), as illustrated in Figure 1 below. % Interrupt Time and % DPC Time
are both subsets of % Privileged Time.
Figure 1. Processor utilization breakdown.
% Processor Time approaches 100% as an absolute upper limit on CPU capacity. A system that is running consistently at greater than 90% busy is clearly out of capacity. However, this is a not a hard and fast rule. Some workloads show signs of significant CPU contention at lower levels of processor utilization. For example, Figure 1 shows an IIS web server at a large e-commerce site where processor utilization remains consistently below 85%, except for two peak processing intervals. However, we will see that this system suffers from a serious CPU capacity constraint. So start with % Processor Time, but do not stop there.
(2) The System Object contains an instantaneous Counter called Processor Queue Length. This Counter shows the number of threads that are currently in the Ready state, but are delayed waiting for a processor to be available. Maintaining a value of no more than two Ready threads per processor is the usual recommendation. More than two Ready threads per processor normally indicates a CPU resource shortage.
The Processor Queue Length Counter is often well-correlated with % Processor Time, even though the former is an instantaneous value obtained at the time the last processor sample was collected, while the latter is based on continuous samples during the measurement interval. Figure 2 shows the overall processor utilization from the 4-way multiprocessor system in Figure 1 with an overlay of the Processor Queue Length for the same interval (charted against the right hand y-axis). The expected correlation between processor utilization and the number of Ready and Waiting threads is apparent.
Figure 2. % Processor time vs. the Processor Queue Length.
In this instance, we also saw a correlation between periods of poor Active Server Pages (ASP) response time and corresponding spikes in the size of the Processor Ready Queue. In this specific instance, % Processor Time values consistently greater than 70% appeared to cause spikes in web site response time.
(3) At the Thread level, there is a Counter called Thread State. Threads waiting in the Ready Queue have a Thread State code of 1 (see the Explain text for the Counter). This Counter tells you precisely which threads are waiting for service at the processor. Since Windows uses priority queuing to order the Ready Queue, knowing which threads are delayed in the queue can be quite useful to help pinpoint the impact of CPU contention on specific applications that might be experiencing performance problems.
Because of the quantity of threads on a typical NT machine, collecting thread execution state data using tools like System Monitor is normally prohibitive. Consequently, we designed NTSMF to allow efficient collection of this potentially useful information. The Ready Threads Counter that NTSMF provides at the process level (beginning with version 2.4.2) shows the number of Ready and Waiting threads for that process at the end of each measurement interval.
2. What sort of CPU capture ratios can I expect on my Windows machine?
Pretty good ones for the most part, unless you encounter one of several possible problems.
Capture ratios deal with the difference between the theory and practice of computer performance monitoring. In theory, since Windows derives all the CPU time measurements from the same data collection mechanism, the following relationship should hold:
S Thread(_Total) % Processor Time = S Process(_Total) % Processor Time = Processor(_Total) % Processor Time
which says that the amount of total Thread % Processor Time should be equal to the total Process % Processor Time, which should be equal to the total Processor % Processor Time. If you examine the numbers over any meaningful interval, however, you are likely to find that
S Thread(_Total) % Processor Time < S Process(_Total) % Processor Time < Processor(_Total) % Processor Time
which is inconsistent, to say the least. These inconsistencies become an issue primarily when you need to be able to project CPU requirements in the future, for example, and you are not sure which set of numbers to work from. When these numbers do not add up the way they should, you have a capture ratio problem.
To compute the % Processor Time capture ratio in Windows 2000, calculate the following ratio between 0 and 1:
S Process(_Total) % Processor Time / ( Processor(_Total) % Processor Time - (Processor % Interrupt Time + Processor % DPC Time))
When this ratio is at or near 1, almost 100 % of the CPU time consumed can be attributed to specific application (or system) workloads. When this ratio is less than 70%, for example, you have a lot of explaining to do to figure out what application is consuming so many CPU cycles.
Normally, CPU capture ratios in Windows are good enough that you do not need to do a whole of explaining. However, under some circumstances, you may run into one of the following problems.
Transient processes.
The most serious capture ratio problem you can encounter in Windows is the result of transient processes. These are processes which do not run for very long, compared to the duration of a measurement interval. Using the built-in performance monitoring interface, it is only possible to collect resource usage information about processes that are currently executing. Once a process terminates, there is no more resource usage information to gather. (Unlike MVS and Unix, there is no separate resource accounting log which tallies the total resource usage of a process.) The effect of this approach is that the resource accounting information about an executing process is lost during the measurement interval in which the process terminates. This is not normally a concern on most Windows 2000 because most important server processes execute more or less continuously from boot to shutdown. It does suggest that using shorter measurements will minimize exposure to this problem, which is one of the reasons why we like to recommend data collection intervals of one minute or less.
This can cause capture ratio problems if you have processes that come and go relatively quickly. This is something that you might see on machines that run a lot of scripts, for example. The way this problem is often manifest is that you will see multiple instances of processes for the script host (i.e., cmd.exe) where the Process Elapsed Time Counter is frequently less than the measurement interval duration. Then you have a case where these transient process come and go so quickly that their resource usage is not being accounted for completely.
The best way to address the transient process capture ratio problem is to increase the rate of sampling. There is a result in communications theory that suggests that you need to sample at about twice the frequency of the waveform in order to assess it accurately. If you calculate the average Process Elapsed Time and sample at twice that interval duration, you should obtain significantly improved capture ratios. For instance, increasing the measurement data collection interval to once every 15 or 30 seconds has worked effectively in the several cases where we have tried it. Of course, shortening the data collection interval inevitably leads to increased measurement overhead, so there is no magic bullet that addresses this potentially serious problem.
Capping Process % Processor Time at 100%
A CPU capture ratio that can occur on multiprocessors has an easy fix. The definition of the % Processor Time Counter type specifies that it can never exceed 100%. Logically, this is a legitimate limitation for instances of both Thread % Processor Time and Processor % Processor Time, but is problematic for Process % Processor Time whenever there is a multithreaded process executing on a multiprocessor machine (more than one CPU). If a multithreaded process does accumulate more than 100% % Processor Time, then the default behavior of the System Monitor is to report a truncated value of 100%. HKCU\Software\Microsoft\PerfMonCapPercentsAt100 can be set to 0 to turn off the default truncation rule that System Monitor uses.
Using Performance Sentry, the default behavior is never truncate % Processor Time values, so this potential problem is addressed. (We do not truncate values for the % Disk Time counters either. For an explanation of what is going on with this Counter click here.) If for some reason you would like NTSMF to mirror the default behavior of the Microsoft tools, specify "Yes" for the File Contents, Truncate CPU Utilization at 100% runtime parameter in your Data Collection set definition.
Rounding errors
This is a relatively minor problem that only effects very fine-grained measurements of thread and process CPU execution time data. Going back to the initial version of Windows NT, there is a HAL function which creates a machine-independent virtual clock denominated using 100 nanosecond timer ticks. However, processor utilization samples are usually taken at 10 millisecond intervals. This somewhat leisurely sampling rate of the system processor utilization measurement function on today's fast machines clock can lead to inconsistencies. These sampling errors are usually associated with very short measurement intervals (5 seconds or less), so they will not effect you when you are looking at coarser grained measurement for capacity planning purposes.
3. Can I use the published clock speed in MHz of the processor reliably as a relative speed rating?
For back-of-the-envelop capacity planning, it is nice to have a relative speed rating for various processors. You would like to be able to say with confidence that a given processor-bound workload running on machine A running at 400 MHz will execute in 1/3 the time on processor B running at 1.2 GHz that is 3 times faster. For the most part, so long as you stay within the same processor family, you can do that with Intel processors.
Benchmark results consistently show that within a processor family, the performance of an Intel processor usually scales linearly with clock speed (in MHz), all other factors like cache size and bus speed being equal. Figure 3 below shows a chart which illustrates this point. Four representative sets of published benchmark results are plotted for Pentium II processors in the range of 300-450 MHz. That performance scales linearly as the clock rate increases is evident. Within a processor family, it is reasonable to expect a Pentium IV at 1.8 GHz to run roughly 50% faster than a Pentium IV 1.2 GHz box.

Figure 3. Within a family of processors, performance of Intel CPUs scales linearly with clock speed.
However, if you try to compare machines from different processor families, you are apt to find that while clock speed is still important, there are other architectural features that matter. For instance, the 386, 486, Pentium, and Pentium Pro machines represent four different Intel processors families, the P3, P4, P5 and P6, respectively. The P4 introduced instruction pipelining to the Intel processor line, the P5 uses a dual integer pipeline (also known as a superscalar architecture), while the P6 features a highly parallel microarchitecture design. Running at similar clock speeds, P4, P5 and P6 machines will show markedly different results.
Since the advent of the first P6 Pentium Pro machines, subsequent versions of the Pentium II, III and IV are all P6 family machines with a similar internal microarchitecture. Machines within this processor family can be expected to scale roughly as a function of clock speed, as illustrated above. (In fact, Intel tweaked the internal architecture of the Pentium IV chip specifically to help it scale linearly with faster clock speeds.)
Intel has also introduced Itanium systems based on the 64-bit P7 architecture, which is vastly different from its predecessors. Our preliminary testing with an early Itanium running a pre-release copy of the 64-bit version of Windows XP indicates that P7 processors are much faster than its clock speed alone would lead you to believe.
4. Is there a table somewhere that will take the information in the CPU Family field in the NTCONFIG record (i.e., X86 FAMILY 15 MODEL 2 STEPPING 7 ) and convert that to a specific processor chip manufacturer name and model?
The short answer is that Intel knows what these things means, but does not publish a mapping anywhere of how these internal names correspond to external products. The closest Intel comes is this document at http://www.intel.com/design/chipsets/mature/mature.pdf, which does not mention either Family or Stepping names.
In semiconductor fabrication, "stepping" refers to the chip manufacturing process which is called a stepper that deposits successive layers of etched material conducting material and insulation. Our best guess is that this is an internal reference to the plant/stepper technology that produced the chip.
The CPU configuration data in the Registry is actually a narrow subset of the information Intel places in WMI today. See for example, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wmisdk/wmi/win32_processor.asp for the complete win32_processor spec. See the table that documents the processor family. However, most of these fields are null when you query them in WMI, as in the following script, for example:
Set colSettings = objWMIService.ExecQuery _
("SELECT * FROM Win32_Processor")
For Each objProcessor in colSettings
Wscript.Echo "System Type: " & objProcessor.Architecture
Wscript.Echo "Processor: " & objProcessor.Description
Wscript.Echo "Family: " & objProcessor.Family
Next
will return a Processor Description identical to what is contained Registry, but Processor.Family is null.
There is a a new field available for some of Intel's newer processors at HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor\0\ProcessorNameString that may provide you with what you are looking for.

You can harvest this Registry value using the NTSMF Registry data collection feature. Add the HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor\0\ProcessorNameString field name to the Registry values on the DCS Parameters "File Contents" Tab, as in the following illustration:
to gather the information stored in this Registry field.
5. How should I report on processor utilization for machines running Intel Hyper-threading (HT) technology?
Hyper-threading (HT) is the brand name for the technology Intel uses in many of its Xeon 32-bit processors that enables one physical processor core to execute two instruction streams (or threads) concurrently. On an HT machine, when HT is enabled, each physical processor currently presents two "logical" CPU interfaces to the operating system so that two program threads can be dispatched at a time. The best way to report on processor utilization for an HT machine is to calculate the average utilization of the logical processors associated with the same physical processor core.
Figuring out whether or not HT is beneficial or detrimental on a specific workload is difficult today unless you can do an apples-to-apples comparison between an HT machine and a non-HT machine running the exact same workload. On an HT machine, all the processor level resource usage measurements such as % Processor Time represent utilization of a logical processor. Some authorities recommend averaging the processor utilization of the two logical processors that share a physical processor core to calculate utilization of the physical processor. To do this, you must understand which logical processors are associated with the same physical processor core.
Two new Processor configuration records, introduced in NTSMF version 2.4.7, allow you to identify HT machines definitively and determine which logical processors share a physical processor core. An instance of a DTS.CPU configuration record that identifies a physical processor is written for each physical processor that is present. These records contain a counter called # Logical Processors Supported that will tell you if it is an HT machine, along with a counter called # Logical Processors Active that shows you if HT is enabled. If the # Logical Processors Supported counter contains a null value, then the machine is not HT-capable. If the # Logical Processors Supported counter contains valid numeric data, then it is an HT-capable machine. (You should see a numeric value of 2 for current HT-ready processors. Note that Intel's processor roadmap shows them contemplating building HT machines with more than 2 logical processors sometime in the future.) You can also tell if HT is enabled on the machine. On an HT machine, if the # Logical Processors Active is less than # Logical Processors Supported, then the HT support has been disabled.
The DTS.CPU records contain some additional CPU hardware configuration data that you might find interesting, like the amount of L1, L2 and L3 cache memory is installed, where that information is available.
DTS.LogicalProcessor records are also written that associate a logical processor instance name (the same instance name used in the Processor records) with a DTS.CPU physical processor core parent instance. Both sets of Processor configuration records are automatically written once to the beginning of each NTSMF data file, just before the first interval data records.
The core technology that Intel uses in its HT machines is known as Simultaneous Multithreading (SMT), which you can learn about at this University of Washington, Computer Science department web site. Much of the research published here shows SMT to be quite promising. Multiple threads executing simultaneously on the same processor core works well when an instruction from one thread blocks inside the instruction pipeline, but the processor can continue to make forward progress executing instructions from another thread. On the other hand, in practice HT is sometimes detrimental to overall performance when it comes to real-world workloads, forcing customers to disable HT in some instances. Threads executing concurrently on the same processor core must contend for shared resources inside the processor, particularly the same processor cache. Multiple threads can also interfere with each other's instruction execution progress, which leads to degraded performance levels. One suggestion is that this interference is more likely to occur when you are attempting to run an homogenous workload and less likely to occur when the processor is executing threads from unrelated processes. In other words, on a machine dedicated to a specific application or one instance of SQL Server, HT could do more harm than good.
According to this white paper posted on Microsoft 's web site, logical processor instance names on an HT machine are generated in sequence, one to a physical processor until all the physical processors have one logical processor, and then in sequence again until all the HT logical processors have been accounted for. For example, on an HT-enabled machine with 4 processor cores, processor instances 0 and 4 are associated with the first physical processor present, instances 1 and 5 are associated with the next, etc. Since the assignment of logical processor numbers to physical processor cores is a BIOS function, the authors of the Microsoft white paper were not entirely certain that every HT machine you ever come across will look this way, but at least every one that they have seen so far conforms to this numbering scheme.