1612283 - Hardware Configuration Standards and Guidance ( virtualization )

Symptom
Information is required on the correct efficient specification and configuration of Intel/AMD x64 Hardware running Windows for SAP ABAP and SAP Java application server environments.  This note also discusses SAP on Windows virtualized environments.

Cause
“Intel” x86 based hardware (based on either Intel or AMD) has evolved rapidly in recent years.  Many new technologies and features in Windows and Intel H/W platforms (hereafter called “Intel”) directly impact the optimal configurations for SAP systems.
SAP ABAP and SAP Java application servers should be deployed after reviewing the recommendations in this note.  The configurations in this SAP Note have been tested and proven by SAP, Microsoft and hardware vendors in lab tests, benchmarks and customer deployments.
More information on SAP standard benchmarks and the term “SAPS” can be found at http://www.sap.com/benchmark/

Resolution
Prior to purchasing new hardware, when installing and configuring SAP on Windows on physical or virtual environments, follow the deployment guidelines in the PDF file attached to this SAP note. 

General

SAP server throughput (as measured by SAPS) has increased significantly on Intel based server hardware in recent years. Intel/AMD and OEM hardware manufacturers achieved performance increases by introducing many new technologies and concepts.  SAP applications require appropriate hardware configurations and parameterization to achieve the performance and throughput increases demonstrated in the SAP Standard Application benchmarks.  Inappropriate Intel configurations could cause significant performance problems, unpredictable performance (sometimes slow) or significantly underperform relative to SAP Standard Application benchmarks. Provided the concepts and configurations documented in this note are followed these problems should not occur.

1. Overview of Modern Intel  Server Technologies

1.1. Clock Speed

All SAP work processes other than Message Server and Enqueue Server are executing logic within a single thread.  The performance of batch jobs in particular and other work process types in general is largely determined by the latency of database requests and by the time a SAP work process spends running on a single CPU Windows tread. SCU is the SAP specific terminology for describing per “thread” throughput (Single Computing Unit – note 1501701). SCU is very important in determining the performance of a SAP system. 
SAP Standard Application benchmarks have shown a strong correlation between clock speed (GHz) and SCU on the same processor architecture.  On some Intel servers disabling Hyperthreading may increase SCU, thereby improving the throughput of a single work process (eg. batch job) but decreasing total aggregate throughput of the entire server. If you need to speed up a single transaction or report you might try to switch off Hyperthreading. The exact performance increase per thread is dependent on factors beyond the scope of this note.  Please contact Intel for further information on Hyperthreading and performance.  SAP benchmarks on Windows Intel systems have shown higher SCU on higher clock speed processors.
2 socket servers have significantly higher SCU than 4 or 8 socket servers.  Benchmarks show 8 socket Intel servers have 55% lower SAPS/thread than 2 socket as at June 2012.
Some SAP components and some specific SAP processes (see note 1501701) are particularly sensitive to SCU performance.  Hypothetical examples below show how to calculate SCU performance (which corresponds to per thread performance on Windows):
Example  – Intel Server with Hyperthreading ON:
SAPS = 32,000
H/W configuration = Intel E5 2 processors / 16 cores / 32 threads
SCU = 32,000 / 32 threads = 1,000 SCU SAPS
Same Intel server with Hyperthreading OFF:
SAPS = 22,000
H/W configuration = Intel E5 2 processors / 16 cores / 16 threads
SCU = 22,000 / 16 threads = 1,375 SCU SAPS
Windows Power Saving features can lower the clock speed when the CPU is idle.  Hardware vendors and Microsoft can provide more information on the optimal energy/performance configuration.
Additional information about Hyperthreading and SCU on Virtualized systems can be found in note 1246467 - Hyper-V Configuration Guideline and note 1056052 - Windows: VMware vSphere configuration guidelines.   Microsoft and VMware provide additional whitepapers and blogs on this topic.

1.2. Multi-core

When talking about performance we need to distinguish between performance expressed in throughput, like Sales Orders per hour, payroll calculations per hour, which can be executed on a given hardware. On the other side, performance often is associated with the time it takes to calculate such an elementary operation of one payroll calculations. Or on the database side, the time it takes to execute, for example, a single lookup of a row in a table. Indications on throughput performance delivered by a single server hardware can be read out of SAP benchmarks and the associated SAPS number. Information on the time it takes to execute an elementary operation is expressed more by the SCU as introduced above. Consequences of changing hardware in a SAP system to a more recent model with more processor cores can be the need to adjust the configuration or number of SAP instances running on a single server in order to be able to leverage the increased number of CPU cores. On the SAP application side, the scale-out provided by the SAP application layer allows high flexibility to leverage hardware which provide a high SCU. Whereas on the DBMS side, the focus in selecting a DBMS server often is more on the throughput performance and the ability of executing as many requests as possible in parallel.

1.3. Large Physical Memory

Windows Zero Memory Management is generally recommended and is documented in note 88416.  SAP generally recommends against huge ABAP or Java instances as documented in note 9942.  A 2 socket server is very powerful and a single instance with around 50 work processes is unlikely to leverage the CPU power of the H/W. Increasing the number of work processes (beyond about 50) and users on a single instance may not linearly improve throughput.  An example is three ABAP instances each with 50 work processes has shown much better performance than one ABAP instance with 150 work processes.
Installing multiple ABAP or Java instances on a single physical server will allow the H/W resources to be fully leveraged. 
Solution: install multiple smaller ABAP instances per physical server, balance workload with SAP Logon Load Balancing and keep the instance configuration identical by setting most parameters in the default.pfl
In general use Windows Zero Administration Memory Management.  Remove the profile parameters listed in note 88416 and set only the PHYS_MEMSIZE.  ZAMM parameters will be automatically calculated correctly based on the value for PHYS_MEMSIZE.
Suggested Profile Parameters for ABAP instances sharing the same H/W and operating system:
PHYS_MEMSIZE 
physical RAM / number of instances + small amount for operating system
em/max_size_MB ZAMM default = 1.5 x PHYS_MEMSIZE*
abap/heap_area_dia 2GB (2000000000) or slightly higher
abap/heap_area_nondia 0 (up to max value of abap/heap_area_total)
abap/heap_area_total ZAMM default = PHYS_MEMSIZE
*As of 720_EXT downwards compatible kernel patch 315 or higher
The attached PDF file contains sample configurations

1.4. NUMA

Non-Uniform Memory Access (NUMA) directly impacts the performance of SAP ABAP application servers.  The SAP Kernel for Windows is single threaded and does not contain NUMA handling logic to localize memory storage for a specific process to a specific NUMA node.  
Performance will therefore be maximized on high clock speed processors with the least number of NUMA nodes.  These conditions are both met on 2 socket commodity Intel systems.
Local memory access times are very fast on NUMA based systems because the memory controller is directly connected to one processor.  Remote memory access is many times slower than local.  The calculation of local versus remote memory access for SAP application instances is a simple mathematical formula:
2 socket = 50% chance of a local NUMA node access
4 socket = 25% chance of a local NUMA node access
8 socket = 12.5% chance of a local NUMA node access
2 socket Intel commodity servers have a higher clock speed and better NUMA characteristics and are therefore suitable for SAP application servers.  Excessive remote memory accesses on 8 socket or higher servers running SAP ABAP instances will adversely impact performance.  This can occur with or without virtualization. Virtualization software does not prevent NUMA induced latencies nor change the physical structure of the processor/memory layout. Modern virtualization software may avoid remote memory communication if a Virtual Machine is equal to or smaller than the resources of one NUMA node.
RDBMS software from Microsoft, IBM, Oracle and SAP are all designed NUMA aware as of current releases.  NUMA aware RDBMS software will attempt to keep memory structures local and avoid remote memory access.  Modern DBMS software has demonstrated very good scalability on 8 socket or higher Intel servers. 

 1.5. Processor Groups (K-Groups)

Windows 2008 R2 and higher introduced a concept called “Processor Groups”.  Processor groups are required to address > 64 threads. See SAP Note 1635387 - Windows Processor Groups. 
Processor Groups are required on most 4 socket servers (4 socket * 10 core * hyperthreading = 80 threads)
Applications and DBMS software must be Processor Group aware otherwise the maximum number of threads the application or DBMS can address is limited to 64. Performance will be somewhat less than the H/W capability.  See section 2 of this note for further information
Current status (August 2012) 
  1. SAP Kernel = no automatic processor group handling – see note 1635387
  2. SQL Server 2008 R2 and higher = processor group aware
  3. Oracle 11g = processor group support planned with patch 11.2.0.4
  4. Other DBMS = check with DBMS vendor for support status (MaxDB/Livecache, DB2, Sybase etc)

1.6. Performance Bottlenecks

1.6.1. Network

SAP 3 tier configurations require a very high performance, low latency and 100% reliable network connection between the SAP application server(s), the message server and the database.   
Large or busy systems strongly benefit from: 
  1. 10 Gigabit network
  2. A separate network for SAP application servers to communicate with the RDBMS
  3. Offload, SR-IOV, VM-FEX and parallelism features built into modern network cards and drivers. TCPIP v4 and v6 offload and Receive Side Scaling have been tested by Microsoft, HP and other vendors.  Contact Microsoft and/or H/W vendor for recommended NIC and drivers
  4. TCPIP Loopback (127.0.0.1) communication is single threaded and is unable to be distributed over multiple threads with technologies such as RSS.  Some RDBMS and SAP instances may attempt to use loopback rather than shared memory by default
  5. The attached PDF file contains links with additional information about network topologies and configuration for Intel  systems

1.6.2. Memory

SAP and DBMS performance testing and customer deployments have shown that RAM is a determining factor in scalability. A modern Intel or AMD system with insufficient memory will be unable to run efficiently or achieve peak throughput.  SAP benchmarks provide an indication of the appropriate amount of RAM for a particular hardware configuration. Customers should use the H/W configurations published on the SAP benchmark website as guidance for how much RAM to specify. SAP Quicksizer also provides some guidance.  As of August 2012 the minimum RAM for a 2 socket Intel server should be 128GB.

1.6.3. Insufficient IO Performance

IO can be a significant performance bottleneck.  Common causes are insufficient LUNs, one LUN presented to Hyper-Visor partitioned into multiple drive letters, insufficient HBAs, incorrectly configured MPIO software. Microsoft and SAN vendors can provide additional information on optimal IO configurations.   

1.7. Energy Consumption

Customers are encouraged to compare the energy consumption of different H/W configurations.  In most cases it is observed that 8 socket systems use proportionately more energy than 2 socket systems.

2.  Summary of Physical Hardware Configurations

SAP benchmarks show several clear trends: 
  1. Total SAPS on Intel servers has increased significantly in recent years
  2. A substantial increase in SAPS per core on 2 socket Intel server and somewhat lesser increase on 4 socket and 8 socket Intel servers
  3. A significant but more moderate increase in SAPS per CPU thread.  Increase in SAPS per CPU thread (SCU) is most significant on 2 socket Intel servers
  4. Total number of cores and threads has increased dramatically.  Servers with 12 to 80 core and 24 to 160 threads are available from most H/W vendors as at 2012.
OS Limitations:
Windows 2012 supports up to 640 threads* and 4TB RAM
Windows 2008 R2 supports up to 256 threads and 2TB RAM
Windows 2008 supports 64 threads and 2TB RAM
Hyper-V 3.0 (Windows 2012) supports 64 vCPU & 1TB RAM per Virtual Machine
Hyper-V 2.0 (Windows 2008 R2) supports 4 vCPU & 64GB RAM per Virtual Machine
VMware vSphere 4.x supports 8 vCPU and 255GB RAM per Virtual Machine
VMware vSphere 5.0 supports 32 vCPU and 1TB RAM per Virtual Machine
VMware vSphere 5.1 supports 64 vCPU and 1TB RAM per Virtual Machine
*thread = Sockets x cores per processor x Hyperthreading. 
4 socket x 10 core Intel server with Hyperthreading = 80 threads (e.g. HP DL580 G7, Dell R910)
Balanced configurations tested and deployed at customer sites:  
  1. 2 socket Intel or AMD = > 32,000-42,000 SAPS 128-384GB RAM, 10G network and 2 x Dual Port  HBA
  2. 4 socket Intel or AMD = > 62,000-75,000 SAPS 512GB-1TB RAM, 10G network and 2-4 Dual Port HBA
  3. 8 socket Intel = 130,000-140.000 SAPS 1TB RAM or more, 10G network and 4-8 Dual Port HBA
 SAP ABAP & Java servers and DBMS software will perform well on configuration #1.
#1 has generally demonstrated best performance with simple configuration and tuning for most SAP applications relative to #2 and #3.
#2 may also be possible for ABAP & Java servers, though performance will not be as good as expected and additional configuration is required.
Configuration #3 requires special expert configuration and tuning to run SAP application servers or DBMS together with SAP application servers (with or without virtualization). The SAP ASCS/SCS is not a full application server and can run without problems on configuration #1, #2 or #3
Configurations #1, #2 and #3 are suitable for modern DBMS software and will deliver nearly linear scalability with addition of CPU sockets. DBMS software running on 2, 4 and 8 socket servers with large amounts of memory will achieve very good scalability and performance without the need for complex configurations and tuning. 
SAP application server installation, configuration and tuning on 2 socket servers is simple and largely automatic.  2 socket servers have a high clock speed, high SAPS per thread/SCU, efficient energy consumption and demonstrate good NUMA characteristics.
If SAP sizing indicates additional capacity in excess of a 2 socket server (currently 32,000 - 42,000 SAPS) is required for DBMS layer, if additional availability & reliability features are required or if many databases are consolidated onto a single server/cluster then select 4 socket or higher servers.
Note: on average 10-30% of the total SAPS resources is consumed by the DBMS layer. SAP application servers typically consume 70%+ of the overall CPU resources of most customer systems.  The SAP application server layer should be scaled out horizontally on 2 socket commodity servers or Virtual Machines. The PDF file attached to this note demonstrates examples.

3.  Configuration of SAP ABAP Server on 4 socket & 8 or higher socket servers

3.1. Guidelines for running SAP ABAP server on 4 socket systems

4 socket servers can be configured to run SAP application servers if 2 socket servers are unavailable. Additional configuration and tuning is required. Knowledge of modern Processor technologies, NUMA, K-Groups and SAP profile parameters is required.
 Recommended Configuration Steps:  
  1. If possible disable HyperThreading to reduce the total threads to below 64. All threads will be in one K-Group. Performance will be significantly reduced.
  2. In the case of still having enabled Hyperthreading AND having more than 64 CPUs, implement Microsoft KB 2510206 as per note 1635387. This will force creation of evenly sized K-Groups/Processor Groups by the Windows OS.
  3. Implement NUMA affinity as detailed in SAP note 1667863
  4. Determine the amount of local memory per NUMA node(s) and size the SAP instance accordingly
  5. If Virtualization is configured on 4 socket systems please consult the Hypervisor vendor for further information, guidance and best practices regarding the configuration of non-NUMA aware applications on VMs

3.2. Guidelines for running SAP ABAP server on 8+ socket systems

Complex configuration and tuning is required to achieve good, stable and predictable performance on SAP application servers or DBMS together with SAP application servers (with or without virtualization) on 8 socket or higher servers.
SAP is unable to provide generalized documentation regarding 8 socket or higher configurations because: 
  1. Some hardware architectures only provide 4 QPI/HyperTransport links.  H/W configurations with > 4 sockets require specialized Hubs/Node controllers.  The implementation of > 4 socket servers differs significantly between the various hardware vendors
  2. Disabling hyperthreading is often insufficient to reduce the number of threads below 64, therefore K-Group configuration is generally required
  3. Placement of PCI HBA, NIC or SSD cards into an inappropriate PCI slot can have a dramatic impact on performance on some 8 socket systems
  4. The impact of device drivers, some backup software and some Anti-Virus software that was not designed for K-Groups, 8 socket servers with OEM designed Hubs/Node controllers and NUMA architectures is likely be pronounced and significant 
  5. Remote memory accesses are vastly more probable on 8 socket systems
  6. Total physical memory will be (most often evenly) distributed over 8 sockets which may lead to very little local memory per NUMA node
Hardware vendors are responsible for the specification, implementation, configuration and performance support of SAP application servers on 8 socket servers. Poor SAP application server performance on 8 socket servers should be referred to the hardware vendor.  Configuration, tuning and performance support of SAP application servers on 8 socket servers requires a Consulting engagement.
8 socket or higher servers offer excellent performance, reliability and scalability for DBMS software (or other software that is NUMA aware).  Typically there would be no need to engage expert consulting to install, configure and tune DBMS software on 8 socket servers.  It is generally recommended to obtain the latest “Best Practices” deployment guides from the relevant hardware vendor.  Hardware vendors will often provide a deployment guide for each specific DBMS. Standard readily available documentation is sufficient to deploy DBMS software on large 8 socket or higher systems. Provided only DBMS and (A)SCS software is installed the SAP Support procedures for 2, 4 or 8 socket servers are the same.

4.  Virtualization

4.1. Virtual platforms supported for Windows

Windows Hyper-V and VMware vSphere are both supported for SAP and documented in note 1409608

4.2. Virtual CPU (vCPU)

Hyper-V and VMware vSphere both map each individual vCPU to one physical core/thread.  Hypervisors will try to run all vCPU on the same physical processor. This is only possible if the number of vCPU is equal to or less than the number of cores on a physical processor. A server with 2 processors each with 8 cores would be able to run 8 vCPU on a single processor. If the number of vCPU was increased to 12, the Hypervisor will run the VM across both processors.
Hypervisors can automatically “relocate” a VM from a busy processor socket to another processor that is not so busy. The process of moving a VM from one NUMA node to another will eventually require copying the entire memory context across the QPI or Hyper-Transport (AMD) links.  Frequent VM relocations are likely to impact overall system performance and impact the predictability of performance (sometimes a VM will run slowly then after a migration to another NUMA node run fast). 

4.3. Virtual RAM + Virtual NUMA (vRAM)

Hypervisors allocate vRAM to physical RAM.  SAP systems should not “overcommit” meaning the vRAM should be equal to or less than physical RAM for Production systems.
If vRAM is larger than the physical RAM connected to one NUMA node or if the number of vCPU exceeds the number of cores on a single processor, the Virtual Machine will be performing Remote NUMA memory access (which is many times slower than Local access). 
Hyper-V 2.0 and VMware vSphere 4.x did not provide NUMA information to the Virtual Machines.  RDBMS software performance was therefore significantly decreased if vRAM > local NUMA memory or vCPU > cores on one single processor.
Hyper-V 3.0 and VMware vSphere 5.0 do provide NUMA topology information to the Virtual Machine. Both Hyper-V 3.0 and VMWare 5.0 greatly improve the alignment between VMs, the NUMA node, the vCPU and local memory.
The amount of Local NUMA memory (therefore the maximum vRAM before remote access occurs) is a function of Total RAM and number of processors.
  1. 2 socket with 8 cores each and 128GB RAM.  Each processor has 8 cores and 64GB local memory directly connected to one processor + 64GB remote memory.
  2. 4 socket with 8 cores each and 128GB RAM.  Each processor has 8 cores and 32GB local memory directly connected to one processor + 96GB remote memory.
  3. 8 socket with 8 cores each and 128GB RAM.  Each processor has 8 cores and 16GB local memory directly connected to one processor + 112GB remote memory.
Partitioning large 4 socket or 8 socket servers into many Virtual Machines is unlikely to achieve good predictable and stable performance without expert knowledge and configuration. 2 socket servers with large amounts of physical memory (up to 768GB as of 2012) have shown consistent and predictable results running virtual workloads. The configuration and operation of 2 socket servers with large amounts of RAM is relatively simple.   Virtualization vendors provide additional documentation and recommendations on NUMA configurations and best practices.   
Virtualization software does not prevent NUMA induced latencies nor change the physical structure of the processor/memory layout. Modern Virtualization software may avoid remote memory communication if a Virtual Machine is equal to or smaller than the resources of one NUMA node.
Author:
Cameron Gardiner, Microsoft Corporation
Contact Person for questions and comments on this article:
cgardin@microsoft.com
Reviewer:
Karl-Heinz Hochmuth, SAP AG
Bernd Lober, SAP AG
Matthias Schlarb, VWware Global Inc.
Peter Simon, SAP AG
Jürgen Thomas, Microsoft Corporation

Keywords
Intel, AMD, NUMA, local memory, remote memory, single threaded, Zero Memory Management, ZMM, PHYS_MEMSIZE, em/max_size_MB, vCPU, vRAM, x64, 64 bit, sizing, Wintel , per thread performance, Multi-SID, consolidation, QPI, Hyperthreading, abap/heap_area, energy


Header Data

Released On 07.01.2013 09:01:41
Release Status Released to Customer
Component BC-OP-NT Windows
Priority Normal
Category How To
Operating System
WIN 2008 R2

Product
This document is not restricted to a product or product version
Attachments
File Name
File Size (KB)
Mime Type
282618
application/pdf

2 comments:

Unitec Africa | Information Tecnology - Connectivity, Cloud, Security, Support ETC. said...

Unitec has extensive experience in desktop, server, storage and network virtualisation solutions. We work with our clients to ensure the benefits sought from choosing virtualisation are evident across the entire organisation.

saivenkat said...

It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.

Quoting & Invoicing Software