Considering Nehalem

Audience: Seasoned IT Professionals
Read Time: 10 Min

There has been a lot of buzz about the new Nehalem processor and chipset family from Intel and how to optimize memory configuration for speed. I had quite a lengthy discussion with some colleagues at work on how to best organize the RAM to gain the most performance from the Nehalem’s 5500 chipset. I wanted to share some of what I found while researching.

The Nehalem uses an integrated memory controller on each CPU with 3 channels and 3 DIMM slots on each channel. This gives most dual CPU servers 2 memory controllers (one per CPU) with 6 channels and 3 DIMM slots per channel (total of 18). This gives many, many MANY options on how to configure servers. My discussion today with my colleagues was about the HP BL490G6.

Here is a great graphic from HP that gives us a look into the way the Nehalem is organized, it includes parts of the BL490 G6 (click to enlarge):

5500bl490

Click to zoom.

With the Nehalem, I like to stick to non-fully populated memory because of the benefit from the increased memory bandwidth when all channels are not in use.  When using all 18 DIMM slots the bus speed is downgraded to 800mhz compared to the fastest at 1333Mhz with only a single slot per channel occupied.  Having the bus speed at downgrade to 800 Mhz from 1333 Mhz bus is almost 40% less throughput overall which is a considerable jump.

Now I want to talk about all of this in relation to virtualization with VMware.  In this scenario I am going to be talking about using 8GB Dual Ranked Registered ECC DIMMS and the 95w versions of the Nehalem*, but the concepts can be applied to smaller or larger DIMMs.  The generally accepted rule of thumb is to have 2 vCPU per core (not a limit, just a good point to start from when planning). So, assuming that I am going to run 1vcpu per VM (statistically most VMs only require a single vCPU), then that is a starting point of 16 machines on each blade. With 2gb of RAM each (again, an average point to start from), 32gb of RAM would be required (however, in an ideal situation I would want to double that so physical hosts are running at 50% in case of a disaster situation causing a host failure).

One issue with using 64gb (Scenario 1) is that it would give us non-uniform memory distribution across memory channels** so the next logical move would be to 80GB of memory (Scenario 2), which would be 10×8gb memory modules – still allowing for expansion and giving us the extra bus speed over 800mhz.  This again presents an issue; 80GB would create a non-uniform memory distribution situation across CPUs/CPU Cores (80GB / 2cpu / 3 channel comes to 12 which cannot be divided by 8GB DIMM modules evenly across channels), which is less than ideal. In the end (Scenario 3), 8GB of memory per channel per CPU is 48GB, 24 per CPU, 6gb per core and will use the fastest bus speed. This is probably more than adequate for most designs because the next move there would be to 96gb which is probably overkill for most environments.  The 50% overhead margin will be decreased by adding more blades to the cluster, thus the load from a single node failure can be spread out more evenly.

Here are a few tables summarizing that last paragraph so it makes more sense:

nehalem_memory

Click to zoom.

Scenario 3 and 4 illustrate how to get to 96 GB in the interest of cost savings and speed respectively.  Either of these scenarios I would consider acceptable – I just want to stay away from the 800MHZ speeds due to the huge performance loss.  There really isn’t a wrong way to go about implementing servers based on Nehalem, it just requires some forethought to achieve the best results for each organizations goals.  Both Intel and VMware (as well as most server vendors) have best practice white papers published on this subject so be sure to check with them as well.  I would be interested in hearing about anyone’s experiences with Nehalem and also anything that I may have missed. Leave a comment or drop me a note!


*-HP does offer the BL490c with 1333MHZ bus using 6 DIMMS, but this was not the origional design of the chipset. The 95w processors are required to achieve the 1333MHZ bus speed.
** – It is desirable to have uniform distribution across memory channels and CPU cores because this will then require less calculation on the part of the hypervisor to place work loads.  This is also much more efficient for the hardware as a whole.

Date: Monday 24 Aug, 2009

All content (c) 2009+ BeyondVM, LLC | Hire BeyondVM | Legal | Contact