Dell High Performance Computing Solution Resources Owner's manual

  • Hello! I am an AI chatbot trained to assist you with the Dell High Performance Computing Solution Resources Owner's manual. I’ve already reviewed the document and can help you find the information you need or explain it in simple terms. Just ask your questions, and providing more details will help me assist you more effectively!
Page 1 of 5
New vs. Old: Comparing Broadwell performance for CAE Applications across generations
Mayura Deshmukh, Ashish K Singh and Neha Kashyap, April 2016
With the refresh of Dell’s 13
th
generation servers with the recently released Broadwell (BDW) processors, some
obvious questions come to mind such as how the new processors compare with the older generation processors.
This blog, fourth in the series of “Broadwell Performance for HPC,focuses on answering this question. It compares
the performance of various CAE applications for five Broadwell Intel Xeon E5-2600 v4 series processor models with
previous generation Intel processors.
Last weeks blog talked about the impact of BIOS options for each of the CAE applications. Here we focus on how
much better the performance of the Broadwell processors is as compared to the previous generation Haswell (HSW)
and Ivy-bridge (IVB) processors for these CAE applications. Table 1 shows the applications that we are comparing
and Table 2 describes the server configuration used for the study. For LS-DYNA, the benchmarks run on the IVB and
HSW (sse binary) and for ANSYS Fluent, benchmarks run on Westmere (WSM), Ivy-bridge (IVB), Sandy-bridge(SB)
and HSW used different software versions (whatever latest version was available at the time) than what is
mentioned in Table 1. STAR-CCM+ and OpenFOAM version for benchmarks run on both HSW and BDW were same.
Table 1 - Applications and benchmarks
Application
Version
Metric
MPI
Benchmark
LS-DYNA®
8.0.0
Elapsed time
Platform MPI 9.1.0
car2car with endtime=0.02
STAR-CCM+®
10.04.011
Average
Elapsed time
Platform MPI 9.1.3
Civil_20m
EglinStoreSeparation
HlMach10
Kcs
LeMans_100M
Lemans_17m
Reactor9M
TurboCharger
Vtm
ANSYS® Fluent®
v16
Solver rating
Platform MPI 9.1.2.1
truck_poly_14m
OpenFOAM
2.4.0
Clock time
Open MPI 1.10.0
Motorbike 11M
Table 2 - Server configuration
Components
Details
Server
PowerEdge R630
Processor
2 x E5-2650 v4 12c, 2.2/1.8 GHz, 105W, 30MB cache
2 x E5-2690 v4 14c, 2.6/2.1 GHz, 135W, 35MB cache
2 x E5-2697Av4 16c, 2.6/2.2 GHz, 145W, 40MB cache
2 x E5-2698 v4 20c, 2.2/1.8 GHz, 135W, 50MB cache
2 x E5-2699 v4 22c, 2.2/1.8 GHz, 145W, 55MB cache
Memory
256GB - 16 x 16GB 2400 MHz DDR4 RDIMMs
Hard drive
6 x 300GB SAS 6Gbps 10K rpm
RAID controller
PERC H330 mini
Page 2 of 5
Operating System
Red Hat Enterprise Linux 7.2 (3.10.0-327.el7.x86_64)
BIOS options
System profile - Performance
Logical Processor - Disabled
Power Supply Redundant Policy - Not Redundant
Power Supply Hot Spare Policy - Disabled
I/O Non-Posted Prefetch - Disabled
Snoop Mode - Opportunistic Snoop Broadcast (OSB) for OpenFOAM and Cluster on Die
(COD) for all the other applications
Node interleaving - Disabled
BIOS
2.0.0
iDRAC Firmware
2.30.30.02
Figure 1 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models with HSW Intel
Xeon E5-2600 v3 series processors and IVB E5-2680 v2 for LS-DYNA car2car benchmark (with endtime set to 0.02).
Figure 1: IVB vs. HSW vs BDW for LS-DYNA
The performance for all the processors is compared to E5-2680 v2, which is shown as the red baseline set at 1. The
green bars show the performance for the HSW processors with LS-DYNA single precision sse binary, the grey bar
represents data for HSW E5-2697 v3 with LS-DYNA single precision avx2 binary, the blue bars show the data for BDW
processors with LS-DYNA single precision sse binary and the orange bars represent the BDW data with LS-DYNA
single precision avx2 binary. For BDW, avx2 binaries perform 12-19% better than the sse binaries across all the
processor models. The purple diamonds describe the performance per core compared to the E5-2680 v2. The
0.84 1.04 1.21 1.38 1.61 1.16 1.51 1.63 1.75 1.83 1.43 1.73 1.88 2.01 2.10
1.05
1.04
1.01
0.99
1.15
0.96
1.08
1.02
0.87
0.83
1.19
1.23
1.17
1.00
0.95
0.0
0.5
1.0
1.5
2.0
2.5
E5-2640
v3 8c,
2.6/2.2
GHz, 90W,
1866
MT/s
E5-2660
v3 10c,
2.6/2.2
GHz,
105W,
2133
MT/s
E5-2680
v3 12c,
2.5/2.1
GHz,
120W,
2133
MT/s
E5-2697
v3 14c,
2.6/2.2
GHz,
145W,
2133
MT/s
E5-2697
v3 14c,
2.6/2.2
GHz,
145W,
2133
MT/s
E5-2650
v4 12c,
2.2/1.8
GHz
105W,
2400
MT/s
E5-2690
v4 14c,
2.6/2.1
GHz,
135W,
2400
MT/s
E5-2697A
v4 16c,
2.6/2.2
GHz,
145W,
2400
MT/s
E5-2698
v4 20c,
2.2/1.8
GHz,
135W,
2400
MT/s
E5-2699
v4 22c,
2.2/1.8
GHz,
145W,
2400
MT/s
E5-2650
v4 12c,
2.2/1.8
GHz
105W,
2400
MT/s
(AVX2)
E5-2690
v4 14c,
2.6/2.1
GHz,
135W,
2400
MT/s
(AVX2)
E5-2697A
v4 16c,
2.6/2.2
GHz,
145W,
2400
MT/s
(AVX2)
E5-2698
v4 20c,
2.2/1.8
GHz,
135W,
2400
MT/s
(AVX2)
E5-2699
v4 22c,
2.2/1.8
GHz,
145W,
2400
MT/s
(AVX2)
HSW HSW
(AVX2)
BDW BDW (AVX2)
perf. relative to IVB E5
-2680 v2
LS-DYNA (car2car endtime=0.02) IVB vs. HSW vs. BDW
Relative Perf. per core Baseline - IVB E5-2680 v2 10c, 2.8 GHz, 115W, 1866 MT/s
11%
lower
Page 3 of 5
percentages at the top of the BDW avx2 orange bar describe the percentage improvement of the BDW processors
over HSW E5-2697 v3 avx2 (grey bar in the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower
frequency understandably performs 11% lower than the Haswell E5-2697 v3 processor. The 14 core E5-2690 v4
which has same number of cores and similar avx2 frequencies performs 7% better than the E5-2697 v3 this can be
accounted for due to the increase in bandwidth for Broadwell and BDW processors also measure better power
efficiencies than Haswell processors. The performance for the 16core, 20core and 22core processors is 16 to 30%
higher than the HSW E5-2697 v3 (avx2). Comparing the performance, performance per core and the higher memory
bandwidth per core, the E5-2690 v4 14c and E5-2697Av4 16c look like attractive options for CAE/CFD codes,
particularly when considering per core licensing costs.
CD-adapco’s STAR-CCM+ is another CFD application widely-used by industry for solving problems involving fluid
flows, heat transfer, and other phenomena. STAR-CCM+ shows similar performance patterns to LS-DYNA.
Figure 2: HSW vs BDW for STAR-CCM+
Figure 2 compares the performance of the five BDW Intel Xeon E5-2600 v4 series processors models (shown as the
five bars in the graph) with HSW E5-2697 v3 shown as the red line set at one. The numbers at the top of the bar
show the per core performance relative to the E5-2697 v3. As seen from the bars the 14core, 16core, 20core and the
22core relative performance is higher by 8% to 40% across all the benchmarks. The lower core, lower frequency
0.98
1.00
0.98
0.93
1.01
0.97
0.99
0.96
1.03
1.08
1.11
1.11
1.10
1.09
1.08
1.10
1.10
1.10
1.02
1.04
1.06
1.04
1.02
1.02
1.10
1.03
1.04
0.88
0.87
0.91
0.87
0.87
0.89
0.94
0.88
0.90
0.85
0.83
0.87
0.89
0.83
0.86
0.89
0.86
0.83
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Performance Relative to HSW E5
-2697 v3
STAR-CCM+ Benchmarks HSW Vs BDW
E5-2650 v4 12c, 2.2/1.8 GHz 105W, 2400 MT/s E5-2690 v4 14c, 2.6/2.1 GHz, 135W, 2400 MT/s
E5-2697A v4 16c, 2.6/2.2 GHz, 145W, 2400 MT/s E5-2698 v4 20c, 2.2/1.8 GHz, 135W, 2400MT/s
E5-2699 v4 22c, 2.2/1.8 GHz, 145W, 2400 MT/s Baseline - HSW E5-2697 v3 14c, 2.6/2.2 GHz, 145W, 2133 MT/s
Relative perf. per core
Page 4 of 5
12core E5-2650 performs 11-20% lower than the E5-2697 v3. Similar to LS-DYNA, the per core performance of the
14core and the 16core is 2% to 11% better than the HSW E5-2697 v3 making them good options for STAR-CCM+ as
well.
ANSYS Fluent is a computational fluid dynamics application. The graph in Figure 3 shows the performance of
truck_poly_14m for Sandy-bridge (SB), Ivy-bridge (IVB), HSW and BDW processors compared to the Westmere
(WSM) processor shown as the redline set at one.
Figure 3: WSM vs. SB vs. IVY vs. HSW vs. BDW for ANSYS Fluent
The Fluent benchmark exhibit a similar pattern as LS-DYNA and STAR-CCM+ benchmarks. The purple diamonds in
Figure 3 describe the performance per core compared to the WSM 2.93GHz processor. The percentages at the top of
the BDW blue bar describe the percentage improvement of the BDW processors over HSW E5-2697 v3 (green bar in
the graph). The 12 core BDW E5-2650 v4 which has fewer cores and lower frequency performs 14% lower than the
Haswell E5-2697 v3 processor. With higher performance per core and the higher memory bandwidth per core, the
E5-2690 v4 14c, E5-2697Av4 16c are good options, particularly when considering per core software licensing costs,
and perform 11% and 21% better than the E5-2697 v3 processor. The 20 and 22core BDW processors perform 32%-
39% better than the HSW E5-2697 v3.
OpenFOAM (Open source Field Operation And Manipulation) is a free, open source software for computational fluid
dynamics (CFD).
1.76 2.28 2.60 3.27 2.82 3.62 3.96 4.32 4.53
1.32
1.37
1.30
1.40
1.41
1.55
1.48
1.30
1.24
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
E5-2680 8c, 2.7
GHz,
130W, 1333
MT/s
E5-2680 v2 10c,
2.8 GHz, 115W,
1866 MT/s
E5-2697 v2 12c,
2.7 GHz, 130W,
1866 MT/s
E5-2697 v3 14c,
2.6/2.2 GHz,
145W, 2133
MT/s
E5-2650 v4 12c,
2.2/1.8 GHz
105W, 2400
MT/s
E5-2690 v4 14c,
2.6/2.1 GHz,
135W, 2400
MT/s
E5-2697A v4 16c,
2.6/2.2 GHz,
145W, 2400
MT/s
E5-2698 v4 20c,
2.2/1.8 GHz,
135W, 2400MT/s
E5-2699 v4 22c,
2.2/1.8 GHz,
145W, 2400
MT/s
12G - SB 12G - IVB 13G - HSW 13G - BDW
Perf. relative to WSM
Fluent - truck_poly_14m WSM vs. SB vs. IVB vs. HSW vs. BDW
Relative perf. per core Baseline 11G - WSM 2.93 GHz, 6c, 1333 MT/s
14%
lower
11%
21%
32%
39%
% Improvement over HSW E5-2697 v3
Page 5 of 5
Figure 4: HSW vs. BDW for OpenFOAM Motorbike 11M benchmark
As shown in Figure 4 for the OpenFOAM Motorbike 11M benchmark, all the Broadwell processors perform 12% to
21% better than the Haswell E5-2697 v3 processor, shown as the red line set at one. Per core performance for the
16 core, 14 core and 12 core is 4% to 30% better than the E5-2697 v3.The performance for the 20 core and the 22
core BDW processors are the same for the Motorbike 11M benchmark. Increase in number of cores does not
provide a significant performance boost for 20 and 22 core parts likely due to lower memory bandwidth per core as
explained in the first blog’s STREAM results.
Conclusion
Along with more cores than HSW, BDW measures better power efficiency than HSW. Looking at the absolute
performance, performance per core and the higher memory bandwidth per core, the E5-2690 v4 14c, E5-2697Av4
16c are attractive options for CAE/CFD codes particularly if per-core licensing costs are involved. For applications like
OpenFOAM (motorbike case) all the BDW processors performed better than Haswell E5-2697 v3, but the increase in
number of cores does not provide a significant performance boost for 20 and 22 core parts due to lower memory
bandwidth per core.
1.12 1.17 1.19 1.21 1.21
1.30
1.17
1.04
0.85
0.77
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
E5-2650 v4 12c, 2.2/1.8
GHz 105W, 2400 MT/s
E5-2690 v4 14c, 2.6/2.1
GHz, 135W, 2400 MT/s
E5-2697A v4 16c, 2.6/2.2
GHz, 145W, 2400 MT/s
E5-2698 v4 20c, 2.2/1.8
GHz, 135W, 2400MT/s
E5-2699 v4 22c, 2.2/1.8
GHz, 145W, 2400 MT/s
Performance relative to E5
-2697 v3
HSW vs. BDW - OpenFOAM Motorbike 11M
Relative Perf. per core Baseline - HSW E5-2697 v3 14c, 2.6/2.2 GHz, 145W, 2133 MT/s
/