We
see a very different picture for the Floating Point benchmarks
however. Only 3 out of 14 have a cache efficiency of 97% and
higher for the 3.0 MByte cache. This becomes 5 out of 14
for the 6.0 MByte cache. Numbers 4 and 5 are the
"infamous" 179.art and 178.galgel. 179.art becomes
entirely cache resident in the 6.0 Megabyte cache. 178.galgel
has a memory footprint much larger then the cache. The re-use of
data however makes if very efficient in a 6 Megabyte
cache.
Most
of these Scientific and Technical benchmark operate on large
data-structures. The re-use becomes very high if a significantly
part of the data structure fits into the cache. For
instance: If the data-structure is a 3D-volume then it may be
enough to hold three planes in the cache. This because new
values are often calculated with the directly surrounding
points:
New
Data [ i, j, k ] = Function { Data [ i-1, j-1, k-1 ] .....
Data [ i+1, j+1, k+1 ] }
The
re-use of data then becomes 26 / 27 = 96% where 27 is the
number of points in the 3x3x3 sub-cube, only one new point needs
to be loaded from memory if the 3x3x3 sub-cube moves a single
position through the volume.
We
see that 2 benchmarks are completely bandwidth starved at 1500
MHz for a single processor: 171.Swim and 172.mgrid
8
out of 14 benchmarks become less cache efficient if we go from
1000 MHz / 3 MB cache Itanium 2's to the newer 1500 MHz/
6
MB Itanium 2's The Memory Controller bottleneck more then
cancels the advantage of the larger caches.
If
you're looking to use the Itanium 2 for Large Scale scientific
or Technical calculations then you'll have to look at SGI with
is propriety memory controllers. SGI uses two SHUB Memory
Controllers for every four processors. The fully loaded 64
processor Altix 3000 has a 65% higher
throughput then the half loaded 32 processor Altix 3000.
The
memory footprint of the SPEC2000 benchmarks is less then 200
MByte to be able to run on systems with 256 MByte DRAM. Heavier
applications using multiple Gigabyte structures are likely to
see much greater degradations. AMD's distributed memory solution
based on HyperTransfer links is likely to pay of in these cases.
A four processor 2200 MHz Opteron may reach a similar
SPEC2000_rate performance as a four way 1500 MHz Itanium 2 even
though the latter has a much higher single processor
score. Again, larger floating point memory footprints may
skew the results even further.
|