| 
                   
                We
                see a very different picture for the Floating Point benchmarks
                however. Only 3 out of 14 have a cache efficiency of 97% and
                higher for the 3.0 MByte cache.  This becomes 5 out of 14
                for the 6.0 MByte cache. Numbers 4 and 5 are the
                "infamous" 179.art and 178.galgel. 179.art becomes
                entirely cache resident in the 6.0 Megabyte cache. 178.galgel
                has a memory footprint much larger then the cache. The re-use of
                data however makes if very efficient in a 6 Megabyte
                cache.  
                  
                Most
                of these Scientific and Technical benchmark operate on large
                data-structures. The re-use becomes very high if a significantly
                part of the data structure fits into the cache.  For
                instance: If the data-structure is a 3D-volume then it may be
                enough  to hold three planes in the cache. This because new
                values are often calculated with the directly surrounding
                points: 
                  
                New
                Data [ i, j, k ]  = Function { Data [ i-1, j-1, k-1 ] .....
                Data [ i+1, j+1, k+1 ]   } 
                  
                The
                re-use of data then becomes 26 / 27 =  96% where 27 is the
                number of points in the 3x3x3 sub-cube, only one new point needs
                to be loaded from memory if the 3x3x3 sub-cube moves a single
                position through the volume. 
                  
                We
                see that 2 benchmarks are completely bandwidth starved at 1500
                MHz for a single processor: 171.Swim and 172.mgrid 
                8
                out of 14 benchmarks become less cache efficient if we go from
                1000 MHz / 3 MB cache Itanium 2's to the newer 1500 MHz/  
                6
                MB Itanium 2's  The Memory Controller bottleneck more then
                cancels the advantage of the larger caches. 
                  
                If
                you're looking to use the Itanium 2 for Large Scale scientific
                or Technical calculations then you'll have to look at SGI with
                is propriety memory controllers. SGI uses two SHUB Memory
                Controllers for every four processors. The fully loaded 64
                processor Altix 3000 has a 65% higher
                throughput then the half loaded 32 processor Altix 3000. 
                  
                The
                memory footprint of the SPEC2000 benchmarks is less then 200
                MByte to be able to run on systems with 256 MByte DRAM. Heavier
                applications using multiple Gigabyte structures are likely to
                see much greater degradations. AMD's distributed memory solution
                based on HyperTransfer links is likely to pay of in these cases.
                A four processor 2200 MHz Opteron may reach a similar
                SPEC2000_rate performance as a four way 1500 MHz Itanium 2 even
                though the latter has a much higher single processor
                score.  Again, larger floating point memory footprints may
                skew the results even further.     
                  
               |