U.S. Bancorp Piper Jaffray's analyst Ashok Kumar
published his report on Intel today in which he apparently mentions
170 mm2 as Willamette's die size. It's
well known that Ashok has a very
friendly relationship with Intel
so we take this very seriously.
A die size of 170 mm2
means that the actual core size (excluding the256kbyte L2 cache)
would be in the order of 140 mm2. This
is twice the size of the Coppermine core (70 mm2).
This is larger then most estimates. We previous estimated a 50-60% larger
core (110 mm2) based on the amount which
the individual units on the P6 die photograph were supposed to grow. We
estimate that this large die size would be equivalent to that of
a Mustang with 1 Megabyte of on-chip L2 cache.
Why should the Willamette die be this big?
Here just a few possibilities:
A much larger Trace Cache:
The Trace Cache needs significantly more bits
to store the (decoded) instructions: About six times as much as for
the un-decoded instructions. Willamette would need about 400 kbyte SRam
to store the same amount of instructions as the Athlon stores in it's 64
kbyte L1 instruction cache. This amount of storage however doesn't seem
to be necessary with a closely coupled on chip L2 cache.
A larger part of the pipeline is running at
A number of "twin" pipeline stages such as "rename-rename",
which can be difficult to "hyper-pipeline" may
actually be single stages with twice the amount of logic running at half
the frequency. The Trace Cache for instance is known to emit six micro-ops
at a rate of 750 MHz
into the three-way super scalar pipeline which
is supposed to run at 1.5 GHz. A three-way super-scalar pipeline running
at 1.5 GHz is largely equivalent in functionality and performance with
a six-way super-scalar pipeline running at 750 MHz. Stages of the pipeline
running at 750 MHz may be easier to design but need about twice the logic.
On chip L3 Cache Tags for Foster:
Willamette will not have external L3 cache memory
but Foster will. A commonly used practice to avoid the difficult problem
of how to split the production between processors of type A and type B
is simply to produce a processor which can be sold as type A and
as type B. A good example is the 180 nm Pentium III which is sold as Celeron
II with half of its 256kbyte cache disabled. If Intel can justify this
for a processor at the value end than it can certainly justify this for
a high end processor. It is thus not unlikely that Willamette and Foster
will have the same die. Foster should be able to handle 2 Megabyte of external,
full speed, L3 cache SRam. The cache tags needed are estimated to occupy
circa 10 mm2 with 128 byte cache lines.
Single cycle 128 bit SSE2:
The P6 executes the 128 bit SSE operations by
issuing two 64 bit micro-ops. We believe that the hardware in the Willamette
also handles 128 bit operations in two 64 bit cycles. Willamette's software
guide mentions that "A few units accept an instruction once every cycle"
Intel has also completely refrained from using those nice Giga-flops numbers
which should result from handling four 32 or two 64 bit floating point
multiplications or additions per cycle at a speed of 1.5 GHz. A single
cycle 128 bit throughput would require extra die real- estate. It is doubtful
if this extra hardware would translate into a much higher performance however.
The small number of eight 128 bit XMM registers in combination with the
long latency of the floating point operations would limit the amount of
instructions which can be executed in parallel. The single load port would
throttle the performance for data-independent code which needs to load
operands from memory.
We'll use the 170 mm2
in our report report, see version 0.83