March 26,
2003
The clues for Yamhill
(
by Hans de Vries )
64
bit processing using twin 32 bit cores
Three
clues for Yamhill seem to provide substantial prove.
The
industry has speculated a year now on the existence of 64 bit extensions
to the x86 ISA in Intel's future 90 nm processor codenamed Prescott. We
could show in our March 6 article that Prescott contains two instead of
one 32 bit integer execution cores. The question arises for the purpose
of such a second core? In fact there are many different
possibilities: Use it to run a separate trace to improve hyper threading.
Use it to check the results of the first core (IBM has a processor that
does just this). And of course, Yamhill, is just one of them. searching
for clues we started looking at the highest resolution die-plot of the
Pentium 4 we could find and try if we could make some sense of all these
little artificial colored rectangles and lines. (The photo shows 5
micrometer details) We made progress, studied code optimization manuals
for clues, Went through all the presentations, then looked at Pentium 4
related patents from known P4 architects, made more progress, gained
confidence and started to write an article about the Integer execution
core with the die photo as the visual base. This long article will be
published in the near future. For now we have stumbled on a number of
clues that seem to provide substantial prove for the existence of
Yamhill. If (or when) it will be enabled is a different question. They
might even call it the Pentium 6.... (Tejas = 7, Nehalem = 8)
(
Edit, March 29,2003: The rumors
are that it will be enabled in Potomac.
The
MP version for systems with more than 2
processors in late 2004 )
And
then now the clues, They are handled in more in detail later in the
article.
|
Clue
1: The second Integer Unit has no AGU's (Fast double
clocked Address Generator Units)
This
unit provides the address bits 32 and higher. We will show that there is
no need to provide these bits very fast
in
the NetBurst Architecture with its replay capabilities. nor do we need
all bits 32 through 63 A virtual address size of 40 or 48 bits
would be sufficient for the time being. (It's 48 bits in the first
implementation of the Hammer family)
|
Clue
2: The second Integer Unit register file has a
smaller size, 1.30 x 0.64 mm versus 1.30 x 0.71 mm
The
(renamed) register file of the Pentium 4 has 128 entries for 32 bit data
plus 6 bit status flags. We could show that Prescott has two 256 entry
register files. The width of the two is equal meaning that they have
the same number of entries. The height of the second one is however less,
indicating that is has less data bits per entry. We presume that it has all its
32 data bits but that the 6 status flags are lacking. A 64 bit processor
needs only one set of status flags per 64 bit word. This
clue also implies that the second core can not be used to run an
independent 32 bit thread.
|
Clue
3: The data caches have been shifted in order to
balance a critical path in 64 bit processing
The
first core has to provide the address bits for the data caches of both
cores. Most critical in Northwood are bits 6..11 that select one
of 32 cache lines in a 2k page and bits 12..16 that are used to
predict which of the 4 ways contains the cache line ( 4 x 2kByte =
8 kByte cache size ). These paths should be as short as possible.
Going from one core to another introduces a long path for this critical
signal. However, it turns out that the path to both caches are equal in
length. They managed to do this by shifting both caches upwards. (see second
image below)
|
Pentium
5 improvements over Pentium 4
A
list of improvements we found on the die until now.
Only
two of them are officially disclosed by Intel.
(
so it's all unofficial )
|
Specifications
and
Enabled
Features
|
Pentium
4
Northwood
Prestonia
(DP)
Gallatin
(MP) |
Pentium
5
Prescott
(Q4-03)
Nocona
(DP) (Q4-03)
Potomac
(MP) (H2-04)
|
Data
Width |
32
bit |
Prescott
32 bit
Nocona
32 bit
Potomac
64
bit |
Logical
Processors
(number
of threads) |
Northwood:
1,2
Prestonia:
2
Gallatin:
2 |
Prescott:
2
Nocona:
2?
Potomac:
4 |
L1
Data Cache |
8
kByte |
Prescott
16
kByte
Nocona
32 16 kByte
Potomac
32 kByte |
Instruction
Trace Cache |
12
k uOps |
16
k uOps |
Trace
Cache Bandwidth |
3
uOps/cycle |
4
uOps/cycle |
L2
Unified Cache |
512
kByte |
1024
kByte |
Instructions
in Flight |
126 |
256
128
|
Integer
Register File |
128
x 32 bit |
256
x 64 bit |
Floating
Point Register File |
128 x 128 bit |
256 x 128 bit |
Load Buffer |
48 entries |
96 entries |
Store Buffer |
24 entries |
48 entries |
(updated
May 7, 2003)
|
The
Image below will be featured in a coming article that will zoom in on
all the individual units and discuss them in detail
Must
be very interesting for all the assembly level programmers to see how
all their instructions fly around through the architecture. You can
click here for a higher resolution version.
|
|
The
next image below visualizes all three clues:
1)
The missing AGU's in the second core.
2)
The second register file with less bits per entry.
3)
The balanced timing critical load address to cache paths.
|
|
The
high address bits 32 and higher don't need to be calculated fast
I
said that I would explain that, however that will become much too
technical for now. Lets put things in a table, say for instance for a
Northwood Memory Execution Unit and larger then 32 bit addressing, see
below. Now, all actions that need the full address must be
designed in such a way that they are not timing critical. The
probably did manage to do that. Prescott has however a different cache
size and thus a different table. Have a look at David Sager's patents to
find out more about the stuff below.
load/
store |
Action
|
address
bit
used |
CSA/XOR
used
to by-
pass
AGU |
Optimization
Rule |
load |
write
load address
in
load buffer |
full
address |
no |
|
load |
Index
cache-line in
each
of the four ways |
6..11 |
yes |
2k
Data Cache aliasing, max four with same tag |
load |
Way
prediction |
12..15 |
yes |
64k
Data Cache aliasing, only one with same tag |
load |
Read
Store Data
from
Store Buffer |
2..13 |
yes? |
16k
store-forwarding aliasing, only one with same bits 2..13 in the
store buffer |
load |
Check
Store Buffer
Address
14..31 |
14:31 |
no |
4
Gbyte aliassing. Must take store buffer address bits 32 and
higher along to check later when full load address ready and
force a replay if needed |
load |
Check
Cache Hierarchy Hit |
full
address |
no |
General
cache stuff |
store |
write
store address
in
store buffer |
full
address |
no |
|
store |
Check
against the
load
buffer addresses |
2:16? |
no? |
Incorrect
match only produces unnecessary replays. Less bits = more
replays |
store |
Store
Data in
Cache
Hierarchy |
full
address |
no |
|
|
Why
does Intel say that the cache is 16kByte while there is 32 kByte on the
die??
Good
question. The answer has been written years ago in the CPUID
table!
66h: |
8kB first level data cache, 4-way set associative, 64 byte cache
line
|
67h: |
16kB first level data cache, 4-way set associative, 64 byte
cache line |
68h: |
32kB first level data cache, 4-way set associative, 64 byte
cache line |
How
can a 32 kB L1 cache be 4-way set associative with 4 kB page memory
management!.. Don't you
need at least 8 ways? (8 x 4 =32) Aren't we missing a
selection bit here. Ahaa... They
must be using a Thread ID bit so 2 threads get half of the cache (4
ways) and the other 2 get the other 4 ways. So Nocona will have a
32 kByte L1 Data Cache!
And
the other way around: It proves that Nocona will handle 4 threads!
Regards,
Hans
|
Related
articles
March
6, 2003: Looking
at Intel's Prescott die
April
20, 2003: Looking
at Intel's Prescott die, part II
|
|