An analysis of the newly presented
Pentium 4 core.
Ample information
The new Pentium 4 core micro architecture contains
many interesting aspects. There is however very little detailed information
given by Intel. Here I try to fill the void with a detailed analysis.
A car pool lane for memory loads
Stage nine of the Pentium 4 contains the queues
where instructions line up and are waiting for further processing. The
diagram shows a separate queue for memory instructions that gives memory
loads a “car pool lane” on the Pentium 4 micro architecture highway that
allows them to pass the other instruction traffic. This is essential since
each instruction ultimately depends on some data loaded from memory. It
is important to execute the loads as early as possible, also because loads
can incur significant delays in case of cache misses.
Load-Load reordering
Stages 10 to 12 contain a separate Memory Scheduler
for the loads. This means that loads can be issued in a different order
then the program order. This is useful since not all loads can be issued
in advance. A load may depend on the value of a register that is still
unknown. The scheduler can use the addressing mode as a clue for load-load
re-ordering. Loads which can typically be executed well in advance are
those that use the base pointer together with an immediate offset given
by the instruction itself. A modern compiler will avoid pops and pushes
to access the stack. Instead it may use the base-pointer with immediate
displacements like EBP+4 or EBP-8. Accesses into structures with
EDI+disp or EDI+disp are also good candidates. A structure is often accessed
multiple times so the compiler can try to keep the pointer in the register
constant for a while. Loads with complex address modes like Base+shift*Index+displacement
should be scheduled as last because they are dependent on so many registers.
The actual availability of the data in the register
file needed may also be used as a clue for load scheduling. There are quite
a few stages between the start of the scheduler and the register file however
so it maybe too early on in the pipeline to be very useful. Maybe it is
used in the last part of the scheduler that is closer to the register file.
Load-Store reordering
The memory stores also follow the separate queue/scheduler
path. Reordering loads and stores is a very different story however and
it is unlikely that the Pentium 4 can do this. The point is that the memory
addresses are still unknown during scheduling so a load following a store
may read from the same memory address that the store wants to write. Loads
are therefor scheduled in program order with stores. Stores are not reordered
with other stores either for the same reason. This kind of reordering would
require speculative processing much like the speculative processing after
a branch prediction. If the direction of the branch is predicted wrong
then all speculative instructions must be canceled and the pipeline must
be restarted. If turns out that a load rescheduled for a store has provided
the wrong memory data then all dependent instructions must be canceled
and restarted. (An architecture that does this is the Alpha EV6). The now
one and a half year old Athlon does not do any load-load or load-store
reordering. This is the reason that it is not very much faster then the
Coppermine even though it has a superior number of integer and floating
point units. The following pseudo code example shows how important load-store
reordering can be for floating point processing.
STORE (A*B+C) to
MEM1;
LOAD (D)
from MEM2;
The load may then have to wait ~8 cycles (Athlon)
or ~14 cycles (Pentium 4) until the floating-point calculations are finished
while MEM1 and MEM2 may be completely different addresses? Well,
this problem is solved for a large part in the Load/Store unit. This unit
is directly coupled to the L1 Data Cache RAM and Address Generator Units.
The dispatcher issues the load and the store without waiting for the floating-point
result. The load and the store would end up in the Load/Store unit where
the addresses are calculated. The Store may then say to the Load: "Hey
guy, you can go now. I probably have to wait quite a while for my floating
point result data from the FP move/store unit but we now know that we have
different addresses".
This doesn’t handle any speculative load-store
reordering however and loads still have to wait for the store addresses.
About 37% of the x86 instructions contains a load while 23% contains a
store. (Link)
Some instructions contain both. This means that any Out Of Order x86 processor
is severely limited in its abilities to reorder code without speculative
load-store reordering. Finally: Shared memory and memory mapped I/O constitutes
a problem for both load-load and load-store reordering. Any processor needs
to have the operating system hooks to handle these issues.
The Instruction schedulers.
Stage 9 of the pipeline also shows a large general
instruction queue. I expect that it is actually divided in to several smaller
queues connected to the five different schedulers shown in stages 10 to
12. Here we encounter the first double frequency units of the Rapid
Execution Engine: The two Fast-Integer-uop schedulers. Each scheduler serves
a double frequency ALU. It can accept two uops per cycle from the queue.
The first uop is handled directly while the second starts half a cycle
later. These schedulers have a total of 6 pipeline stages running on the
double clock frequency. They handle additions, subtractions, increments,
decrements and boolean functions. Then there is the Slow-Integer-uop” scheduler.
It handles other integer instructions like shifts, bit-field functions
and a lot of legacy functions like decimal and ascii adjust. It runs at
the normal frequency has three pipeline stages and can accept 1 uop per
cycle. The Slow-Integer-uop functions are handled by the Slow ALU that
also runs at the normal frequency.
The floating-point schedulers may be longer then
three stages. This would explain why the FP register file is drawn farther
to the right in the original Intel block diagram. Floating point scheduling
is more complex because the floating-point execution units have many stages.
The fact that the x87 instructions are stack based should not be an issue
anymore because the register-stack-to-renamed-register-re-mapping is already
handled in the rename stages before the queue stage.
Load data speculative execution
This is a new feature in the Pentium 4 core.
Instructions that depend on load data from the L1 data cache are scheduled
and dispatched to coincide with the arrival of the data from the L1 data
cache. These instructions are tracked, canceled in case of a cache miss
and later replayed to coincide with the arrival of the valid data. Only
the instructions dependent on the load data are replayed, independent instructions
are allowed to complete. The Alpha EV6A has a similar feature.
The double frequency register
file.
This file contains both the real x86 integer
register set and the renamed registers which are used for speculative execution.
The renamed registers can be retired to the real registers when it sure
that the branch direction chosen by the branch prediction unit was the
right one and that the speculative results of the renamed registers are
valid. The register file runs also runs at the double frequency and thus
contains 4 pipeline stages. It would have been possible to run the register
file at the normal frequency but with twice the number of data ports.
This however would have made the register-file four times bigger. The Intel
designed team therefor decided to give it extra pipeline stages instead.
Interesting is that the Elbrus design team uses this method to limit the
size of the huge combined Integer / Floating point register file of their
E2K processor. This seems much more achievable now that Intel does the
same.
The double frequency ALU’s
The double frequency ALU’s where already disclosed
during the Spring IDF. The 2 ALU’s can together execute 4 operations per
cycle. This however is the not reason of their design. (Four operations
per cycle is overkill compared to the number of load/store ports of the
cache) Much more important is that they can do back to back additions
and subtraction as well as boolean functions with an effective latency
of ½ a cycle. This can significantly speed up serial code.
We still don’t believe that they can complete a full 32 bit addition within
½ a cycle. We suspect that the ALU uses CSA (Carry Save Adder)
methods to reach an effective latency of an ½ cycle. This means
that the ½ cycle does not apply to all cases: An additive function
followed by a boolean function would still see a full cycle latency. (This
full cycle latency can indeed be found in the Spring IDF presentation).
There is another latency issue that is not immediately visible in the block
diagram. The x86 instructions use implicit flags like zero, sign, parity
et-cetera. These flags are calculated in stage 18 and then they are forwarded
together with the result data to the Bypass network and the Integer register
file. This means that operations that depend on result data from another
ALU may see a latency of 2 cycles instead of 1 cycle. This is probably
the reason that shifts are relatively slower than on the P6.
The double frequency AGU’s
The AGU’s are a part of the Rapid Execution Engine
and also run at the double frequency. This is somewhat puzzling since the
MPF presentation explicitly shows that the L1 data cache runs at the normal
speed. A reasonable explanation might be that the total Load-Use latency
of the L1 data cache (AGU+RAM access) is 2 ½ cycles. This might
be reduced to an effective latency of 2 cycles if the preceding operation
is an additive function (add, subtract, increment, decrement). This is
exactly the case in the example used during the Spring IDF presentation.
The same CSA method that is probably used in the ALU’s can combine the
additive function with the address generation and thereby cut ½
a cycle from the total latency. The worst case latency can be 3 cycles
in this case. This happens when the preceding function is not an additive
function and the result becomes available in the wrong ½ cycle (The
last ½ cycle of the normal clock) Via’s Centaur team uses
the same CSA method in their recent designs.
For so far a more detailed technical analysis
of the newly presented Pentium 4 core micro architecture.
Send your comments to: hansdevries@chip-architect.org
HOME
|