Wednesday, December 8, 2010

Microcode and Microarchitecture

Microcode is a layer of hardware-level instructions and/or data structures involved in the implementation of higher level machine code instructions in many computers and other processors; it resides in a special high-speed memory and translates machine instructions into sequences of detailed circuit-level operations (Dell XPS M1210 Battery) http://www.hdd-shop.co.uk .

It helps separate the machine instructions from the underlying electronics so that instructions can be designed and altered more freely. It also makes it feasible to build complex multi-step instructions while still reducing the complexity of the electronic circuitry compared to other methods (Dell Studio XPS 1340 Battery) .

Writing microcode is often called microprogramming and the microcode in a particular processor implementation is sometimes called a microprogram.

Modern microcode is normally written by an engineer during the processor design phase and stored in a ROM and/or PLA structure, although machines exist which have some writable microcode inSRAM or flash memory (Dell Studio XPS 1640 Battery) .

Microcode is generally not visible or changeable by a normal programmer, not even by an assembly programmer. Unlike machine code which often retains some compatibilityamong different processors in a family, microcode only runs on the exact electronic circuitry for which it is designed, as it constitutes an inherent part of the particular processor design itself (Dell Vostro 1710 Battery) .

More extensive microcoding has also been used to allow small and simple microarchitectures to emulate more powerful architectures with wider word length, more execution units and so on; a relatively simple way to achieve software compatibility between different products in a processor family (Sony VGP-BPS13 battery) .

Some hardware vendors, especially IBM, use the term as a synonym for firmware, so that all code in a device, whether microcode or machine code, is termed microcode (such as in a hard drive for instance, which typically contains both) (Sony VGP-BPS13/B battery) .

Overview

The elements composing a microprogram exist on a lower conceptual level than a normal application program. Each element is differentiated by the "micro" prefix to avoid confusion: microinstruction, microassembler, microprogrammer, microarchitecture, etc (Sony VGP-BPS13/S battery) .

The microcode usually does not reside in the main memory, but in a special high speed memory, called the control store. It might be either read-only or read-write memory. In the latter case the microcode would be loaded into the control store from some other storage medium as part of the initialization of the CPU, and it could be altered to correct bugs in the instruction set, or to implement new machine instructions (Sony VGP-BPS13A/B battery) .

Microprograms consist of series of microinstructions. These microinstructions control the CPU at a very fundamental level of hardware circuitry. For example, a single typical microinstruction might specify the following operations (Sony VGP-BPS13B/B battery) :

  • Connect Register 1 to the "A" side of the ALU
  • Connect Register 7 to the "B" side of the ALU
  • Set the ALU to perform two's-complement addition
  • Set the ALU's carry input to zero (Sony VGP-BPL9 battery)
  • Store the result value in Register 8
  • Update the "condition codes" with the ALU status flags ("Negative", "Zero", "Overflow", and "Carry")
  • Microjump to MicroPC nnn for the next microinstruction (Sony VGP-BPS13B/B battery)

To simultaneously control all processor's features in one cycle, the microinstruction is often wider than 50 bits, e.g., 128 bits on a 360/85 with an emulator feature. Microprograms are carefully designed and optimized for the fastest possible execution, since a slow microprogram would yield a slow machine instruction which would in turn cause all programs using that instruction to be slow (Sony VGP-BPL11 battery) .

The reason for microprogramming

Microcode was originally developed as a simpler method of developing the control logic for a computer. Initially CPU instruction sets were "hard wired" (Sony VGP-BPL15 battery) .

Each step needed to fetch, decode and execute the machine instructions (including any operand address calculations, reads and writes) was controlled directly by combinatorial logic and rather minimal sequential state machine circuitry. While very efficient, the need for powerful instruction sets with multi-step addressing and complex operations (see below) made such "hard-wired" processors difficult to design and debug (Dell Inspiron E1505 battery) ;

highly encoded and varied-length instructions can contribute to this as well, especially when very irregular encodings are used.

Microcode simplified the job by allowing much of the processor's behaviour and programming model to be defined via microprogram routines rather than by dedicated circuitry (Dell Latitude E6400 battery) .

Even late in the design process, microcode could easily be changed, whereas hard wired CPU designs were very cumbersome to change, so this greatly facilitated CPU design.

From the 1940s to the late 1970s, much programming was done in assembly language (HP Pavilion dv6000 Battery) ;

higher level instructions meant greater programmer productivity, so an important advantage of microcode was the relative ease by which powerful machine instructions could be defined.During the 1970s, CPU speeds grew more quickly than memory speeds and numerous techniques such as memory block transfer, memory pre-fetch and multi-level caches were used to alleviate this (Sony Vaio VGN-FZ31S battery) .

High level machine instructions, made possible by microcode, helped further, as fewer more complex machine instructions require less memory bandwidth. For example, an operation on a character string could be done as a single machine instruction, thus avoiding multiple instruction fetches (Sony VGN-FZ31S battery) .

Architectures with instruction sets implemented by complex microprograms included the IBM System/360 and Digital Equipment Corporation VAX. The approach of increasingly complex microcode-implemented instruction sets was later called CISC (SONY VGN-FZ38M Battery) .

A middle way, used in many microprocessors, is to use PLAs and/or ROMs (instead of combinatorial logic) mainly for instruction decoding, and let a simple state machine (without much, or any, microcode) do most of the sequencing. The various practical uses of microcode and related techniques (such as PLAs) have been numerous over the years, as well as approaches to where, and to which extent, it should be used. It is still used in modern CPU designs (SONY VGN-FZ31z Battery) .

Other benefits

A processor's microprograms operate on a more primitive, totally different and much more hardware-oriented architecture than the assembly instructions visible to normal programmers. In coordination with the hardware, the microcode implements the programmer-visible architecture (Sony VGN-FZ31Z Battery) .

The underlying hardware need not have a fixed relationship to the visible architecture. This makes it possible to implement a given instruction set architecture on a wide variety of underlying hardware micro-architectures (SONY VAIO VGN-FZ38M Battery) .

Doing so is important if binary program compatibility is a priority. That way previously existing programs can run on totally new hardware without requiring revision and recompilation. However there may be a performance penalty for this approach. The tradeoffs between application backward compatibility vs CPU performance are hotly debated by CPU design engineers (SONY VGN-FZ31E Battery) .

The IBM System/360 has a 32-bit architecture with 16 general-purpose registers, but most of the System/360 implementations actually use hardware that implemented a much simpler underlying microarchitecture (SONY VGN-FZ31J Battery) ;

for example, the System/360 Model 30 had 8-bit data paths to the arithmetic logic unit (ALU) and main memory and implemented the general-purpose registers in a special unit of higher-speed core memory, and the System/360 Model 40 had 8-bit data paths to the ALU and 16-bit data paths to main memory and also implemented the general-purpose registers in a special unit of higher-speed core memory (SONY VGN-FZ31M Battery) .

The Model 50 and Model 65 had full 32-bit data paths and implemented the general-purpose registers in faster transistor circuits. In this way, microprogramming enabled IBM to design many System/360 models with substantially different hardware and spanning a wide range of cost and performance, while making them all architecturally compatible (SONY VGN-FZ31B Battery) .

This dramatically reduced the amount of unique system software that had to be written for each model.

A similar approach was used by Digital Equipment Corporation in their VAX family of computers. Initially a 32-bit TTL processor in conjunction with supporting microcode implemented the programmer-visible architecture (SONY VGP-BPS13 Battery) .

Later VAX versions used different microarchitectures, yet the programmer-visible architecture did not change.

Microprogramming also reduced the cost of field changes to correct defects (bugs) in the processor; a bug could often be fixed by replacing a portion of the microprogram rather than by changes being made to hardware logic and wiring (Dell Inspiron 1320 Battery) .

History

In 1947, the design of the MIT Whirlwind introduced the concept of a control store as a way to simplify computer design and move beyond ad hoc methods. The control store was a two-dimensional lattice: one dimension accepted "control time pulses" from the CPU's internal clock, and the other connected to control signals on gates and other circuits (Dell Inspiron 1320n Battery) .

A "pulse distributor" would take the pulses generated by the CPU clock and break them up into eight separate time pulses, each of which would activate a different row of the lattice. When the row was activated, it would activate the control signals connected to it (Dell Inspiron 1464 Battery) .

Described another way, the signals transmitted by the control store are being played much like a player piano roll. That is, they are controlled by a sequence of very wide words constructed of bits, and they are "played" sequentially. In a control store, however, the "song" is short and repeated continuously (Dell Inspiron 1564 Battery) .

In 1951 Maurice Wilkes enhanced this concept by adding conditional execution, a concept akin to a conditional in computer software. His initial implementation consisted of a pair of matrices, the first one generated signals in the manner of the Whirlwind control store, while the second matrix selected which row of signals (the microprogram instruction word, as it were) to invoke on the next cycle (Dell Inspiron 1764 Battery) .

Conditionals were implemented by providing a way that a single line in the control store could choose from alternatives in the second matrix. This made the control signals conditional on the detected internal signal. Wilkes coined the term microprogramming to describe this feature and distinguish it from a simple control store (Dell Studio 1450 Battery) .

Examples of microprogrammed systems

  • In common with many other complex mechanical devices, Charles Babbage's analytical engine used banks of cams to control each operation, i.e. it had a read-only control store. As such it deserves to be recognised as the first microprogrammed computer to be designed, even if it has not yet been realised in hardware (Dell Studio 1457 Battery) .
  • The EMIDEC 1100 reputedly used a hard-wired control store consisting of wires threaded through ferrite cores, known as 'the laces'.
  • Most models of the IBM System/360 series were microprogrammed:
  • The Model 25 was unique among System/360 models in using the top 16k bytes of core storage to hold the control storage for the microprogram (Dell Latitude D610 Battery) .
  • The 2025 used a 16-bit microarchitecture with seven control words (or microinstructions). At power up, or full system reset, the microcode was loaded from the card reader. The IBM 1410 emulation for this model was loaded this way.
  • The Model 30, the slowest model in the line, used an 8-bit microarchitecture with only a few hardware registers; everything that the programmer saw was emulated by the microprogram (Toshiba NB100 Battery) .
  • The microcode for this model was also held on special punched cards, which were stored inside the machine in a dedicated reader per card, called "CROS" units (Capacitor Read-Only Storage). A second CROS reader was installed for machines ordered with 1620 emulation (Toshiba Satellite M65 battery) .
  • The Model 40 used 56-bit control words. The 2040 box implements both the System/360 main processor and the multiplex channel (the I/O processor). This model used "TROS" dedicated readers similar to "CROS" units, but with an inductive pickup (Transformer Read-only Store) (Toshiba Satellite M60 battery) .
  • The Model 50 had two internal datapaths which operated in parallel: a 32-bit datapath used for arithmetic operations, and an 8-bit data path used in some logical operations. The control store used 90-bit microinstructions.
  • The Model 85 had separate instruction fetch (I-unit) and execution (E-unit) to provide high performance (Dell Latitude D830 Battery) .
  • The I-unit is hardware controlled. The E-unit is microprogrammed; the control words are 108 bits wide on a basic 360/85 and wider if an emulator feature is installed.
  • The NCR 315 was microprogrammed with hand wired ferrite cores (a ROM) pulsed by a sequencer with conditional execution. Wires routed through the cores were enabled for various data and logic elements in the processor (Dell Latitude D620 Battery) .
  • The Digital Equipment Corporation PDP-11 processors, with the exception of the PDP-11/20, were microprogrammed.
  • Many systems from the Burroughs were microprogrammed (Dell Studio 1735 Battery) :
  • The B700 "microprocessor" executed application-level opcodes using sequences of 16-bit microinstructions stored in main memory, each of these was either a register-load operation or mapped to a single 56-bit "nanocode" instruction stored in read-only memory (Dell Inspiron Mini 10 Battery) .
  • This allowed comparatively simple hardware to act either as a mainframe peripheral controller or to be packaged as a standalone computer (Sony VGN-FW11S Battery) .
  • The B1700 was implemented with radically different hardware including bit-addressable main memory but had a similar multi-layer organisation. The operating system would preload the interpreter for whatever language was required. These interpreters presented different virtual machines for COBOL, Fortran, etc (Sony VGN-FW11M Battery) .
  • Microdata produced computers in which the microcode was accessible to the user; this allowed the creation of custom assembler level instructions. Microdata's Reality operating system design made extensive use of this capability (Sony VGN-FW139E/H battery) .
  • The Nintendo 64's Reality Co-Processor, which serves as the console's graphics processing unit and audio processor, utilized microcode; it is possible to implement new effects or tweak the processor to achieve the desired output. Some well-known examples of custom microcode include Factor 5's N64 port of the Indiana Jones and the Infernal Machine, Star Wars (Dell Latitude E5400 Battery) :
  • Rogue Squadronand Star Wars: Battle for Naboo.
  • The VU0 and VU1 vector units in the Sony PlayStation 2 are microprogrammable; in fact, VU1 was only accessible via microcode for the first several generations of the SDK (Dell Latitude E4200 Battery) .

Implementation

Each microinstruction in a microprogram provides the bits which control the functional elements that internally compose a CPU. The advantage over a hard-wired CPU is that internal CPU control becomes a specialized form of a computer program (Dell Vostro A840 Battery) .

Microcode thus transforms a complex electronic design challenge (the control of a CPU) into a less-complex programming challenge.

To take advantage of this, computers were divided into several parts:

A microsequencer picked the next word of the control store (Dell Inspiron 300M Battery) .

A sequencer is mostly a counter, but usually also has some way to jump to a different part of the control store depending on some data, usually data from the instruction register and always some part of the control store. The simplest sequencer is just a register loaded from a few bits of the control store (Dell Studio 1737 battery) .

A register set is a fast memory containing the data of the central processing unit. It may include the program counter, stack pointer, and other numbers that are not easily accessible to the application programmer. Often the register set is a triple-ported register file, that is, two registers can be read, and a third written at the same time (Dell Inspiron E1505 battery) .

An arithmetic and logic unit performs calculations, usually addition, logical negation, a right shift, and logical AND. It often performs other functions, as well.

There may also be a memory address register and a memory data register, used to access the main computer storage (Dell RM791 battery) .

Together, these elements form an "execution unit". Most modern CPUs have several execution units. Even simple computers usually have one unit to read and write memory, and another to execute user code.

These elements could often be bought together as a single chip (Dell XPS M1530 battery) .

This chip came in a fixed width which would form a 'slice' through the execution unit. These were known as 'bit slice' chips. The AMD Am2900 family is one of the best known examples of bit slice elements.

The parts of the execution units, and the execution units themselves are interconnected by a bundle of wires called a bus (Dell XPS M2010 battery) .

Programmers develop microprograms. The basic tools are software: A microassembler allows a programmer to define the table of bits symbolically. A simulator program executes the bits in the same way as the electronics (hopefully), and allows much more freedom to debug the microprogram (Dell Vostro 1000 battery) .

After the microprogram is finalized, and extensively tested, it is sometimes used as the input to a computer program that constructs logic to produce the same data. This program is similar to those used to optimize a programmable logic array. No known computer program can produce optimal logic (Acer Aspire One battery) ,

but even pretty good logic can vastly reduce the number of transistors from the number required for a ROM control store. This reduces the cost and power used by a CPU.

Microcode can be characterized as horizontal or vertical (Toshiba Satellite P10 Battery) .

This refers primarily to whether each microinstruction directly controls CPU elements (horizontal microcode), or requires subsequent decoding by combinatorial logic before doing so (vertical microcode). Consequently each horizontal microinstruction is wider (contains more bits) and occupies more storage space than a vertical microinstruction (SONY VGN-FZ210CE Battery) .

Horizontal microcode

Horizontal microcode is typically contained in a fairly wide control store; it is not uncommon for each word to be 56 bits or more. On each tick of a sequencer clock a microcode word is read, decoded, and used to control the functional elements which make up the CPU (Dell Precision M70 Battery) .

In a typical implementation a horizontal microprogram word comprises fairly tightly defined groups of bits. For example, one simple arrangement might be:

For this type of micromachine to implement a JUMP instruction with the address following the opcode, the microcode might require two clock ticks; the engineer designing it would write microassembler source code looking something like this (Toshiba Satellite L305 Battery) :

# Any line starting with a number-sign is a comment

# This is just a label, the ordinary way assemblers symbolically represent a

# memory address.

InstructionJUMP (Toshiba Satellite T4900 Battery) :

# To prepare for the next instruction, the instruction-decode microcode has already

# moved the program counter to the memory address register. This instruction fetches

# the target address of the jump instruction from the memory word following the

# jump opcode, by copying from the memory data register to the memory address register (Toshiba PA3399U-2BRS battery) .

# This gives the memory system two clock ticks to fetch the next

# instruction to the memory data register for use by the instruction decode.

# The sequencer instruction "next" means just add 1 to the control word address (Toshiba Satellite A200 Battery) .

MDR, NONE, MAR, COPY, NEXT, NONE

# This places the address of the next instruction into the PC.

# This gives the memory system a clock tick to finish the fetch started on the

# previous microinstruction.

# The sequencer instruction is to jump to the start of the instruction decode.

MAR, 1, PC, ADD, JMP, InstructionDecode (Toshiba Satellite 1200 Battery)

# The instruction decode is not shown, because it is usually a mess, very particular

# to the exact processor being emulated. Even this example is simplified.

# Many CPUs have several ways to calculate the address, rather than just fetching

# it from the word following the op-code. Therefore, rather than just one (Toshiba Satellite M300 Battery)

# jump instruction, those CPUs have a family of related jump instructions.

For each tick it is common to find that only some portions of the CPU are used, with the remaining groups of bits in the microinstruction being no-ops. With careful design of hardware and microcode this property can be exploited to parallelise operations which use different areas of the CPU, for example in the case above the ALU is not required during the first tick so it could potentially be used to complete an earlier arithmetic instruction WD passport essential (500GB/640GB) .

Vertical microcode

In vertical microcode, each microinstruction is encoded—that is, the bit fields may pass through intermediate combinatory logic which in turn generates the actual control signals for internal CPU elements (ALU, registers, etc.) WD passport essential (250GB/320GB) .

In contrast, with horizontal microcode the bit fields themselves directly produce the control signals. Consequently vertical microcode requires smaller instruction lengths and less storage, but requires more time to decode, resulting in a slower CPU clockWD passport essential SE (750GB/1TB) .

Some vertical microcodes are just the assembly language of a simple conventional computer that is emulating a more complex computer. This technique was popular in the time of the PDP-8. Another form of vertical microcode has two fields:

The "field select" selects which part of the CPU will be controlled by this word of the control store WD passport elite(250GB/320GB).

The "field value" actually controls that part of the CPU. With this type of microcode, a designer explicitly chooses to make a slower CPU to save money by reducing the unused bits in the control store; however, the reduced complexity may increase the CPU's clock frequency, which lessens the effect of an increased number of cycles per instruction WD passport elite(500GB/640GB) .

As transistors became cheaper, horizontal microcode came to dominate the design of CPUs using microcode, with vertical microcode no longer being used.

Writable control stores

A few computers were built using "writable microcode" -- rather than storing the microcode in ROM or hard-wired logic, the microcode was stored in a RAM called a Writable Control Store or WCS WD passport studio for Mac(320GB/500GB) .

Such a computer is sometimes called a Writable Instruction Set Computer or WISC.Many of these machines were experimental laboratory prototypes, such as the WISC CPU/16 and the RTX 32P WD passport studio for Mac(500GB/640GB) .

There were also commercial machines that used writable microcode, such as early Xerox workstations, the DEC VAX 8800 ("Nautilus") family, the Symbolics L- and G-machines, and a number of IBMSystem/370 implementations. Some DEC PDP-10 machines stored their microcode in SRAM chips (about 80 bits wide x 2 Kwords), which was typically loaded on power-on through some other front-end CPU WD Elements series(250GB/320GB) .

Many more machines offered user-programmable writable control stores as an option (including the HP 2100,DEC PDP-11/60 and Varian Data Machines V-70 series minicomputers). WCS offered several advantages including the ease of patching the microprogram and, for certain hardware generations, faster access than ROMs could provide WD Elements SE(500GB/640GB) .

User-programmable WCS allowed the user to optimize the machine for specific purposes.

Some CPU designs compile the instruction set to a writable RAM or FLASH inside the CPU (such as the Rekursiv processor and the Imsys Cjip), or an FPGA (reconfigurable computing) WD Elements SE(750GB/1TB) .

TheWestern Digital MCP-1600 is an older example, using a dedicated, separate ROM for microcode.

A CPU that uses microcode generally takes several clock cycles to execute a single instruction, one clock cycle for each step in the microprogram for that instruction. Some CISC processors include instructions that can take a very long time to execute WD Elements desktop(500GB/640GB) .

Such variations interfere with both interrupt latency and, what is far more important in modern systems, pipelining.

Several Intel CPUs in the IA32 architecture family have writable microcode.This has allowed bugs in the Intel Core 2 microcode and Intel Xeon microcode to be fixed in software, rather than requiring the entire chip to be replaced WD Elements desktop(750GB/1TB) .

Such fixes can be installed by Linux, FreeBSD Microsoft Windows, or the motherboard BIOS.

Microcode patches on a PC

Linux, FreeBSD for x86, as well as Microsoft Windows have patch programs (not always labeled as as such, since Windows XP) that can be employed to fix problematic microcode in the main CPU of a PC-compatible computer WD Elements desktop(1.5 TB/2TB) .

Also, on all UNIX-like operating systems on x86-based PCs there has been an ongoing requirement to patch erroneous microcode since the FPU multiplier problem on some Pentium CPUs. For Windows, the PC-specific firmware, called BIOS on older computers, or Extensible Firmware Interface on newer, is responsible for patching microcode in the main CPU WD passport essential SE (750GB/1TB)--USB 3.0) .

Such patches can be done before the operating system is loaded, preventing many potential problems. So far, x86 is the only modern CPU-family where parts of the microcode or internal firmware may be modified WD passport essential (500GB/640GB) .

Microcode versus VLIW and RISC

The design trend toward heavily microcoded processors with complex instructions began in the early 1960s and continued until roughly the mid-1980s. At that point the RISC design philosophy started becoming more prominent. This included the points WD passport for Mac(320GB/500GB) :

  • Analysis shows complex instructions are rarely used, hence the machine resources devoted to them are largely wasted.
  • Programming has largely moved away from assembly level, so it's no longer worthwhile to provide complex instructions for productivity reasons WD passport for Mac(640GB/1TB) .
  • The machine resources devoted to rarely-used complex instructions are better used for expediting performance of simpler, commonly-used instructions My book essential 4 generation (640GB/1TB) .
  • Complex microcoded instructions requiring many, varying clock cycles are difficult to pipeline for increased performance.
  • Simpler instruction sets allow direct execution by hardware, avoiding the performance penalty of microcoded execution WD My book essential 4 generation( 1.5TB/2TB) .

It should be mentioned that there are counter-points as well:

  • The complex instructions in heavily microcoded implementations may not take much extra machine resources (except microcode space); for instance, the same ALU is often used to calculate an effective address as well as computing the result from the actual operands, e.g. the original Z80, 8086, and others WD My book elite( 1TB/1.5TB) .
  • The simpler non-RISC instructions, i.e. involving direct memory operands are frequently used by modern compilers, even immediate to stack (i.e. memory result) arithmetic operations are commonly employed WD My book studio(1TB/2TB) .
  • Although such memory operations, often with varying length encodings are more difficult to pipeline, it is still fully feasible, clearly exemplified by the i486, AMD K5, Cyrix 6x86, etc WD My book essential 4 generation( 1.5TB/2TB) .
  • Non-RISC instructions inherently perform more work per instruction (on average), and are also normally highly encoded, so they enable smaller overall size of the same program, and thus better use of limited cache memories WD My book elite(640GB/2TB) .
  • Modern CISC/RISC implementations, e.g. x86 designs, decode instructions into dynamically buffered micro-operations with instruction encodings similar to traditional fixed microcode. Ordinary static microcode is used as hardware assistance for complex multistep operations such as auto-repeating instructions and for transcendental functions in the floating point unitSeagate expansion portable (320GB/500GB) ;
  • it is also used for special purpose instructions (such as CPUID) and internal control and configuration purposes.
  • The simpler instructions in CISC architectures are also directly executed in hardware in modern implementations Seagate expansion (1.5TB/2TB) .

Many RISC and VLIW processors are designed to execute every instruction (as long as it is in the cache) in a single cycle. This is very similar to the way CPUs with microcode execute one microinstruction per cycle. VLIW processors have instructions that behave similarly to very wide horizontal microcode Seagate Freeagent Desktop (500GB/1TB) ,

although typically without such fine-grained control over the hardware as provided by microcode. RISC instructions are sometimes similar to the narrow vertical microcode.

In computer engineering, microarchitecture (sometimes abbreviated to Āµarch or uarch), also called computer organization, is the way a giveninstruction set architecture (ISA) is implemented on a processor Seagate Freeagent Go(250GB/320GB) .

A given ISA may be implemented with different microarchitectures. Implementations might vary due to different goals of a given design or due to shifts in technology. Computer architecture is the combination of microarchitecture and instruction set designSeagate Freeagent Go(500GB/640GB) .

Relation to instruction set architecture

The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things Seagate Freeagent Go(750GB/1TB) .

The microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA.

The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectual elements of the machine Seagate Freeagent Goflex(250GB/320GB) ,

which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements. These diagrams generally separate the data path (where data is placed) and thecontrol path (which can be said to steer the data) Seagate Freeagent Goflex(500GB/640GB) .

Each microarchitectural element is in turn represented by a schematic describing the interconnections of logic gates used to implement it. Each logic gate is in turn represented by a circuit diagramdescribing the connections of the transistors used to implement it in some particular logic family Seagate Freeagent Goflex(750GB/1TB) .

Machines with different microarchitectures may have the same instruction set architecture, and thus be capable of executing the same programs. New microarchitectures and/or circuitry solutions, along with advances in semiconductor manufacturing, are what allows newer generations of processors to achieve higher performance while using the same ISA Seagate Freeagent Goflex Pro(500GB/750GB) .

In principle, a single microarchitecture could execute several different ISAs with only minor changes to the microcode.

Aspects of microarchitecture

The pipelined datapath is the most commonly used datapath design in microarchitecture today Seagate Freeagent Goflex desktop(1TB/2TB) .

This technique is used in most modern microprocessors, microcontrollers, and DSPs. The pipelined architecture allows multiple instructions to overlap in execution, much like an assembly line. The pipeline includes several different stages which are fundamental in microarchitecture designs Seagate Freeagent go for Mac(320GB/640GB) .

Some of these stages include instruction fetch, instruction decode, execute, and write back. Some architectures include other stages such as memory access. The design of pipelines is one of the central microarchitectural tasks.

Execution units are also essential to microarchitecture Samsung G2 protable (250gb/320GB) .

Execution units include arithmetic logic units (ALU), floating point units (FPU), load/store units, branch prediction, and SIMD. These units perform the operations or calculations of the processor. The choice of the number of execution units, their latency and throughput is a central microarchitectural design task Samsung G2 protable (500GB/640GB) .

The size, latency, throughput and connectivity of memories within the system are also microarchitectural decisions.

System-level design decisions such as whether or not to include peripherals, such as memory controllers, can be considered part of the microarchitectural design process Samsung S2 protable (320GB/500GB) .

This includes decisions on the performance-level and connectivity of these peripherals.

Unlike architectural design, where achieving a specific performance level is the main goal, microarchitectural design pays closer attention to other constraints Samsung S1 Mini (120GB/160GB) .

Since microarchitecture design decisions directly affect what goes into a system, attention must be paid to such issues as:

Microarchitectural concepts

In general, all CPUs, single-chip microprocessors or multi-chip implementations run programs by performing the following steps:

  1. Read an instruction and decode it
  2. Find any associated data that is needed to process the instruction
  3. Process the instruction
  4. Write the results out Samsung Story station (1.5TB/2TB)

Complicating this simple-looking series of steps is the fact that the memory hierarchy, which includes caching, main memory and non-volatile storage like hard disks, (where the program instructions and data reside) has always been slower than the processor itself Samsung story station Esata(1TB/1.5TB) .

Step (2) often introduces a lengthy (in CPU terms) delay while the data arrives over the computer bus. A considerable amount of research has been put into designs that avoid these delays as much as possible. Over the years, a central goal was to execute more instructions in parallel, thus increasing the effective execution speed of a program Samsung G3 station (1TB/1.5TB) .

These efforts introduced complicated logic and circuit structures. Initially these techniques could only be implemented on expensive mainframes or supercomputers due to the amount of circuitry needed for these techniques. As semiconductor manufacturing progressed, more and more of these techniques could be implemented on a single semiconductor chip Maxtor one touch 4 plus (500GB/750GB) .

What follows is a survey of micro-architectural techniques that are common in modern CPUs.

Instruction set choice

Instruction sets have shifted over the years, from originally very simple to sometimes very complex (in various respects) Maxtor one touch 4 plus (1TB/1.5TB) .

In recent years, load-store architectures, VLIW and EPIC types have been in fashion. Architectures that are dealing with data parallelism include SIMD and Vectors. Some labels used to denote classes of CPU architectures are not particularly descriptive, especially so the CISC label Maxtor black diamond (320GB/500GB) ;

many early designs retroactively denoted "CISC" are in fact significantly simpler than modern RISC processors (in several respects).

However, the choice of instruction set architecture may greatly affect the complexity of implementing high performance devicesMaxtor cool black(640GB/1TB) .

The prominent strategy, used to develop the first RISC processors, was to simplify instructions to a minimum of individual semantic complexity combined with high encoding regularity and simplicity. Such uniform instructions were easily fetched, decoded and executed in a pipelined fashion and a simple strategy to reduce the number of logic levels in order to reach high operating frequencies Maxtor Black diamond (320GB/500GB) ;

instruction cache-memories compensated for the higher operating frequency and inherently low code density while large register sets were used to factor out as much of the (slow) memory accesses as possible.

Instruction pipelining

One of the first, and most powerful, techniques to improve performance is the use of the instruction pipeline Hitachi simple touch (250GB/320GB) .

Early processor designs would carry out all of the steps above for one instruction before moving onto the next. Large portions of the circuitry were left idle at any one step; for instance, the instruction decoding circuitry would be idle during execution and so on Hitachi simple touch (320GB/500GB) .

Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast Hitachi life studio (320GB/500GB) .

Although any one instruction takes just as long to complete (there are still four steps) the CPU as a whole "retires" instructions much faster and can be run at a much higher clock speed.

RISC make pipelines smaller and much easier to construct by cleanly separating each stage of the instruction process and making them take the same amount of time — one cycle Hitachi life studio (250GB/320GB) .

The processor as a whole operates in an assembly line fashion, with instructions coming in one side and results out the other. Due to the reduced complexity of the Classic RISC pipeline, the pipelined core and an instruction cache could be placed on the same size die that would otherwise fit the core alone on a CISC design. This was the real reason that RISC was faster Hitachi life studio platinum (250GB/320GB) .

Early designs like the SPARC andMIPS often ran over 10 times as fast as Intel and Motorola CISC solutions at the same clock speed and price.

Pipelines are by no means limited to RISC designs. By 1986 the top-of-the-line VAX implementation (VAX 8800) was a heavily pipelined design, slightly predating the first commercial MIPS and SPARC designs Hitachi life studio platinum (320GB/500GB) .

Most modern CPUs (even embedded CPUs) are now pipelined, and microcoded CPUs with no pipelining are seen only in the most area-constrained embedded processors. Large CISC machines, from the VAX 8800 to the modern Pentium 4 and Athlon, are implemented with both microcode and pipelines Hitachi life studio desk (500GB/1TB) .

Improvements in pipelining and caching are the two major microarchitectural advances that have enabled processor performance to keep pace with the circuit technology on which they are based Hitachi life studio plus (320GB/500GB) .

Cache

It was not long before improvements in chip manufacturing allowed for even more circuitry to be placed on the die, and designers started looking for ways to use it. One of the most common was to add an ever-increasing amount of cache memory on-die. Cache is simply very fast memory, memory that can be accessed in a few cycles as opposed to "many" needed to talk to main memory Hitachi life studio plus (320GB/500GB) .

The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears," whereas if it is not the processor is "stalled" while the cache controller reads it in.

RISC designs started adding cache in the mid-to-late 1980s, often only 4 KB in total Hitachi X mobile (250GB/320GB) .

This number grew over time, and typical CPUs now have at least 512 KB, while more powerful CPUs come with 1 or 2 or even 4, 6, 8 or 12 MB, organized in multiple levels of a memory hierarchy. Generally speaking, more cache means more performance, due to reduced stalling Hitachi X mobile(320GB/500GB) .

Caches and pipelines were a perfect match for each other. Previously, it didn't make much sense to build a pipeline that could run faster than the access latency of off-chip memory. Using on-chip cache memory instead, meant that a pipeline could run at the speed of the cache access latency, a much smaller length of time Hitachi XL (1TB/2TB) .

This allowed the operating frequencies of processors to increase at a much faster rate than that of off-chip memory.

Branch prediction

One barrier to achieving higher performance through instruction-level parallelism stems from pipeline stalls and flushes due to branches Toshiba canvio portable(320GB/500GB) .

Normally, whether a conditional branch will be taken isn't known until late in the pipeline as conditional branches depend on results coming from a register. From the time that the processor's instruction decoder has figured out that it has encountered a conditional branch instruction to the time that the deciding register value can be read out Toshiba canvio portable(750GB/1TB) ,

the pipeline needs to be stalled for several cycles, or if it's not and the branch is taken, the pipeline needs to be flushed. As clock speeds increase the depth of the pipeline increases with it, and some modern processors may have 20 stages or more. On average, every fifth instruction executed is a branch, so without any intervention, that's a high amount of stalling Toshiba anvio for Mac(500GB/750GB) .

Techniques such as branch prediction and speculative execution are used to lessen these branch penalties. Branch prediction is where the hardware makes educated guesses on whether a particular branch will be taken. In reality one side or the other of the branch will be called much more often than the other Toshiba canvio for Mac(750GB/1TB) .

Modern designs have rather complex statistical prediction systems, which watch the results of past branches to predict the future with greater accuracy. The guess allows the hardware to prefetch instructions without waiting for the register read Toshiba External HDD –portable(320GB/500GB) .

Speculative execution is a further enhancement in which the code along the predicted path is not just prefetched but also executed before it is known whether the branch should be taken or not. This can yield better performance when the guess is good, with the risk of a huge penalty when the guess is bad because instructions need to be undone Toshiba portable(500GB/640GB) .

No comments:

Post a Comment