The CPU (Central Processing Unit), or processor/microprocessor only, is the component on the computer that interprets the instructions contained in computer programs and processes the data.
What Is A Processor, What Does It Do?
CPUs provide the basic features (programmability) of the digital computer and, along with primary storage and I/O devices, are always one of the essential components in computers. The microprocessor is a CPU produced with integrated circuits. Since the mid-1970s, single-chip microprocessors have completely replaced almost all CPU types, and today the term “CPU” is generally applied to all microprocessors.
The term “central processing unit” is generally a description of a particular class of logic machines capable of running complex computer programs. This broad definition can easily be applied to most of the oldest computers that existed long before the term “CPU” was widely used.
However, the term itself and its abbreviation have been used in the computer industry at least since the early 1960s. The shape, design, and implementation of the CPUs have changed significantly from the oldest examples, but their basic operation has remained quite similar.
The first CPUs were specially designed as part of a larger computer, usually one of a kind. However, this expensive method of CPUs designed specifically for a particular application has largely disappeared and has been replaced by the development of inexpensive, standard processor classes adapted for one or more purposes.
This trend of standardization generally began in the period of Discrete Transistors, Mainframes, and Microcomputers and accelerated with the spread of the Integrated Circuit (IC), which enabled the design and manufacture of more complex CPUs in small spaces.
Both miniaturization and standardization of CPUs have increased the presence of these digital devices in modern life far beyond the limited applications of private computer machines. Modern microprocessors appear in cars, televisions, refrigerators, calculators, planes, mobile or cell phones, toys and everything else.
Almost all CPUs deal with discrete states and therefore require a certain class of switching classes to distinguish and change those states. Before the commercial acceptance of the transistor, electrical relays and vacuum tubes (thermionic valves) were widely used as switching elements.
Although these have different speed advantages over previous purely mechanical designs, they were not reliable for various reasons. For example, making Direct Current Sequential Logic circuits requires additional hardware to deal with the Contact Bounce problem.
On the other hand, even though vacuum tubes do not suffer from contact splashes, they must be heated before they become fully functional and eventually fail and stop working completely.
In general, when a tube fails, the CPU must find the faulty component to replace it. Therefore, the first electronic computers were generally faster but less reliable than electromechanical computers.
While tube computers like EDVAC tended to be an average of eight hours between failures, relay computers like Harvard Mark I rarely failed.
Eventually, tube-based CPUs became dominant because the significant speed advantages produced often outweighed reliability issues. Many of these early synchronous CPUs worked at lower clock frequencies compared to modern microelectronic designs.
Clock signal frequencies from 100 kHz to 4 MHz were now very common and largely limited by the speed of the switching devices in which they were produced.
The complexity of processor design has increased as various technologies make it easier to create smaller, more reliable electronic devices. The first of these improvements came with the arrival of the Transistor.
In the 1950s and 1960s, transistor CPUs did not have to be built with bulky, unreliable and fragile switching elements such as vacuum tubes and electrical relays. With this development, more complex and more reliable CPUs were built on one or more printed circuit boards containing separate (individual) components.
During this period, many production methods gained popularity in a compact area. The Integrated Circuit (IC) allowed the production of a large number of transistors on a simple Semiconductor-based plate or “chip”.
At first, only very basic, non-specialized digital circuits such as NOR gates were miniaturized in IC.
CPUs based on these building block ICs are often called “small-scale integration” (SSI) devices.
SSI integrated circuits, such as those used on the Apollo Guidance Computer, often included transistors numbered as its multiples.
Thousands of separate chips were required to build a full processor using IC SSI, but it still consumed much less space and power than previous discrete transistor designs.
As microelectronic technology advances, an increasing number of transistors are placed in ICs, thereby reducing the number of individual ICs required for a full CPU. MSI integrated circuits and LSI increased the number of transistors by hundreds and then thousands.
In 1964, IBM introduced the System/360 computer architecture used on a number of computers that could run the same programs at different speeds and performances.
This was important at a time when most electronic computers were incompatible, even those produced by the same manufacturer. To facilitate this development, IBM used the concept of Microprogram, called Microcode, which is still widely used in modern CPUs.
System/360 architecture was so popular that it has dominated the Mainframe market for years and is still maintained by similar modern computers like IBM zSeries.
In the same year of 1964, Digital Equipment Corporation (DEC) introduced another effective computer, PDP-8, for scientific and research markets. DEC would later introduce the extremely popular PDP-11 line, which was originally built with IC SSI but eventually became practical when combined with LSI components.
Unlike its predecessors with SSI and MSI technology, the PDP-11’s first LSI implementation included a processor consisting of only four LSI integrated circuits.
Transistor-based computers had some obvious advantages over previous ones. In addition to facilitating increased reliability and lower power consumption, transistors allowed the CPU to operate at much higher speeds due to the short switching time of a transistor compared to a tube or relay.
By this time, thanks to the increased reliability and dramatically increased speed of the switching elements, which are almost exclusively transistors, dozens of megahertz clock speeds were achieved.
In addition, while discrete transistors and integrated circuit CPUs are in heavy use, new high-performance designs are beginning to appear as SIMD (Single Instruction Multiple Data). These first experimental designs were later introduced by Cray Inc. It created the age of special supercomputers like those made by.
The first microprocessor, Intel 4004 in 1970 and the first widely used microprocessor, since the introduction of the Intel 8080 in 1974, have almost completely changed the rest of the application methods of this processor-class Central Processing Unit.
Host and minicomputer manufacturers of that time released proprietary IC development programs to upgrade older computer architectures, eventually producing microprocessors with instruction sets backward compatible with older hardware and software.
Now coupled with the emergence and vast success of the ubiquitous personal computer, the term CPU is almost entirely applied to microprocessors.
Previous generation CPUs were implemented as separate small-scale integrated integration circuits on separate components and one or more circuit boards. On the other hand, microprocessors are CPUs made with very few ICs; it is usually only one.
Smaller processor size means faster transition times due to physical factors such as a reduction in the parasitic capacitance of the doors as a result of application on a single chip. This enabled synchronous microprocessors to have hours ranging from tens of megahertz to several gigahertz.
In addition, as the ability to create extremely small transistors on an IC increases, the complexity and number of transistors in a single CPU has increased significantly. This common trend is explained by Moore’s Law, which has been shown to be a fairly accurate estimate of growth in the complexity of CPUs and other ICs.
It is noteworthy that the complexity, size, structure and overall shape of the CPU have changed significantly over the past sixty years, while the basic design and operation have not changed much. Almost all of today’s common CPUs can be accurately identified as von Neumann stored-program machines.
As Moore’s law continues to apply, concerns have been raised about the limits of integrated circuit transistor technology. Excessive miniaturization of electronic doors causes the effects of much more important events such as Electromigration and Loss Threshold.
These new concerns are among many other factors that have expanded the use of parallelism and increased the usefulness of the classic von Neumann model, as well as researchers reviewing new computational methods like the quantum computer.
How Processor/CPU Works?
The basic process of most CPUs is to execute a series of stored instructions called programs. The program is represented by a series of numbers held in particular computer memory. Almost all von Neumann Architecture CPUs have four steps in their operations: read, decode, execute and write.
The first step involves reading, getting an Instruction from the program memory. The memory location of the program is determined by a Program Counter that stores a number that identifies the current position in the program. In other words, the program counter tells the CPU of its location in the current program.
After reading instruction, the Program Counter is incremented according to the length of the instruction word in memory units. The instruction to be read frequently should be taken from memory relatively slowly and stop the CPU while waiting for the instruction to return. This problem is largely addressed by caches and pipeline architectures in modern processors.
The instruction that the CPU reads from memory is used to determine what the CPU should do. In the decoding step, the instruction is divided into parts that have meaning for other CPU units. The way the value of the numerical command is interpreted is defined by the ISA (Instruction Set) architecture of the CPU.
Usually, a group of numbers called Opcode in the instruction indicates which action to take. The rest of the number usually provides the necessary information for that instruction, such as those processed for the addition.
Such operands can be given as a fixed value, or as a location with a value that can be a record or a memory address, as determined by some Address Modes.
In older designs, the processor units responsible for decoding the instruction were fixed hardware devices. However, in more abstract and complex CPUs and ISAs, a Microprogram is often used to help transform instructions into various configuration signals for the processor.
This microprogram can sometimes be rewritten so that it can be modified to change the way the CPU decodes even after it has been produced.
After the reading and decoding steps, the instruction execution step is performed. During this step, several processor units are connected so that they can perform the desired operation.
For example, if an addition operation is desired, an ALU (Arithmetic Logical Unit) is connected to an input set and an output set. The entries provide the numbers to be added, and the outputs will include the final total.
ALU includes a circuit for performing simple arithmetic and logic operations at inputs such as addition and bitwise operations. If the add operation produces a result that is too large to be processed by the processor, and the Arithmetic Overflow flag in a flag register can also be set.
The last step, write, writes the results of the execution step to a specific memory format. Often, the results are written to some internal processor register for quick access with subsequent instructions.
In other cases, results can be written to a slower but cheaper and larger main memory. Some types of instructions change the schedule counter instead of generating direct result data.
These are often called jumps and facilitate behavior such as loops, conditional execution of programs and functions in programs. Many instructions will also change the status of the steps in the “flags” register. These flags can be used to influence how a program behaves because they usually show the results of various processes.
For example, a comparison instruction type takes two values into account and sets a larger number in the flag register. This flag can then be used with the next jump instruction to determine the program flow.
After the execution of the instruction and writing the obtained data, the whole process is repeated with the next instruction cycle, normally reads the next instruction in the row due to the increased value in the program counter. If the completed command is a skip, the program counter will be changed to include the address of the command it was skipped on, and program execution continues normally. In more complex CPUs as described here, multiple instructions can be read, decoded and executed simultaneously.
This section explains what is called the Classic RISC Pipeline, which is quite common among simple CPUs used in many electronic devices, often called Microcontrollers.
Design and Application
How a CPU represents numbers is a design option that affects the most basic ways the device works. Some of the first digital calculators used to represent numbers internally are an electrical model of the Common Decimal Numbering System.
Some other computers have used more exotic numbering systems, such as triple. Almost all modern CPUs represent numbers in binary form; where each digit is represented by a given physical amount of two values, such as “high” or “low” voltage.
Digital representation is related to the size and precision of the numbers a CPU can represent. In the case of a binary CPU, Bit represents an important position in the numbers the CPU is running on.
The number of bits a CPU uses to represent numbers is often referred to as “word size”, “bit width”, “data path width” or “integer precision” while dealing with integers.
This number differs between architectures and usually in different units of the same CPU. For example, an 8-bit CPU processes a series of numbers that can be represented by eight binary digits, each with two possible values, and 8 bits with 28 or 256 separate numbers in combination. In fact, the integer size sets a hardware limit for the range of integers the software is running and the CPU can use directly.
The integer range can also affect the number of memory locations the CPU can address. For example, if a binary CPU uses 32 bits to represent a memory address and each memory address represents an Octet (8 bit), the maximum amount of memory the processor can address is 232 octets or 4 GB.
This is a very simple view of the CPU Address Space, and many modern designs use much more complex address methods, such as paging, to find more memory as their entire range will allow with a flat address space.
Higher levels of the integer range require more structure, and therefore more complex, size, energy usage, and overall cost to handle extra digits.
So even if there are CPUs with a much higher range (16, 32, 64 and even 128 bits), it’s not entirely rare to see 4 and 8-bit microcontrollers used in modern applications. Simpler microcontrollers are generally cheaper, use less energy and therefore emit less heat. All of these can be important design considerations for electronic devices.
However, in high-end applications, the benefits produced by the additional range are more important and often affect design options.
Many CPUs are designed in different bit widths for different units of the device to take advantage of some of the advantages offered by both the lowest and highest bit lengths. For example, IBM System/370 used a 32 bit but 128-bit precision CPU inside Floating Point units to facilitate greater accuracy and floating-point range.
Many subsequent CPU designs use a similar mix of bit widths, especially when designed for general purpose uses where a reasonable balance is required between processor integer and floating-point capacity.
Most CPUs and indeed most Sequential Logic devices are inherently synchronous. In other words, they are designed and work according to a synchronization signal. This signal, known as the clock signal, usually takes the form of a periodic square wave.
By calculating the maximum time that electrical signals can move in various branches of many CPU circuits, designers can choose a suitable time for the clock signal.
This time must be longer than the time it takes for a signal to move or propagate in the worst case. By setting the clock period to a significantly higher value on the worst propagation delay, it is possible to design the way the entire CPU and data move around the edges of the rising and falling clock signal.
The advantage of this is that it significantly simplifies the CPU in both the design perspective and the component quantity perspective. However, this has the disadvantage that the entire CPU expects slower elements, although some units are much faster. This limitation has been compensated by various methods to greatly increase CPU parallelism.
However, architectural improvements alone do not address all the disadvantages of global synchronized CPUs. For example, a clock signal is subject to delays of any other electrical signal. Higher clock speeds in increasingly complex CPUs make it difficult to keep the clock signal in phase (synchronized) throughout the entire unit.
This has led to many modern CPUs that require the provision of multiple identical clock signals to avoid significantly delaying a single signal that will cause the CPU to malfunction. Another important problem when the clock speed increases significantly is the amount of heat emitted by the CPU.
The clock signal changes constantly and causes many components to change regardless of whether they are currently in use. Generally, a state-changing component uses more energy than an element in a static state. Therefore, as the clock speed increases, heat dissipation also increases, causing the CPU to require more effective cooling solutions.
One method of replacing unnecessary components is called Clock gating, which involves turning off the clock signal to unnecessary components, effectively turning it off. However, this is often difficult to implement and therefore does not see common use other than very low power designs.
Another method of dealing with some problems of a global clock signal is to eliminate it entirely. Removing the spherical signal from the watch makes the design process much more complex in many ways compared to similar synchronous designs, while asynchronous designs have significant advantages in power consumption and heat dissipation.
Although a little rare, all CPUs are built without using a global clock signal. Two important examples of this are AMULET, which implements ARM architecture, and MiniMIPS compatible with MIPS R3000.
Instead of completely removing the clock signal, some designs allow some device units to be asynchronous, such as using asynchronous ALUs with the superscalar pipeline to achieve some gains in arithmetic performance.
While it is not entirely clear whether they can be compared to the simultaneous equivalents of completely asynchronous designs or perform at a better level than them, it is clear that they are excellent at least in the simplest mathematical operations. This combined with excellent power consumption and heat dissipation properties make them very suitable for embedded computers.
The description of the basic process given in the previous section explains the simplest form a CPU can take. This type often called a subscale, runs and executes a single command with one or two pieces of data at a time.
This process naturally results in inefficiency in subscale CPUs. Since only one command is executed at a time, the entire CPU must wait for this command to complete before proceeding to the next instruction.
As a result, the sub CPU is paralyzed in instructions that require multiple clock cycles to complete execution. Adding a second execution unit does not greatly improve performance. Instead of freezing one path, now two paths freeze and the number of unused transistors increases.
This design, in which execution resources can operate with only one command at a time, can probably only achieve scalar performance. However, performance is almost always sub-scale.
Scalar and attempts to achieve better performance have resulted in various design methods that have enabled the CPU to behave less linearly and in parallel. As for parallelism, two terms are often used to classify these design techniques.
ILP (Instruction Level Parallelism): It aims to increase the speed of execution of instructions within a CPU, that is, to increase the use of execution resources on the tablet.
TLP (Thread Level Parallelism): It aims to increase the number of threads the CPU can run simultaneously.
Each methodology differs in both application methods and the relative effectiveness they produce in increasing performance for an application.
ILP: Instructional Piping and Superscalar Architecture
One of the simplest methods to achieve increased parallelism is to begin the first steps to read and decode the instruction before the previous instruction has finished executing.
This is the form of a technique known as instruction pipelining and is used in almost all modern general-purpose CPUs. By dividing the execution path into discrete stages, the pipeline allows multiple commands to be executed at any time.
This separation can be compared to an assembly line where instruction is made more complete at each stage until the execution leaves the pipeline and is withdrawn.
However, the pipeline reveals the possibility of a situation in which the result of the previous operation must be finished to complete the next operation; a condition called data dependency collision. To deal with this, extra care must be taken to control such conditions, and if this occurs, part of the instruction pipeline should be delayed.
Naturally, achieving this requires additional circuitry, cased processors are more complex than the subscale, but not too many. The case processor can become almost completely scalar, only blocked by abrupt pipe stops.
Further development of the teaching tube idea led to the development of a method that further reduces the idle time of the components.
Designs said to be superscalar include a long instruction pipeline and multiple identical execution units. In a superscalar pipeline, multiple instructions are read and sent to the distributor, deciding whether the instructions can be executed in parallel.
Designs said to be superscalar include a long instruction pipeline and multiple identical execution units. In a superscalar pipeline, multiple instructions are read and sent to the distributor, deciding whether the instructions can be executed in parallel.
If so, they are sent to existing execution units, which allows multiple commands to be executed simultaneously.
In general, the more instructions a superscalar hardware can send to standby execution units at the same time, the more commands are completed in a given cycle.
Many of the challenges of designing a superscalar architecture lie in creating an efficient distribution program. The dispatcher needs to be able to quickly and accurately determine whether the instructions can be executed in parallel and send them to occupy as many execution units as possible.
This requires the instruction pipeline to be filled as often as possible and the need for significant amounts of caches in superscalar architectures. This also creates techniques to prevent hazards such as Fork Prediction, Speculative Execution, and Out of Order Execution, which is crucial to maintaining High-Performance levels.
Branch prediction tries to predict which branch a conditional expression will take; The processor can minimize the number of times the entire pipeline should wait until a conditional expression is complete.
Speculative execution often results in modest improvements in performance by executing sections of code that may or may not be required after a conditional transaction ends.
Extraordinary execution to some extent changes the order of execution of instructions to reduce latency due to data dependencies.
Where some of the CPU is superscalar and some are not, the non-superscalar suffer from performance due to program interruption. The original Intel Pentium (P5) had two superscalars ALUs that could accept one command for each clock cycle, but its FPU was unable to accept one command per clock cycle.
Therefore, P5 was superscalar in the integers part, but not in floating-point numbers. The successor to Intel’s Pentium architecture, the P6 added supercars to floating-point functions, thereby providing a significant increase in the performance of such instructions.
The simple tube and superscalar design increase the CPU’s ILP by allowing a single processor to complete instructions execution at rates exceeding IPC (instruction per cycle/one command per cycle). Most modern designs are at least somewhat supersales, and almost all general-purpose designs over the past decade have been superscalar.
In recent years, some of the emphasis on high ILP computer design has shifted from hardware to software interface or ISA. The VLIW (Very Long Instruction Word) strategy reduces the work that the CPU has to do to give the ILP a significant push by causing some ILP to be directly implied by the software, thus reducing the complexity of the design.
TLP: Simultaneous Execution of Threads
Another widely used strategy to increase parallelism is to include the ability to run multiple threads (programs) simultaneously.
In general, high TLP processors have been used for much longer than high ILP processors. Many of the designs that Seymour Cray pioneered in the 1970s and early 1980s focused on TLP as the primary method of providing great computing capabilities. In fact, TLP has been used since the 1950s in the form of improvements in multiple threads.
In the context of individual processor design, the two main methods used to obtain TLP are CMP (Chip-level multiprocessing) and SMT (Simultaneous multithreading).
It is very common to create fully independent multiple processor computers in arrays such as SMP (symmetric multiprocessing) and NUMA (Non-Uniform Memory Access). Although many different tools are used, all of these techniques achieve the same goal to increase the number of threads that CPUs can run in parallel.
CMP and SMP parallelism methods are similar and the most obvious method. These include something more conceptual than using two or more full CPUs and separate CPUs. In the case of CMP, multiple processor cores are included in the same package, sometimes in the same Integrated Circuit.
In addition, SMP includes multiple independent packages. NUMA is somewhat similar to SMP but uses a non-uniform memory access model. This is important for high processor computers because the memory access time of each processor is quickly depleted by the shared memory model of the SMP, resulting in a significant delay due to memory-waiting processors.
For this reason, NUMA is considered a much more scalable model by allowing more processors to be used on a computer than SMP can support. SMT is a bit different from other TLP enhancements because it tries to duplicate as little of the previous processor as possible.
Although it is considered as a TLP strategy, its implementation is really similar to a superscalar design, and in fact, it is often used in superscalar microprocessors like IBM’s POWER5.
Instead of duplicating the entire processor, SMT designs only duplicate things needed for reading, decoding and sending instructions, and general-purpose logs.
This allows an SMT processor to engage execution units more often by giving instructions from two different software threads. Again, this ILP is very similar to the superscalar method, but instead of executing multiple commands on the same thread at the same time, it executes multi-threaded commands at the same time.
A less common but increasingly important processor paradigm concerns vectors. All of the processors discussed above are called a specific type of scalar device.
Vector processors handle multiple pieces of data in the context of the instruction, unlike scalar processors that process one piece of data for each command.
These two schemes of dealing with data are generally called SISD (Single Instruction, Single Data – Simple Instruction, Simple Data) and SIMD (Single Instruction, Multiple Data), respectively.
The biggest utility in the creation of processors dealing with data vectors lies in the optimization of tasks that require the same operation, for example, a total or a Scalar Product to be performed on a large data set. Some classic examples of such tasks are multimedia applications and many scientific and engineering tasks.
A scalar processor should complete the entire process of reading, decoding, and executing every instruction and value in a dataset, but the vector processor can perform a simple operation on a relatively large dataset with a single command.
Of course, this is only possible when the application requires many steps that apply a transaction to a large data set.
Most of the early vector CPUs, like Cray-1, were almost exclusively associated with cryptography and scientific research applications. However, the need for a specific form of SIMD has become important in general-purpose processors, as multimedia has been largely moved to digital media.
Shortly after floating-point units were included in general-purpose processors, the features and applications of SIMD execution units for general-purpose processors also began to appear. Some of these early SIMD specifications, like Intel’s MMX, were only for integers.
This has been a major hurdle for some software developers, like many applications that take advantage of SIMD mostly deal with floating-point numbers.
Progressively, these initial designs were refined and rebuilt to some common, modern SIMD specifications that are often associated with an ISA. Some notable modern examples are AltiVec on Intel SSE and PowerPC.