How Does a Simple Processor Work?

Barış Ekin Yıldırım
10 min readJul 3, 2021

Click for Turkish Translation

Central Processing Unit (CPU), also known as a microprocessor or processor, is used to process data as the name suggests. The data it will process varies according to the program. This could be a game or a LibreOffice file. Whatever the program is, the CPU doesn’t care because it doesn’t see the data it processes as the end-user sees it. It only looks at what instructions the program has.

Here are the things happening when you click on an icon to run a program;

  • The program located in the storage unit is loads into RAM. The program consists of instructions.
  • The CPU pulls the program from RAM and loads it on itself using a circuit called a memory controller.
  • The data is now inside the CPU and it starts to processing.
  • What happens after this moment depends on the program itself. The processor either continues processing by looking at the given command or does something with the data it processes, such as printing its output to the screen.

In the past, the CPU itself was controlling the data transfer between storage and RAM. Since the storage units and RAM were slow, the CPU had to wait for a long time for the transfer to take place, hence the system was overall slowing down. The mentioned process is called Processor I/O, which disappeared with the advent of a system called Direct Access Memory (DMA). Data transfers would now be made over DMA, which would not occupy the processor unnecessarily.

Clock and Multipliers

The clock is a signal used to synchronize processes inside the computer. A clock signal produces a constant continuous square wave, which simply is the zeros and ones as we know. In the image below you will see three complete cycles, each called a “tick”. The clock signal is measured in Hertz (Hz), Hertz refers to the number of cycles produced per second. A 100 MHz clock means 100 million cycles per second.

Figure 1.0 — Clock Signal

In computers, everything is measured in cycles. For example, a memory with a delay of “10 units” means that it will start transferring data after 10 full cycles. Each instruction in the CPU has certain delay times. For example, the x command will start running after 6 full cycles. One of the interesting things about the CPU is that it knows how much latency each given instruction has. It maintains delay tables on itself whereby it’s able to check them.

Now you can say “What’s the relationship between clock and performance?” And you are right to ask it because it often confuses people. If we are considering two identical CPUs, the higher clocked one has higher performance. Because the higher clock speed will be able to process faster. However, if we are considering two different processors, it does not have to proceed in that logic. For example, if we make such a comparison between ARM and Intel, it would be ridiculous, because the design in both is different. If an instruction takes 5 cycles on Intel and 7 on ARM, naturally Intel will process it faster.

When it comes to modern processors, there are different cache sizes, number of execution units, instruction types and runtimes, etc.

If the processor clock signal increases too much, it will cause a huge problem. Because the clock signal on the motherboard will no longer be able to synchronize with the processor’s clock signal. If you have examined a motherboard before, you should see the data lines on it. If the speed of the clock signal gets too high, those data lines will start to act like antennas, which will result in the data being turned into some unrelated radio signals.

Figure 1.1 — Motherboard bus lines

To solve this issue, processor manufacturers have started to use a new concept called the clock multiplier. In this concept, an internal clock is placed inside the processor and that internal clock multiplies the clock on the motherboard. For example, an Intel Core i7 processor has a clock multiplier of 19. If we multiply the 200 MHz speed on the motherboard by 19, the clock speed of the processor will be 3.8GHz (the motherboard speed here is purely hypothetical).

Even in this case, the difference between the internal clock and the external clock will still cause a huge performance loss. Because the clock speed, which is 3.8 GHz in the processor, will decrease to 200 MHz again when it starts to read data from the RAM because it has to stop by the northbridge. That speed difference can be avoided to some extent with the caches added to the processor. Another method is to multiply the number of data transferred in a cycle. Data transfer of two parts per cycle is called DDR (Dual Data Rate), and four parts of data transfer are called QDR (Quad Data Rate).

Figure 1.2 — Transferring multiple data in a single cycle

Summarized Architecture of a Modern Processor

The image below shows a highly simplified architecture of a modern processor. The schematic you see is designed as a general-purpose modern processor unit, AMD, Intel, and ARM use more specific schematics.

Figure 1.3 — A scheme of a Modern CPU

The striped part in the diagram represents the inside of the CPU. The bus between RAM and CPU is usually between 64 and 128 bits (if dual-channel memory is active). It depends on the RAM or CPU clock speed, whichever is lower at that moment.

The theoretical upper limit rate of the transfer can be found with a formula such as:

x clocks per sec. × y lines per clock × 64 bits per line × z interfaces = xxGB/s

All units within the striped sections operate according to the CPU internal clock. Depending on the CPU, some units may even run faster than the internal clock. Also, the bus between units can be much faster. For example, in modern processors, the bandwidth between the L2 memory cache and the L1 data cache is usually 256 bits. Higher the bandwidth, the higher the transfer rate. The colors of the arrows in the diagram indicate different clock speeds.

Memory Cache

Memory cache is a type of high-performance memory called static memory and the memory type that is used like RAM in general computers is dynamic memory. Static memory consumes more power, takes up more space(as in physically), and is more expensive. It can run at the same clock speed as the CPU, which dynamic memory cannot.

We are using memories because going out of bounds to pull data will require the CPU to run at a lower clock speed. When the CPU pulls data from a certain position, a circuit called Memory Cache Controller intervenes and loads the entire data block into the memory cache. Thus, since programs usually run in a sequential stream, the next memory position the CPU requests will be ready in the memory cache. Therefore, the larger the memory cache, the better. Every data that the CPU can reach from its cache is called a “hit”, and “miss” when it cannot.

For example; If the CPU is pulling data from the address 1000, it means that the cache controller will pull up to 1000 + nth address and cache it. The number indicated as “n” here is called “page”. In other words, if your CPU is working with 4KB pages, the cache controller will also cache 4096 addresses after the specified address.

Figure 1.4 — Basic explanation of memory cache controller’s working principle

L1 and L2 stand for “Level 1” and “Level 2”, an indication of how far they are from the CPU core. If you’ve taken a look at Figure 1.3, you might be wondering why there are two L1s but a single L2. That is because the L1 instruction cache works as an input cache while the L1 data cache works as an output cache. The L1 instruction cache is usually smaller than the L2 cache because it is closer to the CPU where loop-style operations are kept, hence L2 does not need that much space.

Branching

One of the most important problems of the CPU is that it encounters too much “miss”. In order to avoid a miss, the CPU must go out and reach the RAM, which operates at a slower clock speed. In general, memory cache is enough to prevent this, but in some cases, it can be inadequate. Imagine a program is running, and the page goes smoothly line by line, and it’s encountering a JMP instruction. That means you have jumped to an n location. This new position will not be loaded into the L2 cache but will be run directly on RAM. As a solution to this problem, if modern CPUs come across a JMP statement on a page, they will go and fetch the next page without waiting for it to run.

Processing Instructions

The fetch unit pulls instructions from the memory. First, it checks whether the instruction required for the CPU is in the L1 instruction cache. If not, it goes to the L2 memory cache. If it still can’t find what it’s looking for, it goes to the RAM, which is much slower.

After the Fetch unit pulls an instruction from memory, it sends it to the Decode unit. When deciding this, the decode unit refers to the microcode, a ROM that resides inside the CPU. After the microcode deciphers the meaning of the instruction, it sends the instruction to the appropriate Execute unit. Without Microcode, the CPU cannot understand what the instructions mean. There are step-by-step instructions in Microcode that show how instruction works. Since it is not possible to update a ROM on the CPU, you can load the updated microcode at every boot through the operating system. Also, in modern CPUs, multiple instructions can run in parallel. In addition to that in modern CPUs, each Execution unit is customized. Although we are talking about a general-purpose processor here, it would be useful to go into some detail.

The best example of this customized Execution unit is the FPU. The Floating Point Unit is designed on a circuit basis to perform complex mathematical operations. For example; If the CPU has decided that the incoming operation is a mathematical operation (the unit that decides here is called the dispatch or schedule unit), it sends this operation to the FPU, not to the ALU. After the process is finished, the results are sent to the L1 data cache.

Another interesting thing about CPUs is that they can run multiple instructions simultaneously on different stages, called pipelining. For example; The Fetch unit will be idle after sending an instruction to the Decode unit. In this case, Fetch will go and pull the next instruction to avoid being idle. This will start the Fetch > Decode > Execution loop. New generation Intel processors can go up to 14 stages. Architectures, where these multiple processes can be run on different stages, are called “superscalar architecture”.

Out-Of-Order Execution (OOO)

I have mentioned the parallel processing capabilities of CPUs just above. I also mentioned that there is more than one Execution unit like ALU, FPU, etc. Let’s take an example and imagine that the CPU we have has 4 ALUs and 2 FPUs. And we have a flow like the one below.

1. General Instruction
2. General Instruction
3. General Instruction
4. General Instruction
5. General Instruction
6. General Instruction
7. Mathematical Instruction
8. General Instruction
9. General Instruction
10 Mathematical Instruction

What will happen here; The schedule/dispatch unit will send the first 4 instructions to the ALUs and wait for one of the ALUs to become free for the 5th instruction. The CPU will not be idle during this time, the OOO unit will start and skip instructions 5 and 6 and check the rest of the queue. When it comes to the 7th instruction, it will send it to the FPU because it is a mathematical instruction. Then it will start checking the queue again until it sends the 10th instruction to the second and idle FPU unit. Thus, no unit will be left empty.

Speculative Execution

So what would it do if there was a loop between the above instructions? The CPU would run both possibilities. For example:

1. General Instruction
2. General Instruction
3. If A=<B go to the 15th instruction
4. General Instruction
5. General Instruction
6. General Instruction
7. Mathematical Instruction
8. General Instruction
9. General Instruction
10. Mathematical Instruction
...
15. Mathematical Instruction
16. General Instruction

After the out-of-order unit analyzed this program, it will take the 15th instruction and throw it into an idle FPU, because the 15th instruction appears as a result of the 3rd instruction. If the result of the 3rd instruction is a > b, it will cancel the operation of the 15th instruction. If a > b, the program will run normally, otherwise the program will run faster, because its result will already be processed from the beginning and loaded into the cache.

Please feel free to correct me if you catch any mistakes, I'm open to constructive criticism. Thank you!

References

--

--