1. Introduce¶

约 1419 个字预计阅读时间 7 分钟

冯·偌伊曼架构
计算与存储分离
数据和指令放在同一个存储器

1.1 Eight Great Ideas¶

Design for Moore’s Law （设计紧跟摩尔定律）
- Moore's Law: Integrated circuit resources double every 18-24 months. 摩尔定律：集成电路资源每18-24个月翻一番。
- Design for where it will be when finishes rather than design for where it starts. 设计完成时的位置，而不是设计开始的位置。
Use Abstraction to Simplify Design (采用抽象简化设计)
- 层次化、模块化的设计
- Instruction set architecture---- the interface between hardware and lowest-level software ISA 连接硬件与底层软件
Make the Common Case Fast (加速大概率事件)
Performance via Parallelism (通过并行提高性能)
Performance via Pipelining (通过流水线提高性能)
- 换句话说就是，每个流程同时进行，只不过每一个流程工作的对象是时间上相邻的若干产品；
- 相比于等一个产品完全生产完再开始下一个产品的生产，会快很多；
- 希望每一个流程的时间是相对均匀的；
Performance via Prediction(通过预测提高性能)
- 例如先当作 if() 条件成立，执行完内部内容，如果后来发现确实成立，那么直接 apply，否则就再重新正常做；
- 这么做就好在（又或者说只有这种情况适合预测），预测成功了就加速了，预测失败了纠正的成本也不高；
Hierarchy of Memories (存储器层次)
- Disk / Tape -> Main Memory(DRAM) -> L2-Cache(SRAM) -> L1-Cache(On-Chip) -> Registers 容量从左到右，依次减小；速度从左到右，依次增大
Dependability via Redundancy (通过冗余提高可靠性)
- 类似于卡车的多个轮胎，一个模块 down 了以后不会剧烈影响整个系统；

1.2 Below Your Program¶

Application software ： aimed at users
System software : aimed at programming.
- 包含： Operation System；Compiler；Assembler

1.3 Computer Language¶

Machine language 机器语言
- Binary numbers express machine instructions
  
  ex. 1000110010100000 means to add two numbers
Assembly language 汇编语言
- Symbolic notations
  
  ex. add A, B
- The assembler translates them into machine instruction
  
  汇编器将汇编语言翻译为机器语言
High-level programming language 高级编程语言
- Notations more closer to the natural language ex. A + B
- The compiler translates them into assembly language statements
  
  编译器将高级编程语言翻译为汇编语言

1.4 Computer Organization and Hardware System¶

The five classic components of a computer: 计算机的五个经典组件：

input;
output;
memory(DRAM);
- 程序、数据存储的地方，也是程序“运行的位置”；
- cache memory (SRAM): buffer for the DRAM memory;
datapath
- 负责实际的数据处理；
control
- 负责指挥控制如何进行数据处理，给出控制信号；

processor / (central processor unit)CPU = datapath + control

处理器/（中央处理器单元）CPU = 数据路径 + 控制

1.6 Performance¶

1.6.1 Execution Time¶

Response time && Throughout
- Response time / execution time: how long it takes to do a task
  
  响应时间：从程序开始到结束的时间（个人）
- Throughout： total work done per unit time
  
  吞吐量：单位时间内完成的任务数量（服务器）

Replacing the processor with a faster version? 改变response time 和 throughout

Adding more processors ? 只改变throughout

定义$performance = \dfrac{1}{Execution \ time}$

X is n times faster than Y

$$ \dfrac{{Performance}_x}{{performance}_y} = \dfrac{{execution time}_y}{{execution time}_x} = n\ $$

$也就是说x的运行时间是y的\dfrac{1}{n}$
关于 Execution Time
- Elapsed time
  
  包含 Processing，I/O， OS overhead， idle time
- CPU time

1.6.2 CPU Clocking¶

Clock period: duration of a clock cycle

$eg:250ps = 0.25ns = 250 \times 10^{-12}s$
Clock frequency (rate): cycles per second

$eg:4.0GHz = 4000MHz = 4.0 \times 10^9Hz$

\[ \begin{aligned} CPU \ Time &= CPU \ Clock \ Cycles \times Clock \ Cycle \ Time \\ &= \dfrac{CPU \ Clock \ Cycles}{Clock \ Rate} \end{aligned} \]

Practice:

Computer A: 2 GHz clock, 10s CPU time

Designing Computer B :

Aim for 6 s CPU time ;

Can do faster clock, but causes 1.2 × clock cycle

How fast must Computer B clock be?

\[ \begin{aligned} &{Clock \ Cycle}_A = \dfrac{CPU \ Time}{CPU \ Clock \ Time } = CPU \ Time \times CPU \ Rate = 2 \times 10 ^{10}\\ &{Clock \ Cycle}_B = 1.2 \times {Clock \ Cycle}_A = 2.4 \times 10^{10}\\ &Clock \ Rate_B = \dfrac{ Clock \ Cycle_B}{CPU \ Time_B } = 4GHz \end{aligned} \]

每条指令的平均周期数(Average cycles per instruction)又缩写为 CPI。

由CPU的硬件决定，不同的指令有不同的CPI $$ CPU Time = Instruction count \times CPI \times Clock cycle time $$

Performance depends on

Algorithm: affects IC(指令条数), possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture ： IC，CPI，T_c($\dfrac{Seconds}{ClockCycle}$)

1.7 Incredible performance improvement¶

Power Wall

$Power = Capactive \ load \times {Voltage}^2 \times Frequency$

频率负载电容电压频率

主频提高了很多，但功耗并没有得到这么多的提升，因为我们降低了工作电压 (5V-1V)

电压不能一味下降，会造成电流过大，引发功耗浪费
Memory Wall

Memory 的性能增长不如 CPU 的性能增长，大部分时间花在读写内存了，影响整体性能
ITP Wall（指令集并行）

refers to increasing difficulty to find enough parallelism in the instructions stream of a single process to keep higher performance processor cores busy.

很难在单个进程的指令流中找到足够的并行度，以保持更高性能的处理器内核繁忙。

$ITP \rightarrow TLP \ and DLP$

从指令集成到线程集成+数据集成

1.8 Multiprocessors 多核¶

Amdahl's Law: Improve an aspect of a computer and expecting improvement in overall performance.

阿姆达尔定律：改进计算机的某个方面，并期望整体性能得到改善。实际上, $T_{improve} = \dfrac{T_{affected}}{improvement \ factor} + T_{unaffected}$

e.g.** 对某一方面优化 90%, 并不能使 CPU 整体性能优化 90%. Corollary: make the common case fast. 推论：使常见情况快速。**
Low Power Not at Idle.

低功耗，不处于空闲状态。机器在没有工作时也有功耗损失。
MIPS as a Performance Metric

MIPS 作为性能指标,不同的 ISA 之间不能仅凭 MIPS 比较。
- MIPS: Millions of Instructions Per Second
  
  MIPS：每秒数百万条指令
  
  \[ \begin{aligned} MIPS &= \dfrac{Instruction \ count}{Execution time \times 10^6} \\ &=\dfrac{Instruction \ count}{CPI \times Instruction \ count \times cycle \ clock\ time \times 10^6}\\ &= \dfrac{clock \ rate}{CPI \times 10^6} \end{aligned} \]
唯一的性能标准是CPU Time,不同的 ISA 之间不能仅凭 MIPS 比较。