Atomic Variable Demo

Posted on 2022-12-07 Edited on 2022-12-08

接上一篇文章, 这一篇我们来用一些示例程序, 来探讨一下在不同平台下原子变量的行为.

示例代码库我放在了 https://github.com/RiversJin/AtomicVariableDemo
使用Rust实现(想顺便温习一下Rust的语法)

项目结构很简单

abc@310a0ca1fb3d:~/workspace/t/src$ tree
.
├── bin
│   ├── t1.rs
│   ├── t2.rs
│   ├── t3.rs
│   ├── t4.rs
│   └── t5.rs
└── main.rs

1 directory, 6 files

计数器

首先是main.rs, 这里比较了两种计数器相加的方法, 一个是使用普通变量全局变量R_MUT, 另一个是使用原子变量R, 其中, 原子变量使用Relaxed的内存序约束.

这种情况下, 无论什么平台, 普通变量的数字都可能会发生错误, 而原子变量正常.

Windows amd64:

PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run  --bin t
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target\debug\t.exe`
R:100000000 r:24134332
PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run  --bin t
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target\debug\t.exe`
R:100000000 r:24366177
PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run -r --bin t
    Finished release [optimized] target(s) in 0.01s
     Running `target\release\t.exe`
R:100000000 r:10000004
PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run -r --bin t
    Finished release [optimized] target(s) in 0.01s
     Running `target\release\t.exe`
R:100000000 r:10000004

可以看到, 如果不使用release配置编译, 普通变量的结果每次都不同, 发生了竞态行为.
而采用release编译后, 编译器可能对循环相加的操作做出了一些优化, 每次运行的值稳定了, 虽然结果不对.

接下来, 使用弱内存序的Aarch64架构实验, 得到的结论类似:

abc@310a0ca1fb3d:~/workspace/t/src$ cargo run --bin t
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `/config/workspace/t/target/debug/t`
R:100000000 r:74990565
abc@310a0ca1fb3d:~/workspace/t/src$ cargo run --bin t
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `/config/workspace/t/target/debug/t`
R:100000000 r:74348213
abc@310a0ca1fb3d:~/workspace/t/src$ cargo run -r --bin t
   Compiling t v0.1.0 (/config/workspace/t)
    Finished release [optimized] target(s) in 2.51s
     Running `/config/workspace/t/target/release/t`
R:100000000 r:10000004
abc@310a0ca1fb3d:~/workspace/t/src$ cargo run -r --bin t
    Finished release [optimized] target(s) in 0.01s
     Running `/config/workspace/t/target/release/t`
R:100000000 r:10000004

标志位

这个测试中, 我们可以看到存在 load-load(读取flag后再读值), store-store(先写值再写flag), 还有load-store(读取完成后清空flag)

t1.rs, t2.rs与t3.rs分别展示了使用普通变量, 误用原子变量以及正常使用原子变量做并发标志位的情形.我们先测试一下Amd64的情况:

PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run  --bin t1
   Compiling t v0.1.0 (C:\Users\rivers\Desktop\AtomicVariableDemo-master)
    Finished dev [unoptimized + debuginfo] target(s) in 0.59s
     Running `target\debug\t1.exe`
PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run  --bin t2
   Compiling t v0.1.0 (C:\Users\rivers\Desktop\AtomicVariableDemo-master)
    Finished dev [unoptimized + debuginfo] target(s) in 0.52s
     Running `target\debug\t2.exe`
PS C:\Users\rivers\Desktop\AtomicVariableDemo-master> cargo run  --bin t3
   Compiling t v0.1.0 (C:\Users\rivers\Desktop\AtomicVariableDemo-master)
    Finished dev [unoptimized + debuginfo] target(s) in 0.51s
     Running `target\debug\t3.exe`

而Aarch64下, 稍稍等一会(我这里跑了3分钟), 就可以看到t1, t2出现了异常值:

abc@310a0ca1fb3d:~/workspace/t/src$ cargo run --bin t1
   Compiling t v0.1.0 (/config/workspace/t)
    Finished dev [unoptimized + debuginfo] target(s) in 1.60s
     Running `/config/workspace/t/target/debug/t1`
Get! V1.0=0
^C
abc@310a0ca1fb3d:~/workspace/t/src$ cargo run --bin t2
   Compiling t v0.1.0 (/config/workspace/t)
    Finished dev [unoptimized + debuginfo] target(s) in 1.59s
     Running `/config/workspace/t/target/debug/t2`
Get! V1.0=0
abc@310a0ca1fb3d:~/workspace/t/src$ cargo run --bin t3
   Compiling t v0.1.0 (/config/workspace/t)
    Finished dev [unoptimized + debuginfo] target(s) in 1.61s
     Running `/config/workspace/t/target/debug/t3`

多核心观测写入顺序

t4.rs与t5.rs简单地复现了一下上篇文章提到的多线程观察修改顺序的问题, 来区分acquire-release与acquire-release. 这个地方和理论上有一些出入. 理论上来说, 对于Aarch64架构, 硬件是不保证多核心的观察必定一致的, 但实际上运行代码会发现, t4与t5使用acquire-release和sequentially-consistent都是不能复现出z==0. 这说明其实Aarch64实际上是保证了这个一致性(至少我的Arm架构的路由器是这样).

我在StackOverflow上找到了这样一篇回答, 听上去很有道理, 大家可以参考一下
https://stackoverflow.com/questions/67397460/does-stlrb-provide-sequential-consistency-on-arm64

“ARMv8 originally allowed that on paper, but no ARM CPUs ever did.”

更深层次的, 如果不保证这种观察上的一致性, 说明CPU的缓存同步使用的是某种广播机制, 但是工业界的cpu没这么搞, 这篇回答中也提到了只有一些很少的POWER架构的CPU才可能存在这种问题.

所以, 保险起见, 在遇到需要acquire-release语义的原子变量时, 简单使用sequentially-consistent即可, 效率都一样, 反正大部分硬件不支持更灵活的控制.

浅析Rust的#[derive(Clone)]

Posted on 2022-10-13 Edited on 2022-10-16

在学习rust的模板的时候, 我遇到了一个奇怪的问题. 就是关于derive(Clone)这个派生宏的. 有点意思, 跟大家分享一下.

问题来由

我们先来看这段代码:

use std::rc::Rc;
trait MayNotClone{}

#[derive(Clone)]
struct CloneByPtr<T> where T: MayNotClone{
    item: Rc<T>
}

trait MustBeClone: Clone {}
impl <T> MustBeClone for CloneByPtr<T> where T:MayNotClone{}

代码本身很简单, 首先声明一个trait叫做MayNotClone, 它不一定满足Clone这个trait, 然后, 再声明一个名为MustBeClone的trait, 这个trait必须实现Clone. 到这里, 一切正常, 对吧?

接下来, 就开始有点奇怪了, 当我们尝试为CloneByPtr这个结构体实现MustBeClone时, 编译器就会报错:

PS C:\Users\rivers\Desktop\example> cargo build
   Compiling example v0.1.0 (C:\Users\rivers\Desktop\example)
error[E0277]: the trait bound `T: Clone` is not satisfied
 --> src\main.rs:9:10
  |
9 | impl <T> MustBeClone for CloneByPtr<T> where T:MayNotClone{}
  |          ^^^^^^^^^^^ the trait `Clone` is not implemented for `T`
  |
note: required because of the requirements on the impl of `Clone` for `CloneByPtr<T>`
 --> src\main.rs:4:10
  |
4 | #[derive(Clone)]
  |          ^^^^^
note: required by a bound in `MustBeClone`
 --> src\main.rs:8:20
  |
8 | trait MustBeClone: Clone {}
  |                    ^^^^^ required by this bound in `MustBeClone`
  = note: this error originates in the derive macro `Clone` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider further restricting this bound
  |
9 | impl <T> MustBeClone for CloneByPtr<T> where T:MayNotClone + std::clone::Clone{}
  |                                                            +++++++++++++++++++

For more information about this error, try `rustc --explain E0277`.
error: could not compile `example` due to previous error

编译器提示得很清晰, MustBeClone这个trait要求实现它的结构体必须满足trait, 但是T(也就是那个MayNotClone)不满足Copy的trait, 请考虑为T添加一个clone trait的限制.

我不是加了#[derive(Clone)]么, 怎么还不好使.

这个就有点无理取闹了啊. 这个CloneByPtr明明持有的是T的指针, 你指针能复制就好了嘛, 要指针指向的值能复制做干嘛.

我们来看一下#[derive(Clone)]吧!

https://doc.rust-lang.org/std/clone/trait.Clone.html#derivable

根据官方文档说, This trait can be used with #[derive] if all fields are Clone.
如果一个struct的全部field都是Clone的, 那这个宏可以让这个struct也变为Clone的.

那我这个没毛病啊, std::rc::Rc不是实现了Clone么. 它直接生成item.clone()不就完事了么.

那问题出在哪了呢? 我觉得应该还是在于derive(Clone)上, 我们来研究一下这个宏, 看看怎么回事.

首先, 让我们先考虑一下, 这个宏都干了什么. 如果不用它, 我们自己手动实现Clone, 应该是什么样子呢? 下面我们以一个比较简单的struct举个栗子

struct ImStruct{
    a: i32,
    b: Vec<i32>
}

如果为它实现Clone那么就应该是

impl Clone for ImStruct {
  fn clone(&self) -> Self {
    Foo {
      a: self.a.clone(),
      b: self.b.clone(),
    }
  }
}

那如果加入泛型呢? 像这样:

struct ImGenericStruct<T, U> {
  a: u32,
  b: Rc<T>,
  c: U,
}

那如果为它实现Clone就应该是:

impl<T, U> Clone for ImGenericStruct<T, U> {
  fn clone(&self) -> Self {
    Foo {
      a: self.a.clone(),
      b: self.b.clone(),
      c: self.c.clone(), // 这里8行, C类型不一定
    }
  }
}

所以如果想要为这个struct实现Clone, 就必须确保U是Clone的:

struct ImGenericStruct<T, U: Clone> {
  a: u32,
  b: Rc<T>,
  c: U,
}

到这里, 一切正常, 对吧? 那#[derive(Clone)]问题出在哪了呢? 这个简单, 我们直接看它生成了什么代码就可以了. 比如gcc,clang, 有一个-E参数, 可以显示模板,宏展开后的代码. rust也有这个功能, 不过需要使用nightly版本的才可以启用此特性.

1
2
3

rustup toolchain install nightly 
cargo install cargo-expand 
rustup default nightly #这里将rust暂时切换为nightly的版本, 记得之后改回去

cargo-expand

我们以上面 struct ImGenericStruct<T, U> 为例

#[derive(Clone)]
struct ImGenericStruct<T, U> {
  a: u32,
  b: Rc<T>,
  c: U,
}

其展开代码为:

struct ImGenericStruct<T, U> {
    a: u32,
    b: Rc<T>,
    c: U,
}
#[automatically_derived]
impl<T: ::core::clone::Clone, U: ::core::clone::Clone> ::core::clone::Clone
for ImGenericStruct<T, U> {
    #[inline]
    fn clone(&self) -> ImGenericStruct<T, U> {
        ImGenericStruct {
            a: ::core::clone::Clone::clone(&self.a),
            b: ::core::clone::Clone::clone(&self.b),
            c: ::core::clone::Clone::clone(&self.c),
        }
    }
}

由此可见, derive(Clone)画蛇添足地为T添加了Clone限制. emm, 这是rust的宏自身的限制, 它能做到读取代码的token流用来自动生成一些其他代码, 但是应该还不具备与编译器交互的能力, 也就没办法在宏展开时做出完善的类型检查, 所以只能简单粗暴地为里面出现的每一个模板参数都直接加上Clone的限制… 也许有一些其他的trait能避开这个问题, 不过直接手动实现一下, 也是一个可以考虑的方案.

Atomic Variable & Memory Order

Posted on 2022-10-04 Edited on 2022-11-05

这一篇文章, 我们浅谈一下原子变量以及内存序模型. 有什么问题欢迎留言指出, 我们可以一起讨论.

Atomic Variable

首先我们先来聊一下什么是”原子性”. 假如有这样一段代码

int v;
void add(){
    v = v + 2;
}

在执行add函数时, CPU需要先从内存中取出v的值, 然后计算+2, 最后写回到内存中. 如果只有一个执行序, 那当然无所谓. 但如果有并发呢? 那就比较麻烦了. 我们讨论最简单的情况, 即双线程并发执行一次这个函数.

这就可能出现多种情况. 比如, 其中一个线程先执行成功, 然后另一个线程执行. 那么当二者都执行成功后. v的值变为4.

或者, 它们一起开始执行. 线程1先读取到v的值, 开始计算+2, 但还没来得及写回; 而线程2已经开始运行, 开始读取v值, 为0, 开始计算+2. 最终, 无论这两个执行绪谁先执行完成, 都会只是向内存写入2. 而不是预期中的4. 就像下面这样

 T0             T1
┌──────────────┬───────────────┐
│  load v (=0) │               │
├──────────────┼───────────────┤
│              │  load v (=0)  │
├──────────────┼───────────────┤
│  add 2  (=2) │  add 2 (=2)   │
├──────────────┼───────────────┤
│              │  store v (=2) │
├──────────────┼───────────────┤
│ store v (=2) │               │
└──────────────┴───────────────┘

从这段代码来说, 这个+2的操作, 显然就不是原子性的. 如果是”原子性”的操作, 那就该这三个步骤合并为一个步骤, 一下子完成. 像这样

 T0             T1
┌──────────────┬───────────────┐
│     v+=2     │               │
├──────────────┼───────────────┤
│              │               │   // v == 2
├──────────────┼───────────────┤
│              │     v+=2      │   
└──────────────┴───────────────┘
                                   // v == 4

或者, 我们可以直接简单地加一个互斥锁, 让f不要并行执行, 也可以.

但是, 比较反直觉的是, 即使我们不讨论对变量修改并写回的操作, 只考虑单纯地读取一个变量或者加一个变量(Load/Store)也不一定是原子性的. 这个就涉及到一些硬件上的原理, 比如, 某种平台, 每次只能读取内存中4个字节, 那如果在这种平台上, 读取一个int64即8个字节的值, CPU就不得不读取两次, 先读取一半, 再读取另一半. 或者在普通我们常见的平台, 也可能出现这种问题, 比如变量没有对齐, 那么CPU就有可能也不得不将一次读取拆分为多次.

另外多说一句, 如果使用gcc/clang-x64并打开优化, 可以观察到add函数被编译为
    add():                                # @add()
    add     dword ptr [rip + v], 1
    ret
看起来像是一条单个的add汇编指令, 似乎是满足了上文提到的"原子性"加法, 将三个操作合并为一个, 但这并不意味着这个加法操作就是原子性的, 只是CPU内部将这一条指令分成多个步骤执行了而已. 对此感兴趣的朋友可以搜索一下关键词"CISC 微指令"

在类似于amd64这种强内存序(Strong memory order)下, 如果变量已经对齐, 那其实不需要纠结什么, 对于整数的读取和写入本身就已经是原子的. 现代的编译器在生成变量时会注意内存对齐的问题. 所以毫不夸张的说, 如果说我们只讨论单一变量的原子性读和写, 并且只讨论amd64或x86架构的话, 那么这个话题就结束了. 没有什么需要注意的, 它们天然就是原子性的. 就当是普通地读取变量即可.

那如果类似于aarch64这类的弱内存(Weak memory order)呢? 那就需要稍微注意一下了, 在这种架构下, 如果是没有对齐的存取, 就很有可能喜提Alignment fault.

比如MIPS和低版本的ARM会不支持对齐访问. 高版本的ARM比如aarch64倒是支持, 不过也很有限(栈和指令依旧必须对齐), 并且性能会有损失

那么好, 我记住了, 以后存取变量注意内存对齐, 结束, 下课~

别急, 还没讲完 :P

从我个人的观点来看, 所谓”原子变量”是一个很有迷惑性的术语, 它会让人误以为所谓”原子性”仅仅是指对于单个变量的原子性操作. 实际上, 当我们谈论原子变量时, 我们更多地是在讨论基于原子变量的缓存控制.
也就是并发编程的三要素–原子, 有序, 可见中的后面两位, 有序性&&可见性.

内存序模型

可见性

我们首先说可见性. 可见性是指, 当一个执行绪修改了内存上的一个变量, 其他执行绪应该能”看见”这次修改. 在现代CPU架构下, 这点其实并不是一定的, 比如Java下的volatile关键字, 就是要确保其他线程能看到这次变量的修改.

 不过, 在C/C++下, *volatile*有着其他的意义, 我的建议是, 在写C/++时, 如果你不是在处理硬件相关的东西, 就不要用.

那如何保证可见性呢? 这个我们和下面的有序性一起讲.

有序性

为了提高执行性能, 现代的编译器,以及CPU都会对执行指令进行重排. 不过重排还是有一个前提, 不影响单执行绪下的执行结果,我们可以看一下这一段代码:

int a;
int b;
void f(){
    a = a + 1 ;
    b = b + 2 ;
}

使用armv8-a clang 10.0.0开启O1优化, 可以得到下面的汇编

f():                                  // @f()
        adrp    x8, a
        adrp    x9, b
        ldr     w10, [x8, :lo12:a]
        ldr     w11, [x9, :lo12:b]
        add     w10, w10, #1            // =1
        add     w11, w11, #2            // =2
        str     w10, [x8, :lo12:a]
        str     w11, [x9, :lo12:b]
        ret
a:
        .word   0                       // 0x0

b:
        .word   0                       // 0x0

可以看到, 编译器生成的汇编指令先读取两个变量的值, 然后相加, 最后写回两个变量.
如果不优化, 那么生成的汇编就会是

f():                                  // @f()
        adrp    x8, a
        add     x8, x8, :lo12:a
        adrp    x9, b
        add     x9, x9, :lo12:b
        ldr     w10, [x8]
        add     w10, w10, #1            // =1
        str     w10, [x8]
        ldr     w10, [x9]
        add     w10, w10, #2            // =2
        str     w10, [x9]
        ret
a:
        .word   0                       // 0x0

b:
        .word   0                       // 0x0

这下, 编译器生成的汇编和我们写的c代码基本就是逐行对应的了.

不仅如此, 除了编译器, CPU内部在硬件层面上, 也会对指令进行重新排序, 所以, 如果以为加了编译屏障, 就万事大吉, 显然是不可能的.

编译器屏障, 通常见到的都是这样的:
__asm__ __volatile__("": : :"memory")
这个东西, 还是那句话, 除非你真的知道自己在做什么再用, 如果希望使用它来禁止指令重排, 保证有序性, 那它大概率是做不到的.

相比于编译器造成的指令重排, 硬件造成的重排, 要更为棘手, 这不仅涉及到指令重排这么简单, 还会影响CPU多层缓存的更新.

比如, 因为有缓存的存在, CPU0修改了内存上的某个值, 但是因为配置的缓存策略是”Write Back”, 此次修改还没来得及写回到内存上, 只是CPU0独享的缓存中, 此时其他CPU核心就读不到最新的修改.

那是不是等这次写回完成后就可以了呢? 无非就是等一等而已. 但是问题要更复杂一些, 上文说到, 不仅是编译器, CPU本身还会对指令进行重排以提高性能. 假如有这样一段代码:

int v = 0;
int flag = 0;
void set_v(){
    v = 42;
    flag = 1;
}
int get_v(){
    while(flag == 0){
        continue;
    }
    return v;
}

在函数set_v中, 首先对v赋值, 然后对flag置位. 然后在get_v中, 会轮询标志位, 当标志位非0时, 代表变量v准备好, 可以被读取, 读取v的值后返回.

但是这段代码, 在一些CPU平台上, 大概率是不能运行的. 两个函数都可能存在问题.

首先说set_v, 因为CPU乱序执行以及缓存的缘故, 可能出现下面的情况

flag = 1有可能先于v=42执行, 被写入到内存.
v=42确实在flag=1之前执行了, 但只是写入到缓存中. 而随后的缓存写回时, flag所在的缓存先于v=42被写回. 导致其他线程先观察到了flag置位, 然后立即去读取v的值, 但是内存上v还没有被更新, 就会导致读出的值是错误的.

然后再说get_v, 在被乱序执行后, 可能它会有自己的关于v的缓存, 还未更新, 或者说, 更新的晚于flag所在的缓存, 这也可能导致读出错误.

当然, 你可能会提到缓存一致性协议, 会保证执行结果的. 但是, 缓存一致性协议是需要一些hint才会工作的, 不然所有核心全用同一个缓存算了. 具体是什么hint, 我们下面马上就讲

set_v可能出现的两种状况中, 如果我们作为一个观察者观察内存时, 看到的结果其实是一样的, 都是修改了flag, 后修改了v.

或者说, 我们如何定义store操作的完成呢? 是写到缓存就算执行完成呢? 还是写到内存才算一条store指令执行完成?

如果是后者的话, 那么其实那两种情况, 指的就是同一种, 即store flag的指令错误地先于store v的指令执行了. 那我们可以简化一下, 假设我们不知道缓存的存在, 就只考虑存取指令的乱序执行. 在这种情况下, 我们就只考虑指令之间的排列顺序即可. 假如一个指令完成, 那就意味着store已经将值写到内存.

只是简化理解噢, 实际上L2及以下的缓存是核心共享的, 那么就意味着这种store, load只要读取到缓存上的最新值即可. 但是对于总线上的其他设备, 可能又不行. 所以大家不必纠结这个, 暂时忘掉缓存即可:P

在这种简化模型下, 如果要保证存取指令之间执行的有序性, 我们可以简单地插入一些屏障指令, 禁止一些CPU的重排行为, 这样, 强制存取指令按照约定的顺序执行, 就不会再有莫名其妙的重排问题了, 这也就是下面讲的东西, 内存屏障.

内存屏障

上文提到了, 有编译器屏障, 那么自然, 也就存在CPU的指令屏障.

各个平台都会提供一些特别的汇编指令, 我们可以称之为”完全内存屏障”, 硬件保证: 这条指令之前的指令一定会在这条指令执行前执行完成, 这条指令之后的指令一定在这条指令完成之后开始执行.

x86下是LFENCE,SFENCE,MFENCE,或者什么带有lock前缀的指令, 或者什么隐含锁语义的指令(比如exchg)
而aarch64下, 则是DMB, DSB, ISB指令

但是, 这种屏障指令, 有的时候可能粒度过大.

比如, 对于我们常见的内存变量存取, 我们只需要保证相应的load, store指令之间不要重排即可. 没必要让所有指令都跟着一起停下来. 因为这样需要让整个CPU流水线都会被排空, 同时缓存也要完全同步一次才能实现. 相对来说, 性能上还有进一步提升的空间.

所以, 因此就出现了更加细致的内存屏障指令. 在不少平台上, 都会存在类似这种细粒度的屏障指令, 只禁止某些特定的指令重排, 这样可以对缓存更新行为做更细粒度的控制, 也就相对可以获得更高的性能.

那么, 考虑最简单的情况, 即两个存取指令, 可以有4种情况. 那么根据这四种情况, 就需要四种内存屏障.

┌──────────────┬───────────────┐
│  load, load  │  load, store  │
├──────────────┼───────────────┤
│ store, load  │  store, store │
└──────────────┴───────────────┘

我们接下来, 用一些例子, 来尝试解释一些这些屏障的行为

load-load

load-load 屏障可以保证, 两个读取行为间不会被乱序. 但是要注意, 这其实并不保证读取到的就一定是最新的值. 准确来说, 它只保证第一次读取到的值, 比第二次读取到的更老.

这个屏障就可以用来实现上面的get_v函数.像这样:
int get_v(){
    while(flag == 0){
        continue;
    }
    *LOAD_LOAD_FENCE();*
    return v;
}
只要读取到flag被置位, 再加上一个load-load屏障, 就可以确保接下来读取v的动作发生在读取flag之后.保证了读取行为的正确性

Store-store

Store-Store 屏障与load-load类似, 它保证两个写入行为间不会乱序. 并且和load-load一样, 它不保证屏障之前的值一定是完成了的, 只是保证, 在内存视角, 前面的修改一定是先于后面的修改.

这个屏障可以用来实现上面set_v函数
void set_v(){
    v = 42;
    *STORE_STORE_FENCE()*
    flag = 1;
}

Load-store

指令重排/乱序执行有一个重要前提, 那就是不会更改单线程下的执行结果. 所以, Load-Store的重排听起来可能有点奇怪? 假如, 必须先读取到一个特定的标志位, 才能开始写入. 那这种重排不就会影响单线程下的执行结果了.

比如

int l = 0;
int v = 1;

void f(){
    if(l == 0){
        v = 2;
    }
}

像是这种, 即使乱序执行, 也肯定不会影响语义.

所以, 这种重排只会发生在一些特殊情况. 当前面的load指令读取恰好遇到了cache miss并且load,store的变量之间完全没有联系的情况下. 还是可能发生这种重排的.

比如有下面f,g两个函数分别有core1和core2执行

int x = 0;
int y = 0;
void f(){ // execute by core 1
    int t0 = x;
    y = 1;
}
void g(){ // execute by core 2
    int t1 = y;
    x = 1;
}

如果f先执行, 那么可以得到

x	y	t0	t1
1	1	0	1

如果g先执行, 那么可以得到

x	y	t0	t1
1	1	1	0

或者二者交替执行, 即两个读取x,y的先执行, 对x,y后执行的可以得到

x	y	t0	t1
1	1	0	0

但是, 如果发生了load-store重排, 最终也可能获得x == y == t0 == t1 == 1的结果

Store-load

Store-load相对来说是最”重”的屏障, 它要确保load在Store执行完成之后才能读取. 还记得我们之前定义的Store操作吧? Store指令的完成, 意味着修改写入到内存, 让其他观察者可见. 并且, 读取到的值, 一定是最新可见的值.

以上, 就是各种单向内存屏障的定义. 但实际上, 我们实际在编程时极少直接使用这类屏障指令. 一方面, 不同平台的屏障指令特点都不尽相同, 不能完全和这些标准的屏障对应; 另一方面, 直接使用屏障对于使用高级语言编程来说, 可能过于底层了, 从这个角度来说, 我们应该告诉编译器我们想要达成什么效果, 而不是直接告诉编译器使用什么指令. 那这就引出我们今天最后的内容了—-内存序.

比如, x86, amd64, 安腾架构是所谓的"强内存序(Strong memory order)", 平台会保证在一些情况下不会乱序执行. 以amd64为例, 它只需要store-load屏障, 其余情况不会发生乱序执行.
或者类似于Aarch64, 它不提供任何保证, 所有执行情况必须由显式的屏障指令才能保证其不会重排, 也就是所谓的"弱内存序(Weak memory order)"

内存序 (memory order)

C++11定义了6种内存序(memory order), 用来控制对原子变量的读写. 当访问原子变量时, 需要给出一个内存序参数, 通过这个参数, 就可以控制以这个原子变量为分界线的内存乱序执行行为. 编译器会根据对应的内存序, 生成不同的屏障指令.

relaxed
acquire
release
acquire-release
consume
sequentially-consistent

我们接下来一个一个讲

relaxed

Relaxed 是最轻的内存序限制, 它只保证对于指定的原子变量的操作是原子的, 而对于在访问原子变量周围的同步或者内存重排行为, 则一概不管. 所以显而易见, 这也是原子变量中性能最好的内存序. 不过嘛, 似乎只能用来做计数器? 如果涉及到多个变量的同步都可能发生问题.

acquire

Acquire 要求, 在读取到此原子变量后面的所有读取/写入指令, 不可以被重排到读取此原子变量的前面. 这个就非常适合用于我们上面的get_v

相当于load-store+load-load两个屏障加在一起

release

与acquire相反, release要求, 在写入此变量之前的所有读取/写入指令, 不可以重排到写入此原子变量的后面. 适合我们刚才的set_v.

相当于store-store+load-store两个屏障相加

acquire-release

这个就是Acquire和release的相加在一起, 相当于一次完全内存屏障. 这种内存序需要一次read-modify-write操作, 既有acquire又有release. 用在两个线程交换什么东西的时候.

这个所谓的read-modify-write操作要和上面的计数器做一些区分. 计数器加1当然也是read-modify-wirte,但是只是相对于原子变量自身而言. 比如, 两个线程需要交换指针, 这种情况, 就需要acquire-release语义了.

sequentially-consistent

序列化一致性和acquire-release区别不大. 比如, 在单生产者单消费者的情况下, 二者没有区别. 区别在于多线程的情况下, 序列化一致性保证, 多个线程观察到修改时, 顺序是一致的.

比如有4个线程, 分别执行a, b, c, d四个函数

int t1 = 0, t2 = 0;
int z = 0;
void a(){
    t1 = 1;
}
void b(){
    t2 = 1;
}
void c(){
    while(t1 == 0)continue;
    if(t2 == 1)z+=1;
}
void d(){
    while(t2 == 0)continue;
    if(t1 == 1)z+=1;
}

c函数执行时, 如果t1先于t2赋值, 是有可能不执行z+=1的. 同理, 对于d函数来说, 如果t2先于t1赋值, 也是有可能不执行z+=1的. 那么按照直觉来说, d或者c应该至少有一个将z+=1, 对吧? 其实不一定. 如果不使用序列化一致性模型, C++标准不保证两个线程观察到的赋值顺序是相同的.

多说一句, 如果用的是amd64, 硬件保证所有线程观察到的顺序是一定的, 在这个平台上, 是看不到z==0的, 硬件上没有这种灵活性...

consume

那这种内存序就比较高端了. 它和acquire-release类似, 但是粒度更细一点. 假如一个线程通过这个内存序load的一个值, 那么**与这个值相关(或者叫数据依赖 carries dependency)的load和store, 不能重排到前面; 如果通过这个内存序Store一个值, 那么与这个值相关(或者叫数据依赖 carries dependency)**的load和store, 不能重排到后面.

但是实际上, 好像还没有硬件平台支持这么细粒度的屏障. 所以这个内存序只会影响一些编译器优化的行为(比如编译器可能会禁止预加载依赖链上的数据).

END

这篇关于原子变量的文章到这里就结束了, emm 也没有完全结束, 还都是理论介绍, 没有实际的代码例子, 之后我会再写一篇实操的(咕咕咕咕).

但其实我最想说的是, 大部分情况, 除了计数器这种非常简单的需求, 没事不要折腾什么无锁编程, 性能不一定快多少, 但可以保证的是bug会很难调. 很多时候程序的瓶颈根本就不差这几个锁, 现在的Linux的futex什么的, 用锁其实也没慢多少的… 同理, 你确定一定要用原子变量的时候, 就直接用顺序一致性就好, 也没差多少, 尤其是在x86这种平台, 很多细粒度的屏障都是不支持的, 没有什么搞事的余地:(

MIT 6.824 Lab1 要求

Posted on 2021-10-11 Edited on 2022-10-04 In MIT 6.824

这一篇文章主要是对实验1要求的翻译以及顺便需要记录的碎碎念.

Introduction

In this lab you’ll build a MapReduce system. You’ll implement a worker process that calls application Map and Reduce functions and handles reading and writing files, and a coordinator process that hands out tasks to workers and copes with failed workers. You’ll be building something similar to the MapReduce paper. (Note: the lab uses “coordinator” instead of the paper’s “master”.)

在这个实验中,你需要实现一个MapReduce系统.你需要实现一个调用Map和Reduce的worker process. 以及一个负责将任务分发到worker process并处理work process错误的coordinator process.

Getting started

You’ll fetch the initial lab software with git (a version control system). To learn more about git, look at the Pro Git book or the git user’s manual. To fetch the 6.824 lab software:

通过git拉取代码,如下:

$ git clone git://g.csail.mit.edu/6.824-golabs-2021 6.824
$ cd 6.824
$ ls
Makefile src
$

We supply you with a simple sequential mapreduce implementation in src/main/mrsequential.go. It runs the maps and reduces one at a time, in a single process. We also provide you with a couple of MapReduce applications: word-count in mrapps/wc.go, and a text indexer in mrapps/indexer.go. You can run word count sequentially as follows:

在src/main/mrsequential.go中,以及实现了一个简单的顺序的MapReduce实现(即非并发的版本).它在一个进程中,执行map和reduce.

另外,还提供了一些MapReduce的应用程序,比如mrapps/wc.go的word count,以及mrapps/indexer.go的文本索引器.

以下的命令可以以顺序版本执行word count:

$ cd ~/6.824
$ cd src/main
$ go build -race -buildmode=plugin ../mrapps/wc.go
$ rm mr-out*
$ go run -race mrsequential.go wc.so pg*.txt
$ more mr-out-0
A 509
ABOUT 2
ACT 8
...

注意: 如果编译的时候没有-race参数,那么运行的时候也不可加-race参数

-race参数是golang内置的竞态检测,一般来说,内存使用增加5-10倍,运行速度减慢2-20倍.
具体参考 https://golang.org/doc/articles/race_detector

Your Job

Your job is to implement a distributed MapReduce, consisting of two programs, the coordinator and the worker. There will be just one coordinator process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you’ll run them all on a single machine. The workers will talk to the coordinator via RPC. Each worker process will ask the coordinator for a task, read the task’s input from one or more files, execute the task, and write the task’s output to one or more files. The coordinator should notice if a worker hasn’t completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.

你的工作是实现一个分布式的MapReduce.它主要由Worker和Coordinator组成. 一个Coordinator与若干个Worker并行执行. 在真实系统中,这些进程会在不同的机器上运行,但是对于这个实验来说,Coordinator与Worker会在同一个机器上执行. 进程间通过RPC通信.Workder会向Coordinator请求发布任务,从一个或若干个文件中读取输入,执行任务,并将输出写另一些文件中.Coordinator要注意到每个Workder是否在合理的时间内完成任务(在这个实验中,这个时间是10s),如果没有,将这个任务交给另一个Worker

We have given you a little code to start you off. The “main” routines for the coordinator and worker are in main/mrcoordinator.go and main/mrworker.go; don’t change these files. You should put your implementation in mr/coordinator.go, mr/worker.go, and mr/rpc.go.

我们已经给了你一小部分代码.Worker和Coordinator的main routine位于main/mrcoordinator.go和main/mrworker.go中.不要更改这些文件。您应该将实现放在mr/coordinator.go、mr/worker.go和mr/rpc.go中。

Here’s how to run your code on the word-count MapReduce application. First, make sure the word-count plugin is freshly built:

以下,介绍如何使用你的代码运行word count.首先,确保word-count是最新构建出的:

1	$ go build -race -buildmode=plugin ../mrapps/wc.go

In the main directory, run the coordinator.
在main目录中,运行Coordinator程序

1 2	$ rm mr-out* $ go run -race mrcoordinator.go pg-*.txt

The pg-*.txt arguments to mrcoordinator.go are the input files; each file corresponds to one “split”, and is the input to one Map task. The -race flags runs go with its race detector.

传入”pg-*.txt”参数到mrcoordinator.go是输入文件,每个文件都是一个”切片”(即输入文件已经被拆分为若干个小文件了),这对应一个Map task. -race 参数告诉编译器启用竞态检测.

In one or more other windows, run some workers:

在其他窗口上,执行worker进程:

1	$ go run -race mrworker.go wc.so

When the workers and coordinator have finished, look at the output in mr-out-*. When you’ve completed the lab, the sorted union of the output files should match the sequential output, like this:

当执行成功后,查看mr-out-*中的输出,这些输出结果应该与提供的顺序版本的MapReduce的输出一致.像这样:

$ cat mr-out-* | sort | more
A 509
ABOUT 2
ACT 8
...

We supply you with a test script in main/test-mr.sh. The tests check that the wc and indexer MapReduce applications produce the correct output when given the pg-xxx.txt files as input. The tests also check that your implementation runs the Map and Reduce tasks in parallel, and that your implementation recovers from workers that crash while running tasks.

我们提供了一个测试脚本,位于main/test-mr.sh.它会检查wc和indexer能否在以pg-xxx.txt文件作为输入时,产生正确的输出. 也会检查你的实现是否是并行的,以及能否在某些worker崩溃后恢复.

If you run the test script now, it will hang because the coordinator never finishes:

如果现在运行测试脚本，它将挂起，因为还没有完成Coordinator.

You can change ret := false to true in the Done function in mr/coordinator.go so that the coordinator exits immediately. Then:

你可以将mr/coordinator.go中的Done函数中的reg:=false改为reg:=true.这样Coordinator会立即退出.这样脚本就不会卡住了.

$ bash test-mr.sh
*** Starting wc test.
sort: No such file or directory
cmp: EOF on mr-wc-all
--- wc output is not the same as mr-correct-wc.txt
--- wc test: FAIL
$

The test script expects to see output in files named mr-out-X, one for each reduce task. The empty implementations of mr/coordinator.go and mr/worker.go don’t produce those files (or do much of anything else), so the test fails.

测试脚本期望的输出文件为mr-out-X,每个reduce任务对应一个文件.mr/coordinator.go与mr/worker.go的空实现不会生成这些输出,所以上面的测试会失败.

When you’ve finished, the test script output should look like this:

当你完成这个实验后,运行测试脚本,会得到类似于下面的输出:

$ bash test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
--- reduce parallelism test: PASS
*** Starting crash test.
--- crash test: PASS
*** PASSED ALL TESTS
$

You’ll also see some errors from the Go RPC package that look like

你会在Go PRC包中看到类似于下面的报错:

1	2019/12/16 13:27:09 rpc.Register: method "Done" has 1 input parameters; needs exactly three

Ignore these messages; registering the coordinator as an RPC server checks if all its methods are suitable for RPCs (have 3 inputs); we know that Done is not called via RPC.

忽略这些报错即可.将Coordinator注册为RPC server时检查Coordinator中的所有导出方法是否适合RPC的调用,但是我们知道Done并不是通过RPC调用的,所以不必理会.

A few rules

The map phase should divide the intermediate keys into buckets for nReduce reduce tasks, where nReduce is the argument that main/mrcoordinator.go passes to MakeCoordinator().
The worker implementation should put the output of the X’th reduce task in the file mr-out-X.
worker实现应该将第X个reduce任务的输出放在mr-out-X文件中
A mr-out-X file should contain one line per Reduce function output. The line should be generated with the Go “%v %v” format, called with the key and value. Have a look in main/mrsequential.go for the line commented “this is the correct format”. The test script will fail if your implementation deviates too much from this format.
You can modify mr/worker.go, mr/coordinator.go, and mr/rpc.go. You can temporarily modify other files for testing, but make sure your code works with the original versions; we’ll test with the original versions.
你可以修改mr/worker.go、mr/coordinator.go和mr/rpc.go。你可以临时修改其他文件进行测试，但请确保代码与原始版本兼容；我们将使用原始版本进行测试。
The worker should put intermediate Map output in files in the current directory, where your worker can later read them as input to Reduce tasks.
Worker应该将中间的Map输出放在当前目录中,这样稍后的Workder可以将其作为Reduce任务的输入
main/mrcoordinator.go expects mr/coordinator.go to implement a Done() method that returns true when the MapReduce job is completely finished; at that point,mrcoordinator.go will exit.
main/mrcoordinator.go期望mr/coordinator.go实现一个Done方法.当MapReduce任务完成时,此函数返回true. 整个coordinator.go程序退出.
When the job is completely finished, the worker processes should exit. A simple way to implement this is to use the return value from call(): if the worker fails to contact the coordinator, it can assume that the coordinator has exited because the job is done, and so the worker can terminate too. Depending on your design, you might also find it helpful to have a “please exit” pseudo-task that the coordinator can give to workers.
当任务全部完成时,Worker进程应该退出.实现这一点可以使用call()的返回值: 如果Workder未能联系到Coordinator,可以假设Coordinator已经退出,因为任务已经完成,那么Worker也可以退出.这取决与你的设计,你可能会发现,如果Coordinator可以向Worker发送一个”退出”的伪任务,会很有用.

Hints

One way to get started is to modify mr/worker.go’s Worker() to send an RPC to the coordinator asking for a task. Then modify the coordinator to respond with the file name of an as-yet-unstarted map task. Then modify the worker to read that file and call the application Map function, as in mrsequential.go.
你可以从这个方面入手: 修改mr/worker.go的worker(),通过RPC向Coordinator请求任务,然后修改Coordiantor,向Worker返回尚未启动的map任务对应的文件名.再然后,修改Worker,读取此文件,调用map程序.
The application Map and Reduce functions are loaded at run-time using the Go plugin package, from files whose names end in .so.
Map和Reduce应用程序在运行的时候从.so文件中通过Go plugin package动态加载
If you change anything in the mr/ directory, you will probably have to re-build any MapReduce plugins you use, with something like go build -race -buildmode=plugin ../mrapps/wc.go
如果你修改了mr/目录下的内容,可能需要重新构建MapReduce插件,使用类似于go build-race-buildmode=plugin../mrapps/wc.go的命令
This lab relies on the workers sharing a file system. That’s straightforward when all workers run on the same machine, but would require a global filesystem like GFS if the workers ran on different machines.
这个实验需要worker程序们共享同一个文件系统.如果worker在同一台机器上,实现这一点很简单.如果在不同的机器上,就需要类似于GFS这种分布式文件系统了.
A reasonable naming convention for intermediate files is mr-X-Y, where X is the Map task number, and Y is the reduce task number.
一个比较合理的中间文件命名规则是这样的: mr-X-Y. 其中X是Map任务号,Y是Reduce任务号.
The worker’s map task code will need a way to store intermediate key/value pairs in files in a way that can be correctly read back during reduce tasks. One possibility is to use Go’s encoding/json package. To write key/value pairs to a JSON file:
Worker的map任务需要一种在文件中存取中间键\值对的方法,你可以使用Go的encoding/json包
The map part of your worker can use the ihash(key) function (in worker.go) to pick the reduce task for a given key.
worker的map部分可以使用ihash(key)函数(在worker.go中)为给定的键选择reduce任务
You can steal some code from mrsequential.go for reading Map input files, for sorting intermedate key/value pairs between the Map and Reduce, and for storing Reduce output in files.
你可以借鉴一些mrsequential.go中代码来实现读取map输入,对map,reduce中间的键值对排序,以及将reduce的输入写入到文件(读书人的事情,怎么能叫偷呢)
The coordinator, as an RPC server, will be concurrent; don’t forget to lock shared data.
协调器作为RPC服务器会并发执行,别忘了加锁
Use Go’s race detector, with go build -race and go run -race. test-mr.sh by default runs the tests with the race detector.
在go build以及go run时加入-race来启用go的竞态检测,提前发现问题.不然test-mr.sh也会检测竞态问题的.
Workers will sometimes need to wait, e.g. reduces can’t start until the last map has finished. One possibility is for workers to periodically ask the coordinator for work, sleeping with time.Sleep() between each request. Another possibility is for the relevant RPC handler in the coordinator to have a loop that waits, either with time.Sleep() or sync.Cond. Go runs the handler for each RPC in its own thread, so the fact that one handler is waiting won’t prevent the coordinator from processing other RPCs.
Worker有时需要等待.比如,reduce任务需要所有map任务执行完成之后才能开始. 可以在每个请求间使用time.sleep(这个Sleep放在Worker或者Coordinator中都行),或者通过条件变量(sync.Cond)进行同步.Go的RPC调用是并发的,所以其中一个调用阻塞时,不会影响处理其他RPC请求.
The coordinator can’t reliably distinguish between crashed workers, workers that are alive but have stalled for some reason, and workers that are executing but too slowly to be useful. The best you can do is have the coordinator wait for some amount of time, and then give up and re-issue the task to a different worker. For this lab, have the coordinator wait for ten seconds; after that the coordinator should assume the worker has died (of course, it might not have).
Coordinator无法可靠地区分以下几种Worker
1. 已经崩溃的
2. 仍然运行但是因为某些原因停止运行
3. 执行速度太慢而无法使用
所以,最好让Coordinator等待一段时间,如果仍然不能响应,将这个任务重新分配给其他的Worker.在这个实验中,这个时间是10s,在此之后,Coordinator可以认为对应的Worker已经死亡
If you choose to implement Backup Tasks (Section 3.6), note that we test that your code doesn’t schedule extraneous tasks when workers execute tasks without crashing. Backup tasks should only be scheduled after some relatively long period of time (e.g., 10s).
如果你选择实现备份任务(3.6节),要注意,当worker执行备份任务并且没有崩溃时,我们会测试你的代码没有调度其他无关的任务.备份任务应该只在一段比较长的时间后被调度(这段还没看懂,先留个坑,回头再更详细地补充)
To test crash recovery, you can use the mrapps/crash.go application plugin. It randomly exits in the Map and Reduce functions.
为了测试崩溃恢复,你可以使用mrapps/crash.go崩溃会在map和reduce任务时随机发生
To ensure that nobody observes partially written files in the presence of crashes, the MapReduce paper mentions the trick of using a temporary file and atomically renaming it once it is completely written. You can use ioutil.TempFile to create a temporary file and os.Rename to atomically rename it.
为了确保在程序崩溃时不会出现部分写入的文件,MapReduce论文中提到了使用临时文件并在完全写入后自动重命名的技巧.你可以使用ioutil.TempFile创建一个临时文件,并使用os.Rename对其进行原子地重命名
test-mr.sh runs all the processes in the sub-directory mr-tmp, so if something goes wrong and you want to look at intermediate or output files, look there. You can modify test-mr.sh to exit after the failing test, so the script does not continue testing (and overwrite the output files).
test-mr.sh运行子目录mr-tmp中的所有进程,因此如果出现问题,您希望查看中间文件或输出文件,请查看那里.你可以修改test-mr.sh以在测试失败后退出,这样脚本就不会继续测试
test-mr-many.sh provides a bare-bones script for running test-mr.sh with a timeout (which is how we’ll test your code). It takes as an argument the number of times to run the tests. You should not run several test-mr.sh instances in parallel because the coordinator will reuse the same socket, causing conflicts.
test-mr-many.sh提供了一个用于运行带有超时的test-mr.sh的基本脚本(这就是我们测试代码的方式).它将运行测试的次数作为参数.因为你不应该并行运行多个test-mr.sh实例,因为协调器将重用同一套接字,从而导致冲突.