Updated documentation

6 years ago · 4fc4b840f5
parent d6ca408ce2
commit 4fc4b840f5
4 changed files with 341 additions and 343 deletions
--- a/README.md
+++ b/README.md
@ -1,365 +1,46 @@
-
 # RandomX
-RandomX ("random ex") is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.
+RandomX is an experimental proof of work (PoW) algorithm that uses random code execution to achieve ASIC resistance.

-RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
+### Key features

-*Software implementation details and design notes are written in italics.*
+* CPU-friendly (especially for x86 and ARM architectures)
+* ASIC-resistant, FPGA-resistant, GPU-resistant
+* Memory-hard (requires  >4 GiB of memory)
+* Web-mining resistant due to high memory requirement

 ## Virtual machine
-RandomX is intended to be run efficiently and easily on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
-
-![Imgur](https://i.imgur.com/Xx5QVOV.png)
-
-#### DRAM
-The VM has access to 4 GiB of external memory in read-only mode. The DRAM memory blob is generated from the hash of the previous block using AES encryption (TBD). The contents of the DRAM blob change on average every 2 minutes. The DRAM blob is read with a maximum rate of 2.5 GiB/s per thread.
-
-*CPUs without hardware AES support can use a GPU to generate the DRAM blob quickly. Dual channel DDR4 memory has enough bandwidth to support up to 16 mining threads.*
-
-#### MMU
-The memory management unit (MMU) interfaces the CPU with the DRAM blob. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from DRAM. The DRAM blob is read mostly sequentially. After an average of 8192 sequential reads, a random read is performed. An average program reads a total of 4 MiB of DRAM and has 64 random reads.
-
-The MMU uses two internal registers:
-* **ma** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
-* **mx** - A 32-bit counter that determines if the next read is sequential or random. After each read, the read address is mixed with the counter: `mx ^= addr`. If the right-most 13 bits of the register are zero:  `(mx & 0x1FFF) == 0`, the value of the `mx` register is copied into register `ma`.
-
-*When the value of the `ma` register is changed to a random address, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
-
-#### Scratchpad
-The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
-
-*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance, which should limit the impact of L1 cache misses.*

-#### Program
-The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
+RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:

-*For high-performance mining, the program should be translated directly into machine code. The whole program should fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.*
+![Imgur](https://i.imgur.com/ZAfbX9m.png)

-#### Control unit
-The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:
-* **pc** - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
-* **sp** - Address of the top of the stack (64-bit, 8 byte aligned).
-* **ic** - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when `ic` reaches `0`.
+Full description: [vm.md](doc/vm.md).

-*Fixed number of executed instructions should ensure roughly equal runtime of each random program.*
+## Dataset

-#### Stack
-To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
+RandomX uses a 4 GiB read-only dataset. The dataset is constructed using a combination of the [Argon2d](https://en.wikipedia.org/wiki/Argon2) hashing function, [AES](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) encryption/decryption and a random permutation. The dataset is regenerated every ~34 hours.

-*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB for a program that contains only unconditional CALL instructions (the probability of randomly generating such program is about 5×10<sup>-912</sup>). In reality, the stack size will rarely exceed 1 MiB.*
-
-#### Register file
-The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7`. All registers are 64 bits wide.
-
-*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
-
-#### ALU
-The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.
-
-#### FPU
-The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root.
-
-#### Endianness
-The VM stores and loads all data in little-endian byte order.
+Full description: [dataset.md](doc/dataset.md).

 ## Instruction set
-The 128-bit instruction is encoded as follows:
-
-![Imgur](https://i.imgur.com/thpvVHN.png)
-
-*All flags are aligned to an 8-bit boundary for easier decoding.*
-
-#### Opcode
-There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:
-
-|operation|number of opcodes||
-|---------|-----------------|----|
-|ALU operations|142|55.5%|
-|FPU operations|82|32.0%|
-|Control flow |32|12.5%|
-
-#### Operand A
-The first operand is read from memory. The location is determined by the `loc(a)` flag:
-
-|loc(a)[2:0]|read A from|address size (W)
-|---------|-|-|
-|000|DRAM|32 bits|
-|001|DRAM|32 bits|
-|010|DRAM|32 bits|
-|011|DRAM|32 bits|
-|100|scratchpad|15 bits|
-|101|scratchpad|11 bits|
-|110|scratchpad|11 bits|
-|111|scratchpad|11 bits|
-
-Flag `reg(a)` encodes an integer register `r0`-`r7`.  The read address is calculated as:
-```
-reg(a) ^= addr0
-addr(a) = reg(a)[W-1:0]
-```
-For reading from the scratchpad, `addr(a)` is multiplied by 8 for 8-byte aligned access.
-
-#### Operand B
-The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).
-
-|loc(b)[2:0]|read B from|
-|---------|-|
-|000|register `reg(b)`|
-|001|register `reg(b)`|
-|010|register `reg(b)`|
-|011|register `reg(b)`|
-|100|register `reg(b)`|
-|101|register `reg(b)`|
-|110|`imm0` or `imm1`|
-|111|`imm0` or `imm1`|
-
-`imm0` is an 8-bit immediate value, which is used for shift and rotate ALU operations.
-
-`imm1` is a 32-bit immediate value which is used for most operations. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is first left-shifted by 32 bits, treated as a signed 64-bit integer and converted to a double precision floating point format.
-
-#### Operand C
-The third operand is the location where the result is stored.
-
-|loc\(c\)[2:0]|write C to|address size (W)
-|---------|-|-|
-|000|scratchpad|15 bits|
-|001|scratchpad|11 bits|
-|010|scratchpad|11 bits|
-|011|scratchpad|11 bits|
-|100|register `reg(c)`|-|
-|101|register `reg(c)`|-|
-|110|register `reg(c)`|-|
-|111|register `reg(c)`|-|
-
-The `reg(c)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).  For writing to the scratchpad, an integer register is always used and the write address is calculated as:
-```
-addr(c) = (addr1 ^ reg(c))[W-1:0] * 8
-```
-*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 write to memory.*
-
-#### imm0
-An 8-bit immediate value that is used as the shift/rotate count by some ALU instructions and as the jump offset of the CALL instruction.
-
-#### addr0
-A 32-bit address mask that is used to calculate the read address for the A operand.
-
-#### addr1
-A 32-bit address mask that is used to calculate the write address for the C operand. `addr1` is equal to `imm1`.
-
-### ALU instructions
-
-|weight|instruction|signed|A width|B width|C|C width|
-|-|-|-|-|-|-|-|
-|16|ADD_64|no|64|64|A + B|64|
-|8|ADD_32|no|32|32|A + B|32|
-|16|SUB_64|no|64|64|A - B|64|
-|8|SUB_32|no|32|32|A - B|32|
-|7|MUL_64|no|64|64|A * B|64|
-|7|MULH_64|no|64|64|A * B|64|
-|7|MUL_32|no|32|32|A * B|64|
-|7|IMUL_32|yes|32|32|A * B|64|
-|7|IMULH_64|yes|64|64|A * B|64|
-|1|DIV_64|no|64|32|A / B|32|
-|1|IDIV_64|yes|64|32|A / B|32|
-|4|AND_64|no|64|64|A & B|64|
-|3|AND_32|no|32|32|A & B|32|
-|4|OR_64|no|64|64|A &#124; B|64|
-|3|OR_32|no|32|32|A &#124; B|32|
-|4|XOR_64|no|64|64|A ^ B|64|
-|3|XOR_32|no|32|32|A ^ B|32|
-|6|SHL_64|no|64|6|A << B|64|
-|6|SHR_64|no|64|6|A >> B|64|
-|6|SAR_64|yes|64|6|A >> B|64|
-|9|ROL_64|no|64|6|A <<< B|64|
-|9|ROR_64|no|64|6|A >>> B|64|

-##### 32-bit operations
-Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are zero.
+RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program. Each RandomX instruction has a length of 128 bits.

-##### Multiplication
-There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
-
-The following table shows an example of the output of the 5 multiplication instructions for inputs `A = 0xBC550E96BA88A72B` and `B = 0xF5391FA9F18D6273`:
-
-|instruction|A interpreted as|B interpreted as|result C|
-|-|-|-|-|
-|MUL_64|13570769092688258859|17670189427727360627|`0x28723424A9108E51`|
-|MULH_64|13570769092688258859|17670189427727360627|`0xB4676D31D2B34883`|
-|MUL_32|3129517867|4052574835|`0xB001AA5FA9108E51`|
-|IMUL_32|-1165449429|-242392461|`0x03EBA0C1A9108E51`|
-|IMULH_64|-4875974981021292757|-776554645982190989|`0x02D93EF1269D3EE5`|
-
-##### Division
-For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend `A`.
-
-*Division by zero can be handled without branching by conditional move (`IF B == 0 THEN B = 1`). Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. In this extremely rare case, ARM produces the "correct" result, but x86 throws a hardware exception, which must be handled.*
-
-##### Shift and rotate
-The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm0` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
-
-### FPU instructions
-
-|weight|instruction|C|
-|-|-|-|
-|22|FADD|A + B|
-|22|FSUB|A - B|
-|22|FMUL|A * B|
-|8|FDIV|A / B|
-|6|FABSQRT|sqrt(A)|
-|2|FROUND|A|
-
-FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is RN (Round to Nearest). Denormal values may not be produced by any operation.
-
-*Denormals can be disabled by setting the FTZ flag in x86 SSE and ARM Neon engines. This is done for performance reasons.*
-
-Operands loaded from memory are treated as signed 64-bit integers and converted to double precision floating point format. Operands loaded from floating point registers are used directly.
-
-##### FABSQRT
-The sign bit of the FABSQRT operand is always cleared first, so only non-negative values are used.
-
-*In x86, the `SQRTSD` instruction must be used. The legacy `FSQRT` instruction doesn't produce correctly rounded results in all cases.*
-
-##### FROUND
-The FROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two right-most bits of A:
-
-|A[1:0]|rounding mode|
-|-------|------------|
-|00|Round to Nearest (RN) mode|
-|01|Round towards Minus Infinity (RM) mode
-|10|Round towards Plus Infinity (RP) mode
-|11|Round towards Zero (RZ) mode
-
-*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 23 and 22 (reversed) of the ARM `FPSCR` register.*
-
-### Control flow instructions
-The following 2 control flow instructions are supported:
-
-|weight|instruction|function|
-|-|-|-|
-|17|CALL|near procedure call|
-|15|RET|return from procedure|
-
-Both instructions are conditional in 75% of cases. The jump is taken only if `B <= imm1`. For the 25% of cases when `B` is equal to `imm1`, the jump is unconditional. In case the branch is not taken, both instructions become "arithmetic no-op" `C = A`.
-
-##### CALL
-Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm0[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
-
-##### RET
-The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A ^ retval`. Finally, the instruction jumps back to `raddr`.
+Full description: [isa.md](doc/isa.md).

 ## Proof of work

-### Hash functions
-
-#### Blake2b
-The primary cryptographically secure hash function used by RandomX is [Blake2b](https://blake2.net/) with an output size of 256 bits. Blake2b was specifically designed to be fast in software, especially on modern 64-bit processors, where it's around three times faster than SHA-3 and can run at a speed of around 3 clock cycles per byte of input.
-
-`Blake2b(X)` refers to the 256-bit plain hash and `Blake2b(K, X)` refers to the 256-bit keyed hash.
-
-#### HighwayHash
-[HighwayHash](https://github.com/google/highwayhash) is a fast keyed pseudorandom function, which can take advantage of SIMD instructions available in modern CPUs. It's used to calculate the scratchpad digest. HighwayHash can run at a speed of about 0.3 clocks per byte using SSE 4.1.
-
-The function is called as `HighwayHash(K, X)`, where `K` is a 256-bit key.
-
-### Pseudo-random number generator
-RandomX uses a permuted congruential generator (PCG) for VM initialization. A minimal C implementation is available [here](http://www.pcg-random.org/download.html#minimal-c-implementation). The generator has an internal state of 64 bits and additional 63 bits are used to select the output stream. The generator produces 32 random bits per call.
-
-### DRAM blob initialization
-TBD
-
-### VM initialization
-#### Scratchpad initialization
-The scratchpad is initialized by copying a 256 KiB block from the DRAM blob. The starting offset of the block is `262144 * i`, where `i` is a 14-bit input parameter.
-
-Pseudocode:
-```python
-# initializes the scratchpad
-def InitializeScratchpad(i):
-    memcpy(Scratchpad, DRAM + 262144 * i, 262144)
-```
-
-#### Program initialization
-The program is initialized from a 256-bit seed value `S`.
-1. The PCG random number generator is initialized with state `S[63:0]` and increment `S[127:64] | 1` (odd number).
-2. The generator is used to generate 8324 random bytes.
-3. The integer registers `r0`-`r7` are initialized with bytes 0-63.
-4. Floating point registers `f0`-`f7` are initialized with bytes 64-127 interpreted as 8 64-bit signed integers converted to a double precision floating point format.
-5. The program buffer is initialized with bytes 128-8319.
-6. The initial value of the `ma` register is set to bytes 8320-8323, XORed with `S[159:128]` and the last 3 bits are cleared (8-byte aligned).
-7. The value of the `mx` register is initialized as `S[191:160]`.
-8. The remaining registers are initialized with constant values: `pc = 0`, `sp = 0`, `ic = 1048576`.
-
-Pseudocode:
-```python
-# S is a 256-bit seed value
-# initializes the program buffer and registers
-def InitializeProgram(S):
-    rng = Pcg32(S[63:0], S[127:64] | 1)
-    a = []
-    loop 2081 times:
-        a.append(rng.next())
-    r0 = a[0..1]
-    r1 = a[2..3]
-    r2 = a[4..5]
-    r3 = a[6..7]
-    r4 = a[8..9]
-    r5 = a[10..11]
-    r6 = a[12..13]
-    r7 = a[14..15]
-    f0 = double(a[16..17])
-    f1 = double(a[18..19])
-    f2 = double(a[20..21])
-    f3 = double(a[22..23])
-    f4 = double(a[24..25])
-    f5 = double(a[26..27])
-    f6 = double(a[28..29])
-    f7 = double(a[30..31])
-    ProgramBuffer = a[32..2079]
-    ma = (a[2080] ^ S[159:128]) & 0xFFFFFFF8
-    mx = S[191:160]
-    pc = 0
-    sp = 0
-    ic = 1048576
-```
-
-### PoW hash calculation
 RandomX produces a 256-bit final hash value to be used for a Hashcash-style proof evaluation.
-The hash of the input (block header for a cryptocurrency) is used for the first VM initiazation. The program initialization and program execution are chained three times to discourage mining strategies that search for programs with particular properties. The scratchpad is preserved between the 3 program executions.
-
-Pseudocode:
-```python
-# H is the input value
-# returns a 256-bit PoW hash
-def RandomXPoW(H):
-    K = Blake2b(H)
-    InitializeScratchpad(K[205:192])
-    S = K
-    loop 3 times:
-        InitializeProgram(S)
-        ExecuteProgram()
-        S = Blake2b(K, RegisterFile)
-    W = HighwayHash(K, Scratchpad)
-    return Blake2b(K, RegisterFile + W)
-```
-*The stack is not included in the result calculation to enable platform-specific return addresses.*
-*An average program takes roughly 2.5 ms to execute on a recent CPU (preliminary tests). VM initialization and result calculation should take less than 0.1 ms. The total time to calculate the PoW should be under 10 ms (depends on the overhead of translating RandomX code into machine code).*
-
-## Test code
-A python generator is available to generate a random program and output its C source code.
+The hash of the block header is used for the first VM initialization. The program initialization and execution are chained multiple times to prevent mining strategies that search for programs with particular properties (for example, without division).

-Generate a random program:
-```
-python rx2c.py > rx-sample.c
-```
-Compile the program:
-```
-gcc -O2 -maes -DRAM -DPREF rx-sample.c -o rx-sample
-```
+The final result is obtained by calculating a Blake2b hash of the Register file and a checkum of the Scratchpad.

-*(Note that the test program can be compiled only by the GCC compiler due to the use of non-standard C features such as computed goto.)*
+## Acknowledgements
+The following people have contributed to the design of RandomX:
+* [SChernykh](https://github.com/SChernykh)
+* [hyc](https://github.com/hyc)

-Run the program:
-```
-./rx-sample
-```
-*(Note that the test program execution requires more than 4 GiB of available virtual memory and the AES-NI instruction set support.)*
+RandomX uses some source code from the following 3rd party repositories:
+* Argon2d, Blake2b hashing functions: https://github.com/P-H-C/phc-winner-argon2
+* PCG32 random number generator: https://github.com/imneme/pcg-c-basic
+* Software AES implementation https://github.com/fireice-uk/xmr-stak
--- a/doc/dataset.md
+++ b/doc/dataset.md
@ -0,0 +1,82 @@
+
+## Dataset
+
+The dataset serves as the source of the first operand of all instructions and provides the memory-hardness of RandomX. The size of the dataset is fixed at 4 GiB and it's divided into 65536 blocks, each 64 KiB in size.
+
+In order to allow PoW verification with less than 4 GiB of memory, the dataset is constructed from a 64 MiB cache, which can be used to calculate dataset blocks on the fly. To facilitate this, all random reads from the dataset are aligned to the beginning of a block.
+
+Because the initialization of the dataset is computationally intensive, it's recalculated on average every 1024 blocks (~34 hours). The following figure visualizes the construction of the dataset:
+
+![Imgur](https://i.imgur.com/JgLCjeq.png)
+
+### Seed block
+The whole dataset is constructed from a 256-bit hash of the last block whose height is divisible by 1024 **and** has at least 64 confirmations.
+
+|block|Seed block|
+|------|---------------------------------|
+|1-1088|Genesis block|
+|1088-2112|1024|
+|2113-3136|2048|
+|...|...
+
+### Cache construction
+
+The 32-byte seed block hash is expanded into the 64 MiB cache using the "memory fill" function of Argon2d. [Argon2](https://github.com/P-H-C/phc-winner-argon2) is a memory-hard password hashing function, which is highly customizable. The variant with "d" suffix uses a data-dependent memory access pattern and provides the highest resistance against time-memory tradeoffs.
+
+Argon2 is used with the following parameters:
+
+|parameter|value|
+|------------|--|
+|parallelism|1|
+|output size|0|
+|memory|65536 (64 MiB)|
+|iterations|12|
+|version|`0x13`|
+|hash type|0 (Argon2d)
+|password|seed block hash (32 bytes)
+|salt|`4d 6f 6e 65 72 6f 1a 24` (8 bytes)
+|secret size|0|
+|assoc. data size|0|
+
+The finalizer and output calculation steps of Argon2 are omitted. The output is the filled memory array.
+
+The use of 12 iterations makes time-memory tradeoffs infeasible and thus 64 MiB is the minimum amount of memory required by RandomX.
+
+When the memory fill is complete, the whole memory array is cyclically shifted backwards by 512 bytes (i.e. bytes 0-511 are moved to the end of the array). This is done to misalign the array so that each 1024-byte cache block spans two subsequent Argon2 blocks.
+
+### Dataset block generation
+The full 4 GiB dataset can be generated from the 64 MiB cache. Each block is generated separately: a 1024 byte block of the cache is expanded into 64 KiB of the dataset. The algorithm has 3 steps: expansion, AES and shuffle.
+
+#### Expansion
+The 1024 cache bytes are split into 128 quadwords and interleaved with 504-byte chunks of null bytes. The resulting sequence is: 8 cache bytes + 504 null bytes + 8 cache bytes + 504 null bytes etc. Total length of the expanded block is 65536 bytes.
+
+#### AES
+The 256-bit seed block hash is expanded into 10 AES round keys `k0`-`k9`. Let `i = 0...65535` be the index of the block that is being expanded. If `i` is an even number, this step uses AES *decryption* and if `i` is an odd number, it uses AES *encryption*.  Since both encryption and decryption scramble random data, no distinction is made between them in the text below.
+
+The AES encryption is performed with 10 identical rounds using round keys `k0`-`k9`. Note that this is different from the typical AES procedure, which uses a different key schedule for decryption and a modified last round.
+
+Before the AES encryption is applied, each 16-byte chunk is XORed with the ciphertext of the previous chunk. This is similar to the [AES-CBC](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Cipher_Block_Chaining_%28CBC%29) mode of operation and forces the encryption to be sequential. For XORing the initial block, an initialization vector is formed by zero-extending `i` to 128 bits.
+
+#### Shuffle
+When the AES step is complete, the last 16-byte chunk of the block is used to initialize a PCG32 random number generator. Bits 0-63 are used as the initial state and bits 64-127 are used as the increment. The least-significant bit of the increment is always set to 1 to form an odd number.
+
+The whole block is then divided into 16384 doublewords (4 bytes) and the [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) algorithm is applied to it. The algorithm generates a random in-place permutation of the 16384 doublewords. The result of the shuffle is the `i`-th block of the dataset.
+
+The shuffle algorithm requires a uniform distribution of random numbers. The output of the PCG32 generator is always properly filtered to avoid the [modulo bias](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modulo_bias).
+
+### Performance
+The initial 64-MiB cache construction using Argon2d takes around 1 second using an older laptop with an Intel i5-3230M CPU (Ivy Bridge). Cache generation is strictly serial and cannot be easily parallelized.
+
+Dataset generation performance depends on the support of the AES-NI instruction set. The following table lists the generation runtimes using the same Ivy Bridge laptop with a single thread:
+
+|AES|4 GiB dataset generation|single block generation|
+|-----|-----------------------------|----------------|
+|hardware (AES-NI)|25 s|380 µs|
+|software|53 s|810 µs|
+
+While the generation of a single block is strictly serial, multiple blocks can be easily generated in parallel, so the dataset generation time decreases linearly with the number of threads. Using a recent 6-core CPU with AES-NI support, the whole dataset can be generated in about 4 seconds.
+
+Moreover, the seed block hash is known up to 64 blocks in advance, so miners can slowly precalculate the whole dataset by generating ~512 dataset blocks per minute (corresponds to less than 1% utilization of a single CPU core).
+
+### Light clients
+Light clients, who cannot or do not want to generate and keep the whole dataset in memory, can generate just the cache and then generate blocks on the fly as the program is being executed. In this case, the program execution time will be increased by roughly 100 times the single block generation time. For the Intel Ivy Bridge laptop, this amounts to around 40 milliseconds per program.
--- a/doc/isa.md
+++ b/doc/isa.md
@ -0,0 +1,181 @@
+
+## RandomX instruction set
+RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
+
+Each RandomX instruction has a length of 128 bits. The encoding is following:
+
+![Imgur](https://i.imgur.com/thpvVHN.png)
+
+*All flags are aligned to an 8-bit boundary for easier decoding.*
+
+#### Opcode
+There are 256 opcodes, which are distributed between various operations depending on their weight (how often they will occur in the program on average). The distribution of opcodes is following:
+
+|operation|number of opcodes||
+|---------|-----------------|----|
+|ALU operations|146|57.0%|
+|FPU operations|78|30.5%|
+|Control flow |32|12.5%|
+
+#### Operand A
+The first operand is read from memory. The location is determined by the `loc(a)` flag:
+
+|loc(a)[2:0]|read A from|address size (W)
+|---------|-|-|
+|000|dataset|32 bits|
+|001|dataset|32 bits|
+|010|dataset|32 bits|
+|011|dataset|32 bits|
+|100|scratchpad|15 bits|
+|101|scratchpad|11 bits|
+|110|scratchpad|11 bits|
+|111|scratchpad|11 bits|
+
+Flag `reg(a)` encodes an integer register `r0`-`r7`.  The read address is calculated as:
+```
+reg(a)[31:0] = reg(a)[31:0] XOR addr0
+addr(a) = reg(a)[W-1:0]
+```
+`W` is the address width from the above table. For reading from the scratchpad, `addr(a)` is multiplied by 8 for 8-byte aligned access.
+
+#### Operand B
+The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).
+
+|loc(b)[2:0]|read B from|
+|---------|-|
+|000|register `reg(b)`|
+|001|register `reg(b)`|
+|010|register `reg(b)`|
+|011|register `reg(b)`|
+|100|register `reg(b)`|
+|101|register `reg(b)`|
+|110|`imm0` or `imm1`|
+|111|`imm0` or `imm1`|
+
+`imm0` is an 8-bit immediate value, which is used for shift and rotate ALU operations.
+
+`imm1` is a 32-bit immediate value which is used for most operations. For operands larger than 32 bits, the value is zero-extended for unsigned instructions and sign-extended for signed instructions. For FPU instructions, the value is considered a signed 32-bit integer and then converted to a double precision floating point format.
+
+#### Operand C
+The third operand is the location where the result is stored.
+
+|loc\(c\)[2:0]|write C to|address size (W)
+|---------|-|-|
+|000|scratchpad|15 bits|
+|001|scratchpad|11 bits|
+|010|scratchpad|11 bits|
+|011|scratchpad|11 bits|
+|100|register `reg(c)`|-|
+|101|register `reg(c)`|-|
+|110|register `reg(c)`|-|
+|111|register `reg(c)`|-|
+
+The `reg(c)` flag encodes an integer register (ALU operations) or a floating point register (FPU operations).  For writing to the scratchpad, an integer register is always used and the write address is calculated as:
+```
+addr(c) = 8 * (addr1 XOR reg(c)[31:0])[W-1:0]
+```
+*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 write to memory.*
+
+#### imm0
+An 8-bit immediate value that is used as the shift/rotate count by some ALU instructions and as the jump offset of the CALL instruction.
+
+#### addr0
+A 32-bit address mask that is used to calculate the read address for the A operand.
+
+#### addr1
+A 32-bit address mask that is used to calculate the write address for the C operand. `addr1` is equal to `imm1`.
+
+### ALU instructions
+
+|weight|instruction|signed|A width|B width|C|C width|
+|-|-|-|-|-|-|-|
+|16|ADD_64|no|64|64|A + B|64|
+|4|ADD_32|no|32|32|A + B|32|
+|16|SUB_64|no|64|64|A - B|64|
+|4|SUB_32|no|32|32|A - B|32|
+|15|MUL_64|no|64|64|A * B|64|
+|11|MULH_64|no|64|64|A * B|64|
+|11|MUL_32|no|32|32|A * B|64|
+|11|IMUL_32|yes|32|32|A * B|64|
+|11|IMULH_64|yes|64|64|A * B|64|
+|1|DIV_64|no|64|32|A / B|32|
+|1|IDIV_64|yes|64|32|A / B|32|
+|4|AND_64|no|64|64|A & B|64|
+|2|AND_32|no|32|32|A & B|32|
+|4|OR_64|no|64|64|A &#124; B|64|
+|2|OR_32|no|32|32|A &#124; B|32|
+|4|XOR_64|no|64|64|A ^ B|64|
+|2|XOR_32|no|32|32|A ^ B|32|
+|3|SHL_64|no|64|6|A << B|64|
+|3|SHR_64|no|64|6|A >> B|64|
+|3|SAR_64|yes|64|6|A >> B|64|
+|9|ROL_64|no|64|6|A <<< B|64|
+|9|ROR_64|no|64|6|A >>> B|64|
+
+##### 32-bit operations
+Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are zero.
+
+##### Multiplication
+There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
+
+##### Division
+For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend `A`.
+
+*Division by zero can be handled without branching by a conditional move. Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. This rare case must be handled in x86 (ARM produces the "correct" result).*
+
+##### Shift and rotate
+The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm0` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
+
+### FPU instructions
+
+|weight|instruction|conversion method|C|
+|-|-|-|-|
+|20|FPADD|`convertSigned52`|A + B|
+|20|FPSUB|`convertSigned52`|A - B|
+|22|FPMUL|`convertSigned51`|A * B|
+|8|FPDIV|`convertSigned51`|A / B|
+|6|FPSQRT|`convert52`|sqrt(A)|
+|2|FPROUND|`convertSigned52`|A|
+
+#### Rounding
+FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` instruction. Denormal values are not be produced by any operation.
+
+#### Conversion of operand A
+Operand A is loaded from memory as a 64-bit signed integer and then converted to a double-precision floating point format using one of the following 3 methods:
+
+* *convertSigned52* - Clears the 11 least-significant bits before conversion. This is done so the number fits exactly into the 52-bit mantissa without rounding.
+* *convertSigned51* - Clears the 11 least-significant bits and sets the 12th bit before conversion. This is done so the number fits exactly into the 52-bit mantissa without rounding and avoids 0.
+* *convert52* - Clears the 11 least-significant bits and the sign bit before conversion. This is done so the number fits exactly into the 52-bit mantissa without rounding and avoids negative values.
+
+##### FPROUND
+The FPROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two least-significant bits of A.
+
+|A[1:0]|rounding mode|
+|-------|------------|
+|00|roundTiesToEven|
+|01|roundTowardNegative|
+|10|roundTowardPositive|
+|11|roundTowardZero|
+
+The rounding modes are defined by the IEEE-754 standard.
+
+*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 23 and 22 (reversed) of the ARM `FPSCR` register.*
+
+### Control flow instructions
+The following 2 control flow instructions are supported:
+
+|weight|instruction|function|
+|-|-|-|
+|17|CALL|near procedure call|
+|15|RET|return from procedure|
+
+Both instructions are conditional in 75% of cases. The jump is taken only if `B <= imm1`. For the 25% of cases when `B` is equal to `imm1`, the jump is unconditional. In case the branch is not taken, both instructions become "arithmetic no-op" `C = A`.
+
+##### CALL
+Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm0[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
+
+##### RET
+The RET instruction behaves like "not taken" when the stack is empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A XOR retval`. Finally, the instruction jumps back to `raddr`.
+
+## Reference implementation
+A portable C++ implementation of all ALU and FPU instructions is available in [instructionsPortable.cpp](../src/instructionsPortable.cpp).
--- a/doc/vm.md
+++ b/doc/vm.md
@ -0,0 +1,54 @@
+
+## RandomX virtual machine
+RandomX is intended to be run efficiently on a general-purpose CPU. The virtual machine (VM) which runs RandomX code attempts to simulate a generic CPU using the following set of components:
+
+![Imgur](https://i.imgur.com/ZAfbX9m.png)
+
+#### Dataset
+The VM has access to a read-only dataset which has a size of 4 GiB and changes every ~34 hours. See [dataset.md](dataset.md) for details how the dataset is generated.
+
+#### MMU
+The memory management unit (MMU) interfaces the CPU with the external memory. The purpose of the MMU is to translate the random memory accesses generated by the random program into a DRAM-friendly access pattern, where memory reads are not bound by access latency. The MMU accepts a 32-bit address `addr` and outputs a 64-bit value from the dataset. The dataset is read mostly sequentially. On average, there is one random read for every 8192 sequential reads. An average program reads a total of 4 MiB of the dataset and has 64 random reads.
+
+The MMU uses two internal registers:
+* **ma** - Address of the next quadword to be read from memory (32-bit, 8-byte aligned).
+* **mx** - A 32-bit counter that determines if the next read is sequential or random. After each read, the read address is XORed with the counter and if bits 3-15 of the register are zero, bits 0-2 are cleared and the value of the `mx` register is copied into register `ma`. Thus, all random reads are aligned to a 64 KiB block boundary.
+
+*When the value of the `ma` register is changed to a random address, the memory location can be preloaded into CPU cache using the x86 `PREFETCH` instruction or ARM `PRFM` instruction. Implicit prefetch should ensure that sequentially accessed memory is already in the cache.*
+
+#### Scratchpad
+The VM contains a 256 KiB scratchpad, which is accessed randomly both for reading and writing. The scratchpad is split into two segments (16 KiB and 240 KiB). 75% of accesses are into the first 16 KiB.
+
+*The scratchpad access pattern mimics the usual CPU cache structure. The first 16 KiB should be covered by the L1 cache, while the remaining accesses should hit the L2 cache. In some cases, the read address can be calculated in advance, which should limit the impact of L1 cache misses.*
+
+#### Program
+The actual program is stored in a 8 KiB ring buffer structure. Each program consists of 512 random 128-bit instructions. The ring buffer structure makes sure that the program forms a closed infinite loop.
+
+*For high-performance mining, the program should be translated directly into machine code. The whole program will fit into the L1 instruction cache and hot execution paths should stay in the µOP cache that is used by newer x86 CPUs. This should limit the number of front-end stalls and keep the CPU busy most of the time.*
+
+#### Control unit
+The control unit (CU) controls the execution of the program. It reads instructions from the program buffer and sends commands to the other units. The CU contains 3 internal registers:
+* **pc** - Address of the next instruction in the program buffer to be executed (64-bit, 8 byte aligned).
+* **sp** - Address of the last element on the stack (64-bit, 8 byte aligned).
+* **ic** - Instruction counter contains the number of instructions to execute before terminating. The register is decremented after each instruction and the program execution stops when `ic` reaches `0`.
+
+*Fixed number of executed instructions ensure roughly equal runtime of each random program.*
+
+#### Stack
+To simulate function calls, the VM uses a stack structure. The program interacts with the stack using the CALL and RET instructions. The stack has unlimited size and each stack element is 64 bits wide.
+
+*Although there is no explicit limit of the stack size, the maximum theoretical size of the stack is 16 MiB for a program that contains only unconditional CALL instructions (the probability of randomly generating such program is about 5×10<sup>-912</sup>). In reality, the stack size will rarely exceed 1 MiB.*
+
+#### Register file
+The VM has 8 integer registers `r0`-`r7` and 8 floating point registers `f0`-`f7`. All registers are 64 bits wide.
+
+*The number of registers is low enough so that they can be stored in actual hardware registers on most CPUs.*
+
+#### ALU
+The arithmetic logic unit (ALU) performs integer operations. The ALU can perform binary integer operations from 7 groups (addition, subtraction, multiplication, division, bitwise operations, shift, rotation) with operand sizes of 64 or 32 bits.
+
+#### FPU
+The floating-point unit performs IEEE-754 compliant math using 64-bit double precision floating point numbers. Five basic operations are available: addition, subtraction, multiplication, division and square root.
+
+#### Binary encoding
+The VM stores and loads all data in little-endian byte order. Signed numbers are represented using two's complement.