commit
98c4ccf5ca
@ -0,0 +1,103 @@
|
||||
# RandomX instruction listing
|
||||
|
||||
## Integer instructions
|
||||
For integer instructions, the destination is always an integer register (register group R). Source operand (if applicable) can be either an integer register or memory value. If `dst` and `src` refer to the same register, most instructions use `imm32` as the source operand instead of the register. This is indicated in the 'src == dst' column.
|
||||
|
||||
Memory operands are loaded as 8-byte values from the address indicated by `src`. This indirect addressing is marked with square brackets: `[src]`.
|
||||
|
||||
|frequency|instruction|dst|src|`src == dst ?`|operation|
|
||||
|-|-|-|-|-|-|
|
||||
|12/256|IADD_R|R|R|`src = imm32`|`dst = dst + src`|
|
||||
|7/256|IADD_M|R|mem|`src = imm32`|`dst = dst + [src]`|
|
||||
|16/256|IADD_RC|R|R|`src = dst`|`dst = dst + src + imm32`|
|
||||
|12/256|ISUB_R|R|R|`src = imm32`|`dst = dst - src`|
|
||||
|7/256|ISUB_M|R|mem|`src = imm32`|`dst = dst - [src]`|
|
||||
|9/256|IMUL_9C|R|-|-|`dst = 9 * dst + imm32`|
|
||||
|16/256|IMUL_R|R|R|`src = imm32`|`dst = dst * src`|
|
||||
|4/256|IMUL_M|R|mem|`src = imm32`|`dst = dst * [src]`|
|
||||
|4/256|IMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64`|
|
||||
|1/256|IMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64`|
|
||||
|4/256|ISMULH_R|R|R|`src = dst`|`dst = (dst * src) >> 64` (signed)|
|
||||
|1/256|ISMULH_M|R|mem|`src = imm32`|`dst = (dst * [src]) >> 64` (signed)|
|
||||
|4/256|IDIV_C|R|-|-|`dst = dst + dst / imm32`|
|
||||
|4/256|ISDIV_C|R|-|-|`dst = dst + dst / imm32` (signed)|
|
||||
|2/256|INEG_R|R|-|-|`dst = -dst`|
|
||||
|16/256|IXOR_R|R|R|`src = imm32`|`dst = dst ^ src`|
|
||||
|4/256|IXOR_M|R|mem|`src = imm32`|`dst = dst ^ [src]`|
|
||||
|10/256|IROR_R|R|R|`src = imm32`|`dst = dst >>> src`|
|
||||
|4/256|ISWAP_R|R|R|`src = dst`|`temp = src; src = dst; dst = temp`|
|
||||
|
||||
#### IMULH and ISMULH
|
||||
These instructions output the high 64 bits of the whole 128-bit multiplication result. The result differs for signed and unsigned multiplication (`IMULH` is unsigned, `ISMULH` is signed). The variants with a register source operand do not use `imm32` (they perform a squaring operation if `dst` equals `src`).
|
||||
|
||||
#### IDIV_C and ISDIV_C
|
||||
The division instructions use a constant divisor, so they can be optimized into a [multiplication by fixed-point reciprocal](https://en.wikipedia.org/wiki/Division_algorithm#Division_by_a_constant). `IDIV_C` performs unsigned division (`imm32` is zero-extended to 64 bits), while `ISDIV_C` performs signed division. In the case of division by zero, the instructions become a no-op. In the very rare case of signed overflow, the destination register is set to zero.
|
||||
|
||||
#### ISWAP_R
|
||||
This instruction swaps the values of two registers. If source and destination refer to the same register, the result is a no-op.
|
||||
|
||||
## Floating point instructions
|
||||
For floating point instructions, the destination can be a group F or group E register. Source operand is either a group A register or a memory value.
|
||||
|
||||
Memory operands are loaded as 8-byte values from the address indicated by `src`. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format. The lower and upper memory operands are marked as `[src][0]` and `[src][1]`.
|
||||
|
||||
|frequency|instruction|dst|src|operation|
|
||||
|-|-|-|-|-|
|
||||
|8/256|FSWAP_R|F+E|-|`(dst0, dst1) = (dst1, dst0)`|
|
||||
|20/256|FADD_R|F|A|`(dst0, dst1) = (dst0 + src0, dst1 + src1)`|
|
||||
|5/256|FADD_M|F|mem|`(dst0, dst1) = (dst0 + [src][0], dst1 + [src][1])`|
|
||||
|20/256|FSUB_R|F|A|`(dst0, dst1) = (dst0 - src0, dst1 - src1)`|
|
||||
|5/256|FSUB_M|F|mem|`(dst0, dst1) = (dst0 - [src][0], dst1 - [src][1])`|
|
||||
|6/256|FNEG_R|F|-|`(dst0, dst1) = (-dst0, -dst1)`|
|
||||
|20/256|FMUL_R|E|A|`(dst0, dst1) = (dst0 * src0, dst1 * src1)`|
|
||||
|4/256|FDIV_M|E|mem|`(dst0, dst1) = (dst0 / [src][0], dst1 / [src][1])`|
|
||||
|6/256|FSQRT_R|E|-|`(dst0, dst1) = (√dst0, √dst1)`|
|
||||
|
||||
#### Denormal and NaN values
|
||||
Due to restrictions on the values of the floating point registers, no operation results in `NaN`.
|
||||
`FDIV_M` can produce a denormal result. In that case, the result is set to `DBL_MIN = 2.22507385850720138309e-308`, which is the smallest positive normal number.
|
||||
|
||||
#### Rounding
|
||||
All floating point instructions give correctly rounded results. The rounding mode depends on the value of the `fprc` register:
|
||||
|
||||
|`fprc`|rounding mode|
|
||||
|-------|------------|
|
||||
|0|roundTiesToEven|
|
||||
|1|roundTowardNegative|
|
||||
|2|roundTowardPositive|
|
||||
|3|roundTowardZero|
|
||||
|
||||
The rounding modes are defined by the IEEE 754 standard.
|
||||
|
||||
## Other instructions
|
||||
There are 4 special instructions that have more than one source operand or the destination operand is a memory value.
|
||||
|
||||
|frequency|instruction|dst|src|operation|
|
||||
|-|-|-|-|-|
|
||||
|7/256|COND_R|R|R|`if(condition(src, imm32)) dst = dst + 1`
|
||||
|1/256|COND_M|R|mem|`if(condition([src], imm32)) dst = dst + 1`
|
||||
|1/256|CFROUND|`fprc`|R|`fprc = src >>> imm32`
|
||||
|16/256|ISTORE|mem|R|`[dst] = src`
|
||||
|
||||
#### COND
|
||||
|
||||
These instructions conditionally increment the destination register. The condition function depends on the `mod.cond` flag and takes the lower 32 bits of the source operand and the value `imm32`.
|
||||
|
||||
|`mod.cond`|signed|`condition`|probability|*x86*|*ARM*
|
||||
|---|---|----------|-----|--|----|
|
||||
|0|no|`src <= imm32`|0% - 100%|`JBE`|`BLS`
|
||||
|1|no|`src > imm32`|0% - 100%|`JA`|`BHI`
|
||||
|2|yes|`src - imm32 < 0`|50%|`JS`|`BMI`
|
||||
|3|yes|`src - imm32 >= 0`|50%|`JNS`|`BPL`
|
||||
|4|yes|`src - imm32` overflows|0% - 50%|`JO`|`BVS`
|
||||
|5|yes|`src - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
|
||||
|6|yes|`src < imm32`|0% - 100%|`JL`|`BLT`
|
||||
|7|yes|`src >= imm32`|0% - 100%|`JGE`|`BGE`
|
||||
|
||||
The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected probability the condition is true (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
|
||||
|
||||
#### CFROUND
|
||||
This instruction sets the value of the `fprc` register to the 2 least significant bits of the source register rotated right by `imm32`. This changes the rounding mode of all subsequent floating point instructions.
|
||||
|
||||
#### ISTORE
|
||||
The `ISTORE` instruction stores the value of the source integer register to the memory at the address specified by the destination register. The `src` and `dst` register can be the same.
|
@ -1,213 +1,91 @@
|
||||
|
||||
## RandomX instruction set
|
||||
RandomX uses a simple low-level language (instruction set), which was designed so that any random bitstring forms a valid program.
|
||||
|
||||
Each RandomX instruction has a length of 128 bits. The encoding is following:
|
||||
|
||||
![Imgur](https://i.imgur.com/mbndESz.png)
|
||||
|
||||
*All flags are aligned to an 8-bit boundary for easier decoding.*
|
||||
|
||||
#### Opcode
|
||||
There are 256 opcodes, which are distributed between 30 instructions based on their weight (how often they will occur in the program on average). Instructions are divided into 5 groups:
|
||||
|
||||
|group|number of opcodes||comment|
|
||||
|---------|-----------------|----|------|
|
||||
|IA|115|44.9%|integer arithmetic operations
|
||||
|IS|21|8.2%|bitwise shift and rotate
|
||||
|FA|70|27.4%|floating point arithmetic operations
|
||||
|FS|8|3.1%|floating point single-input operations
|
||||
|CF|42|16.4%|control flow instructions (branches)
|
||||
||**256**|**100%**
|
||||
|
||||
#### Operand A
|
||||
The first 64-bit operand is read from memory. The location is determined by the `loc(a)` flag:
|
||||
|
||||
|loc(a)[2:0]|read A from|address size (W)
|
||||
|---------|-|-|
|
||||
|000|dataset|32 bits|
|
||||
|001|dataset|32 bits|
|
||||
|010|dataset|32 bits|
|
||||
|011|dataset|32 bits|
|
||||
|100|scratchpad|15 bits|
|
||||
|101|scratchpad|11 bits|
|
||||
|110|scratchpad|11 bits|
|
||||
|111|scratchpad|11 bits|
|
||||
|
||||
Flag `reg(a)` encodes an integer register `r0`-`r7`. The read address is calculated as:
|
||||
```
|
||||
reg(a) = reg(a) XOR signExtend(addr(a))
|
||||
read_addr = reg(a)[W-1:0]
|
||||
```
|
||||
`W` is the address width from the above table. For reading from the scratchpad, `read_addr` is multiplied by 8 for 8-byte aligned access.
|
||||
|
||||
#### Operand B
|
||||
The second operand is loaded either from a register or from an immediate value encoded within the instruction. The `reg(b)` flag encodes an integer register (instruction groups IA and IS) or a floating point register (instruction group FA). Instruction group FS doesn't use operand B.
|
||||
|
||||
|loc(b)[2:0]|B (IA)|B (IS)|B (FA)|B (FS)
|
||||
|---------|-|-|-|-|
|
||||
|000|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
|
||||
|001|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
|
||||
|010|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
|
||||
|011|integer `reg(b)`|integer `reg(b)`|floating point `reg(b)`|-
|
||||
|100|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
|
||||
|101|integer `reg(b)`|`imm8`|floating point `reg(b)`|-
|
||||
|110|`imm32`|`imm8`|floating point `reg(b)`|-
|
||||
|111|`imm32`|`imm8`|floating point `reg(b)`|-
|
||||
|
||||
`imm8` is an 8-bit immediate value, which is used for shift and rotate integer instructions (group IS). Only bits 0-5 are used.
|
||||
|
||||
`imm32` is a 32-bit immediate value which is used for integer instructions from group IA.
|
||||
|
||||
Floating point instructions don't use immediate values.
|
||||
|
||||
#### Operand C
|
||||
The third operand is the location where the result is stored. It can be a register or a 64-bit scratchpad location, depending on the value of flag `loc(c)`.
|
||||
|
||||
|loc\(c\)[2:0]|address size (W)| C (IA, IS)|C (FA, FS)
|
||||
|---------|-|-|-|-|-|
|
||||
|000|15 bits|scratchpad|floating point `reg(c)`
|
||||
|001|11 bits|scratchpad|floating point `reg(c)`
|
||||
|010|11 bits|scratchpad|floating point `reg(c)`
|
||||
|011|11 bits|scratchpad|floating point `reg(c)`
|
||||
|100|15 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
|
||||
|101|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
|
||||
|110|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
|
||||
|111|11 bits|integer `reg(c)`|floating point `reg(c)`, scratchpad
|
||||
|
||||
Integer operations write either to the scratchpad or to a register. Floating point operations always write to a register and can also write to the scratchpad. In that case, bit 3 of the `loc(c)` flag determines if the low or high half of the register is written:
|
||||
|
||||
|loc\(c\)[3]|write to scratchpad|
|
||||
|------------|-----------------------|
|
||||
|0|floating point `reg(c)[63:0]`
|
||||
|1|floating point `reg(c)[127:64]`
|
||||
|
||||
The FPROUND instruction is an exception and always writes the low half of the register.
|
||||
|
||||
For writing to the scratchpad, an integer register is always used to calculate the address:
|
||||
```
|
||||
write_addr = 8 * (addr(c) XOR reg(c)[31:0])[W-1:0]
|
||||
```
|
||||
*CPUs are typically designed for a 2:1 load:store ratio, so each VM instruction performs on average 1 memory read and 0.5 writes to memory.*
|
||||
|
||||
#### imm8
|
||||
An 8-bit immediate value that is used as the shift/rotate count by group IS instructions and as the jump offset of the CALL instruction.
|
||||
|
||||
#### addr(a)
|
||||
A 32-bit address mask that is used to calculate the read address for the A operand. It's sign-extended to 64 bits.
|
||||
|
||||
#### addr\(c\)
|
||||
A 32-bit address mask that is used to calculate the write address for the C operand. `addr(c)` is equal to `imm32`.
|
||||
|
||||
### ALU instructions
|
||||
|
||||
|weight|instruction|group|signed|A width|B width|C|C width|
|
||||
|-|-|-|-|-|-|-|-|
|
||||
|10|ADD_64|IA|no|64|64|`A + B`|64|
|
||||
|2|ADD_32|IA|no|32|32|`A + B`|32|
|
||||
|10|SUB_64|IA|no|64|64|`A - B`|64|
|
||||
|2|SUB_32|IA|no|32|32|`A - B`|32|
|
||||
|21|MUL_64|IA|no|64|64|`A * B`|64|
|
||||
|10|MULH_64|IA|no|64|64|`A * B`|64|
|
||||
|15|MUL_32|IA|no|32|32|`A * B`|64|
|
||||
|15|IMUL_32|IA|yes|32|32|`A * B`|64|
|
||||
|10|IMULH_64|IA|yes|64|64|`A * B`|64|
|
||||
|1|DIV_64|IA|no|64|32|`A / B`|32|
|
||||
|1|IDIV_64|IA|yes|64|32|`A / B`|32|
|
||||
|4|AND_64|IA|no|64|64|`A & B`|64|
|
||||
|2|AND_32|IA|no|32|32|`A & B`|32|
|
||||
|4|OR_64|IA|no|64|64|`A | B`|64|
|
||||
|2|OR_32|IA|no|32|32|`A | B`|32|
|
||||
|4|XOR_64|IA|no|64|64|`A ^ B`|64|
|
||||
|2|XOR_32|IA|no|32|32|`A ^ B`|32|
|
||||
|3|SHL_64|IS|no|64|6|`A << B`|64|
|
||||
|3|SHR_64|IS|no|64|6|`A >> B`|64|
|
||||
|3|SAR_64|IS|yes|64|6|`A >> B`|64|
|
||||
|6|ROL_64|IS|no|64|6|`A <<< B`|64|
|
||||
|6|ROR_64|IS|no|64|6|`A >>> B`|64|
|
||||
|
||||
##### 32-bit operations
|
||||
Instructions ADD_32, SUB_32, AND_32, OR_32, XOR_32 only use the low-order 32 bits of the input operands. The result of these operations is 32 bits long and bits 32-63 of C are set to zero.
|
||||
|
||||
##### Multiplication
|
||||
There are 5 different multiplication operations. MUL_64 and MULH_64 both take 64-bit unsigned operands, but MUL_64 produces the low 64 bits of the result and MULH_64 produces the high 64 bits. MUL_32 and IMUL_32 use only the low-order 32 bits of the operands and produce a 64-bit result. The signed variant interprets the arguments as signed integers. IMULH_64 takes two 64-bit signed operands and produces the high-order 64 bits of the result.
|
||||
|
||||
##### Division
|
||||
For the division instructions, the dividend is 64 bits long and the divisor 32 bits long. The IDIV_64 instruction interprets both operands as signed integers. In case of division by zero or signed overflow, the result is equal to the dividend `A`.
|
||||
|
||||
*Division by zero can be handled without branching by a conditional move. Signed overflow happens only for the signed variant when the minimum negative value is divided by -1. This rare case must be handled in x86 (ARM produces the "correct" result).*
|
||||
|
||||
##### Shift and rotate
|
||||
The shift/rotate instructions use just the bottom 6 bits of the `B` operand (`imm8` is used as the immediate value). All treat `A` as unsigned except SAR_64, which performs an arithmetic right shift by copying the sign bit.
|
||||
|
||||
### FPU instructions
|
||||
|
||||
|weight|instruction|group|C|
|
||||
|-|-|-|-|
|
||||
|20|FPADD|FA|`A + B`|
|
||||
|20|FPSUB|FA|`A - B`|
|
||||
|22|FPMUL|FA|`A * B`|
|
||||
|8|FPDIV|FA|`A / B`|
|
||||
|6|FPSQRT|FS|`sqrt(abs(A))`|
|
||||
|2|FPROUND|FS|`convertSigned52(A)`|
|
||||
|
||||
All floating point instructions apart FPROUND are vector instructions that operate on two packed double precision floating point values.
|
||||
# RandomX instruction set architecture
|
||||
RandomX VM is a complex instruction set computer ([CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer)). All data are loaded and stored in little-endian byte order. Signed integer numbers are represented using [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement). Floating point numbers are represented using the [IEEE 754 double precision format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).
|
||||
|
||||
#### Conversion of operand A
|
||||
Operand A is loaded from memory as a 64-bit value. All floating point instructions apart FPROUND interpret A as two packed 32-bit signed integers and convert them into two packed double precision floating point values.
|
||||
|
||||
The FPROUND instruction has a scalar output and interprets A as a 64-bit signed integer. The 11 least-significant bits are cleared before conversion to a double precision format. This is done so the number fits exactly into the 52-bit mantissa without rounding. Output of FPROUND is always written into the lower half of the result register and only this lower half may be written into the scratchpad.
|
||||
|
||||
#### Rounding
|
||||
FPU instructions conform to the IEEE-754 specification, so they must give correctly rounded results. Initial rounding mode is *roundTiesToEven*. Rounding mode can be changed by the `FPROUND` instruction. Denormal values must be flushed to zero.
|
||||
|
||||
#### NaN
|
||||
If an operation produces NaN, the result is converted into positive zero. NaN results may never be written into registers or memory. Only division and multiplication must be checked for NaN results (`0.0 / 0.0` and `0.0 * Infinity` result in NaN).
|
||||
|
||||
##### FPROUND
|
||||
The FPROUND instruction changes the rounding mode for all subsequent FPU operations depending on the two least-significant bits of A.
|
||||
|
||||
|A[1:0]|rounding mode|
|
||||
|-------|------------|
|
||||
|00|roundTiesToEven|
|
||||
|01|roundTowardNegative|
|
||||
|10|roundTowardPositive|
|
||||
|11|roundTowardZero|
|
||||
## Registers
|
||||
|
||||
The rounding modes are defined by the IEEE-754 standard.
|
||||
|
||||
*The two-bit flag value exactly corresponds to bits 13-14 of the x86 `MXCSR` register and bits 23 and 22 (reversed) of the ARM `FPSCR` register.*
|
||||
|
||||
### Control instructions
|
||||
The following 2 control instructions are supported:
|
||||
RandomX has 8 integer registers `r0`-`r7` (group R) and a total of 12 floating point registers split into 3 groups: `a0`-`a3` (group A), `f0`-`f3` (group F) and `e0`-`e3` (group E). Integer registers are 64 bits wide, while floating point registers are 128 bits wide and contain a pair of floating point numbers. The lower and upper half of floating point registers are not separately addressable.
|
||||
|
||||
|weight|instruction|function|condition|
|
||||
|-|-|-|-|
|
||||
|20|CALL|near procedure call|(see condition table below)
|
||||
|22|RET|return from procedure|stack is not empty
|
||||
|
||||
Both instructions are conditional. If the condition evaluates to `false`, CALL and RET behave as "arithmetic no-op" and simply copy operand A into destination C without jumping.
|
||||
|
||||
##### CALL
|
||||
The CALL instruction uses a condition function, which takes the lower 32 bits of integer register `reg(b)` and the value `imm32` and evaluates a condition based on the `loc(b)` flag:
|
||||
|
||||
|loc(b)[2:0]|signed|jump condition|probability|*x86*|*ARM*
|
||||
|---|---|----------|-----|--|----|
|
||||
|000|no|`reg(b)[31:0] <= imm32`|0% - 100%|`JBE`|`BLS`
|
||||
|001|no|`reg(b)[31:0] > imm32`|0% - 100%|`JA`|`BHI`
|
||||
|010|yes|`reg(b)[31:0] - imm32 < 0`|50%|`JS`|`BMI`
|
||||
|011|yes|`reg(b)[31:0] - imm32 >= 0`|50%|`JNS`|`BPL`
|
||||
|100|yes|`reg(b)[31:0] - imm32` overflows|0% - 50%|`JO`|`BVS`
|
||||
|101|yes|`reg(b)[31:0] - imm32` doesn't overflow|50% - 100%|`JNO`|`BVC`
|
||||
|110|yes|`reg(b)[31:0] < imm32`|0% - 100%|`JL`|`BLT`
|
||||
|111|yes|`reg(b)[31:0] >= imm32`|0% - 100%|`JGE`|`BGE`
|
||||
|
||||
The 'signed' column specifies if the operands are interpreted as signed or unsigned 32-bit numbers. Column 'probability' lists the expected jump probability (range means that the actual value for a specific instruction depends on `imm32`). *Columns 'x86' and 'ARM' list the corresponding hardware instructions (following a `CMP` instruction).*
|
||||
|
||||
Taken CALL instruction pushes the values `A` and `pc` (program counter) onto the stack and then performs a forward jump relative to the value of `pc`. The forward offset is equal to `16 * (imm8[6:0] + 1)`. Maximum jump distance is therefore 128 instructions forward (this means that at least 4 correctly spaced CALL instructions are needed to form a loop in the program).
|
||||
|
||||
##### RET
|
||||
The RET instruction is taken only if the stack is not empty. Taken RET instruction pops the return address `raddr` from the stack (it's the instruction following the previous CALL), then pops a return value `retval` from the stack and sets `C = A XOR retval`. Finally, the instruction jumps back to `raddr`.
|
||||
*Table 1: Addressable register groups*
|
||||
|
||||
## Reference implementation
|
||||
A portable C++ implementation of all ALU and FPU instructions is available in [instructionsPortable.cpp](../src/instructionsPortable.cpp).
|
||||
|index|R|A|F|E|F+E|
|
||||
|--|--|--|--|--|--|
|
||||
|0|`r0`|`a0`|`f0`|`e0`|`f0`|
|
||||
|1|`r1`|`a1`|`f1`|`e1`|`f1`|
|
||||
|2|`r2`|`a2`|`f2`|`e2`|`f2`|
|
||||
|3|`r3`|`a3`|`f3`|`e3`|`f3`|
|
||||
|4|`r4`||||`e0`|
|
||||
|5|`r5`||||`e1`|
|
||||
|6|`r6`||||`e2`|
|
||||
|7|`r7`||||`e3`|
|
||||
|
||||
Besides the directly addressable registers above, there is a 2-bit `fprc` register for rounding control, which is an implicit destination register of the `CFROUND` instruction, and two architectural 32-bit registers `ma` and `mx`, which are not accessible to any instruction.
|
||||
|
||||
Integer registers `r0`-`r7` can be the source or the destination operands of integer instructions or may be used as address registers for loading the source operand from the memory (scratchpad).
|
||||
|
||||
Floating point registers `a0`-`a3` are read-only and may not be written to except at the moment a program is loaded into the VM. They can be the source operand of any floating point instruction. The value of these registers is restricted to the interval `[1, 4294967296)`.
|
||||
|
||||
Floating point registers `f0`-`f3` are the *additive* registers, which can be the destination of floating point addition and subtraction instructions. The absolute value of these registers will not exceed `1.0e+12`.
|
||||
|
||||
Floating point registers `e0`-`e3` are the *multiplicative* registers, which can be the destination of floating point multiplication, division and square root instructions. Their value is always positive.
|
||||
|
||||
## Instruction encoding
|
||||
|
||||
Each instruction word is 64 bits long and has the following format:
|
||||
|
||||
![Imgur](https://i.imgur.com/FtkWRwe.png)
|
||||
|
||||
### opcode
|
||||
There are 256 opcodes, which are distributed between 32 distinct instructions. Each instruction can be encoded using multiple opcodes (the number of opcodes specifies the frequency of the instruction in a random program).
|
||||
|
||||
*Table 2: Instruction groups*
|
||||
|
||||
|group|# instructions|# opcodes||
|
||||
|---------|-----------------|----|-|
|
||||
|integer |19|137|53.5%|
|
||||
|floating point |9|94|36.7%|
|
||||
|other |4|25|9.8%|
|
||||
||**32**|**256**|**100%**
|
||||
|
||||
Full description of all instructions: [isa-ops.md](isa-ops.md).
|
||||
|
||||
### dst
|
||||
Destination register. Only bits 0-1 (register groups A, F, E) or 0-2 (groups R, F+E) are used to encode a register according to Table 1.
|
||||
|
||||
### src
|
||||
|
||||
The `src` flag encodes a source operand register according to Table 1 (only bits 0-1 or 0-2 are used).
|
||||
|
||||
Immediate value `imm32` is used as the source operand in cases when `dst` and `src` encode the same register.
|
||||
|
||||
For register-memory instructions, the source operand determines the `address_base` value for calculating the memory address (see below).
|
||||
|
||||
### mod
|
||||
|
||||
The `mod` flag is encoded as:
|
||||
|
||||
*Table 3: mod flag encoding*
|
||||
|
||||
|`mod`|description|
|
||||
|----|--------|
|
||||
|0-1|`mod.mem` flag|
|
||||
|2-4|`mod.cond` flag|
|
||||
|5-7|Reserved|
|
||||
|
||||
The `mod.mem` flag determines the address mask when reading from or writing to memory:
|
||||
|
||||
*Table 3: memory address mask*
|
||||
|
||||
|`mod.mem`|`address_mask`|(scratchpad level)|
|
||||
|---------|-|---|
|
||||
|0|262136|(L2)|
|
||||
|1-3|16376|(L1)|
|
||||
|
||||
Table 3 applies to all memory accesses except for cases when the source operand is an immediate value. In that case, `address_mask` is equal to 2097144 (L3).
|
||||
|
||||
The address for reading/writing is calculated by applying bitwise AND operation to `address_base` and `address_mask`.
|
||||
|
||||
The `mod.cond` flag is used only by the `COND` instruction to select a condition to be tested.
|
||||
|
||||
### imm32
|
||||
A 32-bit immediate value that can be used as the source operand. The immediate value is sign-extended to 64 bits unless specified otherwise.
|
||||
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,123 @@
|
||||
/*
|
||||
Copyright (c) 2019 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#include "LightClientAsyncWorker.hpp"
|
||||
#include "dataset.hpp"
|
||||
#include "Cache.hpp"
|
||||
|
||||
namespace RandomX {
|
||||
|
||||
template<bool softAes>
|
||||
LightClientAsyncWorker<softAes>::LightClientAsyncWorker(const Cache* c) : ILightClientAsyncWorker(c), output(nullptr), hasWork(false),
|
||||
#ifdef TRACE
|
||||
sw(true),
|
||||
#endif
|
||||
workerThread(&LightClientAsyncWorker::runWorker, this) {
|
||||
|
||||
}
|
||||
|
||||
template<bool softAes>
|
||||
void LightClientAsyncWorker<softAes>::prepareBlock(addr_t addr) {
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": prepareBlock-enter " << addr / CacheLineSize << std::endl;
|
||||
#endif
|
||||
{
|
||||
std::lock_guard<std::mutex> lk(mutex);
|
||||
startBlock = addr / CacheLineSize;
|
||||
blockCount = 1;
|
||||
output = currentLine.data();
|
||||
hasWork = true;
|
||||
}
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": prepareBlock-notify " << startBlock << "/" << blockCount << std::endl;
|
||||
#endif
|
||||
notifier.notify_one();
|
||||
}
|
||||
|
||||
template<bool softAes>
|
||||
const uint64_t* LightClientAsyncWorker<softAes>::getBlock(addr_t addr) {
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": getBlock-enter " << addr / CacheLineSize << std::endl;
|
||||
#endif
|
||||
uint32_t currentBlock = addr / CacheLineSize;
|
||||
if (currentBlock != startBlock || output != currentLine.data()) {
|
||||
initBlock(cache->getCache(), (uint8_t*)currentLine.data(), currentBlock, cache->getKeys());
|
||||
}
|
||||
else {
|
||||
sync();
|
||||
}
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": getBlock-return " << addr / CacheLineSize << std::endl;
|
||||
#endif
|
||||
return currentLine.data();
|
||||
}
|
||||
|
||||
template<bool softAes>
|
||||
void LightClientAsyncWorker<softAes>::prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) {
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": prepareBlocks-enter " << startBlock << "/" << blockCount << std::endl;
|
||||
#endif
|
||||
{
|
||||
std::lock_guard<std::mutex> lk(mutex);
|
||||
this->startBlock = startBlock;
|
||||
this->blockCount = blockCount;
|
||||
output = out;
|
||||
hasWork = true;
|
||||
notifier.notify_one();
|
||||
}
|
||||
}
|
||||
|
||||
template<bool softAes>
|
||||
void LightClientAsyncWorker<softAes>::getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) {
|
||||
for (uint32_t i = 0; i < blockCount; ++i) {
|
||||
initBlock(cache->getCache(), (uint8_t*)out + CacheLineSize * i, startBlock + i, cache->getKeys());
|
||||
}
|
||||
}
|
||||
|
||||
template<bool softAes>
|
||||
void LightClientAsyncWorker<softAes>::sync() {
|
||||
std::unique_lock<std::mutex> lk(mutex);
|
||||
notifier.wait(lk, [this] { return !hasWork; });
|
||||
}
|
||||
|
||||
template<bool softAes>
|
||||
void LightClientAsyncWorker<softAes>::runWorker() {
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": runWorker-enter " << std::endl;
|
||||
#endif
|
||||
for (;;) {
|
||||
std::unique_lock<std::mutex> lk(mutex);
|
||||
notifier.wait(lk, [this] { return hasWork; });
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": runWorker-getBlocks " << startBlock << "/" << blockCount << std::endl;
|
||||
#endif
|
||||
//getBlocks(output, startBlock, blockCount);
|
||||
initBlock(cache->getCache(), (uint8_t*)output, startBlock, cache->getKeys());
|
||||
hasWork = false;
|
||||
#ifdef TRACE
|
||||
std::cout << sw.getElapsed() << ": runWorker-finished " << startBlock << "/" << blockCount << std::endl;
|
||||
#endif
|
||||
lk.unlock();
|
||||
notifier.notify_one();
|
||||
}
|
||||
}
|
||||
|
||||
template class LightClientAsyncWorker<true>;
|
||||
template class LightClientAsyncWorker<false>;
|
||||
}
|
@ -0,0 +1,60 @@
|
||||
/*
|
||||
Copyright (c) 2019 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
//#define TRACE
|
||||
#include "common.hpp"
|
||||
|
||||
#include <thread>
|
||||
#include <mutex>
|
||||
#include <condition_variable>
|
||||
#include <array>
|
||||
#ifdef TRACE
|
||||
#include "Stopwatch.hpp"
|
||||
#include <iostream>
|
||||
#endif
|
||||
|
||||
namespace RandomX {
|
||||
|
||||
class Cache;
|
||||
|
||||
using DatasetLine = std::array<uint64_t, CacheLineSize / sizeof(uint64_t)>;
|
||||
|
||||
template<bool softAes>
|
||||
class LightClientAsyncWorker : public ILightClientAsyncWorker {
|
||||
public:
|
||||
LightClientAsyncWorker(const Cache*);
|
||||
void prepareBlock(addr_t) final;
|
||||
void prepareBlocks(void* out, uint32_t startBlock, uint32_t blockCount) final;
|
||||
const uint64_t* getBlock(addr_t) final;
|
||||
void getBlocks(void* out, uint32_t startBlock, uint32_t blockCount) final;
|
||||
void sync() final;
|
||||
private:
|
||||
void runWorker();
|
||||
std::condition_variable notifier;
|
||||
std::mutex mutex;
|
||||
alignas(16) DatasetLine currentLine;
|
||||
void* output;
|
||||
uint32_t startBlock, blockCount;
|
||||
bool hasWork;
|
||||
#ifdef TRACE
|
||||
Stopwatch sw;
|
||||
#endif
|
||||
std::thread workerThread;
|
||||
};
|
||||
}
|
@ -1,72 +0,0 @@
|
||||
/*
|
||||
Copyright (c) 2018 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
// Based on:
|
||||
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
|
||||
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
|
||||
|
||||
#pragma once
|
||||
#include <cstdint>
|
||||
|
||||
#if defined(_MSC_VER)
|
||||
#pragma warning (disable : 4146)
|
||||
#endif
|
||||
|
||||
class Pcg32 {
|
||||
public:
|
||||
typedef uint32_t result_type;
|
||||
static constexpr result_type min() { return 0U; }
|
||||
static constexpr result_type max() { return UINT32_MAX; }
|
||||
Pcg32(const void* seed) {
|
||||
auto* u64seed = (const uint64_t*)seed;
|
||||
state = *(u64seed + 0);
|
||||
inc = *(u64seed + 1) | 1ull;
|
||||
}
|
||||
Pcg32(uint64_t state, uint64_t inc) : state(state), inc(inc | 1ull) {
|
||||
}
|
||||
result_type operator()() {
|
||||
return next();
|
||||
}
|
||||
result_type getUniform(result_type min, result_type max) {
|
||||
const result_type range = max - min;
|
||||
const result_type erange = range + 1;
|
||||
result_type ret;
|
||||
|
||||
for (;;) {
|
||||
ret = next();
|
||||
if (ret / erange < UINT32_MAX / erange || UINT32_MAX % erange == range) {
|
||||
ret %= erange;
|
||||
break;
|
||||
}
|
||||
}
|
||||
return ret + min;
|
||||
}
|
||||
private:
|
||||
uint64_t state;
|
||||
uint64_t inc;
|
||||
result_type next() {
|
||||
uint64_t oldstate = state;
|
||||
// Advance internal state
|
||||
state = oldstate * 6364136223846793005ULL + inc;
|
||||
// Calculate output function (XSH RR), uses old state for max ILP
|
||||
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
|
||||
uint32_t rot = oldstate >> 59u;
|
||||
return (xorshifted >> rot) | (xorshifted << (-rot & 31));
|
||||
}
|
||||
};
|
@ -0,0 +1,28 @@
|
||||
mov rdx, rax
|
||||
and eax, 2097088
|
||||
lea rcx, [rsi+rax]
|
||||
push rcx
|
||||
xor r8, qword ptr [rcx+0]
|
||||
xor r9, qword ptr [rcx+8]
|
||||
xor r10, qword ptr [rcx+16]
|
||||
xor r11, qword ptr [rcx+24]
|
||||
xor r12, qword ptr [rcx+32]
|
||||
xor r13, qword ptr [rcx+40]
|
||||
xor r14, qword ptr [rcx+48]
|
||||
xor r15, qword ptr [rcx+56]
|
||||
ror rdx, 32
|
||||
and edx, 2097088
|
||||
lea rcx, [rsi+rdx]
|
||||
push rcx
|
||||
cvtdq2pd xmm0, qword ptr [rcx+0]
|
||||
cvtdq2pd xmm1, qword ptr [rcx+8]
|
||||
cvtdq2pd xmm2, qword ptr [rcx+16]
|
||||
cvtdq2pd xmm3, qword ptr [rcx+24]
|
||||
cvtdq2pd xmm4, qword ptr [rcx+32]
|
||||
cvtdq2pd xmm5, qword ptr [rcx+40]
|
||||
cvtdq2pd xmm6, qword ptr [rcx+48]
|
||||
cvtdq2pd xmm7, qword ptr [rcx+56]
|
||||
andps xmm4, xmm14
|
||||
andps xmm5, xmm14
|
||||
andps xmm6, xmm14
|
||||
andps xmm7, xmm14
|
@ -0,0 +1,18 @@
|
||||
pop rcx
|
||||
mov qword ptr [rcx+0], r8
|
||||
mov qword ptr [rcx+8], r9
|
||||
mov qword ptr [rcx+16], r10
|
||||
mov qword ptr [rcx+24], r11
|
||||
mov qword ptr [rcx+32], r12
|
||||
mov qword ptr [rcx+40], r13
|
||||
mov qword ptr [rcx+48], r14
|
||||
mov qword ptr [rcx+56], r15
|
||||
pop rcx
|
||||
mulpd xmm0, xmm4
|
||||
mulpd xmm1, xmm5
|
||||
mulpd xmm2, xmm6
|
||||
mulpd xmm3, xmm7
|
||||
movapd xmmword ptr [rcx+0], xmm0
|
||||
movapd xmmword ptr [rcx+16], xmm1
|
||||
movapd xmmword ptr [rcx+32], xmm2
|
||||
movapd xmmword ptr [rcx+48], xmm3
|
@ -1,63 +1,21 @@
|
||||
mov rbp, rsp ;# beginning of VM stack
|
||||
mov rdi, 1048577 ;# number of VM instructions to execute + 1
|
||||
|
||||
xorps xmm10, xmm10
|
||||
cmpeqpd xmm10, xmm10
|
||||
psrlq xmm10, 1 ;# mask for absolute value = 0x7fffffffffffffff7fffffffffffffff
|
||||
|
||||
;# reset rounding mode
|
||||
mov dword ptr [rsp-8], 40896
|
||||
ldmxcsr dword ptr [rsp-8]
|
||||
|
||||
;# load integer registers
|
||||
mov r8, qword ptr [rcx+0]
|
||||
mov r9, qword ptr [rcx+8]
|
||||
mov r10, qword ptr [rcx+16]
|
||||
mov r11, qword ptr [rcx+24]
|
||||
mov r12, qword ptr [rcx+32]
|
||||
mov r13, qword ptr [rcx+40]
|
||||
mov r14, qword ptr [rcx+48]
|
||||
mov r15, qword ptr [rcx+56]
|
||||
|
||||
;# initialize floating point registers
|
||||
xorps xmm8, xmm8
|
||||
cvtsi2sd xmm8, qword ptr [rcx+72]
|
||||
pslldq xmm8, 8
|
||||
cvtsi2sd xmm8, qword ptr [rcx+64]
|
||||
|
||||
xorps xmm9, xmm9
|
||||
cvtsi2sd xmm9, qword ptr [rcx+88]
|
||||
pslldq xmm9, 8
|
||||
cvtsi2sd xmm9, qword ptr [rcx+80]
|
||||
|
||||
xorps xmm2, xmm2
|
||||
cvtsi2sd xmm2, qword ptr [rcx+104]
|
||||
pslldq xmm2, 8
|
||||
cvtsi2sd xmm2, qword ptr [rcx+96]
|
||||
|
||||
xorps xmm3, xmm3
|
||||
cvtsi2sd xmm3, qword ptr [rcx+120]
|
||||
pslldq xmm3, 8
|
||||
cvtsi2sd xmm3, qword ptr [rcx+112]
|
||||
|
||||
lea rcx, [rcx+64]
|
||||
|
||||
xorps xmm4, xmm4
|
||||
cvtsi2sd xmm4, qword ptr [rcx+72]
|
||||
pslldq xmm4, 8
|
||||
cvtsi2sd xmm4, qword ptr [rcx+64]
|
||||
|
||||
xorps xmm5, xmm5
|
||||
cvtsi2sd xmm5, qword ptr [rcx+88]
|
||||
pslldq xmm5, 8
|
||||
cvtsi2sd xmm5, qword ptr [rcx+80]
|
||||
|
||||
xorps xmm6, xmm6
|
||||
cvtsi2sd xmm6, qword ptr [rcx+104]
|
||||
pslldq xmm6, 8
|
||||
cvtsi2sd xmm6, qword ptr [rcx+96]
|
||||
|
||||
xorps xmm7, xmm7
|
||||
cvtsi2sd xmm7, qword ptr [rcx+120]
|
||||
pslldq xmm7, 8
|
||||
cvtsi2sd xmm7, qword ptr [rcx+112]
|
||||
mov rax, rbp
|
||||
|
||||
;# zero integer registers
|
||||
xor r8, r8
|
||||
xor r9, r9
|
||||
xor r10, r10
|
||||
xor r11, r11
|
||||
xor r12, r12
|
||||
xor r13, r13
|
||||
xor r14, r14
|
||||
xor r15, r15
|
||||
|
||||
;# load constant registers
|
||||
lea rcx, [rcx+120]
|
||||
movapd xmm8, xmmword ptr [rcx+72]
|
||||
movapd xmm9, xmmword ptr [rcx+88]
|
||||
movapd xmm10, xmmword ptr [rcx+104]
|
||||
movapd xmm11, xmmword ptr [rcx+120]
|
||||
movapd xmm13, xmmword ptr [minDbl]
|
||||
movapd xmm14, xmmword ptr [absMask]
|
||||
movapd xmm15, xmmword ptr [signMask]
|
||||
|
@ -0,0 +1,17 @@
|
||||
xor rbp, rax ;# modify "mx"
|
||||
xor eax, eax
|
||||
and rbp, -64 ;# align "mx" to the start of a cache line
|
||||
mov edx, ebp ;# edx = mx
|
||||
prefetchnta byte ptr [rdi+rdx]
|
||||
ror rbp, 32 ;# swap "ma" and "mx"
|
||||
mov edx, ebp ;# edx = ma
|
||||
lea rcx, [rdi+rdx] ;# dataset cache line
|
||||
xor r8, qword ptr [rcx+0]
|
||||
xor r9, qword ptr [rcx+8]
|
||||
xor r10, qword ptr [rcx+16]
|
||||
xor r11, qword ptr [rcx+24]
|
||||
xor r12, qword ptr [rcx+32]
|
||||
xor r13, qword ptr [rcx+40]
|
||||
xor r14, qword ptr [rcx+48]
|
||||
xor r15, qword ptr [rcx+56]
|
||||
|
@ -1,13 +0,0 @@
|
||||
mov edx, dword ptr [rbx] ;# ma
|
||||
mov rax, qword ptr [rbx+8] ;# dataset
|
||||
cvtdq2pd xmm0, qword ptr [rax+rdx]
|
||||
add dword ptr [rbx], 8
|
||||
xor ecx, dword ptr [rbx+4] ;# mx
|
||||
mov dword ptr [rbx+4], ecx
|
||||
test ecx, 65528
|
||||
jne short rx_read_dataset_f_ret
|
||||
and ecx, -8
|
||||
mov dword ptr [rbx], ecx
|
||||
prefetcht0 byte ptr [rax+rcx]
|
||||
rx_read_dataset_f_ret:
|
||||
ret 0
|
@ -1,13 +0,0 @@
|
||||
mov eax, dword ptr [rbx] ;# ma
|
||||
mov rdx, qword ptr [rbx+8] ;# dataset
|
||||
mov rax, qword ptr [rdx+rax]
|
||||
add dword ptr [rbx], 8
|
||||
xor ecx, dword ptr [rbx+4] ;# mx
|
||||
mov dword ptr [rbx+4], ecx
|
||||
test ecx, 65528
|
||||
jne short rx_read_dataset_r_ret
|
||||
and ecx, -8
|
||||
mov dword ptr [rbx], ecx
|
||||
prefetcht0 byte ptr [rdx+rcx]
|
||||
rx_read_dataset_r_ret:
|
||||
ret 0
|
@ -0,0 +1,154 @@
|
||||
;# 90 address transformations
|
||||
;# forced REX prefix is used to make all transformations 4 bytes long
|
||||
lea eax, [rax+rax*8+109]
|
||||
db 64
|
||||
xor eax, 96
|
||||
lea eax, [rax+rax*8-19]
|
||||
db 64
|
||||
add eax, -98
|
||||
db 64
|
||||
add eax, -21
|
||||
db 64
|
||||
xor eax, -80
|
||||
lea eax, [rax+rax*8-92]
|
||||
db 64
|
||||
add eax, 113
|
||||
lea eax, [rax+rax*8+100]
|
||||
db 64
|
||||
add eax, -39
|
||||
db 64
|
||||
xor eax, 120
|
||||
lea eax, [rax+rax*8-119]
|
||||
db 64
|
||||
add eax, -113
|
||||
db 64
|
||||
add eax, 111
|
||||
db 64
|
||||
xor eax, 104
|
||||
lea eax, [rax+rax*8-83]
|
||||
lea eax, [rax+rax*8+127]
|
||||
db 64
|
||||
xor eax, -112
|
||||
db 64
|
||||
add eax, 89
|
||||
db 64
|
||||
add eax, -32
|
||||
db 64
|
||||
add eax, 104
|
||||
db 64
|
||||
xor eax, -120
|
||||
db 64
|
||||
xor eax, 24
|
||||
lea eax, [rax+rax*8+9]
|
||||
db 64
|
||||
add eax, -31
|
||||
db 64
|
||||
xor eax, -16
|
||||
db 64
|
||||
add eax, 68
|
||||
lea eax, [rax+rax*8-110]
|
||||
db 64
|
||||
xor eax, 64
|
||||
db 64
|
||||
xor eax, -40
|
||||
db 64
|
||||
xor eax, -8
|
||||
db 64
|
||||
add eax, -10
|
||||
db 64
|
||||
xor eax, -32
|
||||
db 64
|
||||
add eax, 14
|
||||
lea eax, [rax+rax*8-46]
|
||||
db 64
|
||||
xor eax, -104
|
||||
lea eax, [rax+rax*8+36]
|
||||
db 64
|
||||
add eax, 100
|
||||
lea eax, [rax+rax*8-65]
|
||||
lea eax, [rax+rax*8+27]
|
||||
lea eax, [rax+rax*8+91]
|
||||
db 64
|
||||
add eax, -101
|
||||
db 64
|
||||
add eax, -94
|
||||
lea eax, [rax+rax*8-10]
|
||||
db 64
|
||||
xor eax, 80
|
||||
db 64
|
||||
add eax, -108
|
||||
db 64
|
||||
add eax, -58
|
||||
db 64
|
||||
xor eax, 48
|
||||
lea eax, [rax+rax*8+73]
|
||||
db 64
|
||||
xor eax, -48
|
||||
db 64
|
||||
xor eax, 32
|
||||
db 64
|
||||
xor eax, -96
|
||||
db 64
|
||||
add eax, 118
|
||||
db 64
|
||||
add eax, 91
|
||||
lea eax, [rax+rax*8+18]
|
||||
db 64
|
||||
add eax, -11
|
||||
lea eax, [rax+rax*8+63]
|
||||
db 64
|
||||
add eax, 114
|
||||
lea eax, [rax+rax*8+45]
|
||||
db 64
|
||||
add eax, -67
|
||||
db 64
|
||||
add eax, 53
|
||||
lea eax, [rax+rax*8-101]
|
||||
lea eax, [rax+rax*8-1]
|
||||
db 64
|
||||
xor eax, 16
|
||||
lea eax, [rax+rax*8-37]
|
||||
lea eax, [rax+rax*8-28]
|
||||
lea eax, [rax+rax*8-55]
|
||||
db 64
|
||||
xor eax, -88
|
||||
db 64
|
||||
xor eax, -72
|
||||
db 64
|
||||
add eax, 36
|
||||
db 64
|
||||
xor eax, -56
|
||||
db 64
|
||||
add eax, 116
|
||||
db 64
|
||||
xor eax, 88
|
||||
db 64
|
||||
xor eax, -128
|
||||
db 64
|
||||
add eax, 50
|
||||
db 64
|
||||
add eax, 105
|
||||
db 64
|
||||
add eax, -37
|
||||
db 64
|
||||
xor eax, 112
|
||||
db 64
|
||||
xor eax, 8
|
||||
db 64
|
||||
xor eax, -24
|
||||
lea eax, [rax+rax*8+118]
|
||||
db 64
|
||||
xor eax, 72
|
||||
db 64
|
||||
xor eax, -64
|
||||
db 64
|
||||
add eax, 40
|
||||
lea eax, [rax+rax*8-74]
|
||||
lea eax, [rax+rax*8+82]
|
||||
lea eax, [rax+rax*8+54]
|
||||
db 64
|
||||
xor eax, 56
|
||||
db 64
|
||||
xor eax, 40
|
||||
db 64
|
||||
add eax, 87
|
@ -0,0 +1,6 @@
|
||||
minDbl:
|
||||
db 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 16, 0
|
||||
absMask:
|
||||
db 255, 255, 255, 255, 255, 255, 255, 127, 255, 255, 255, 255, 255, 255, 255, 127
|
||||
signMask:
|
||||
db 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 128
|
@ -0,0 +1,87 @@
|
||||
mov rax, 1613783669344650115
|
||||
add rax, rcx
|
||||
mul rax
|
||||
sub rax, rdx ;# 1
|
||||
mul rax
|
||||
sub rax, rdx ;# 2
|
||||
mul rax
|
||||
sub rax, rdx ;# 3
|
||||
mul rax
|
||||
sub rax, rdx ;# 4
|
||||
mul rax
|
||||
sub rax, rdx ;# 5
|
||||
mul rax
|
||||
sub rax, rdx ;# 6
|
||||
mul rax
|
||||
sub rax, rdx ;# 7
|
||||
mul rax
|
||||
sub rax, rdx ;# 8
|
||||
mul rax
|
||||
sub rax, rdx ;# 9
|
||||
mul rax
|
||||
sub rax, rdx ;# 10
|
||||
mul rax
|
||||
sub rax, rdx ;# 11
|
||||
mul rax
|
||||
sub rax, rdx ;# 12
|
||||
mul rax
|
||||
sub rax, rdx ;# 13
|
||||
mul rax
|
||||
sub rax, rdx ;# 14
|
||||
mul rax
|
||||
sub rax, rdx ;# 15
|
||||
mul rax
|
||||
sub rax, rdx ;# 16
|
||||
mul rax
|
||||
sub rax, rdx ;# 17
|
||||
mul rax
|
||||
sub rax, rdx ;# 18
|
||||
mul rax
|
||||
sub rax, rdx ;# 19
|
||||
mul rax
|
||||
sub rax, rdx ;# 20
|
||||
mul rax
|
||||
sub rax, rdx ;# 21
|
||||
mul rax
|
||||
sub rax, rdx ;# 22
|
||||
mul rax
|
||||
sub rax, rdx ;# 23
|
||||
mul rax
|
||||
sub rax, rdx ;# 24
|
||||
mul rax
|
||||
sub rax, rdx ;# 25
|
||||
mul rax
|
||||
sub rax, rdx ;# 26
|
||||
mul rax
|
||||
sub rax, rdx ;# 27
|
||||
mul rax
|
||||
sub rax, rdx ;# 28
|
||||
mul rax
|
||||
sub rax, rdx ;# 29
|
||||
mul rax
|
||||
sub rax, rdx ;# 30
|
||||
mul rax
|
||||
sub rax, rdx ;# 31
|
||||
mul rax
|
||||
sub rax, rdx ;# 32
|
||||
mul rax
|
||||
sub rax, rdx ;# 33
|
||||
mul rax
|
||||
sub rax, rdx ;# 34
|
||||
mul rax
|
||||
sub rax, rdx ;# 35
|
||||
mul rax
|
||||
sub rax, rdx ;# 36
|
||||
mul rax
|
||||
sub rax, rdx ;# 37
|
||||
mul rax
|
||||
sub rax, rdx ;# 38
|
||||
mul rax
|
||||
sub rax, rdx ;# 39
|
||||
mul rax
|
||||
sub rax, rdx ;# 40
|
||||
mul rax
|
||||
sub rax, rdx ;# 41
|
||||
mul rax
|
||||
sub rax, rdx ;# 42
|
||||
ret
|
@ -0,0 +1,99 @@
|
||||
#pragma once
|
||||
#include <stdint.h>
|
||||
#include <string.h>
|
||||
|
||||
#if defined(_MSC_VER)
|
||||
#define FORCE_INLINE __inline
|
||||
#elif defined(__GNUC__) || defined(__clang__)
|
||||
#define FORCE_INLINE __inline__
|
||||
#else
|
||||
#define FORCE_INLINE
|
||||
#endif
|
||||
|
||||
/* Argon2 Team - Begin Code */
|
||||
/*
|
||||
Not an exhaustive list, but should cover the majority of modern platforms
|
||||
Additionally, the code will always be correct---this is only a performance
|
||||
tweak.
|
||||
*/
|
||||
#if (defined(__BYTE_ORDER__) && \
|
||||
(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)) || \
|
||||
defined(__LITTLE_ENDIAN__) || defined(__ARMEL__) || defined(__MIPSEL__) || \
|
||||
defined(__AARCH64EL__) || defined(__amd64__) || defined(__i386__) || \
|
||||
defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) || \
|
||||
defined(_M_ARM)
|
||||
#define NATIVE_LITTLE_ENDIAN
|
||||
#endif
|
||||
/* Argon2 Team - End Code */
|
||||
|
||||
static FORCE_INLINE uint32_t load32(const void *src) {
|
||||
#if defined(NATIVE_LITTLE_ENDIAN)
|
||||
uint32_t w;
|
||||
memcpy(&w, src, sizeof w);
|
||||
return w;
|
||||
#else
|
||||
const uint8_t *p = (const uint8_t *)src;
|
||||
uint32_t w = *p++;
|
||||
w |= (uint32_t)(*p++) << 8;
|
||||
w |= (uint32_t)(*p++) << 16;
|
||||
w |= (uint32_t)(*p++) << 24;
|
||||
return w;
|
||||
#endif
|
||||
}
|
||||
|
||||
static FORCE_INLINE uint64_t load64(const void *src) {
|
||||
#if defined(NATIVE_LITTLE_ENDIAN)
|
||||
uint64_t w;
|
||||
memcpy(&w, src, sizeof w);
|
||||
return w;
|
||||
#else
|
||||
const uint8_t *p = (const uint8_t *)src;
|
||||
uint64_t w = *p++;
|
||||
w |= (uint64_t)(*p++) << 8;
|
||||
w |= (uint64_t)(*p++) << 16;
|
||||
w |= (uint64_t)(*p++) << 24;
|
||||
w |= (uint64_t)(*p++) << 32;
|
||||
w |= (uint64_t)(*p++) << 40;
|
||||
w |= (uint64_t)(*p++) << 48;
|
||||
w |= (uint64_t)(*p++) << 56;
|
||||
return w;
|
||||
#endif
|
||||
}
|
||||
|
||||
static FORCE_INLINE void store32(void *dst, uint32_t w) {
|
||||
#if defined(NATIVE_LITTLE_ENDIAN)
|
||||
memcpy(dst, &w, sizeof w);
|
||||
#else
|
||||
uint8_t *p = (uint8_t *)dst;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
#endif
|
||||
}
|
||||
|
||||
static FORCE_INLINE void store64(void *dst, uint64_t w) {
|
||||
#if defined(NATIVE_LITTLE_ENDIAN)
|
||||
memcpy(dst, &w, sizeof w);
|
||||
#else
|
||||
uint8_t *p = (uint8_t *)dst;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
w >>= 8;
|
||||
*p++ = (uint8_t)w;
|
||||
#endif
|
||||
}
|
@ -0,0 +1,169 @@
|
||||
/*
|
||||
Reference implementations of computing and using the "magic number" approach to dividing
|
||||
by constants, including codegen instructions. The unsigned division incorporates the
|
||||
"round down" optimization per ridiculous_fish.
|
||||
|
||||
This is free and unencumbered software. Any copyright is dedicated to the Public Domain.
|
||||
*/
|
||||
|
||||
#include <limits.h> //for CHAR_BIT
|
||||
#include <assert.h>
|
||||
|
||||
#include "divideByConstantCodegen.h"
|
||||
|
||||
struct magicu_info compute_unsigned_magic_info(unsigned_type D, unsigned num_bits) {
|
||||
|
||||
//The numerator must fit in a unsigned_type
|
||||
assert(num_bits > 0 && num_bits <= sizeof(unsigned_type) * CHAR_BIT);
|
||||
|
||||
// D must be larger than zero and not a power of 2
|
||||
assert(D & (D - 1));
|
||||
|
||||
// The eventual result
|
||||
struct magicu_info result;
|
||||
|
||||
// Bits in a unsigned_type
|
||||
const unsigned UINT_BITS = sizeof(unsigned_type) * CHAR_BIT;
|
||||
|
||||
// The extra shift implicit in the difference between UINT_BITS and num_bits
|
||||
const unsigned extra_shift = UINT_BITS - num_bits;
|
||||
|
||||
// The initial power of 2 is one less than the first one that can possibly work
|
||||
const unsigned_type initial_power_of_2 = (unsigned_type)1 << (UINT_BITS - 1);
|
||||
|
||||
// The remainder and quotient of our power of 2 divided by d
|
||||
unsigned_type quotient = initial_power_of_2 / D, remainder = initial_power_of_2 % D;
|
||||
|
||||
// ceil(log_2 D)
|
||||
unsigned ceil_log_2_D;
|
||||
|
||||
// The magic info for the variant "round down" algorithm
|
||||
unsigned_type down_multiplier = 0;
|
||||
unsigned down_exponent = 0;
|
||||
int has_magic_down = 0;
|
||||
|
||||
// Compute ceil(log_2 D)
|
||||
ceil_log_2_D = 0;
|
||||
unsigned_type tmp;
|
||||
for (tmp = D; tmp > 0; tmp >>= 1)
|
||||
ceil_log_2_D += 1;
|
||||
|
||||
|
||||
// Begin a loop that increments the exponent, until we find a power of 2 that works.
|
||||
unsigned exponent;
|
||||
for (exponent = 0; ; exponent++) {
|
||||
// Quotient and remainder is from previous exponent; compute it for this exponent.
|
||||
if (remainder >= D - remainder) {
|
||||
// Doubling remainder will wrap around D
|
||||
quotient = quotient * 2 + 1;
|
||||
remainder = remainder * 2 - D;
|
||||
}
|
||||
else {
|
||||
// Remainder will not wrap
|
||||
quotient = quotient * 2;
|
||||
remainder = remainder * 2;
|
||||
}
|
||||
|
||||
// We're done if this exponent works for the round_up algorithm.
|
||||
// Note that exponent may be larger than the maximum shift supported,
|
||||
// so the check for >= ceil_log_2_D is critical.
|
||||
if ((exponent + extra_shift >= ceil_log_2_D) || (D - remainder) <= ((unsigned_type)1 << (exponent + extra_shift)))
|
||||
break;
|
||||
|
||||
// Set magic_down if we have not set it yet and this exponent works for the round_down algorithm
|
||||
if (!has_magic_down && remainder <= ((unsigned_type)1 << (exponent + extra_shift))) {
|
||||
has_magic_down = 1;
|
||||
down_multiplier = quotient;
|
||||
down_exponent = exponent;
|
||||
}
|
||||
}
|
||||
|
||||
if (exponent < ceil_log_2_D) {
|
||||
// magic_up is efficient
|
||||
result.multiplier = quotient + 1;
|
||||
result.pre_shift = 0;
|
||||
result.post_shift = exponent;
|
||||
result.increment = 0;
|
||||
}
|
||||
else if (D & 1) {
|
||||
// Odd divisor, so use magic_down, which must have been set
|
||||
assert(has_magic_down);
|
||||
result.multiplier = down_multiplier;
|
||||
result.pre_shift = 0;
|
||||
result.post_shift = down_exponent;
|
||||
result.increment = 1;
|
||||
}
|
||||
else {
|
||||
// Even divisor, so use a prefix-shifted dividend
|
||||
unsigned pre_shift = 0;
|
||||
unsigned_type shifted_D = D;
|
||||
while ((shifted_D & 1) == 0) {
|
||||
shifted_D >>= 1;
|
||||
pre_shift += 1;
|
||||
}
|
||||
result = compute_unsigned_magic_info(shifted_D, num_bits - pre_shift);
|
||||
assert(result.increment == 0 && result.pre_shift == 0); //expect no increment or pre_shift in this path
|
||||
result.pre_shift = pre_shift;
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
struct magics_info compute_signed_magic_info(signed_type D) {
|
||||
// D must not be zero and must not be a power of 2 (or its negative)
|
||||
assert(D != 0 && (D & -D) != D && (D & -D) != -D);
|
||||
|
||||
// Our result
|
||||
struct magics_info result;
|
||||
|
||||
// Bits in an signed_type
|
||||
const unsigned SINT_BITS = sizeof(signed_type) * CHAR_BIT;
|
||||
|
||||
// Absolute value of D (we know D is not the most negative value since that's a power of 2)
|
||||
const unsigned_type abs_d = (D < 0 ? -D : D);
|
||||
|
||||
// The initial power of 2 is one less than the first one that can possibly work
|
||||
// "two31" in Warren
|
||||
unsigned exponent = SINT_BITS - 1;
|
||||
const unsigned_type initial_power_of_2 = (unsigned_type)1 << exponent;
|
||||
|
||||
// Compute the absolute value of our "test numerator,"
|
||||
// which is the largest dividend whose remainder with d is d-1.
|
||||
// This is called anc in Warren.
|
||||
const unsigned_type tmp = initial_power_of_2 + (D < 0);
|
||||
const unsigned_type abs_test_numer = tmp - 1 - tmp % abs_d;
|
||||
|
||||
// Initialize our quotients and remainders (q1, r1, q2, r2 in Warren)
|
||||
unsigned_type quotient1 = initial_power_of_2 / abs_test_numer, remainder1 = initial_power_of_2 % abs_test_numer;
|
||||
unsigned_type quotient2 = initial_power_of_2 / abs_d, remainder2 = initial_power_of_2 % abs_d;
|
||||
unsigned_type delta;
|
||||
|
||||
// Begin our loop
|
||||
do {
|
||||
// Update the exponent
|
||||
exponent++;
|
||||
|
||||
// Update quotient1 and remainder1
|
||||
quotient1 *= 2;
|
||||
remainder1 *= 2;
|
||||
if (remainder1 >= abs_test_numer) {
|
||||
quotient1 += 1;
|
||||
remainder1 -= abs_test_numer;
|
||||
}
|
||||
|
||||
// Update quotient2 and remainder2
|
||||
quotient2 *= 2;
|
||||
remainder2 *= 2;
|
||||
if (remainder2 >= abs_d) {
|
||||
quotient2 += 1;
|
||||
remainder2 -= abs_d;
|
||||
}
|
||||
|
||||
// Keep going as long as (2**exponent) / abs_d <= delta
|
||||
delta = abs_d - remainder2;
|
||||
} while (quotient1 < delta || (quotient1 == delta && remainder1 == 0));
|
||||
|
||||
result.multiplier = quotient2 + 1;
|
||||
if (D < 0) result.multiplier = -result.multiplier;
|
||||
result.shift = exponent - SINT_BITS;
|
||||
return result;
|
||||
}
|
@ -0,0 +1,117 @@
|
||||
/*
|
||||
Copyright (c) 2018 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
#include <stdint.h>
|
||||
|
||||
#if defined(__cplusplus)
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
typedef uint64_t unsigned_type;
|
||||
typedef int64_t signed_type;
|
||||
|
||||
/* Computes "magic info" for performing signed division by a fixed integer D.
|
||||
The type 'signed_type' is assumed to be defined as a signed integer type large enough
|
||||
to hold both the dividend and the divisor.
|
||||
Here >> is arithmetic (signed) shift, and >>> is logical shift.
|
||||
|
||||
To emit code for n/d, rounding towards zero, use the following sequence:
|
||||
|
||||
m = compute_signed_magic_info(D)
|
||||
emit("result = (m.multiplier * n) >> SINT_BITS");
|
||||
if d > 0 and m.multiplier < 0: emit("result += n")
|
||||
if d < 0 and m.multiplier > 0: emit("result -= n")
|
||||
if m.post_shift > 0: emit("result >>= m.shift")
|
||||
emit("result += (result < 0)")
|
||||
|
||||
The shifts by SINT_BITS may be "free" if the high half of the full multiply
|
||||
is put in a separate register.
|
||||
|
||||
The final add can of course be implemented via the sign bit, e.g.
|
||||
result += (result >>> (SINT_BITS - 1))
|
||||
or
|
||||
result -= (result >> (SINT_BITS - 1))
|
||||
|
||||
This code is heavily indebted to Hacker's Delight by Henry Warren.
|
||||
See http://www.hackersdelight.org/HDcode/magic.c.txt
|
||||
Used with permission from http://www.hackersdelight.org/permissions.htm
|
||||
*/
|
||||
|
||||
struct magics_info {
|
||||
signed_type multiplier; // the "magic number" multiplier
|
||||
unsigned shift; // shift for the dividend after multiplying
|
||||
};
|
||||
struct magics_info compute_signed_magic_info(signed_type D);
|
||||
|
||||
|
||||
/* Computes "magic info" for performing unsigned division by a fixed positive integer D.
|
||||
The type 'unsigned_type' is assumed to be defined as an unsigned integer type large enough
|
||||
to hold both the dividend and the divisor. num_bits can be set appropriately if n is
|
||||
known to be smaller than the largest unsigned_type; if this is not known then pass
|
||||
(sizeof(unsigned_type) * CHAR_BIT) for num_bits.
|
||||
|
||||
Assume we have a hardware register of width UINT_BITS, a known constant D which is
|
||||
not zero and not a power of 2, and a variable n of width num_bits (which may be
|
||||
up to UINT_BITS). To emit code for n/d, use one of the two following sequences
|
||||
(here >>> refers to a logical bitshift):
|
||||
|
||||
m = compute_unsigned_magic_info(D, num_bits)
|
||||
if m.pre_shift > 0: emit("n >>>= m.pre_shift")
|
||||
if m.increment: emit("n = saturated_increment(n)")
|
||||
emit("result = (m.multiplier * n) >>> UINT_BITS")
|
||||
if m.post_shift > 0: emit("result >>>= m.post_shift")
|
||||
|
||||
or
|
||||
|
||||
m = compute_unsigned_magic_info(D, num_bits)
|
||||
if m.pre_shift > 0: emit("n >>>= m.pre_shift")
|
||||
emit("result = m.multiplier * n")
|
||||
if m.increment: emit("result = result + m.multiplier")
|
||||
emit("result >>>= UINT_BITS")
|
||||
if m.post_shift > 0: emit("result >>>= m.post_shift")
|
||||
|
||||
The shifts by UINT_BITS may be "free" if the high half of the full multiply
|
||||
is put in a separate register.
|
||||
|
||||
saturated_increment(n) means "increment n unless it would wrap to 0," i.e.
|
||||
if n == (1 << UINT_BITS)-1: result = n
|
||||
else: result = n+1
|
||||
A common way to implement this is with the carry bit. For example, on x86:
|
||||
add 1
|
||||
sbb 0
|
||||
|
||||
Some invariants:
|
||||
1: At least one of pre_shift and increment is zero
|
||||
2: multiplier is never zero
|
||||
|
||||
This code incorporates the "round down" optimization per ridiculous_fish.
|
||||
*/
|
||||
|
||||
struct magicu_info {
|
||||
unsigned_type multiplier; // the "magic number" multiplier
|
||||
unsigned pre_shift; // shift for the dividend before multiplying
|
||||
unsigned post_shift; //shift for the dividend after multiplying
|
||||
int increment; // 0 or 1; if set then increment the numerator, using one of the two strategies
|
||||
};
|
||||
struct magicu_info compute_unsigned_magic_info(unsigned_type D, unsigned num_bits);
|
||||
|
||||
#if defined(__cplusplus)
|
||||
}
|
||||
#endif
|
@ -0,0 +1,136 @@
|
||||
/*
|
||||
Copyright (c) 2019 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#include "softAes.h"
|
||||
|
||||
/*
|
||||
Calculate a 512-bit hash of 'input' using 4 lanes of AES.
|
||||
The input is treated as a set of round keys for the encryption
|
||||
of the initial state.
|
||||
|
||||
'inputSize' must be a multiple of 64.
|
||||
|
||||
For a 2 MiB input, this has the same security as 32768-round
|
||||
AES encryption.
|
||||
|
||||
Hashing throughput: >20 GiB/s per CPU core with hardware AES
|
||||
*/
|
||||
template<bool softAes>
|
||||
void hashAes1Rx4(const void *input, size_t inputSize, void *hash) {
|
||||
const uint8_t* inptr = (uint8_t*)input;
|
||||
const uint8_t* inputEnd = inptr + inputSize;
|
||||
|
||||
__m128i state0, state1, state2, state3;
|
||||
__m128i in0, in1, in2, in3;
|
||||
|
||||
//intial state
|
||||
state0 = _mm_set_epi32(0x9d04b0ae, 0x59943385, 0x30ac8d93, 0x3fe49f5d);
|
||||
state1 = _mm_set_epi32(0x8a39ebf1, 0xddc10935, 0xa724ecd3, 0x7b0c6064);
|
||||
state2 = _mm_set_epi32(0x7ec70420, 0xdf01edda, 0x7c12ecf7, 0xfb5382e3);
|
||||
state3 = _mm_set_epi32(0x94a9d201, 0x5082d1c8, 0xb2e74109, 0x7728b705);
|
||||
|
||||
//process 64 bytes at a time in 4 lanes
|
||||
while (inptr < inputEnd) {
|
||||
in0 = _mm_load_si128((__m128i*)inptr + 0);
|
||||
in1 = _mm_load_si128((__m128i*)inptr + 1);
|
||||
in2 = _mm_load_si128((__m128i*)inptr + 2);
|
||||
in3 = _mm_load_si128((__m128i*)inptr + 3);
|
||||
|
||||
state0 = aesenc<softAes>(state0, in0);
|
||||
state1 = aesdec<softAes>(state1, in1);
|
||||
state2 = aesenc<softAes>(state2, in2);
|
||||
state3 = aesdec<softAes>(state3, in3);
|
||||
|
||||
inptr += 64;
|
||||
}
|
||||
|
||||
//two extra rounds to achieve full diffusion
|
||||
__m128i xkey0 = _mm_set_epi32(0x4ff637c5, 0x053bd705, 0x8231a744, 0xc3767b17);
|
||||
__m128i xkey1 = _mm_set_epi32(0x6594a1a6, 0xa8879d58, 0xb01da200, 0x8a8fae2e);
|
||||
|
||||
state0 = aesenc<softAes>(state0, xkey0);
|
||||
state1 = aesdec<softAes>(state1, xkey0);
|
||||
state2 = aesenc<softAes>(state2, xkey0);
|
||||
state3 = aesdec<softAes>(state3, xkey0);
|
||||
|
||||
state0 = aesenc<softAes>(state0, xkey1);
|
||||
state1 = aesdec<softAes>(state1, xkey1);
|
||||
state2 = aesenc<softAes>(state2, xkey1);
|
||||
state3 = aesdec<softAes>(state3, xkey1);
|
||||
|
||||
//output hash
|
||||
_mm_store_si128((__m128i*)hash + 0, state0);
|
||||
_mm_store_si128((__m128i*)hash + 1, state1);
|
||||
_mm_store_si128((__m128i*)hash + 2, state2);
|
||||
_mm_store_si128((__m128i*)hash + 3, state3);
|
||||
}
|
||||
|
||||
template void hashAes1Rx4<false>(const void *input, size_t inputSize, void *hash);
|
||||
template void hashAes1Rx4<true>(const void *input, size_t inputSize, void *hash);
|
||||
|
||||
/*
|
||||
Fill 'buffer' with pseudorandom data based on 512-bit 'state'.
|
||||
The state is encrypted using a single AES round per 16 bytes of output
|
||||
in 4 lanes.
|
||||
|
||||
'outputSize' must be a multiple of 64.
|
||||
|
||||
The modified state is written back to 'state' to allow multiple
|
||||
calls to this function.
|
||||
*/
|
||||
template<bool softAes>
|
||||
void fillAes1Rx4(void *state, size_t outputSize, void *buffer) {
|
||||
const uint8_t* outptr = (uint8_t*)buffer;
|
||||
const uint8_t* outputEnd = outptr + outputSize;
|
||||
|
||||
__m128i state0, state1, state2, state3;
|
||||
__m128i key0, key1, key2, key3;
|
||||
|
||||
key0 = _mm_set_epi32(0x9274f206, 0x79498d2f, 0x7d2de6ab, 0x67a04d26);
|
||||
key1 = _mm_set_epi32(0xe1f7af05, 0x2a3a6f1d, 0x86658a15, 0x4f719812);
|
||||
key2 = _mm_set_epi32(0xd1b1f791, 0x9e2ec914, 0x14c77bce, 0xba90750e);
|
||||
key3 = _mm_set_epi32(0x179d0fd9, 0x6e57883c, 0xa53bbe4f, 0xaa07621f);
|
||||
|
||||
state0 = _mm_load_si128((__m128i*)state + 0);
|
||||
state1 = _mm_load_si128((__m128i*)state + 1);
|
||||
state2 = _mm_load_si128((__m128i*)state + 2);
|
||||
state3 = _mm_load_si128((__m128i*)state + 3);
|
||||
|
||||
while (outptr < outputEnd) {
|
||||
state0 = aesdec<softAes>(state0, key0);
|
||||
state1 = aesenc<softAes>(state1, key1);
|
||||
state2 = aesdec<softAes>(state2, key2);
|
||||
state3 = aesenc<softAes>(state3, key3);
|
||||
|
||||
_mm_store_si128((__m128i*)outptr + 0, state0);
|
||||
_mm_store_si128((__m128i*)outptr + 1, state1);
|
||||
_mm_store_si128((__m128i*)outptr + 2, state2);
|
||||
_mm_store_si128((__m128i*)outptr + 3, state3);
|
||||
|
||||
outptr += 64;
|
||||
}
|
||||
|
||||
_mm_store_si128((__m128i*)state + 0, state0);
|
||||
_mm_store_si128((__m128i*)state + 1, state1);
|
||||
_mm_store_si128((__m128i*)state + 2, state2);
|
||||
_mm_store_si128((__m128i*)state + 3, state3);
|
||||
}
|
||||
|
||||
template void fillAes1Rx4<true>(void *state, size_t outputSize, void *buffer);
|
||||
template void fillAes1Rx4<false>(void *state, size_t outputSize, void *buffer);
|
@ -0,0 +1,26 @@
|
||||
/*
|
||||
Copyright (c) 2019 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#include "softAes.h"
|
||||
|
||||
template<bool softAes>
|
||||
void hashAes1Rx4(const void *input, size_t inputSize, void *hash);
|
||||
|
||||
template<bool softAes>
|
||||
void fillAes1Rx4(void *state, size_t outputSize, void *buffer);
|
@ -1,63 +0,0 @@
|
||||
/*
|
||||
Copyright (c) 2018 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#include <cstdint>
|
||||
#include "common.hpp"
|
||||
|
||||
namespace RandomX {
|
||||
|
||||
//Clears the 11 least-significant bits before conversion. This is done so the number
|
||||
//fits exactly into the 52-bit mantissa without rounding.
|
||||
inline double convertSigned52(int64_t x) {
|
||||
return (double)(x & -2048L);
|
||||
}
|
||||
|
||||
extern "C" {
|
||||
void ADD_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void ADD_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void SUB_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void SUB_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void MUL_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void MULH_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void MUL_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void IMUL_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void IMULH_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void DIV_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void IDIV_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void AND_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void AND_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void OR_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void OR_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void XOR_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void XOR_32(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void SHL_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void SHR_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void SAR_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void ROL_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
void ROR_64(convertible_t& a, convertible_t& b, convertible_t& c);
|
||||
bool JMP_COND(uint8_t, convertible_t&, int32_t);
|
||||
void FPINIT();
|
||||
void FPADD(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
|
||||
void FPSUB(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
|
||||
void FPMUL(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
|
||||
void FPDIV(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
|
||||
void FPSQRT(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
|
||||
void FPROUND(convertible_t& a, fpu_reg_t& b, fpu_reg_t& c);
|
||||
}
|
||||
}
|
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,17 @@
|
||||
.intel_syntax noprefix
|
||||
#if defined(__APPLE__)
|
||||
.text
|
||||
#else
|
||||
.section .text
|
||||
#endif
|
||||
#if defined(__WIN32__) || defined(__APPLE__)
|
||||
#define DECL(x) _##x
|
||||
#else
|
||||
#define DECL(x) x
|
||||
#endif
|
||||
|
||||
.global DECL(squareHash)
|
||||
|
||||
DECL(squareHash):
|
||||
mov rcx, rsi
|
||||
#include "asm/squareHash.inc"
|
@ -0,0 +1,9 @@
|
||||
PUBLIC squareHash
|
||||
|
||||
.code
|
||||
|
||||
squareHash PROC
|
||||
include asm/squareHash.inc
|
||||
squareHash ENDP
|
||||
|
||||
END
|
@ -0,0 +1,76 @@
|
||||
/*
|
||||
Copyright (c) 2019 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
/*
|
||||
Based on the original idea by SChernykh:
|
||||
https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-414336613
|
||||
*/
|
||||
|
||||
#include <stdint.h>
|
||||
|
||||
#if !defined(_M_X64) && !defined(__x86_64__)
|
||||
|
||||
typedef struct {
|
||||
uint64_t lo;
|
||||
uint64_t hi;
|
||||
} uint128_t;
|
||||
|
||||
#define LO(x) ((x)&0xffffffff)
|
||||
#define HI(x) ((x)>>32)
|
||||
static inline uint128_t square128(uint64_t x) {
|
||||
uint64_t xh = HI(x), xl = LO(x);
|
||||
uint64_t xll = xl * xl;
|
||||
uint64_t xlh = xl * xh;
|
||||
uint64_t xhh = xh * xh;
|
||||
uint64_t m1 = 2 * LO(xlh) + HI(xll);
|
||||
uint64_t m2 = 2 * HI(xlh) + LO(xhh) + HI(m1);
|
||||
uint64_t m3 = HI(xhh) + HI(m2);
|
||||
|
||||
uint128_t x2;
|
||||
|
||||
x2.lo = (m1 << 32) + LO(xll);
|
||||
x2.hi = (m3 << 32) + LO(m2);
|
||||
|
||||
return x2;
|
||||
}
|
||||
#undef LO(x)
|
||||
#undef HI(x)
|
||||
|
||||
inline uint64_t squareHash(uint64_t x) {
|
||||
x += 1613783669344650115;
|
||||
for (int i = 0; i < 42; ++i) {
|
||||
uint128_t x2 = square128(x);
|
||||
x = x2.lo - x2.hi;
|
||||
}
|
||||
return x;
|
||||
}
|
||||
|
||||
#else
|
||||
|
||||
#if defined(__cplusplus)
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
uint64_t squareHash(uint64_t);
|
||||
|
||||
#if defined(__cplusplus)
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif
|
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,112 @@
|
||||
/*
|
||||
Copyright (c) 2018 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#include "virtualMemory.hpp"
|
||||
|
||||
#include <stdexcept>
|
||||
|
||||
#ifdef _WIN32
|
||||
#include <windows.h>
|
||||
#else
|
||||
#ifdef __APPLE__
|
||||
#include <mach/vm_statistics.h>
|
||||
#endif
|
||||
#include <sys/types.h>
|
||||
#include <sys/mman.h>
|
||||
#ifndef MAP_ANONYMOUS
|
||||
#define MAP_ANONYMOUS MAP_ANON
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#ifdef _WIN32
|
||||
std::string getErrorMessage(const char* function) {
|
||||
LPSTR messageBuffer = nullptr;
|
||||
size_t size = FormatMessageA(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS,
|
||||
NULL, GetLastError(), MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPSTR)&messageBuffer, 0, NULL);
|
||||
std::string message(messageBuffer, size);
|
||||
LocalFree(messageBuffer);
|
||||
return std::string(function) + std::string(": ") + message;
|
||||
}
|
||||
|
||||
void setPrivilege(const char* pszPrivilege, BOOL bEnable) {
|
||||
HANDLE hToken;
|
||||
TOKEN_PRIVILEGES tp;
|
||||
BOOL status;
|
||||
DWORD error;
|
||||
|
||||
if (!OpenProcessToken(GetCurrentProcess(), TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, &hToken))
|
||||
throw std::runtime_error(getErrorMessage("OpenProcessToken"));
|
||||
|
||||
if (!LookupPrivilegeValue(NULL, pszPrivilege, &tp.Privileges[0].Luid))
|
||||
throw std::runtime_error(getErrorMessage("LookupPrivilegeValue"));
|
||||
|
||||
tp.PrivilegeCount = 1;
|
||||
|
||||
if (bEnable)
|
||||
tp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
|
||||
else
|
||||
tp.Privileges[0].Attributes = 0;
|
||||
|
||||
status = AdjustTokenPrivileges(hToken, FALSE, &tp, 0, (PTOKEN_PRIVILEGES)NULL, 0);
|
||||
|
||||
error = GetLastError();
|
||||
if (!status || (error != ERROR_SUCCESS))
|
||||
throw std::runtime_error(getErrorMessage("AdjustTokenPrivileges"));
|
||||
|
||||
if (!CloseHandle(hToken))
|
||||
throw std::runtime_error(getErrorMessage("CloseHandle"));
|
||||
}
|
||||
#endif
|
||||
|
||||
void* allocExecutableMemory(std::size_t bytes) {
|
||||
void* mem;
|
||||
#ifdef _WIN32
|
||||
mem = VirtualAlloc(nullptr, bytes, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
|
||||
if (mem == nullptr)
|
||||
throw std::runtime_error(getErrorMessage("allocExecutableMemory - VirtualAlloc"));
|
||||
#else
|
||||
mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
|
||||
if (mem == MAP_FAILED)
|
||||
throw std::runtime_error("allocExecutableMemory - mmap failed");
|
||||
#endif
|
||||
return mem;
|
||||
}
|
||||
|
||||
constexpr std::size_t align(std::size_t pos, uint32_t align) {
|
||||
return ((pos - 1) / align + 1) * align;
|
||||
}
|
||||
|
||||
void* allocLargePagesMemory(std::size_t bytes) {
|
||||
void* mem;
|
||||
#ifdef _WIN32
|
||||
setPrivilege("SeLockMemoryPrivilege", 1);
|
||||
mem = VirtualAlloc(NULL, align(bytes, 2 * 1024 * 1024), MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES, PAGE_READWRITE);
|
||||
if (mem == nullptr)
|
||||
throw std::runtime_error(getErrorMessage("allocLargePagesMemory - VirtualAlloc"));
|
||||
#else
|
||||
#ifdef __APPLE__
|
||||
mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, VM_FLAGS_SUPERPAGE_SIZE_2MB, 0);
|
||||
#else
|
||||
mem = mmap(nullptr, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE, -1, 0);
|
||||
#endif
|
||||
if (mem == MAP_FAILED)
|
||||
throw std::runtime_error("allocLargePagesMemory - mmap failed");
|
||||
#endif
|
||||
return mem;
|
||||
}
|
@ -0,0 +1,25 @@
|
||||
/*
|
||||
Copyright (c) 2018 tevador
|
||||
|
||||
This file is part of RandomX.
|
||||
|
||||
RandomX is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
RandomX is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with RandomX. If not, see<http://www.gnu.org/licenses/>.
|
||||
*/
|
||||
|
||||
#pragma once
|
||||
|
||||
#include <cstddef>
|
||||
|
||||
void* allocExecutableMemory(std::size_t);
|
||||
void* allocLargePagesMemory(std::size_t);
|
Loading…
Reference in new issue