* Optimized dataset read
There was a false dependency on readReg2 and readReg3 (caused by `xor rbp, rax` instruction) when reading dataset item (see design.md - 4.6.2 Loop execution, steps 5 and 7). This change uses `ma` register to read dataset item before the whole `rbp` (`ma` and `mx`) is changed, so superscalar and out-of-order CPU can start executing it earlier.
Results: https://i.imgur.com/Bpeq9mx.png
~1% speedup on modern Intel/AMD CPUs.
* ARMv8: optimized dataset read
Break dependency from readReg2 and readReg3.
* Fixed light mode hashing
* this better matches CPU capabilities since execution ports are usually split 1:1 between fadd and fmul
* the frequency of FSWAP_R decreased from 8 to 4 (it's ASIC-friendly)
* activate IROL_R instruction
* added detailed guidelines for the selection of configuration values
* added additional compile-time checks to prevent bad configurations
* removed RANDOMX_SUPERSCALAR_MAX_SIZE parameter
Fixed some undefined behavior with signed types
Fixed different results on big endian systems
Removed unused code files
Restored FNEG_R instructions
Updated documentation