* Optimized dataset read
There was a false dependency on readReg2 and readReg3 (caused by `xor rbp, rax` instruction) when reading dataset item (see design.md - 4.6.2 Loop execution, steps 5 and 7). This change uses `ma` register to read dataset item before the whole `rbp` (`ma` and `mx`) is changed, so superscalar and out-of-order CPU can start executing it earlier.
Results: https://i.imgur.com/Bpeq9mx.png
~1% speedup on modern Intel/AMD CPUs.
* ARMv8: optimized dataset read
Break dependency from readReg2 and readReg3.
* Fixed light mode hashing