Adds more parallelizm into AES loop so modern CPUs can take advantage of it. Also, scratchpad data moves between L1 and L3 caches only one time which saves time and energy per hash.
* Blake2Generator::getInt32 renamed to getUInt32 to avoid confusion
* isPowerOf2 renamed to isZeroOrPowerOf2 to avoid confusion
* added asserts to validate the input/output size of AES functions
* fixed possible overflow in JitCompilerX86::getCodeSize (unused function)
* this fixes identical sequences of columns 0/2 and 1/3 if their states are the same
* added TestU01 results for AesGenerator1R and AesGenerator4R
* added a note about the reversibility of AesHash1R