-
Notifications
You must be signed in to change notification settings - Fork 84
Description
ProgPoW currently uses a source register content modulo cache size as the word index into cache. This requires (implicit) left shift by 2 in the GPU hardware to produce the byte offset. Such left shift might or might not have a runtime performance cost, depending on compiler and (micro-)architecture.
This (potential) runtime cost may be reliably avoided by applying a mask to the source register content such that the byte offset is extracted right away, without needing a further shift. This will change the computed hash values, but not other properties of ProgPoW (those values are supposed to be random anyway).
Here are the changes I tested on top of my current (revised in other ways) ProgPoW tree:
+++ b/libprogpow/ProgPow.cpp
@@ -113,9 +113,13 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
ret << "uint32_t offset, data;\n";
if (kern == KERNEL_CUDA)
+ {
+ ret << "const unsigned char *c_dag_uc = (const unsigned char *)c_dag;\n";
ret << "const uint32_t lane_id = threadIdx.x & (PROGPOW_LANES-1);\n";
+ }
else
{
+ ret << "__local const unsigned char *c_dag_uc = (__local const unsigned char *)c_dag;\n";
ret << "const uint32_t lane_id = get_local_id(0) & (PROGPOW_LANES-1);\n";
ret << "const uint32_t group_id = get_local_id(0) / PROGPOW_LANES;\n";
}
@@ -152,8 +157,15 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
std::string dest = mix_dst();
uint32_t r = rnd();
ret << "// cache load " << i << "\n";
- ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
- ret << "data = c_dag[offset];\n";
+ ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+ if (kern == KERNEL_CUDA)
+ {
+ ret << "data = *(const uint32_t *)&c_dag_uc[offset];\n";
+ }
+ else
+ {
+ ret << "data = *(__local const uint32_t *)&c_dag_uc[offset];\n";
+ }
ret << merge(dest, "data", r);
}
if (i < PROGPOW_CNT_MATH)
For me, this improves the hashrate on Vega 64 and GTX 1080 by about 1% and on GTX Titan X Maxwell by about 2%. Yes, this is in my Maxwell-friendly tree. Speedups on the newer GPUs need to be confirmed on the original Maxwell-unfriendly ProgPoW as well, which I haven't done yet (am experimenting with more tweaks anyway), but I expect them to be about 1% as well (unless ProgPoW is fully memory-bound, in which case the would-be-speedup can instead be extracted to perform more random math, etc.)
Another way to implement this change is:
+++ libprogpow/ProgPow.cpp 2019-05-06 14:31:44.081259833 +0000
@@ -153,8 +157,8 @@
std::string dest = mix_dst();
uint32_t r = rnd();
ret << "// cache load " << i << "\n";
- ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
- ret << "data = c_dag[offset];\n";
+ ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+ ret << "data = c_dag[offset >> 2];\n";
ret << merge(dest, "data", r);
}
if (i < PROGPOW_CNT_MATH)
This is simpler in source code, but relies on the compiler figuring out that the explicit right shift by 2 cancels out with the array indexing's implicit left shift by 2. In my testing, this appears to provide a slightly smaller speedup than the lengthier patch above.
Edits: fixed bugs in the first version of the patch, which were really nasty yet didn't significantly affect the observed speedups.