Skip to content

Index cache with byte offsets #40

@solardiz

Description

@solardiz

ProgPoW currently uses a source register content modulo cache size as the word index into cache. This requires (implicit) left shift by 2 in the GPU hardware to produce the byte offset. Such left shift might or might not have a runtime performance cost, depending on compiler and (micro-)architecture.

This (potential) runtime cost may be reliably avoided by applying a mask to the source register content such that the byte offset is extracted right away, without needing a further shift. This will change the computed hash values, but not other properties of ProgPoW (those values are supposed to be random anyway).

Here are the changes I tested on top of my current (revised in other ways) ProgPoW tree:

+++ b/libprogpow/ProgPow.cpp
@@ -113,9 +113,13 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
   ret << "uint32_t offset, data;\n";
 
   if (kern == KERNEL_CUDA)
+  {
+      ret << "const unsigned char *c_dag_uc = (const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = threadIdx.x & (PROGPOW_LANES-1);\n";
+  }
   else
   {
+      ret << "__local const unsigned char *c_dag_uc = (__local const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = get_local_id(0) & (PROGPOW_LANES-1);\n";
       ret << "const uint32_t group_id = get_local_id(0) / PROGPOW_LANES;\n";
   }
@@ -152,8 +157,15 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          if (kern == KERNEL_CUDA)
+          {
+              ret << "data = *(const uint32_t *)&c_dag_uc[offset];\n";
+          }
+          else
+          {
+              ret << "data = *(__local const uint32_t *)&c_dag_uc[offset];\n";
+          }
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)

For me, this improves the hashrate on Vega 64 and GTX 1080 by about 1% and on GTX Titan X Maxwell by about 2%. Yes, this is in my Maxwell-friendly tree. Speedups on the newer GPUs need to be confirmed on the original Maxwell-unfriendly ProgPoW as well, which I haven't done yet (am experimenting with more tweaks anyway), but I expect them to be about 1% as well (unless ProgPoW is fully memory-bound, in which case the would-be-speedup can instead be extracted to perform more random math, etc.)

Another way to implement this change is:

+++ libprogpow/ProgPow.cpp  2019-05-06 14:31:44.081259833 +0000
@@ -153,8 +157,8 @@
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          ret << "data = c_dag[offset >> 2];\n";
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)

This is simpler in source code, but relies on the compiler figuring out that the explicit right shift by 2 cancels out with the array indexing's implicit left shift by 2. In my testing, this appears to provide a slightly smaller speedup than the lengthier patch above.

Edits: fixed bugs in the first version of the patch, which were really nasty yet didn't significantly affect the observed speedups.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions