Index cache with byte offsets

ProgPoW currently uses a source register content modulo cache size as the _word_ index into cache. This requires (implicit) left shift by 2 in the GPU hardware to produce the byte offset. Such left shift might or might not have a runtime performance cost, depending on compiler and (micro-)architecture.

This (potential) runtime cost may be reliably avoided by applying a mask to the source register content such that the _byte_ offset is extracted right away, without needing a further shift. This will change the computed hash values, but not other properties of ProgPoW (those values are supposed to be random anyway).

Here are the changes I tested on top of my current (revised in other ways) ProgPoW tree:

```diff
+++ b/libprogpow/ProgPow.cpp
@@ -113,9 +113,13 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
   ret << "uint32_t offset, data;\n";
 
   if (kern == KERNEL_CUDA)
+  {
+      ret << "const unsigned char *c_dag_uc = (const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = threadIdx.x & (PROGPOW_LANES-1);\n";
+  }
   else
   {
+      ret << "__local const unsigned char *c_dag_uc = (__local const unsigned char *)c_dag;\n";
       ret << "const uint32_t lane_id = get_local_id(0) & (PROGPOW_LANES-1);\n";
       ret << "const uint32_t group_id = get_local_id(0) / PROGPOW_LANES;\n";
   }
@@ -152,8 +157,15 @@ std::string ProgPow::getKern(uint64_t block_number, kernel_t kern)
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          if (kern == KERNEL_CUDA)
+          {
+              ret << "data = *(const uint32_t *)&c_dag_uc[offset];\n";
+          }
+          else
+          {
+              ret << "data = *(__local const uint32_t *)&c_dag_uc[offset];\n";
+          }
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)
```

For me, this improves the hashrate on Vega 64 and GTX 1080 by about 1% and on GTX Titan X Maxwell by about 2%. Yes, this is in my Maxwell-friendly tree. Speedups on the newer GPUs need to be confirmed on the original Maxwell-unfriendly ProgPoW as well, which I haven't done yet (am experimenting with more tweaks anyway), but I expect them to be about 1% as well (unless ProgPoW is fully memory-bound, in which case the would-be-speedup can instead be extracted to perform more random math, etc.)

Another way to implement this change is:

```diff
+++ libprogpow/ProgPow.cpp  2019-05-06 14:31:44.081259833 +0000
@@ -153,8 +157,8 @@
           std::string dest = mix_dst();
           uint32_t    r = rnd();
           ret << "// cache load " << i << "\n";
-          ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
-          ret << "data = c_dag[offset];\n";
+          ret << "offset = " << src << " & ((PROGPOW_CACHE_WORDS - 1) << 2);\n";
+          ret << "data = c_dag[offset >> 2];\n";
           ret << merge(dest, "data", r);
       }
       if (i < PROGPOW_CNT_MATH)
```

This is simpler in source code, but relies on the compiler figuring out that the explicit right shift by 2 cancels out with the array indexing's implicit left shift by 2. In my testing, this appears to provide a slightly smaller speedup than the lengthier patch above.

Edits: fixed bugs in the first version of the patch, which were really nasty yet didn't significantly affect the observed speedups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index cache with byte offsets #40

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Index cache with byte offsets #40

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions