-
Notifications
You must be signed in to change notification settings - Fork 3
Cache Prefetching
For a single-thread application, prefetching helps overlapping I/O of next page to be accessed. Calling prefetch method only for one element is enough to fully load the page that holds the element at index. Just like get/set/bulk operations, prefetch method is thread-safe too. Any number of prefetch calls can be made as long as number of cache lines (active pages) is big enough to serve all asnchronous page loads. All prefetch calls are enqueued and run by single dedicated prefetcher thread with a cpu-friendly waiting policy.
Example usage:
VirtualMultiArray<Object> arr(...,pageSize,...);
for(size_t i=0;i<testSize;i++)
{
Object o = arr[i];
// when first element of a page is reached, asynchronously load next page into cache
if((i%pageSize)==0)
arr.prefetch(i+pageSize);
}
Prefetching efficiency depends on:
- latency from pcie for loading a page (+storing if previous accesses have edited elements)
- time required for computing all elements until next page
- if prefetch operation is complete before the processing loop arrives at that page
-
- then: loop continues iterations without pcie latency
-
- else: prefetching adds an extra latency of a mutex lock and LRU iteration
- if prefetch performance isn't good enough, prefetch distance should be increased. Prefetching an element from 2(or more) pages away gives more time to prefetcher to complete its task but requires 1(or more) extra cache lines (active pages) to hold those pages between.
Benchmark timings from an arbitrary no-compute example:
-------------------------------------------FX8150 2.1GHz---------------------------------------------------------------
simple write: 7583454002 nanoseconds (bandwidth = 522.19 MB/s) (throughput = 505.56 nanoseconds per iteration)
simple write: 7703129767 nanoseconds (bandwidth = 514.08 MB/s) (throughput = 513.54 nanoseconds per iteration)
simple write: 7704452143 nanoseconds (bandwidth = 513.99 MB/s) (throughput = 513.63 nanoseconds per iteration)
simple write: 7700202845 nanoseconds (bandwidth = 514.27 MB/s) (throughput = 513.35 nanoseconds per iteration)
simple write: 7701270510 nanoseconds (bandwidth = 514.20 MB/s) (throughput = 513.42 nanoseconds per iteration)
prefetch write: 4735132852 nanoseconds (bandwidth = 836.30 MB/s) (throughput = 315.68 nanoseconds per iteration)
prefetch write: 4723029476 nanoseconds (bandwidth = 838.44 MB/s) (throughput = 314.87 nanoseconds per iteration)
prefetch write: 4728696222 nanoseconds (bandwidth = 837.44 MB/s) (throughput = 315.25 nanoseconds per iteration)
prefetch write: 4725454628 nanoseconds (bandwidth = 838.01 MB/s) (throughput = 315.03 nanoseconds per iteration)
prefetch write: 4741174428 nanoseconds (bandwidth = 835.24 MB/s) (throughput = 316.08 nanoseconds per iteration)
simple read: 3767858716 nanoseconds (bandwidth = 1050.99 MB/s) (throughput = 251.19 nanoseconds per iteration)
simple read: 3636024989 nanoseconds (bandwidth = 1089.10 MB/s) (throughput = 242.40 nanoseconds per iteration)
simple read: 3637292732 nanoseconds (bandwidth = 1088.72 MB/s) (throughput = 242.49 nanoseconds per iteration)
simple read: 3635396139 nanoseconds (bandwidth = 1089.29 MB/s) (throughput = 242.36 nanoseconds per iteration)
simple read: 3627524875 nanoseconds (bandwidth = 1091.65 MB/s) (throughput = 241.83 nanoseconds per iteration)
prefetch read: 2258548128 nanoseconds (bandwidth = 1753.34 MB/s) (throughput = 150.57 nanoseconds per iteration)
prefetch read: 2266880289 nanoseconds (bandwidth = 1746.89 MB/s) (throughput = 151.13 nanoseconds per iteration)
prefetch read: 2267620311 nanoseconds (bandwidth = 1746.32 MB/s) (throughput = 151.17 nanoseconds per iteration)
prefetch read: 2269081075 nanoseconds (bandwidth = 1745.20 MB/s) (throughput = 151.27 nanoseconds per iteration)
prefetch read: 2268357389 nanoseconds (bandwidth = 1745.76 MB/s) (throughput = 151.22 nanoseconds per iteration)
-------------------------------------------FX8150 3.6GHz---------------------------------------------------------------
simple write: 6284796895 nanoseconds (bandwidth = 630.09 MB/s) (throughput = 418.99 nanoseconds per iteration)
simple write: 6452773880 nanoseconds (bandwidth = 613.69 MB/s) (throughput = 430.18 nanoseconds per iteration)
simple write: 6452111324 nanoseconds (bandwidth = 613.75 MB/s) (throughput = 430.14 nanoseconds per iteration)
simple write: 6459163342 nanoseconds (bandwidth = 613.08 MB/s) (throughput = 430.61 nanoseconds per iteration)
simple write: 6452326896 nanoseconds (bandwidth = 613.73 MB/s) (throughput = 430.16 nanoseconds per iteration)
prefetch write: 4470454530 nanoseconds (bandwidth = 885.82 MB/s) (throughput = 298.03 nanoseconds per iteration)
prefetch write: 4469158477 nanoseconds (bandwidth = 886.07 MB/s) (throughput = 297.94 nanoseconds per iteration)
prefetch write: 4477282203 nanoseconds (bandwidth = 884.47 MB/s) (throughput = 298.49 nanoseconds per iteration)
prefetch write: 4494766503 nanoseconds (bandwidth = 881.02 MB/s) (throughput = 299.65 nanoseconds per iteration)
prefetch write: 4471666851 nanoseconds (bandwidth = 885.58 MB/s) (throughput = 298.11 nanoseconds per iteration)
simple read: 3072195790 nanoseconds (bandwidth = 1288.98 MB/s) (throughput = 204.81 nanoseconds per iteration)
simple read: 2979333197 nanoseconds (bandwidth = 1329.16 MB/s) (throughput = 198.62 nanoseconds per iteration)
simple read: 2979167113 nanoseconds (bandwidth = 1329.23 MB/s) (throughput = 198.61 nanoseconds per iteration)
simple read: 2980474802 nanoseconds (bandwidth = 1328.65 MB/s) (throughput = 198.70 nanoseconds per iteration)
simple read: 2982577528 nanoseconds (bandwidth = 1327.71 MB/s) (throughput = 198.84 nanoseconds per iteration)
prefetch read: 2135030398 nanoseconds (bandwidth = 1854.77 MB/s) (throughput = 142.34 nanoseconds per iteration)
prefetch read: 2133755388 nanoseconds (bandwidth = 1855.88 MB/s) (throughput = 142.25 nanoseconds per iteration)
prefetch read: 2132800590 nanoseconds (bandwidth = 1856.71 MB/s) (throughput = 142.19 nanoseconds per iteration)
prefetch read: 2134220811 nanoseconds (bandwidth = 1855.48 MB/s) (throughput = 142.28 nanoseconds per iteration)
prefetch read: 2137878750 nanoseconds (bandwidth = 1852.30 MB/s) (throughput = 142.53 nanoseconds per iteration)
For object size of 264 bytes, 150 nanoseconds is equivalent to 1750MB/s but some of this latency is made of mutex lock, LRU computation and object copying/construction/destruction.
Source code:
#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "CpuBenchmarker.h"
#include <random>
class Object
{
public:
Object(){id=-1;}
Object(size_t i){id=i; data[id%(256)]='A';}
char data[256];
size_t id;
};
int main()
{
GraphicsCardSupplyDepot depot;
const size_t n = 15000000;
const size_t pageSize=10000;
const int maxActivePagesPerGpu = 5;
VirtualMultiArray<Object> arr(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu);
const int testSize = n;
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"simple write",testSize);
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,testSize);
for(size_t i=0;i<testSize;i++)
{
arr[i]=Object(rnd(rng));
}
}
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"prefetch write",testSize);
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_real_distribution<float> rnd(0,testSize);
for(size_t i=0;i<testSize;i++)
{
arr[i]=Object(rnd(rng));
if((i%pageSize)==0 && (i+pageSize<n))
arr.prefetch(i+pageSize);
}
}
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"simple read",testSize);
for(size_t i=0;i<testSize;i++)
{
Object o = arr[i];
}
}
for(int j=0;j<5;j++)
{
CpuBenchmarker bench(testSize*sizeof(Object),"prefetch read",testSize);
for(size_t i=0;i<testSize;i++)
{
Object o = arr[i];
if((i%pageSize)==0 && (i+pageSize<n))
arr.prefetch(i+pageSize);
}
}
std::cout<<arr.get(0).data[0]<<std::endl;
return 0;
}