Skip to content

Cache Prefetching

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 18, 2021 · 14 revisions

For a single-thread application, prefetching helps overlapping I/O of next page to be accessed. Calling prefetch method only for one element is enough to fully load the page that holds the element at index. Just like get/set/bulk operations, prefetch method is thread-safe too. Any number of prefetch calls can be made as long as number of cache lines (active pages) is big enough to serve all asnchronous page loads. All prefetch calls are enqueued and run by single dedicated prefetcher thread with a cpu-friendly waiting policy.

Example usage:

VirtualMultiArray<Object> arr(...,pageSize,...);
for(size_t i=0;i<testSize;i++)
{
    Object o = arr[i];

    // when first element of a page is reached, asynchronously load next page into cache
    if((i%pageSize)==0)
        arr.prefetch(i+pageSize);
}

Prefetching efficiency depends on:

  • latency from pcie for loading a page (+storing if previous accesses have edited elements)
  • time required for computing all elements until next page
  • if prefetch operation is complete before the processing loop arrives at that page
    • then: loop continues iterations without pcie latency
    • else: prefetching adds an extra latency of a mutex lock and LRU iteration
  • if prefetch performance isn't good enough, prefetch distance should be increased. Prefetching an element from 2(or more) pages away gives more time to prefetcher to complete its task but requires 1(or more) extra cache lines (active pages) to hold those pages between.

Benchmark timings from an arbitrary no-compute example:

-------------------------------------------FX8150 2.1GHz---------------------------------------------------------------
simple write: 7306081611 nanoseconds     (bandwidth = 542.01 MB/s)      (throughput = 487.07 nanoseconds per iteration) 
simple write: 7414693501 nanoseconds     (bandwidth = 534.07 MB/s)      (throughput = 494.31 nanoseconds per iteration) 
simple write: 7424773182 nanoseconds     (bandwidth = 533.35 MB/s)      (throughput = 494.98 nanoseconds per iteration) 
simple write: 7410889679 nanoseconds     (bandwidth = 534.35 MB/s)      (throughput = 494.06 nanoseconds per iteration) 
simple write: 7427219218 nanoseconds     (bandwidth = 533.17 MB/s)      (throughput = 495.15 nanoseconds per iteration) 
prefetch write: 4625263464 nanoseconds     (bandwidth = 856.17 MB/s)      (throughput = 308.35 nanoseconds per iteration) 
prefetch write: 4623544046 nanoseconds     (bandwidth = 856.49 MB/s)      (throughput = 308.24 nanoseconds per iteration) 
prefetch write: 4639479430 nanoseconds     (bandwidth = 853.54 MB/s)      (throughput = 309.30 nanoseconds per iteration) 
prefetch write: 4632937040 nanoseconds     (bandwidth = 854.75 MB/s)      (throughput = 308.86 nanoseconds per iteration) 
prefetch write: 4631165947 nanoseconds     (bandwidth = 855.08 MB/s)      (throughput = 308.74 nanoseconds per iteration) 
simple read: 3421668563 nanoseconds     (bandwidth = 1157.33 MB/s)      (throughput = 228.11 nanoseconds per iteration) 
simple read: 3319603713 nanoseconds     (bandwidth = 1192.91 MB/s)      (throughput = 221.31 nanoseconds per iteration) 
simple read: 3325157628 nanoseconds     (bandwidth = 1190.92 MB/s)      (throughput = 221.68 nanoseconds per iteration) 
simple read: 3317791892 nanoseconds     (bandwidth = 1193.56 MB/s)      (throughput = 221.19 nanoseconds per iteration) 
simple read: 3324708277 nanoseconds     (bandwidth = 1191.08 MB/s)      (throughput = 221.65 nanoseconds per iteration) 
prefetch read: 2157070232 nanoseconds     (bandwidth = 1835.82 MB/s)      (throughput = 143.80 nanoseconds per iteration) 
prefetch read: 2156060071 nanoseconds     (bandwidth = 1836.68 MB/s)      (throughput = 143.74 nanoseconds per iteration) 
prefetch read: 2155241474 nanoseconds     (bandwidth = 1837.38 MB/s)      (throughput = 143.68 nanoseconds per iteration) 
prefetch read: 2160351019 nanoseconds     (bandwidth = 1833.04 MB/s)      (throughput = 144.02 nanoseconds per iteration) 
prefetch read: 2161684529 nanoseconds     (bandwidth = 1831.90 MB/s)      (throughput = 144.11 nanoseconds per iteration) 

-------------------------------------------FX8150 3.6GHz---------------------------------------------------------------
simple write: 6130302550 nanoseconds     (bandwidth = 645.97 MB/s)      (throughput = 408.69 nanoseconds per iteration) 
simple write: 6274952934 nanoseconds     (bandwidth = 631.08 MB/s)      (throughput = 418.33 nanoseconds per iteration) 
simple write: 6412885022 nanoseconds     (bandwidth = 617.51 MB/s)      (throughput = 427.53 nanoseconds per iteration) 
simple write: 6469412523 nanoseconds     (bandwidth = 612.11 MB/s)      (throughput = 431.29 nanoseconds per iteration) 
simple write: 6280109509 nanoseconds     (bandwidth = 630.56 MB/s)      (throughput = 418.67 nanoseconds per iteration) 
prefetch write: 4429019772 nanoseconds     (bandwidth = 894.10 MB/s)      (throughput = 295.27 nanoseconds per iteration) 
prefetch write: 4411352188 nanoseconds     (bandwidth = 897.68 MB/s)      (throughput = 294.09 nanoseconds per iteration) 
prefetch write: 4419218238 nanoseconds     (bandwidth = 896.09 MB/s)      (throughput = 294.61 nanoseconds per iteration) 
prefetch write: 4416441335 nanoseconds     (bandwidth = 896.65 MB/s)      (throughput = 294.43 nanoseconds per iteration) 
prefetch write: 4413349236 nanoseconds     (bandwidth = 897.28 MB/s)      (throughput = 294.22 nanoseconds per iteration) 
simple read: 2875923400 nanoseconds     (bandwidth = 1376.95 MB/s)      (throughput = 191.73 nanoseconds per iteration) 
simple read: 2782026411 nanoseconds     (bandwidth = 1423.42 MB/s)      (throughput = 185.47 nanoseconds per iteration) 
simple read: 2781819041 nanoseconds     (bandwidth = 1423.53 MB/s)      (throughput = 185.45 nanoseconds per iteration) 
simple read: 2796056443 nanoseconds     (bandwidth = 1416.28 MB/s)      (throughput = 186.40 nanoseconds per iteration) 
simple read: 2785293858 nanoseconds     (bandwidth = 1421.75 MB/s)      (throughput = 185.69 nanoseconds per iteration) 
prefetch read: 2096412465 nanoseconds     (bandwidth = 1888.94 MB/s)      (throughput = 139.76 nanoseconds per iteration) 
prefetch read: 2099907062 nanoseconds     (bandwidth = 1885.80 MB/s)      (throughput = 139.99 nanoseconds per iteration) 
prefetch read: 2094567844 nanoseconds     (bandwidth = 1890.60 MB/s)      (throughput = 139.64 nanoseconds per iteration) 
prefetch read: 2093896249 nanoseconds     (bandwidth = 1891.21 MB/s)      (throughput = 139.59 nanoseconds per iteration) 
prefetch read: 2092521891 nanoseconds     (bandwidth = 1892.45 MB/s)      (throughput = 139.50 nanoseconds per iteration) 

For object size of 264 bytes, 150 nanoseconds is equivalent to 1750MB/s but some of this latency is made of mutex lock, LRU computation and object copying/construction/destruction.

Source code:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "CpuBenchmarker.h"
#include <random>

class Object
{
public:
	Object(){id=-1;}
	Object(size_t i){id=i; data[id%(256)]='A';}
	char data[256];
	size_t id;
};
int main()
{
	GraphicsCardSupplyDepot depot;
	const size_t n = 15000000;
	const size_t pageSize=10000;
	const int maxActivePagesPerGpu = 5;

	VirtualMultiArray<Object> arr(n,depot.requestGpus(),pageSize,maxActivePagesPerGpu);
	const int testSize = n;
	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"simple write",testSize);
		std::random_device rd;
		std::mt19937 rng(rd());
		std::uniform_real_distribution<float> rnd(0,testSize);

		for(size_t i=0;i<testSize;i++)
		{
			arr[i]=Object(rnd(rng));
		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"prefetch write",testSize);
		std::random_device rd;
		std::mt19937 rng(rd());
		std::uniform_real_distribution<float> rnd(0,testSize);
		for(size_t i=0;i<testSize;i++)
		{
			arr[i]=Object(rnd(rng));
			if((i%pageSize)==0 && (i+pageSize<n))
				arr.prefetch(i+pageSize);

		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"simple read",testSize);
		for(size_t i=0;i<testSize;i++)
		{
			Object o = arr[i];
		}
	}

	for(int j=0;j<5;j++)
	{
		CpuBenchmarker bench(testSize*sizeof(Object),"prefetch read",testSize);
		for(size_t i=0;i<testSize;i++)
		{
			Object o = arr[i];
			if((i%pageSize)==0  && (i+pageSize<n))
				arr.prefetch(i+pageSize);

		}
	}
	std::cout<<arr.get(0).data[0]<<std::endl;
	return 0;
}

When code was changed to do pageSize*2 distanced prefetching, benchmark timings improved:

-------------------------------------------FX8150 2.1GHz---------------------------------------------------------------
simple write: 7228987739 nanoseconds     (bandwidth = 547.79 MB/s)      (throughput = 481.93 nanoseconds per iteration) 
simple write: 7406752912 nanoseconds     (bandwidth = 534.65 MB/s)      (throughput = 493.78 nanoseconds per iteration) 
simple write: 7415119836 nanoseconds     (bandwidth = 534.04 MB/s)      (throughput = 494.34 nanoseconds per iteration) 
simple write: 7416596358 nanoseconds     (bandwidth = 533.94 MB/s)      (throughput = 494.44 nanoseconds per iteration) 
simple write: 7389326431 nanoseconds     (bandwidth = 535.91 MB/s)      (throughput = 492.62 nanoseconds per iteration) 
prefetch write: 4375411306 nanoseconds     (bandwidth = 905.06 MB/s)      (throughput = 291.69 nanoseconds per iteration) 
prefetch write: 4375833916 nanoseconds     (bandwidth = 904.97 MB/s)      (throughput = 291.72 nanoseconds per iteration) 
prefetch write: 4373170853 nanoseconds     (bandwidth = 905.52 MB/s)      (throughput = 291.54 nanoseconds per iteration) 
prefetch write: 4374209347 nanoseconds     (bandwidth = 905.31 MB/s)      (throughput = 291.61 nanoseconds per iteration) 
prefetch write: 4377141487 nanoseconds     (bandwidth = 904.70 MB/s)      (throughput = 291.81 nanoseconds per iteration) 
simple read: 3415854079 nanoseconds     (bandwidth = 1159.30 MB/s)      (throughput = 227.72 nanoseconds per iteration) 
simple read: 3327793328 nanoseconds     (bandwidth = 1189.98 MB/s)      (throughput = 221.85 nanoseconds per iteration) 
simple read: 3334623880 nanoseconds     (bandwidth = 1187.54 MB/s)      (throughput = 222.31 nanoseconds per iteration) 
simple read: 3339431674 nanoseconds     (bandwidth = 1185.83 MB/s)      (throughput = 222.63 nanoseconds per iteration) 
simple read: 3325240729 nanoseconds     (bandwidth = 1190.89 MB/s)      (throughput = 221.68 nanoseconds per iteration) 
prefetch read: 2067014975 nanoseconds     (bandwidth = 1915.81 MB/s)      (throughput = 137.80 nanoseconds per iteration) 
prefetch read: 2067774046 nanoseconds     (bandwidth = 1915.10 MB/s)      (throughput = 137.85 nanoseconds per iteration) 
prefetch read: 2066604811 nanoseconds     (bandwidth = 1916.19 MB/s)      (throughput = 137.77 nanoseconds per iteration) 
prefetch read: 2067883533 nanoseconds     (bandwidth = 1915.00 MB/s)      (throughput = 137.86 nanoseconds per iteration) 
prefetch read: 2067226263 nanoseconds     (bandwidth = 1915.61 MB/s)      (throughput = 137.82 nanoseconds per iteration) 
Clone this wiki locally