Skip to content

Bandwidth and latency

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 18, 2021 · 22 revisions

Before getting into details, here is some benchmark data from development system that has 3 low-end old graphics cards (pcie v2 x4 + x8 + x4) and 8 cpu cores (source-code below):

  • random write bandwidth (object size = 512kB, 8 threads): 4011 MB/s

  • random read bandwidth (object size = 512kB, 64 threads): 6383 MB/s

  • random write latency (object size = 8 bytes, 64 threads): 28334 nanoseconds @ 2.1GHz, 21127 nanoseconds @ 3.6 GHz

  • random read latency (object size = 8 bytes, 64 threads): 18347 nanoseconds @ 2.1GHz, 14188 nanoseconds @ 3.6GHz

  • RAM: 10600 MB/s (single channel ddr3 at 1333 MHz), ~80 nanoseconds latency

  • L1 cache bandwidth: ~200 GB/s (fx8150 at 2.1GHz)

Not all object types have same read/write efficiency, not all objects have same size, not all algorithms have same access pattern, so there is always a balance between bandwidth and latency which can be optimized by tuning the parameters of virtual array.

Sequential Scalar Access

Sequential access pattern has some locality after the first access. Once any element of a page is fetched or written, whole page of elements are cached inside RAM (and L1/L2/L3 caches if page/object is small enough). Following elements are accessed at the cost of only a page-locking which is only used for calculating page index, element index, graphics card index, etc, until relevant data is reached, at constant time complexity. Pci-e data copies are also constant-time but system topology causes different pages to have different latencies/bandwidths. To confine this per-page access performance differency as local as possible, all graphics cards are interleaved for every sequential page index. First page is in first card, second page is in second card. Since there are multiple opencl channels per card, this interleaving repeats on much bigger cycles. If there are a,b,c cards each with 2 data channels, then cycling is: a1 b1 c1 a2 b2 c2 a1 b1 c1 a2 b2 c2 ... it takes 6 pages to reach same graphics card again. This enables high volume threading for I/O and distributes access time differencies on all virtual array region instead of causing a slow-down on one part of array.

Caching is made on page-level and higher page size means lesser element access latency on average. But in multi-threaded access, page-locking causes heavy contention and performance decreases. To overcome this, each thread should access a different region of array instead of same page. For example, a thread 1 can access first page while a second thread can access second page concurrently. If number of concurrent accesses are needed to be increased, then memMult parameter of constructor of virtual array needs higher values. For example, if memMult contains {10,10,20} then it can have 40 concurrent pages for updating/accessing.

Sequential Bulk Access

Similar to a scalar access, bulk access is made of pci-e data transfer and page locking but with multiple of them at once that reduces average latency greatly. In the tiled nbody example, average latency per x,y,z position float data mapped from virtual array is ~100 nanoseconds (which lets cpu-cache tiling work at 50GB/s at least, with 0.2 nanoseconds average element access). This is orders of magnitude faster than scalar get/set methods. Since each bulk access is still single-threaded, it is scalable to multiple threads and can hide i/o latencies behind math operations in a many-threaded application.

Sequential access is capable of achieving both moderate bandwidth and moderate latency performances regardless of object/element size. But it needs proper tuning of page size.

Random Scalar Access

Random access pattern is the weak point of caching of this virtual array. If randomly indexing does not land on close-enough elements, then paging with more than 1 elements causes latency problems because of fetching unnecessary elements from cards. But random access pattern is strong point for VRAM (video memory). A card can access any element within its VRAM in constant time regardless of position of element. So, randomly accessing is not much of a problem if object(element type) has enough bytes to make up for the minimum pcie access latency.

Random-access-pattern can not have both high bandwidth and low latency at the same time. Object size decides the performance of random-access. Just single char elements are randomly-accessed at minimum pcie latency (several or more microseconds) and 1-2 MB/s bandwidth. Objects of 512kB size are accessed with 5-10 GB/s bandwidth and 100 - 150 microseconds latency. For randomly accessing 4kB objects (as in HDD benchmarking), it is very close to the low-latency side (~16 microseconds & 260MB/s on the development computer). Usually the pcie bandwidth-effiency-gain diminishes at 256kB page size. No need to be bigger than this besides benchmarking.

Random Bulk Access

This is not very different than scalar version except the reduction of lock-latency and increased bandwidth. Since it does multiple data copies, it works better with page size > 1. With page size > 1 setting, it is a combination of sequential access and random access and has good performance. But not as good as sequential bulk access. In sequential version, a thread may update a cache and next thread can directly benefit from that cache, especially when bulk access regions overlap.

If virtual array is needed as a read-only data source for both sequential and random access patterns concurrently, then it can be duplicated with different settings, one with M number of elements per page (sequential-access) and another with only 1 element per page (random-access). Or, index list of random-access can be prepared and sorted before accessing array. For example, if there is a ray-tracer with rays randomly scattered, then they can be sorted on their location or their direction so that neighboring rays on the list can point to similar elements in virtual array.

Limitations For Concurrency

  • Even though the virtual array can serve to many more threads than logical cores of cpu, there is still limited DMA capability per graphics card. Some cards have only 1 async copy engine, some cards have 5 engines. Using too many opencl channels(in memMult vector elements) may not help much.
  • Some algorithms may not have enough work to hide the latency of i/o
  • There may not be enough RAM bandwidth for all tasks and doing any DMA between RAM and VRAM can only steal bandwidth from other tasks, unless they run in L1/L2/L3 caches. Effects of this is visible on the development computer with 1-channel low frequency RAM. (6GB/s achieved instead of 8GB).
  • Too big page and sequential multithreaded access means lock-based contention, lowered performance.
  • Very few active pages (in memMult vector parameter of constructor) can not overlap enough I/O
  • Virtual arrays have their own opencl contexts so multiple virtual arrays may not enable extra concurrency on same graphics card, unless they are backed by different cards.
  • A write triggering a page-fault is actually a write(of old data) followed by a read(of new data) so it does double amount of work and has less average bandwidth than just a read triggering a page fault. Due to this, writes scale to less threads (like 8 on a 8 core cpu) than reads (up to 64 threads for 8 core cpu).
  • Interleaved opencl data channels (independent graphics card data copy lanes) may cause lower performance for strided access to elements if the pcie topology is very asymmetrical and the number of channels(or number of elements per page) is not a prime number. For example, using 17 as number of total channels (or size of page) is good for escaping low performance when elements are accessed like 0,2,4,6,.. or 1,11,21,31,41,... This way, strided access does not always arrive at same graphics card. Tuning page size may not be possible for some algorithms but tuning opencl channels is very easy. Number of active pages depends on number of channels so the more channels there are the more caching there is. It only needs to have equal or more frozen pages (total pages) than active pages. Having all pages active means all data is on RAM and there is not point of using a virtual array.

Source code for random-access max bandwidth & min latency test:

#include "GraphicsCardSupplyDepot.h"
#include "VirtualMultiArray.h"
#include "PcieBandwidthBenchmarker.h"
#include "CpuBenchmarker.h"

// testing
#include <random>
#include <iostream>
#include "omp.h"

constexpr bool TEST_BANDWIDTH=true;
constexpr bool TEST_LATENCY=false;
constexpr bool testType = TEST_BANDWIDTH;

class Object
{
public:
	Object():id(-1){}
	Object(int p):id(p){}
	const int getId() const {return id;}
private:
	char data[testType?(1024*512 - 4):(4)];
	int id;
};

int main()
{
	const long long pageSize = 1;
	const long long n = pageSize*(testType?1000:100000);
	const int numTestsPerThread = 25;
	VirtualMultiArray<Object> test(n,GraphicsCardSupplyDepot().requestGpus(),pageSize,3,PcieBandwidthBenchmarker().bestBandwidth(10));

	#pragma omp parallel for
	for(long long j=0;j<n;j++)
	{
			test.set(j,Object(j));
	}


	for(int i=1;i<=64;i++)
	{
		{
			CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar set, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
			#pragma omp parallel for num_threads(i)
			for(long long j=0;j<i;j++)
			{
				std::random_device rd;
				std::mt19937 rng(rd());
				std::uniform_real_distribution<float> rnd(0,n-1);
				for(int k=0;k<numTestsPerThread;k++)
				{
					int rndv = rnd(rng);
					test.set(rndv,Object(rndv));
				}
			}
		}

		{
			CpuBenchmarker bench(i*numTestsPerThread*sizeof(Object),std::string("scalar get, ")+std::to_string(i)+std::string("threads"),i*numTestsPerThread);
			#pragma omp parallel for num_threads(i)
			for(long long j=0;j<i;j++)
			{
				std::random_device rd;
				std::mt19937 rng(rd());
				std::uniform_real_distribution<float> rnd(0,n-1);
				for(int k=0;k<numTestsPerThread;k++)
				{
					int rndv = rnd(rng);
					const auto obj = test.get(rndv);
					if(obj.getId()!=rndv)
					{
						throw std::invalid_argument("Error: set/get");
					}
				}
			}
		}

		std::cout<<"==================================================================="<<std::endl;
	}
	return 0;
}

output for bandwidth test:

(cpu has 8 logical cores but benchmark code doesn't use math so i/o latencies can be optimized by using many more threads)

scalar set, 1threads: 12044469 nanoseconds     (bandwidth = 1088.23 MB/s)      (throughput = 481778.76 nanoseconds per iteration) 
scalar get, 1threads: 9934338 nanoseconds     (bandwidth = 1319.38 MB/s)      (throughput = 397373.52 nanoseconds per iteration) 
===================================================================
scalar set, 2threads: 15049618 nanoseconds     (bandwidth = 1741.86 MB/s)      (throughput = 300992.36 nanoseconds per iteration) 
scalar get, 2threads: 14757744 nanoseconds     (bandwidth = 1776.31 MB/s)      (throughput = 295154.88 nanoseconds per iteration) 
===================================================================
scalar set, 3threads: 14624487 nanoseconds     (bandwidth = 2688.75 MB/s)      (throughput = 194993.16 nanoseconds per iteration) 
scalar get, 3threads: 16327495 nanoseconds     (bandwidth = 2408.31 MB/s)      (throughput = 217699.93 nanoseconds per iteration) 
===================================================================
scalar set, 4threads: 15923864 nanoseconds     (bandwidth = 3292.47 MB/s)      (throughput = 159238.64 nanoseconds per iteration) 
scalar get, 4threads: 15666750 nanoseconds     (bandwidth = 3346.50 MB/s)      (throughput = 156667.50 nanoseconds per iteration) 
===================================================================
scalar set, 5threads: 19624400 nanoseconds     (bandwidth = 3339.52 MB/s)      (throughput = 156995.20 nanoseconds per iteration) 
scalar get, 5threads: 18376494 nanoseconds     (bandwidth = 3566.30 MB/s)      (throughput = 147011.95 nanoseconds per iteration) 
===================================================================
scalar set, 6threads: 23283685 nanoseconds     (bandwidth = 3377.61 MB/s)      (throughput = 155224.57 nanoseconds per iteration) 
scalar get, 6threads: 23514552 nanoseconds     (bandwidth = 3344.45 MB/s)      (throughput = 156763.68 nanoseconds per iteration) 
===================================================================
scalar set, 7threads: 23991961 nanoseconds     (bandwidth = 3824.21 MB/s)      (throughput = 137096.92 nanoseconds per iteration) 
scalar get, 7threads: 24789213 nanoseconds     (bandwidth = 3701.22 MB/s)      (throughput = 141652.65 nanoseconds per iteration) 
===================================================================
scalar set, 8threads: 24721163 nanoseconds     (bandwidth = 4241.61 MB/s)      (throughput = 123605.82 nanoseconds per iteration) 
scalar get, 8threads: 26509004 nanoseconds     (bandwidth = 3955.55 MB/s)      (throughput = 132545.02 nanoseconds per iteration) 
===================================================================
scalar set, 9threads: 34087377 nanoseconds     (bandwidth = 3460.66 MB/s)      (throughput = 151499.45 nanoseconds per iteration) 
scalar get, 9threads: 30353076 nanoseconds     (bandwidth = 3886.42 MB/s)      (throughput = 134902.56 nanoseconds per iteration) 
===================================================================
scalar set, 10threads: 28806228 nanoseconds     (bandwidth = 4550.13 MB/s)      (throughput = 115224.91 nanoseconds per iteration) 
scalar get, 10threads: 29689703 nanoseconds     (bandwidth = 4414.73 MB/s)      (throughput = 118758.81 nanoseconds per iteration) 
===================================================================
scalar set, 11threads: 33307579 nanoseconds     (bandwidth = 4328.72 MB/s)      (throughput = 121118.47 nanoseconds per iteration) 
scalar get, 11threads: 32129205 nanoseconds     (bandwidth = 4487.48 MB/s)      (throughput = 116833.47 nanoseconds per iteration) 
===================================================================
scalar set, 12threads: 37564603 nanoseconds     (bandwidth = 4187.09 MB/s)      (throughput = 125215.34 nanoseconds per iteration) 
scalar get, 12threads: 33679315 nanoseconds     (bandwidth = 4670.12 MB/s)      (throughput = 112264.38 nanoseconds per iteration) 
===================================================================
scalar set, 13threads: 42089135 nanoseconds     (bandwidth = 4048.40 MB/s)      (throughput = 129505.03 nanoseconds per iteration) 
scalar get, 13threads: 37127030 nanoseconds     (bandwidth = 4589.48 MB/s)      (throughput = 114237.02 nanoseconds per iteration) 
===================================================================
scalar set, 14threads: 39826189 nanoseconds     (bandwidth = 4607.54 MB/s)      (throughput = 113789.11 nanoseconds per iteration) 
scalar get, 14threads: 40159614 nanoseconds     (bandwidth = 4569.29 MB/s)      (throughput = 114741.75 nanoseconds per iteration) 
===================================================================
scalar set, 15threads: 43406877 nanoseconds     (bandwidth = 4529.42 MB/s)      (throughput = 115751.67 nanoseconds per iteration) 
scalar get, 15threads: 42253638 nanoseconds     (bandwidth = 4653.04 MB/s)      (throughput = 112676.37 nanoseconds per iteration) 
===================================================================
scalar set, 16threads: 52377322 nanoseconds     (bandwidth = 4003.93 MB/s)      (throughput = 130943.30 nanoseconds per iteration) 
scalar get, 16threads: 40575603 nanoseconds     (bandwidth = 5168.50 MB/s)      (throughput = 101439.01 nanoseconds per iteration) 
===================================================================
scalar set, 17threads: 54093263 nanoseconds     (bandwidth = 4119.23 MB/s)      (throughput = 127278.27 nanoseconds per iteration) 
scalar get, 17threads: 43468946 nanoseconds     (bandwidth = 5126.01 MB/s)      (throughput = 102279.87 nanoseconds per iteration) 
===================================================================
scalar set, 18threads: 58901439 nanoseconds     (bandwidth = 4005.50 MB/s)      (throughput = 130892.09 nanoseconds per iteration) 
scalar get, 18threads: 48320660 nanoseconds     (bandwidth = 4882.58 MB/s)      (throughput = 107379.24 nanoseconds per iteration) 
===================================================================
scalar set, 19threads: 57317337 nanoseconds     (bandwidth = 4344.88 MB/s)      (throughput = 120668.08 nanoseconds per iteration) 
scalar get, 19threads: 47879715 nanoseconds     (bandwidth = 5201.30 MB/s)      (throughput = 100799.40 nanoseconds per iteration) 
===================================================================
scalar set, 20threads: 56880821 nanoseconds     (bandwidth = 4608.65 MB/s)      (throughput = 113761.64 nanoseconds per iteration) 
scalar get, 20threads: 51301546 nanoseconds     (bandwidth = 5109.87 MB/s)      (throughput = 102603.09 nanoseconds per iteration) 
===================================================================
scalar set, 21threads: 65669130 nanoseconds     (bandwidth = 4191.49 MB/s)      (throughput = 125084.06 nanoseconds per iteration) 
scalar get, 21threads: 53249203 nanoseconds     (bandwidth = 5169.11 MB/s)      (throughput = 101427.05 nanoseconds per iteration) 
===================================================================
scalar set, 22threads: 117282036 nanoseconds     (bandwidth = 2458.67 MB/s)      (throughput = 213240.07 nanoseconds per iteration) 
scalar get, 22threads: 80246068 nanoseconds     (bandwidth = 3593.43 MB/s)      (throughput = 145901.94 nanoseconds per iteration) 
===================================================================
scalar set, 23threads: 85605689 nanoseconds     (bandwidth = 3521.56 MB/s)      (throughput = 148879.46 nanoseconds per iteration) 
scalar get, 23threads: 59164064 nanoseconds     (bandwidth = 5095.42 MB/s)      (throughput = 102894.02 nanoseconds per iteration) 
===================================================================
scalar set, 24threads: 94704633 nanoseconds     (bandwidth = 3321.62 MB/s)      (throughput = 157841.05 nanoseconds per iteration) 
scalar get, 24threads: 57603306 nanoseconds     (bandwidth = 5461.02 MB/s)      (throughput = 96005.51 nanoseconds per iteration) 
===================================================================
scalar set, 25threads: 90020628 nanoseconds     (bandwidth = 3640.05 MB/s)      (throughput = 144033.00 nanoseconds per iteration) 
scalar get, 25threads: 64262309 nanoseconds     (bandwidth = 5099.10 MB/s)      (throughput = 102819.69 nanoseconds per iteration) 
===================================================================
scalar set, 26threads: 90324529 nanoseconds     (bandwidth = 3772.92 MB/s)      (throughput = 138960.81 nanoseconds per iteration) 
scalar get, 26threads: 66267715 nanoseconds     (bandwidth = 5142.58 MB/s)      (throughput = 101950.33 nanoseconds per iteration) 
===================================================================
scalar set, 27threads: 84455240 nanoseconds     (bandwidth = 4190.32 MB/s)      (throughput = 125118.87 nanoseconds per iteration) 
scalar get, 27threads: 66927255 nanoseconds     (bandwidth = 5287.75 MB/s)      (throughput = 99151.49 nanoseconds per iteration) 
===================================================================
scalar set, 28threads: 86489726 nanoseconds     (bandwidth = 4243.30 MB/s)      (throughput = 123556.75 nanoseconds per iteration) 
scalar get, 28threads: 64505288 nanoseconds     (bandwidth = 5689.48 MB/s)      (throughput = 92150.41 nanoseconds per iteration) 
===================================================================
scalar set, 29threads: 98129856 nanoseconds     (bandwidth = 3873.53 MB/s)      (throughput = 135351.53 nanoseconds per iteration) 
scalar get, 29threads: 71603803 nanoseconds     (bandwidth = 5308.50 MB/s)      (throughput = 98763.87 nanoseconds per iteration) 
===================================================================
scalar set, 30threads: 107287325 nanoseconds     (bandwidth = 3665.07 MB/s)      (throughput = 143049.77 nanoseconds per iteration) 
scalar get, 30threads: 69077720 nanoseconds     (bandwidth = 5692.37 MB/s)      (throughput = 92103.63 nanoseconds per iteration) 
===================================================================
scalar set, 31threads: 106661113 nanoseconds     (bandwidth = 3809.48 MB/s)      (throughput = 137627.24 nanoseconds per iteration) 
scalar get, 31threads: 73251634 nanoseconds     (bandwidth = 5546.95 MB/s)      (throughput = 94518.24 nanoseconds per iteration) 
===================================================================
scalar set, 32threads: 104326775 nanoseconds     (bandwidth = 4020.35 MB/s)      (throughput = 130408.47 nanoseconds per iteration) 
scalar get, 32threads: 74904525 nanoseconds     (bandwidth = 5599.53 MB/s)      (throughput = 93630.66 nanoseconds per iteration) 
===================================================================
scalar set, 33threads: 106740853 nanoseconds     (bandwidth = 4052.22 MB/s)      (throughput = 129382.85 nanoseconds per iteration) 
scalar get, 33threads: 76224533 nanoseconds     (bandwidth = 5674.52 MB/s)      (throughput = 92393.37 nanoseconds per iteration) 
===================================================================
scalar set, 34threads: 119854672 nanoseconds     (bandwidth = 3718.21 MB/s)      (throughput = 141005.50 nanoseconds per iteration) 
scalar get, 34threads: 74974366 nanoseconds     (bandwidth = 5943.96 MB/s)      (throughput = 88205.14 nanoseconds per iteration) 
===================================================================
scalar set, 35threads: 119714275 nanoseconds     (bandwidth = 3832.06 MB/s)      (throughput = 136816.31 nanoseconds per iteration) 
scalar get, 35threads: 83676187 nanoseconds     (bandwidth = 5482.47 MB/s)      (throughput = 95629.93 nanoseconds per iteration) 
===================================================================
scalar set, 36threads: 115623557 nanoseconds     (bandwidth = 4081.00 MB/s)      (throughput = 128470.62 nanoseconds per iteration) 
scalar get, 36threads: 84048530 nanoseconds     (bandwidth = 5614.13 MB/s)      (throughput = 93387.26 nanoseconds per iteration) 
===================================================================
scalar set, 37threads: 120987558 nanoseconds     (bandwidth = 4008.40 MB/s)      (throughput = 130797.36 nanoseconds per iteration) 
scalar get, 37threads: 86777777 nanoseconds     (bandwidth = 5588.60 MB/s)      (throughput = 93813.81 nanoseconds per iteration) 
===================================================================
scalar set, 38threads: 128336066 nanoseconds     (bandwidth = 3881.01 MB/s)      (throughput = 135090.60 nanoseconds per iteration) 
scalar get, 38threads: 93384708 nanoseconds     (bandwidth = 5333.57 MB/s)      (throughput = 98299.69 nanoseconds per iteration) 
===================================================================
scalar set, 39threads: 127154851 nanoseconds     (bandwidth = 4020.14 MB/s)      (throughput = 130415.23 nanoseconds per iteration) 
scalar get, 39threads: 92697129 nanoseconds     (bandwidth = 5514.53 MB/s)      (throughput = 95073.98 nanoseconds per iteration) 
===================================================================
scalar set, 40threads: 134929853 nanoseconds     (bandwidth = 3885.63 MB/s)      (throughput = 134929.85 nanoseconds per iteration) 
scalar get, 40threads: 89044254 nanoseconds     (bandwidth = 5887.95 MB/s)      (throughput = 89044.25 nanoseconds per iteration) 
===================================================================
scalar set, 41threads: 141794482 nanoseconds     (bandwidth = 3789.96 MB/s)      (throughput = 138336.08 nanoseconds per iteration) 
scalar get, 41threads: 94989329 nanoseconds     (bandwidth = 5657.43 MB/s)      (throughput = 92672.52 nanoseconds per iteration) 
===================================================================
scalar set, 42threads: 138018927 nanoseconds     (bandwidth = 3988.60 MB/s)      (throughput = 131446.60 nanoseconds per iteration) 
scalar get, 42threads: 93692083 nanoseconds     (bandwidth = 5875.66 MB/s)      (throughput = 89230.56 nanoseconds per iteration) 
===================================================================
scalar set, 43threads: 142291775 nanoseconds     (bandwidth = 3960.94 MB/s)      (throughput = 132364.44 nanoseconds per iteration) 
scalar get, 43threads: 100883762 nanoseconds     (bandwidth = 5586.72 MB/s)      (throughput = 93845.36 nanoseconds per iteration) 
===================================================================
scalar set, 44threads: 147218678 nanoseconds     (bandwidth = 3917.42 MB/s)      (throughput = 133835.16 nanoseconds per iteration) 
scalar get, 44threads: 103558932 nanoseconds     (bandwidth = 5568.97 MB/s)      (throughput = 94144.48 nanoseconds per iteration) 
===================================================================
scalar set, 45threads: 160263860 nanoseconds     (bandwidth = 3680.33 MB/s)      (throughput = 142456.76 nanoseconds per iteration) 
scalar get, 45threads: 105228102 nanoseconds     (bandwidth = 5605.19 MB/s)      (throughput = 93536.09 nanoseconds per iteration) 
===================================================================
scalar set, 46threads: 151100080 nanoseconds     (bandwidth = 3990.28 MB/s)      (throughput = 131391.37 nanoseconds per iteration) 
scalar get, 46threads: 105698294 nanoseconds     (bandwidth = 5704.27 MB/s)      (throughput = 91911.56 nanoseconds per iteration) 
===================================================================
scalar set, 47threads: 177625915 nanoseconds     (bandwidth = 3468.18 MB/s)      (throughput = 151170.99 nanoseconds per iteration) 
scalar get, 47threads: 108074864 nanoseconds     (bandwidth = 5700.11 MB/s)      (throughput = 91978.61 nanoseconds per iteration) 
===================================================================
scalar set, 48threads: 165096898 nanoseconds     (bandwidth = 3810.77 MB/s)      (throughput = 137580.75 nanoseconds per iteration) 
scalar get, 48threads: 111755268 nanoseconds     (bandwidth = 5629.67 MB/s)      (throughput = 93129.39 nanoseconds per iteration) 
===================================================================
scalar set, 49threads: 165494883 nanoseconds     (bandwidth = 3880.80 MB/s)      (throughput = 135097.86 nanoseconds per iteration) 
scalar get, 49threads: 113930199 nanoseconds     (bandwidth = 5637.25 MB/s)      (throughput = 93004.24 nanoseconds per iteration) 
===================================================================
scalar set, 50threads: 168037490 nanoseconds     (bandwidth = 3900.08 MB/s)      (throughput = 134429.99 nanoseconds per iteration) 
scalar get, 50threads: 120633489 nanoseconds     (bandwidth = 5432.65 MB/s)      (throughput = 96506.79 nanoseconds per iteration) 
===================================================================
scalar set, 51threads: 177678374 nanoseconds     (bandwidth = 3762.23 MB/s)      (throughput = 139355.59 nanoseconds per iteration) 
scalar get, 51threads: 110331406 nanoseconds     (bandwidth = 6058.72 MB/s)      (throughput = 86534.44 nanoseconds per iteration) 
===================================================================
scalar set, 52threads: 172810356 nanoseconds     (bandwidth = 3944.06 MB/s)      (throughput = 132931.04 nanoseconds per iteration) 
scalar get, 52threads: 119626565 nanoseconds     (bandwidth = 5697.52 MB/s)      (throughput = 92020.43 nanoseconds per iteration) 
===================================================================
scalar set, 53threads: 168286062 nanoseconds     (bandwidth = 4127.98 MB/s)      (throughput = 127008.35 nanoseconds per iteration) 
scalar get, 53threads: 116776813 nanoseconds     (bandwidth = 5948.80 MB/s)      (throughput = 88133.44 nanoseconds per iteration) 
===================================================================
scalar set, 54threads: 174877156 nanoseconds     (bandwidth = 4047.35 MB/s)      (throughput = 129538.63 nanoseconds per iteration) 
scalar get, 54threads: 127269515 nanoseconds     (bandwidth = 5561.34 MB/s)      (throughput = 94273.71 nanoseconds per iteration) 
===================================================================
scalar set, 55threads: 189869811 nanoseconds     (bandwidth = 3796.79 MB/s)      (throughput = 138087.14 nanoseconds per iteration) 
scalar get, 55threads: 124078367 nanoseconds     (bandwidth = 5810.01 MB/s)      (throughput = 90238.81 nanoseconds per iteration) 
===================================================================
scalar set, 56threads: 182990853 nanoseconds     (bandwidth = 4011.15 MB/s)      (throughput = 130707.75 nanoseconds per iteration) 
scalar get, 56threads: 114976085 nanoseconds     (bandwidth = 6383.96 MB/s)      (throughput = 82125.77 nanoseconds per iteration) 
===================================================================
scalar set, 57threads: 193549162 nanoseconds     (bandwidth = 3860.05 MB/s)      (throughput = 135823.97 nanoseconds per iteration) 
scalar get, 57threads: 125784258 nanoseconds     (bandwidth = 5939.62 MB/s)      (throughput = 88269.65 nanoseconds per iteration) 
===================================================================
scalar set, 58threads: 202275286 nanoseconds     (bandwidth = 3758.33 MB/s)      (throughput = 139500.20 nanoseconds per iteration) 
scalar get, 58threads: 134799997 nanoseconds     (bandwidth = 5639.60 MB/s)      (throughput = 92965.52 nanoseconds per iteration) 
===================================================================
scalar set, 59threads: 197567201 nanoseconds     (bandwidth = 3914.24 MB/s)      (throughput = 133943.87 nanoseconds per iteration) 
scalar get, 59threads: 138207522 nanoseconds     (bandwidth = 5595.39 MB/s)      (throughput = 93700.01 nanoseconds per iteration) 
===================================================================
scalar set, 60threads: 198432837 nanoseconds     (bandwidth = 3963.22 MB/s)      (throughput = 132288.56 nanoseconds per iteration) 
scalar get, 60threads: 138086937 nanoseconds     (bandwidth = 5695.19 MB/s)      (throughput = 92057.96 nanoseconds per iteration) 
===================================================================
scalar set, 61threads: 208980698 nanoseconds     (bandwidth = 3825.90 MB/s)      (throughput = 137036.52 nanoseconds per iteration) 
scalar get, 61threads: 147400316 nanoseconds     (bandwidth = 5424.27 MB/s)      (throughput = 96655.94 nanoseconds per iteration) 
===================================================================
scalar set, 62threads: 197225930 nanoseconds     (bandwidth = 4120.38 MB/s)      (throughput = 127242.54 nanoseconds per iteration) 
scalar get, 62threads: 143633852 nanoseconds     (bandwidth = 5657.76 MB/s)      (throughput = 92667.00 nanoseconds per iteration) 
===================================================================
scalar set, 63threads: 216191555 nanoseconds     (bandwidth = 3819.55 MB/s)      (throughput = 137264.48 nanoseconds per iteration) 
scalar get, 63threads: 140489377 nanoseconds     (bandwidth = 5877.69 MB/s)      (throughput = 89199.60 nanoseconds per iteration) 
===================================================================
scalar set, 64threads: 221455925 nanoseconds     (bandwidth = 3787.94 MB/s)      (throughput = 138409.95 nanoseconds per iteration) 
scalar get, 64threads: 141562704 nanoseconds     (bandwidth = 5925.72 MB/s)      (throughput = 88476.69 nanoseconds per iteration) 
===================================================================

output for latency test:

scalar set, 1threads: 796015 nanoseconds     (bandwidth = 0.25 MB/s)      (throughput = 31840.60 nanoseconds per iteration) 
scalar get, 1threads: 728776 nanoseconds     (bandwidth = 0.27 MB/s)      (throughput = 29151.04 nanoseconds per iteration) 
===================================================================
scalar set, 2threads: 2257726 nanoseconds     (bandwidth = 0.18 MB/s)      (throughput = 45154.52 nanoseconds per iteration) 
scalar get, 2threads: 1415864 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28317.28 nanoseconds per iteration) 
===================================================================
scalar set, 3threads: 1931549 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25753.99 nanoseconds per iteration) 
scalar get, 3threads: 1768852 nanoseconds     (bandwidth = 0.34 MB/s)      (throughput = 23584.69 nanoseconds per iteration) 
===================================================================
scalar set, 4threads: 2168005 nanoseconds     (bandwidth = 0.37 MB/s)      (throughput = 21680.05 nanoseconds per iteration) 
scalar get, 4threads: 2321361 nanoseconds     (bandwidth = 0.34 MB/s)      (throughput = 23213.61 nanoseconds per iteration) 
===================================================================
scalar set, 5threads: 3092328 nanoseconds     (bandwidth = 0.32 MB/s)      (throughput = 24738.62 nanoseconds per iteration) 
scalar get, 5threads: 2740735 nanoseconds     (bandwidth = 0.36 MB/s)      (throughput = 21925.88 nanoseconds per iteration) 
===================================================================
scalar set, 6threads: 3125355 nanoseconds     (bandwidth = 0.38 MB/s)      (throughput = 20835.70 nanoseconds per iteration) 
scalar get, 6threads: 3356256 nanoseconds     (bandwidth = 0.36 MB/s)      (throughput = 22375.04 nanoseconds per iteration) 
===================================================================
scalar set, 7threads: 4414571 nanoseconds     (bandwidth = 0.32 MB/s)      (throughput = 25226.12 nanoseconds per iteration) 
scalar get, 7threads: 3958150 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 22618.00 nanoseconds per iteration) 
===================================================================
scalar set, 8threads: 10917844 nanoseconds     (bandwidth = 0.15 MB/s)      (throughput = 54589.22 nanoseconds per iteration) 
scalar get, 8threads: 5614330 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28071.65 nanoseconds per iteration) 
===================================================================
scalar set, 9threads: 6783931 nanoseconds     (bandwidth = 0.27 MB/s)      (throughput = 30150.80 nanoseconds per iteration) 
scalar get, 9threads: 5164546 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 22953.54 nanoseconds per iteration) 
===================================================================
scalar set, 10threads: 7847008 nanoseconds     (bandwidth = 0.25 MB/s)      (throughput = 31388.03 nanoseconds per iteration) 
scalar get, 10threads: 5441004 nanoseconds     (bandwidth = 0.37 MB/s)      (throughput = 21764.02 nanoseconds per iteration) 
===================================================================
scalar set, 11threads: 6401318 nanoseconds     (bandwidth = 0.34 MB/s)      (throughput = 23277.52 nanoseconds per iteration) 
scalar get, 11threads: 6356144 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 23113.25 nanoseconds per iteration) 
===================================================================
scalar set, 12threads: 6943016 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 23143.39 nanoseconds per iteration) 
scalar get, 12threads: 7978162 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26593.87 nanoseconds per iteration) 
===================================================================
scalar set, 13threads: 8543294 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26287.06 nanoseconds per iteration) 
scalar get, 13threads: 6459147 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 19874.30 nanoseconds per iteration) 
===================================================================
scalar set, 14threads: 8088537 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 23110.11 nanoseconds per iteration) 
scalar get, 14threads: 7922350 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 22635.29 nanoseconds per iteration) 
===================================================================
scalar set, 15threads: 8462724 nanoseconds     (bandwidth = 0.35 MB/s)      (throughput = 22567.26 nanoseconds per iteration) 
scalar get, 15threads: 7651472 nanoseconds     (bandwidth = 0.39 MB/s)      (throughput = 20403.93 nanoseconds per iteration) 
===================================================================
scalar set, 16threads: 11109599 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27774.00 nanoseconds per iteration) 
scalar get, 16threads: 8576300 nanoseconds     (bandwidth = 0.37 MB/s)      (throughput = 21440.75 nanoseconds per iteration) 
===================================================================
scalar set, 17threads: 11096916 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 26110.39 nanoseconds per iteration) 
scalar get, 17threads: 8618447 nanoseconds     (bandwidth = 0.39 MB/s)      (throughput = 20278.70 nanoseconds per iteration) 
===================================================================
scalar set, 18threads: 12515788 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27812.86 nanoseconds per iteration) 
scalar get, 18threads: 9277279 nanoseconds     (bandwidth = 0.39 MB/s)      (throughput = 20616.18 nanoseconds per iteration) 
===================================================================
scalar set, 19threads: 11523653 nanoseconds     (bandwidth = 0.33 MB/s)      (throughput = 24260.32 nanoseconds per iteration) 
scalar get, 19threads: 11772960 nanoseconds     (bandwidth = 0.32 MB/s)      (throughput = 24785.18 nanoseconds per iteration) 
===================================================================
scalar set, 20threads: 12526493 nanoseconds     (bandwidth = 0.32 MB/s)      (throughput = 25052.99 nanoseconds per iteration) 
scalar get, 20threads: 11266865 nanoseconds     (bandwidth = 0.36 MB/s)      (throughput = 22533.73 nanoseconds per iteration) 
===================================================================
scalar set, 21threads: 13493774 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25702.43 nanoseconds per iteration) 
scalar get, 21threads: 10337310 nanoseconds     (bandwidth = 0.41 MB/s)      (throughput = 19690.11 nanoseconds per iteration) 
===================================================================
scalar set, 22threads: 14546812 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26448.75 nanoseconds per iteration) 
scalar get, 22threads: 11083035 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 20150.97 nanoseconds per iteration) 
===================================================================
scalar set, 23threads: 14777246 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25699.56 nanoseconds per iteration) 
scalar get, 23threads: 11627917 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 20222.46 nanoseconds per iteration) 
===================================================================
scalar set, 24threads: 15471962 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25786.60 nanoseconds per iteration) 
scalar get, 24threads: 12204815 nanoseconds     (bandwidth = 0.39 MB/s)      (throughput = 20341.36 nanoseconds per iteration) 
===================================================================
scalar set, 25threads: 16321776 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 26114.84 nanoseconds per iteration) 
scalar get, 25threads: 12399694 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 19839.51 nanoseconds per iteration) 
===================================================================
scalar set, 26threads: 16876347 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25963.61 nanoseconds per iteration) 
scalar get, 26threads: 14088312 nanoseconds     (bandwidth = 0.37 MB/s)      (throughput = 21674.33 nanoseconds per iteration) 
===================================================================
scalar set, 27threads: 17893112 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26508.31 nanoseconds per iteration) 
scalar get, 27threads: 14622779 nanoseconds     (bandwidth = 0.37 MB/s)      (throughput = 21663.38 nanoseconds per iteration) 
===================================================================
scalar set, 28threads: 17988429 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25697.76 nanoseconds per iteration) 
scalar get, 28threads: 15610514 nanoseconds     (bandwidth = 0.36 MB/s)      (throughput = 22300.73 nanoseconds per iteration) 
===================================================================
scalar set, 29threads: 19184402 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26461.24 nanoseconds per iteration) 
scalar get, 29threads: 14404094 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 19867.72 nanoseconds per iteration) 
===================================================================
scalar set, 30threads: 20370221 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27160.29 nanoseconds per iteration) 
scalar get, 30threads: 19632563 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 26176.75 nanoseconds per iteration) 
===================================================================
scalar set, 31threads: 20234194 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 26108.64 nanoseconds per iteration) 
scalar get, 31threads: 16520864 nanoseconds     (bandwidth = 0.38 MB/s)      (throughput = 21317.24 nanoseconds per iteration) 
===================================================================
scalar set, 32threads: 21372440 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26715.55 nanoseconds per iteration) 
scalar get, 32threads: 14721429 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18401.79 nanoseconds per iteration) 
===================================================================
scalar set, 33threads: 22024660 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26696.56 nanoseconds per iteration) 
scalar get, 33threads: 16601922 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 20123.54 nanoseconds per iteration) 
===================================================================
scalar set, 34threads: 23254269 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27357.96 nanoseconds per iteration) 
scalar get, 34threads: 16645546 nanoseconds     (bandwidth = 0.41 MB/s)      (throughput = 19583.00 nanoseconds per iteration) 
===================================================================
scalar set, 35threads: 23188697 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26501.37 nanoseconds per iteration) 
scalar get, 35threads: 17003479 nanoseconds     (bandwidth = 0.41 MB/s)      (throughput = 19432.55 nanoseconds per iteration) 
===================================================================
scalar set, 36threads: 23183603 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 25759.56 nanoseconds per iteration) 
scalar get, 36threads: 17907174 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 19896.86 nanoseconds per iteration) 
===================================================================
scalar set, 37threads: 26079365 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28193.91 nanoseconds per iteration) 
scalar get, 37threads: 17549649 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 18972.59 nanoseconds per iteration) 
===================================================================
scalar set, 38threads: 26302146 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27686.47 nanoseconds per iteration) 
scalar get, 38threads: 18822879 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 19813.56 nanoseconds per iteration) 
===================================================================
scalar set, 39threads: 27120067 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27815.45 nanoseconds per iteration) 
scalar get, 39threads: 19517639 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 20018.09 nanoseconds per iteration) 
===================================================================
scalar set, 40threads: 28232050 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28232.05 nanoseconds per iteration) 
scalar get, 40threads: 19440484 nanoseconds     (bandwidth = 0.41 MB/s)      (throughput = 19440.48 nanoseconds per iteration) 
===================================================================
scalar set, 41threads: 27915815 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27234.94 nanoseconds per iteration) 
scalar get, 41threads: 19577403 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 19099.91 nanoseconds per iteration) 
===================================================================
scalar set, 42threads: 31036703 nanoseconds     (bandwidth = 0.27 MB/s)      (throughput = 29558.76 nanoseconds per iteration) 
scalar get, 42threads: 19848544 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 18903.38 nanoseconds per iteration) 
===================================================================
scalar set, 43threads: 28556314 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26564.01 nanoseconds per iteration) 
scalar get, 43threads: 19492210 nanoseconds     (bandwidth = 0.44 MB/s)      (throughput = 18132.29 nanoseconds per iteration) 
===================================================================
scalar set, 44threads: 32160799 nanoseconds     (bandwidth = 0.27 MB/s)      (throughput = 29237.09 nanoseconds per iteration) 
scalar get, 44threads: 20623827 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18748.93 nanoseconds per iteration) 
===================================================================
scalar set, 45threads: 30139185 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26790.39 nanoseconds per iteration) 
scalar get, 45threads: 21152333 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18802.07 nanoseconds per iteration) 
===================================================================
scalar set, 46threads: 30938365 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26902.93 nanoseconds per iteration) 
scalar get, 46threads: 21368780 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18581.55 nanoseconds per iteration) 
===================================================================
scalar set, 47threads: 31432058 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26750.69 nanoseconds per iteration) 
scalar get, 47threads: 23851332 nanoseconds     (bandwidth = 0.39 MB/s)      (throughput = 20299.01 nanoseconds per iteration) 
===================================================================
scalar set, 48threads: 34000969 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28334.14 nanoseconds per iteration) 
scalar get, 48threads: 22016920 nanoseconds     (bandwidth = 0.44 MB/s)      (throughput = 18347.43 nanoseconds per iteration) 
===================================================================
scalar set, 49threads: 33997037 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27752.68 nanoseconds per iteration) 
scalar get, 49threads: 23557869 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 19230.91 nanoseconds per iteration) 
===================================================================
scalar set, 50threads: 37000330 nanoseconds     (bandwidth = 0.27 MB/s)      (throughput = 29600.26 nanoseconds per iteration) 
scalar get, 50threads: 22919347 nanoseconds     (bandwidth = 0.44 MB/s)      (throughput = 18335.48 nanoseconds per iteration) 
===================================================================
scalar set, 51threads: 33965336 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26639.48 nanoseconds per iteration) 
scalar get, 51threads: 25502645 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 20002.07 nanoseconds per iteration) 
===================================================================
scalar set, 52threads: 35548154 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27344.73 nanoseconds per iteration) 
scalar get, 52threads: 23694985 nanoseconds     (bandwidth = 0.44 MB/s)      (throughput = 18226.91 nanoseconds per iteration) 
===================================================================
scalar set, 53threads: 38345789 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28940.22 nanoseconds per iteration) 
scalar get, 53threads: 25453102 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 19209.89 nanoseconds per iteration) 
===================================================================
scalar set, 54threads: 35148156 nanoseconds     (bandwidth = 0.31 MB/s)      (throughput = 26035.67 nanoseconds per iteration) 
scalar get, 54threads: 24864764 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18418.34 nanoseconds per iteration) 
===================================================================
scalar set, 55threads: 36917426 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26849.04 nanoseconds per iteration) 
scalar get, 55threads: 25272193 nanoseconds     (bandwidth = 0.44 MB/s)      (throughput = 18379.78 nanoseconds per iteration) 
===================================================================
scalar set, 56threads: 38059347 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27185.25 nanoseconds per iteration) 
scalar get, 56threads: 26492332 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 18923.09 nanoseconds per iteration) 
===================================================================
scalar set, 57threads: 39666700 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27836.28 nanoseconds per iteration) 
scalar get, 57threads: 26619218 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18680.15 nanoseconds per iteration) 
===================================================================
scalar set, 58threads: 39490829 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27235.05 nanoseconds per iteration) 
scalar get, 58threads: 27374445 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 18878.93 nanoseconds per iteration) 
===================================================================
scalar set, 59threads: 40687856 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27584.99 nanoseconds per iteration) 
scalar get, 59threads: 27655523 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18749.51 nanoseconds per iteration) 
===================================================================
scalar set, 60threads: 40158973 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26772.65 nanoseconds per iteration) 
scalar get, 60threads: 27533915 nanoseconds     (bandwidth = 0.44 MB/s)      (throughput = 18355.94 nanoseconds per iteration) 
===================================================================
scalar set, 61threads: 43636906 nanoseconds     (bandwidth = 0.28 MB/s)      (throughput = 28614.36 nanoseconds per iteration) 
scalar get, 61threads: 30648691 nanoseconds     (bandwidth = 0.40 MB/s)      (throughput = 20097.50 nanoseconds per iteration) 
===================================================================
scalar set, 62threads: 42619747 nanoseconds     (bandwidth = 0.29 MB/s)      (throughput = 27496.61 nanoseconds per iteration) 
scalar get, 62threads: 29547546 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 19062.93 nanoseconds per iteration) 
===================================================================
scalar set, 63threads: 42244608 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26821.97 nanoseconds per iteration) 
scalar get, 63threads: 30267117 nanoseconds     (bandwidth = 0.42 MB/s)      (throughput = 19217.22 nanoseconds per iteration) 
===================================================================
scalar set, 64threads: 43016439 nanoseconds     (bandwidth = 0.30 MB/s)      (throughput = 26885.27 nanoseconds per iteration) 
scalar get, 64threads: 29702208 nanoseconds     (bandwidth = 0.43 MB/s)      (throughput = 18563.88 nanoseconds per iteration) 
===================================================================

Single-thread access performance per page(per graphics card opencl data channel):

scalar set: 698751 nanoseconds     (bandwidth = 750.32 MB/s)     <---- gt1030 pcie 4x v2.0
scalar set: 353396 nanoseconds     (bandwidth = 1483.57 MB/s)    <---- k420 #1 pcie 8x v2.0
scalar set: 747921 nanoseconds     (bandwidth = 700.99 MB/s)     <---- k420 #2 pcie 4x v2.0
scalar set: 688019 nanoseconds     (bandwidth = 762.03 MB/s)     <---- gt1030
scalar set: 349214 nanoseconds     (bandwidth = 1501.34 MB/s)    <---- k420 #1
scalar set: 744247 nanoseconds     (bandwidth = 704.45 MB/s)     <---- k420 #2
scalar set: 687288 nanoseconds     (bandwidth = 762.84 MB/s)     
scalar set: 350730 nanoseconds     (bandwidth = 1494.85 MB/s)     
scalar set: 757650 nanoseconds     (bandwidth = 691.99 MB/s)     
scalar set: 688658 nanoseconds     (bandwidth = 761.32 MB/s)     
scalar set: 350202 nanoseconds     (bandwidth = 1497.10 MB/s)     
scalar set: 744332 nanoseconds     (bandwidth = 704.37 MB/s)     
scalar set: 685233 nanoseconds     (bandwidth = 765.12 MB/s)     
scalar set: 348773 nanoseconds     (bandwidth = 1503.24 MB/s)     
scalar set: 752406 nanoseconds     (bandwidth = 696.82 MB/s)     
scalar set: 687659 nanoseconds     (bandwidth = 762.42 MB/s)     
scalar set: 348244 nanoseconds     (bandwidth = 1505.52 MB/s)     
scalar set: 744723 nanoseconds     (bandwidth = 704.00 MB/s)     
scalar set: 686572 nanoseconds     (bandwidth = 763.63 MB/s)     
scalar set: 353398 nanoseconds     (bandwidth = 1483.56 MB/s)     
scalar set: 747481 nanoseconds     (bandwidth = 701.41 MB/s)     
scalar set: 688354 nanoseconds     (bandwidth = 761.65 MB/s)     
scalar set: 355543 nanoseconds     (bandwidth = 1474.61 MB/s)     
scalar set: 741487 nanoseconds     (bandwidth = 707.08 MB/s)     
scalar set: 690397 nanoseconds     (bandwidth = 759.40 MB/s)     
scalar get: 683646 nanoseconds     (bandwidth = 766.90 MB/s)     
scalar get: 349947 nanoseconds     (bandwidth = 1498.19 MB/s)     
scalar get: 751107 nanoseconds     (bandwidth = 698.02 MB/s)     
scalar get: 689224 nanoseconds     (bandwidth = 760.69 MB/s)     
scalar get: 348975 nanoseconds     (bandwidth = 1502.37 MB/s)     
scalar get: 745463 nanoseconds     (bandwidth = 703.31 MB/s)     
scalar get: 684979 nanoseconds     (bandwidth = 765.41 MB/s)     
scalar get: 349885 nanoseconds     (bandwidth = 1498.46 MB/s)     
scalar get: 742869 nanoseconds     (bandwidth = 705.76 MB/s)     
scalar get: 325463 nanoseconds     (bandwidth = 1610.90 MB/s)     
scalar get: 179616 nanoseconds     (bandwidth = 2918.94 MB/s)     
scalar get: 368598 nanoseconds     (bandwidth = 1422.38 MB/s)     
scalar get: 326722 nanoseconds     (bandwidth = 1604.69 MB/s)     
scalar get: 172963 nanoseconds     (bandwidth = 3031.21 MB/s)     
scalar get: 366453 nanoseconds     (bandwidth = 1430.71 MB/s)     
scalar get: 327639 nanoseconds     (bandwidth = 1600.20 MB/s)     
scalar get: 174292 nanoseconds     (bandwidth = 3008.10 MB/s)     
scalar get: 367108 nanoseconds     (bandwidth = 1428.16 MB/s)     
scalar get: 327970 nanoseconds     (bandwidth = 1598.59 MB/s)     
scalar get: 172072 nanoseconds     (bandwidth = 3046.91 MB/s)     
scalar get: 367022 nanoseconds     (bandwidth = 1428.49 MB/s)     
scalar get: 325872 nanoseconds     (bandwidth = 1608.88 MB/s)     
scalar get: 179071 nanoseconds     (bandwidth = 2927.82 MB/s)     
scalar get: 365247 nanoseconds     (bandwidth = 1435.43 MB/s)     
scalar get: 325645 nanoseconds     (bandwidth = 1610.00 MB/s)   

Half of bandwidth comes from pcie v2.0 x8 card which is the first K420 card plugged on x16 bridge but shares it with gt1030 50%/50% and gt1030 can not go above x4 so the remaining x4 is wasted here. Last K420 is on a x4 bridge so there is no loss in there.

Clone this wiki locally