Skip to content

Optimizations

Hüseyin Tuğrul BÜYÜKIŞIK edited this page Mar 11, 2021 · 13 revisions

Since page uploads are going through pcie and pcie having a fixed minimum latency to work, it is better to do less pcie transfers for all virtual arrays. For example

class Particle
{
public:
   int buffer[1024];
}


VirtualMultiArray<Particle> test1(...10 elements per page);
VirtualMultiArray<int> test2(...10 elements per page);

Test1 and test2 will move 40kB and 40bytes repectively for every page update. If minimum pcie latency is 10 microseconds, then "int" array will have only 4MB/s in a random-access (indexing with a mersenne twister etc) scenario while "Particle" will have 4GB/s (if pcie/ram is that fast).


Bandwidth Optimizations

  • With 8 logical cores for a math-heavy problem, up to 64 (openmp,std::thread, etc) threads can be used to hide i/o latencies. Once a thread starts waiting for VRAM data, it yields execution to another thread so that math-compute can continue. This is automatically handled by event-based waiting instead of clFinish command for opencl so that different vendors(nvidia/amd/intel) with different clFinish waiting policy(spin-wait vs idle-wait) can reach same performance.
  • The more virtual gpus, the more bandwidth potential. memMult parameter can take 10s (if not 100s) per physical gpu. {40,100,5} enables up to 145 data channels in parallel, ready to serve for up to 145 threads to be waiting on i/o concurrently. 40 channels (virtual gpus) on 1st physical gpu, 100 on second one, 5 on last one. Default value is 4 per physical card.
  • PcieBandwidthBenchmarker().bestBandwidth(virtual gpus for slowest pcie) helps getting necessary physical-gpu multipliers for the memMult to approach maximum bandwidth of all pcie bridges in system. Its parameter takes an integer value that produces equal amount of data channels (virtual gpus) for the slowest pcie connected physical card. If there are multiple cards, other cards' multipliers can be higher or equal depending on their pcie performance. pcie 16x card gets 4x the multiplier of a pcie 4x card with this method.
  • Bulk read/write operations increase average bandwidth per element access. Because they lock a page only once and get all necessary data at once.
  • Different threads can lock active pages that are on different virtual gpus (data channels) concurrently. This increases throughput of page-locking. Even same physical gpu can be used concurrently as long as different virtual gpus of it are accessed concurrently. Also some cards have more async copy engines that help data transfers. Since virtual gpus are "directly mapped" on all pages, every neighboring page can be concurrent to another.
  • Temporal locality helps LRU caching of pages.

Latency Optimizations

  • When virtual array is used as linear-streaming-only, it doesn't need caching. Setting parameter numActivePage=1 disables caching so that each element can be accessed quicker. Default value is 50 so it is advised to allocate (size) a large enough virtual array to include all active pages (= active pages per virtual gpu x number of virtual gpus). Since virtual array is used for bigger-than-RAM data, actual virtual array size may be 3x,5x,100x, or any multiple of total active pages x elements per page.
  • On Windows, Nvidia drivers can lower data access latency by enabling TCC mode.

  • Choosing numActivePage parameter of VirtualMultiArray's constructor as 1 disables LRU eviction policy and every different page access causes a page fault (pcie data transmission) per virtual gpu.

  • Choosing numActivePage parameter of VirtualMultiArray's constructor between values 2-12 enables vector&insertion-sort based LRU eviction policy. This works faster than map&list version.

  • Choosing numActivePage parameter of VirtualMultiArray's constructor greater than 12 enables map&list based LRU eviction policy so that increased cache size per virtual gpu does not increase latency much.

Clone this wiki locally