Skip to content
This repository was archived by the owner on Apr 28, 2023. It is now read-only.

Commit 30eae2a

Browse files
committed
register promotion: insert syncs if scoped above thread mapping
Promotion to registers is performed in a particular scope. Within this scope, if a promotion is deemed valid, tensor elements are only accessed by the same thread making thread synchronization unnecessary. However, outside this scope, promoted elements may be accessed by different threads which would require synchronization. If different threads (or, in practice, different iterations of thread-mapped loops) access the same element in a way that requires synchronization, this is reflected in the dependence relation. The OuterBlockInnerThread mapping strategy detects it and introduces synchronization statements above thread mapping. It only performs promotion to registers below thread mapping. Therefore, synchronizations were unnecessary around copies to or from registers. PR #489 introduced functionality to promote to registers at any scope, including above thread mapping. In this case, synchronizations inserted below may not suffice. For example, in the tree of the shape band( // contains a sequential loop // <- promotion scope extension( sequence( filter( // main computation mapping(...)), // to threads filter(...) // synchronization different iterations of the outer sequential loop may lead to different threads accessing the same tensor element (but in one iteration, only one thread accesses it). Copies from global memory to registers will be inserted at the promotion scope, i.e. after the synchronization statement. A write to global memory by one thread will not be synchronized with a read from the same address by a potentially different thread in the following iteration of the loop above the scoping point. A synchronization must be inserted either before the read from global memory or after the write-back. In this particular case, one may want to insert the write-back before the existing synchronization, but it is not always possible in the general case where the scoping point may be not immediately above the thread mapping. Furthermore, it may also be necessary to synchronize due to dependences with sibling subtrees that have a different mapping. When register promotion copies are inserted above thread mapping, introduce thread synchronizations before the copy to register and after the copy from register. This is a conservative approximation. Exact analysis would require analyzing dependences between an instance of the scope and the rest of the elements and is left for future work.
1 parent eb39d2c commit 30eae2a

File tree

2 files changed

+41
-0
lines changed

2 files changed

+41
-0
lines changed

tc/core/polyhedral/cuda/memory_promotion_heuristic.cc

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -698,6 +698,26 @@ void promoteToRegistersBelow(MappedScop& mscop, detail::ScheduleTree* scope) {
698698
partialSched);
699699
}
700700
}
701+
702+
// Return immediately if nothing was promoted.
703+
if (scope->numChildren() == 0 ||
704+
!matchOne(extension(sequence(any())), scope->child({0}))) {
705+
return;
706+
}
707+
708+
// If promoting above thread mapping, insert synchronizations.
709+
// It is possible that promoted array elements are accessed by different
710+
// threads outside the current scope (either in different iterations of the
711+
// scope loops, or in sibling subtrees). For now, always insert
712+
// synchronizations, similarly to copies to shared memory.
713+
//
714+
// TODO: The exact check for sync insertion requires the dependences between
715+
// the elements in the scope and those before/after the scope and a check if
716+
// the dependent instances belong to the same thread.
717+
auto ancestors = scope->ancestors(root);
718+
if (functional::Filter(isMappingTo<mapping::ThreadId>, ancestors).empty()) {
719+
scop.insertSyncsAroundSeqChildren(scope->child({0, 0}));
720+
}
701721
}
702722

703723
// Promote at the positions of the thread specific markers.

test/test_cuda_mapper_memory_promotion.cc

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -493,10 +493,12 @@ def fun(float(N,K) A, float(K,M) B, float(N,M) C) -> (O) {
493493

494494
expectNoABCPromotion(code);
495495

496+
auto preCopySyncPos = code.find("__syncthreads()", oDeclPos);
496497
auto o00Pos = code.find("_O_0[0][0]");
497498
auto o10Pos = code.find("_O_0[1][0]");
498499
auto o20Pos = code.find("_O_0[2][0]");
499500
auto o30Pos = code.find("_O_0[3][0]");
501+
auto postCopySyncPos = code.find("__syncthreads()", o30Pos);
500502

501503
EXPECT_TRUE(o00Pos != std::string::npos)
502504
<< "expected constant subscripts in _O_0";
@@ -506,6 +508,25 @@ def fun(float(N,K) A, float(K,M) B, float(N,M) C) -> (O) {
506508
<< "expected constant subscripts in _O_0";
507509
EXPECT_TRUE(o30Pos != std::string::npos)
508510
<< "expected constant subscripts in _O_0";
511+
512+
EXPECT_TRUE(preCopySyncPos != std::string::npos)
513+
<< "expected synchronization to be inserted";
514+
EXPECT_TRUE(postCopySyncPos != std::string::npos)
515+
<< "expected synchronization to be inserted";
516+
517+
EXPECT_TRUE(
518+
preCopySyncPos < o00Pos && preCopySyncPos < o10Pos &&
519+
preCopySyncPos < o20Pos && preCopySyncPos < o30Pos)
520+
<< "expected synchronization before copies to registers";
521+
522+
o00Pos = code.find("_O_0[0][0]", postCopySyncPos);
523+
o10Pos = code.find("_O_0[1][0]", postCopySyncPos);
524+
o20Pos = code.find("_O_0[2][0]", postCopySyncPos);
525+
o30Pos = code.find("_O_0[3][0]", postCopySyncPos);
526+
EXPECT_TRUE(
527+
o00Pos == std::string::npos && o10Pos == std::string::npos &&
528+
o20Pos == std::string::npos && o20Pos == std::string::npos)
529+
<< "expected synchronization after copies from registers";
509530
}
510531
};
511532

0 commit comments

Comments
 (0)