Skip to content
This repository was archived by the owner on Apr 28, 2023. It is now read-only.

Commit 62fa601

Browse files
author
Sven Verdoolaege
committed
insert reduction synchronization outside thread mapping
Regular synchronization should never appear underneath a thread mapping since the synchronization should be performed by all threads and the mapping to threads may leave some thread instances unmapped. Inserting reduction synchronization was apparently deemed safe because the partial tile separation makes sure only complete blocks are mapped to reductions. However, by having the synchronization inside the mapping, the isl AST generator may generate tests outside this synchronization that involve thread identifiers (even if it is known to the user that those same conditions could be represented without involving thread identifiers, in combination with other constraints in the code). Insert the synchronization outside the mapping to prevent this from happening. This also means that the reduction member no longer needs to be split off, such that the thread mapping now always corresponds to a single band. Note that while the partial tile separation makes sure that only complete blocks are mapped to reductions, multiple such complete blocks may still get mapped by the thread mapping, including in the parallel directions. The current reduction handling does not support this as it stores the partial reductions in a single (per-thread) scalar variable. The band mapped to threads therefore needs to be tiled first such that it contains exactly one complete block in the parallel directions.
1 parent 9c29c10 commit 62fa601

File tree

2 files changed

+30
-12
lines changed

2 files changed

+30
-12
lines changed

tc/core/polyhedral/cuda/mapped_scop.cc

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -372,7 +372,8 @@ size_t MappedScop::mapToThreads(detail::ScheduleTree* band) {
372372
bandSplit(scop_->scheduleRoot(), band, nCanMap - nMappedThreads);
373373
auto child = band->child({0});
374374
if (isReduction) {
375-
// Update reductionBandUpdates_ such that splitOutReductionAndInsertSyncs
375+
// Update reductionBandUpdates_ such that
376+
// splitOutReductionTileAndInsertSyncs
376377
// can find the information it needs.
377378
reductionBandUpdates_.emplace(child, reductionBandUpdates_.at(band));
378379
reductionBandUpdates_.erase(band);
@@ -387,12 +388,12 @@ size_t MappedScop::mapToThreads(detail::ScheduleTree* band) {
387388

388389
CHECK_GT(nMappedThreads, 0u) << "not mapping to threads";
389390

390-
mapThreadsBackward(band);
391-
392391
if (isReduction) {
393-
splitOutReductionAndInsertSyncs(band);
392+
band = splitOutReductionTileAndInsertSyncs(band);
394393
}
395394

395+
mapThreadsBackward(band);
396+
396397
return numThreads.view.size();
397398
}
398399

@@ -946,17 +947,32 @@ std::tuple<std::string, tc::Grid, tc::Block> MappedScop::codegen(
946947
mappedScopForCodegen->numThreads);
947948
}
948949

949-
// Split out reduction member in "band" and
950-
// insert reduction synchronizations outside this split off band.
951-
void MappedScop::splitOutReductionAndInsertSyncs(
950+
// Split out a single reduction tile (in the directions other than
951+
// the reduction) and insert reduction synchronizations outside this tile.
952+
// Return a pointer to the split off tile.
953+
detail::ScheduleTree* MappedScop::splitOutReductionTileAndInsertSyncs(
952954
detail::ScheduleTree* band) {
953955
using namespace polyhedral::detail;
954956
size_t n = numThreads.view.size();
955957

956-
auto tree = bandSplitOut(scop_->scheduleRoot(), band, n - 1);
958+
// The current band contains only full blocks.
959+
// Split off a band that iterates over these blocks,
960+
// such that only a single block gets mapped to thread identifiers.
961+
// The mapping to thread identifier X is allowed to iterate
962+
// over multiple blocks, so this direction is not tiled.
963+
std::vector<size_t> sizes(n);
964+
for (size_t i = 1; i < n; ++i) {
965+
sizes[n - 1 - i] = numThreads.view[i];
966+
}
967+
sizes[n - 1] = 0;
968+
bandTile(band, sizes, TileOptions::ScaleTileLoops);
969+
970+
// Insert synchronization outside the single block.
971+
auto child = band->child({0});
957972
for (auto updateId : reductionBandUpdates_.at(band).ids) {
958-
scop_->insertReductionSync1D(tree, updateId);
973+
scop_->insertReductionSync1D(child, updateId);
959974
}
975+
return child;
960976
}
961977

962978
std::unique_ptr<MappedScop> MappedScop::makeWithOuterBlockInnerThreadStrategy(

tc/core/polyhedral/cuda/mapped_scop.h

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -182,9 +182,11 @@ class MappedScop {
182182
private:
183183
// Insert the optimal combination of synchronizations in the sequence
184184
void insertBestSyncInSeq(detail::ScheduleTree* seq);
185-
// Split out reduction member in "band" and
186-
// insert reduction synchronizations.
187-
void splitOutReductionAndInsertSyncs(detail::ScheduleTree* band);
185+
// Split out a single reduction tile (in the directions other than
186+
// the reduction) and insert reduction synchronizations.
187+
// Return a pointer to the split off tile.
188+
detail::ScheduleTree* splitOutReductionTileAndInsertSyncs(
189+
detail::ScheduleTree* band);
188190
// Map "band" to thread identifiers using as many blockSizes values as outer
189191
// coincident dimensions (plus reduction dimension, if any),
190192
// insert synchronization in case of a reduction, and

0 commit comments

Comments
 (0)