-
Notifications
You must be signed in to change notification settings - Fork 11
(feat): Add a straightforward implementation for tile iterator. #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
using NewTile = SharedTile<typename Tile::DType, TileLayout>; | ||
using Iter = SharedTileIterator<NewTile, ChunkShape>; | ||
static_assert(Iter::sc0 == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this means that the step size of the rows is an integer multiple of the step size of the columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current rules for indexing and slicing an iterator are designed as follows:
Suppose there is a 2D grid composed of several sub-tiles like this. An iterator
can iterate over these sub-tiles using logical array indices.
|--|---------|---------|---------|
|0 |sub-tile0|sub-tile1|sub-tile2|
|--|---------|---------|---------|
|1 |sub-tile3|sub-tile4|sub-tile5|
|--|---------|---------|---------|
tiles(0, _)
will return a 1D Iterator
like this:
|--|---------|---------|---------|
|0 |sub-tile0|sub-tile1|sub-tile2|
|--|---------|---------|---------|
tiles(1, _)
will return a 1D Iterator
like this:
|--|---------|---------|---------|
|1 |sub-tile3|sub-tile4|sub-tile5|
|--|---------|---------|---------|
Therefore, line 97 checks if the strip count of the first dimension is equal to 1.
The iterator
is used to iterate over 2D grids of tiles
, and it is a simple wrapper that transforms the logical array index into an offset to the physical address. It modifies the descriptor (including (1) re-computing the layout for the data that a returned iterator or tile covers, and (2) advancing the pointer of the starting position of the returned data.) of the addressing space of the physical memory when it is indexed or sliced:
- When a 2D
iterator
is indexed using a 2D array index, a tile is returned. - If a 2D
iterator
is sliced, a 1Diterator
is returned; thus, the returned value can be indexed using a 1D array index. - If the 2D iterator has any dimension equal to 1, it can be indexed using a 1D array index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential issue is that, instead of having the user compute the physical address manually, addressing using an Iterator
introduces its own implementation overhead. However, I am not sure how significant this overhead will be. This may not be a primary consideration at the moment, but just mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current rules for indexing and slicing an iterator are designed as follows:
Suppose there is a 2D grid composed of several sub-tiles like this. An
iterator
can iterate over these sub-tiles using logical array indices.|--|---------|---------|---------| |0 |sub-tile0|sub-tile1|sub-tile2| |--|---------|---------|---------| |1 |sub-tile3|sub-tile4|sub-tile5| |--|---------|---------|---------|
tiles(0, _)
will return a 1DIterator
like this:|--|---------|---------|---------| |0 |sub-tile0|sub-tile1|sub-tile2| |--|---------|---------|---------|
tiles(1, _)
will return a 1DIterator
like this:|--|---------|---------|---------| |1 |sub-tile3|sub-tile4|sub-tile5| |--|---------|---------|---------|
Therefore, line 97 checks if the strip count of the first dimension is equal to 1.
The
iterator
is used to iterate over 2D grids oftiles
, and it is a simple wrapper that transforms the logical array index into an offset to the physical address. It modifies the descriptor (including (1) re-computing the layout for the data that a returned iterator or tile covers, and (2) advancing the pointer of the starting position of the returned data.) of the addressing space of the physical memory when it is indexed or sliced:
- When a 2D
iterator
is indexed using a 2D array index, a tile is returned.- If a 2D
iterator
is sliced, a 1Diterator
is returned; thus, the returned value can be indexed using a 1D array index.- If the 2D iterator has any dimension equal to 1, it can be indexed using a 1D array index.
I see, that's a clear explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential issue is that, instead of having the user compute the physical address manually, addressing using an
Iterator
introduces its own implementation overhead. However, I am not sure how significant this overhead will be. This may not be a primary consideration at the moment, but just mention it.
Yes, this is not the primary consideration at the moment. In fact, good abstraction can introduce some overhead, but a small overhead is negligible.
printf("Iterate over rows.\n\n"); | ||
for (int i = 0; i < Iterator::sc0; ++i) { | ||
printf("Iteration-[%d, _]:\n", i); | ||
tiles(i, _).to_tile().dump_value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slicing a 2D iterator will return a 2D iterator with one dimension reduced to 1, resulting in a 1D iterator. The to_tile
function can then flatten the iterator into a large tile.
for (int i = 0; i < Iterator::sc0; ++i) { | ||
for (int j = 0; j < Iterator::sc1; ++j) { | ||
printf("Iteration-[%d, %d]:\n", i, j); | ||
tiles(i, j).dump_value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexing a 2D iterator with a 2D array index returns a Tile
.
printf("\n"); | ||
for (int j = 0; j < decltype(cols)::sc1; ++j) { | ||
printf("Iteration-[%d, %d]:\n", i, j); | ||
cols(j).dump_value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slicing a 2D iterator will return a 2D iterator with one dimension reduced to 1, resulting in a 1D iterator. This 1D iterator can be indexed using a 1D array index.
In the current implementation, a 2D iterator with any of its dimensions being 1 can be indexed using a 1D index.
tests/cpp/cell/test_g2s_copy.cu
Outdated
@@ -1,4 +1,4 @@ | |||
#include "cell/copy/mod.hpp" | |||
#include "cell/copy/dyn_copy.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why such modifications are necessary in order for it to compile successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include order in copy/mod.hpp
is as follows:
TiledCUDA/include/cell/copy/mod.hpp
Lines 3 to 5 in b31db2a
#include "cell/copy/copy.hpp" | |
#include "cell/copy/dyn_copy.hpp" | |
#include "cell/copy/static_copy.hpp" |
copy/copy.hpp
is before dyn_copy.hpp
.
copy.hpp
includes <cute/algorithm/copy.hpp>
:
TiledCUDA/include/cell/copy/copy.hpp
Line 5 in b31db2a
#include <cute/algorithm/copy.hpp> |
and dyn_copy.hpp
include <cute/tensor.hpp>
#include <cute/tensor.hpp> |
In a conclusion, <cute/algorithm/copy.hpp>
is included before <cute/tensor.hpp>
. When I upgrade g++ into 10.5.0., it complains that:
error: namespace "cute::detail" has no member "is_prefetch"
Similar issues can be found in: NVIDIA/cutlass#1508 and NVIDIA/cutlass#1484
I am not sure why the compilation is successful in g++ 9.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this might be a bug in CuTe, where they haven't handled the dependencies between header files very well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include order in
copy/mod.hpp
is as follows:TiledCUDA/include/cell/copy/mod.hpp
Lines 3 to 5 in b31db2a
#include "cell/copy/copy.hpp" #include "cell/copy/dyn_copy.hpp" #include "cell/copy/static_copy.hpp"
copy/copy.hpp
is beforedyn_copy.hpp
.
copy.hpp
includes<cute/algorithm/copy.hpp>
:TiledCUDA/include/cell/copy/copy.hpp
Line 5 in b31db2a
#include <cute/algorithm/copy.hpp> and
dyn_copy.hpp
include<cute/tensor.hpp>
#include <cute/tensor.hpp> In a conclusion,
<cute/algorithm/copy.hpp>
is included before<cute/tensor.hpp>
. When I upgrade g++ into 10.5.0., it complains that:error: namespace "cute::detail" has no member "is_prefetch"Similar issues can be found in: NVIDIA/cutlass#1508 and NVIDIA/cutlass#1484
I am not sure why the compilation is successful in g++ 9.4.
So does that mean when using copy-related functions later, we won't be able to directly include all the copy-related header files, but can only include one specific header file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect things won't be that bad. I suggest leaving this PR to be merged later. I can examine the include order to make it clearer.
@@ -2,7 +2,7 @@ | |||
|
|||
#include "cuda_utils.hpp" | |||
|
|||
#include <cute/algorithm/copy.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KuangjuX I've started cleaning up the include relationships. As we continue to refine the copy implementations, this should help resolve the include order issues more effectively. The current separation into static copy, dynamic copy, and copy is somewhat redundant.
resolve #49