Skip to content

(feat): Add a straightforward implementation for tile iterator. #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 23, 2024
Merged

Conversation

haruhi55
Copy link
Member

@haruhi55 haruhi55 commented Jun 20, 2024

resolve #49

  1. This PR adds implementations for these two lines: https://github.com/haruhi55/TiledCUDA/blob/b31db2aa1420b595f4ac01a792c714cd81053d1e/tests/cpp/cell/test_gemm.cu#L74-L75
  2. You can find potential uses of a shared memory tile iterator in the unit tests.
  3. The current unit tests are not sufficiently meaningful. I plan to add more stringent unit tests to ensure correctness once load/store operations are implemented.
  4. Improve code organizations and interfaces for copy tile from shared memory to register. I plan to add implementations for it in the next PR.

@haruhi55 haruhi55 marked this pull request as draft June 20, 2024 10:32
@haruhi55 haruhi55 marked this pull request as ready for review June 21, 2024 06:50
@haruhi55 haruhi55 requested a review from KuangjuX June 21, 2024 07:00

using NewTile = SharedTile<typename Tile::DType, TileLayout>;
using Iter = SharedTileIterator<NewTile, ChunkShape>;
static_assert(Iter::sc0 == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this means that the step size of the rows is an integer multiple of the step size of the columns?

Copy link
Member Author

@haruhi55 haruhi55 Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current rules for indexing and slicing an iterator are designed as follows:

Suppose there is a 2D grid composed of several sub-tiles like this. An iterator can iterate over these sub-tiles using logical array indices.

|--|---------|---------|---------|
|0 |sub-tile0|sub-tile1|sub-tile2|
|--|---------|---------|---------|
|1 |sub-tile3|sub-tile4|sub-tile5|
|--|---------|---------|---------|

tiles(0, _) will return a 1D Iterator like this:

|--|---------|---------|---------|
|0 |sub-tile0|sub-tile1|sub-tile2|
|--|---------|---------|---------|

tiles(1, _) will return a 1D Iterator like this:

|--|---------|---------|---------|
|1 |sub-tile3|sub-tile4|sub-tile5|
|--|---------|---------|---------|

Therefore, line 97 checks if the strip count of the first dimension is equal to 1.

The iterator is used to iterate over 2D grids of tiles, and it is a simple wrapper that transforms the logical array index into an offset to the physical address. It modifies the descriptor (including (1) re-computing the layout for the data that a returned iterator or tile covers, and (2) advancing the pointer of the starting position of the returned data.) of the addressing space of the physical memory when it is indexed or sliced:

  1. When a 2D iterator is indexed using a 2D array index, a tile is returned.
  2. If a 2D iterator is sliced, a 1D iterator is returned; thus, the returned value can be indexed using a 1D array index.
  3. If the 2D iterator has any dimension equal to 1, it can be indexed using a 1D array index.

Copy link
Member Author

@haruhi55 haruhi55 Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential issue is that, instead of having the user compute the physical address manually, addressing using an Iterator introduces its own implementation overhead. However, I am not sure how significant this overhead will be. This may not be a primary consideration at the moment, but just mention it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current rules for indexing and slicing an iterator are designed as follows:

Suppose there is a 2D grid composed of several sub-tiles like this. An iterator can iterate over these sub-tiles using logical array indices.

|--|---------|---------|---------|
|0 |sub-tile0|sub-tile1|sub-tile2|
|--|---------|---------|---------|
|1 |sub-tile3|sub-tile4|sub-tile5|
|--|---------|---------|---------|

tiles(0, _) will return a 1D Iterator like this:

|--|---------|---------|---------|
|0 |sub-tile0|sub-tile1|sub-tile2|
|--|---------|---------|---------|

tiles(1, _) will return a 1D Iterator like this:

|--|---------|---------|---------|
|1 |sub-tile3|sub-tile4|sub-tile5|
|--|---------|---------|---------|

Therefore, line 97 checks if the strip count of the first dimension is equal to 1.

The iterator is used to iterate over 2D grids of tiles, and it is a simple wrapper that transforms the logical array index into an offset to the physical address. It modifies the descriptor (including (1) re-computing the layout for the data that a returned iterator or tile covers, and (2) advancing the pointer of the starting position of the returned data.) of the addressing space of the physical memory when it is indexed or sliced:

  1. When a 2D iterator is indexed using a 2D array index, a tile is returned.
  2. If a 2D iterator is sliced, a 1D iterator is returned; thus, the returned value can be indexed using a 1D array index.
  3. If the 2D iterator has any dimension equal to 1, it can be indexed using a 1D array index.

I see, that's a clear explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential issue is that, instead of having the user compute the physical address manually, addressing using an Iterator introduces its own implementation overhead. However, I am not sure how significant this overhead will be. This may not be a primary consideration at the moment, but just mention it.

Yes, this is not the primary consideration at the moment. In fact, good abstraction can introduce some overhead, but a small overhead is negligible.

printf("Iterate over rows.\n\n");
for (int i = 0; i < Iterator::sc0; ++i) {
printf("Iteration-[%d, _]:\n", i);
tiles(i, _).to_tile().dump_value();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing a 2D iterator will return a 2D iterator with one dimension reduced to 1, resulting in a 1D iterator. The to_tile function can then flatten the iterator into a large tile.

for (int i = 0; i < Iterator::sc0; ++i) {
for (int j = 0; j < Iterator::sc1; ++j) {
printf("Iteration-[%d, %d]:\n", i, j);
tiles(i, j).dump_value();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing a 2D iterator with a 2D array index returns a Tile.

printf("\n");
for (int j = 0; j < decltype(cols)::sc1; ++j) {
printf("Iteration-[%d, %d]:\n", i, j);
cols(j).dump_value();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing a 2D iterator will return a 2D iterator with one dimension reduced to 1, resulting in a 1D iterator. This 1D iterator can be indexed using a 1D array index.

In the current implementation, a 2D iterator with any of its dimensions being 1 can be indexed using a 1D index.

@@ -1,4 +1,4 @@
#include "cell/copy/mod.hpp"
#include "cell/copy/dyn_copy.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why such modifications are necessary in order for it to compile successfully?

Copy link
Member Author

@haruhi55 haruhi55 Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include order in copy/mod.hpp is as follows:

#include "cell/copy/copy.hpp"
#include "cell/copy/dyn_copy.hpp"
#include "cell/copy/static_copy.hpp"

copy/copy.hpp is before dyn_copy.hpp.

copy.hpp includes <cute/algorithm/copy.hpp>:

#include <cute/algorithm/copy.hpp>

and dyn_copy.hpp include <cute/tensor.hpp>

#include <cute/tensor.hpp>

In a conclusion, <cute/algorithm/copy.hpp> is included before <cute/tensor.hpp>. When I upgrade g++ into 10.5.0., it complains that:

error: namespace "cute::detail" has no member "is_prefetch"

Similar issues can be found in: NVIDIA/cutlass#1508 and NVIDIA/cutlass#1484

I am not sure why the compilation is successful in g++ 9.4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this might be a bug in CuTe, where they haven't handled the dependencies between header files very well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include order in copy/mod.hpp is as follows:

#include "cell/copy/copy.hpp"
#include "cell/copy/dyn_copy.hpp"
#include "cell/copy/static_copy.hpp"

copy/copy.hpp is before dyn_copy.hpp.

copy.hpp includes <cute/algorithm/copy.hpp>:

#include <cute/algorithm/copy.hpp>

and dyn_copy.hpp include <cute/tensor.hpp>

#include <cute/tensor.hpp>

In a conclusion, <cute/algorithm/copy.hpp> is included before <cute/tensor.hpp>. When I upgrade g++ into 10.5.0., it complains that:

error: namespace "cute::detail" has no member "is_prefetch"

Similar issues can be found in: NVIDIA/cutlass#1508 and NVIDIA/cutlass#1484

I am not sure why the compilation is successful in g++ 9.4.

So does that mean when using copy-related functions later, we won't be able to directly include all the copy-related header files, but can only include one specific header file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect things won't be that bad. I suggest leaving this PR to be merged later. I can examine the include order to make it clearer.

@@ -2,7 +2,7 @@

#include "cuda_utils.hpp"

#include <cute/algorithm/copy.hpp>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KuangjuX I've started cleaning up the include relationships. As we continue to refine the copy implementations, this should help resolve the include order issues more effectively. The current separation into static copy, dynamic copy, and copy is somewhat redundant.

@haruhi55 haruhi55 merged commit afb0092 into TiledTensor:master Jun 23, 2024
@haruhi55 haruhi55 deleted the iter branch June 23, 2024 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a straightforward implementation for TileIterator
2 participants