(feat): Add a straightforward implementation for tile iterator. #50

haruhi55 · 2024-06-20T10:31:57Z

resolve #49

This PR adds implementations for these two lines: https://github.com/haruhi55/TiledCUDA/blob/b31db2aa1420b595f4ac01a792c714cd81053d1e/tests/cpp/cell/test_gemm.cu#L74-L75
You can find potential uses of a shared memory tile iterator in the unit tests.
The current unit tests are not sufficiently meaningful. I plan to add more stringent unit tests to ensure correctness once load/store operations are implemented.
Improve code organizations and interfaces for copy tile from shared memory to register. I plan to add implementations for it in the next PR.

KuangjuX · 2024-06-22T01:51:26Z

include/types/tile_iterator.hpp

+
+        using NewTile = SharedTile<typename Tile::DType, TileLayout>;
+        using Iter = SharedTileIterator<NewTile, ChunkShape>;
+        static_assert(Iter::sc0 == 1);


I wonder if this means that the step size of the rows is an integer multiple of the step size of the columns?

The current rules for indexing and slicing an iterator are designed as follows:

Suppose there is a 2D grid composed of several sub-tiles like this. An iterator can iterate over these sub-tiles using logical array indices.

|--|---------|---------|---------| |0 |sub-tile0|sub-tile1|sub-tile2| |--|---------|---------|---------| |1 |sub-tile3|sub-tile4|sub-tile5| |--|---------|---------|---------|

tiles(0, _) will return a 1D Iterator like this:

|--|---------|---------|---------| |0 |sub-tile0|sub-tile1|sub-tile2| |--|---------|---------|---------|

tiles(1, _) will return a 1D Iterator like this:

|--|---------|---------|---------| |1 |sub-tile3|sub-tile4|sub-tile5| |--|---------|---------|---------|

Therefore, line 97 checks if the strip count of the first dimension is equal to 1.

The iterator is used to iterate over 2D grids of tiles, and it is a simple wrapper that transforms the logical array index into an offset to the physical address. It modifies the descriptor (including (1) re-computing the layout for the data that a returned iterator or tile covers, and (2) advancing the pointer of the starting position of the returned data.) of the addressing space of the physical memory when it is indexed or sliced:

When a 2D iterator is indexed using a 2D array index, a tile is returned.

If a 2D iterator is sliced, a 1D iterator is returned; thus, the returned value can be indexed using a 1D array index.

If the 2D iterator has any dimension equal to 1, it can be indexed using a 1D array index.

A potential issue is that, instead of having the user compute the physical address manually, addressing using an Iterator introduces its own implementation overhead. However, I am not sure how significant this overhead will be. This may not be a primary consideration at the moment, but just mention it.

The current rules for indexing and slicing an iterator are designed as follows:

Suppose there is a 2D grid composed of several sub-tiles like this. An iterator can iterate over these sub-tiles using logical array indices.

|--|---------|---------|---------| |0 |sub-tile0|sub-tile1|sub-tile2| |--|---------|---------|---------| |1 |sub-tile3|sub-tile4|sub-tile5| |--|---------|---------|---------|

tiles(0, _) will return a 1D Iterator like this:

|--|---------|---------|---------| |0 |sub-tile0|sub-tile1|sub-tile2| |--|---------|---------|---------|

tiles(1, _) will return a 1D Iterator like this:

|--|---------|---------|---------| |1 |sub-tile3|sub-tile4|sub-tile5| |--|---------|---------|---------|

Therefore, line 97 checks if the strip count of the first dimension is equal to 1.

The iterator is used to iterate over 2D grids of tiles, and it is a simple wrapper that transforms the logical array index into an offset to the physical address. It modifies the descriptor (including (1) re-computing the layout for the data that a returned iterator or tile covers, and (2) advancing the pointer of the starting position of the returned data.) of the addressing space of the physical memory when it is indexed or sliced:

When a 2D iterator is indexed using a 2D array index, a tile is returned.

If a 2D iterator is sliced, a 1D iterator is returned; thus, the returned value can be indexed using a 1D array index.

If the 2D iterator has any dimension equal to 1, it can be indexed using a 1D array index.

I see, that's a clear explanation.

A potential issue is that, instead of having the user compute the physical address manually, addressing using an Iterator introduces its own implementation overhead. However, I am not sure how significant this overhead will be. This may not be a primary consideration at the moment, but just mention it.

Yes, this is not the primary consideration at the moment. In fact, good abstraction can introduce some overhead, but a small overhead is negligible.

haruhi55 · 2024-06-22T12:12:52Z

tests/cpp/cell/test_tile_iterator.cu

+    printf("Iterate over rows.\n\n");
+    for (int i = 0; i < Iterator::sc0; ++i) {
+        printf("Iteration-[%d, _]:\n", i);
+        tiles(i, _).to_tile().dump_value();


Slicing a 2D iterator will return a 2D iterator with one dimension reduced to 1, resulting in a 1D iterator. The to_tile function can then flatten the iterator into a large tile.

haruhi55 · 2024-06-22T12:13:49Z

tests/cpp/cell/test_tile_iterator.cu

+    for (int i = 0; i < Iterator::sc0; ++i) {
+        for (int j = 0; j < Iterator::sc1; ++j) {
+            printf("Iteration-[%d, %d]:\n", i, j);
+            tiles(i, j).dump_value();


Indexing a 2D iterator with a 2D array index returns a Tile.

haruhi55 · 2024-06-22T12:15:39Z

tests/cpp/cell/test_tile_iterator.cu

+        printf("\n");
+        for (int j = 0; j < decltype(cols)::sc1; ++j) {
+            printf("Iteration-[%d, %d]:\n", i, j);
+            cols(j).dump_value();


Slicing a 2D iterator will return a 2D iterator with one dimension reduced to 1, resulting in a 1D iterator. This 1D iterator can be indexed using a 1D array index.

In the current implementation, a 2D iterator with any of its dimensions being 1 can be indexed using a 1D index.

KuangjuX · 2024-06-23T03:52:31Z

tests/cpp/cell/test_g2s_copy.cu

@@ -1,4 +1,4 @@
-#include "cell/copy/mod.hpp"
+#include "cell/copy/dyn_copy.hpp"


I wonder why such modifications are necessary in order for it to compile successfully?

The include order in copy/mod.hpp is as follows:

TiledCUDA/include/cell/copy/mod.hpp

Lines 3 to 5 in b31db2a

#include "cell/copy/copy.hpp"

#include "cell/copy/dyn_copy.hpp"

#include "cell/copy/static_copy.hpp"

copy/copy.hpp is before dyn_copy.hpp.

copy.hpp includes <cute/algorithm/copy.hpp>:

TiledCUDA/include/cell/copy/copy.hpp

Line 5 in b31db2a

#include <cute/algorithm/copy.hpp>

and dyn_copy.hpp include <cute/tensor.hpp>

TiledCUDA/include/cell/copy/dyn_copy.hpp

Line 5 in b31db2a

#include <cute/tensor.hpp>

In a conclusion, <cute/algorithm/copy.hpp> is included before <cute/tensor.hpp>. When I upgrade g++ into 10.5.0., it complains that:

error: namespace "cute::detail" has no member "is_prefetch"

Similar issues can be found in: NVIDIA/cutlass#1508 and NVIDIA/cutlass#1484

I am not sure why the compilation is successful in g++ 9.4.

It seems this might be a bug in CuTe, where they haven't handled the dependencies between header files very well.

The include order in copy/mod.hpp is as follows:

TiledCUDA/include/cell/copy/mod.hpp

Lines 3 to 5 in b31db2a

#include "cell/copy/copy.hpp"

#include "cell/copy/dyn_copy.hpp"

#include "cell/copy/static_copy.hpp"

copy/copy.hpp is before dyn_copy.hpp.

copy.hpp includes <cute/algorithm/copy.hpp>:

TiledCUDA/include/cell/copy/copy.hpp

Line 5 in b31db2a

#include <cute/algorithm/copy.hpp>

and dyn_copy.hpp include <cute/tensor.hpp>

TiledCUDA/include/cell/copy/dyn_copy.hpp

Line 5 in b31db2a

#include <cute/tensor.hpp>

In a conclusion, <cute/algorithm/copy.hpp> is included before <cute/tensor.hpp>. When I upgrade g++ into 10.5.0., it complains that:

error: namespace "cute::detail" has no member "is_prefetch"

Similar issues can be found in: NVIDIA/cutlass#1508 and NVIDIA/cutlass#1484

I am not sure why the compilation is successful in g++ 9.4.

So does that mean when using copy-related functions later, we won't be able to directly include all the copy-related header files, but can only include one specific header file?

I suspect things won't be that bad. I suggest leaving this PR to be merged later. I can examine the include order to make it clearer.

haruhi55 · 2024-06-23T06:07:51Z

include/cell/copy/copy.hpp

@@ -2,7 +2,7 @@

 #include "cuda_utils.hpp"

-#include <cute/algorithm/copy.hpp>


@KuangjuX I've started cleaning up the include relationships. As we continue to refine the copy implementations, this should help resolve the include order issues more effectively. The current separation into static copy, dynamic copy, and copy is somewhat redundant.

Add a straightforward implementation for tile iterator.

6020263

haruhi55 marked this pull request as draft June 20, 2024 10:32

haruhi55 marked this pull request as ready for review June 21, 2024 06:50

haruhi55 requested a review from KuangjuX June 21, 2024 07:00

haruhi55 added 2 commits June 21, 2024 00:15

refine the implementation.

b7bb0e5

refine interfaces for copy from shared to register.

6895977

KuangjuX reviewed Jun 22, 2024

View reviewed changes

haruhi55 commented Jun 22, 2024

View reviewed changes

fix disordered includes.

3f67f84

KuangjuX reviewed Jun 23, 2024

View reviewed changes

KuangjuX approved these changes Jun 23, 2024

View reviewed changes

haruhi55 commented Jun 23, 2024

View reviewed changes

clean include relations.

f80c65b

haruhi55 merged commit afb0092 into TiledTensor:master Jun 23, 2024

haruhi55 deleted the iter branch June 23, 2024 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(feat): Add a straightforward implementation for tile iterator. #50

(feat): Add a straightforward implementation for tile iterator. #50

Uh oh!

haruhi55 commented Jun 20, 2024 •

edited

Loading

Uh oh!

KuangjuX Jun 22, 2024

Uh oh!

haruhi55 Jun 22, 2024 •

edited

Loading

Uh oh!

haruhi55 Jun 22, 2024 •

edited

Loading

Uh oh!

KuangjuX Jun 23, 2024

Uh oh!

KuangjuX Jun 23, 2024

Uh oh!

haruhi55 Jun 22, 2024

Uh oh!

haruhi55 Jun 22, 2024

Uh oh!

haruhi55 Jun 22, 2024

Uh oh!

KuangjuX Jun 23, 2024

Uh oh!

haruhi55 Jun 23, 2024 •

edited

Loading

Uh oh!

KuangjuX Jun 23, 2024

Uh oh!

KuangjuX Jun 23, 2024

Uh oh!

haruhi55 Jun 23, 2024

Uh oh!

haruhi55 Jun 23, 2024

Uh oh!

Uh oh!

		@@ -1,4 +1,4 @@
		#include "cell/copy/mod.hpp"
		#include "cell/copy/dyn_copy.hpp"

	#include "cell/copy/copy.hpp"
	#include "cell/copy/dyn_copy.hpp"
	#include "cell/copy/static_copy.hpp"

		@@ -2,7 +2,7 @@

		#include "cuda_utils.hpp"

		#include <cute/algorithm/copy.hpp>

(feat): Add a straightforward implementation for tile iterator. #50

(feat): Add a straightforward implementation for tile iterator. #50

Uh oh!

Conversation

haruhi55 commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haruhi55 Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haruhi55 Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haruhi55 Jun 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haruhi55 commented Jun 20, 2024 •

edited

Loading

haruhi55 Jun 22, 2024 •

edited

Loading

haruhi55 Jun 22, 2024 •

edited

Loading

haruhi55 Jun 23, 2024 •

edited

Loading