Skip to content

Conversation

@lucbv
Copy link
Contributor

@lucbv lucbv commented Oct 16, 2025

Pushing some preliminary changes based of the previous work by @yasahi-hpc
The goal is to isolate the failure that ensue in Trilinos' BTDS solver.
The current changes are sufficient to make the BTDS fail which might indicate that the underlying issue is a bad usage of Kokkos View?

@lucbv lucbv self-assigned this Oct 16, 2025
@lucbv lucbv added the Cleanup Code maintenance that isn't a bugfix or new feature label Oct 16, 2025
Copy link
Contributor

@yasahi-hpc yasahi-hpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv

Thanks a lot for your help.
I think we can update the extent computation as attempted in #2667
I am not sure the correct (or expected) behavior for the stride computation

v.stride(r) for r >= View::rank()

Comment on lines 761 to 693
if (r == 0) {
int V_extent_0 = V_rank == 0 ? 0 : v.extent_int(0);
return V_extent_0;
} else if (r == 1) {
int V_extent_1 = V_rank == 0 ? 0 : V_rank == 1 ? 1 : v.extent_int(1);
return V_extent_1;
} else {
return -1;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (r == 0) {
int V_extent_0 = V_rank == 0 ? 0 : v.extent_int(0);
return V_extent_0;
} else if (r == 1) {
int V_extent_1 = V_rank == 0 ? 0 : V_rank == 1 ? 1 : v.extent_int(1);
return V_extent_1;
} else {
return -1;
}
}
if (r == 0) {
int V_extent_0 = V_rank == 0 ? 1 : v.extent_int(0);
return V_extent_0;
} else if (r == 1) {
int V_extent_1 = V_rank <= 1 ? 1 : v.extent_int(1);
return V_extent_1;
} else {
return 1;
}

I think extent would be computed like this.

@ndellingwood
Copy link
Contributor

ndellingwood commented Oct 21, 2025

I tested this PR with/without Yuichi's suggestion, gcc/11.3, OpenMPI/4.1.5 Serial build with the new View enabled, I'm still seeing the test failures:

[ndellin@blake01 build]$ ctest -R Ifpack2_BlockTriDiContainer_
Test project /home/ndellin/trilinos/Trilinos-1/Build/Blake-serial-gcc-1130-ompi-415/build
    Start 700: Ifpack2_BlockTriDiContainer_BTD_UnitTests_MPI_4
1/2 Test #700: Ifpack2_BlockTriDiContainer_BTD_UnitTests_MPI_4 ......***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
]  0.94 sec
    Start 701: Ifpack2_BlockTriDiContainer_Jacobi_UnitTests_MPI_4
2/2 Test #701: Ifpack2_BlockTriDiContainer_Jacobi_UnitTests_MPI_4 ...***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
]  1.05 sec

0% tests passed, 2 tests failed out of 2

Subproject Time Summary:
Ifpack2    =   7.99 sec*proc (2 tests)

Total Test time (real) =   2.57 sec

The following tests FAILED:
	700 - Ifpack2_BlockTriDiContainer_BTD_UnitTests_MPI_4 (Failed)
	701 - Ifpack2_BlockTriDiContainer_Jacobi_UnitTests_MPI_4 (Failed)
Errors while running CTest

Tested with:
kokkos/kokkos@4747e23
kernels with PR #2829 +/- Yuichi's suggestion
trilinos/Trilinos@69d24c6

@yasahi-hpc
Copy link
Contributor

@ndellingwood Thank you for the information
Then, it would be related to the stride computation
Concerning the correct implementation in #2593

The stride computation may be like this?

template <typename ViewType>
KOKKOS_INLINE_FUNCTION std::size_t get_stride(const ViewType &v, const int r) {
  static_assert(Kokkos::is_view_v<ViewType>, "KokkosBatched: ViewType is not a Kokkos::View.");
  constexpr std::size_t V_rank = ViewType::rank();
  static_assert(V_rank <= 2, "KokkosBatched: ViewType must have rank 0, 1 or 2.");

  if (r == 0) {
    std::size_t V_stride_0 = V_rank == 0 ? 1 : v.stride(0);
    return V_stride_0;
  } else if (r == 1) {
    std::size_t V_stride_1 = V_rank <= 1 ? 1 : v.stride(1);
    return V_stride_1;
  } else {
    return 1;
  }
}

@ndellingwood
Copy link
Contributor

Cross-referencing issue #2622

@ndellingwood
Copy link
Contributor

ndellingwood commented Oct 28, 2025

I retested as a sanity check

  • Serial build with SKX enabled had the failures, as posted above
  • Serial build with SKX disabled passed!
  • Cuda build passed!

All jobs above enabled the new View impl and complex types

@ndellingwood
Copy link
Contributor

I rebuilt with @yasahi-hpc suggestion (thank you!) , and the Serial job with SKX still exhibits the failures

@ndellingwood
Copy link
Contributor

I retested Trilinos@develop (containing the 4.7.01 releases as packages) in the Serial build with SKX and the legacy View, and the Ifpack2_BlockTriDiContainer* tests failed in that case as well, looks like we encountered a combo of issues

To move forward, it looks like we need

  1. Resolve merge conflicts
  2. DCO signoff
  3. Whether to include the extent and stride modifications from the comments
  4. CI

lucbv added 3 commits October 28, 2025 17:51
)"

This reverts commit 82605a9.

Signed-off-by: Luc Berger-Vergiat <lberge@sandia.gov>
This reverts commit 386663c.

Signed-off-by: Luc Berger-Vergiat <lberge@sandia.gov>
We have been liberally querrying the extent and stride of Views
without checking if the rank of the view is high enough to return
a valid value. The fixes lead to failures in BTDS which might point
to a bug that relied on the old UB behavior of Kokkos::View x(

Signed-off-by: Luc Berger-Vergiat <lberge@sandia.gov>
@lucbv lucbv force-pushed the ifpack2_btds_gemm branch from 337b1f5 to c301882 Compare October 28, 2025 23:52
Signed-off-by: Luc Berger-Vergiat <lberge@sandia.gov>
@yasahi-hpc
Copy link
Contributor

@ndellingwood @lucbv

Thanks a lot for your help.
Glad to hear that tests pass without SKX option.
Let us keep in touch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cleanup Code maintenance that isn't a bugfix or new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants