Skip to content

Test Suite Flag --rdma-mpi Implemented (#598) #878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jun 11, 2025

Description

I added the flag under ./mfc.sh test. It can be invoked with -r or --rdma-mpi
It can be used on any test cases. It is intended to skip tests whenever the rdma_mpi is false or unspecified (false by default).

/rdma_mpi/MFC-mo2$ ./mfc.sh test  -r -o HLLD

Test Summary: 0 passed, 0 failed, 2 skipped.

 Skipped Cases

 1D -> MHD -> HLLD
 2D -> MHD -> HLLD

@Malmahrouqi3 Malmahrouqi3 requested a review from a team as a code owner June 11, 2025 15:50
@Malmahrouqi3 Malmahrouqi3 changed the title added flag --rdma-mpi Test Suite Flag --rdma-mpi Implemented (#598) Jun 11, 2025
Copy link
Member

@sbryngelson sbryngelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an RDMA flag T/F in the case files right now (there's probably an example of this in examples/). What you've added is very useful for testing, but perhaps not for running cases (since RDMA_MPI doesn't need to be known at compile-time, we can just put it in the case file).

For testing (the purpose of this GH issue), it's useful to be able to flip RDMA_MPI on/off from the command line since some compilers/computers support it and some do not.

I would suggest:

  1. Making this flag only an option in the ./mfc.sh test environment
  2. If --rdma-mpi is provided as an argument, then we run three additional tests. They would be copies of the current 2-rank test cases, but have RDMA_MPI = 'T' in the case file. The goldenfiles would be identical. The hashes for the current 2-rank tests are below. Note that if the flag is enabled, we should still run the existing non-RDMA 2-rank tests (if the compiler supports RDMA_MPI then it also supports not using it). Current 2-rank tests:
  0FCCE9F1   1D -> 2 MPI Ranks
  8C7AA13B   2D -> 2 MPI Ranks
  CE232828   3D -> 2 MPI Ranks
  1. If --rdma-mpi flag is not provided, then the new/extra cases are 'skipped'. This is already a feature of the test suite.
  2. This flag should be used in the Frontier CI workflow. ./mfc.sh test -a --rdma-mpi (or whatever) since Frontier supports this feature.

Copy link

codecov bot commented Jun 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.71%. Comparing base (adcc0dd) to head (c0169e7).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #878   +/-   ##
=======================================
  Coverage   43.71%   43.71%           
=======================================
  Files          68       68           
  Lines       18360    18360           
  Branches     2292     2292           
=======================================
  Hits         8026     8026           
  Misses       8945     8945           
  Partials     1389     1389           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 11, 2025

Six 2-Rank MPI cases will be created. Pairs with the RDMA enabled and disabled. RDMA-On cases will be run only when the flag is given. Otherwise, they will be written but skipped.

0FCCE9F1 1D -> 2 MPI Ranks
8C7AA13B 2D -> 2 MPI Ranks
CE232828 3D -> 2 MPI Ranks

Example Usage:

./mfc.sh test --rdma-mpi                                        #only MPI Ranks cases.
./mfc.sh test -o Hypoelasticity --rdma-mpi                      #Hypoelasticity cases with the RDMA enabled.
./mfc.sh test -a --rdma-mpi                                     #all tests along with RDMA 2-Ranks MPI cases.

./mfc.sh test --help
  -r, --rdma-mpi        Enable RDMA MPI for tests (default: False)
  --no-build            (Testing) Do not rebuild MFC. (default: False)
  --no-examples         Do not test example cases. (default: False)
  --case-optimization   (GPU Optimization) Compile MFC targets with some case parameters hard-coded. (default: False)
  --dry-run             Build and generate case files but do not run tests. (default: False)
  --generate            (Test Generation) Generate golden files. (default: False)
  --add-new-variables   (Test Generation) If new variables are found in D/ when running tests, add them to the golden files. (default: False)
  --remove-old-tests    (Test Generation) Delete tests directories that are no longer. (default: False)

@Malmahrouqi3
Copy link
Collaborator Author

Just need to update the testing docs for the flag and add ./mfc.sh test -a --rdma-mpi to Frontier CI.

@sbryngelson
Copy link
Member

I'm not sure if this is the case right now, but if one enables the flag --rdma-mpi (i would remove the -r option for it) then it should run all the cases, not only the ones with RDMA MPI. that's what i was trying to suggest.

@Malmahrouqi3
Copy link
Collaborator Author

@sbryngelson
When the flag is enabled, it should run whatever category specified or just all + RDMA cases.

@Malmahrouqi3
Copy link
Collaborator Author

@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)??

@sbryngelson
Copy link
Member

@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)??

You can ask the MFC Slack what other computers have compilers that support it. But the Frontier failure appears to have nothing to do with your PR. It's having some network issues.

@Malmahrouqi3
Copy link
Collaborator Author

Working on it right now on Phoenix

Mohammed Said Hamed Humaid Al-Mahrouqi and others added 2 commits June 15, 2025 14:27
@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 15, 2025

This should do it hopefully. The difference between the two commands is right now 3 tests. I will keep an eye on the test suite
./mfc.sh test -j $(nproc) -a
./mfc.sh test -j $(nproc) -a --rdma-mpi

I removed -r so it only can be run by --rdma-mpi at this moment.

@Malmahrouqi3 Malmahrouqi3 changed the title Test Suite Flag --rdma-mpi Implemented (#598) Test Suite Flag --rdma-mpi Implemented (#598) Jun 15, 2025
@Malmahrouqi3
Copy link
Collaborator Author

Self-Hosted Frontier failed due to irrelevant reasons at all. I will rerun momentarily.

@Malmahrouqi3
Copy link
Collaborator Author

@anandrdbz Self-Hosted Frontier has failed on multiple re-runs for unclear reasons. Do you mind checking it out?

@sbryngelson
Copy link
Member

the problem seems to be that it's trying to use the internet on a compute node (in the test part of the workflow). the compute nodes don't have internet access. it's trying to access silo. I am not sure why it thinks it isn't available, because all of the other PR workflows are working fine... which suggests something specific to this PR

@Malmahrouqi3
Copy link
Collaborator Author

I have not clue to be honest

@sbryngelson
Copy link
Member

frankly me either, your PR doesn't seem to tickle that part of the code. i would try commenting out some of your changes until it works. not an elegant method. probably leave the flag but comment out the extra test cases and how you modify the case files, and make sure that works. if that works, add in how you modify the case files but not the tests themselves.

as i look closer, i'm unsure if how you added the RDMA flag is proper - though i'm not entirely sure. what happens if you try to use it on Phoenix or something? If you run the ./mfc.sh test --dry-run -a and then run ./mfc.sh test --rdma-mpi -a does it try to build silo again? You can just try that out on Phoenix on your own (or whatever computer, for that matter).

@Malmahrouqi3
Copy link
Collaborator Author

on Phoenix it ran just fine, but failed on the RDMA cases only

@Malmahrouqi3
Copy link
Collaborator Author

I will edit frontier template to reflect that no-need for silo. I will wait for the self-hosted tests to pass, then push another change.

@sbryngelson sbryngelson self-requested a review July 1, 2025 17:26
@sbryngelson
Copy link
Member

sbryngelson commented Jul 3, 2025

Hmm something wrong now...

 Failed Cases

 Failed test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
 TestCase.run() got an unexpected keyword argument 'rdma_mpi'
 Failed test tests/1B300F28: 2D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
 TestCase.run() got an unexpected keyword argument 'rdma_mpi'
 Failed test tests/2C9844EF: 3D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
 TestCase.run() got an unexpected keyword argument 'rdma_mpi'

Edit/update: Looks like this has been an error for several commits.

@sbryngelson sbryngelson self-requested a review July 4, 2025 01:57
@Malmahrouqi3
Copy link
Collaborator Author

Failed test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI after 1 attempt(s).
Test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI: Failed to execute MFC.

Prior, it was when run on Phoenix

Failed test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
TestCase.run() got an unexpected keyword argument 'rdma_mpi'

@Malmahrouqi3
Copy link
Collaborator Author

Self Hosted (frontier) were canceled and needs to be re-run whenever convenient. appreciate that.

@sbryngelson
Copy link
Member

yeah frontier keeps failing, I suspect someone has been using the computer for a long time or there's some backlog of jobs. I'm going to look shortly and try to fix it up -- hopefully easy fix.

@sbryngelson sbryngelson self-requested a review July 14, 2025 06:31
@Malmahrouqi3
Copy link
Collaborator Author

Failed Cases

 Failed test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
 Test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI: The golden file does not exist! To generate golden files, use the '--generate' flag.
 Failed test tests/1B300F28: 2D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
 Test tests/1B300F28: 2D -> 2 MPI Ranks -> RDMA MPI: The golden file does not exist! To generate golden files, use the '--generate' flag.
 Failed test tests/2C9844EF: 3D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
 Test tests/2C9844EF: 3D -> 2 MPI Ranks -> RDMA MPI: The golden file does not exist! To generate golden files, use the '--generate' flag.

@sbryngelson
Copy link
Member

Umm, yeah, this makes sense. I would create the goldenfile on an RDMA-capable computer and call it a day. This should work since you 'SKIP' the non-RDMA case when appropriate, or skip RDMA case when appropriate. Unless this is because the goldenfile on the master branch doesn't exist? Or is something else going wrong that I'm not noticing?

@Malmahrouqi3
Copy link
Collaborator Author

yup, I could not find their golden files anywhere.

@sbryngelson
Copy link
Member

In that case, failure is expected. Tests work by comparing the UUID's goldenfile to the results that come out. There is a clean way to generate goldenfiles via test --generate (or something similar). You would only commit the new goldenfiles (associated with the new UUIDs).

Note that the non-RDMA tests have different UUIDs (hashes):

[I] MFC/MFC $ ./mfc.sh test -l -o 2\ MPI\ Ranks
mfc: OK > (venv) Entered the Python 3.11.7 virtual environment (>= 3.9).

      .=++*:          -+*+=.        | sbryngelson@login03 [Linux]
     :+   -*-        ==   =* .      | ---------------------------
   :*+      ==      ++    .+-       |
  :*##-.....:*+   .#%+++=--+=:::.   | --jobs 1
  -=-++-======#=--**+++==+*++=::-:. | --mpi --gpu --no-debug --no-gcov --no-unified --no-single
 .:++=----------====+*= ==..:%..... |
  .:-=++++===--==+=-+=   +.  :=     |
  +#=::::::::=%=. -+:    =+   *:    | ----------------------------------------------------------
 .*=-=*=..    :=+*+:      -...--    | $ ./mfc.sh (build, run, test, clean, count, packer) --help

         MFC Test Cases

    UUID     Trace
 ──────────────────────────────
  0FCCE9F1   1D -> 2 MPI Ranks
  8C7AA13B   2D -> 2 MPI Ranks
  CE232828   3D -> 2 MPI Ranks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants