-
Notifications
You must be signed in to change notification settings - Fork 109
Test Suite Flag --rdma-mpi
Implemented (#598)
#878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an RDMA flag T/F
in the case files right now (there's probably an example of this in examples/
). What you've added is very useful for testing, but perhaps not for running cases (since RDMA_MPI doesn't need to be known at compile-time, we can just put it in the case file).
For testing (the purpose of this GH issue), it's useful to be able to flip RDMA_MPI
on/off from the command line since some compilers/computers support it and some do not.
I would suggest:
- Making this flag only an option in the
./mfc.sh test
environment - If
--rdma-mpi
is provided as an argument, then we run three additional tests. They would be copies of the current 2-rank test cases, but haveRDMA_MPI = 'T'
in the case file. The goldenfiles would be identical. The hashes for the current 2-rank tests are below. Note that if the flag is enabled, we should still run the existing non-RDMA 2-rank tests (if the compiler supports RDMA_MPI then it also supports not using it). Current 2-rank tests:
0FCCE9F1 1D -> 2 MPI Ranks
8C7AA13B 2D -> 2 MPI Ranks
CE232828 3D -> 2 MPI Ranks
- If
--rdma-mpi
flag is not provided, then the new/extra cases are 'skipped'. This is already a feature of the test suite. - This flag should be used in the Frontier CI workflow.
./mfc.sh test -a --rdma-mpi
(or whatever) since Frontier supports this feature.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #878 +/- ##
=======================================
Coverage 43.71% 43.71%
=======================================
Files 68 68
Lines 18360 18360
Branches 2292 2292
=======================================
Hits 8026 8026
Misses 8945 8945
Partials 1389 1389 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Six 2-Rank MPI cases will be created. Pairs with the RDMA enabled and disabled. RDMA-On cases will be run only when the flag is given. Otherwise, they will be written but skipped.
Example Usage: ./mfc.sh test --rdma-mpi #only MPI Ranks cases.
./mfc.sh test -o Hypoelasticity --rdma-mpi #Hypoelasticity cases with the RDMA enabled.
./mfc.sh test -a --rdma-mpi #all tests along with RDMA 2-Ranks MPI cases. ./mfc.sh test --help -r, --rdma-mpi Enable RDMA MPI for tests (default: False)
--no-build (Testing) Do not rebuild MFC. (default: False)
--no-examples Do not test example cases. (default: False)
--case-optimization (GPU Optimization) Compile MFC targets with some case parameters hard-coded. (default: False)
--dry-run Build and generate case files but do not run tests. (default: False)
--generate (Test Generation) Generate golden files. (default: False)
--add-new-variables (Test Generation) If new variables are found in D/ when running tests, add them to the golden files. (default: False)
--remove-old-tests (Test Generation) Delete tests directories that are no longer. (default: False) |
Just need to update the testing docs for the flag and add |
I'm not sure if this is the case right now, but if one enables the flag |
@sbryngelson |
@sbryngelson I do not think it is viable to debug locally. I do not have access to Frontier, is RDMA available on (Phoenix, Bridges2, or Delta)?? |
You can ask the MFC Slack what other computers have compilers that support it. But the Frontier failure appears to have nothing to do with your PR. It's having some network issues. |
Working on it right now on Phoenix |
This should do it hopefully. The difference between the two commands is right now 3 tests. I will keep an eye on the test suite I removed |
--rdma-mpi
Implemented (#598)
Self-Hosted Frontier failed due to irrelevant reasons at all. I will rerun momentarily. |
@anandrdbz Self-Hosted Frontier has failed on multiple re-runs for unclear reasons. Do you mind checking it out? |
the problem seems to be that it's trying to use the internet on a compute node (in the test part of the workflow). the compute nodes don't have internet access. it's trying to access silo. I am not sure why it thinks it isn't available, because all of the other PR workflows are working fine... which suggests something specific to this PR |
I have not clue to be honest |
frankly me either, your PR doesn't seem to tickle that part of the code. i would try commenting out some of your changes until it works. not an elegant method. probably leave the flag but comment out the extra test cases and how you modify the case files, and make sure that works. if that works, add in how you modify the case files but not the tests themselves. as i look closer, i'm unsure if how you added the RDMA flag is proper - though i'm not entirely sure. what happens if you try to use it on Phoenix or something? If you run the |
on Phoenix it ran just fine, but failed on the RDMA cases only |
I will edit frontier template to reflect that no-need for silo. I will wait for the self-hosted tests to pass, then push another change. |
Hmm something wrong now... Failed Cases
Failed test tests/FA4D8FEF: 1D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
TestCase.run() got an unexpected keyword argument 'rdma_mpi'
Failed test tests/1B300F28: 2D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
TestCase.run() got an unexpected keyword argument 'rdma_mpi'
Failed test tests/2C9844EF: 3D -> 2 MPI Ranks -> RDMA MPI after 3 attempt(s).
TestCase.run() got an unexpected keyword argument 'rdma_mpi' Edit/update: Looks like this has been an error for several commits. |
Prior, it was when run on Phoenix
|
Self Hosted (frontier) were canceled and needs to be re-run whenever convenient. appreciate that. |
yeah frontier keeps failing, I suspect someone has been using the computer for a long time or there's some backlog of jobs. I'm going to look shortly and try to fix it up -- hopefully easy fix. |
|
Umm, yeah, this makes sense. I would create the goldenfile on an RDMA-capable computer and call it a day. This should work since you 'SKIP' the non-RDMA case when appropriate, or skip RDMA case when appropriate. Unless this is because the goldenfile on the master branch doesn't exist? Or is something else going wrong that I'm not noticing? |
yup, I could not find their golden files anywhere. |
In that case, failure is expected. Tests work by comparing the UUID's goldenfile to the results that come out. There is a clean way to generate goldenfiles via Note that the non-RDMA tests have different UUIDs (hashes): [I] MFC/MFC $ ./mfc.sh test -l -o 2\ MPI\ Ranks
mfc: OK > (venv) Entered the Python 3.11.7 virtual environment (>= 3.9).
.=++*: -+*+=. | sbryngelson@login03 [Linux]
:+ -*- == =* . | ---------------------------
:*+ == ++ .+- |
:*##-.....:*+ .#%+++=--+=:::. | --jobs 1
-=-++-======#=--**+++==+*++=::-:. | --mpi --gpu --no-debug --no-gcov --no-unified --no-single
.:++=----------====+*= ==..:%..... |
.:-=++++===--==+=-+= +. := |
+#=::::::::=%=. -+: =+ *: | ----------------------------------------------------------
.*=-=*=.. :=+*+: -...-- | $ ./mfc.sh (build, run, test, clean, count, packer) --help
MFC Test Cases
UUID Trace
──────────────────────────────
0FCCE9F1 1D -> 2 MPI Ranks
8C7AA13B 2D -> 2 MPI Ranks
CE232828 3D -> 2 MPI Ranks |
Description
I added the flag under
./mfc.sh test
. It can be invoked with-r
or--rdma-mpi
It can be used on any test cases. It is intended to skip tests whenever the
rdma_mpi
is false or unspecified (false by default).