Skip to content

Fixing bug with FYPP macros #931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 15, 2025
Merged

Conversation

prathi-wind
Copy link
Collaborator

@prathi-wind prathi-wind commented Jul 8, 2025

User description

Description

There was a bug with how I had replaced acc kernels with acc parallel. This pull request should fix that.

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Bug fix


Description

  • Replace incorrect acc kernels with proper GPU_PARALLEL macros

  • Fix GPU parallelization directives in time stepping and data output

  • Add GPU parallelization documentation reference


Changes diagram

flowchart LR
  A["acc kernels directives"] -- "replace with" --> B["GPU_PARALLEL macros"]
  B --> C["proper copyout/copyin parameters"]
  D["documentation"] -- "add" --> E["GPU parallelization reference"]
Loading

Changes walkthrough 📝

Relevant files
Bug fix
m_data_output.fpp
Fix GPU parallelization in data output                                     

src/simulation/m_data_output.fpp

  • Replace acc kernels with GPU_PARALLEL macro calls
  • Add proper copyout and copyin parameters for GPU data transfer
  • Fix parallelization for CFL and viscous calculations
  • +7/-7     
    m_time_steppers.fpp
    Fix GPU parallelization in time stepping                                 

    src/simulation/m_time_steppers.fpp

  • Replace acc kernels with GPU_PARALLEL macro for dt calculation
  • Add copyout and copyin parameters for time step computation
  • +3/-3     
    Documentation
    readme.md
    Add GPU parallelization documentation link                             

    docs/documentation/readme.md

    • Add reference to GPU parallelization documentation
    +1/-0     

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • @prathi-wind prathi-wind linked an issue Jul 8, 2025 that may be closed by this pull request
    Copy link

    codecov bot commented Jul 9, 2025

    Codecov Report

    Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

    Project coverage is 43.68%. Comparing base (adcc0dd) to head (2afef60).
    Report is 6 commits behind head on master.

    Files with missing lines Patch % Lines
    src/simulation/m_data_output.fpp 0.00% 1 Missing ⚠️
    src/simulation/m_time_steppers.fpp 50.00% 1 Missing ⚠️
    Additional details and impacted files
    @@            Coverage Diff             @@
    ##           master     #931      +/-   ##
    ==========================================
    - Coverage   43.71%   43.68%   -0.03%     
    ==========================================
      Files          68       68              
      Lines       18360    18363       +3     
      Branches     2292     2295       +3     
    ==========================================
    - Hits         8026     8022       -4     
    - Misses       8945     8949       +4     
    - Partials     1389     1392       +3     

    ☔ View full report in Codecov by Sentry.
    📢 Have feedback on the report? Share it here.

    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

    @sbryngelson sbryngelson marked this pull request as ready for review July 12, 2025 14:11
    @sbryngelson sbryngelson requested review from a team as code owners July 12, 2025 14:11
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The GPU_PARALLEL macro calls are placed inside conditional blocks that may not execute on GPU. The original acc kernels directives were outside the viscous condition, but the new GPU_PARALLEL calls are inside the conditional blocks, potentially changing execution behavior.

    #:call GPU_PARALLEL(copyout='[icfl_max_loc]', copyin='[icfl_sf]')
        icfl_max_loc = maxval(icfl_sf)
    #:endcall GPU_PARALLEL
    if (viscous) then
        #:call GPU_PARALLEL(copyout='[vcfl_max_loc, Rc_min_loc]', copyin='[vcfl_sf,Rc_sf]')
            vcfl_max_loc = maxval(vcfl_sf)
            Rc_min_loc = minval(Rc_sf)
        #:endcall GPU_PARALLEL
    end if
    Missing File

    A reference to 'gpuParallelization.md' is added to the documentation index, but the actual file may not exist in the repository, which could result in broken documentation links.

    - [GPU Parallelization](gpuParallelization.md)
    - [MFC's Authors](authors.md)

    Copy link

    PR Code Suggestions ✨

    No code suggestions found for the PR.

    !$acc kernels
    icfl_max_loc = maxval(icfl_sf)
    !$acc end kernels
    #:call GPU_PARALLEL(copyout='[icfl_max_loc]', copyin='[icfl_sf]')
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    are you sure this works? why don't you just specify acc kernels via an option to GPU_PARALLEL? right now it looks like you're doing things differently than before when it would be easy to make them the same, but i'm quite sure why. i guess if this parallel loop works it is nicer than invoking kernels? idk

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    acc kernels requires the compiler to parallelize the surrounded code for you. The compiler will also not parallelize if it cannot guarantee data-dependency free loops. There is no OpenMP support for the compiler to analyze code and parallelize the code on behalf of the developer. If there should be maximum performance with OpenMP, then the codebase can't use GPU_KERNELS or its equivalent since that won't have GPU acceleration in OpenMP.

    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    what i'm saying is you would use kernels if the compiler is nvhpc and the offload engine is openacc. otherwise, it goes to OMP and uses whatever is appropriate for that compiler. this would all be taken care of in the macro. of course if you can find a acc parallel shortcut that also works for nvhpc + openacc that's fine with me too.

    @sbryngelson sbryngelson merged commit 55b50a5 into MFlowCode:master Jul 15, 2025
    81 of 87 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Development

    Successfully merging this pull request may close these issues.

    Metadirectives kernels fixup
    2 participants