Skip to content

Commit d2b4ee8

Browse files
committed
Implemented ~20x faster GPU-accelerated force/torque summation, simplified calculating object force/torque in setups, improved coloring in VIS_FIELD/ray_grid_traverse_sum()
1 parent 83ef22e commit d2b4ee8

File tree

10 files changed

+184
-131
lines changed

10 files changed

+184
-131
lines changed

DOCUMENTATION.md

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -386,29 +386,22 @@
386386
387387
### Lift/Drag Forces
388388
- Enable (uncomment) the [`FORCE_FIELD`](src/defines.hpp) extension. This extension allows computing boundary forces on every solid cell (`TYPE_S`) individually, as well as placing an individual volume force on every fluid cell (not used here).
389-
- In the [`main_setup()`](src/setup.cpp) function's main simulation loop, alternatingly call:
389+
- In the [`main_setup()`](src/setup.cpp) function, voxelize the mesh with a unique flag combination, such as `(TYPE_S|TYPE_X)` or `(TYPE_S|TYPE_Y)` or `(TYPE_S|TYPE_X|TYPE_Y)`, to distinguish it from all other `(TYPE_S)` cells that might be needed to define other geometry, and compute its center of mass:
390390
```c
391-
lbm.run(lbm_dt); // run lbm_dt LBM time steps
392-
lbm.calculate_force_on_boundaries(); // compute boundary forces on GPU on all solid cells (TYPE_S)
391+
lbm.voxelize_mesh_on_device(mesh, TYPE_S|TYPE_X); // voxelize mesh with unique flag combination
392+
const float3 lbm_com = lbm.object_center_of_mass(TYPE_S|TYPE_X); // object center of mass in LBM unit coordinates
393393
```
394-
The latter computes the boundary forces on the GPU into the `lbm.F` field in VRAM.
395-
- To copy `lbm.F` from GPU VRAM to CPU RAM, call:
394+
- To sum over all the individual boundary cells that belong to the object, in the [`main_setup()`](src/setup.cpp) function's main simulation loop call:
396395
```c
397-
lbm.F.read_from_device();
396+
const float3 lbm_force = lbm.object_force(TYPE_S|TYPE_X); // force on object
397+
const float3 lbm_torque = lbm.object_torque(lbm_com, TYPE_S|TYPE_X); // torque on object around lbm_com rotation point
398398
```
399-
You can then access the boundary forces at each individual cell with:
399+
These functions sum over all cells marked `(TYPE_S|TYPE_X)` that belong to the object. The summation happens GPU-accelerated in VRAM, and only the result is copied to CPU RAM.
400+
- You may also access the force field on individual grid cells. Note that copying the entire `lbm.F` force field from GPU VRAM to CPU RAM is slow:
400401
```c
401-
float lbm_force_x_n = lbm.F.x[lbm.index(x, y, z)];
402+
lbm.F.read_from_device(); // copy entire force field from GPU VRAM to CPU RAM (slow)
403+
lbm_force_x_n = lbm.F.x[lbm.index(x, y, z)]; // access force at one particular grid cell with integer coordinates x, y, z
402404
```
403-
- To sum over all the individual boundary cells that belong to the body, to get the total force on the body, first voxelize the body with
404-
```c
405-
lbm.voxelize_mesh_on_device(mesh, TYPE_S|TYPE_X);
406-
```
407-
with the additional `TYPE_X` flagging, and then call
408-
```c
409-
const float3 lbm_force = lbm.calculate_force_on_object(TYPE_S|TYPE_X);
410-
```
411-
to sum over all cells marked `TYPE_S|TYPE_X` that belong to the body. You can also use `TYPE_Y` for this.
412405
- Finally, [convert from LBM to SI units](#unit-conversion) with
413406
```c
414407
const float si_force_x = units.si_F(lbm_force.x);

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,12 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on
213213
- added workaround for compiler bug in Intel CPU Runtime for OpenCL that causes Q-criterion isosurface rendering corruption
214214
- fixed TFlops estimate for Intel Battlemage GPUs
215215
- fixed wrong device name reporting for AMD GPUs
216+
- [v3.2](https://github.com/ProjectPhysX/FluidX3D/releases/tag/v3.2) (09.03.2025) [changes](https://github.com/ProjectPhysX/FluidX3D/compare/v3.1...v3.2) (fast force/torque summation)
217+
- implemented GPU-accelerated force/torque summation (~20x faster than CPU-multithreaded implementation before)
218+
- simplified calculating object force/torque in setups
219+
- improved coloring in `VIS_FIELD`/`ray_grid_traverse_sum()`
220+
- updated OpenCL-Wrapper now compiles OpenCL C code with `-cl-std=CL3.0` if available
221+
- fixed compiling on macOS with new OpenCL headers
216222

217223
</details>
218224

src/defines.hpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
#define BENCHMARK // disable all extensions and setups and run benchmark setup instead
1717

1818
//#define VOLUME_FORCE // enables global force per volume in one direction (equivalent to a pressure gradient); specified in the LBM class constructor; the force can be changed on-the-fly between time steps at no performance cost
19-
//#define FORCE_FIELD // enables computing the forces on solid boundaries with lbm.calculate_force_on_boundaries(); and enables setting the force for each lattice point independently (enable VOLUME_FORCE too); allocates an extra 12 Bytes/cell
19+
//#define FORCE_FIELD // enables computing the forces on solid boundaries with lbm.update_force_field(); and enables setting the force for each lattice point independently (enable VOLUME_FORCE too); allocates an extra 12 Bytes/cell
2020
//#define EQUILIBRIUM_BOUNDARIES // enables fixing the velocity/density by marking cells with TYPE_E; can be used for inflow/outflow; does not reflect shock waves
2121
//#define MOVING_BOUNDARIES // enables moving solids: set solid cells to TYPE_S and set their velocity u unequal to zero
2222
//#define SURFACE // enables free surface LBM: mark fluid cells with TYPE_F; at initialization the TYPE_I interface and TYPE_G gas domains will automatically be completed; allocates an extra 12 Bytes/cell
@@ -32,9 +32,9 @@
3232
#define GRAPHICS_FRAME_HEIGHT 1080 // set frame height if only GRAPHICS is enabled
3333
#define GRAPHICS_BACKGROUND_COLOR 0x000000 // set background color; black background (default) = 0x000000, white background = 0xFFFFFF
3434
#define GRAPHICS_U_MAX 0.18f // maximum velocity for velocity coloring in units of LBM lattice speed of sound (c=1/sqrt(3)) (default: 0.18f)
35-
#define GRAPHICS_RHO_DELTA 0.01f // coloring range for density rho will be [1.0f-GRAPHICS_RHO_DELTA, 1.0f+GRAPHICS_RHO_DELTA] (default: 0.01f)
35+
#define GRAPHICS_RHO_DELTA 0.001f // coloring range for density rho will be [1.0f-GRAPHICS_RHO_DELTA, 1.0f+GRAPHICS_RHO_DELTA] (default: 0.001f)
3636
#define GRAPHICS_T_DELTA 1.0f // coloring range for temperature T will be [1.0f-GRAPHICS_T_DELTA, 1.0f+GRAPHICS_T_DELTA] (default: 1.0f)
37-
#define GRAPHICS_F_MAX 0.001f // maximum force in LBM units for visualization of forces on solid boundaries if VOLUME_FORCE is enabled and lbm.calculate_force_on_boundaries(); is called (default: 0.001f)
37+
#define GRAPHICS_F_MAX 0.001f // maximum force in LBM units for visualization of forces on solid boundaries if VOLUME_FORCE is enabled and lbm.update_force_field(); is called (default: 0.001f)
3838
#define GRAPHICS_Q_CRITERION 0.0001f // Q-criterion value for Q-criterion isosurface visualization (default: 0.0001f)
3939
#define GRAPHICS_STREAMLINE_SPARSE 8 // set how many streamlines there are every x lattice points
4040
#define GRAPHICS_STREAMLINE_LENGTH 128 // set maximum length of streamlines

src/info.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ void Info::print_logo() const {
4242
print("| "); print("\\ \\ / /", c); print(" |\n");
4343
print("| "); print("\\ ' /", c); print(" |\n");
4444
print("| "); print("\\ /", c); print(" |\n");
45-
print("| "); print("\\ /", c); print(" FluidX3D Version 3.1 |\n");
45+
print("| "); print("\\ /", c); print(" FluidX3D Version 3.2 |\n");
4646
print("| "); print( "'", c); print(" Copyright (c) Dr. Moritz Lehmann |\n");
4747
print("|-----------------------------------------------------------------------------|\n");
4848
}

src/kernel.cpp

Lines changed: 73 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ string opencl_c_container() { return R( // ########################## begin of O
1414
)+R(float angle(const float3 v1, const float3 v2) {
1515
return acos(dot(v1, v2)/(length(v1)*length(v2)));
1616
}
17-
)+R(float fast_rsqrt(const float x) { // slightly fastwer approximation
17+
)+R(float fast_rsqrt(const float x) { // slightly faster approximation
1818
return as_float(0x5F37642F-(as_int(x)>>1));
1919
}
20-
)+R(float fast_asin(const float x) { // slightly fastwer approximation
20+
)+R(float fast_asin(const float x) { // slightly faster approximation
2121
return x*fma(0.5702f, sq(sq(sq(x))), 1.0f); // 0.5707964f = (pi-2)/2
2222
}
23-
)+R(float fast_acos(const float x) { // slightly fastwer approximation
23+
)+R(float fast_acos(const float x) { // slightly faster approximation
2424
return fma(fma(-0.5702f, sq(sq(sq(x))), -1.0f), x, 1.5712963f); // 0.5707964f = (pi-2)/2
2525
}
2626
)+R(void swap(float* x, float* y) {
@@ -52,22 +52,6 @@ string opencl_c_container() { return R( // ########################## begin of O
5252
const float x1=p.x, y1=p.y, z1=p.z, x0=1.0f-x1, y0=1.0f-y1, z0=1.0f-z1; // calculate interpolation factors
5353
return (x0*y0*z0)*v[0]+(x1*y0*z0)*v[1]+(x1*y0*z1)*v[2]+(x0*y0*z1)*v[3]+(x0*y1*z0)*v[4]+(x1*y1*z0)*v[5]+(x1*y1*z1)*v[6]+(x0*y1*z1)*v[7]; // perform trilinear interpolation
5454
}
55-
//bool workgroup_any(const bool condition) { // returns true if any thread within the workgroup enters true
56-
// volatile local uint workgroup_condition; // does not work on AMD GPUs (error: non-kernel function variable cannot be declared in local address space)
57-
// workgroup_condition = 0u;
58-
// barrier(CLK_LOCAL_MEM_FENCE);
59-
// atomic_or(&workgroup_condition, (uint)condition);
60-
// barrier(CLK_LOCAL_MEM_FENCE);
61-
// return (bool)workgroup_condition;
62-
//}
63-
//bool workgroup_all(const bool condition) { // returns true if all threads within the workgroup enter true
64-
// volatile local uint workgroup_condition; // does not work on AMD GPUs (error: non-kernel function variable cannot be declared in local address space)
65-
// workgroup_condition = 1u;
66-
// barrier(CLK_LOCAL_MEM_FENCE);
67-
// atomic_and(&workgroup_condition, (uint)condition);
68-
// barrier(CLK_LOCAL_MEM_FENCE);
69-
// return (bool)workgroup_condition;
70-
//}
7155

7256

7357

@@ -1916,9 +1900,9 @@ string opencl_c_container() { return R( // ########################## begin of O
19161900
} // update_fields()
19171901

19181902
)+"#ifdef FORCE_FIELD"+R(
1919-
)+R(kernel void calculate_force_on_boundaries(const global fpxx* fi, const global uchar* flags, const ulong t, global float* F) { // calculate force from the fluid on solid boundaries from fi directly
1903+
)+R(kernel void update_force_field(const global fpxx* fi, const global uchar* flags, const ulong t, global float* F) { // calculate force from the fluid on solid boundaries from fi directly
19201904
const uxx n = get_global_id(0); // n = x+(y+z*Ny)*Nx
1921-
if(n>=(uxx)def_N||is_halo(n)) return; // don't execute calculate_force_on_boundaries() on halo
1905+
if(n>=(uxx)def_N||is_halo(n)) return; // don't execute update_force_field() on halo
19221906
if((flags[n]&TYPE_BO)!=TYPE_S) return; // only continue for solid boundary cells
19231907
uxx j[def_velocity_set]; // neighbor indices
19241908
neighbors(n, j); // calculate neighbor indices
@@ -1929,19 +1913,15 @@ string opencl_c_container() { return R( // ########################## begin of O
19291913
F[ n] = 2.0f*fx*Fb; // 2 times because fi are reflected on solid boundary cells (bounced-back)
19301914
F[ def_N+(ulong)n] = 2.0f*fy*Fb;
19311915
F[2ul*def_N+(ulong)n] = 2.0f*fz*Fb;
1932-
} // calculate_force_on_boundaries()
1916+
} // update_force_field()
19331917
)+R(kernel void reset_force_field(global float* F) { // reset force field
19341918
const uxx n = get_global_id(0); // n = x+(y+z*Ny)*Nx
19351919
if(n>=(uxx)def_N) return; // execute reset_force_field() also on halo
19361920
F[ n] = 0.0f;
19371921
F[ def_N+(ulong)n] = 0.0f;
19381922
F[2ul*def_N+(ulong)n] = 0.0f;
19391923
} // reset_force_field()
1940-
)+"#endif"+R( // FORCE_FIELD
1941-
1942-
)+"#ifdef PARTICLES"+R(
1943-
)+"#ifdef FORCE_FIELD"+R(
1944-
void atomic_add_f(volatile global float* addr, const float val) {
1924+
)+R(void atomic_add_f(volatile global float* addr, const float val) {
19451925
)+"#if cl_nv_compute_capability>=20"+R( // use hardware-supported atomic addition on Nvidia GPUs with inline PTX assembly
19461926
float ret;)+"asm volatile(\"atom.global.add.f32\t%0,[%1],%2;\":\"=f\"(ret):\"l\"(addr),\"f\"(val):\"memory\");"+R(
19471927
)+"#elif defined(__opencl_c_ext_fp32_global_atomic_add)"+R( // use hardware-supported atomic addition on some Intel GPUs
@@ -1952,6 +1932,68 @@ void atomic_add_f(volatile global float* addr, const float val) {
19521932
float old = val; while((old=atomic_xchg(addr, atomic_xchg(addr, 0.0f)+old))!=0.0f);
19531933
)+"#endif"+R(
19541934
}
1935+
)+R(kernel void object_center_of_mass(const global uchar* flags, const uchar flag_marker, volatile global float* object_sum) {
1936+
const uxx n = get_global_id(0); // n = x+(y+z*Ny)*Nx
1937+
const uint lid = get_local_id(0); // local memory reduction of cl_workgroup_size:1
1938+
local float3 cache[cl_workgroup_size];
1939+
local uint cells[cl_workgroup_size];
1940+
cache[lid] = n<(uxx)def_N&&flags[n]==flag_marker ? position(coordinates(n)) : (float3)(0.0f, 0.0f, 0.0f);
1941+
cells[lid] = (uint)(n<(uxx)def_N&&flags[n]==flag_marker);
1942+
barrier(CLK_GLOBAL_MEM_FENCE);
1943+
for(uint s=1u; s<cl_workgroup_size; s*=2u) {
1944+
if(lid%(2u*s)==0u) {
1945+
cache[lid] += cache[lid+s];
1946+
cells[lid] += cells[lid+s];
1947+
}
1948+
barrier(CLK_LOCAL_MEM_FENCE);
1949+
}
1950+
const uint local_cells = cells[0];
1951+
if(lid==0u&&local_cells>0u) { // global memory reduction with atomic addition of local_sum
1952+
const float3 local_sum = cache[0];
1953+
atomic_add_f(&object_sum[0], local_sum.x);
1954+
atomic_add_f(&object_sum[1], local_sum.y);
1955+
atomic_add_f(&object_sum[2], local_sum.z);
1956+
atomic_add((volatile global uint*)&object_sum[3], local_cells);
1957+
}
1958+
} // object_center_of_mass()
1959+
)+R(kernel void object_force(const global float* F, const global uchar* flags, const uchar flag_marker, volatile global float* object_sum) {
1960+
const uxx n = get_global_id(0); // n = x+(y+z*Ny)*Nx
1961+
const uint lid = get_local_id(0); // local memory reduction of cl_workgroup_size:1
1962+
local float3 cache[cl_workgroup_size];
1963+
cache[lid] = n<(uxx)def_N&&flags[n]==flag_marker ? load3(n, F) : (float3)(0.0f, 0.0f, 0.0f);
1964+
barrier(CLK_GLOBAL_MEM_FENCE);
1965+
for(uint s=1u; s<cl_workgroup_size; s*=2u) {
1966+
if(lid%(2u*s)==0u) cache[lid] += cache[lid+s];
1967+
barrier(CLK_LOCAL_MEM_FENCE);
1968+
}
1969+
if(lid==0u) { // global memory reduction with atomic addition of local_sum
1970+
const float3 local_sum = cache[0];
1971+
if(local_sum.x!=0.0f) atomic_add_f(&object_sum[0], local_sum.x);
1972+
if(local_sum.y!=0.0f) atomic_add_f(&object_sum[1], local_sum.y);
1973+
if(local_sum.z!=0.0f) atomic_add_f(&object_sum[2], local_sum.z);
1974+
}
1975+
} // object_force()
1976+
)+R(kernel void object_torque(const global float* F, const global uchar* flags, const uchar flag_marker, const float cx, const float cy, const float cz, volatile global float* object_sum) {
1977+
const uxx n = get_global_id(0); // n = x+(y+z*Ny)*Nx
1978+
const uint lid = get_local_id(0); // local memory reduction of cl_workgroup_size:1
1979+
local float3 cache[cl_workgroup_size];
1980+
cache[lid] = n<(uxx)def_N&&flags[n]==flag_marker ? cross(position(coordinates(n))-(float3)(cx, cy, cz), load3(n, F)) : (float3)(0.0f, 0.0f, 0.0f);
1981+
barrier(CLK_GLOBAL_MEM_FENCE);
1982+
for(uint s=1u; s<cl_workgroup_size; s*=2u) {
1983+
if(lid%(2u*s)==0u) cache[lid] += cache[lid+s];
1984+
barrier(CLK_LOCAL_MEM_FENCE);
1985+
}
1986+
if(lid==0u) { // global memory reduction with atomic addition of local_sum
1987+
const float3 local_sum = cache[0];
1988+
if(local_sum.x!=0.0f) atomic_add_f(&object_sum[0], local_sum.x);
1989+
if(local_sum.y!=0.0f) atomic_add_f(&object_sum[1], local_sum.y);
1990+
if(local_sum.z!=0.0f) atomic_add_f(&object_sum[2], local_sum.z);
1991+
}
1992+
} // object_torque()
1993+
)+"#endif"+R( // FORCE_FIELD
1994+
1995+
)+"#ifdef PARTICLES"+R(
1996+
)+"#ifdef FORCE_FIELD"+R(
19551997
)+R(void spread_force(volatile global float* F, const float3 p, const float3 Fn) {
19561998
const float xa=p.x-0.5f+1.5f*(float)def_Nx, ya=p.y-0.5f+1.5f*(float)def_Ny, za=p.z-0.5f+1.5f*(float)def_Nz; // subtract lattice offsets
19571999
const uint xb=(uint)xa, yb=(uint)ya, zb=(uint)za; // integer casting to find bottom left corner
@@ -2545,14 +2587,14 @@ void atomic_add_f(volatile global float* addr, const float val) {
25452587
const uxx n = index((uint3)((uint)clamp(xyz.x, 0, (int)Nx-1), (uint)clamp(xyz.y, 0, (int)Ny-1), (uint)clamp(xyz.z, 0, (int)Nz-1)));
25462588
if(!(flags[n]&(TYPE_S|TYPE_E|TYPE_G))) {
25472589
const float un = length(load3(n, u));
2548-
const float weight = sq(un)+sq(un-0.5f/def_scale_u);
2590+
const float weight = fmin(un, fabs(un-0.5f/def_scale_u));
25492591
sum = fma(weight, un, sum);
25502592
traversed_cells_weighted += weight;
25512593
}
25522594
traversed_cells++;
25532595
}
25542596
color = colorscale_rainbow(def_scale_u*sum/traversed_cells_weighted);
2555-
traversed_cells_weighted *= sq(def_scale_u);
2597+
traversed_cells_weighted *= 2.0f*def_scale_u;
25562598
break;
25572599
case 1: // coloring by density
25582600
while(traversed_cells<Nx+Ny+Nz) { // limit number of traversed cells to space diagonal
@@ -2562,14 +2604,14 @@ void atomic_add_f(volatile global float* addr, const float val) {
25622604
const uxx n = index((uint3)((uint)clamp(xyz.x, 0, (int)Nx-1), (uint)clamp(xyz.y, 0, (int)Ny-1), (uint)clamp(xyz.z, 0, (int)Nz-1)));
25632605
if(!(flags[n]&(TYPE_S|TYPE_E|TYPE_G))) {
25642606
const float rhon = rho[n];
2565-
const float weight = sq(rhon-1.0f);
2607+
const float weight = fabs(rhon-1.0f);
25662608
sum = fma(weight, rhon, sum);
25672609
traversed_cells_weighted += weight;
25682610
}
25692611
traversed_cells++;
25702612
}
25712613
color = colorscale_twocolor(0.5f+def_scale_rho*(sum/traversed_cells_weighted-1.0f));
2572-
traversed_cells_weighted *= sq(def_scale_rho);
2614+
traversed_cells_weighted *= def_scale_rho;
25732615
break;
25742616
)+"#ifdef TEMPERATURE"+R(
25752617
case 2: // coloring by temperature

0 commit comments

Comments
 (0)