Summary
Array constructor assignments (var = [a, b, c]) to private local arrays inside GPU_PARALLEL_LOOP regions produce silently wrong values on AMD GPU when compiled with amdflang using both -fopenmp-assume-threads-oversubscription and -fopenmp-assume-teams-oversubscription. This is the flag combination MFC uses for all AMD GPU builds.
The bug is present in all AFAR drops we tested: 23.1.0, 23.2.0, 23.2.1 (gfx90a / MI250X).
Errors are large — O(10–100) in absolute value, not floating-point noise — affecting ~50% of loop iterations non-deterministically.
Trigger Conditions
The bug fires when all three conditions hold:
-fopenmp-assume-threads-oversubscription and -fopenmp-assume-teams-oversubscription are both present (either flag alone does not trigger it)
- An array constructor
[expr1, expr2, ...] is assigned to a private array inside a target teams distribute parallel do
- The loop has enough iterations (≥ ~64)
Affected Code in MFC
src/simulation/m_ibm.fpp
s_ibm_correct_state (line ~195 GPU_PARALLEL_LOOP, lines 207/209):
real(wp), dimension(3) :: physical_loc ! declared private
...
$:GPU_PARALLEL_LOOP(private='[..., physical_loc, ...]')
do i = 1, num_gps
...
if (p > 0) then
physical_loc = [x_cc(j), y_cc(k), z_cc(l)] ! BUG: wrong values ~50% of iterations
else
physical_loc = [x_cc(j), y_cc(k), 0._wp] ! BUG: same
end if
...
end do
This subroutine is called every time step for all IBM cases. The wrong physical_loc values mean ghost point corrections are computed using corrupted spatial coordinates.
s_find_ghost_points (line ~397 GPU_PARALLEL_LOOP, lines 407/409): same pattern.
src/simulation/m_collisions.fpp
s_detect_ib_collisions_n2 (lines 337–338):
centroid_1 = [patch_ib(pid1)%x_centroid, patch_ib(pid1)%y_centroid, 0._wp]
centroid_2 = [patch_ib(pid2)%x_centroid, patch_ib(pid2)%y_centroid, 0._wp]
Inside a GPU_PARALLEL_LOOP. Only reachable when collision_model > 0 (not currently in CI).
Why CI Passes
The affected code paths (s_ibm_correct_state, s_find_ghost_points) are exercised by the GPU CI IBM test suite. However, the current regression tests appear to tolerate the corrupted ghost point coordinates — likely because the test geometries and flow conditions happen to be insensitive enough that wrong physical_loc values don't push the final solution norm outside the 1e-10 tolerance. This is a silent correctness bug.
Minimal Reproducer
program minimal_array_constructor
implicit none
integer, parameter :: N = 64
real(8) :: x(N), y(N), z(N), out(N,3)
real(8) :: loc(3)
integer :: i, nerr
do i = 1, N
x(i) = real(i,8)*0.1d0; y(i) = real(i,8)*0.2d0; z(i) = real(i,8)*0.3d0
end do
!$omp target teams distribute parallel do map(to:x,y,z) map(from:out) private(loc)
do i = 1, N
loc = [x(i), y(i), z(i)]
out(i,1) = loc(1); out(i,2) = loc(2); out(i,3) = loc(3)
end do
!$omp end target teams distribute parallel do
nerr = 0
do i = 1, N
if (abs(out(i,1)-x(i)) > 1d-14 .or. &
abs(out(i,2)-y(i)) > 1d-14 .or. &
abs(out(i,3)-z(i)) > 1d-14) nerr = nerr + 1
end do
if (nerr == 0) then
print *, "PASS"
else
print *, "FAIL:", nerr, "of", N, "(max abs error O(10-100))"
end if
end program
Compile and run:
amdflang -fopenmp --offload-arch=gfx90a -fopenmp-target-fast \
-fopenmp-assume-threads-oversubscription \
-fopenmp-assume-teams-oversubscription \
-o minimal minimal_array_constructor.f90
./minimal
# FAIL: 31 of 64 (max abs error O(10-100))
Workaround
Replace array constructor assignments with explicit element-by-element assignment:
! Instead of:
physical_loc = [x_cc(j), y_cc(k), z_cc(l)]
! Use:
physical_loc(1) = x_cc(j)
physical_loc(2) = y_cc(k)
physical_loc(3) = z_cc(l)
This is confirmed to produce correct results.
Upstream Compiler Bug
Being investigated in ROCm/llvm-project. A related fix for private VLA arrays is in ROCm/llvm-project#2422. The array constructor bug is a separate issue — the upstream report will follow.
cc @danieljvickers
Summary
Array constructor assignments (
var = [a, b, c]) toprivatelocal arrays insideGPU_PARALLEL_LOOPregions produce silently wrong values on AMD GPU when compiled with amdflang using both-fopenmp-assume-threads-oversubscriptionand-fopenmp-assume-teams-oversubscription. This is the flag combination MFC uses for all AMD GPU builds.The bug is present in all AFAR drops we tested: 23.1.0, 23.2.0, 23.2.1 (gfx90a / MI250X).
Errors are large — O(10–100) in absolute value, not floating-point noise — affecting ~50% of loop iterations non-deterministically.
Trigger Conditions
The bug fires when all three conditions hold:
-fopenmp-assume-threads-oversubscriptionand-fopenmp-assume-teams-oversubscriptionare both present (either flag alone does not trigger it)[expr1, expr2, ...]is assigned to aprivatearray inside atarget teams distribute parallel doAffected Code in MFC
src/simulation/m_ibm.fpps_ibm_correct_state(line ~195 GPU_PARALLEL_LOOP, lines 207/209):This subroutine is called every time step for all IBM cases. The wrong
physical_locvalues mean ghost point corrections are computed using corrupted spatial coordinates.s_find_ghost_points(line ~397 GPU_PARALLEL_LOOP, lines 407/409): same pattern.src/simulation/m_collisions.fpps_detect_ib_collisions_n2(lines 337–338):Inside a
GPU_PARALLEL_LOOP. Only reachable whencollision_model > 0(not currently in CI).Why CI Passes
The affected code paths (
s_ibm_correct_state,s_find_ghost_points) are exercised by the GPU CI IBM test suite. However, the current regression tests appear to tolerate the corrupted ghost point coordinates — likely because the test geometries and flow conditions happen to be insensitive enough that wrongphysical_locvalues don't push the final solution norm outside the 1e-10 tolerance. This is a silent correctness bug.Minimal Reproducer
Compile and run:
amdflang -fopenmp --offload-arch=gfx90a -fopenmp-target-fast \ -fopenmp-assume-threads-oversubscription \ -fopenmp-assume-teams-oversubscription \ -o minimal minimal_array_constructor.f90 ./minimal # FAIL: 31 of 64 (max abs error O(10-100))Workaround
Replace array constructor assignments with explicit element-by-element assignment:
This is confirmed to produce correct results.
Upstream Compiler Bug
Being investigated in ROCm/llvm-project. A related fix for private VLA arrays is in ROCm/llvm-project#2422. The array constructor bug is a separate issue — the upstream report will follow.
cc @danieljvickers