Performance notes
=================

This page describes the performance characteristics of ``vmec_jax``, the
algorithmic and implementation choices that determine them, and practical
tuning advice.

Overview: cold vs warm runtime
--------------------------------

``vmec_jax`` uses **XLA JIT compilation** (via JAX).  The first call in a
process compiles the iteration kernels; subsequent calls reuse the compiled
code:

- **Cold run**: includes Python/JAX startup and XLA compilation (one-time cost
  per process).  Depending on input size and whether the persistent cache is
  warm, typical fixed-boundary CPU cases range from a few seconds to tens of
  seconds.
- **Warm run**: steady-state solve time after the kernels are in memory — the
  fair comparison against VMEC2000.

Persistent XLA compilation caching is enabled automatically for
accelerator-selected runs.  CPU cache use is opt-in because XLA:CPU persistent
cache entries are native AOT executables and can emit host-feature mismatch
errors on some JAX versions.  Compiled kernels are stored under
``~/.cache/vmec_jax/jax_cache/<machine-fingerprint>`` unless the user sets
``VMEC_JAX_COMPILATION_CACHE_DIR`` or upstream ``JAX_COMPILATION_CACHE_DIR``.
The suffix includes host CPU details plus Python/JAX/JAXLIB versions to avoid
reusing native XLA:CPU AOT executables compiled for an incompatible runtime.
Set ``VMEC_JAX_COMPILATION_CACHE=1`` to opt in for CPU runs, or
``VMEC_JAX_COMPILATION_CACHE=0`` to disable the persistent cache.

VMEC2000 is a pre-compiled Fortran binary with no JIT overhead — it is always
effectively "cold".  When benchmarking, compare ``vmec_jax`` warm runtime
against VMEC2000 runtime.

2026-05-25 rerun snapshot
-------------------------

The current post-performance-lane audit was rerun under
``outputs/rerun_20260525_123334``.  This snapshot is intentionally a
measurement checkpoint, not a new broad speedup claim.

Parity gates:

.. list-table::
   :header-rows: 1

   * - Matrix
     - Cases
     - Failures
     - Coverage
   * - ``parity_smoke``
     - 6
     - 0
     - fixed/free boundary, axisymmetry/non-axisymmetry, LASYM false/true
   * - ``parity_full``
     - 1
     - 0
     - QH warm-start fixed-boundary stage trace against VMEC2000

Representative profiler timings:

``Wrapper time`` is the complete subprocess launch used by
``gpu_cpu_performance_matrix.py``.  ``Profile time`` is the measured body of
the diagnostic report.  For short fixed-boundary cases, the difference is
mostly Python/JAX startup, import, and first-use compilation overhead.

.. list-table::
   :header-rows: 1

   * - Case
     - Backend
     - Wrapper time
     - Profile time
     - Main exposed cost
   * - QH warm-start fixed-boundary
     - CPU
     - ``7.99 s``
     - ``0.24 s``
     - cold process/import/JIT dominates this short case
   * - QH warm-start fixed-boundary
     - GPU
     - ``7.80 s``
     - ``0.70 s``
     - accelerator launch/JIT overhead dominates this short case
   * - QH finite-beta fixed-boundary
     - CPU
     - ``14.53 s``
     - ``3.16 s``
     - finite-beta solve body
   * - QH finite-beta fixed-boundary
     - GPU
     - ``14.97 s``
     - ``3.52 s``
     - finite-beta solve body plus launch overhead
   * - LASYM pressure fixed-boundary
     - CPU
     - ``22.85 s``
     - ``0.43 s``
     - cold process/JIT dominates
   * - LASYM pressure fixed-boundary
     - GPU
     - ``10.10 s``
     - ``0.31 s``
     - GPU is faster for this short LASYM profile
   * - QH mode-2 exact dense Jacobian
     - CPU
     - ``13.78 s``
     - ``9.64 s``
     - replay/tangent construction
   * - QH mode-2 exact dense Jacobian
     - GPU
     - ``15.20 s``
     - ``11.36 s``
     - JVP-only tape plus replay/tangent construction
   * - QI Boozer/residual isolation
     - CPU
     - ``31.29 s``
     - ``28.89 s``
     - VMEC solve ``19.93 s`` plus first Boozer/QI call ``8.68 s``
   * - QI Boozer/residual isolation
     - GPU
     - ``22.79 s``
     - ``19.34 s``
     - VMEC solve ``8.16 s`` plus first Boozer/QI call ``10.76 s``

Low-hanging finding from the rerun: the GPU exact-callback profile was already
using the JVP-only exact tape path, but the matrix summary reported
``jvp_tape=False`` because it echoed only the requested CLI flag.  The profiler
and report comparator now record requested and effective JVP/basepoint-carry
metadata separately and infer effective behavior from the profile timers.  This
was a reporting fix rather than a solver-path change.

The exposed remaining bottlenecks are still:

- exact dense-Jacobian accepted-point replay/tangent construction,
- first-call Boozer/QI setup on GPU,
- cold-process import/JIT cost for short CLI solves.

Key performance decisions
--------------------------

The following design choices explain the current performance profile.  They
reduce warm runtime and are especially useful inside differentiable
optimization workflows, but the present single-solve CPU matrix should still be
read as a mixed result rather than a broad VMEC2000 speedup claim:

**1. Single-grid default for fixed-boundary**
  The default CLI path skips staged multi-grid schedules and goes directly to
  the final grid.  VMEC2000 uses a ``NS_ARRAY`` continuation schedule; for
  many cases the intermediate stages add overhead without improving convergence.
  Pass ``--parity`` to force the VMEC2000 continuation schedule.

**2. Backend-aware fixed-boundary default**
  Public CLI runs default to the VMEC-control non-scan loop on CPU because the
  auto-selected ``accelerated`` CPU policy can lose the scan win once
  VMEC-style progress printing and staged finish behavior are included.
  Quiet API/profiler CPU runs may select scan automatically for high-work
  decks, while low-mode CPU cases stay on the lower-latency host loop. Public
  GPU/CUDA/ROCm fixed-boundary runs default to the ``jax.lax.scan`` loop
  because the current May 2026 ``office`` profiles show materially better
  warmed raw throughput on accelerator backends.  Each scan chunk compiles to a
  single XLA program and returns only scalar convergence diagnostics to the
  host.  Pass ``use_scan=False`` or profiler ``--no-use-scan`` to force the
  VMEC-control loop.

**3. Dynamic replay bucketing**
  ``VMEC_JAX_DYNAMIC_REPLAY_BUCKET`` pads nearby exact-adjoint tape lengths so
  compiled replay kernels can be reused across Jacobian calls with slightly
  different iteration counts.  The default is backend-adaptive: ``32`` on CPU
  and ``128`` on CUDA/ROCm/GPU backends.  Override it only as a profiling knob;
  larger buckets can still be slower if they pad a workload too aggressively.

**4. Preconditioner caching**
  The 1-D preconditioner (``clear_preconditioner_jit_caches``) is JIT-compiled
  once and reused.  Explicit cache clearing between Gauss-Newton Jacobian
  evaluations (``post_jacobian_callback``) releases accumulated caches without
  paying the cost of recompilation on the next call.

**5. Strict-update + no backtracking**
  The VMEC iteration algorithm uses ``strict_update=True, backtracking=False``
  to match the VMEC2000 step-accept path.  Using backtracking causes the
  adaptive step size to collapse to machine epsilon on stiff 3D geometries
  (e.g. QH stellarators), resulting in non-convergence.

**6. FFT-based spectral synthesis**
  ``vmec_jax`` uses FFT-based real-space synthesis (``vmec_tomnsp.py``) for
  the R/Z/λ transforms, replacing the matrix–vector approach of VMEC2000.
  This reduces transform cost from O(N_modes × N_real) to O(N_real log N_real).

**7. Relaxed trial residuals in optimisation line search**
  The Gauss-Newton line search uses a *relaxed* forward solve (fewer iterations,
  looser ftol) for trial step evaluations.  The tight solve is only run for
  accepted steps and Jacobian builds.  This roughly halves the per-iteration
  wall time in optimisation loops.

**8. Discrete-adjoint Jacobian (solver-consistent, not finite differences)**
  For optimisation, the Jacobian is computed via discrete-adjoint replay
  (``build_residual_checkpoint_tape_direct`` + ``checkpoint_tape_state_jvp_columns``).
  This gives a Jacobian consistent with the converged VMEC iteration tape
  without finite differences.  The cost scales with the number of boundary
  degrees of freedom because all parameter tangent columns must be propagated
  through the tape.

**9. Cached quasisymmetry angular grids**
  QS optimisation callbacks reuse the same angular quadrature grid and
  trigonometric tables across residual and Jacobian calls.  The public
  ``quasisymmetry_ratio_residual_from_wout`` API still builds these arrays on
  demand, but ``make_qs_residuals_fn`` / ``make_qh_residuals_fn`` precompute
  them once for the fixed optimisation problem.  This reduces cold
  residual/Jacobian trace overhead without changing the residual values.

**10. Minimal history in optimisation loops**
  The ``light_history=True`` flag suppresses the full per-step diagnostic record
  during optimisation solves, reducing host/device traffic and memory pressure.

**11. On-disk XLA kernel cache**
  The persistent XLA compilation cache is enabled by default for repeated
  cold-process accelerator runs when the requested JAX platform is
  GPU/CUDA/ROCm/TPU or a public call explicitly requests
  ``solver_device="gpu"``. Merely exposing a GPU through
  ``CUDA_VISIBLE_DEVICES`` no longer enables the cache during import, so CPU
  runs on GPU workstations do not accidentally load CPU AOT cache entries. CPU
  cache use remains opt-in to avoid XLA:CPU AOT host-feature mismatch warnings. Set
  ``VMEC_JAX_COMPILATION_CACHE=1`` to enable it for CPU runs,
  ``VMEC_JAX_COMPILATION_CACHE=0`` to disable it, or
  ``VMEC_JAX_COMPILATION_CACHE_DIR`` to choose the cache location.

**12. GPU demand allocation**
  Before importing JAX, vmec_jax defaults
  ``XLA_PYTHON_CLIENT_PREALLOCATE=false`` unless the user already configured
  the allocator.  This keeps GPU memory available for worker/profiling
  processes and improved the exact-Jacobian replay profile on ``office``.
  Set ``XLA_PYTHON_CLIENT_PREALLOCATE=true`` or
  ``VMEC_JAX_GPU_PREALLOCATE=1`` before import to keep JAX's preallocation
  default.

**13. Tape exact default**
  Accepted-point exact optimizer callbacks use the discrete-adjoint tape path by
  default on both CPU and GPU.  The scan exact path is still available through
  ``VMEC_JAX_OPT_EXACT_PATH=scan`` for targeted profiling and parity studies, but
  it is not the GPU default.

**14. Scan trial residuals**
  Relaxed trial residual solves in optimization loops use a backend-aware
  policy: CPU and current GPU/CUDA/ROCm exact-optimization runs stay on the
  VMEC-control non-scan path because high-mode trial-solve profiles showed the
  scan path paying too much cold compile/dispatch overhead.  TPU keeps the scan
  default.  Set
  ``VMEC_JAX_OPT_TRIAL_SCAN=1`` or ``0`` for targeted diagnostics.

**15. Fused accelerator update step**
  The exact optimizer's strict fixed-boundary accepted-point solve uses a
  cached JIT helper for the velocity/state update on non-CPU backends.  This
  removes many small eager GPU dispatches per VMEC iteration while leaving the
  CPU host-update path unchanged.  Set ``VMEC_JAX_JIT_STRICT_UPDATE=0`` only
  for diagnostics.

**16. Residual-derived accepted-point history**
  For standard QS residual factories, accepted-point Jacobian callbacks
  reconstruct aspect-ratio and QS history metrics directly from the cached
  residual vector.  This avoids a second accepted-point solve after every
  Jacobian callback.  The conservative state-backed path is still used for
  custom residuals and histories that request an explicit ``iota_fn``.

**17. Device-resident exact-tape final state**
  The direct discrete-adjoint tape keeps the accepted-point final packed state
  as a JAX array until a public NumPy output boundary is reached.  This avoids
  an unnecessary device-to-host transfer during exact Jacobian callbacks and is
  especially relevant for GPU replay profiling.

**18. Async initial-state projection**
  The exact optimizer no longer blocks immediately after the JIT initial-state
  projection.  The projected state is consumed by subsequent JAX solver work,
  so synchronizing there only adds a host/device barrier on accelerator runs.
  Set ``VMEC_JAX_OPT_SYNC_INITIAL_STATE=1`` for diagnostics that need to
  attribute initial-state projection time independently.

**19. Experimental accelerator host-update assembly**
  The production non-scan GPU path is still limited by host/device scalar
  synchronization in VMEC-style time-step control.  For lower-level diagnostics,
  ``VMEC_JAX_HOST_UPDATE_ON_ACCELERATOR=1`` permits an explicitly requested
  host-update assembly path on GPU/accelerator backends.  The public driver
  diagnostic override ``VMEC_JAX_HOST_UPDATE_ASSEMBLY=1`` remains CPU/non-scan
  only.  This path is not a production default because it moves more
  intermediate data through the host and is case-dependent.

**20. GPU JVP-only exact tapes with basepoint carries**
  ``VMEC_JAX_OPT_JVP_ONLY_EXACT_TAPE=1`` and the diagnostics flag
  ``--jvp-only-exact-tape`` omit reverse-mode base-carry storage from accepted
  exact tapes when only JVP columns are needed.  As of the 2026-05-23 office GPU
  profile, accelerator exact callbacks enable this path by default and preserve
  ``VMEC_JAX_JVP_ONLY_EXACT_TAPE_BASEPOINT_CARRIES`` automatically unless the
  user sets either environment variable explicitly.  CPU keeps the established
  full-tape default.

  The production policy is deliberately narrower than the diagnostic knobs:
  JVP-only without basepoint carries is slower on GPU, and the forced scan exact
  path is still a profiling/parity tool rather than a default.

  .. list-table:: QH mode-2 accepted Jacobian profile, ``office``, 2026-05-22
     :header-rows: 1

     * - Device
       - JVP-only
       - Total
       - Tape build
       - Replay
       - RSS growth
     * - CPU
       - off
       - 31.373 s
       - 6.635 s
       - 11.246 s
       - 1934.6 MiB
     * - CPU
       - on
       - 28.199 s
       - 3.432 s
       - 18.724 s
       - 2760.3 MiB
     * - GPU
       - off
       - 20.356 s
       - 8.395 s
       - 5.543 s
       - 922.4 MiB
     * - GPU
       - on
       - 56.347 s
       - 8.043 s
       - 41.794 s
       - 1378.2 MiB
     * - GPU
       - on + basepoint carries
       - 24.249 s
       - 11.069 s
       - 6.461 s
       - 922.2 MiB

  A 2026-05-23 bounded QH mode-2 exact-Jacobian profile on ``office`` showed the
  production GPU lane clearly: full tape measured ``44.1 s`` total, JVP-only
  without carries measured ``44.7 s``, and JVP-only with basepoint carries
  measured ``15.1 s``.  CPU still uses the full tape by default.

**21. NumPy multigrid interpolation for ordinary stage transfers**
  Non-differentiated ``NS_ARRAY`` stage transfers now use the host NumPy path
  for radial VMEC interpolation, while traced/autodiff calls stay on the JAX
  implementation.  This avoids one-time XLA dispatch during ordinary
  multigrid startup without changing the interpolation convention or the final
  residual.

**22. NumPy initialization fast path for ordinary CPU solves**
  Public CPU performance-mode fixed-boundary runs now build non-traced initial
  guesses under the same NumPy compatibility shim used by the CPU force loop.
  Differentiated/traced calls and explicit JAX/GPU paths still use the JAX
  implementation.  The pure-NumPy VMEC axis recompute scan is also vectorized.
  The CPU update path is now workload-sensitive: low-mode non-scan solves keep
  the host NumPy update path, while larger spectral/radial stages use the fused
  strict-update JIT path.  On the local finite-beta QH multigrid profile this
  reduced solve-only wall time from ``21.6 s`` to ``11.9 s`` while keeping the
  converged residual at ``5.6e-13``.  Set
  ``VMEC_JAX_HOST_UPDATE_CPU_WORK_LIMIT`` to tune the threshold.

**23. CPU host solves use an adaptive NumPy preconditioner apply**
  The non-scan CPU host-update path uses the pure-NumPy R/Z preconditioner
  apply for short solves (default threshold
  ``VMEC_JAX_NUMPY_PRECOND_MAX_ITER=240``) and for moderate/high spectral
  mode counts (default threshold
  ``VMEC_JAX_NUMPY_PRECOND_MIN_MODES=16``, with mode count
  ``mpol * (ntor + 1)``).  This avoids cold JAX preconditioner compilation in
  optimization trial/probe solves and also helps long CPU host solves when the
  R/Z preconditioner apply is bandwidth/dispatch dominated.  Low-mode long
  solves keep the compiled preconditioner because it still amortizes better.
  On repeated local finite-beta QH multigrid profiles this reduced in-process
  wall time from the previous ``15.61 s`` baseline to ``13.62--14.29 s``
  without a material regression in the small QH warm-start profile
  (``1.64--1.67 s``).  Set
  ``VMEC_JAX_NUMPY_PRECOND_MAX_ITER=0`` and
  ``VMEC_JAX_NUMPY_PRECOND_MIN_MODES=0`` to disable this policy.

**24. Fixed-boundary startup avoids free-boundary imports**
  Ordinary fixed-boundary CLI/API runs no longer import or validate the
  free-boundary/NESTOR mgrid path.  Free-boundary inputs still validate and
  prepare mgrids exactly when ``LFREEB`` is active.  This reduced
  ``import vmec_jax.cli`` startup on the development Mac from about ``0.61 s``
  to about ``0.36 s`` after the related public-API lazy import pass.

**25. Accelerator scan preflight uses a one-step cached runner**
  VMEC2000-style scan solves sometimes need a one-step preflight after an
  initial axis reset.  On accelerator backends, running that step through the
  host/non-JIT path can dominate short LASYM profiles even when the scan kernel
  itself is fast.  Quiet GPU/CUDA/ROCm/TPU scan runs now use the same cached
  one-step scan runner for preflight.  CPU and verbose/printing paths keep the
  conservative host preflight.  Set ``VMEC_JAX_SCAN_JIT_PREFLIGHT=0`` or ``1``
  to force either behavior during diagnostics.

**26. Multigrid profiles expose per-stage wall time**
  ``profile_fixed_boundary.py`` now records multigrid stage wall time and
  per-stage solver-loop timing in diagnostics.  Chunked accelerated stages are
  aggregated before reporting so stage wall time and VMEC-loop timing are
  comparable, and scan stages report their ``scan_total_s`` timing through the
  same generic per-stage solve column as non-scan stages.  On the local
  finite-beta QH case, the two CPU stages measured
  about ``5.1 s`` and ``20.5 s`` wall time, while the aggregated final-stage
  solver-loop timing was about ``20.45 s``.  A
  ``VMEC_JAX_JIT_PRECOMPILE=0`` probe was slower (``29.2 s`` total), so the
  remaining cold-CPU target is compiled transform/preconditioner work inside
  the VMEC iteration loop plus shape-stable reuse, not WOUT output or
  driver-side staging overhead.

May 2026 policy validation snapshot
-----------------------------------

The detailed tables below combine the latest
``outputs/rerun_20260525_123334`` matrix with earlier May 2026 targeted probes.
Use the rerun snapshot above as the current end-to-end timing checkpoint; older
rows are retained as profiling provenance when they isolate a particular code
path.  The main conclusion is unchanged: fixed-boundary GPU scan throughput can
be competitive on warmed/profiled stages, but full subprocess wall time and
exact-optimizer callbacks still have large startup, tape/replay, and tangent
construction costs.

.. list-table:: CPU fixed-boundary policy check, local Mac, 2026-05-25
   :header-rows: 1

   * - Case
     - Default warm
     - Accelerated warm
     - Warm speedup
     - Final residual
   * - ``input.nfp4_QH_warm_start``
     - ``1.15 s``
     - ``0.165 s``
     - ``7.0x``
     - ``1.11e-13``
   * - ``input.nfp4_QH_finite_beta``
     - ``11.80 s``
     - ``10.62 s``
     - ``1.1x``
     - ``5.52e-13``

The same finite-beta deck also passed the optional converged VMEC2000 external
parity test against ``~/bin/xvmec2000`` on the same checkout
(``1 passed`` in ``382 s``), so the CPU policy change did not trade accuracy
for speed on this promoted reference.

.. list-table:: GPU fixed-boundary profile, ``office`` RTX A4000, 2026-05-25
   :header-rows: 1

   * - Case
     - Wall time
     - Iterations
     - Dominant exposed phase
     - Final residual
   * - ``input.nfp4_QH_warm_start``
     - ``0.70 s``
     - ``450``
     - scan dispatch ``0.50 s``
     - ``1.11e-13``
   * - ``input.nfp4_QH_finite_beta``
     - ``3.52 s``
     - ``2020``
     - scan dispatch ``3.31 s``
     - ``5.51e-13``

The finite-beta GPU run uses scan for the explicit-stage monitor chunks and
precomputed tridiagonal coefficients.  This replaced the previous non-scan
stage-monitor profile (``30.21 s`` wall time, dominated by ``12.81 s`` in the
preconditioner) with a single cached scan runner hit and no scan-cache misses.
The profiled solve body is approximately VMEC2000-class for the promoted
finite-beta deck (``~/bin/xvmec2000`` measured ``3.32 s`` on the same input in
the local external parity run), but the current subprocess wrapper still
measured ``14.97 s`` because it includes startup/import/JIT.  Force-kernel
tuning alone is no longer the first raw fixed-boundary GPU bottleneck for this
case.

2026-06-02 fixed-boundary and exact-callback checkpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After the fixed-boundary scan timing reporter was promoted, the same two
fixed-boundary references were rerun from clean timing subprocesses.  The
``--use-input-niter`` report stem now records ``input_niter`` instead of the
default diagnostic ``iters20`` label, so these artifacts are easier to compare
against VMEC2000 runs that use the input deck budgets.

.. list-table:: Fixed-boundary solve-body timing, 2026-06-02
   :header-rows: 1

   * - Case
     - Backend
     - vmec_jax scan body
     - VMEC2000 total computational time
     - Notes
   * - ``input.nfp4_QH_warm_start``
     - CPU
     - ``0.190 s``
     - ``0.20 s``
     - 450 iterations; warm scan body is VMEC2000-class
   * - ``input.nfp4_QH_warm_start``
     - GPU
     - ``0.448 s``
     - ``0.20 s``
     - small warm solve remains GPU dispatch dominated
   * - ``input.nfp4_QH_finite_beta``
     - CPU
     - ``3.957 s``
     - ``3.32 s``
     - 2459 iterations; close to VMEC2000, but still slower
   * - ``input.nfp4_QH_finite_beta``
     - GPU
     - ``4.005 s``
     - ``3.32 s``
     - 2508 iterations; GPU dispatch/preconditioner launch overhead dominates

The same checkpoint also reran the optimization-critical QH exact dense
Jacobian path on the ``office`` RTX A4000.  The current default basepoint
dynamic replay remains faster than ``VMEC_JAX_DYNAMIC_REPLAY_MODE=whole_scan``:
mode-2 warm callbacks measured about ``3.47 s`` with basepoint replay versus
``3.76 s`` with whole-scan replay, and mode-3 warm callbacks measured about
``3.60 s`` versus ``3.80 s``.  The remaining warm GPU cost is therefore not a
mode-selection issue.  The repeated hotspots are accepted tape-build solve
dispatch, preconditioner apply, update-state dispatch, and projected
replay/residual-projection dispatch.

.. list-table:: LASYM fixed-boundary scan preflight, ``input.basic_non_stellsym_pressure``, 20 iterations
   :header-rows: 1

   * - Policy
     - Backend
     - Wall time
     - Scan total
     - Preflight
     - Device run
     - Final residual
   * - host preflight
     - CPU
     - ``0.323 s``
     - ``0.199 s``
     - ``0.156 s``
     - ``0.0188 s``
     - ``1.0575e-3``
   * - host preflight
     - ``office`` GPU
     - ``1.116 s``
     - ``0.945 s``
     - ``0.805 s``
     - ``0.0322 s``
     - ``1.0575e-3``
   * - cached one-step preflight
     - ``office`` GPU
     - ``0.285 s``
     - ``0.164 s``
     - ``0.0100 s``
     - ``0.0360 s``
     - ``1.0575e-3``

This profile is intentionally a short, non-converged raw throughput probe.  It
shows why the accelerator preflight path matters: the actual GPU scan kernel
was already fast, but the old host preflight dominated the short LASYM run.
The patched one-step preflight keeps the same residual and cuts the GPU wall
time by about ``3.9x`` for this diagnostic.  The promoted converged QH decks
remain stable under the same policy: ``input.nfp4_QH_warm_start`` measured
``0.689 s`` with final residual ``1.11e-13``, and
``input.nfp4_QH_finite_beta`` measured ``3.45 s`` with final residual
``5.61e-13`` on ``office``.

.. list-table:: QH mode-2 exact dense-Jacobian callback, latest rerun, 2026-05-25
   :header-rows: 1

   * - Backend
     - Wall time
     - Tape build
     - Tape replay
     - Initial tangents
     - Residual tangents
   * - CPU
     - ``9.64 s``
     - ``0.48 s``
     - ``3.00 s``
     - ``2.21 s``
     - ``2.19 s``
   * - ``office`` GPU
     - ``11.36 s``
     - ``3.89 s``
     - ``3.38 s``
     - ``1.40 s``
     - ``2.42 s``

This callback has the same Jacobian shape and norm on CPU/GPU
(``44353 x 24``, norm ``5.705e2``), so the remaining gap is performance rather
than a different solve.  The latest GPU run used the effective JVP-only exact
tape path with basepoint carries.  The next optimizer-performance work should
target accepted-point tape build/replay and tangent construction; forcing scan
for this mode-2 dense callback is not the production fix.  Projected replay is
now enabled for non-LASYM GPU callbacks with at least 24 columns because later
bounded profiles show a small mode-2 win, but the remaining bottleneck is still
accepted-point tape/replay dispatch rather than residual-projection kernel math.
Older same-point cache probes still show that repeated identical callbacks can
return from cache in milliseconds, so the production cost is at new accepted
optimizer points rather than redundant repeated evaluations.

.. list-table:: QH mode-2 perturbed accepted-point dense-Jacobian matrix, 2026-05-25
   :header-rows: 1

   * - Backend
     - Wall time
     - Tape build
     - Tape replay
     - Initial tangents
     - Residual tangents
   * - CPU
     - ``12.94 s``
     - ``0.97 s``
     - ``3.87 s``
     - ``3.96 s``
     - ``2.22 s``
   * - ``office`` GPU
     - ``12.79 s``
     - ``4.99 s``
     - ``3.99 s``
     - ``1.30 s``
     - ``2.25 s``

The perturbed profile uses ``--perturb-scale 1e-4`` to mimic a new accepted
optimizer point.  It has the same Jacobian shape and norm on CPU/GPU
(``44353 x 24``, norm ``4.976e2``).  An earlier explicit
``--solver-device gpu`` matrix took ``53.18 s`` because the optimizer wrapped
every callback in a redundant ``jax.default_device`` context even though the
process backend was already GPU.  The current policy treats explicit requests
for the already-active backend as a no-op, removing that overhead while still
preserving explicit CPU/GPU routing when the requested backend differs from the
active process backend.

.. list-table:: Mode-3 perturbed exact dense-Jacobian callbacks, 2026-05-25
   :header-rows: 1

   * - Case
     - Backend
     - Wall time
     - Tape build
     - Replay path
     - Replay time
     - Initial tangents
     - Residual tangents
   * - QH
     - CPU
     - ``11.46 s``
     - ``0.79 s``
     - standard
     - ``4.29 s``
     - ``2.39 s``
     - ``2.14 s``
   * - QH
     - ``office`` GPU
     - ``13.20 s``
     - ``4.82 s``
     - projected
     - ``6.52 s``
     - ``1.58 s``
     - ``2.21 s``
   * - QA
     - CPU
     - ``11.87 s``
     - ``0.77 s``
     - standard
     - ``4.32 s``
     - ``2.37 s``
     - ``2.61 s``
   * - QA
     - ``office`` GPU
     - ``13.12 s``
     - ``4.36 s``
     - projected
     - ``7.00 s``
     - ``1.48 s``
     - ``2.70 s``

The mode-3 callbacks have 48 boundary parameters.  GPU replay now uses
8-column chunks for this size and enables projected replay for
stellarator-symmetric callbacks.  The 2026-05-30 ``office`` rerun supersedes
the older 24-column policy for these shapes.

.. list-table:: LASYM QH mode-2 perturbed exact dense-Jacobian callback, 2026-05-25
   :header-rows: 1

   * - Backend / policy
     - Wall time
     - Tape build
     - Replay time
     - Initial tangents
     - Residual tangents
   * - CPU standard
     - ``13.87 s``
     - ``0.84 s``
     - ``7.07 s``
     - ``2.40 s``
     - ``2.70 s``
   * - ``office`` GPU, 24-column chunks
     - ``15.90 s``
     - ``5.63 s``
     - ``5.65 s``
     - ``1.64 s``
     - ``2.66 s``

LASYM doubles the active boundary families, so the mode-2 callback also reaches
48 columns.  It should not use the symmetric mode-3 projected-replay heuristic:
the measured LASYM projected path took ``51.1 s`` total.  Chunking remains the
right default for this tested LASYM matrix, and the current GPU chunk policy is
the same 8-column policy used for stellarator-symmetric mode-2/mode-3
callbacks.  Explicit environment overrides are still available for profiling,
but the production automatic policy is now:

- GPU callbacks with at least 24 columns use 8-column replay chunks.
- Projected replay is only auto-enabled for stellarator-symmetric callbacks
  with at least 24 columns.
- LASYM uses the conservative chunked standard replay path unless explicitly
  overridden.

CPU/GPU profiling playbook
--------------------------

Use the diagnostics scripts from the repository root.  Always record whether a
timing is cold or warm, the selected JAX backend, JAX/JAXLIB versions, input
deck, ``max_mode`` or iteration budget, and whether the command is measuring
raw solver throughput or optimization callback overhead.
For sweep and matrix outputs, do not infer the runtime backend from an output
directory or ``--backend-label`` alone.  Use the JSON/CSV provenance fields
(``jax_backend``, ``jax_device_kind``, ``solver_device``, ``jax_platforms``)
and the profiler ``runtime`` block when comparing CPU/GPU results.

For optimization work, keep three measurements separate:

- raw fixed-boundary throughput, measured with ``profile_fixed_boundary.py``;
- accepted-point exact optimizer callbacks, measured with
  ``profile_exact_optimizer.py`` or ``gpu_cpu_performance_matrix.py``;
- QI Boozer/residual cost, measured with ``profile_qi_boozer_gpu.py`` before
  attributing a slow QI run to VMEC replay or SciPy trust-region behavior.

For CPU-only timings, force a CPU process with ``JAX_PLATFORMS=cpu`` or an
explicit ``--solver-device cpu``.  For GPU timings, use
``JAX_PLATFORM_NAME=gpu`` plus ``--solver-device gpu``.  On NVIDIA-only JAX
installs, ``JAX_PLATFORMS=cuda`` is also valid; avoid ``JAX_PLATFORMS=gpu`` on
mixed CUDA/ROCm installations because some JAX versions try to initialize both
backends.

Raw fixed-boundary throughput:

.. code-block:: bash

   JAX_PLATFORMS=cpu JAX_ENABLE_X64=1 PYTHONPATH=. python tools/diagnostics/profile_fixed_boundary.py \
     --input examples/data/input.nfp4_QH_warm_start \
     --iters 20 \
     --simple-profile \
     --no-multigrid \
     --no-auto-cli-policy \
     --solver-mode accelerated \
     --no-use-scan \
     --solver-device cpu \
     --json-out /tmp/vmec_jax_qh20_raw_cpu.json

   JAX_PLATFORM_NAME=gpu JAX_ENABLE_X64=1 PYTHONPATH=. python tools/diagnostics/profile_fixed_boundary.py \
     --input examples/data/input.nfp4_QH_warm_start \
     --iters 20 \
     --simple-profile \
     --no-multigrid \
     --no-auto-cli-policy \
     --solver-mode accelerated \
     --solver-device gpu \
     --json-out /tmp/vmec_jax_qh20_raw_gpu.json

The profiler follows production scan policy when neither ``--use-scan`` nor
``--no-use-scan`` is supplied.  Use ``--use-scan`` only to force scan on a CPU
run or diagnostic comparison, and use ``--no-use-scan`` to force the
Python/VMEC-control loop on GPU.

Short exact-optimizer runs:

.. code-block:: bash

   JAX_PLATFORMS=cpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 3 --max-nfev 2 --run-repeats 3 \
     --inner-max-iter 120 --trial-max-iter 120 --solver-device cpu \
     --json-out /tmp/vmec_jax_qh_m3_cpu.json

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 3 --max-nfev 2 --run-repeats 3 \
     --inner-max-iter 120 --trial-max-iter 120 --solver-device gpu \
     --vmec-timing --json-out /tmp/vmec_jax_qh_m3_gpu.json

Accepted-point Jacobian callback cost at realistic new optimizer points:

.. code-block:: bash

   JAX_PLATFORMS=cpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 2 --callback jacobian --repeats 3 \
     --perturb-scale 1e-4 --inner-max-iter 80 --trial-max-iter 40 \
     --solver-device cpu --vmec-timing --json-out /tmp/qh_m2_cpu_jacobian.json

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 2 --callback jacobian --repeats 3 \
     --perturb-scale 1e-4 --inner-max-iter 80 --trial-max-iter 40 \
     --solver-device gpu --vmec-timing --json-out /tmp/qh_m2_gpu_jacobian.json

Compare the JSON reports before launching a full sweep or a long GPU run:

.. code-block:: bash

   PYTHONPATH=. python tools/diagnostics/compare_profile_reports.py \
     /tmp/qh_m2_cpu_jacobian.json /tmp/qh_m2_gpu_jacobian.json \
     --label cpu --label gpu \
     --json-out /tmp/qh_m2_cpu_gpu_comparison.json

The comparison summary exposes exact optimizer phases separately:
``exact_tape_build_s``, ``exact_tape_build_unattributed_s``,
``initial_tangents_s``, ``initial_projection_s``, ``residual_tangents_s``,
``trial_solve_s``, and ``exact_solve_s``.  Use these fields before changing
solver kernels: in recent GPU runs the dominant cost has been accepted-point
tape/replay and tangent construction, not VMEC force assembly.
The same tool also prints an "Exact optimizer patch targets" section that
ignores enclosing timers such as ``jacobian_total`` and
``exact_solve_with_tape_total``.  Treat that row as the next concrete profiling
target when a total timer is largest; it points at leaf-like accepted-point
work such as ``*_tape_replay``, ``exact_tape_build_unattributed``,
``*_initial_vjp``, or ``*_residual_tangents``.

QI Boozer/residual isolation:

.. code-block:: bash

   JAX_PLATFORMS=cpu PYTHONPATH=. python tools/diagnostics/profile_qi_boozer_gpu.py \
     --solver-device cpu --repeat 2 --jit-booz \
     --output results/diagnostics/qi_boozer_cpu.json

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_qi_boozer_gpu.py \
     --solver-device gpu --repeat 2 --jit-booz \
     --output results/diagnostics/qi_boozer_gpu.json

The user-facing QI optimization helpers default to the jitted Boozer path
(``QuasiIsodynamicOptions(jit_booz=True)``).  Keep the non-jitted profiler mode
available for parity isolation; use ``--jit-booz`` for production-like QI
timings.

May 2026 QI Boozer/GPU split
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On the bundled ``input.nfp2_QI`` low-resolution diagnostic
(``mpol=ntor=mboz=nboz=3``, one Boozer surface, ``repeat=3``), ``--jit-booz``
improved the QI/Boozer phase on both backends:

- CPU: total profiler time ``20.17 s`` without Boozer JIT and ``18.73 s`` with
  Boozer JIT; first QI/Boozer call improved from ``8.38 s`` to ``6.78 s``.
- ``office`` GPU: total profiler time ``27.39 s`` without Boozer JIT and
  ``20.88 s`` with Boozer JIT; first QI/Boozer call improved from ``11.01 s``
  to ``4.98 s``.

The remaining gap in this small case comes from the VMEC solve, not from the QI
residual itself.  A raw 80-iteration fixed-boundary ``input.nfp2_QI`` profile
with the accelerated single-grid path measured ``0.65 s`` on CPU and ``3.11 s``
on GPU; forcing the scan path measured ``0.55 s`` on CPU and ``3.35 s`` on GPU.
The next GPU work should therefore target fixed-boundary force/update kernel
launch and tape structure, while keeping Boozer JIT enabled for QI production
runs.

A bounded QH warm-start forced non-scan profile on 2026-05-23
(``input.nfp4_QH_warm_start``, 20 iterations, ``--no-warmup``) measured
``9.70 s`` total on local CPU and ``7.48 s`` total on ``office`` GPU.  The GPU
was faster end-to-end in that cold profile, but the VMEC phase timers still
showed slower accelerator micro-kernels: ``compute_forces`` was ``1.03 s`` on
GPU versus ``0.094 s`` on CPU, and ``update`` was ``0.133 s`` on GPU versus
``0.008 s`` on CPU.  This is retained as historical forced-path evidence.  The
current policy is narrower: ordinary GPU fixed-boundary solves use the scan path
by default, while exact accepted-point GPU Jacobians use the tape path with
JVP-only/basepoint carries enabled by default.

A follow-up ``LASYM=true`` finite-beta fixed-boundary profile on ``office``
(``input.basic_non_stellsym_pressure``, 20 iterations, forced non-scan
diagnostic path, no warmup) measured ``21.38 s`` profile time on CPU and
``26.15 s`` on GPU.  GPU force assembly was faster (``1.29 s`` versus ``2.31 s``), but GPU
preconditioner time was slower (``2.48 s`` versus ``1.09 s``), so the next raw
GPU solve target for this lane is preconditioner/update launch structure rather
than the force kernel alone.  Forcing the scan path was slower on both devices
in the same case: ``35.76 s`` CPU and ``44.96 s`` GPU profile time.

The next ``office`` pass isolated the preconditioner piece: enabling
precomputed Thomas coefficients for the same historical raw LASYM GPU profile reduced
wrapper time from ``30.50 s`` to ``11.33 s`` and detailed preconditioner time
from ``2.66 s`` to ``0.74 s``.  Short VMEC2000 trace parity was then checked for
the LASYM finite-pressure case and the QH warm-start case.  The public
production policy now enables this path automatically for non-scan,
performance-mode accelerator LASYM solves, and for higher-mode accelerator
non-LASYM solves where the mode count is large enough to amortize the
coefficient setup.  Small non-LASYM decks keep the legacy default: the
2026-05-23 QH warm-start profile regressed from ``13.4 s`` to ``14.4 s`` when
precomputation was forced, while the 50-mode finite-beta QH profile improved
from ``52.6 s`` to ``49.4 s`` with the same iteration count and final residual.
Set ``VMEC_JAX_TRIDI_PRECOMPUTE=0`` to disable this narrow accelerator default
during diagnostics, or ``=1`` to force it in lower-level experiments.

The same commit was profiled with a one-callback QH mode-2 exact Jacobian
matrix on ``office`` (``inner_max_iter=40``, ``trial_max_iter=20``,
``--sync-replay-timing``).  The GPU profile was faster end-to-end than CPU
(``15.45 s`` versus ``29.88 s``), with GPU accepted exact solve ``7.20 s``,
tape build ``5.45 s``, replay ``6.16 s``, and initial tangent construction
``2.09 s``.  CPU spent ``11.02 s`` in the exact solve, ``7.16 s`` in tape build,
``6.72 s`` in replay, ``8.19 s`` in initial tangents, and ``3.95 s`` in
residual tangents.  This keeps the current GPU optimization target focused on
cold accepted-point tape/replay/tangent construction; forcing scan for GPU is
not justified by these profiles.
A matched run with ``--jvp-only-exact-tape --jvp-only-basepoint-carries`` was
nearly neutral: CPU profile time was ``29.97 s`` and GPU profile time was
``14.95 s``.  That is a small GPU win for this one callback, but not enough to
change the default without larger mode and full optimizer-trajectory coverage.

A 2026-05-23 GPU sidecar pass on ``office`` at ``4d61eab`` used one QH mode-2
cold Jacobian callback (``inner_max_iter=20``, ``trial_max_iter=20``,
``--sync-replay-timing``) plus matched trial and raw-LASYM probes.  Full-tape
exact took ``13.29 s`` on CPU and ``36.81 s`` on GPU; the GPU regression was
the accepted replay dispatch bucket (``21.10 s`` dispatch, ``0.00 s`` ready),
not tangent construction (GPU initial/residual tangents ``2.04 s``/``2.46 s``).
The same GPU callback with
``--jvp-only-exact-tape --jvp-only-basepoint-carries`` dropped to ``14.48 s``
and replay to ``3.62 s``, but the matched CPU JVP-only callback regressed to
``27.86 s`` with ``8.96 s`` in initial tangents.  Treat this as a GPU-only
candidate that needs mode-3 and full-trajectory validation before changing
defaults.

The same sidecar confirmed the trial-scan policy.  Forced scan trial solves
spent ``6.76 s`` in the CPU scan block and ``11.68 s`` in the GPU scan block;
the GPU device bucket was almost entirely dispatch/compile-like
(``10.24 s`` dispatch, ``0.001 s`` ready).  Disabling scan reduced the trial
profiles to ``4.46 s`` on CPU and ``6.76 s`` on GPU.  Later QH ``max_mode=2``
profiles briefly favored a patched scan GPU default, but the 2026-06-02
``office`` QH ``max_mode=4`` optimizer trace reversed that policy: forced
non-scan trial solves reduced the short run from ``89.23 s`` to ``77.73 s`` and
the trial solve itself from ``15.95 s`` to ``4.56 s``.  Trial callbacks
therefore default to non-scan on CPU/GPU/CUDA/ROCm and scan only on TPU; use
``VMEC_JAX_OPT_TRIAL_SCAN=1`` or ``0`` to force either path.

For historical raw ``LASYM=true`` ``input.basic_non_stellsym_pressure`` non-scan solves
(``20`` iterations, single-grid, no warmup, detailed timing), the same pass
measured ``7.98 s`` on CPU and ``8.84 s`` on GPU.  Force assembly was comparable
(``1.29 s`` CPU, ``1.27 s`` GPU), GPU update was faster (``0.14 s`` versus
``0.41 s``), and GPU preconditioner apply remained slower (``0.44 s`` versus
``0.27 s``).  The current scan-lifted profile above has moved that short-case
target to preflight and scan-runner setup rather than force assembly alone.

A larger historical QH mode-3 exact-Jacobian callback on ``office`` showed
that the pre-JVP-only projected-replay residual path was the wrong default for
GPU.  With projected replay enabled the one-callback profile took ``69.06 s``
wall time (``63.53 s`` profile time, ``33.40 s`` replay, ``10.00 s``
projected residual tangents).  The same callback with projected replay
disabled took ``21.59 s`` wall time (``16.29 s`` profile time, ``4.29 s``
replay, ``2.38 s`` residual tangents).  Later JVP-only/basepoint-carry
profiles changed that conclusion for non-LASYM callbacks, so production GPU
exact callbacks now enable projected replay for stellarator-symmetric cases
with at least 24 columns.  ``VMEC_JAX_OPT_PROJECTED_REPLAY_RESIDUALS`` remains
available as an explicit diagnostic override.

For raw ``input.nfp2_QI`` follow-up profiling, keep the production-like scan
measurement separate from phase attribution.  The scan path is best inspected
with XProf traces because the force/preconditioner/update work is inside one
``lax.scan`` program.  The non-scan path supports lightweight JSON phase timing
with ``--vmec-timing``:

.. code-block:: bash

   PYTHONPATH=. python tools/diagnostics/gpu_cpu_performance_matrix.py \
     --mode fixed-boundary --backend cpu --backend gpu --keep-going \
     --input examples/data/input.nfp2_QI \
     --iters 80 \
     --solver-mode accelerated \
     --single-grid \
     --raw-solver-policy \
     --use-scan \
     --outdir outputs/performance_profiles/qi_raw_scan80

   JAX_PLATFORMS=cpu JAX_ENABLE_X64=1 PYTHONPATH=. python tools/diagnostics/profile_fixed_boundary.py \
     --input examples/data/input.nfp2_QI \
     --iters 20 \
     --simple-profile \
     --no-warmup \
     --no-multigrid \
     --no-auto-cli-policy \
     --solver-mode accelerated \
     --solver-device cpu \
     --vmec-timing \
     --vmec-timing-detail \
     --json-out outputs/performance_profiles/qi_raw_cpu_phase20.json

   JAX_PLATFORM_NAME=gpu JAX_ENABLE_X64=1 PYTHONPATH=. python tools/diagnostics/profile_fixed_boundary.py \
     --input examples/data/input.nfp2_QI \
     --iters 20 \
     --simple-profile \
     --no-warmup \
     --no-multigrid \
     --no-auto-cli-policy \
     --solver-mode accelerated \
     --solver-device gpu \
     --vmec-timing \
     --vmec-timing-detail \
     --json-out outputs/performance_profiles/qi_raw_gpu_phase20.json

The fixed-boundary raw-solve source path is:
``vmec_jax/driver.py::run_fixed_boundary`` for accelerated/single-grid policy,
``vmec_jax/solve.py::solve_fixed_boundary_residual_iter`` for the VMEC loop and
``_run_vmec2000_scan`` scan runner, ``vmec_jax/solve.py::_compute_forces`` for
per-iteration force assembly and residual norms,
``vmec_jax/vmec_forces.py::vmec_forces_rz_from_wout`` and
``vmec_residual_internal_from_kernels`` for bcovar/force/tomnsps assembly, and
``vmec_jax/vmec_tomnsp.py::tomnsps_rzl`` for the Fourier transform hot kernel.
The fixed-boundary profiler now keeps any ``diagnostics.timing`` block in its
JSON, and ``compare_profile_reports.py`` / ``gpu_cpu_performance_matrix.py``
surface ``vmec_compute_forces_s``, ``vmec_preconditioner_s``, and
``vmec_update_s`` when those timings exist.

The comparison table reports ratios for total runtime, compile/replay/cache
time when those timings exist, callback count, observed RSS peak, solve count,
accepted-point replay count, and cache growth.  The JSON output is stable enough
for CI dashboards or follow-up scripts that track whether a GPU regression is
coming from tape replay, extra callbacks, cache retention, or a cold
compile-like phase.
It also emits a cheap ``bottleneck_hint`` for each report by selecting the
largest exposed phase timing (QI/Boozer first call, VMEC force assembly,
preconditioner/update, accepted-point replay, compile, or cache).  Treat this
as a triage pointer only: it cannot identify work hidden inside a fused
``lax.scan`` unless the source profile exported phase timings.

For repeatable CPU/GPU matrix launches, use the wrapper below.  Its default
``--backend auto`` does not set ``JAX_PLATFORMS`` or ``JAX_PLATFORM_NAME``; the
child process inherits the user's active JAX/GPU selection.  Add explicit
``--backend cpu`` or ``--backend gpu`` only when the comparison should launch
separate CPU/GPU processes.

.. code-block:: bash

   PYTHONPATH=. python tools/diagnostics/gpu_cpu_performance_matrix.py \
     --mode fixed-boundary \
     --backend auto \
     --input examples/data/input.nfp4_QH_warm_start \
     --use-input-niter \
     --public-cli-policy \
     --allow-multigrid \
     --outdir outputs/performance_profiles/qh20_auto

   PYTHONPATH=. python tools/diagnostics/gpu_cpu_performance_matrix.py \
     --mode exact-callback \
     --backend cpu --backend gpu --keep-going \
     --problem qh --max-mode 2 --callback jacobian --repeats 3 \
     --perturb-scale 1e-4 --inner-max-iter 80 --trial-max-iter 40 \
     --method scipy_matrix_free --dynamic-replay-mode whole_scan \
     --vmec-timing \
     --outdir outputs/performance_profiles/qh_m2_cpu_gpu

For fixed-boundary production-policy profiles, omit both ``--use-scan`` and
``--no-use-scan`` so the child profiler follows backend-aware production scan
selection.  Add ``--use-input-niter`` to use the input deck's full
``NITER``/``NITER_ARRAY`` schedule instead of the wrapper's short ``--iters``
budget.

The wrapper delegates to ``profile_fixed_boundary.py`` or
``profile_exact_optimizer.py`` and writes one child JSON per backend plus a
matrix JSON.  The printed table reports wrapper wall time, child profile wall
time, and replay time when the child report exposes it.  The matrix JSON embeds
the normalized ``compare_profile_reports.py`` summary, so dashboards can track
``total_runtime_s``, ``replay_time_s``, ``accepted_point_replay_count``,
``cache_entry_growth``, and RSS peak without parsing profiler-specific output
shapes.  Each child process also writes ``*.stdout.log`` and ``*.stderr.log``
files next to its JSON report so failed GPU jobs can be diagnosed without
rerunning the matrix.

Use ``--dry-run`` before scheduling cluster jobs; it prints and records the
exact child commands and backend environment overrides without importing JAX in
the child.  Use ``--replay-column-chunk`` or
``--dynamic-replay-bucket`` to make tape/replay tuning explicit in the report.
Use ``--dynamic-replay-mode`` and ``--method`` to compare accepted-point replay
paths and dense/matrix-free optimizer behavior under the same child-launch
policy.
For cold exact-callback bucket audits, start with one perturbed callback and a
small iteration budget before launching larger GPU sweeps:

.. code-block:: bash

   PYTHONPATH=. python tools/diagnostics/gpu_cpu_performance_matrix.py \
     --mode exact-callback \
     --backend cpu --backend gpu --keep-going \
     --problem qh --max-mode 2 --callback jacobian --repeats 1 \
     --perturb-scale 1e-4 --inner-max-iter 20 --trial-max-iter 20 \
     --method scipy_matrix_free --vmec-timing --sync-replay-timing \
     --outdir outputs/performance_profiles/qh_m2_cold_bucket_smoke

The matrix table surfaces ``exact_s``, ``tape_build_s``,
``tape_unattr_s``, ``replay_s``, ``replay_dispatch_s``,
``replay_ready_s``, ``init_tangent_s``, ``resid_tangent_s``, ``callbacks``,
and ``replays`` when the child exact-callback report exposes those metrics.
When the child reports contain the detailed buckets, the wrapper also prints
separate ``Trial scan timing``, ``Scan cache details``, and
``Projected replay / JVP details`` tables and embeds the same sections in the
matrix JSON.  Use those sections to distinguish scan-cache misses from
projected-replay residual tangent cost before changing solver kernels.  The
scan-cache diagnostic helpers also expose stable cache-miss categories
(``iteration_budget``, ``tolerance``, ``scan_policy``,
``stage_transition``, ``fallback_policy``, etc.) so profiler summaries can
report why a trial solve missed the cached scan runner without relying on raw
tuple offsets.
Use ``--sync-replay-timing`` only for targeted cold-bucket diagnostics: it
adds ``block_until_ready`` synchronization so dispatch and device-ready buckets
are attributable, but that synchronization is not representative of production
sweep throughput.
Malformed ``VMEC_JAX_REPLAY_COLUMN_CHUNK`` values now fall back to the automatic
replay memory guard rather than aborting the Jacobian callback; set
``VMEC_JAX_REPLAY_COLUMN_CHUNK=off`` or ``0`` only when chunking should be
disabled for a targeted profiling run.  The optimizer-specific
``VMEC_JAX_LASYM_REPLAY_COLUMN_CHUNK`` override follows the same safe parsing
policy, so malformed values fall back to the backend/input auto heuristic.

Use ``--trace-outdir`` for TensorBoard/XProf traces and
``--device-memory-profile-out`` for JAX device-memory snapshots when GPU memory
or launch overhead is the bottleneck.  Use ``--no-auto-cli-policy`` only when
you want raw solver throughput; omit it when measuring the public
``run_fixed_boundary`` policy that users see through the CLI/API.

Exact optimizer profiling
-------------------------

Use ``tools/diagnostics/profile_exact_optimizer.py`` to time the exact
optimization callback stack for the QA, QH, and QP quasisymmetry objectives:

.. code-block:: bash

   PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qa --max-mode 2 --max-nfev 2 \
     --trial-max-iter 300 --trial-ftol 1e-10

Use ``--problem qp`` for the quasi-poloidal fixed-boundary profile preset.  QI
uses Boozer-space residuals and promotion gates, so profile it with the
dedicated ``tools/diagnostics/profile_qi_boozer_gpu.py`` path before launching
full QI optimization sweeps.

The callback profile reports separate timings for relaxed trial solves, exact
tape construction, checkpoint-tape JVP replay, residual tangent projection, and
``wout`` writing.  The profiler does not compute initial aspect/QS metrics by
default because that requires an exact solve that is immediately cleared before
the measured callback/run; add ``--initial-metrics`` only when you want that
sanity check and do not need cold-start timing purity.  On the current
exact-adjoint implementation, the dominant
term for ``max_mode=2`` and ``max_mode=3`` is
``jacobian_tape_replay``.  Ordinary fixed-boundary solves can benefit from GPU
``lax.scan`` after warmup.  For exact optimization, accepted-point Jacobians use
the discrete-adjoint tape path on both CPU and GPU by default.  May 2026
``office`` RTX A4000 profiling showed that forced scan accepted-point Jacobians
are still useful for parity/profiling experiments but can be much slower than
tape on production-like cold GPU callbacks.  Set
``VMEC_JAX_OPT_EXACT_PATH=tape`` or ``VMEC_JAX_OPT_EXACT_PATH=scan`` to force
one accepted-point path for parity or profiling.  Relaxed trial residuals are
backend-aware: CPU/GPU/CUDA/ROCm default to the VMEC-control non-scan loop, and
TPU defaults to scan.  May 2026 bounded QH mode-2 profiles measured local CPU
non-scan trial callbacks at ``3.64 s`` total for two repeats with sane
residuals; forced CPU scan was slower and produced less reliable finite-step
diagnostics.  A 2026-06-02 ``office`` RTX A4000 QH mode-4 optimizer trace then
showed the non-scan GPU trial path reducing the trial solve from ``15.95 s`` to
``4.56 s``.  Set ``VMEC_JAX_OPT_TRIAL_SCAN=1`` or ``0`` to force either path for
diagnostics.
``solver_device=None``, ``"auto"``, and ``"default"`` inherit JAX's active
backend; pass ``solver_device="cpu"`` or ``"gpu"`` only when you want an explicit
override.

The tape implementation is backend-aware.  CPU accepted-point Jacobians keep the
full-tape default.  GPU/CUDA/ROCm accepted-point Jacobians default to JVP-only
tapes while preserving basepoint carries so the fast basepoint replay path
remains available.  ``VMEC_JAX_OPT_JVP_ONLY_EXACT_TAPE=0`` restores the full
GPU tape, ``=1`` forces JVP-only on any backend, and
``VMEC_JAX_JVP_ONLY_EXACT_TAPE_BASEPOINT_CARRIES=0`` disables the GPU carry
default for profiling.  The diagnostics wrappers expose the pair as
``--jvp-only-exact-tape --jvp-only-basepoint-carries`` so CPU/GPU matrices can
record the exact environment without hand-written shell exports.

Representative May 2026 callback timings after the backend-adaptive replay
bucket, scalar-gradient tangent-cache, and GPU replay-chunk changes were:

.. list-table::
   :header-rows: 1

   * - Case
     - Device/path
     - Budget
     - Wall time
   * - QH ``max_mode=1`` dense Jacobian callback
     - ``office`` RTX A4000, tape exact
     - ``inner_max_iter=20``, one cold callback
     - ``9.29 s``
   * - QH ``max_mode=1`` dense Jacobian callback
     - same ``office`` host CPU/JAX stack, tape exact
     - ``inner_max_iter=20``, one cold callback
     - ``16.97 s``
   * - QH ``max_mode=2`` dense Jacobian callbacks
     - ``office`` RTX A4000, tape exact
     - ``inner_max_iter=80``, two perturbed callbacks
     - ``16.48 s``
   * - QH ``max_mode=2`` dense Jacobian callbacks
     - local CPU, tape exact
     - ``inner_max_iter=80``, two perturbed callbacks
     - ``10.34 s``
   * - QH ``max_mode=1`` dense Jacobian callback
     - local CPU, tape exact
     - ``inner_max_iter=20``, one cold callback
     - ``16.115 s`` total; ``7.428 s`` exact solve/tape, ``3.644 s`` replay,
       ``2.736 s`` initial tangents, ``2.295 s`` residual tangents

These short cases show that GPU is now competitive or faster on some cold
callbacks, but not uniformly faster for production-like mode-2 dense
Jacobians.  The production default is therefore still the tape exact path on
GPU, while CPU remains the conservative recommendation for small dense
least-squares optimizations.  Forced scan exact remains available through
``VMEC_JAX_OPT_EXACT_PATH=scan`` for targeted diagnostics; a mode-2 forced-scan
GPU probe was stopped after the cold compile exceeded the practical profiling
budget, so it is not a production default.  The remaining exact-callback
bottleneck is still split across accepted-point solve/tape construction,
checkpoint-tape replay, initial tangent construction, and residual tangent
projection; further speedups should target reuse/fusion across those stages
rather than raw force-kernel throughput alone.

For same-process warmup studies, repeat a callback at the same point or repeat
the whole short optimizer run while keeping compiled executables warm:

.. code-block:: bash

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 2 --callback jacobian --repeats 2 \
     --inner-max-iter 80 --trial-max-iter 40 --solver-device gpu

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 3 --max-nfev 2 --run-repeats 3 \
     --inner-max-iter 120 --trial-max-iter 120 --solver-device gpu

For realistic accepted-point studies, perturb the parameter vector on each
repeat.  This keeps compiled helper shapes warm while forcing a new equilibrium
tape/state at each point, matching the cost structure of a real optimizer
trajectory more closely than same-point repeats:

.. code-block:: bash

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 2 --callback jacobian --repeats 3 \
     --perturb-scale 1e-4 --inner-max-iter 80 --trial-max-iter 40 \
     --solver-device gpu --vmec-timing --json-out qh_m2_gpu_new_points.json

With ``--vmec-timing``, the callback profile also splits
``exact_tape_build`` into solver compute-force, preconditioner, update, and
unattributed tape-building overhead terms.  Add ``--vmec-timing-detail`` when
the preconditioner bucket is the bottleneck; it further reports
``exact_tape_solver_preconditioner_apply`` and
``exact_tape_solver_preconditioner_mode_scale``.  The detailed mode adds extra
synchronization and should be used for targeted diagnostics, not production
sweeps.

The replay diagnostics also include
``replay_jvp_columns_*`` counters for the accepted-point tangent replay path:
which path ran (identity, dynamic-basepoint scan, segmented dynamic-basepoint
scan, dynamic linearization, dynamic-scan linearization, generic per-trace, or
generic scan), how many tangent-column leaf calls were replayed, how many input
columns they covered, and whether column chunking split the callback.  This
keeps GPU profiling focused on accepted-point replay/tangent construction
instead of conflating scan-cache misses, replay chunks, and residual projection
in one aggregate callback timer.

Add ``--sync-replay-timing`` when the question is whether a cold callback is
spending time in replay dispatch/compile-like overhead or in device-ready
execution.  This exposes ``*_tape_replay_dispatch``,
``*_tape_replay_ready``, ``*_initial_tangents_vmap_dispatch``, and
``*_initial_tangents_vmap_ready`` buckets in the JSON profile.  Dense
Jacobian residual projection now reports the same split through
``jacobian_residual_tangents_dispatch`` and
``jacobian_residual_tangents_ready``; projected replay reports
``jacobian_projected_replay_residual_tangents_dispatch`` and
``jacobian_projected_replay_residual_tangents_ready``.  Keep this option off
for normal CPU/GPU comparison sweeps because the explicit synchronization
changes the measured workload.

A 2026-06-01 ``office`` rerun on RTX A4000/JAX 0.6.2 rechecked QH
``max_mode=2`` dense exact-Jacobian callbacks with perturbed accepted points,
``--inner-max-iter 80``, and single-GPU sequential timing.  For non-LASYM
projected replay, larger chunks were faster than the older conservative
8-column policy: two callbacks took ``81.0 s`` with chunk 24 versus ``83.5 s``
with chunk 8, and the replay-dispatch bucket dropped from ``30.5 s`` to
``25.1 s``.  Additional cold checks gave ``73.2 s`` for QH ``max_mode=3`` with
chunk 48 versus ``81.4 s`` with chunk 8, and ``69.7 s`` for a reduced
``max_mode=4`` check with chunk 40 versus ``74.0 s`` with chunk 8.  The default
GPU policy therefore keeps LASYM at conservative 8-column chunks, but uses
larger bounded chunks for non-LASYM projected replay.  The remaining
bottlenecks are accepted-tape build, replay dispatch/compile-like overhead,
and dense residual-tangent projection.

A 2026-06-02 follow-up on QH ``max_mode=4`` rechecked smaller and disabled
non-LASYM replay chunks after the GPU trial-solve policy change.  With the
same ``inner_max_iter=80`` exact-Jacobian callback budget, chunk 20 took
``69.4 s`` and no explicit chunking took ``71.0 s``.  Both were slower than the
current bounded default policy for this case, so the production heuristic was
left unchanged.  The actionable GPU target was therefore reducing projected
replay dispatch and residual-tangent projection overhead, not shrinking the
replay chunk size further.

The same mode-4 callback also ruled out two simple replay-mode promotions.
Opting into the fused replay/projection helper with chunking disabled took
``69.9 s`` and spent ``38.6 s`` inside fused projected replay, so it did not
improve on the default bounded chunk policy.  Forcing
``VMEC_JAX_DYNAMIC_REPLAY_MODE=whole_scan`` was worse: ``81.5 s`` total with
``50.0 s`` in projected replay.  Keep both modes as diagnostics rather than
defaults for high-mode QH until a future implementation changes the underlying
dispatch structure.

The next source change kept the same bounded replay chunks but changed how
chunked projected replay is assembled.  For accelerator non-LASYM callbacks
where the parameter count exceeds the replay chunk size, ``vmec-jax`` now
projects residual tangents immediately after each replay chunk and concatenates
the resulting Jacobian blocks.  This avoids materializing one full
``n_params x state_size`` tangent block before residual projection.  A QH
``max_mode=4`` callback on ``office`` with ``inner_max_iter=80`` dropped from
the previous ``~69--71 s`` range to ``61.5 s`` by default; an explicit opt-in
probe of the same path measured ``60.1 s``.  The new timing bucket is
``jacobian_chunked_projected_replay_projection_total``.  Set
``VMEC_JAX_OPT_CHUNKED_PROJECTED_REPLAY_PROJECTION=0`` to restore the old
full-tangent projection path for diagnostics.  A short QH ``max_mode=4``
optimizer run with ``max_nfev=3`` then confirmed the callback result at the
solver level: wall time dropped from the previous dense SciPy ``94.8 s`` trace
to ``78.4 s`` while reaching the same final objective scale
(``1.74024e-1``).  The remaining high-mode GPU buckets are chunked
replay/projection, accepted exact-tape build, and relaxed trial solves.
The same diagnostics now cover QP.  On ``office`` at commit ``80ccd51``, QP
``max_mode=4`` with 80 boundary DOFs took ``63.8 s`` for one exact Jacobian
callback, with ``31.9 s`` in
``jacobian_chunked_projected_replay_projection_total`` and ``22.5 s`` in tape
build.  A short QP ``max_mode=4`` optimizer run with ``max_nfev=3`` took
``70.5 s`` for one accepted Jacobian plus two trial solves.  QP ``max_mode=5``
with 120 boundary DOFs took ``68.2 s`` for one exact Jacobian, with ``36.2 s``
in chunked projected replay/projection and ``22.8 s`` in tape build.  The
mode-5 result confirms that the current chunked projection path scales
reasonably with parameter count; the next production target remains reducing
the replay/projection dispatch and cold accepted-tape build costs.

The same QP ``max_mode=4`` short optimizer was then profiled with
``method="auto_scalar"``.  With cost-only scalar trial filtering enabled it took
``90.6 s`` and reduced the objective to ``6.32``; with cost-only trials disabled
it took ``78.1 s`` and reduced the objective to ``6.55``.  Both scalar-adjoint
runs made more progress than the dense three-evaluation diagnostic, but the
no-cost-only variant removed three extra forward trial solves with only a small
objective penalty.  Therefore ``auto_scalar`` now leaves
``scalar_cost_only_trials`` disabled by default and keeps it as an explicit
globalization knob.

A follow-up QH ``max_mode=2`` GPU profile on ``office`` with
``--inner-max-iter 80``, ``--trial-max-iter 40``,
``VMEC_JAX_DYNAMIC_REPLAY_MODE=basepoint``,
``VMEC_JAX_REPLAY_COLUMN_CHUNK=0``, and ``--sync-replay-timing`` confirmed
that the standard dense residual projection is dispatch/tracing dominated:
three perturbed exact-Jacobian callbacks took ``13.04 s``, ``3.98 s``, and
``3.27 s``.  The cumulative ``jacobian_residual_tangents`` bucket was
``2.263 s``, split into ``2.204 s`` dispatch and only ``0.059 s``
device-ready time.  The next GPU exact-callback optimization should therefore
reduce Python/JAX dispatch and callback construction around accepted-point
replay/projection before spending effort on residual-projection kernel math.

A 2026-06-02 ``office`` profile on the same RTX A4000/JAX 0.6.2 setup then
isolated the accepted-tape preconditioner cost.  For QH ``max_mode=2`` with
24 boundary DOFs, three perturbed GPU exact-Jacobian callbacks dropped from
``68.16 s`` to ``65.04 s`` when
``VMEC_JAX_OPT_EXACT_TRIDI_PRECOMPUTE_MAX_DOFS=24`` was enabled; the first warm
callback dropped from ``4.39 s`` to ``3.63 s``.  For QH ``max_mode=3`` with
48 DOFs and a 60-iteration budget, two callbacks dropped from ``64.98 s`` to
``63.70 s`` with the threshold set to 48; the warm callback dropped from
``3.57 s`` to ``3.38 s``.  The production default therefore precomputes tridi
coefficients for accelerator exact tapes up to 48 DOFs and leaves larger
parameter spaces on the legacy solver policy unless explicitly overridden.

The 2026-05-28 mode-2 GPU policy check also compared regular projected replay
against the fused replay/projection helper.  Regular projected replay took
``12.66 s`` for one QH mode-2 exact-Jacobian callback
(``jacobian_projected_replay_total=6.16 s``), while the fused helper took
``13.38 s`` (``jacobian_fused_projected_replay_total=6.25 s``).  Fusion is
therefore opt-in through ``VMEC_JAX_OPT_FUSED_PROJECTED_REPLAY=1``; the default
uses regular projected replay for non-LASYM GPU mode-2+ cases.

For production cache-growth audits, use the same accepted-point callback mode
with JSON output and explicit budgets.  The report records per-repeat phase
deltas, optimizer/global JIT cache cardinalities before and after each repeat,
RSS growth, cumulative callback profile, and a ``budget_status`` block.  A
budget breach exits with status ``2`` by default; pass ``--budget-action warn``
to keep the run informational:

.. code-block:: bash

   JAX_PLATFORMS=cpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 2 --callback jacobian --repeats 3 \
     --perturb-scale 1e-4 --inner-max-iter 80 --trial-max-iter 40 \
     --solver-device cpu --vmec-timing \
     --budget-total-wall-s 45 --budget-repeat-wall-s 20 \
     --budget-tape-build-wall-s 20 --budget-replay-wall-s 15 \
     --budget-residual-tangent-wall-s 10 --budget-accepted-replays 3 \
     --budget-cache-entry-growth 12 --budget-rss-growth-mb 1024 \
     --json-out /tmp/qh_m2_cpu_callback_cache.json

   JAX_PLATFORM_NAME=gpu PYTHONPATH=. python tools/diagnostics/profile_exact_optimizer.py \
     --problem qh --max-mode 2 --callback jacobian --repeats 3 \
     --perturb-scale 1e-4 --inner-max-iter 80 --trial-max-iter 40 \
     --solver-device gpu --vmec-timing \
     --budget-total-wall-s 90 --budget-repeat-wall-s 45 \
     --budget-tape-build-wall-s 45 --budget-replay-wall-s 30 \
     --budget-residual-tangent-wall-s 20 --budget-accepted-replays 3 \
     --budget-cache-entry-growth 12 --budget-rss-growth-mb 4096 \
     --json-out /tmp/qh_m2_gpu_callback_cache.json

Use ``--callback accepted`` when you only want the accepted-point residual/tape
build without dense Jacobian replay.  Keep ``--clear-between-repeats`` off for
cache-growth audits; enabling it intentionally drops optimizer/JIT caches
between repeats and measures cold callback behavior instead.  Malformed
``VMEC_JAX_DYNAMIC_REPLAY_BUCKET`` values fall back to the backend-adaptive
default instead of forcing the CPU-sized bucket on GPU diagnostics.  The
``--budget-tape-build-wall-s``, ``--budget-replay-wall-s``,
``--budget-residual-tangent-wall-s``, and ``--budget-accepted-replays`` limits
are intended for regression guards around the accepted-point tape/replay lane:
they catch extra tape replays or dense residual-tangent projection regressions
even when total callback wall time is noisy.

For the standalone sweep scripts, worker subprocesses also inherit the parent
JAX backend by default.  Use ``JAX_PLATFORMS=cpu`` or
``--worker-jax-platforms cpu`` only when an explicit CPU-only worker process is
desired.

After each accepted-point Jacobian, the optimizer drops the heavy adjoint tape
but keeps a single solved-state cache entry.  This keeps RSS bounded while
avoiding a duplicate exact VMEC replay for final objective evaluation and
``wout`` writing.  The SciPy callback path also reuses this solved-state cache
when it asks for a residual at a point whose exact Jacobian was just built,
avoiding an otherwise unnecessary relaxed forward replay.  If
``save_wout(..., state=result["_state_final"])`` is used immediately after
``run()``, no additional equilibrium solve is performed.

The tape Jacobian callback also returns the residual primal from the same
``jax.linearize`` used for residual tangent projection.  History recording and
the custom Gauss-Newton gradient path reuse that residual instead of evaluating
the accepted-point residual block a second time after every dense exact
Jacobian.

SciPy can also revisit the same rejected trust-region trial point.  The
optimizer therefore keeps a small residual-only LRU cache for relaxed trial
callbacks.  The cache stores NumPy residual vectors, not VMEC states or JAX
tapes, so it removes repeated forward solves without retaining large XLA
buffers.  To audit the exact callback sequence, run the diagnostic script with
``--trace-callbacks``; the JSON history then includes a ``callback_trace`` block
that labels each residual/Jacobian callback as an exact-state cache hit, a
trial-residual cache hit, a fresh trial solve, or an exact tape replay.

The lowest accepted-point callback count currently comes from the custom
``method="gauss_newton"`` loop: it uses relaxed residuals only for line-search
trial points and one exact Jacobian per accepted point.  The SciPy trust-region
path remains more robust for the documented full QA/QH continuation examples,
but it may request additional Jacobians around trust-region updates.  A bounded
CPU diagnostic on QH ``max_mode=1`` with ``inner_max_iter=trial_max_iter=80``
and ``max_nfev=2`` gave:

.. list-table::
   :header-rows: 1
   :widths: 24 18 18 28

   * - Method
     - Wall time
     - Jacobian calls
     - Notes
   * - ``scipy``
     - ``6.55 s``
     - 2
     - one cached exact-state residual hit
   * - ``gauss_newton``
     - ``4.42 s``
     - 1
     - same first accepted step, fewer exact callbacks

GPU exact-optimizer diagnostics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The current GPU bottleneck is the exact Jacobian replay path, not ordinary
fixed-boundary force evaluation.  The replay is a long sequence of
linearized VMEC iteration steps, and cold GPU processes pay heavy XLA compile
costs.  For GPU profiling, always separate the first run from cache-warm runs:

.. code-block:: bash

   JAX_PLATFORM_NAME=gpu python tools/diagnostics/profile_exact_optimizer.py \
     --problem qa --max-mode 1 --inner-max-iter 20 \
     --trial-max-iter 20 --solver-device default --max-nfev 1 \
     --trace-callbacks --json-out gpu_trace.json

April 2026 callback diagnostics for the full input-deck QH ``max_mode=1`` case
show where the GPU path loses today.  Local CPU used JAX 0.9.2 on an Apple
workstation; ``office`` used JAX 0.6.2 on an NVIDIA RTX A4000 host.

.. list-table::
   :header-rows: 1
   :widths: 26 18 18 30

   * - Callback / process
     - Wall time
     - Backend
     - Dominant terms
   * - Trial residual, local CPU
     - ``3.35 s``
     - CPU
     - trial solve ``2.08 s``, residual ``1.16 s``
   * - Trial residual, forced GPU
     - ``18.32 s``
     - GPU
     - trial solve ``17.43 s``
   * - Exact residual/tape, local CPU
     - ``10.70 s``
     - CPU
     - tape build ``9.47 s``
   * - Exact residual/tape, forced GPU
     - ``93.92 s``
     - GPU
     - tape build ``93.06 s``
   * - Exact residual/tape, GPU process with CPU device
     - ``65.44 s``
     - GPU process, CPU device
     - tape build ``63.16 s``
   * - Exact residual/tape, CPU-only process on ``office``
     - ``28.47 s``
     - CPU
     - tape build ``28.00 s``
   * - Dense Jacobian, local CPU
     - ``18.48 s``
     - CPU
     - tape build ``7.49 s``, replay ``9.46 s``
   * - Dense Jacobian, forced GPU
     - ``136.65 s``
     - GPU
     - tape build ``96.53 s``, replay ``32.14 s``
   * - Dense Jacobian, GPU with demand allocation, inner-10 QH smoke
     - ``7.54 s``
     - GPU
     - replay ``3.72 s``; faster than same-host CPU smoke
   * - Full QH ``max_mode=1`` optimizer, forced GPU
     - ``676 s``
     - GPU
     - ``nfev=9``; still much slower than the CPU sweep case

This makes the present conclusion narrower and more actionable: the exact
optimizer is not GPU-ready just because the fixed-boundary force kernels are in
JAX.  The bottleneck is the accepted-point tape build/replay path.  A CPU
``jax.default_device`` context inside a GPU-initialized process is still much
slower than a CPU-only process, but vmec_jax does not force CPU execution for
GPU-enabled users.  Use explicit CPU-only workers for controlled CPU studies,
and explicit GPU backends for GPU profiling.

A follow-up QH ``max_mode=2`` GPU trace with ``max_nfev=4`` and
``inner_max_iter=trial_max_iter=120`` showed the same bottleneck.  The warm
run spent about ``74 s`` total: four accepted-point Jacobian callbacks consumed
about ``49 s`` end-to-end, while three relaxed trial solves consumed about
``24 s``.  The trace contained one exact-state residual cache hit and no
repeated trial residuals, so the next GPU optimization lane is reducing
accepted-point tape build/replay cost rather than adding more residual caching.

The perturbed accepted-point profiler separates same-tape cache hits from real
new optimizer points.  On ``office`` with QH ``max_mode=2``,
``inner_max_iter=80``, ``trial_max_iter=40``, and ``--perturb-scale 1e-4``, the
default tape path gave three GPU dense-Jacobian callbacks of about
``13.8 s``, ``7.8 s``, and ``6.9 s``.  The mean profile was dominated by exact
tape construction (``5.4 s`` per point), with tape replay around ``2.1 s`` per
point.  The same CPU run gave about ``9.3 s``, ``5.8 s``, and ``5.6 s``.  The
scan exact path remained unsuitable for this workload: the first perturbed GPU
scan Jacobian took about ``118 s``.

After adding solver-phase timing to the same perturbed callback, the GPU
accepted-point tape build split into about ``3.0 s`` of VMEC update work,
``1.3 s`` of preconditioner work, ``1.4 s`` of unattributed trace/build
overhead, and only ``0.10 s`` of force evaluation per new point.  The matching
CPU tape build was about ``2.0 s`` per point and was instead dominated by force
evaluation (``1.26 s`` per point).  This identifies the next GPU target more
precisely: reduce host-dispatched update/preconditioner/tape bookkeeping and
replay overhead, not the already-fast GPU force kernels.

The first mitigation keeps dynamic exact-tape trace arrays on device until the
compact replay payload is assembled.  On the warm ``office`` GPU QH
``max_mode=2`` profile above, that reduced mean tape-build time from about
``5.75 s`` to ``5.32 s`` per accepted point and update bookkeeping from about
``3.02 s`` to ``2.70 s``.  It does not solve the main GPU gap; it narrows the
next target to the update/preconditioner replay graph itself.

Splitting the update timer confirmed that conclusion: on the same QH
``max_mode=2`` GPU profile, ``exact_tape_solver_update_state`` accounted for
essentially all of ``exact_tape_solver_update`` (about ``2.7 s`` per accepted
point), while trace-build/finalize bookkeeping was below ``1 ms`` per point.
The next implementation target is therefore fusing or scanning the primal
state-update/replay work, not further reducing Python trace dictionary overhead.

The next production patch skips the ``update_rms`` reduction when it is not
consumed.  Exact optimizer dynamic-tape solves run with light history, no update
clipping, and non-verbose output, so that reduction was wasted work.  On the QH
``max_mode=2`` GPU profile, this reduced mean tape-build time further to about
``4.74 s`` per accepted point and update-state time to about ``2.30 s``.  The
largest remaining per-point GPU costs are now replay (about ``2.67 s``),
state update (about ``2.30 s``), and preconditioner work (about ``1.10 s``).
Dynamic replay payload stacking is backend-aware: GPU uses on-device JAX stacks
to avoid unnecessary host materialization, while CPU keeps the lower-overhead
NumPy stack path.  This avoids the CPU replay regression seen with unconditional
device stacking.

The next accepted-point GPU fix fuses the strict fixed-boundary velocity/state
update into one cached JIT helper for accelerator backends.  On ``office`` for
the same QH ``max_mode=2`` perturbed dense-Jacobian profile
(``inner_max_iter=80``, ``trial_max_iter=40``), two new accepted-point
callbacks dropped from roughly ``20--22 s`` to ``16.5 s``.  The measured
solver-update component dropped from about ``2.6 s`` per accepted point to
about ``0.22 s`` per accepted point; tape replay is now again the largest
remaining GPU term.  Explicit replay column chunks of 8 or 12 were much slower
on this 24-DOF case because they segmented the replay into multiple GPU
launch/compile groups; full-column replay remained best.

The next smaller accepted-point cleanup caches the initial-state tangent matrix
for each VMEC theta-flip branch.  With the magnetic-axis branch frozen by the
accepted-point tape, the initial guess is affine in the boundary coefficients,
so repeated Jacobian callbacks do not need to re-linearize that graph unless
the discrete flip branch changes.  On the same QH ``max_mode=2`` GPU profile,
three perturbed dense-Jacobian callbacks moved from about ``20.7 s`` total to
about ``19.9 s`` total, with matching Jacobian norms.  Two attempted GPU replay
shortcuts were rejected as broad defaults at that point: precomputed
tridiagonal coefficients were correctness-tested but workload-dependent, and
stopping gradients through solver time-control scalars nearly doubled replay
time.

May 2026 follow-up profiling used Python 3.11.15, JAX 0.10.0, and
``jax[cuda13]`` on the same ``office`` RTX A4000 host.  The GPU backend is
working, but it is still not a clear win for these small/medium exact
optimization callbacks.  The important split is raw fixed-boundary throughput
versus accepted-point optimization replay:

.. list-table::
   :header-rows: 1
   :widths: 30 16 16 30

   * - Callback / case
     - CPU warm
     - GPU warm
     - Dominant GPU terms
   * - Raw QH fixed-boundary, 100 iterations
     - ``0.93 s``
     - ``2.06 s``
     - fixed launch/compile overhead dominates
   * - QA ``max_mode=1`` dense Jacobian, 8 DOFs
     - ``2.92 s``
     - ``3.94 s``
     - replay and residual tangent projection
   * - QA ``max_mode=1`` scalar gradient, 8 DOFs
     - ``2.56 s``
     - ``4.56 s``
     - cached scalar cotangent and replay
   * - QA ``max_mode=3`` dense Jacobian, 48 DOFs
     - ``4.12 s``
     - ``3.80 s``
     - replay and residual tangent projection
   * - QA ``max_mode=3`` scalar gradient, 48 DOFs
     - ``1.73 s``
     - ``3.03 s``
     - cached scalar cotangent and replay

Those numbers are perturbed accepted-point repeats with warm compiled helper
shapes, ``inner_max_iter`` of 40--80, relaxed ``ftol`` appropriate for
profiling, and the default tape exact path unless explicitly marked otherwise.
They are not full production optimization timings.  The scalar reverse-adjoint
gradient now uses a cached JIT scalar-objective cotangent hook
for the built-in QS residual factories rather than VJP-ing the full residual
vector from Python on every callback.  That reduced the QA ``max_mode=1`` GPU
gradient callback from about ``9.8 s`` to ``4.6 s`` after warmup, and reduced
the QA ``max_mode=3`` GPU gradient callback from about ``5.8 s`` to ``3.0 s``.
Dense Jacobians remain competitive at low DOF counts; the scalar-adjoint path is
now the better candidate for higher-mode or memory-limited optimizations.

After caching the fixed quasisymmetry angular quadrature grid, the same
``office`` QH ``max_mode=2`` accepted-point callback profile gave two perturbed
GPU dense-Jacobian callbacks of about ``11.8 s`` and ``4.3 s`` with the default
dynamic replay bucket.  Forcing ``VMEC_JAX_DYNAMIC_REPLAY_BUCKET=1024`` made the
same profile much worse (about ``53.9 s`` and ``7.2 s``).  The production
optimizer therefore no longer sets a coarse replay bucket automatically; use
large buckets only for controlled experiments on workloads where recompilation
dominates replay execution.

The next optimization keeps a cached JIT residual evaluator for
non-differentiating optimizer callbacks and avoids recomputing the QS objective
when the residual vector is already available.  On the same ``office`` GPU QH
``max_mode=2`` short production path (``max_nfev=2``), wall time dropped from
about ``15.3 s`` to ``11.2 s``.  This does not change the discrete-adjoint
Jacobian path, which still linearizes the raw residual function where
derivatives are required.

The accepted-point exact tape path now enables precomputed Thomas coefficients
for small-DOF accelerator tapes only.  This is intentionally not applied to
trial solves by default: the same switch made cold GPU trial scans slower.  On
``office`` with JAX 0.6.2 and one RTX A4000, two perturbed QA ``max_mode=1``
dense Jacobian callbacks dropped from ``88.9 s`` total to ``72.6 s`` total,
while the Jacobian Frobenius norms matched to about ``7.5e-11`` relative
difference.  A later QH ``max_mode=2`` profile with 24 boundary DOFs showed the
opposite tradeoff: replaying the larger tape outweighed the preconditioner
savings.  The default therefore enables this optimization only up to 12
optimization DOFs.  Set ``VMEC_JAX_OPT_EXACT_TRIDI_PRECOMPUTE=0`` to disable
this accepted-point optimization for diagnostics, ``=1`` to force it on a
specific backend, or ``VMEC_JAX_OPT_EXACT_TRIDI_PRECOMPUTE_MAX_DOFS`` to adjust
the automatic small-DOF threshold.

The 2026-05-23 TensorBoard/Perfetto pass on ``office`` confirmed the current
GPU exact policy.  A cold QH ``max_mode=2`` accepted Jacobian
(``inner_max_iter=80``, ``trial_max_iter=30``, 24 DOFs) took about ``43.4 s``:
``19.6 s`` in tape replay, ``10.1 s`` in tape build, and ``10.1 s`` in
residual tangent projection, with about ``1.1 GiB`` RSS growth.  Repeating after
the persistent cache was warm reduced the same callback to about ``15.6 s``
(``5.7 s`` tape build, ``4.0 s`` replay, ``2.3 s`` residual tangents).  Forcing
``VMEC_JAX_OPT_EXACT_PATH=scan`` on the same callback took about ``100.5 s``.
This keeps the production optimizer on the tape exact path for GPU; scan exact
is only a parity/profiling fallback.

The ``office`` TensorBoard profiler may emit
``Can't import tensorflow.python.profiler.trace`` when TensorFlow's Python hook
is not installed.  That warning is non-fatal for these runs: the XPlane trace,
Perfetto/trace JSON, and device-memory profile still write successfully.

Fixed-boundary GPU diagnostics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For fixed-boundary inputs, the GPU force kernels are fast, but the best solver
policy depends on whether the workload is a raw fixed-boundary solve or an exact
optimization callback.  A 2026-05-23 ``office`` RTX A4000 profile showed the raw
fixed-boundary scan path is clearly faster for ``input.nfp4_QH_warm_start``:
``500`` GPU iterations took about ``1.37 s`` with scan versus ``8.01 s`` with
the non-scan VMEC-control loop.  Public auto-selected GPU/CUDA/ROCm
fixed-boundary runs therefore default to scan; CPU remains non-scan by default,
and users can still pass ``use_scan=False`` to force the VMEC-control loop.

A deeper 2026-05-23 trace pass separated cold compilation/cache effects from
warmed raw iteration throughput.  On ``office`` with JAX ``0.6.2`` and an RTX
A4000, ``input.nfp4_QH_warm_start`` took about ``13.3 s`` in a first cold GPU
scan process and ``13.6 s`` in a first cold GPU non-scan process.  After the
persistent compilation cache was warm, a fresh scan process dropped to about
``5.7 s``.  In one long-lived process, the warmed scan run took about ``1.4 s``
whereas warmed non-scan took about ``8.5 s``.  For
``input.nfp4_QH_finite_beta``, warmed scan took about ``2.7 s`` for the
profiled iteration budget while warmed non-scan took about ``12.6 s``.  The
non-scan finite-beta GPU phase timer showed force assembly was fast
(``~2.5 ms`` per iteration) but preconditioner/update launch overhead dominated
(``~10.4 ms`` per iteration in the preconditioner path), so production raw GPU
runs should keep the scan default.

A later 2026-05-23 ``office`` pass removed per-iteration host placeholder
allocations from the accelerator R/Z preconditioner-apply wrapper.  On the
same RTX A4000 stack, ``input.nfp4_QH_warm_start`` dropped from about
``22.9 s`` to ``13.9 s`` with identical convergence, while
``input.nfp4_QH_finite_beta`` dropped from about ``93.0 s`` to ``58.6 s``.
The follow-up fused preconditioner payload pass combined accelerator-side
lambda scaling, mode weighting, and preconditioned residual diagnostics.  In a
fresh no-warmup profile this kept the warm-start case at about ``13.5 s`` and
reduced ``input.nfp4_QH_finite_beta`` to about ``50.2 s``.  Enabling the
precomputed Thomas path only for high-mode non-LASYM accelerator cases then
kept the low-mode QH warm-start case on its legacy path while reducing the
finite-beta QH default profile to about ``50.7 s``.  The detailed timer
still attributes most remaining GPU time to ``precond_apply`` (about ``16.2 s``
on the finite-beta final stage), so the next GPU target remains restructuring
the radial tridiagonal preconditioner itself rather than the already-fast force
assembly kernels.

The 2026-05-24 follow-up batched the precomputed R/Z Thomas systems and then
combined the ``m=0`` and ``m>0`` radial blocks through one padded scan.  On the
same ``office`` GPU, the converged finite-beta scan profile measured ``31.6 s``
cold, ``19.8 s`` for a cache-population fresh process, ``8.12 s`` for the next
fresh process, and ``3.39 s`` when warmed in the same process.  The low-mode QH
warm-start case measured ``5.24 s`` in a cache-warm fresh process and ``0.71 s``
when warmed in the same process.  These numbers move the production GPU lane
from raw kernel throughput to cache warmup and compile-shape stability.

Earlier converged QH non-scan comparison, ``max_iter=500``:

.. list-table::
   :header-rows: 1
   :widths: 28 20 20 20

   * - Backend
     - Policy
     - Wall time
     - Final total residual
   * - CPU
     - default, non-scan
     - ``16.65 s``
     - ``1.11e-13``
   * - GPU
     - accelerated, forced non-scan
     - ``16.03 s``
     - ``1.11e-13``

Older GPU scan/non-scan comparison, same host, retained as a reminder that
full-convergence policy changes must be re-profiled case by case:

.. list-table::
   :header-rows: 1
   :widths: 36 16 16 18

   * - Case
     - Scan
     - Non-scan
     - Result
   * - ``input.nfp4_QH_warm_start``
     - ``33.10 s``
     - ``17.85 s``
     - non-scan faster
   * - ``input.LandremanPaul2021_QA_lowres``
     - ``149.27 s``
     - ``69.32 s``
     - non-scan faster
   * - ``input.up_down_asymmetric_tokamak``
     - ``164.57 s``
     - ``140.69 s``
     - non-scan faster
   * - ``input.nfp2_QI``
     - ``108.67 s``
     - ``50.78 s``
     - non-scan faster

The finish policy also distinguishes production input-deck runs from
explicit low-budget diagnostics.  When the user explicitly supplies
``max_iter``, all accelerated/parity finish attempts combined are capped at
twice that budget and the run reports non-convergence if the cap is exhausted.
This avoids spending many hidden extra iteration blocks in profiling or sweep
scripts.  When ``max_iter`` is not overridden, VMEC input-deck budgets retain
the robust finish behavior needed for parity-oriented production solves.
On ``office``, the explicit ``max_iter=100`` QH GPU diagnostic now stops after
two finish blocks (``[100, 100]``) and reports non-convergence at the cap; the
same low-budget diagnostic previously spent five finish blocks and took about
``56.2 s``, while the capped source-tree run took about ``42.9 s``.

Earlier April 2026 diagnostics with the scan-heavy GPU policy are retained
below for historical context:

.. list-table::
   :header-rows: 1
   :widths: 34 18 18 18 18

   * - Case
     - CPU
     - CPU warm cache
     - GPU empty cache
     - GPU warm cache
   * - ``input.nfp4_QH_warm_start``
     - ``13.9 s``
     - ``7.5 s``
     - ``69.4 s``
     - ``12.1 s``
   * - ``input.up_down_asymmetric_tokamak``
     - ``19.9 s``
     - ``7.8 s``
     - ``40.6 s``
     - ``15.0 s``
   * - ``input.basic_non_stellsym_pressure``
     - ``73.5 s``
     - ``46.5 s``
     - ``73.1 s``
     - ``43.3 s``

The current practical policy is therefore:

- use the non-scan VMEC-control loop for auto-selected production fixed-boundary
  solves on CPU,
- use the scan-lifted loop for auto-selected production fixed-boundary solves
  on GPU/CUDA/ROCm,
- keep explicit ``--use-scan`` and ``--no-use-scan`` profiler controls
  available for case-by-case comparisons,
- keep the persistent cache enabled by default so repeated GPU processes are
  not dominated by recompilation,
- expose ``--solver-device cpu`` / ``solver_device="cpu"`` for users who want
  to force CPU execution inside a GPU-enabled process,
- do not treat raw scan throughput as production performance unless the run also
  converges to the requested VMEC tolerance.

Raw solver throughput vs public policy overhead
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The fixed-boundary profiler can now separate the requested raw solver path from
the public CLI-style policy.  This matters because public defaults may add
dynamic scan probes, staged follow-up, or finish attempts around the requested
iteration budget.  The profiler imports ``vmec_jax`` before importing JAX
directly so GPU allocator defaults and the persistent compilation-cache policy
match normal API/CLI runs.  If neither ``--use-scan`` nor ``--no-use-scan`` is
provided, the profiler leaves ``use_scan=None`` and follows the same
backend-aware production policy as ``run_fixed_boundary``.  To benchmark the raw
accelerated scan path explicitly, use:

.. code-block:: bash

   JAX_ENABLE_X64=1 python tools/diagnostics/profile_fixed_boundary.py \
     --input examples/data/input.nfp4_QH_warm_start \
     --iters 20 \
     --simple-profile \
     --no-multigrid \
     --no-auto-cli-policy \
     --solver-mode accelerated \
     --use-scan \
     --solver-device gpu \
     --json-out /tmp/vmec_jax_qh20_raw_gpu.json

May 2026 raw-path diagnostics showed:

.. list-table::
   :header-rows: 1
   :widths: 34 16 16 16 16

   * - Case
     - Device
     - Iterations
     - Timing
     - Notes
   * - ``input.nfp4_QH_warm_start``
     - local CPU, JAX 0.9.2
     - 20
     - ``4.43 s`` cold, ``0.31 s`` warm
     - raw accelerated scan
   * - ``input.nfp4_QH_warm_start``
     - ``office`` GPU, JAX 0.6.2
     - 20
     - ``12.1 s`` cold, ``1.70 s`` warm
     - old Python 3.10/JAX stack
   * - ``input.nfp4_QH_warm_start``
     - ``office`` CPU, JAX 0.6.2
     - 100
     - ``0.97 s`` warm
     - same host as GPU
   * - ``input.nfp4_QH_warm_start``
     - ``office`` GPU, JAX 0.6.2
     - 100
     - ``1.77 s`` warm
     - fixed overhead dominates
   * - ``input.LandremanPaul2021_QA_lowres``
     - ``office`` CPU/GPU, JAX 0.6.2
     - 20
     - ``1.29 s`` CPU, ``1.87 s`` GPU
     - warmed raw path
   * - ``input.nfp2_QI``
     - ``office`` CPU/GPU, JAX 0.6.2
     - 20
     - ``0.99 s`` CPU, ``1.65 s`` GPU
     - warmed raw path

The conclusion is narrower than “GPU is slow”: raw force iterations are fast
once warmed, but the available ``office`` stack still has high GPU fixed
overhead and uses Python 3.10 / JAX 0.6.2.  The next GPU lane should use a
Python 3.11+ environment with current JAX, then target the remaining
compile/launch/replay overhead rather than changing physics tolerances or
forcing CPU fallback.

For high-mode runs, also profile the experimental reverse-adjoint scalar
gradient callback:

.. code-block:: bash

   python tools/diagnostics/profile_exact_optimizer.py \
     --problem qa --max-mode 3 --inner-max-iter 300 \
     --gradient-only --check-gradient

This path computes the gradient of ``0.5 * ||r||^2`` with one reverse replay
through the VMEC tape instead of propagating one tangent column per boundary
degree of freedom.  It is exposed through
``FixedBoundaryExactOptimizer.objective_and_gradient_fun`` and the opt-in
``method="lbfgs_adjoint"`` / ``method="scalar_trust"`` optimizers.  These are
currently profiling/experimental lanes, not the default QA/QH production path,
because the current reverse products are comparable to, but not consistently
faster than, the dense vectorized column replay for the present
``max_mode <= 3`` parameter counts.
The ``lbfgs_adjoint`` wrapper now enforces a hard scalar-gradient callback
budget because SciPy's internal L-BFGS-B line search can otherwise exceed the
requested ``max_nfev``.  It also uses a conservative default trust box
(``lbfgs_step_bound=0.01`` in scaled parameter space) because unbounded L-BFGS-B
can probe extremely distorted boundaries on the first line search.
``scalar_trust`` uses the same scalar-adjoint gradient but accepts only
monotone trust-region steps with a hard callback budget; this makes profiling
predictable even when L-BFGS-B line searches are ineffective.  A May 2026 QH
``max_mode=1`` diagnostic confirms the budget is respected, but the
scalar-adjoint optimizers remain slower and less effective than dense exact
least-squares on that small case.  The next useful step is therefore better
matrix-free/scalar trust-region behavior, not switching the default
least-squares path.

The accepted-point exact path uses the discrete-adjoint tape path by default on
both CPU and GPU.  April and May 2026 diagnostics showed that a naive cold scan
exact Jacobian can be very slow, so scan remains an explicit
``VMEC_JAX_OPT_EXACT_PATH=scan`` diagnostic path rather than the GPU default.
The production optimizer now avoids redundant residual-only exact executables,
uses backend-aware trial residuals, and reconstructs standard QS history
metrics from cached accepted-point residuals instead of re-solving the accepted
state after every Jacobian.  In a fresh May 2026 ``office`` RTX A4000 profile of QH
``max_mode=1`` with ``inner_max_iter=trial_max_iter=20``, forced scan took
``102.9 s`` for one dense Jacobian callback while forced tape took ``36.7 s``;
the same local CPU-style QA ``max_mode=2`` tape callback was ``9.5 s``.  CPU
therefore remains faster for the small/medium diagnostics, so production runs
still document both CPU and GPU timings rather than claiming blanket GPU
speedups.  GPU sweep production runs use calibrated optimizer budgets
(currently ``inner_max_iter =
trial_max_iter = 120`` and ``ftol = trial_ftol = 1e-8`` for deck-controlled
QA/QH cases), rather than the old four-evaluation diagnostic caps.  Final
standalone verification runs can still use the VMEC input-deck ``NITER_ARRAY`` /
``FTOL_ARRAY``.

Replay and preconditioner JIT helper caches are retained across accepted points
and LRU-bounded.  ``VMEC_JAX_SCAN_RUNNER_CACHE_SIZE`` and
``VMEC_JAX_COMPUTE_FORCES_CACHE_SIZE`` default to ``32`` entries;
``VMEC_JAX_STRICT_UPDATE_CACHE_SIZE`` defaults to ``16`` entries.  Set any of
these to ``0`` to disable in-process retention when profiling RSS growth.  Call
``FixedBoundaryExactOptimizer.clear_caches()`` to release compiled replay
helpers explicitly after a long optimization batch.  The optimizer still clears
heavyweight exact tapes between SciPy callbacks where needed to avoid RSS
growth.

The same exact ``Jv``/``J.Tv`` products can be used in SciPy's trust-region
least-squares solver with ``method="scipy_matrix_free"``:

.. code-block:: bash

   python tools/diagnostics/profile_exact_optimizer.py \
     --problem qa --max-mode 3 --inner-max-iter 300 \
     --method scipy_matrix_free --lsmr-maxiter 4 --max-nfev 2

This matrix-free trust-region lane is useful for profiling, memory-pressure
fallbacks, and larger parameter-count cases, but it is not a global default
because it is problem dependent.  Use ``method="auto"`` as an opt-in,
device-preserving policy rather than as a promise of the fastest wall time for
every case.  The current automatic policy may select matrix-free LSMR for
high-mode, stellarator-symmetric QS/QI problems on CPU/default CPU, where
cold-process and memory-pressure profiles motivated the lane.  It still keeps
LASYM cases and explicit GPU runs on the dense SciPy path until case-specific
matrix-free GPU profiles show a benefit.  The retained benchmark table below is
a useful counterexample: on that warm QA run, dense SciPy is still faster.
Validate the case you care about with the profiling commands before promoting
production timings.  The trial residual path can be compared explicitly with
``profile_exact_optimizer.py --trial-scan auto|on|off``.

For high-parameter-count scalar-objective probes, use
``method="auto_scalar"`` (alias ``"auto_adjoint"``).  This policy keeps the
same LASYM safeguards as ``"auto"``, but selects the safeguarded
``scalar_trust`` reverse-adjoint path for high-mode, stellarator-symmetric
QS/QI CPU/default-backend cases.  As of the 2026-06-01 production-policy
refresh, explicit GPU/CUDA/ROCm high-mode, stellarator-symmetric QS/QI
``auto_scalar`` runs also use ``scalar_trust`` rather than falling back to
dense SciPy.  Low-mode, LASYM, and ordinary ``method="auto"`` GPU runs stay on
dense SciPy unless the caller explicitly chooses another method.  ``auto_scalar``
does not enable cost-only trial filtering by default, since QH and QP GPU
profiles showed the extra forward trial solves can dominate short high-mode
runs.  Enable that filter explicitly with ``scalar_cost_only_trials=True`` or
``VMEC_JAX_OPT_SCALAR_COST_ONLY_TRIALS=1`` when a rugged case needs trial
screening before exact gradient builds.  This is the production entry point for
testing scalar-adjoint optimization without relying on environment variables.

A 2026-05-20 matrix-free cleanup removed one redundant initialization AD pass:
``residual_linear_operator`` now obtains the frozen-axis initial-state
transpose from the already-created ``jax.linearize`` object instead of tracing a
second ``jax.vjp`` through the same initial-state graph.  A later 2026-05-31
production pass moved SIMSOPT-style tuple workflows onto the same packed-state
cotangent hooks used by the older direct QS factories.  For larger
stellarator-symmetric CPU/default matrix-free operators it can also precompute
the small initial-state tangent block; by default this is limited to
64--128 boundary DOFs because a 48-DOF QA microprobe showed the build cost can
exceed the single ``J.Tv`` transpose it saves.  Override with
``VMEC_JAX_OPT_LINEAR_OPERATOR_INITIAL_TANGENTS=1`` or tune
``VMEC_JAX_OPT_LINEAR_OPERATOR_INITIAL_TANGENT_{MIN,MAX}_DOFS`` for profiling.
New profiles therefore separate
``linear_operator_initial_tangent_projection`` from the older
``linear_operator_initial_transpose`` bucket; if the transpose bucket is still
large, the case has fallen back to the conservative no-precompute path.

The same 2026-06-01 pass added a scalar-gradient initial-tangent projection
path.  On GPU/CUDA/ROCm high-mode, stellarator-symmetric scalar-adjoint runs,
``objective_and_gradient_fun`` now precomputes the affine initial-state tangent
columns once per flip branch for 24--256 DOFs and projects the reverse tape
cotangent through that block.  This targets repeated accepted-point scalar
gradients where rebuilding the initial-state VJP was visible in profiles.  Tune
or override with ``VMEC_JAX_OPT_SCALAR_GRADIENT_INITIAL_TANGENTS`` and
``VMEC_JAX_OPT_SCALAR_GRADIENT_INITIAL_TANGENT_{MIN,MAX}_DOFS``.

The validation profile for this change used the QA ``max_mode=3`` 48-DOF probe
with two VMEC iterations per accepted/trial solve.  On the ``office`` GPU,
``method="auto_scalar"`` selected ``scalar_trust`` and completed one cold
callback in ``13.43 s``; the profile showed ``gradient_initial_tangents`` and
``gradient_initial_projection`` rather than ``gradient_initial_vjp``.  A
standalone two-repeat GPU scalar-gradient callback took ``12.22 s`` for the
first point and ``0.03 s`` for a same-point repeat, with the second repeat using
``gradient_initial_tangents_cache_hit``.  A matched dense SciPy GPU callback
still took ``29.84 s`` for this one-callback diagnostic, dominated by
``jacobian_projected_replay_total``; dense SciPy was ``2.22x`` slower than the
new ``auto_scalar`` route.  These numbers are diagnostics for the
small cold probe, not a production optimization benchmark, but they confirm the
new route avoids dense-Jacobian fallback on eligible GPU scalar-adjoint runs.

A larger QH ``max_mode=4`` optimizer run on ``office`` then compared the
production dense SciPy path against ``method="scipy_matrix_free"`` and
``method="auto_scalar"`` with ``max_nfev=3``, ``inner_max_iter=80``, and
``trial_max_iter=80``.  Dense SciPy remained the best production choice for
this high-mode GPU case: it finished in ``94.8 s`` with final objective
``1.74e-1``.  ``scipy_matrix_free`` took ``213 s`` and stopped at
``2.50e-1`` because repeated ``Jv``/``J.Tv`` products dominated the trace.
``auto_scalar`` took ``95.6 s`` but stopped at ``2.85e-1`` because the current
scalar trust-region globalization did not make enough progress in the same
budget.  Therefore explicit GPU ``method="auto"`` still preserves dense SciPy
for high-mode QS/QI optimizations; scalar and matrix-free paths remain opt-in
diagnostics until longer production sweeps show comparable objective progress.

The follow-up patch tested JIT compilation of the repeated initial-state
construction inside that accepted-point path.  ``FixedBoundaryExactOptimizer``
can lazily JIT the parameter-to-packed-initial-state map with
``VMEC_JAX_OPT_JIT_INITIAL_STATE=1``.  It is not enabled by default because
new cold CPU exact callbacks pay more compile/dispatch overhead than this small
projection saves.  On a QH ``max_mode=2`` CPU smoke, the default non-JIT path
cut a cold accepted callback from ``3.23 s`` to ``2.13 s`` relative to the JIT
helper, with ``initial_guess_exact`` dropping from ``1.58 s`` to ``0.154 s``.
The next hot bucket remains the compiled accepted VMEC tape solve/replay path,
not initialization.

April 2026 local CPU diagnostics with ``inner_max_iter=trial_max_iter=300`` and
``max_nfev=2``:

.. list-table::
   :header-rows: 1
   :widths: 30 20 20 20

   * - Case
     - Method
     - Wall time
     - Notes
   * - QA ``max_mode=3``
     - dense exact LSQ
     - ``34.2 s`` warm
     - ``14.3 s`` mean Jacobian
   * - QA ``max_mode=3``
     - matrix-free LSQ, ``lsmr_maxiter=2``
     - ``49.2 s`` warm
     - same objective as dense, slower wall time
   * - QH ``max_mode=3``
     - dense exact LSQ
     - ``39.1 s`` warm
     - ``16.4 s`` mean Jacobian
   * - QA ``max_mode=3``
     - reverse scalar gradient
     - ``~14--19 s`` per callback
     - experimental; current L-BFGS wrapper needs optimizer work

The same QA dense run took ``110.6 s`` in a cold process, so benchmark reports
must identify cold and warm timings separately.

Before using this lane for production results, validate the products on the
case of interest:

.. code-block:: bash

   python tools/diagnostics/profile_exact_optimizer.py \
     --problem qa --max-mode 1 --inner-max-iter 20 \
     --check-linear-operator

The forward product ``Jv`` is covered by the same dynamic replay path as the
dense exact Jacobian.  The reverse product ``J.Tv`` uses split residual-block
transposes so inactive aspect/iota/QS blocks are not differentiated with zero
cotangents through singular axis branches.  On current diagnostics,
``Jv`` and ``J @ X`` match dense to near roundoff; QA ``J.Tv`` matches to
about ``1e-6`` relative error because current-driven iota still needs
axis-gauge cotangent cleanup, while QH ``J.Tv`` matches to near roundoff.

Current performance (representative benchmarks)
-----------------------------------------------

The README-facing fixed-boundary CPU matrix is generated from
``docs/_static/figures/readme_runtime_compare.csv`` and visualized in:

.. image:: _static/figures/readme_runtime_compare.png
   :width: 100%
   :align: center
   :alt: VMEC2000 versus vmec_jax fixed-boundary runtime comparison

The current data should be read as a performance reality check, not as a broad
single-solve speedup claim.  On the current local matrix, warm ``vmec_jax``
beats VMEC2000 on 1 of 16 bundled fixed-boundary rows
(``circular_tokamak_aspect_100``, about ``1.33x`` faster).  The median warm
single-solve row is still about ``4.4x`` slower than VMEC2000 on this host.
Cold runs are slower because they include XLA compilation and runtime setup.

A broader 2026-05-24 internal policy matrix compared the default fixed-boundary
policy against the explicit ``accelerated`` policy on all 35 bundled
fixed-boundary input decks.  On the local CPU host, 34 rows completed within the
per-case timeout: 20 were faster with ``accelerated``, 14 were slower, the
median warm speedup was ``1.17x``, and the geometric-mean warm speedup was
``1.61x``.  This confirms that CPU policy remains case-dependent rather than a
single global scan/default choice.  The largest CPU wins were low-latency or
simple-seed rows such as ``solovev`` (``16.9x``),
``up_down_asymmetric_tokamak`` (``15.3x``), ``cth_like_fixed_bdy``
(``11.1x``), and ``nfp4_QH_warm_start`` (``7.17x``).  The largest CPU losses
were high-work or tokamak rows such as ``LandremanPaul2021_QA_lowres``
(``0.20x``), ``DSHAPE`` (``0.33x``), ``shaped_tokamak_pressure`` (``0.38x``),
and ``purely_toroidal_field`` (``0.42x``).  ``NuhrenbergZille_1988_QHS`` timed
out in both policies and remains a stress fixture rather than a promotion row.

The same 2026-05-24 matrix on ``office`` with ``JAX_PLATFORMS=cuda,cpu`` showed
the opposite policy conclusion for CUDA: 33 rows had valid warm
accelerated-vs-default speedups within CUDA, 32 of those were faster with
``accelerated``, the median warm speedup was ``2.06x``, and the geometric-mean
warm speedup was ``3.38x``.  Only
``LandremanPaul2021_QA_lowres`` was slower (``0.33x``), and
``NuhrenbergZille_1988_QHS`` again timed out.  Representative CUDA wins were
``li383_low_res`` (``12.6x``), ``nfp1_QI`` (``11.9x``), minimal-seed NFP
families (``~10-11x``), and ``nfp4_QH_warm_start`` (``10.3x``).  These numbers
are the reason public accelerator fixed-boundary runs prefer the scan-backed
accelerated path while CPU remains backend/case aware.

This does not contradict the optimization motivation: the exact-adjoint
optimization path can avoid the many finite-difference VMEC subprocess columns
that SIMSOPT+VMEC2000 needs.  It does mean that single-solve CPU/GPU runtime
remains an active performance lane before claiming broad VMEC2000 runtime wins.

Finite-beta CPU profile: May 2026
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``examples/data/input.nfp4_QH_finite_beta`` case is a useful stress test
because it combines finite pressure/current profiles, ``mpol=5``, ``ntor=5``,
and a two-stage ``NS_ARRAY=[5, 51]`` schedule.  On the local Apple CPU host,
VMEC2000 converges this case in about ``3.3 s``.  Before the May 2026 profile
cleanup, the public ``vmec_jax`` CPU path took about ``25.3 s`` cold for the
same converged multigrid run.  After moving concrete profile evaluation and the
CPU force helper post-processing out of residual JAX dispatch fragments, the
same diagnostic run takes about ``20.6 s`` cold while preserving the final
residual (``~5.6e-13``).

A later pass moved ordinary, non-differentiated multigrid interpolation through
the NumPy host path while keeping traced/autodiff interpolation on the JAX path.
On the same diagnostic command this reduced the local cold time to about
``20.1 s`` and preserved the final residual (``~5.6e-13``).  It also avoids
unnecessary one-time JAX dispatch during the radial ``NS_ARRAY`` stage transfer.

The 2026-05-23 CPU pass made the VMEC-output writer use the fast bcovar path by
default.  On the same local host, the cold full CLI time for
``input.nfp4_QH_finite_beta`` dropped from about ``18.8-19.1 s`` to about
``16.8 s`` while bundled wout parity tests continued to pass.  The corresponding
``input.nfp4_QH_warm_start`` full CLI time dropped from about ``6.8 s`` to
about ``4.6 s``.  Set ``VMEC_JAX_WOUT_FAST_BCOVAR=0`` only for legacy
force-kernel output debugging.

The 2026-05-24 CPU pass fixed two steady-state host-loop overheads: NumPy
copies of the cached VMEC phase stack now keep the zero-cost phase-stack branch,
and the ``scalxc[:, modes.m]`` table used by real-space synthesis is cached for
static mode/surface grids.  The full public CLI compute time on the local Apple
CPU host is now about ``12.2 s`` for ``input.nfp4_QH_finite_beta`` and about
``1.8 s`` for ``input.nfp4_QH_warm_start``.  End-to-end shell time is still
higher (about ``16.3 s`` and ``5.5 s`` respectively) because a cold process
includes Python/JAX startup and the first netCDF4 import/write for WOUT output.

An opt-in follow-up knob, ``VMEC_JAX_WOUT_REUSE_FINAL_BCOVAR=1``, lets WOUT
assembly reuse the converged final force/bcovar payload already produced by the
solve when that payload is available.  This is intended for profiling and cold
CLI runtime experiments on converged runs; the default remains an independent
fast-bcovar WOUT reconstruction so parity and diagnostics stay conservative.
Another opt-in knob, ``VMEC_JAX_WOUT_IO_WARMUP=1``, starts the heavy netCDF4
import in a background thread while the solve is running.  It is disabled by
default because the May 2026 cold CPU measurements showed that the background
import contended with the VMEC solve and was slightly slower overall.

The same local CPU profiling pass also isolated solve-only cost with
``profile_fixed_boundary.py`` and no wout write.  A 2026-05-25 policy check
showed that host NumPy update assembly is still best for low-mode warm-start
QH: ``input.nfp4_QH_warm_start`` converged in ``1.22 s`` solve-only with final
residual ``1.1e-13``.  The larger finite-beta QH case is the opposite:
the host-update path spent ``8.62 s`` in state update and took ``21.6 s`` wall
time, while the workload-sensitive fused strict-update path spent only
``0.33 s`` in state update and converged in ``11.9 s`` wall time with final
residual ``5.5e-13``.  This is still slower than the local VMEC2000 finite-beta
runtime, but it removes the largest pure-Python/host assembly term from the
current CPU profile.
A short CPU TensorBoard trace
showed about ``440`` ``pjit`` cache misses and ``9`` backend compile/load events
in the cold path, but the long-run steady bottleneck is now the compiled
``vmec_forces_rz_from_wout`` / ``tomnsps_rzl`` transform stack.

With an explicit CPU persistent cache
(``VMEC_JAX_COMPILATION_CACHE=1`` and an isolated
``VMEC_JAX_COMPILATION_CACHE_DIR``), repeated fresh-process CLI runs improve
further, but CPU cache use remains opt-in because XLA:CPU cache entries are
native AOT executables and can emit host-feature mismatch warnings if reused
across incompatible machines or JAX/JAXLIB versions.

The current bottleneck is therefore no longer the cubic current/pressure
profile helper or the wout force recomputation.  The remaining gap is dominated
by per-iteration force synthesis/residual transforms and the host VMEC-control
loop.  Future single-solve CPU work should target fewer materializations in
``vmec_forces_rz_from_wout``/``tomnsps_rzl`` and a less fragmented cold-start
setup path before claiming VMEC2000-competitive finite-beta runtime.

On the ``office`` RTX A4000 GPU, ordinary scan fixed-boundary solves now show a
cleaner split between cold compilation and warmed throughput.  With the current
Python 3/JAX stack, ``input.nfp4_QH_warm_start`` takes about ``27 s`` in a fully
cold fresh process but ``0.71 s`` when warmed in the same process.  The heavier
``input.nfp4_QH_finite_beta`` case takes about ``31.6 s`` cold and ``3.39 s``
when warmed in the same process, with final total residual ``5.51e-13``.  This
means the warmed GPU finite-beta solve is now VMEC2000-level for this case; the
remaining GPU production issue is fresh-process compile/cache latency.

Public GPU requests now enable the persistent compilation cache automatically
even when the user selects the GPU through ``solver_device="gpu"`` instead of a
JAX platform environment variable.  The runtime GPU path mirrors the
import-time cache setup by setting both ``jax_compilation_cache_dir`` and the
GPU XLA autotune cache option before first solve compilation.  A post-fix
``office`` check showed that this alignment is necessary for correct default
configuration, but not sufficient to make fresh-process GPU solves reuse the
compiled scan executable on this stack.  Re-running from the checked-out source
tree with ``PYTHONPATH=$PWD`` gave repeated QH warm-start processes of
``16.3 s`` and ``16.2 s``, and repeated finite-beta QH processes of
``55.9 s`` and ``55.5 s``.  The dominant term remains
``scan_device_dispatch_s`` (about ``12.5 s`` for QH warm start and ``41.9 s``
for finite beta), so the next GPU lane is persistent executable reuse or
reducing scan compile/dispatch cost rather than more driver-side cache toggles.
The first compile/dispatch reduction that survived the finite-beta guardrail is
an adaptive accelerator chunk: quiet GPU scans with more than 512 iterations use
a fixed 512-iteration scan chunk, regardless of Fourier mode count.  On
``office`` this is neutral for the low-mode ``input.nfp4_QH_warm_start`` case
(``13.49 s`` default, ``13.43 s`` at 512, ``15.90 s`` for one full chunk) and
material for the high-mode finite-beta QH case (``54.28 s`` default/full,
``37.84 s`` at 256, ``23.48 s`` at 512, ``51.33 s`` at 1024), with unchanged
final residuals.  That moves the finite-beta GPU path back to the same order as
the local CPU timing while preserving the warmed scan-throughput advantage.

The exact-optimizer profile has a different bottleneck from single fixed-boundary
solves.  A 2026-05-24 bounded ``max_mode=3`` two-evaluation profile on the
minimal-seed QA/QH/QP examples measured:

.. list-table:: Bounded exact-optimizer timing, 2026-05-24
   :header-rows: 1

   * - case
     - backend
     - warm profile time
     - dominant measured term
     - note
   * - QA
     - CPU
     - ``7.79 s``
     - ``5.53 s`` tape replay
     - objective moved, iota stayed on the zero-transform branch
   * - QA
     - CUDA
     - ``12.40 s`` same-process warm; ``31.7 s`` fresh after cache population
     - ``4.39 s`` tape replay, ``6.49 s`` tape build
     - persistent cache reduces but does not remove cold cost
   * - QH
     - CPU
     - ``5.11 s``
     - ``3.30 s`` tape replay
     - no accepted movement in the two-evaluation diagnostic
   * - QH
     - CUDA
     - ``9.47 s`` same-process warm; ``23.7 s`` fresh after cache population
     - ``2.66 s`` tape replay, ``3.65 s`` tape build
     - same-process warm is materially better than fresh-process warm
   * - QP
     - CPU
     - ``21.10 s``
     - ``14.48 s`` Jacobian total
     - minimal seed stayed on the zero-transform branch
   * - QP
     - CUDA
     - ``61.90 s``
     - ``35.50 s`` Jacobian total, ``24.06 s`` trial solve
     - not production-competitive yet
   * - QI
     - CPU
     - ``43.25 s`` for one bounded matrix-free stage
     - Boozer/QI objective setup and exact solve
     - one-evaluation timing only, not a converged QI run
   * - QI
     - CUDA
     - ``64.08 s`` for the same bounded stage
     - Boozer/QI objective setup and exact solve
     - GPU is still slower for this cold QI profile

These optimization profiles point to the next performance target: accepted-point
exact tape build/replay, tangent construction, and zero-transform branch
avoidance in simple-seed QP/QI starts.  The fixed-boundary CUDA scan path is now
often faster than the host loop, but full optimization does not become
production-fast until exact callbacks and Boozer/QI objective setup are less
cold-start dominated.

The finite-beta CPU timing above was checked against the local VMEC2000
executable ``~/bin/xvmec2000`` in a temporary directory.  The VMEC2000 run took
about ``3.3 s``.  The full public ``vmec_jax`` CPU CLI path measured about
``12.2 s`` before the workload-sensitive strict-update policy; the latest
solve-only profile is ``11.9 s`` on the local CPU host, with the state-update
term reduced from ``8.62 s`` to ``0.33 s``.  The resulting WOUT comparison remained in the expected
converged-equilibrium parity band: representative relative RMS differences were
``~1.8e-6`` for iota, ``~3.5e-6`` for ``rmnc``, ``~2.2e-5`` for ``zmns``,
``~1.0e-5`` for ``bmnc``, and ``~1.7e-4`` for ``lmns`` after skipping the
near-axis cells.

A 2026-05-23 cold-process ``cProfile`` pass on a bounded finite-beta run showed
the same split at shorter iteration budgets: package/JAX startup took about
``1.9 s`` of a ``4.9 s`` profiled process, backend compile/load about ``0.9 s``,
and the solver stage about ``2.4 s``.  The largest steady solver terms were
``vmec_forces_rz_from_wout`` (``0.77 s``),
``vmec_bcovar_half_mesh_from_wout`` (``0.47 s``), and ``tomnsps_rzl``
(``0.19 s``).  That older bounded profile was low enough work that forcing
``VMEC_JAX_HOST_UPDATE_ASSEMBLY=0`` was slower; the current production policy
therefore uses the host path for low-work CPU stages and the fused strict-update
path only above the ``VMEC_JAX_HOST_UPDATE_CPU_WORK_LIMIT`` threshold.

The figure rows and provenance are available as:

- :download:`readme_runtime_compare.csv <_static/figures/readme_runtime_compare.csv>`
- :download:`readme_runtime_compare.json <_static/figures/readme_runtime_compare.json>`

Regenerate the current fixed-boundary plot after a runtime sweep with:

.. code-block:: bash

   python tools/diagnostics/readme_runtime_compare.py \
     --cpu-summary outputs/fixed_runtime_accel_cpu_bundle_20260406_r2/summary.json \
     --figure-kind fixed --plot-mode runtime \
     --figure-out docs/_static/figures/readme_runtime_compare.png \
     --csv-out docs/_static/figures/readme_runtime_compare.csv \
     --json-out docs/_static/figures/readme_runtime_compare.json

When a same-host GPU sweep is available, add one or more ``--gpu-summary``
paths from the matching run.  The generated CSV/JSON/table then retain
separate CPU and GPU ``vmec_jax`` cold/warm runtime and memory columns, which
keeps the README plot from silently dropping GPU comparison data.

Profiling and diagnostics

Enable float64
--------------

VMEC2000 is float64-first. For parity, enable x64 in JAX::

  export JAX_ENABLE_X64=1

JIT boundaries and compile latency
----------------------------------

On CPU, compilation can dominate runtime for moderate problem sizes. ``vmec-jax`` uses:

- a jitted geometry kernel (``eval_geom``),
- non-jitted solver gradients by default (to reduce compile latency).

Solver functions accept ``jit_grad=True`` to trade longer compile time for faster
iterations.

To reduce initial compilation overhead during startup, you can disable JIT for
the **initial guess** phase by setting::

  export VMEC_JAX_DISABLE_JIT_INIT=1

This keeps the solver kernel JIT-compiled, but avoids compiling the initial
boundary->state projection path (useful for short runs or rapid profiling).

To reduce per-iteration latency spikes in multigrid runs, ``vmec-jax`` can
precompile the force kernel at the start of each stage. This is enabled by
default when ``jit_forces=True``; you can override it with::

  export VMEC_JAX_JIT_PRECOMPILE=0

If you prefer to run a few iterations without JIT before compiling, set::

  export VMEC_JAX_JIT_WARMUP_ITERS=2

Scan-mode iteration (fast path)
-------------------------------

The scan-based loop lifts the VMEC2000 iteration into ``jax.lax.scan`` to reduce
Python overhead.  Public fixed-boundary defaults are backend-aware: CPU uses the
profiled VMEC-control non-scan loop, while GPU/CUDA/ROCm uses scan for ordinary
raw fixed-boundary solves.  You can explicitly request scan with:

- ``--fast`` on the CLI,
- ``use_scan=True`` in ``run_fixed_boundary``,
- or ``VMEC_JAX_USE_SCAN=1``.

You can force the VMEC-control loop with ``use_scan=False`` or the diagnostics
flag ``--no-use-scan``.

**Important**: scan parity is case-dependent on difficult large-``ns`` stages.
Use scan when profiling a case where the scan loop has been validated. You can
always force the conservative path with ``--parity``.

For LASYM fixed-boundary stages in explicit scan mode, the selector can use:

- a timed scan/non-scan probe on CPU backends,
- a short parity-only probe on accelerator backends.

This keeps explicit GPU scan experiments from paying the full warmed non-scan
timing cost while still rejecting scan when the short parity probe disagrees.

Controls:

- ``VMEC_JAX_DYNAMIC_SCAN_TIMED=1``: force a timed probe even on accelerators.
- ``VMEC_JAX_DYNAMIC_SCAN_TIMED=0``: force parity-only probing.
- ``VMEC_JAX_DYNAMIC_SCAN_ITERS=<int>``: override the probe window
  (defaults to ``10`` on CPU, ``3`` on accelerators).

For quiet accelerator scans, ``vmec-jax`` uses a backend-aware scan chunk.
Short runs use one full-length chunk.  Long-budget runs use fixed
512-iteration chunks to reduce fresh-process compile/dispatch latency while
reusing the same compiled body inside the solve.

Controls:

- ``VMEC_JAX_SCAN_CHUNK_SIZE=<int>``: override the chunk target explicitly.

Debug dump env vars are incompatible with scan mode.

Experimental accelerated mode
-----------------------------

``vmec-jax`` now exposes an explicit experimental solver policy for the
non-parity performance track:

- Python API: ``run_fixed_boundary(..., solver_mode="accelerated")``
- CLI: ``vmec input.name --solver-mode accelerated``

Current behavior of this first slice:

- auto-selected public fixed-boundary API/CLI runs use the profiled non-scan
  VMEC-control loop on CPU and the scan-lifted loop on GPU/CUDA/ROCm,
- explicit ``solver_mode="accelerated"`` callers keep the historical scan
  default unless ``use_scan=False`` is supplied,
- ordinary non-scan production runs skip the parity-oriented scan-selection
  probes; use the dynamic scan controls below only for targeted diagnostics,
- when the caller does not explicitly request multigrid, accelerated
  fixed-boundary runs now default to a single final-grid stage. This avoids
  per-stage interpolation and recompilation overhead that was dominating the
  heavy bundled fixed-boundary cases,
- accelerated fixed-boundary stages still use a scalar total-residual target
  derived from the input ``ftol`` budget as a cheap *in-block* early-stop:
  ``fsq_total_target = ftol * 3`` for the three VMEC residual channels
  (``fsqr``, ``fsqz``, ``fsql``). However, the returned accelerated
  fixed-boundary run now accepts convergence only when the final-stage
  per-channel rule is satisfied, matching the requested ``FTOL`` literally on
  ``fsqr``, ``fsqz``, and ``fsql``,
- the experimental solver controls no longer rely on fixed absolute
  convergence thresholds. By default:

  - gradient-based stopping derives ``grad_tol`` from the initial gradient
    scale and machine precision,
  - the Gauss-Newton path derives its CG tolerance from the current residual
    progress against the same ``ftol`` budget,
  - the Gauss-Newton damping seed is derived from the local normal-equation
    curvature scale instead of a fixed literal damping floor,
  - residual-objective ``m=1`` release thresholds now default to ``ftol``
    instead of hardcoded residual cutoffs,
- accelerated runs now request compact histories and a minimal resume payload
  by default, so the result object does not carry the full parity-era
  momentum/preconditioner cache unless the caller explicitly asks for it,
- the CLI executable now has an extra fixed-boundary-only policy layer on top
  of accelerated mode:

  - the first attempt is the same fast final-grid solve used by the optimized
    Python API path,
  - if a staged input provides explicit ``NS_ARRAY`` / ``NITER_ARRAY`` and the
    fast final-grid attempt misses the target, the CLI replays that staged
    schedule automatically (accelerated coarse stages, strict parity final
    stage),
  - if the input is staged but does not provide ``NITER_ARRAY`` and the user
    explicitly forces accelerated mode, the CLI falls back to a reduced
    warm-start multigrid budget derived from the coarsest-to-finest ``ns``
    ratio,
  - strict parity finish blocks then continue from state only, without reusing
    the parity-era nonlinear-controller caches,
- free-boundary cases currently stay on the existing robust path; accelerated
  free-boundary control is not implemented yet,
- the mode is intended to reduce control overhead while preserving final
  residual quality, not to reproduce the VMEC2000 iteration trace.

Use the dedicated comparison harness to evaluate it against the current default
solver policy:

.. code-block:: bash

  python tools/diagnostics/benchmark_accelerated_mode.py \
    --baseline-mode default \
    --candidate-mode accelerated \
    --candidate-cli-fixed-boundary-mode \
    --kind fixed \
    --jax-platforms cpu

The harness reports:

- cold and warm runtime,
- peak process memory,
- final ``fsq_total``,
- convergence flags,
- reference-``wout`` relRMS metrics when bundled references are available.

Early March 2026 smoke results on the local CPU host:

- ``input.up_down_asymmetric_tokamak``: about ``4.1x`` warm speedup with a
  materially smaller memory footprint than the current default path,
- ``input.circular_tokamak``: approximately neutral in runtime, with good
  final quality (``~1.2e-5`` reference-``wout`` relRMS),
- ``input.LandremanPaul2021_QA_lowres``: approximately neutral with the
  current ftol-derived total target,
- free-boundary accelerated mode is currently a control-path alias for the
  robust baseline, not a new fast free-boundary controller.

Serial fixed-boundary follow-up measurements from
``outputs/accelerated_fixed_boundary_reassessment_20260309/summary.json``
show why the single-grid default is now the accelerated fixed-boundary policy:

- ``input.LandremanSenguptaPlunk_section5p3_low_res``:
  ``45.48s`` current default vs ``0.198s`` accelerated single-grid and
  ``0.232s`` accelerated explicit multigrid; the accelerated single-grid route
  converges and is dramatically faster than both,
- ``input.LandremanPaul2021_QA_lowres``:
  ``8.18s`` current default vs ``7.31s`` accelerated single-grid and
  ``8.10s`` accelerated explicit multigrid; the accelerated single-grid route
  now carries the full staged iteration budget and converges at
  ``~3.0e-13``,
- ``input.LandremanPaul2021_QA_reactorScale_lowres``:
  ``21.15s`` warmed on the optimized CLI track versus ``43.20s`` for VMEC2000
  on the current bundled CPU benchmark, showing the same controller policy
  carries over to a heavier reactor-scale 3D case.

The fixed-boundary CLI path is now best understood as a controller stack,
not a single algorithm:

- easy inputs stay on the fast final-grid optimized path,
- staged inputs can automatically escalate into their input-defined stage
  schedule before paying the cost of parity finish blocks,
- only the genuinely hard cases should reach the final strict continuation
  phase.

The last pre-PR cleanup on ``codex/nonparity-performance`` did not change the
controller policy again. Instead, it trimmed overhead around the existing fast
path:

- performance-oriented non-verbose staged runs now default to the lighter
  history footprint, not just the explicitly accelerated subset,
- ordinary free-boundary runs now skip extra axis syntheses that were only
  needed for ``VMEC_JAX_DUMP_SCALPOT`` diagnostics,
- the VMEC-like dense free-boundary solve path now reuses cached LU
  factorizations when SciPy is available, with a NumPy fallback otherwise.

For an up-to-date side-by-side comparison on your machine, use the bundled
driver example:

.. code-block:: bash

  python examples/fixed_boundary_driver_tracks.py \
    examples/data/input.circular_tokamak \
    --quiet --json

On the current branch, that example produced the following local CPU smoke
result for ``input.circular_tokamak``:

- parity track: ``28.863s`` with ``fsq_total ~2.04e-14``,
- optimized CLI-style track: ``3.445s`` with ``fsq_total ~2.85e-14``.

That example uses the same public Python driver entry point as library users,
but it enables ``cli_fixed_boundary_mode=True`` on the optimized path so the
controller matches the executable behavior exactly.

Latest serial bundled fixed-boundary reassessment (April 2026)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Historical note: this April 2026 accelerated-branch snapshot is retained to
explain why the optimized controller exists, but it is not the current public
VMEC2000 comparison.  Use the README-facing CSV/JSON in the previous section
for current release claims.  This snapshot used NS=151 single-grid inputs
(``examples_single_grid/data/``) and compared
``solver_mode="accelerated"`` warm runtimes against VMEC2000.
Results are in ``outputs/bench_accel_20260413/summary.json``.

**vmec_jax accelerated mode (warm) vs VMEC2000 on NS=151 single-grid:**

- ``ITERModel``: ``0.44s`` vs ``1.72s`` — **3.9x faster**
- ``up_down_asymmetric_tokamak``: ``1.62s`` vs ``7.03s`` — **4.3x faster**
- ``B2_A80``: ``0.16s`` vs ``0.81s`` — **5.1x faster**
- ``LandremanSenguptaPlunk_section5p3_low_res``: ``0.23s`` vs ``1.00s`` — **4.3x faster**
- ``circular_tokamak``: ``0.62s`` vs ``1.50s`` — **2.4x faster**
- ``shaped_tokamak_pressure``: ``0.77s`` vs ``1.99s`` — **2.6x faster**
- ``solovev``: ``0.26s`` vs ``1.23s`` — **4.7x faster**
- ``nfp4_QH_warm_start``: ``0.96s`` vs ``1.71s`` — **1.8x faster**
- ``basic_non_stellsym_pressure``: ``3.61s`` vs ``8.38s`` — **2.3x faster**
- ``circular_tokamak_aspect_100``: ``0.43s`` vs ``0.84s`` — **2.0x faster**
- ``purely_toroidal_field``: ``1.44s`` vs ``1.37s`` — roughly neutral
- ``LandremanPaul2021_QA_lowres``: ``57.26s`` vs ``35.18s`` — 1.6x slower
- ``LandremanPaul2021_QA_reactorScale_lowres``: ``53.81s`` vs ``34.57s`` — 1.6x slower
- ``LandremanPaul2021_QH_reactorScale_lowres``: ``51.14s`` vs ``32.98s`` — 1.6x slower
- ``LandremanPaul2021_QA_lowres1``: ``32.33s`` vs ``16.03s`` — 2.0x slower
- ``cth_like_fixed_bdy``: ``23.66s`` vs ``4.19s`` — 5.7x slower

**11 of 16 cases** are faster in warm accelerated mode than VMEC2000.
The 5 slower cases (QA/QH reactor-scale and cth_like) are the main
remaining fixed-boundary performance targets.

Note: cold (first-run) times include XLA compilation and are
5–30x the warm times. Cold compilation is a one-time cost per
JIT-distinct input configuration.

Earlier reassessments from March 2026:

- results recorded in
  ``outputs/accelerated_cli_fixed_boundary_full_20260311_r2/summary.json``
- all 16 bundled fixed-boundary cases converge on both the baseline and
  optimized paths,
- the optimized path is now faster on 13 of 16 cases and roughly neutral on
  the remaining 3,
- the earlier runtime-regression blocker on the bundled CPU matrix is gone.

Final-``wout`` accuracy is a separate gate from residual convergence. The
earlier full fixed-boundary audit is recorded in
``outputs/fixed_wout_audit_20260310_r3/summary.json``, and the later staged-3D
controller fixes improved several non-axisymmetric cases materially:

- strong final-``wout`` agreement on the current shipped showcase cases:
  ``ITERModel`` (max relRMS ``6.01e-06``),
  ``shaped_tokamak_pressure`` (``1.55e-07``),
  ``circular_tokamak`` (``1.03e-05``),
- targeted staged non-axisymmetric follow-up fixes then brought the reactor
  QA/QH Fourier-channel errors down to the branch target range on direct
  comparisons:
  ``LandremanPaul2021_QA_lowres`` now reaches about
  ``rmnc 5.83e-05``, ``zmns 2.83e-04``, ``lmns 4.75e-03``;
  ``LandremanPaul2021_QA_reactorScale_lowres`` reaches about
  ``rmnc 2.49e-05``, ``zmns 1.61e-04``, ``lmns 2.86e-03``;

Optimization and gradient benchmarking
--------------------------------------

For optimization workflows, runtime alone is not the right metric. The project
needs to track at least four numbers per case:

- primal solve time,
- explicit-gradient time,
- implicit-gradient time,
- peak memory for each of the above.

The current repo already has the pieces needed to build this:

- explicit-diff and implicit-diff examples in ``examples/optimization/``,
- implicit solver support in ``vmec_jax/implicit.py``,
- profiling hooks in ``tools/diagnostics/``.

What is still missing is a canonical benchmark matrix and reporting format. The
recommended first benchmark set is:

- ``circular_tokamak`` for small axisymmetric behavior,
- ``ITERModel`` for a larger axisymmetric case,
- ``LandremanPaul2021_QA_lowres`` for a representative 3D case.

For each case, record:

- backend,
- primal runtime,
- explicit-gradient runtime,
- implicit-gradient runtime,
- peak memory,
- final objective / residual quality.

This will tell us where the real optimization bottlenecks are, and it will
also guide downstream integration work for ``booz_xform_jax`` and ``neo_jax``.

- the runtime picture is now favorable on the bundled CPU matrix, but the
  branch remains experimental because the non-parity scope and GPU/default
  policy questions are broader than this one fixed-boundary CPU result.
- a later ``wout`` audit found that much of the remaining QA/QH benchmark
  error was coming from symmetry-forbidden geometry channels being exported
  for ``lasym=False``:
  zeroing ``rmns`` and ``zmnc`` in ``wout`` for symmetric runs reduced the
  bundled 3D quality metric from about ``3.37e-01`` to ``4.19e-02`` on
  ``LandremanPaul2021_QA_lowres``, from ``3.56e+00`` to ``3.14e-02`` on
  ``LandremanPaul2021_QA_reactorScale_lowres``, and from ``4.61e+00`` to
  ``2.22e-02`` on ``LandremanPaul2021_QH_reactorScale_lowres`` in
  ``outputs/fixed_wout_3d_audit_20260311_r1/summary.json``,
- the next narrowed audit then focused on the staged current-driven 3D
  continuation policy itself:
  for 3-stage ``lasym=True`` current-driven runs, the first attempt used a
  mixed controller that kept the entry/final stages conservative and
  accelerated only the interior stage,
- that same audit kept the remaining QH reactor-scale parity error bounded:
  ``LandremanPaul2021_QH_reactorScale_lowres`` reached about
  ``rmnc 6.12e-05``, ``zmns 2.60e-04``, ``lmns 9.97e-03``,
- that change materially reduced the remaining non-axisymmetric lambda drift:
  ``basic_non_stellsym_pressure`` improved from about ``3.46e-01`` to
  ``3.46e-02`` max relRMS while still running faster than baseline
  (about ``23.69s`` baseline vs ``19.36s`` optimized in the targeted audit),
  and the QA/QH reactor-scale cases held at about ``3.14e-02`` and
  ``2.22e-02`` with large runtime wins,
- the remaining bundled 3D quality gap is therefore much narrower and now
  mostly a lambda-field accuracy question rather than a broad geometry or
  force-balance mismatch,
- a final targeted controller split closed most of that remaining gap:
  ``lasym=False`` current-driven 3D CLI runs now go straight to staged
  multigrid on the conservative non-scan residual path,
- with that split, the latest targeted audit reached:
  ``LandremanPaul2021_QA_lowres`` about ``4.20e-03`` max relRMS at
  about ``100.6s`` warmed runtime,
  ``LandremanPaul2021_QA_reactorScale_lowres`` about ``6.42e-04`` at
  about ``125.1s``,
  ``LandremanPaul2021_QH_reactorScale_lowres`` about ``6.00e-05`` at
  about ``180.2s``,
  while ``basic_non_stellsym_pressure`` remained the last branch-specific
  lambda outlier,
- the follow-on strict-``FTOL`` pass removed that branch-specific regression by
  keeping ``lasym=True`` current-driven 3D staged runs fully on the
  conservative controller:
  ``basic_non_stellsym_pressure`` now lands back at about ``2.98e-02`` max
  relRMS, matching the current baseline quality instead of worsening it, with
  essentially neutral warmed runtime (about ``22.24s`` baseline vs ``22.31s``
  optimized),
- a subsequent CPU-focused profiling pass showed that the next avoidable cost
  was not force balance itself but controller-side JAX overhead:
  moving ``ptau`` sign-change detection to a host NumPy implementation removed
  it from the hot list, and a follow-on host update-assembly path for
  accelerated ``lasym=False`` CPU CLI solves moved the next hotspot out of the
  repeated signed-mode conversion path. On the targeted reassessment artifact
  ``outputs/host_updates_benchmark_20260312/summary.json``,
  ``LandremanPaul2021_QA_lowres`` improved from about ``34.83s`` baseline vs
  ``38.64s`` optimized before the host update path to about ``34.83s`` baseline
  vs ``31.17s`` optimized after it, while keeping about ``4.20e-03`` max
  relRMS against the VMEC2000 reference. The sensitive
  ``basic_non_stellsym_pressure`` case also held baseline-level quality and
  remained slightly faster (about ``9.12s`` baseline vs ``8.95s`` optimized),
- rerunning the full warmed bundled ``lasym=False`` fixed-boundary CPU matrix
  on that new head produced the cleanest branch-level fixed-boundary result so
  far: all 13 bundled ``lasym=False`` cases converged on both paths, and the
  optimized CLI controller was faster on all 13. Representative rows from
  ``outputs/fixed_lasym_false_matrix_20260312/summary.json`` include
  ``LandremanPaul2021_QA_reactorScale_lowres`` (``51.31s`` baseline vs
  ``38.56s`` optimized), ``LandremanPaul2021_QH_reactorScale_lowres``
  (``60.10s`` vs ``46.33s``), ``ITERModel`` (``12.73s`` vs ``5.00s``), and
  ``cth_like_fixed_bdy`` (``4.71s`` vs ``0.97s``),
- the README-facing VMEC2000 comparison was then rerun separately on the same
  host in
  ``outputs/readme_fixed_runtime_vmec2000_accel_cpu_20260312/summary.json``:
  all 13 bundled ``lasym=False`` fixed-boundary cases converged, but the
  optimized branch is still faster than VMEC2000 on only the smallest shipped
  cases (``solovev`` and ``circular_tokamak_aspect_100``). The reactor-scale
  QA/QH cases are now close enough to compare honestly on one plot, but they
  still run somewhat slower than VMEC2000 on CPU,
- carrying the same “reduce host-controlled overhead” approach into
  ``lasym=False`` free-boundary showed the next safe win: batching the
  boundary real-space syntheses in
  ``vmec_jax/free_boundary.py:_sample_external_boundary_arrays`` cut the
  representative ``input.cth_like_free_bdy`` cProfile total from about
  ``60.41s`` to about ``58.21s`` while keeping the NESTOR reuse tests green,
- the next free-boundary profile then showed that
  ``_vmec_nonsingular_terms_from_bexni`` and
  ``_vmec_nonsingular_gsource_from_bexni`` were still rebuilding basis-only
  helper tables on every call; caching those tables on the free-boundary basis
  cut the same representative cProfile total further to about ``32.67s`` and
  brought the direct warmed CPU benchmark for ``cth_like_free_bdy`` to about
  ``11.25s`` in ``outputs/freeb_cth_runtime_20260312/summary.json`` while
  preserving convergence.
- one more pass then removed six separate second-derivative boundary synthesis
  calls from ``_sample_external_boundary_arrays`` and replaced them with two
  stacked batched syntheses. On the same representative free-boundary case,
  that moved the cProfile total further to about ``31.04s`` and improved the
  warmed CPU benchmark to about ``10.41s`` in
  ``outputs/freeb_cth_runtime_20260312_r2/summary.json`` while keeping the
  direct NESTOR reuse tests green.
- the next step replaced the remaining JAX-backed boundary synthesis in that
  host-only external-sampling path with a cached NumPy phase-stack helper.
  That keeps the same VMEC trig algebra but removes more JAX lowering/indexing
  overhead from ``_sample_external_boundary_arrays`` itself. On the same
  representative case, the cProfile total dropped again to about ``30.20s``
  with ``_sample_external_boundary_arrays`` down to about ``5.78s``, and the
  warmed CPU benchmark improved further to about ``9.86s`` in
  ``outputs/freeb_cth_runtime_20260312_r4/summary.json``.
- a deeper 2026-03-13 profiling pass then combined full ``pytest -q`` /
  Sphinx validation with kernel-level auditing:

  - the accelerated fixed-boundary HLO dump for
    ``input.LandremanPaul2021_QA_lowres`` showed the bcovar kernel still
    contains a large gather/scatter footprint (roughly ``96`` gathers and
    ``55`` scatters in the dumped HLO), which means the next fixed-boundary
    wins are more likely to come from refactoring bcovar/state-enforcement
    kernels than from controller tweaks,
  - cProfile on the representative free-boundary case
    ``input.cth_like_free_bdy`` showed the remaining dominant host-side cost
    was still inside ``free_boundary.py``:
    ``_sample_external_boundary_arrays`` plus JAX-backed parity conversions
    and the full-profile ``currents_from_bcovar`` path used only to recover
    the edge ``ctor`` scalar for NESTOR,
  - replacing those host-only parity conversions with pure NumPy helpers in
    ``vmec_parity.py`` and adding a specialized
    ``vmec_lforbal.plascur_edge_from_bcovar`` helper cut the same warmed
    200-iteration ``input.cth_like_free_bdy`` benchmark from about ``8.00s``
    to about ``3.45s`` on the same CPU host while keeping the free-boundary
    regression tests green,
  - after that change, the same cProfile run dropped from about ``8.00s`` to
    about ``6.06s`` total Python-side wall time, and the remaining dominant
    CPU hotspots became the update/preconditioner side
    (``_apply_vmec_scale_m1_precond_rhs``,
    ``_enforce_fixed_boundary_and_axis``, and the free-boundary update block)
    rather than the old external-boundary sampling path itself.

Full same-host readiness sweep
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After the targeted free-boundary optimizations, the branch was rerun on the
full shipped example matrix plus the external DIII-D axisymmetric
free-boundary references. The freshest final-head artifacts are:

- fixed-boundary optimized CLI / automatic Python readiness matrix:
  ``outputs/readiness_fixed_all_20260313/summary.json``
- fixed-boundary VMEC2000-vs-optimized warmed runtime matrix:
  ``outputs/fixed_runtime_vmec2000_accel_cpu_warm_20260313/summary.json``
- free-boundary VMEC2000-vs-default warmed runtime matrix:
  ``outputs/free_runtime_vmec2000_cpu_warm_20260313/summary.json``

Current state:

- fixed-boundary: all 16 rows converged on the optimized branch path,
- fixed-boundary accelerated/default comparison: faster on 13 rows, roughly
  neutral on 1, slower on 2, with all 16 satisfying the requested final-stage
  ``FTOL``,
- free-boundary: all 5 rows now converge on the shipped default path,
- the previous shipped holdout ``cth_like_free_bdy_lasym_small`` was replaced
  with a convergent ``lasym=True`` CTH-like fixture built from the stable
  bundled free-boundary case,
- the free-boundary DIII-D rows remain the main runtime blockers, but the
  latest axisymmetric CPU pass reduced ``input.DIII-D_lasym_false`` from about
  ``173.82s`` warmed earlier in the branch to about ``113.78s`` warmed in the
  final runtime matrix on the same host,
- the full 21-case CPU matrix is still slower than VMEC2000 on this host, so
  the performance story is now “better vmec_jax defaults and broader
  convergence,” not “broad CPU wins over VMEC2000.”

So the branch is useful and now converges across the shipped 21-case matrix,
but any merge/default decision should be framed around vmec_jax usability and
coverage rather than raw same-host VMEC2000 CPU speed.

- boundary ``R/Z`` synthesis and first-derivative synthesis in
  ``_sample_external_boundary_arrays`` cut the representative
  ``input.cth_like_free_bdy`` profile from about ``60.41s`` total wall time to
  about ``58.21s`` while keeping the direct NESTOR regression tests green,
- a follow-on experiment that added a final-grid parity polish to the
  staged 3D accelerated path was rejected because it raised runtime
  substantially without improving those benchmarked quality numbers.

Representative warmed CPU baseline-vs-optimized points from the current full
matrix:

- ``ITERModel``: ``8.86s`` baseline vs ``4.48s`` optimized,
- ``LandremanPaul2021_QA_lowres``:
  ``29.65s`` baseline vs ``35.02s`` optimized, with
  ``quality_max_rel_rms ~ 4.20e-03``,
- ``LandremanPaul2021_QA_reactorScale_lowres``:
  ``47.17s`` baseline vs ``45.52s`` optimized, with
  ``quality_max_rel_rms ~ 6.42e-04``,
- ``LandremanPaul2021_QH_reactorScale_lowres``:
  ``53.28s`` baseline vs ``53.05s`` optimized, with
  ``quality_max_rel_rms ~ 6.00e-05``,
- ``LandremanSenguptaPlunk_section5p3_low_res``:
  ``15.86s`` baseline vs ``7.48s`` optimized,
- ``up_down_asymmetric_tokamak``:
  ``24.58s`` baseline vs ``2.97s`` optimized.

Representative warmed CPU VMEC2000-vs-``vmec_jax`` points from that historical
readiness matrix:

- ``solovev``: VMEC2000 ``0.61s`` vs ``vmec_jax`` ``0.09s``,
- ``circular_tokamak_aspect_100``: ``2.99s`` vs ``0.59s``,
- ``cth_like_fixed_bdy``: ``1.34s`` vs ``1.04s``,
- ``LandremanPaul2021_QA_reactorScale_lowres``: ``39.27s`` vs ``44.48s``,
- ``LandremanPaul2021_QH_reactorScale_lowres``: ``46.34s`` vs ``53.93s``,
- ``DIII-D_lasym_false``: ``19.80s`` vs ``113.78s``,
- ``cth_like_free_bdy``: ``1.79s`` vs ``6.96s``.

Same-host CPU/GPU reassessment on a reference GPU workstation is now complete
for the same 16-case bundled fixed-boundary set:

- both CPU and GPU converge on all 16 cases,
- GPU is already faster on the heavier 3D cases
  (``LandremanPaul2021_QA_lowres``,
  ``LandremanPaul2021_QA_lowres1``,
  ``LandremanPaul2021_QA_reactorScale_lowres``,
  ``LandremanPaul2021_QH_reactorScale_lowres``,
  ``cth_like_fixed_bdy``),
- CPU still wins on the remaining 11 smaller or more launch-latency-dominated
  cases.

Representative same-host CPU/GPU warmed comparisons:

- ``LandremanPaul2021_QA_reactorScale_lowres``:
  CPU ``23.72s`` vs GPU ``13.06s``,
- ``LandremanPaul2021_QH_reactorScale_lowres``:
  CPU ``31.39s`` vs GPU ``16.56s``,
- ``cth_like_fixed_bdy``:
  CPU ``1.45s`` vs GPU ``0.89s``,
- ``circular_tokamak``:
  CPU ``0.70s`` vs GPU ``1.97s``,
- ``solovev``:
  CPU ``0.16s`` vs GPU ``0.55s``.

That makes the current branch state clearer than the earlier stress-case
benchmark story:

- the optimized fixed-boundary CLI path is a mixed but useful warmed CPU result:
  faster on 13 of 16 rows versus the prior/default branch path, roughly neutral
  on 1, and slower on 2,
- relative to VMEC2000, the current README-facing CPU matrix still wins on only
  1 of 16 bundled fixed-boundary rows,
- the GPU path is functional and convergent on the same bundled matrix, but is
  faster only on selected heavier 3D rows; do not claim uniform GPU superiority.

Exact-optimizer replay profiling on ``f0225ff``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The exact optimizer has a different performance profile from a standalone
fixed-boundary solve because accepted-point Jacobians replay a differentiated
VMEC tape for every active boundary direction.  A fresh 2026-05-12 QH
``max_mode=2`` profile used
``tools/diagnostics/profile_exact_optimizer.py --callback jacobian --repeats 2
--perturb-scale 1e-4`` on CPU and on an RTX A4000 GPU, then compared the JSON
reports with ``tools/diagnostics/compare_profile_reports.py``.

The local CPU report took ``11.29 s`` for two new Jacobian points.  The GPU
report took ``54.83 s`` for the same two points, with the first GPU point
dominated by cold accepted-point setup and the second warm point taking
``4.44 s``.  The GPU/CPU ratio was ``4.86x`` overall and ``4.87x`` for replay
time; RSS was similar.  The largest profile terms were:

- CPU: ``jacobian_tape_replay`` ``4.89 s`` (``43%``),
  ``jacobian_initial_tangents`` ``2.75 s`` (``24%``),
  ``jacobian_residual_tangents`` ``2.21 s`` (``20%``).
- GPU: ``jacobian_tape_replay`` ``23.79 s`` (``43%``),
  ``exact_solve_with_tape_total`` ``12.89 s`` (``24%``),
  ``exact_tape_build`` ``11.34 s`` (``21%``).

This narrows the GPU lane: optimizing residual convergence alone will not fix
optimization runtime.  The next GPU work should target accepted-point tape
replay and tangent batching/reuse, while keeping the CPU exact path unchanged.

Backend-adaptive replay bucketing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The first concrete GPU fix from that profile is backend-adaptive dynamic tape
bucketing.  CPU replay remains fastest with the previous bucket size ``32``.
On the ``office`` RTX A4000, the same dense exact-Jacobian profile is much
faster when GPU replay pads dynamic tapes to bucket size ``128``.  This is now
the default for CUDA/ROCm/GPU backends; users can still override it with
``VMEC_JAX_DYNAMIC_REPLAY_BUCKET`` for experiments.

Validation used the same QH ``max_mode=2`` accepted-point profile as above:

- previous GPU default: ``54.83 s`` for two perturbed Jacobian points,
  ``jacobian_tape_replay=23.79 s``;
- backend-adaptive GPU default: ``15.79 s`` for two perturbed Jacobian points,
  ``jacobian_tape_replay=5.43 s``;
- CPU baseline remains ``11.29 s`` for the same profile, so the optimized GPU
  profile is now about ``1.40x`` CPU overall and ``1.11x`` CPU in replay time.

This does not make GPU universally faster yet.  It removes the largest replay
regression for small/medium exact optimizations and leaves tape construction,
residual tangent projection, and accepted-point replay fusion as the next GPU
targets.

Follow-up profiling on 2026-05-12 confirmed the same bottleneck split.  A tiny
cold fixed-boundary smoke case on ``office`` still favored CPU
(``13.80 s`` CPU versus ``68.86 s`` GPU), so cold single solves remain a CPU
use case.  A local CPU exact-Jacobian smoke spent about ``96%`` of callback time
in replay/tangent work: ``jacobian_tape_replay`` ``3.34 s``,
``jacobian_initial_tangents`` ``2.85 s``, and
``jacobian_residual_tangents`` ``2.75 s`` out of a ``9.31 s`` callback.  The
next GPU and CPU target is therefore accepted-point tangent fusion/reuse, not
ordinary force-kernel optimization.

The 2026-05-18 scalar-gradient patch narrows the initial tangent/projection
cost without changing the dense least-squares Jacobian path:
``objective_and_gradient_fun`` now projects reverse tape cotangents through an
initial-state VJP on tangent-cache misses, caches that VJP by the same
shape/flip-branch key, and still reuses the dense tangent map when a previous
Jacobian callback already built it.  On a local CPU QH ``max_mode=2`` bounded
profile with two perturbed gradient callbacks
(``inner_max_iter=40``, ``trial_max_iter=20``), total callback time dropped from
``20.47 s`` to ``18.68 s``; the warm repeat dropped from ``0.69 s`` to
``0.56 s``.  A bounded QH ``max_mode=1`` check with ``inner_max_iter=8``
matched the dense reference gradient (``||g_adj - J.T r|| = 2.10e-12``,
relative ``7.21e-14``).

The 2026-05-25 ``office`` GPU follow-up rechecked the current QH
``max_mode=2`` exact-optimization hotspot with a deliberately short
``inner_max_iter=10`` accepted solve.  The dense least-squares Jacobian still
materializes a ``44353 x 24`` residual derivative, so the one-callback GPU
profile is dominated by accepted-point tape construction and residual-tangent
projection:

.. list-table:: QH mode-2 GPU exact callback, ``office``, 2026-05-25
   :header-rows: 1

   * - Path
     - Wall time
     - Main profile terms
   * - Dense Jacobian
     - ``55.8 s``
     - ``exact_tape_build=20.5 s``, ``jacobian_tape_replay=17.9 s``,
       ``jacobian_residual_tangents=9.4 s``
   * - Forced projected replay
     - ``57.0 s``
     - ``exact_tape_build=21.2 s``,
       ``jacobian_fused_projected_replay_total=27.9 s``
   * - Scalar adjoint, 4 optimizer callbacks, old backtracking
     - ``69.2 s`` total
     - no accepted improvement from the initial objective ``0.717``
   * - Scalar adjoint, L-BFGS/fast backtracking, 4 callbacks
     - ``72.9 s`` total
     - objective ``0.717 -> 0.659``, no dense ``44353 x 24`` Jacobian
   * - Scalar adjoint, L-BFGS/fast backtracking, 8 callbacks
     - ``76.8 s`` total
     - objective ``0.717 -> 0.627``; warm gradient callbacks drop to
       ``6.3 s`` average
   * - Scalar adjoint, default radius plus best-state finalization cache
     - ``74.2 s`` total
     - objective ``0.717 -> 0.615``; final exact tape rebuild avoided
       (``exact_tape_build`` count ``4`` instead of ``5``)
   * - Scalar adjoint, radius re-expansion after backtracked accepts
     - ``75.3 s`` total
     - objective ``0.717 -> 0.612`` in 8 callbacks; avoids the tiny-step
       plateau after rejected probes while keeping the final-state cache
   * - Scalar adjoint, opt-in cost-only trial filter
     - ``86.7 s`` total
     - objective ``0.717 -> 0.612`` with only 4 exact gradients, but
       8 trial solves add ``17.6 s``; enable with
       ``scalar_cost_only_trials=True`` for explicit profiling
   * - Matrix-free SciPy, ``lsmr_maxiter=2``, 4 residual callbacks
     - ``167.1 s`` total
     - objective ``0.717 -> 0.680``; repeated ``Jv``/``J.Tv`` products make
       this slower than scalar adjoints for this microcase

That historical test ruled out the earlier projected-replay implementation for
24 DOFs.  After the JVP-only/basepoint-carry tape changes, the updated bounded
QH mode-2 comparison slightly favors projected replay for non-LASYM GPU
callbacks, while SciPy's generic matrix-free trust-region path still spends too
much time in repeated linear-operator products at low LSMR budgets.  In this QH
mode-2 GPU microcase, the scalar-adjoint path remains useful because it avoids
dense Jacobian materialization and now accepts monotone steps after aggressive
backtracking.  Keeping the best exact
scalar state also removes one final accepted-point solve for runs whose last
probe is rejected.
Backtracked accepts now re-expand the next trust radius to the previous rejected
scale instead of permanently collapsing to the accepted microscopic step.  A
cost-only trial filter was also tested, but it is not the default because this
QH GPU case spends about ``2.2 s`` per trial solve and is slower overall despite
halving exact-gradient builds.  It remains available for diagnostics with
``scalar_cost_only_trials=True`` on ``FixedBoundaryExactOptimizer.run`` (or the
same argument through ``least_squares_solve``).  The legacy environment variable
``VMEC_JAX_OPT_SCALAR_COST_ONLY_TRIALS=1`` remains available for profiling
scripts, and ``tools/diagnostics/profile_exact_optimizer.py`` exposes the same
switch as ``--scalar-cost-only-trials``.  The path still needs problem-aware
step proposal/globalization before it can replace dense SciPy least-squares for
README-quality optimization results.

Additional controller finding from March 2026:

- the existing fully non-VMEC scan path was re-probed as a possible next
  accelerated controller, but it is not yet robust enough to become the
  default accelerated path: on representative fixed-boundary cases it is much
  faster, but it can diverge badly in ``fsq_total`` and final ``wout`` quality.
  The current accelerated mode therefore stays on the masked VMEC-control scan
  until a more stable device-resident controller is in place.

If you want an automatic parity probe when using scan, set::

  export VMEC_JAX_SCAN_PARITY_GUARD=1

This runs a short scan-vs-non-scan probe at the start of each stage and falls
back to the non-scan loop if a mismatch is detected. It is **off by default**
because it adds extra compilation and iteration overhead.

Scan chunking (fixed NSTEP blocks)
----------------------------------

To avoid retracing for variable tail lengths, the scan loop executes in fixed
chunks of length ``NSTEP`` (the VMEC input parameter). Iterations beyond
``NITER`` are masked by the in-scan hold condition, so the extra work is a
no-op and does not affect parity.

Controls:

- ``VMEC_JAX_VMEC2000_CHUNKED=1`` (default): enable chunked scan.
- ``VMEC_JAX_SCAN_CHUNK_SIZE=<int>``: override chunk length (defaults to
  ``NSTEP``).

This reduces compilation cache misses when the stage transition changes
``NITER`` but keeps the same ``NSTEP`` cadence.

Live NSTEP printing (debug callback)
------------------------------------

VMEC2000-style iteration rows (scan and non-scan) are printed using a JAX debug
callback by default. This keeps the output VMEC-like without inserting extra
Python-side synchronization on every step.

Defaults:

- Live printing is **enabled** when ``verbose`` and ``vmec2000_control`` are on.
- The VMEC2000-control scan path defaults to ``debug_callback`` for ordered
  host-side row formatting.  ``debug_print`` remains available for simpler
  device-side debug printing.

Disable live printing with:

::

  export VMEC_JAX_SCAN_PRINT=0

If you want to reduce any remaining host-callback overhead, increase ``NSTEP``
in the input file. Fewer prints means fewer callbacks.

Quiet scan runs (``--quiet`` / ``verbose=False``) automatically switch to a
minimal history mode: only ``fsqr/fsqz/fsql`` (and therefore ``w_history``) are
kept. Per-iteration print scalars (``r00``, ``w_mhd``) and time-step histories
are skipped to reduce host/device traffic. Override with::

  export VMEC_JAX_SCAN_MINIMAL=0   # keep full scan diagnostics
  export VMEC_JAX_SCAN_MINIMAL=1   # force minimal histories

In fast mode (``performance_mode=True`` / ``--fast``), ``scan_minimal`` is the
default for **quiet** runs (``verbose=False``) unless explicitly overridden by
``VMEC_JAX_SCAN_MINIMAL``. When ``verbose=True``, scan keeps the extra scalar
histories needed for VMEC-style printing.

Advanced knobs (not required for normal use):

- ``VMEC_JAX_SCAN_PRINT_MODE=debug_callback`` (default for VMEC2000-control scan)
- ``VMEC_JAX_SCAN_PRINT_MODE=debug_print`` (device-side debug printing)
- ``VMEC_JAX_SCAN_PRINT_ORDERED=1`` to force ordered prints (may reduce parallelism)

DFT tomnsps (GEMM path)
-----------------------

VMEC2000's ``tomnsps`` analysis transform is now implemented as a two-stage
DFT using the precomputed ``fixaray`` trig/weight tables:

- theta stage: multiply by ``cosmui/sinmui`` (endpoint-weighted + ``mscale``),
- zeta stage: multiply by ``cosnv/sinnv`` (with ``nscale`` and ``n*NFP`` in
  ``cosnvn/sinnvn`` for derivative terms).

The core contractions are done with batched ``dot_general`` calls so XLA can
lower them into GEMM kernels. This follows the VMEC++ basis approach (see
References [5-6]) while keeping VMEC2000 parity.

Recent updates to the DFT path:

- **Stacked theta contractions**: multiple force kernels are concatenated into
  a single cosine and sine projection per iteration, reducing the number of
  ``dot_general`` launches.
- **Derivative-factor fusion**: the :math:`n\,\mathrm{NFP}` factor for
  ``cosnvn/sinnvn`` is applied *after* the zeta contraction, so the same
  ``cosnv/sinnv`` basis can be reused for derivative blocks.
- **Stacked zeta contractions**: cosine- and sine-basis transforms for the
  derivative and non-derivative blocks are grouped to reduce kernel dispatches.
- **NumPy force path follows the same policy**: the host CPU force shim no
  longer forces the FFT branch.  It uses the DFT/GEMM path by default and still
  honors ``VMEC_JAX_TOMNSPS_FFT=0/1``.  On the finite-beta
  ``input.nfp4_QH_finite_beta`` CPU diagnostic, this reduced a 200-iteration
  force block from ``1.067 s`` to ``0.868 s`` while preserving the residual
  trajectory, and the full 3000-budget direct solve completed in ``13.05 s``
  with ``10.28 s`` in force assembly.

An FFT-based path remains available for experiments:

- ``VMEC_JAX_TOMNSPS_FFT=1`` enables the FFT implementation (not default).

Preconditioner weight caching
-----------------------------

The 1D radial preconditioner uses angular weights
:math:`w_i=\mathrm{cosmui3}_{i,0}/\mathrm{mscale}_0` on the VMEC internal grid.
These weights are now cached in the trig table as ``wint3_precond`` and reused
whenever the preconditioner diagonal is refreshed. This avoids rebuilding the
same weight tensor in every refresh call and keeps the preconditioner refresh
path purely algebraic in ``bsq``, ``r12``, ``sqrtg``, ``ru12``, and ``zu12``.

Free-boundary WP1 micro-benchmark
---------------------------------

For free-boundary staging, use the dedicated benchmark script:

.. code-block:: bash

  python tools/benchmarks/bench_free_boundary_wp1.py \
    --input examples/data/input.DIII-D_lasym_false \
    --interp-points 20000 \
    --interp-repeats 5

This reports:

- metadata validation/load time,
- full mgrid tensor load time,
- interpolation throughput and sampled ``|B_ext|`` stats.

Solver note: this benchmark isolates external-field staging cost. The sampling
toggle below is diagnostic-only; it does not describe overall free-boundary
solver maturity. You can disable that sampling with:

.. code-block:: bash

  export VMEC_JAX_FREEB_SAMPLE_EXTERNAL=0

WP2 free-boundary runtime controls
----------------------------------

Current free-boundary coupling defaults to the VMEC2000-aligned dense vacuum
operator described in :doc:`free_boundary_plan`.  The earlier lightweight
spectral solve remains useful for historical context and debugging, but it is
not the promoted parity path.  To keep dense-coupling runtime bounded:

- mgrid field tensors are cached by path in-process (avoids per-iteration
  NetCDF reloads),
- Poisson spectral denominators are stage-static,
- ``ivacskip`` reuses prior potential (skip solve) when ``ivac != 1``.

Control flags:

.. code-block:: bash

  export VMEC_JAX_FREEB_COUPLE_EDGE=1         # default: on
  export VMEC_JAX_FREEB_SAMPLE_EXTERNAL=1     # default: on

If profiling free-boundary solver-only cost, disable sampling diagnostics:

.. code-block:: bash

  export VMEC_JAX_FREEB_SAMPLE_EXTERNAL=0

Direct-coil CPU/GPU micro-benchmark snapshot
--------------------------------------------

The direct-coil benchmark matrix isolates the new coil-provider free-boundary
path from the broader example matrix:

.. code-block:: bash

  python tools/benchmarks/bench_freeb_direct_coil_matrix.py \
    --quick --include-gpu \
    --out /tmp/freeb_matrix_office_gpu_followup/summary.json

On the 2026-05-28 ``office`` CPU/CUDA rerun, the matrix used concrete-platform
probing so CUDA rows were recorded even under ``JAX_PLATFORMS=cpu,cuda``.  The
best bounded direct-solve row was ``direct_solve_jit_forces``.  The warm solve
was still CPU-favorable (``0.0525 s`` CPU versus ``0.2346 s`` CUDA,
``4.46x`` GPU/CPU), but the detailed timers show that the remaining GPU cost is
no longer the direct Biot-Savart field sample or the dense NESTOR solve:

.. list-table::
   :header-rows: 1

   * - Timing bucket
     - CPU
     - CUDA
     - CUDA/CPU
   * - setup
     - ``9.31 ms``
     - ``53.81 ms``
     - ``5.78x``
   * - residual metrics
     - ``0.76 ms``
     - ``29.26 ms``
     - ``38.32x``
   * - accepted-control fsq1
     - ``0.15 ms``
     - ``14.17 ms``
     - ``97.06x``
   * - preconditioner apply
     - ``1.09 ms``
     - ``12.65 ms``
     - ``11.55x``
   * - force evaluation
     - ``9.21 ms``
     - ``8.55 ms``
     - ``0.93x``
   * - finalize
     - ``10.67 ms``
     - ``11.05 ms``
     - ``1.04x``

The next GPU optimization target is therefore accepted-control and
preconditioner/update dispatch amortization plus reusable setup/precompute
state, not Biot-Savart sampling or the final dense vacuum solve.  Scalar-defer
is deliberately not the default yet because the residual scalars still drive
VMEC control flow, history, and free-boundary acceptance logic.

The follow-up instrumentation splits ``setup_total_s`` into static-grid
rebuild, free-boundary policy, boundary/profile construction, cache-key hashing,
``ptau`` constants, mode-index constants, and update constants.  Use those keys
in ``warm_solver_timing.timing`` or in the matrix ``cpu_gpu_comparison`` block
before implementing the next setup cache; the split is intentionally
measurement-only and does not change solver numerics.

When ``VMEC_JAX_TIMING=1`` is enabled, the direct-coil solve benchmark also
emits a normalized ``phase_timing_comparison`` block.  It ranks the warm setup,
force-evaluation, residual-metric, accepted-control ``fsq1``, preconditioner,
update, and finalize buckets and reports cold-to-warm improvement for each
bucket.  This is the preferred quick triage view for structural control-loop
staging and preconditioner/residual-scalar dispatch work because it is stable
even when the raw timing dictionary grows new sub-buckets.

When both CPU and GPU rows are present, the matrix summary also writes
``gpu_bottleneck_summary``.  This is a short, sorted list of warm phases for
which GPU is slower than CPU, including the GPU/CPU ratio and absolute
``gpu_minus_cpu_s``.  Use this field as the first-pass triage table before
opening detailed XLA or Perfetto traces.

The first CUDA probe after adding that split reported a warm ``setup_total_s``
of ``40.4 ms`` on the tiny direct-coil case.  The dominant setup sub-buckets
were boundary/profile construction (``18.6 ms``), update constants
(``12.3 ms``), and unattributed setup (``5.9 ms``).  Cold setup was dominated
by compilation-adjacent boundary/profile and axis-reset work, so cold-start
analysis should be kept separate from warm accepted-point optimization timing.
Accelerator host-forward runs now use the existing NumPy row-enforcement setup
path for the initial state when ``VMEC_JAX_HOST_SETUP_ENFORCE=auto``.  This
avoids tiny eager device dispatches in setup without changing traced
differentiable solves; set the variable to ``0`` to disable the policy or ``1``
to force it in non-traced profiling runs.  On the follow-up office CUDA probe,
warm time improved from ``0.180 s`` to ``0.169 s`` and
``setup_update_constants_s`` dropped from ``12.3 ms`` to ``4.7 ms``.  The
remaining warm GPU targets are boundary/profile setup, residual scalar
materialization, accepted-control ``fsq1``, and preconditioner dispatch.

The next pushed patch extends the host flux-profile fast path to concrete
default-``APHI`` iota profiles, so the non-traced setup path no longer falls
back to small eager JAX profile work for common ``AI`` inputs.  A local CPU
quick benchmark at that head reported the tiny direct-coil ``--jit-forces``
warm solve at about ``0.026 s`` with ``setup_boundary_profiles_s≈1.5 ms``.
The matched ``office`` CPU/CUDA matrix still showed the GPU row as
CPU-favorable (``0.0521 s`` CPU versus ``0.2318 s`` CUDA), with force assembly
near parity (``9.11 ms`` CPU versus ``9.68 ms`` CUDA).  This confirms the next
larger performance patch should cache or stage the non-traced setup/control
payload rather than continuing to optimize the direct coil field kernel.

The follow-up host-profile setup policy adds
``VMEC_JAX_HOST_PROFILE_SETUP=auto`` (``0`` disables, ``1`` forces) so
non-traced accelerator setup can also use the host profile path.  On the same
office quick matrix the tiny ``--jit-forces`` direct-coil row improved to
``0.0552 s`` warm on CPU and ``0.1625 s`` warm on CUDA (``2.95x`` GPU/CPU).
The GPU setup/profile bucket dropped to ``5.6 ms`` and force assembly stayed
slightly faster than CPU (``8.28 ms`` CUDA versus ``9.98 ms`` CPU).  The
remaining named GPU buckets are residual scalar materialization
(``18.6 ms``), accepted-control ``fsq1`` (``12.8 ms``), and preconditioner
dispatch (``10.8 ms``).

The next promoted accelerator-forward policy stages the preconditioned
``fsq1`` norm reductions on the host for non-traced accelerator solves:
``VMEC_JAX_HOST_FSQ1_NORMS=auto`` (``0`` disables, ``1`` forces).  The policy
is disabled for traced/AD solves, so exact-gradient paths keep the JAX-native
reduction. Later ``office`` policy-ablation runs showed that host-staging the
primary residual metrics was not robustly beneficial: the leaner device-product
path measured ``0.181 s`` warm on CUDA versus ``0.224 s`` for the old
host-residual default in the tiny direct-coil JIT-forces row. Therefore
``VMEC_JAX_HOST_RESIDUAL_METRICS=auto`` now keeps residual products on device
and only materializes the three residual scalars; set the variable to ``1`` to
force the older six-scalar host-staging path for diagnostics.

The production-like timing-light row, which disables detailed timing
synchronization, moved in the same direction: the tiny ``--jit-forces`` direct
coil row measured ``0.0528 s`` warm on CPU and ``0.1857 s`` warm on CUDA
(``3.52x`` GPU/CPU).  This confirms the policy is not merely improving a
diagnostic bucket, but the tiny case is still launch/synchronization dominated.
Do not claim GPU speedup for this row; use it as the regression target for the
next residual/preconditioner dispatch fusion pass.

A 2026-05-28 fresh-clone ``office`` rerun with the normalized
``phase_timing_comparison`` confirmed the same target after the latest merge:
CPU warm time was ``0.0424 s`` and CUDA warm time was ``0.1466 s`` for the tiny
direct-coil ``--jit-forces`` row.  The CPU named phases were setup
(``14.5 ms``), force evaluation (``4.90 ms``), residual metrics
(``0.64 ms``), and preconditioner (``0.55 ms``).  The CUDA named phases were
residual metrics (``23.8 ms``), preconditioner (``20.5 ms``), setup
(``20.2 ms``), and force evaluation (``10.0 ms``).  This keeps the next
structural performance lane focused on residual scalar synchronization,
preconditioner dispatch/application fusion, and reusable setup context; the
direct-coil field kernel itself is not the limiting bucket for this case.

A follow-up same-head ``office`` matrix on PR head ``599668a`` added the
``--include-timing-light`` row to separate production-like wall time from
timing synchronization.  The tiny direct-coil ``--jit-forces`` row measured
``0.0569 s`` warm on CPU and ``0.1898 s`` warm on CUDA with detailed timing
enabled.  The timing-light row measured ``0.0719 s`` warm on CPU and
``0.1462 s`` warm on CUDA.  The detailed CUDA buckets again showed force
evaluation near CPU parity (``10.7 ms`` versus ``10.4 ms``) while residual
metrics (``20.1 ms``), preconditioner dispatch/application (``15.7 ms``), and
setup (``29.1 ms``) dominate.  The performance lane is therefore still
structural control-loop staging and preconditioner/residual-scalar fusion, not
Biot-Savart evaluation.

Two opt-in policies were checked and are deliberately not promoted.  Allowing
the host-update path on accelerators with ``VMEC_JAX_HOST_UPDATE_ON_ACCELERATOR=1``
made the tiny CUDA row slower in this matrix, and setting
``VMEC_JAX_BADJAC_INITIAL_STATE_PROBE_ITERS=0`` did not produce a robust
speedup after the accepted-control payload fusion.  ``--include-timing-light``
rows also showed that production-like wall time remains GPU-slower, so the
remaining work is structural: stage or fuse the VMEC-control residual scalar
materialization, accepted-control ``fsq1``, and preconditioner dispatch.

The 2026-05-31 post-merge timing split added correctness-critical finalize
sub-buckets to the same benchmark matrix. A local CPU quick run with
``--include-badjac-probe0 --include-timing-light`` reported a warm direct-coil
``solve_total_s=0.1184`` and ``finalize_s=0.00876``; finalize split into
``finalize_nestor_recompute_s=0.00651``,
``finalize_residual_recompute_s=0.00218``, and
``update_state_ready_s=4.25e-6``. A fresh ``office`` CPU/CUDA clone at
``bc00ff4`` completed all 22 ``--include-gpu --include-policy-ablation`` rows.
The default CUDA direct solve remained GPU-slower (``9.42x`` CPU), dominated by
non-JIT force evaluation (``0.602 s``), preconditioner work (``0.369 s``),
setup/axis reset (``0.334 s``), and final residual recompute (``0.296 s``).
With JIT forces enabled, force evaluation became faster than CPU (``0.82x``),
but the warm row was still ``2.48x`` CPU because preconditioner refresh/apply
(``0.0438 s``, ``2.55x`` CPU) and residual/control/finalization dispatch
remained exposed.

The policy-ablation rerun after the direct-coil replay-helper commits reached
the same conclusion. All CPU/GPU rows completed, but disabling host residual
metrics, host ``fsq1`` norms, host profile setup, or all three together did not
produce a single promoted GPU fix. The tiny non-JIT-force direct solve remained
about ``10.48x`` CPU, and JIT-force policy-ablation rows remained
``2.65x``--``3.08x`` CPU. Treat the current benchmark evidence as a regression
target for controller/preconditioner/finalization staging, not as a GPU speedup
claim for the direct-coil free-boundary row.

The 2026-06-01 PR-head rerun at ``8a0ce5c2`` confirmed the bounded
preconditioner reuse patch removes the duplicate refresh after a
``bcovar``-seeded preconditioner build.  A fresh ``office`` clone completed all
22 CPU/GPU quick rows and the timed direct rows reported
``precond_refresh_seed_reuse_count=1``.  On CPU, the JIT-force direct solve
reported ``warm_min=0.0677 s`` and ``solve_total=0.0567 s``.  On CUDA, the
same row reported ``warm_min=0.183 s`` and ``solve_total=0.116 s``:
JIT-force evaluation was faster than CPU (``0.88x``), finalization was
essentially parity (``0.98x``), and the remaining warm GPU penalties were the
seeded preconditioner refresh (``2.01x``), preconditioner apply (``18.8x``),
accepted-control scalar work (``4.09x``), and residual-metric synchronization
(``43.8x``).  The best policy-ablation row was
``direct_solve_jit_forces_host_policies_off`` at ``2.43x`` CPU.  The next
performance target is therefore not Biot-Savart or force assembly; it is
structural staging/fusion of the preconditioner apply and residual/control
scalar synchronization.

A 2026-06-05 follow-up policy-ablation matrix kept the same conclusion but
changed one default: host-staged residual metrics are no longer the accelerator
``auto`` path. The tiny direct-coil JIT-forces row measured ``0.224 s`` warm
on CUDA with the old host-residual policy, ``0.181 s`` when residual products
were kept on device, and ``0.250 s``/``0.230 s`` for the tridiagonal
precompute/lax-solve variants. ``VMEC_JAX_HOST_RESIDUAL_METRICS=1`` remains
available as a diagnostic override, but the production target is now reducing
preconditioner/update/finalization launch overhead.

The same benchmark family now writes a ``final_recompute_guard`` block for
direct-solve children.  This is deliberately a benchmark-only correctness
guard, not a solver shortcut: it records final accepted-state residuals,
pre-update residuals, their maximum absolute delta, final NESTOR metric deltas,
and the measured finalize recompute cost.  A quick CPU probe after the guard
was added reported nonzero final/pre-update residual deltas, so final NESTOR
and final residual recomputes remain required by default.  Any future cached
finalization path must first make this guard pass with parity evidence before
changing production behavior.

Historical bundled example runtime/memory matrix (March 2026)
-------------------------------------------------------------

For repeatable runtime/memory sweeps across the bundled inputs, use:

.. code-block:: bash

  python tools/diagnostics/example_runtime_memory_matrix.py \
    --backend both \
    --vmec-exec /path/to/xvmec2000

Recent artifacts from this tool:

- ``outputs/example_runtime_memory_matrix_cpu_20260306/summary.json``:
  all bundled examples on a reference CPU host, including VMEC2000 timings.
- ``outputs/example_runtime_memory_matrix_gpu_20260306_summary.json``:
  all bundled examples on a reference CUDA host (CUDA JAX).
- ``outputs/example_runtime_memory_matrix_gpu_freeb_20260306_rerun_summary.json``:
  corrected GPU rerun for the bundled free-boundary cases after staging bundled
  ``mgrid`` files in the benchmark clone.

Historical snapshot highlights from this March 2026 matrix:

- Fixed-boundary scan performance on the reference GPU host improved materially
  after the accelerator-aware scan probe and quiet-scan chunking changes:

  - ``input.circular_tokamak`` now runs in about ``13.8s`` / ``1.97 GiB``.
  - ``input.LandremanPaul2021_QA_lowres`` now runs in about ``33.9s`` /
    ``2.66 GiB``.
  - ``input.up_down_asymmetric_tokamak`` now runs in about ``16.5s`` /
    ``1.60 GiB``.
  - ``input.basic_non_stellsym_pressure`` now runs in about ``141.1s`` /
    ``3.68 GiB``.
  - ``input.LandremanSenguptaPlunk_section5p3_low_res`` now runs in about
    ``77.1s`` / ``2.13 GiB``.

- Fixed-boundary ``lasym=True`` on the reference CPU host remains:

  - ``input.up_down_asymmetric_tokamak`` about ``6.7s`` / ``0.89 GiB`` versus
    VMEC2000 about ``0.74s``.
  - ``input.basic_non_stellsym_pressure`` about ``29.7s`` / ``3.22 GiB``
    versus VMEC2000 about ``2.02s``.
  - ``input.LandremanSenguptaPlunk_section5p3_low_res`` about ``46.8s`` /
    ``4.07 GiB`` versus VMEC2000 about ``0.69s``.

- Bundled free-boundary cases remain the dominant default-path outliers:

  - ``input.DIII-D_lasym_false``:
    about ``428.2s`` / ``7.36 GiB`` on the reference CPU host,
    about ``1602.3s`` / ``6.23 GiB`` on the reference GPU host,
    versus VMEC2000 about ``14.4s``.
  - ``input.cth_like_free_bdy``:
    about ``41.8s`` / ``1.64 GiB`` on the reference CPU host,
    about ``155.8s`` / ``2.30 GiB`` on the reference GPU host,
    versus VMEC2000 about ``2.48s``.
  - ``input.cth_like_free_bdy_lasym_small``:
    about ``37.6s`` / ``1.47 GiB`` on the reference CPU host,
    about ``103.5s`` / ``1.97 GiB`` on the reference GPU host,
    versus VMEC2000 about ``0.63s``.

- Recent parity-path free-boundary GPU work narrowed the large-``ns``
  force-kernel overhead:

  - deferring non-scan scalar-history materialization was effectively neutral
    on the smaller ``input.cth_like_free_bdy`` case
    (about ``111.3s`` warm on ``70fc418`` versus about ``111.4s`` warm on
    ``f35ce44``).
  - passing only the free-boundary ``bsqvac`` edge slice into the force kernel
    instead of rebuilding a mostly-zero ``(ns, ntheta, nzeta)`` array every
    iteration materially improves the heavy axisymmetric case. On the reference
    GPU host, a parity-path ``max_iter=10`` probe of
    ``input.DIII-D_lasym_false`` dropped:

    - ``compute_forces`` from about ``5.79s`` total
      (``0.579s/iter`` on ``70fc418``) to about ``2.58s`` total
      (``0.258s/iter`` on ``f35ce44``),
    - ``preconditioner`` from about ``0.675s`` to about ``0.324s``,
    - ``update`` from about ``0.914s`` to about ``0.535s``.

- The current GPU path is not yet a universal speedup:

  - same-host CPU/GPU benchmarking shows GPU is already faster on the heavier
    QA/QH reactor-scale cases, but not yet on the smallest axisymmetric cases,
  - the current mixed result is now backend-selection limited rather than
    convergence limited: all 16 bundled fixed-boundary cases converge on both
    CPU and GPU.

Why the GPU can still be slower than the CPU
--------------------------------------------

This is a consequence of the current solver architecture, not a statement that
VMEC-like equilibria are fundamentally better suited to CPUs. The short
version is:

- the **fast** path is the scan-lifted path, where JAX can keep long stretches
  of work on-device,
- the **parity** path is still a host-controlled VMEC2000-style iteration,
- many of the slowest benchmark rows are exactly those parity-path solves,
  especially free-boundary cases.

In more detail:

1. VMEC2000 parity requires a host-controlled nonlinear loop

   The conservative path preserves VMEC2000-style semantics such as:

   - Garabedian time-step control,
   - Jacobian sign checks,
   - same-iteration restarts,
   - free-boundary ``ivac/ivacskip/nvacskip`` cadence,
   - per-iteration diagnostics and VMEC-style tables,
   - stage transitions and cache refresh rules.

   In the current implementation, those decisions still happen in Python on the
   host. Each iteration therefore launches several short JAX kernels, waits for
   scalar decisions, then launches the next block. CPUs tolerate that control
   pattern much better than GPUs because the launch/synchronization cost is
   smaller.

2. The kernels are mostly moderate-size float64 kernels, not giant batched GPU kernels

   For parity we run in float64, matching VMEC2000 numerics. On many of the
   shipped examples the per-iteration grids are only moderate in size, so the
   GPU never reaches the kind of occupancy that would amortize launch overhead.
   The work is also heavy in transforms, synthesis, and tensor assembly
   (`bcovar`, `tomnsps`, force kernels), which are often memory-traffic bound
   rather than one large dense GEMM.

   The result is that the CPU can look surprisingly competitive, because it is
   executing the same float64 algebra with lower orchestration overhead and
   without paying for many small host->device transitions.

3. Free-boundary parity is the worst case for the current GPU stack

   Free-boundary adds more than just one extra kernel. It adds:

   - external/vacuum field staging,
   - extra edge-force coupling,
   - free-boundary reuse/refresh cadence,
   - more restart-sensitive control flow,
   - larger edge/state tensors on some axisymmetric cases.

   The timing probes in this repo show that on the current parity free-boundary
   GPU path, ``compute_forces`` dominates. For example, on the reference GPU
   host:

   - ``input.cth_like_free_bdy`` with ``performance_mode=False`` spends about
     ``0.278s/iter`` in ``compute_forces``, while preconditioning and update are
     much smaller.
   - ``input.DIII-D_lasym_false`` is even more sensitive to force-path data
     movement because of its large ``ns``.

4. Data movement and edge-coupling details matter a lot on large free-boundary cases

   Recent profiling made this explicit. Passing only the free-boundary
   ``bsqvac`` edge slice into the force kernel, instead of rebuilding a mostly
   zero ``(ns, ntheta, nzeta)`` array each iteration, was nearly neutral on the
   smaller ``input.cth_like_free_bdy`` case but materially improved the large
   axisymmetric case. On a parity-path ``max_iter=10`` probe of
   ``input.DIII-D_lasym_false`` on the reference GPU host:

   - ``compute_forces`` dropped from about ``0.579s/iter`` to about ``0.258s/iter``,
   - ``preconditioner`` dropped from about ``0.067s/iter`` to about ``0.032s/iter``,
   - ``update`` dropped from about ``0.091s/iter`` to about ``0.054s/iter``.

   That is a good example of the current situation: the GPU is not losing
   because of the physics model itself, but because the parity path still
   contains control-flow and data-shaping patterns that are cheap on CPU and
   expensive on GPU.

5. Compilation and warmup amplify the gap on short runs

   JAX/XLA compile cost is front-loaded. On short solves, or on runs that only
   execute a small number of iterations per stage, compile and cache warmup can
   dominate the wall time. This hurts accelerator results more than CPU results
   because the GPU path has higher startup overhead and stricter sensitivity to
   retracing/recompilation.

6. Differentiability and parity constraints limit aggressive GPU-only shortcuts

   ``vmec-jax`` is not trying to be a separate non-parity GPU solver. We are
   preserving:

   - end-to-end differentiability,
   - VMEC2000-compatible iteration behavior where parity is required,
   - VMEC-style outputs and diagnostics.

   That rules out some easy GPU wins that would change ordering, skip
   diagnostics, or replace the parity controller with a different nonlinear
   algorithm. The current performance work is therefore focused on moving more
   of the existing algorithm into longer device-resident regions without
   changing the numerical contract.

7. When the GPU already helps today

   The GPU story is already much better when the solve can remain on the fast
   scan path, or when repeated runs can amortize compile cost. That is why the
   fixed-boundary scan cases improved materially after:

   - accelerator-aware scan probing,
   - larger quiet-scan chunks,
   - reduced launch overhead in the scan path.

   The next large gains on GPU are therefore expected to come from the same
   direction on the parity/free-boundary side and the accepted-point replay
   path: keeping more of the force/residual/control pipeline on-device for
   longer stretches, and reducing per-iteration host orchestration.

Fused tridiagonal solver (scan only)
------------------------------------

The scan preconditioner can use XLA's fused tridiagonal solver with
pretransposed coefficients (``dl/d/du``).  Current bounded profiles show this
is the fastest measured CPU scan path.  GPU exact-optimization trial callbacks
now default to the non-scan loop; when scan is forced for GPU diagnostics, the
older scan preconditioner with precomputed Thomas coefficients remains the
measured faster scan variant.  Keep the explicit env overrides for
parity/perf bisection:

- ``VMEC_JAX_SCAN_PRECOND_LAXTRIDI=0``: force the older Thomas scan path.
- ``VMEC_JAX_SCAN_PRECOND_LAXTRIDI=1``: explicitly request the fused scan path.
- ``VMEC_JAX_SCAN_PRECOND_PRECOMPUTE=0``: disable the GPU scan precomputed
  Thomas-coefficient default.

Boundary decomposition cache + JAX-friendly initial guess
---------------------------------------------------------

``boundary_from_indata`` now caches the boundary decomposition across runs
using the input file path + mtime (or a coefficient fingerprint when the path
is unavailable). This trims repeated host work in workflows that solve the same
input file multiple times in a single process.

The initial-guess path also supports a fully JAX-backed boundary flip and
constraint application, which reduces Python-side overhead and keeps the path
JAX-friendly for future JIT staging. Control this with:

- ``VMEC_JAX_INIT_GUESS_JAX=1`` (default): use JAX boundary flip path.
- ``VMEC_JAX_INIT_GUESS_JAX=0``: fall back to NumPy/Python boundary flips.

Implementation map (performance-critical paths)
------------------------------------------------

- ``vmec_jax/vmec_tomnsp.py``: VMEC ``fixaray`` tables + DFT-based ``tomnsps``.
- ``vmec_jax/init_guess.py``: initial guess, axis blending, JAX boundary flip.
- ``vmec_jax/boundary.py``: input boundary decomposition + cache.
- ``vmec_jax/static.py``: cached grids, phase stacks, and per-solve constants.

Recent profiling snapshot (QA, 3 iterations on CPU)
---------------------------------------------------

- Default loop: ~0.26s total wall time (post-warmup).
- Scan loop: ~0.083s total wall time (~3x faster for this short run).

Longer runs benefit more because Python control-flow overhead scales with the
iteration count in the non-scan path.

VMEC++ bad-progress restarts (optional)
-----------------------------------------------

VMEC++ introduces a "bad progress" restart policy that detects large residuals
on refined grids and restarts the time-step controller more aggressively. This
is now available in ``vmec_jax`` behind an explicit flag so the VMEC2000 parity
path remains unchanged by default.

The VMEC++-style trigger follows the VMEC++ criteria:

- ``iter2 - iter1 > k_preconditioner_update_interval / 2``
- ``iter2 > 2 * k_preconditioner_update_interval``
- ``fsqr + fsqz > 1e-2`` (physical residual on the full grid)

When triggered, the restart path reduces ``delt`` by ``1/1.03`` (the VMEC++
"bad progress" factor) and resets the cached preconditioner state.

Enable it with:

- ``run_fixed_boundary(..., vmecpp_restart=True)``

Note: the VMEC++ restart flag is currently wired to the VMEC2000-control path.
When scan is active, it takes effect on fallback segments that execute in the
non-scan parity controller.

Static precomputation
---------------------

Use ``VMECStatic`` to avoid rebuilding:

- mode tables,
- angle grids,
- Fourier basis tensors,
- radial grid.

VMEC phase-stack cache
----------------------

The VMEC real-space synthesis path builds full ``(m,n)`` phase tables from the
``fixaray`` trig tables. This is correct but expensive to repeat inside the JIT
kernel. ``VMECStatic`` now precomputes and caches stacked phase tensors for the
VMEC grid (including ``dtheta``/``dzeta`` variants) and attaches them to the
cached trig tables. The precompute uses NumPy on the host to avoid extra JAX
compilation work. This reduces both runtime and compilation work because the
kernel no longer rebuilds the phase tables from scratch every iteration.

Control this behavior with:

- ``VMEC_JAX_CACHE_VMEC_PHASE=1`` (default): precompute phase stacks in
  ``build_static`` for fastest execution.
- ``VMEC_JAX_CACHE_VMEC_PHASE=0``: skip the extra cached tensors to save memory.

Compilation cache
-----------------

JAX can persist compiled executables to disk. ``vmec_jax`` enables this
automatically for accelerator-selected runs because cold-start compilation
dominates short fixed-boundary and optimization diagnostics, especially on GPU.
CPU cache use is explicit opt-in with ``VMEC_JAX_COMPILATION_CACHE=1`` because
XLA:CPU AOT cache hits can emit host-feature mismatch errors on some JAX
versions.  By default, the cache directory is machine/CPU-feature/Python/JAX
scoped so shared home directories do not reuse incompatible CPU AOT artifacts
across different hosts or runtime versions. Use
``VMEC_JAX_COMPILATION_CACHE_DIR=/path/to/cache`` (or the upstream
``JAX_COMPILATION_CACHE_DIR``) to choose the cache location, or set
``VMEC_JAX_COMPILATION_CACHE=0`` to disable it.

If XLA prints a message such as ``Loading XLA:CPU AOT result`` followed by a
target-machine feature mismatch, it means an existing persistent-cache entry was
compiled for another CPU. Do not suppress that message blindly: clear the old
cache directory or rerun once with ``VMEC_JAX_COMPILATION_CACHE=0``. Current
``vmec_jax`` defaults avoid enabling the CPU persistent cache unless the user
opts in explicitly or provides a cache directory.

CLI profiling (pre-iteration overhead)
--------------------------------------

To capture a JAX trace for the VMEC2000-style CLI path, set
``VMEC_JAX_PROFILE_DIR`` before invoking ``vmec``. By default the CLI also
emits a Perfetto-compatible trace (``perfetto_trace.json.gz``); disable that
extra file by setting ``VMEC_JAX_PROFILE_PERFETTO=0``. The trace is written in
TensorBoard/Chrome trace format::

  VMEC_JAX_PROFILE_DIR=/tmp/vmec_jax_trace \\
    vmec examples/data/input.ITERModel --max-iter 3 --no-multigrid --no-use-input-niter --quiet

For tighter windows (e.g., pre-iteration or iter-1 only), set
``VMEC_JAX_PROFILE_WINDOW=pre`` (or ``iter1`` / ``iterN``) and optionally start
a profiler server for XProf inspection::

  VMEC_JAX_PROFILE_DIR=/tmp/vmec_jax_trace \\
  VMEC_JAX_PROFILE_WINDOW=pre \\
  VMEC_JAX_PROFILE_SERVER=1 VMEC_JAX_PROFILE_SERVER_PORT=9999 \\
    vmec examples/data/input.ITERModel --max-iter 3 --no-multigrid --quiet

With ``VMEC_JAX_PROFILE_SERVER=1`` you can also capture a tight window using
``python -m jax.collect_profile`` from another terminal (see the JAX profiling
guide for the exact invocation).

Recent traces show that the pre-iteration time is dominated by JIT
compilation/cache misses (``pjit cache_miss`` + backend compile) rather than
the nonlinear iteration itself. This is expected for short runs on CPU.
For repeated CPU or GPU runs, the default compilation cache can significantly
reduce this overhead once the cache is warm. Explicit ``solver_device="gpu"``
requests enable the same accelerator cache policy even when no JAX GPU
environment variable was set before import. Set
``VMEC_JAX_COMPILATION_CACHE=0`` to disable it.

Persistent compilation cache tuning
-----------------------------------

JAX's persistent cache can be made more aggressive via ``vmec_jax`` environment
variables:

- ``VMEC_JAX_CACHE_MIN_COMPILE_TIME_SECS`` (default: 0)
- ``VMEC_JAX_CACHE_MIN_ENTRY_SIZE_BYTES`` (default: -1)
- ``VMEC_JAX_COMPILATION_CACHE_MAX_SIZE`` (optional)

These map to JAX's persistent cache configuration and allow caching more (or
fewer) compiled executables to reduce repeat-start latency for stable shapes.
Enable cache-miss diagnostics by setting
``VMEC_JAX_EXPLAIN_CACHE_MISSES=1``; JAX will log a short summary whenever a
cache miss triggers a compilation.

Batched radial smoothing
------------------------

The scan path now batches the radial tridiagonal smoother across the R/Z
components (and separately for lambda) so the solver does fewer tridi solves per
iteration. This reduces kernel count and Python overhead while preserving the
VMEC update math.

Batched VMEC real-space synthesis
---------------------------------

The VMEC-grid synthesis path now batches base + derivative (dtheta/dzeta)
evaluations into a single stacked ``einsum`` call. This reduces kernel count in
the pre-iteration setup (especially the bcovar/realspace pipeline) while
preserving the original algebra and parity outputs.

Vectorized multigrid conversion
-------------------------------

Multigrid staging now uses the vectorized signed↔(m,n) conversion helpers from
``vmec_parity`` instead of Python loops. In the current path the signed→(m,n)
conversion uses precomputed dense maps (matmul) to avoid repeated gather-heavy
indexing. This trims host-side overhead during grid transitions, which shows up
prominently in short profiling traces.

Multigrid interpolation caches
------------------------------

Radial interpolation now caches the ``(j1,j2,xint)`` weights and ``scalxc``
profiles for reuse across multigrid stages. This reduces host-side setup costs
when multiple grids are visited in a single solve.

Precomputed (m,n)→signed maps
-----------------------------

The fixed-boundary update now builds dense mapping matrices once per solve to
convert ``(m,n>=0)`` force blocks into signed Fourier updates via matmul. This
reduces scatter-heavy updates inside the iteration loop and keeps the JIT graph
more regular.

Batched sin conversions
-----------------------

The scan update now batches the Z/L ``(m,n)`` sin-block conversions into a
single matmul-based mapping, reducing kernel count compared to converting each
field independently.

Scatter-free boundary/axis enforcement
--------------------------------------

The fixed-boundary/axis enforcement step now uses concatenation instead of
scatter updates for the edge and axis rows. This trims scatter-heavy kernels in
the scan loop without changing the VMEC constraints.
Axis m=0 masks are now reused from ``VMECStatic`` to avoid per-iteration mask
construction.

Lambda gauge masking
--------------------

The (m,n)=(0,0) lambda gauge constraint now uses a boolean mask instead of a
scatter update, trimming another small scatter kernel from the iteration loop.

Concatenation-based mode updates
--------------------------------

The current optimized branch also replaces several remaining scatter-style
fixed-boundary hot-path updates with concatenation-based helpers:

- zeroing the lambda gauge coefficient column,
- replacing or scaling a single ``m``-mode slice in the preconditioned force
  blocks,
- and enforcing axis/edge constraints on the R/Z/L state fields.

This keeps the same constraints and algebra, but reduces the number of
scatter-heavy kernels created inside the iteration loop. On the representative
accelerated reactor-scale QA case
(``input.LandremanPaul2021_QA_reactorScale_lowres``), the warmed CPU runtime
improved from about ``12.00s`` to about ``11.04s`` on the same host. The same
pass also nudged the representative free-boundary warmed CPU benchmark
(``input.cth_like_free_bdy``) from about ``3.11s`` to about ``3.07s`` while
keeping the full test suite green.

Skip asymmetric updates when ``lasym=False``
--------------------------------------------

Several fixed/free-boundary update paths previously computed the asymmetric
``Rsin/Zcos/Lcos`` signed-coefficient updates even for ``lasym=False`` cases,
then immediately zeroed them. The current branch now skips those conversions
entirely on the symmetric path.

This is a clean win for the shipped ``lasym=False`` workloads:

- ``input.cth_like_free_bdy`` improved from about ``3.07s`` to about
  ``2.91s`` on the same warmed CPU benchmark,
- ``input.DIII-D_lasym_false`` with ``max_iter=20`` improved from about
  ``0.262s`` to about ``0.262s`` (effectively neutral but not worse),
- the representative ``lasym=True`` fixed-boundary smoke
  (``input.basic_non_stellsym_pressure``) stayed on the same runtime and
  convergence profile because the asymmetric path is still fully active there.

Vectorized axis blending
------------------------

Initial-guess axis blending now updates all ``m=0`` Fourier columns in one
vectorized scatter instead of looping over toroidal modes. This reduces
index-heavy overhead during startup.

Cached mode scaling
-------------------

``VMECStatic`` now caches the per-mode internal scaling factors
``1/(mscale*nscale)`` so initial-guess construction avoids repeated gathers
from the trig tables.

Historical free-boundary cache note
-----------------------------------

A May 2026 free-boundary cleanup pass cached more of the static
external-boundary sampling setup on the host:

- phase-independent second-derivative mode factors,
- angular ``phi`` grids for the sampled boundary mesh,
- even-``m`` masks used by the axis-parity shortcut,
- VMEC angular ``wint`` weights for the sampled mesh,
- and the optional SciPy LU helpers used by the VMEC-like dense solve path.

That keeps the vacuum-step loop focused on state-dependent work instead of
rebuilding static sampling metadata. On the representative
``input.cth_like_free_bdy`` path, the warmed CPU runtime improved from about
``8.00s`` to about ``7.88s`` on the same host while preserving the same final
residuals and free-boundary sampling test coverage.

Avoid Python objects in jitted functions
----------------------------------------

JAX ``jit`` requires inputs to be arrays or PyTrees. ``vmec-jax`` makes the key
containers PyTrees:

- ``VMECState``
- ``HelicalBasis``
- ``Geom``

If you build your own containers, follow the same approach.

Memory considerations
---------------------

The current Fourier implementation stores ``(K, ntheta, nzeta)`` basis tensors
for cos/sin phases. This is acceptable for low-resolution validation cases, but
will become heavy for larger ``mpol/ntor``.

Planned upgrades (post-parity):

- factorized DFTs (theta/phi separable) using precomputed trig/weight tables,
- FFT-based angular transforms only if they reproduce VMEC scaling and weights,
- chunked evaluation in ``theta``/``zeta`` to reduce peak memory.