how to interpret "Results do not match the reference. This is likely a bug/unexpected loss of precision" #23934

mattjj · 2025-03-19T15:52:06Z

In jax-ml/jax#27228, @HeavyCrab observed messages like this one on an A100 machine:

E0317 05:01:29.217242  712317 buffer_comparator.cc:156] Difference at 6: 8.18504, expected 7.16615
E0317 05:01:29.217324  712317 buffer_comparator.cc:156] Difference at 7: 10.2058, expected 8.91452
E0317 05:01:29.217329  712317 buffer_comparator.cc:156] Difference at 8: 8.30671, expected 6.651
E0317 05:01:29.217332  712317 buffer_comparator.cc:156] Difference at 9: 9.57833, expected 8.51998
E0317 05:01:29.217335  712317 buffer_comparator.cc:156] Difference at 11: 12.3298, expected 10.7096
E0317 05:01:29.217339  712317 buffer_comparator.cc:156] Difference at 15: 6.00732, expected 5.25718
E0317 05:01:29.217342  712317 buffer_comparator.cc:156] Difference at 22: 8.97186, expected 7.95259
E0317 05:01:29.217345  712317 buffer_comparator.cc:156] Difference at 24: 9.59525, expected 7.9386
E0317 05:01:29.217348  712317 buffer_comparator.cc:156] Difference at 27: 13.3396, expected 11.7191
E0317 05:01:29.217351  712317 buffer_comparator.cc:156] Difference at 38: 8.77498, expected 7.75621
2025-03-17 05:01:29.217365: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1138] Results do not match the reference. This is likely a bug/unexpected loss of precision.

How should we interpret an error message like that? Like, is it XLA-internal logging and we should ignore it, or should we be concerned about it affecting our computation and look for numerical issues?

The text was updated successfully, but these errors were encountered:

mooskagh · 2025-03-20T13:31:18Z

When Triton was introduced in XLA, it turned out that for some of the tiling configurations produce wrong result. Because of that, the correctness check was added (compares vs cuBLAS output), and autotuner dropped such configurations, especially given that these configuration was quite exotic and slow, so wouldn't be chosen even if they were correct.

However, the wrong results that we saw back then were very off, like outputting zeros instead of an acual values. Recently however we see lots of slight miscompares like that (especially in fp8 kernels), and often it turns out that it's cuBLAS and not Triton who is less precise. Likely that are caused by using less precise GEMM accumulator.

When such a miscompare happens, usually the GEMM falls back to cuBLAS, which may make it slower, and potentially less precise (if cuBLAS is indeed less precise). It's also likely that on real non-syntethic inputs the presicion different is not as dramatic.

So it's safe-ish to ignore, but if it happens in a model which is actually being used, it makes sense to look into it more closely. So far, it wasn't high priority as it only was observed on synthetic models.

mattjj · 2025-03-25T16:20:28Z

Thanks, that makes sense!

mattjj mentioned this issue Mar 19, 2025

Stacking flax.nnx.Module layers with vmap and forwarding with scan result in loss of precision in XLA backend jax-ml/jax#27228

Open

aniruthraj added the question Further information is requested label Mar 21, 2025

aniruthraj self-assigned this Mar 25, 2025

mattjj closed this as completed Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to interpret "Results do not match the reference. This is likely a bug/unexpected loss of precision" #23934

how to interpret "Results do not match the reference. This is likely a bug/unexpected loss of precision" #23934

mattjj commented Mar 19, 2025

mooskagh commented Mar 20, 2025

mattjj commented Mar 25, 2025

how to interpret "Results do not match the reference. This is likely a bug/unexpected loss of precision" #23934

how to interpret "Results do not match the reference. This is likely a bug/unexpected loss of precision" #23934

Comments

mattjj commented Mar 19, 2025

mooskagh commented Mar 20, 2025

mattjj commented Mar 25, 2025