You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
E0317 05:01:29.217242 712317 buffer_comparator.cc:156] Difference at 6: 8.18504, expected 7.16615
E0317 05:01:29.217324 712317 buffer_comparator.cc:156] Difference at 7: 10.2058, expected 8.91452
E0317 05:01:29.217329 712317 buffer_comparator.cc:156] Difference at 8: 8.30671, expected 6.651
E0317 05:01:29.217332 712317 buffer_comparator.cc:156] Difference at 9: 9.57833, expected 8.51998
E0317 05:01:29.217335 712317 buffer_comparator.cc:156] Difference at 11: 12.3298, expected 10.7096
E0317 05:01:29.217339 712317 buffer_comparator.cc:156] Difference at 15: 6.00732, expected 5.25718
E0317 05:01:29.217342 712317 buffer_comparator.cc:156] Difference at 22: 8.97186, expected 7.95259
E0317 05:01:29.217345 712317 buffer_comparator.cc:156] Difference at 24: 9.59525, expected 7.9386
E0317 05:01:29.217348 712317 buffer_comparator.cc:156] Difference at 27: 13.3396, expected 11.7191
E0317 05:01:29.217351 712317 buffer_comparator.cc:156] Difference at 38: 8.77498, expected 7.75621
2025-03-17 05:01:29.217365: E external/xla/xla/service/gpu/autotuning/gemm_fusion_autotuner.cc:1138] Results do not match the reference. This is likely a bug/unexpected loss of precision.
How should we interpret an error message like that? Like, is it XLA-internal logging and we should ignore it, or should we be concerned about it affecting our computation and look for numerical issues?
The text was updated successfully, but these errors were encountered:
When Triton was introduced in XLA, it turned out that for some of the tiling configurations produce wrong result. Because of that, the correctness check was added (compares vs cuBLAS output), and autotuner dropped such configurations, especially given that these configuration was quite exotic and slow, so wouldn't be chosen even if they were correct.
However, the wrong results that we saw back then were very off, like outputting zeros instead of an acual values. Recently however we see lots of slight miscompares like that (especially in fp8 kernels), and often it turns out that it's cuBLAS and not Triton who is less precise. Likely that are caused by using less precise GEMM accumulator.
When such a miscompare happens, usually the GEMM falls back to cuBLAS, which may make it slower, and potentially less precise (if cuBLAS is indeed less precise). It's also likely that on real non-syntethic inputs the presicion different is not as dramatic.
So it's safe-ish to ignore, but if it happens in a model which is actually being used, it makes sense to look into it more closely. So far, it wasn't high priority as it only was observed on synthetic models.
In jax-ml/jax#27228, @HeavyCrab observed messages like this one on an A100 machine:
How should we interpret an error message like that? Like, is it XLA-internal logging and we should ignore it, or should we be concerned about it affecting our computation and look for numerical issues?
The text was updated successfully, but these errors were encountered: