Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Conv tests failing with BF16 #569

Open
pranavm-nvidia opened this issue Mar 12, 2025 · 0 comments
Open

Some Conv tests failing with BF16 #569

pranavm-nvidia opened this issue Mar 12, 2025 · 0 comments

Comments

@pranavm-nvidia
Copy link
Collaborator

Some of the convolution integration tests are failing with BF16.

There are two types of failures:

  1. allclose fails:
>       assert tp.allclose(tripy_out, torch_out)
E       assert False
E        +  where False = <function allclose at 0x76b9495f2280>(tensor(\n    [[[[338]],\n\n      [[880]]]], \n    dtype=bfloat16, loc=gpu:0, shape=(1, 2, 1, 1)), tensor(\n    [[[[338]],\n\n      [[880]]]], \n    dtype=bfloat16, loc=gpu:0, shape=(1, 2, 1, 1)))
E        +    where <function allclose at 0x76b9495f2280> = tp.allclose

Note that the Tripy/torch outputs compare equal if they are evaluated separately. It's only when the allclose is part of the same computation graph as the Conv that it returns a False output. This implies a bug in a graph transformation.

  1. Myelin error:
E       MTRTException: failed to run pass pipeline
E           Internal Error: MyelinCheckException: conv_lowering.cpp:43: CHECK(op->operands()[0]->is_tensor()) failed. 
E           Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [myelin_graph.h:attachExceptionMsgToGraph:840] MyelinCheckException: conv_lowering.cpp:43: CHECK(op->operands()[0]->is_tensor()) failed. 
E           IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[[tensorrt.constant] (%t3288) ...[tensorrt.cast] (%t3288) ]}.)
E           (%t3288) error: failed to translate function 'tensorrt_cluster' to a TensorRT engine

Most likely, it's expecting a build-time constant here. Constant folding may solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant