Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds kernels for depthwise 2D convolution (CPU only for now).
There is an existing
ggml_conv_2d_dw
based on im2col + mul_mat, but it has high overhead. The approach makes sense for regular conv2d since it can profit from fast gemm, but depthwise convolution is much simpler and im2col will always slow it down I think.Timings (W=256, H=256, C=256)
ggml_conv_2d_dw
ggml_depthwise_conv_2d
ggml_depthwise_conv_2d
Timings (W=1024, H=1024, C=3)
ggml_conv_2d_dw
ggml_depthwise_conv_2d
ggml_depthwise_conv_2d
I didn't replace
ggml_conv_2d_dw
because it supports more backends (and dilation).Memory layout
Having channels/depth most contiguous in memory allows for better vectorization. It also improves memory access for im2col in regular 2D convolutions, and can avoid many costly
ggml_cont(ggml_permute(...))
. Since the default for 2D ops on the API seems to be spatial dimension first, this is kept in place, and opportunity to use channels-first kernel is detected from strides. Could also make that more explicit.Background
I've implemented MobileSAM (fast SAM variant with TinyViT as image encoder) here. Runtime was ~2.1s initially, with depthwise convolution eating a sizeable chunk. After changing memory layout and optimizing conv2d it now runs in 570ms (PyTorch: 608ms, ONNX: 549ms).
Ryzen 5 5600X (6 core, AVX2), windows, OpenBLAS