flash-attention

Star

Here are 32 public repositories matching this topic...

QwenLM / Qwen

Star

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

natural-language-processing chinese pretrained-models large-language-models llm flash-attention

Updated Feb 25, 2025
Python

ymcui / Chinese-LLaMA-Alpaca-2

Star

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

nlp yarn llama alpaca 64k large-language-models llm rlhf flash-attention llama2 llama-2 alpaca-2 alpaca2

Updated Sep 23, 2024
Python

InternLM / InternLM

Star

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

chatbot chinese gpt pretrained-models llm long-context rlhf large-language-model flash-attention fine-tuning-llm

Updated Feb 7, 2025
Python

DefTruth / Awesome-LLM-Inference

Star

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

mla vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 flash-mla

Updated Mar 4, 2025

DefTruth / CUDA-Learn-Notes

Star

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

cuda cuda-kernels cutlass cudnn cuda-toolkit gemm cuda-programming gemv hgemm flash-attention flash-mla

Updated Mar 19, 2025
Cuda

flashinfer-ai / flashinfer

Star

FlashInfer: Kernel Library for LLM Serving

gpu cuda jit pytorch llm-inference flash-attention large-large-models flashinfer-python

Updated Mar 19, 2025
Cuda

MoonshotAI / MoBA

Star

MoBA: Mixture of Block Attention for Long-Context LLMs

pytorch transformer moe llm llm-serving llm-training flash-attention

Updated Mar 7, 2025
Python

InternLM / InternEvo

Star

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

pytorch multi-modal gemma pipeline-parallelism transformers-models tensor-parallelism llava llm-training internlm flash-attention zero3 llm-framework sequence-parallelism internlm2 ring-attention deepspeed-ulysses llama3 910b

Updated Mar 20, 2025
Python

DAMO-NLP-SG / Inf-CLIP

Star

[CVPR 2025] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

memory-efficient clip contrastive-learning flash-attention ring-attention infinite-batch-size

Updated Jan 16, 2025
Python

DefTruth / ffpa-attn-mma

Star

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

cuda attention sdpa mla mlsys tensor-cores flash-attention deepseek deepseek-v3 deepseek-r1 fused-mla flash-mla

Updated Mar 17, 2025
Cuda

alexzhang13 / flashattention2-custom-mask

Star

Triton implementation of FlashAttention2 that adds Custom Masks.

deep-learning triton attention cuda-kernels attention-mechanism triton-lang flash-attention flash-attention-2

Updated Aug 14, 2024
Python

CoinCheung / gdGPT

Star

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

nlp bloom pipeline pytorch deepspeed llm full-finetune model-parallization flash-attention llama2 baichuan2-7b chatglm3-6b mixtral-8x7b

Updated Feb 5, 2024
Python

Bruce-Lee-LY / decoding_attention

Star

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu cuda inference nvidia mha mla multi-head-attention gqa mqa llm large-language-model flash-attention cuda-core decoding-attention flashinfer flashmla

Updated Mar 9, 2025
C++

Bruce-Lee-LY / flash_attention_inference

Star

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

gpu cuda inference nvidia cutlass mha multi-head-attention llm tensor-core large-language-model flash-attention flash-attention-2

Updated Feb 27, 2025
C++

kklemon / FlashPerceiver

Star

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

nlp deep-learning transformer attention-mechanism perceiver flash-attention

Updated Nov 4, 2024
Python

RulinShao / FastCkpt

Star

Python package for rematerialization-aware gradient checkpointing

gradient-checkpointing flash-attention

Updated Oct 31, 2023
Python

erfanzar / jax-flash-attn2

Star

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

pallas jax flash-attention flash-attention-2

Updated Mar 4, 2025
Python

Naman-ntc / FastCode

Star

Utilities for efficient fine-tuning, inference and evaluation of code generation models

transformers efficient inference code-generation finetuning flash-attention

Updated Oct 3, 2023
Python

kyegomez / FlashMHA

Sponsor

Star

An simple pytorch implementation of Flash MultiHead Attention

artificial-intelligence transformer attention artificial-neural-networks attention-mechanisms attentionisallyouneed gpt4 flash-attention

Updated Feb 5, 2024
Jupyter Notebook

AI-DarwinLabs / amd-mi300-ml-stack

Star

🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations

machine-learning deep-learning hpc axolotl slurm conda gpu-computing rocm deepspeed pytorch-rocm flash-attention amd-mi300

Updated Nov 30, 2024
Shell

Improve this page

Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention

Here are 32 public repositories matching this topic...

QwenLM / Qwen

ymcui / Chinese-LLaMA-Alpaca-2

InternLM / InternLM

DefTruth / Awesome-LLM-Inference

DefTruth / CUDA-Learn-Notes

flashinfer-ai / flashinfer

MoonshotAI / MoBA

InternLM / InternEvo

DAMO-NLP-SG / Inf-CLIP

DefTruth / ffpa-attn-mma

alexzhang13 / flashattention2-custom-mask

CoinCheung / gdGPT

Bruce-Lee-LY / decoding_attention

Bruce-Lee-LY / flash_attention_inference

kklemon / FlashPerceiver

RulinShao / FastCkpt

erfanzar / jax-flash-attn2

Naman-ntc / FastCode

kyegomez / FlashMHA

AI-DarwinLabs / amd-mi300-ml-stack

Improve this page

Add this topic to your repo