bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True #79

void-mckenzie · 2025-03-16T07:16:24Z

This PR addresses the unsloth issue : https://github.com/unslothai/unsloth/issues/2008

While the previous release fixes the vLLM component of the issue unslothai/unsloth#2008 , the process still errors out for custom models due to the on the fly bnb_config not being passed to the convert_vllm_to_huggingface method in unsloth_zoo's vllm_utils.py

This PR modifies the vllm_utils.py to take in the on-the-fly generated bnb_config and pass it on to the convert_vllm_to_huggingface method to be parsed for quantization configs.

I have chosen not to bundle it with the model config, since the custom models might also have their own bnb-configs, if they're 4bit quantized already. Hence, the if and elif for parsing the quantization_config and the generated bnb_config.

NOTE: This PR needs to be merged along with unslothai/unsloth#2039 in unsloth, where the llama.py is edited to handle this additional configuration.

Code Snippet:

from unsloth import FastLanguageModel
import torch
from dotenv import load_dotenv
import os

load_dotenv()
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "void-mckenzie/krikri-sft_compound_instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.8, # Reduce if out of memory
)

Output before fix:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-16 00:00:28 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.14: Fast Llama patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.621 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading void-mckenzie/krikri-sft_compound_instruct with actual GPU utilization = 74.61%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 23.62 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.19 GB. Also swap space = 4 GB.
INFO 03-16 00:00:34 config.py:549] This model supports multiple tasks: {'generate', 'reward', 'score', 'classify', 'embed'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'fp4', 'bnb_4bit_use_double_quant': False, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': [], 'llm_int8_threshold': 6.0}
INFO 03-16 00:00:34 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='void-mckenzie/krikri-sft_compound_instruct', speculative_config=None, tokenizer='void-mckenzie/krikri-sft_compound_instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=void-mckenzie/krikri-sft_compound_instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":160}, use_cached_outputs=False, 
INFO 03-16 00:00:38 cuda.py:229] Using Flash Attention backend.
[W316 00:00:38.381537842 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 03-16 00:00:38 model_runner.py:1110] Starting to load model void-mckenzie/krikri-sft_compound_instruct...
INFO 03-16 00:00:38 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-16 00:00:38 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

INFO 03-16 00:00:42 model_runner.py:1115] Loading model weights took 5.6977 GB
INFO 03-16 00:00:42 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-16 00:00:43 worker.py:267] Memory profiling takes 0.77 seconds
INFO 03-16 00:00:43 worker.py:267] the current vLLM instance can use total_gpu_memory (23.62GiB) x gpu_memory_utilization (0.75) = 17.62GiB
INFO 03-16 00:00:43 worker.py:267] model weights take 5.70GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 0.87GiB; the rest of the memory reserved for KV Cache is 10.97GiB.
INFO 03-16 00:00:43 executor_base.py:111] # cuda blocks: 5617, # CPU blocks: 2048
INFO 03-16 00:00:43 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 87.77x
INFO 03-16 00:00:45 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:07<00:00,  2.98it/s]INFO 03-16 00:00:52 model_runner.py:1562] Graph capturing finished in 8 secs, took 0.58 GiB
INFO 03-16 00:00:52 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 10.52 seconds

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[1], line 10
      7 max_seq_length = 1024 # Can increase for longer reasoning traces
      8 lora_rank = 32 # Larger rank = smarter, but slower
---> 10 model, tokenizer = FastLanguageModel.from_pretrained(
     11     model_name = "void-mckenzie/krikri-sft_compound_instruct",
     12     max_seq_length = max_seq_length,
     13     load_in_4bit = True, # False for LoRA 16bit
     14     fast_inference = True, # Enable vLLM fast inference
     15     max_lora_rank = lora_rank,
     16     gpu_memory_utilization = 0.8, # Reduce if out of memory
     17     token='hf_',
     18 )

File ~/anaconda3/envs/llm/lib/python3.12/site-packages/unsloth/models/loader.py:351, in FastLanguageModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, load_in_8bit, full_finetuning, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, revision, use_exact_model_name, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, *args, **kwargs)
    348     pass
    349 pass
--> 351 model, tokenizer = dispatch_model.from_pretrained(
    352     model_name        = model_name,
    353     max_seq_length    = max_seq_length,
    354     dtype             = _get_dtype(dtype),
    355     load_in_4bit      = load_in_4bit,
    356     token             = token,
    357     device_map        = device_map,
    358     rope_scaling      = rope_scaling,
    359     fix_tokenizer     = fix_tokenizer,
    360     model_patcher     = dispatch_model,
    361     tokenizer_name    = tokenizer_name,
    362     trust_remote_code = trust_remote_code,
    363     revision          = revision if not is_peft else None,
    364 
    365     fast_inference    = fast_inference,
    366     gpu_memory_utilization = gpu_memory_utilization,
    367     float8_kv_cache   = float8_kv_cache,
    368     random_state      = random_state,
    369     max_lora_rank     = max_lora_rank,
    370     disable_log_stats = disable_log_stats,
    371     *args, **kwargs,
    372 )
    374 if resize_model_vocab is not None:
    375     model.resize_token_embeddings(resize_model_vocab)

File ~/anaconda3/envs/llm/lib/python3.12/site-packages/unsloth/models/llama.py:1825, in FastLlamaModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map, rope_scaling, fix_tokenizer, model_patcher, tokenizer_name, trust_remote_code, fast_inference, gpu_memory_utilization, float8_kv_cache, random_state, max_lora_rank, disable_log_stats, **kwargs)
   1823 # Convert to HF format
   1824 _, quant_state_dict = get_vllm_state_dict(llm, config = model_config)
-> 1825 model = convert_vllm_to_huggingface(quant_state_dict, model_config, dtype)
   1826 model.vllm_engine = llm
   1827 model.fast_generate = model.vllm_engine.generate

File ~/anaconda3/envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/anaconda3/envs/llm/lib/python3.12/site-packages/unsloth_zoo/vllm_utils.py:614, in convert_vllm_to_huggingface(quant_state_dict, config, dtype, bnb_config)
    612 quant_state = quant_state_dict[f"{layer_name}.weight.quant_state"]
    613 n_layers = config.num_hidden_layers
--> 614 layer = Linear4bit(0, 0, device = "cuda:0", bias = has_bias, compute_dtype = compute_dtype, **kwargs)
    615 layer.in_features  = quant_state.shape[1]
    616 layer.out_features = quant_state.shape[0]

UnboundLocalError: cannot access local variable 'compute_dtype' where it is not associated with a value

Output After Fix:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-16 00:08:17 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.14: Fast Llama patching. Transformers: 4.49.0. vLLM: 0.7.3.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.621 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading void-mckenzie/krikri-sft_compound_instruct with actual GPU utilization = 75.45%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 23.62 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.39 GB. Also swap space = 4 GB.
INFO 03-16 00:08:22 config.py:549] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'fp4', 'bnb_4bit_use_double_quant': False, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': [], 'llm_int8_threshold': 6.0}
INFO 03-16 00:08:22 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='void-mckenzie/krikri-sft_compound_instruct', speculative_config=None, tokenizer='void-mckenzie/krikri-sft_compound_instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=void-mckenzie/krikri-sft_compound_instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":160}, use_cached_outputs=False, 
INFO 03-16 00:08:23 cuda.py:229] Using Flash Attention backend.
[W316 00:08:23.516938937 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 03-16 00:08:23 model_runner.py:1110] Starting to load model void-mckenzie/krikri-sft_compound_instruct...
INFO 03-16 00:08:23 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-16 00:08:24 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

INFO 03-16 00:08:26 model_runner.py:1115] Loading model weights took 5.6977 GB
INFO 03-16 00:08:26 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-16 00:08:27 worker.py:267] Memory profiling takes 0.77 seconds
INFO 03-16 00:08:27 worker.py:267] the current vLLM instance can use total_gpu_memory (23.62GiB) x gpu_memory_utilization (0.75) = 17.82GiB
INFO 03-16 00:08:27 worker.py:267] model weights take 5.70GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 0.87GiB; the rest of the memory reserved for KV Cache is 11.18GiB.
INFO 03-16 00:08:27 executor_base.py:111] # cuda blocks: 5723, # CPU blocks: 2048
INFO 03-16 00:08:27 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 89.42x
INFO 03-16 00:08:29 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:08<00:00,  2.68it/s]INFO 03-16 00:08:38 model_runner.py:1562] Graph capturing finished in 9 secs, took 0.59 GiB
INFO 03-16 00:08:38 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 11.35 seconds

Datta0 · 2025-03-16T11:27:54Z

unsloth_zoo/vllm_utils.py

    # All Unsloth Zoo code licensed under LGPLv3
    # Unmerges vLLM modules to create HF compatible model
    config.update({"torch_dtype" : dtype}) # Do not use config file's dtype!
    new_model = create_empty_causal_lm(config, dtype)
    quantization_config = getattr(config, "quantization_config", {})
    kwargs = dict()
-    if quantization_config != {}:
+    if quantization_config != {} or bnb_config:


what if there's no quant involved and we're doing BF16 LoRA or something? compute_dtype will still be not declared...
might want to check that?

@Datta0 This doesn't change the flow for non quantized cases. This comes into effect only when loading custom models which are BF16 on hugging face, but the user mentions load in 4bit and gast inference to be True.

This doesn't change the flow for non quant cases, but that is where the problem lies.
compute_dtype would be undeclared variable only to be referenced a few lines later
I'm saying maybe we should set compute_dtype outside the if quantization .... so that it is always available.

I see what you mean, yes. Regardless of the type of quantization config, the compute_dtype is always set to the same dtype value passed to this method. I'll update this accordingly. Thanks for the catch!

Done @Datta0 !

LGTM! Thanks for the changes :)

shimmyshimmer · 2025-03-16T12:02:15Z

Thank you for the PR we will review it

Datta0 · 2025-03-16T16:22:29Z

unsloth_zoo/vllm_utils.py

    # All Unsloth Zoo code licensed under LGPLv3
    # Unmerges vLLM modules to create HF compatible model
    config.update({"torch_dtype" : dtype}) # Do not use config file's dtype!
    new_model = create_empty_causal_lm(config, dtype)
    quantization_config = getattr(config, "quantization_config", {})
    kwargs = dict()
-    if quantization_config != {}:
+    if quantization_config != {} or bnb_config:


LGTM! Thanks for the changes :)

enochlev · 2025-03-16T18:50:33Z

confirmed working for me

enochlev · 2025-03-16T18:51:44Z

maybe replace

    pass

with

    else:
        compute_dtype = dtype

danielhanchen · 2025-03-16T22:18:29Z

Thanks and appreciate it! I'll add this to nightly and push it a mini release later today!

* Update compiler.py * debugging * remove debugging * num items in batch * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * logs * Update patching_utils.py * VLM attention mask * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Recheck * Update compiler.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update compiler.py * Update patching_utils.py * suppress errors * Update compiler.py * Update patching_utils.py * Update compiler.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update peft_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * bug fixes * Update compiler.py * Update compiler.py * Update vision_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Bug fixes * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * gpu_memory_utilization * Update temporary_patches.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * train on completions VLMs * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * VLM train only on completions * Update loss_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update saving_utils.py * Update llama_cpp.py * Update llama_cpp.py * Update saving_utils.py * Update saving_utils.py * Update __init__.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update llama_cpp.py * Update loss_utils.py * Update compiler.py * Update llama_cpp.py * Update compiler.py * Update vllm_utils.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update training_utils.py * Update dataset_utils.py * Update dataset_utils.py * Revert "Update dataset_utils.py" This reverts commit 3b690ad. * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Remove prints * Update compiler.py * Update saving_utils.py * Update temporary_patches.py * Update __init__.py * Update pyproject.toml * Update vllm_utils.py * bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True (#79) * bug fix #2008 unsloth * non-quant dtype fix * Update vllm_utils.py --------- Co-authored-by: Daniel Han <[email protected]> * Update dataset_utils.py * Update compiler.py * Update temporary_patches.py * Gemma 3 fixes * Update temporary_patches.py * Update compiler.py * Update compiler.py * Gemma 3 fixes * Update patching_utils.py * Update compiler.py * Update compiler.py * Update patching_utils.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * compiler * Update gradient_checkpointing.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * causal mask dtype * Fix checkpoint and save from local file (#74) * Enhance gradient checkpointing and add original model ID retrieval in saving utilities * In case adapter_config.json as well * Update patching_utils.py * Update patching_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update vllm_utils.py * Update compiler.py * Update peft_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py --------- Co-authored-by: Mukkesh Ganesh <[email protected]> Co-authored-by: Edd <[email protected]>

* Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Recheck * Update compiler.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update compiler.py * Update patching_utils.py * suppress errors * Update compiler.py * Update patching_utils.py * Update compiler.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update peft_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * bug fixes * Update compiler.py * Update compiler.py * Update vision_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Bug fixes * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * gpu_memory_utilization * Update temporary_patches.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * train on completions VLMs * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * VLM train only on completions * Update loss_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update saving_utils.py * Update llama_cpp.py * Update llama_cpp.py * Update saving_utils.py * Update saving_utils.py * Update __init__.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update llama_cpp.py * Update loss_utils.py * Update compiler.py * Update llama_cpp.py * Update compiler.py * Update vllm_utils.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update training_utils.py * Update dataset_utils.py * Update dataset_utils.py * Revert "Update dataset_utils.py" This reverts commit 3b690ad. * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Remove prints * Update compiler.py * Update saving_utils.py * Update temporary_patches.py * Update __init__.py * Update pyproject.toml * Update vllm_utils.py * bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True (#79) * bug fix #2008 unsloth * non-quant dtype fix * Update vllm_utils.py --------- Co-authored-by: Daniel Han <[email protected]> * Update dataset_utils.py * Update compiler.py * Update temporary_patches.py * Gemma 3 fixes * Update temporary_patches.py * Update compiler.py * Update compiler.py * Gemma 3 fixes * Update patching_utils.py * Update compiler.py * Update compiler.py * Update patching_utils.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * compiler * Update gradient_checkpointing.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * causal mask dtype * Fix checkpoint and save from local file (#74) * Enhance gradient checkpointing and add original model ID retrieval in saving utilities * In case adapter_config.json as well * Update patching_utils.py * Update patching_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update vllm_utils.py * Update compiler.py * Update peft_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update vllm_lora_worker_manager.py * Update utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update dataset_utils.py * bidirectional attention * Update vllm_utils.py * Update __init__.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py --------- Co-authored-by: Mukkesh Ganesh <[email protected]> Co-authored-by: Edd <[email protected]>

* Update compiler.py * Update patching_utils.py * Update patching_utils.py * Update patching_utils.py * Update peft_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * bug fixes * Update compiler.py * Update compiler.py * Update vision_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Bug fixes * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * gpu_memory_utilization * Update temporary_patches.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * train on completions VLMs * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * VLM train only on completions * Update loss_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update saving_utils.py * Update llama_cpp.py * Update llama_cpp.py * Update saving_utils.py * Update saving_utils.py * Update __init__.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update llama_cpp.py * Update loss_utils.py * Update compiler.py * Update llama_cpp.py * Update compiler.py * Update vllm_utils.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update training_utils.py * Update dataset_utils.py * Update dataset_utils.py * Revert "Update dataset_utils.py" This reverts commit 3b690ad. * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Remove prints * Update compiler.py * Update saving_utils.py * Update temporary_patches.py * Update __init__.py * Update pyproject.toml * Update vllm_utils.py * bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True (#79) * bug fix #2008 unsloth * non-quant dtype fix * Update vllm_utils.py --------- Co-authored-by: Daniel Han <[email protected]> * Update dataset_utils.py * Update compiler.py * Update temporary_patches.py * Gemma 3 fixes * Update temporary_patches.py * Update compiler.py * Update compiler.py * Gemma 3 fixes * Update patching_utils.py * Update compiler.py * Update compiler.py * Update patching_utils.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * compiler * Update gradient_checkpointing.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * causal mask dtype * Fix checkpoint and save from local file (#74) * Enhance gradient checkpointing and add original model ID retrieval in saving utilities * In case adapter_config.json as well * Update patching_utils.py * Update patching_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update vllm_utils.py * Update compiler.py * Update peft_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update vllm_lora_worker_manager.py * Update utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update dataset_utils.py * bidirectional attention * Update vllm_utils.py * Update __init__.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update __init__.py * fix: AsyncLLMEngine bugs (#82) * fixed a typo in L119, removing unnecessary len() (#84) Co-authored-by: Xiaochen Zhu <[email protected]> --------- Co-authored-by: Mukkesh Ganesh <[email protected]> Co-authored-by: Edd <[email protected]> Co-authored-by: Brad Hilton <[email protected]> Co-authored-by: SpaceHunter <[email protected]> Co-authored-by: Xiaochen Zhu <[email protected]>

* Update dataset_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * gpu_memory_utilization * Update temporary_patches.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * train on completions VLMs * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * VLM train only on completions * Update loss_utils.py * Update dataset_utils.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update saving_utils.py * Update llama_cpp.py * Update llama_cpp.py * Update saving_utils.py * Update saving_utils.py * Update __init__.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update loss_utils.py * Update loss_utils.py * Update llama_cpp.py * Update loss_utils.py * Update compiler.py * Update llama_cpp.py * Update compiler.py * Update vllm_utils.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update rl_replacements.py * Update training_utils.py * Update dataset_utils.py * Update dataset_utils.py * Revert "Update dataset_utils.py" This reverts commit 3b690ad. * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Remove prints * Update compiler.py * Update saving_utils.py * Update temporary_patches.py * Update __init__.py * Update pyproject.toml * Update vllm_utils.py * bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True (#79) * bug fix #2008 unsloth * non-quant dtype fix * Update vllm_utils.py --------- Co-authored-by: Daniel Han <[email protected]> * Update dataset_utils.py * Update compiler.py * Update temporary_patches.py * Gemma 3 fixes * Update temporary_patches.py * Update compiler.py * Update compiler.py * Gemma 3 fixes * Update patching_utils.py * Update compiler.py * Update compiler.py * Update patching_utils.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * Update compiler.py * compiler * Update gradient_checkpointing.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * causal mask dtype * Fix checkpoint and save from local file (#74) * Enhance gradient checkpointing and add original model ID retrieval in saving utilities * In case adapter_config.json as well * Update patching_utils.py * Update patching_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update loss_utils.py * Update compiler.py * Update vllm_utils.py * Update compiler.py * Update peft_utils.py * Update rl_replacements.py * Update vllm_utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update compiler.py * Update vllm_lora_worker_manager.py * Update utils.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update dataset_utils.py * bidirectional attention * Update vllm_utils.py * Update __init__.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_utils.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update vllm_lora_worker_manager.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update temporary_patches.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update loss_utils.py * Update __init__.py * fix: AsyncLLMEngine bugs (#82) * fixed a typo in L119, removing unnecessary len() (#84) Co-authored-by: Xiaochen Zhu <[email protected]> * Fix gradient checkpointing warning filter implementation * Input grads fix for gemma3 (#96) * gemma require gradients fix * Update peft_utils.py --------- Co-authored-by: Daniel Han <[email protected]> * Update vision_utils.py * Vision requires grad * Check SDPA for Mistral / Pixtral * Update compiler.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update __init__.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vision_utils.py * Update vllm_utils.py (#99) Fix bugs in generate_batches.py.Original output = [] will result in duplication of results. * Update vision_utils.py * Fixes to support IterableDataset (#98) * Support Iterable Datasets * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Update dataset_utils.py * Preserve batch size from iterable dataset * Preserve batch size from iterable dataset * Support train_on_response_only with IterableDataset * Support train_on_response_only with IterableDataset * Support train_on_response_only with IterableDataset * Support train_on_response_only with IterableDataset --------- Co-authored-by: Mukkesh Ganesh <[email protected]> Co-authored-by: Edd <[email protected]> Co-authored-by: Brad Hilton <[email protected]> Co-authored-by: SpaceHunter <[email protected]> Co-authored-by: Xiaochen Zhu <[email protected]> Co-authored-by: Roland Tannous <[email protected]> Co-authored-by: DoubleMathew <[email protected]> Co-authored-by: Michael Han <[email protected]> Co-authored-by: Qian Wu <[email protected]> Co-authored-by: marcandrelarochelle <[email protected]>

bug fix #2008 unsloth

33ef1e1

Datta0 reviewed Mar 16, 2025

View reviewed changes

non-quant dtype fix

0f145cc

Datta0 approved these changes Mar 16, 2025

View reviewed changes

Datta0 mentioned this pull request Mar 16, 2025

UnboundLocalError: local variable 'compute_dtype' referenced before assignment unslothai/unsloth#2030

Open

void-mckenzie mentioned this pull request Mar 16, 2025

bug fix #2008 - load_in_4bit = True + fast_inference = True unslothai/unsloth#2039

Merged

Datta0 mentioned this pull request Mar 16, 2025

UnboundLocalError: cannot access local variable 'compute_dtype' where it is not associated with a value unslothai/unsloth#2041

Closed

void-mckenzie mentioned this pull request Mar 16, 2025

FastLanguageModel load_in_4bit not working, when used with fast_inference=True unslothai/unsloth#2008

Closed

Update vllm_utils.py

62973b4

danielhanchen changed the base branch from main to nightly March 16, 2025 22:13

danielhanchen merged commit 4c72e79 into unslothai:nightly Mar 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True #79

bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True #79

void-mckenzie commented Mar 16, 2025 •

edited by danielhanchen

Loading

Datta0 Mar 16, 2025

void-mckenzie Mar 16, 2025

Datta0 Mar 16, 2025

void-mckenzie Mar 16, 2025

void-mckenzie Mar 16, 2025

Datta0 Mar 16, 2025

shimmyshimmer commented Mar 16, 2025

Datta0 Mar 16, 2025

enochlev commented Mar 16, 2025

enochlev commented Mar 16, 2025

danielhanchen commented Mar 16, 2025

bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True #79

bug fix #2008 unsloth issue - load_in_4bit = True + fast_inference = True #79

Conversation

void-mckenzie commented Mar 16, 2025 • edited by danielhanchen Loading

Datta0 Mar 16, 2025

Choose a reason for hiding this comment

void-mckenzie Mar 16, 2025

Choose a reason for hiding this comment

Datta0 Mar 16, 2025

Choose a reason for hiding this comment

void-mckenzie Mar 16, 2025

Choose a reason for hiding this comment

void-mckenzie Mar 16, 2025

Choose a reason for hiding this comment

Datta0 Mar 16, 2025

Choose a reason for hiding this comment

shimmyshimmer commented Mar 16, 2025

Datta0 Mar 16, 2025

Choose a reason for hiding this comment

enochlev commented Mar 16, 2025

enochlev commented Mar 16, 2025

danielhanchen commented Mar 16, 2025

void-mckenzie commented Mar 16, 2025 •

edited by danielhanchen

Loading