Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] rewrite the model name passed to upstream #69

Closed
matteoserva opened this issue Mar 14, 2025 · 4 comments
Closed

[feature request] rewrite the model name passed to upstream #69

matteoserva opened this issue Mar 14, 2025 · 4 comments

Comments

@matteoserva
Copy link

I would like to configure the model name passed to upstream server.
From my understanding, right now the proxy sends the model name configured in the yaml file of llama-swap.

Use case:

- upstream ollama server running "gemma:27b"
- llama-swap configured with a model called "gemma_27b"

Current behavior:
the operation fails because llama-swap interprets "gemma" as profile name

Desired behavior:
gemma_27b in llama-swap becomes gemma:27b in ollama.

Function to modify:
I think a good place to insert the change is proxyOAIHandler() in proxymanager.go

Example config file:

  "gemma_27b":
    # environment variables to pass to the command
    env:
      ...

    cmd: ollama serve ...
    proxy: http://127.0.0.1:8080

    upstream_name: "gemma:27b"
@mostlygeek
Copy link
Owner

I've been considering this feature and with the recent changes it would be fairly easy to implement.

I am curious, what is your use case for llama-swap+ollama vs llama-swap+llama-server?

@matteoserva
Copy link
Author

Thanks for your quick answer.

I am curious, what is your use case for llama-swap+ollama vs llama-swap+llama-server?

Why not both?

I'm using llama-swap to not only swap models but also to swap inference engines depending on the constraints i have (RAM, VRAM, time, inputs...).

For example:

  • ollama supports multimodal inputs but cannot run the biggest models.
  • llama.cpp can run big models thanks to all the quantization options and cpu offload, no tensor parallelism.
  • vllm is very fast thanks to tensor parallelism and it supports multimodal inputs, limited by VRAM

I'm using llama-swap to select the best upstream backend depending on the chosen model.

@mostlygeek
Copy link
Owner

I use it for that too! Makes it a lot easier to swap between engines for capabilities.

In this case, the “:” used for profiles is conflicting with ollama’s naming conventions. And “upstream_name” is an override so the model name can be set to anything.

mostlygeek added a commit that referenced this issue Mar 15, 2025
* add test for splitRequestedModel()
* Add `useModelName` parameter to model configuration
* add docs to README
@mostlygeek
Copy link
Owner

Fixed in #71 and released in v95!

Example of usage:

models:
  "qwq":
    proxy: http://127.0.0.1:11434
    cmd: my-server

    # use this new configuration parameter to override what's in the request
    useModelName: "qwen:qwq"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants