Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ollama: add reasoning model support (e.g. deepseek) #29689

Merged
merged 15 commits into from
Mar 21, 2025

Conversation

BobMerkus
Copy link
Contributor

Description

This PR adds reasoning model support for langchain-ollama by extracting reasoning token blocks, like those used in deepseek. It was inspired by ollama-deep-researcher, specifically the parsing of thinking blocks:

  # TODO: This is a hack to remove the <think> tags w/ Deepseek models 
  # It appears very challenging to prompt them out of the responses 
  while "<think>" in running_summary and "</think>" in running_summary:
      start = running_summary.find("<think>")
      end = running_summary.find("</think>") + len("</think>")
      running_summary = running_summary[:start] + running_summary[end:]

This notes that it is very hard to remove the reasoning block from prompting, but we actually want the model to reason in order to increase model performance. This implementation extracts the thinking block, so the client can still expect a proper message to be returned by ChatOllama (and use the reasoning content separately when desired).

This implementation takes the same approach as ChatDeepseek, which adds the reasoning content to chunk.additional_kwargs.reasoning_content;

  if hasattr(response.choices[0].message, "reasoning_content"):  # type: ignore
      rtn.generations[0].message.additional_kwargs["reasoning_content"] = (
          response.choices[0].message.reasoning_content  # type: ignore
      )

This should probably be handled upstream in ollama + ollama-python, but this seems like a reasonably effective solution. This is a standalone example of what is happening;

async def deepseek_message_astream(
    llm: BaseChatModel,
    messages: list[BaseMessage],
    config: RunnableConfig | None = None,
    *,
    model_target: str = "deepseek-r1",
    **kwargs: Any,
) -> AsyncIterator[BaseMessageChunk]:
    """Stream responses from Deepseek models, filtering out <think> tags.

    Args:
        llm: The language model to stream from
        messages: The messages to send to the model

    Yields:
        Filtered chunks from the model response
    """
    # check if the model is deepseek based
    if (llm.name and model_target not in llm.name) or (hasattr(llm, "model") and model_target not in llm.model):
        async for chunk in llm.astream(messages, config=config, **kwargs):
            yield chunk
        return

    # Yield with a buffer, upon completing the <think></think> tags, move them to the reasoning content and start over
    buffer = ""
    async for chunk in llm.astream(messages, config=config, **kwargs):
        # start or append
        if not buffer:
            buffer = chunk.content
        else:
            buffer += chunk.content if hasattr(chunk, "content") else chunk

        # Process buffer to remove <think> tags
        if "<think>" in buffer or "</think>" in buffer:
            if hasattr(chunk, "tool_calls") and chunk.tool_calls:
                raise NotImplementedError("tool calls during reasoning should be removed?")
            if "<think>" in chunk.content or "</think>" in chunk.content:
                continue
            chunk.additional_kwargs["reasoning_content"] = chunk.content
            chunk.content = ""
        # upon block completion, reset the buffer
        if "<think>" in buffer and "</think>" in buffer:
            buffer = ""
        yield chunk

Issue

Integrating reasoning models (e.g. deepseek-r1) into existing LangChain based workflows is hard due to the thinking blocks that are included in the message contents. To avoid this, we could match the ChatOllama integration with ChatDeepseek to return the reasoning content inside message.additional_arguments.reasoning_content instead.

Dependenices

None

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 8, 2025
Copy link

vercel bot commented Feb 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Mar 21, 2025 3:43pm

@BobMerkus BobMerkus force-pushed the feat/ollama-reasoning-support branch from be00618 to 052f76c Compare February 8, 2025 18:34
Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BobMerkus, we can move forward with this but I don't think the current implementation is working correctly. I'm finding that reasoning_content is just "<think></think>". The reason is that _generate runs through streaming, so as tokens are generated in the stream, we don't know whether they're part of the thinking block or not.

@BobMerkus BobMerkus force-pushed the feat/ollama-reasoning-support branch from 517a870 to cc91ca9 Compare March 18, 2025 23:11
@BobMerkus
Copy link
Contributor Author

Hi @BobMerkus, we can move forward with this but I don't think the current implementation is working correctly. I'm finding that reasoning_content is just "<think></think>". The reason is that _generate runs through streaming, so as tokens are generated in the stream, we don't know whether they're part of the thinking block or not.

Hey @ccurme,

Good catch, the standalone example I supplied is aggregating a buffer to account for this. The initial prompt inside the test did not actually lead to any content inside the thinking block (because it was relatively simple), I overlooked this while integrating in to ChatOllama. I've updated the test and implementation, I think it should be working correctly now. In streaming mode, we keep track whether we are inside a 'thinking block' and if so, the tokens are added to additional_kwargs["reasoning_content"] instead. When the chunks are aggregated based on the stream, then it should lead to the same result as invoke().

Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BobMerkus. I updated the default for extract_reasoning to False. I also refactored this:

  • 7fac207 refactors the original streaming code to share a single iterator method
  • e8ff5a8 adds back in the extraction of reasoning content

The motivation is that extracting reasoning content is arguably a niche use-case (disabled by default, only applies to a subset of models), so I was hesitant to introduce complexity into the basic streaming loops. This separates it out a bit more.

Would appreciate a review :)

@BobMerkus
Copy link
Contributor Author

Thanks @BobMerkus. I updated the default for extract_reasoning to False. I also refactored this:

  • 7fac207 refactors the original streaming code to share a single iterator method
  • e8ff5a8 adds back in the extraction of reasoning content

The motivation is that extracting reasoning content is arguably a niche use-case (disabled by default, only applies to a subset of models), so I was hesitant to introduce complexity into the basic streaming loops. This separates it out a bit more.

Would appreciate a review :)

Yeah, makes sense to separate this from existing runtime. Implementation seems good to me and thanks for fixing the linter errors! As a side note; some of the other integration tests seem to fail when running locally, although not related to the changes in this MR.

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Mar 21, 2025
@ccurme ccurme enabled auto-merge (squash) March 21, 2025 15:44
@ccurme ccurme merged commit 5700646 into langchain-ai:master Mar 21, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants