Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper handling of MultimodalWebSurfer content led to premature task termination #6064

Open
ode126 opened this issue Mar 22, 2025 · 0 comments

Comments

@ode126
Copy link

ode126 commented Mar 22, 2025

What happened?

Describe the bug

When using MultimodalWebSurfer in MagenticOneGroupChat, the task prematurely terminates after the first WebSurfer action. This is caused by improper handling of multimodal content (images + text) in the MagenticOneOrchestrator's progress assessment logic.

To Reproduce

from autogen_ext.teams.magentic_one import MagenticOne

async def main():

    client = xxx
    autoctf = MagenticOne(client=client)
    task = "open bing.com search weather then click third result link"
    result = await autoctf.run_stream(task=task)

# The task terminates after WebSurfer's first action without completing click

The issue occurs because:

  1. WebSurfer returns MultiModalMessage containing both text and image
  2. MagenticOneOrchestrator fails to properly process this multimodal content when assessing task progress
  3. This leads to incorrect termination assessment in _orchestrate_step

Expected behavior

  1. The WebSurfer should be able to perform multiple actions as needed
  2. Task should continue until the actual goal is achieved
  3. The orchestrator should properly handle multimodal content in progress assessment

Technical Details

The issue occurs in two key components:

  1. MagenticOneOrchestrator._thread_to_context:
# Original problematic code
if isinstance(m, (TextMessage, MultiModalMessage, ToolCallSummaryMessage)):
    context.append(UserMessage(content=m.content, source=m.source))
  1. Progress assessment logic incorrectly processes multimodal content, leading to premature task completion judgment.

Fix Implementation

The fix involves properly handling MultiModalMessage content in the orchestrator:

# Fixed version
if isinstance(m, MultiModalMessage):
    if isinstance(m.content, list) and len(m.content) > 0:
        content = m.content[0] if isinstance(m.content[0], str) else str(m.content[0])
    else:
        content = str(m.content)
else:
    content = m.content
context.append(UserMessage(content=content, source=m.source))

Additional context

  • This issue specifically affects tasks that require multiple WebSurfer actions
  • The issue affects standard MagenticOneGroupChat implementations

Environment

  • AutoGen Studio Version: 0.4.2
  • Python Version: 3.10
  • OS: Tested on both Linux and Windows

Which packages was the bug in?

Python AgentChat (autogen-agentchat>=0.4.0)

AutoGen library version.

Python dev (main branch)

Other library version.

No response

Model used

qwen-vl-max-latest

Model provider

Other (please specify below)

Other model provider

Qwen

Python version

3.10

.NET version

None

Operating system

None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant