WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers)(2024)

Cited 0|Views88
No score
Abstract
The advancement of large language models (LLMs) leads to a new era marked bythe development of autonomous applications in the real world, which drivesinnovation in the creation of advanced web-based agents. Existing web agentstypically only handle one input modality and are evaluated only in simplifiedweb simulators or static web snapshots, greatly limiting their applicability inreal-world scenarios. To bridge this gap, we introduce WebVoyager, aninnovative Large Multimodal Model (LMM) powered web agent that can completeuser instructions end-to-end by interacting with real-world websites. Moreover,we propose a new evaluation protocol for web agents to address the challengesof automatic evaluation of open-ended web agent tasks, leveraging the robustmultimodal comprehension capabilities of GPT-4V. We create a new benchmark bygathering real-world tasks from 15 widely used websites to evaluate our agents.We show that WebVoyager achieves a 55.7surpassing the performance of both GPT-4 (All Tools) and the WebVoyager(text-only) setups, underscoring the exceptional capability of WebVoyager inpractical applications. We found that our proposed automatic evaluationachieves 85.3development of web agents in a real-world setting.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined