-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Browsergym Version
0.14.1
Playwright Version
1.44.0
Operating System Type
Ubuntu
Operating System Version
Ubuntu 24.04 LTS (Noble Numbat)
Affected Browsers
Chromium
What happened?
When running GenericAgent evaluation with bgym on WorkArena L1, I noticed the agent is failing all-menu tasks (10 seeds), but it was not the case before, it used to succeed at them all.
This is an example of the trace, where the fill action did not get executed properly:
Task seed: 619
Action: fill('260', 'System Mobile')
AxTree:
Axtree of the step
RootWebArea 'Shared admin dashboard | ServiceNow', focused
[47] generic, live='assertive', atomic, relevant='additions text'
[48] generic, live='polite', atomic, relevant='additions text'
[53] generic, live='polite', atomic, relevant='all'
[56] navigation 'Global skip links', visible
[57] link 'Skip to main content', clickable
[58] link 'Open accessibility preferences', clickable
[251] region 'There are 0 announcements displayed', visible, live='polite', relevant='additions text'
[59] generic, live='polite', atomic, relevant='additions text'
[62] navigation 'Primary', visible
navigation 'Unpinned All menu', live='polite', relevant='additions text'
[252] navigation '', visible
[255] image '', visible
[259] LabelText '', clickable
StaticText 'Enter search term to filter All menu'
[260] textbox 'Enter search term to filter All menu', clickable, visible, focused
[264] generic, visible, live='assertive', relevant='additions text'
[274] button 'Self-Service', visible, expanded=True
StaticText 'Self-Service'
[279] button 'Edit Application Self-Service', clickable, visible
[282] button 'Add Self-Service to favorites', clickable, visible
[285] list ''
[286] listitem '', visible
[288] link 'Business Applications', clickable, visible
StaticText 'Business Applications'
[292] button 'Edit Module Business Applications', clickable, visible
[295] button 'Add Business Applications to favorites', clickable, visible
StaticText ''
[298] listitem '', visible
[300] link 'Dashboards', clickable, visible
StaticText 'Dashboards'
[304] button 'Edit Module Dashboards', clickable, visible
[307] button 'Add Dashboards to favorites', clickable, visible
StaticText ''
[310] listitem '', visible
[312] link 'Service Catalog', clickable, visible
StaticText 'Service Catalog'
[317] button 'Edit Module Service Catalog', clickable, visible
[320] button 'Add Service Catalog to favorites', clickable, visible
StaticText ''
[323] listitem '', visible
[325] link 'Employee Center', clickable, visible
StaticText 'Employee Center'
[329] button 'Edit Module Employee Center', clickable, visible
[332] button 'Add Employee Center to favorites', clickable, visible
StaticText ''
[335] listitem '', visible
[337] link 'Knowledge', clickable, visible
StaticText 'Knowledge'
[342] button 'Edit Module Knowledge', clickable, visible
[345] button 'Add Knowledge to favorites', clickable, visible
StaticText ''
[348] listitem '', visible
[350] listitem '', visible
[352] link 'Visual Task Boards', clickable, visible
StaticText 'Visual Task Boards'
[356] button 'Edit Module Visual Task Boards', clickable, visible
[359] button 'Add Visual Task Boards to favorites', clickable, visible
StaticText ''
[362] listitem '', visible
[364] link 'Incidents', clickable, visible
StaticText 'Incidents'
[369] button 'Edit Module Incidents', clickable, visible
[372] button 'Add Incidents to favorites', clickable, visible
StaticText ''
[375] listitem '', visible
[377] link 'Watched Incidents', clickable, visible
StaticText 'Watched Incidents'
[381] button 'Edit Module Watched Incidents', clickable, visible
[384] button 'Add Watched Incidents to favorites', clickable, visible
StaticText ''
[387] listitem '', visible
[389] link 'My Requests', clickable, visible
StaticText 'My Requests'
[393] button 'Edit Module My Requests', clickable, visible
[396] button 'Add My Requests to favorites', clickable, visible
StaticText ''
[399] listitem '', visible
[401] link 'Requested Items', clickable, visible
StaticText 'Requested Items'
[406] button 'Edit Module Requested Items', clickable, visible
[409] button 'Add Requested Items to favorites', clickable, visible
StaticText ''
[412] listitem ''
[414] link 'Watched Requested Items', clickable
StaticText 'Watched Requested Items'
[418] button 'Edit Module Watched Requested Items', clickable
[421] button 'Add Watched Requested Items to favorites', clickable
StaticText ''
[424] listitem ''
[426] listitem ''
[428] link 'My Connected Apps', clickable
StaticText 'My Connected Apps'
[432] button 'Edit Module My Connected Apps', clickable
[435] button 'Add My Connected Apps to favorites', clickable
StaticText ''
[438] listitem ''
[440] link 'My Profile', clickable
StaticText 'My Profile'
[445] button 'Edit Module My Profile', clickable
[448] button 'Add My Profile to favorites', clickable
StaticText ''
[451] listitem ''
[453] link 'My Tagged Documents', clickable
StaticText 'My Tagged Documents'
[458] button 'Edit Module My Tagged Documents', clickable
[461] button 'Add My Tagged Documents to favorites', clickable
StaticText ''
[464] listitem ''
[466] link 'My Tags', clickable
StaticText 'My Tags'
[471] button 'Edit Module My Tags', clickable
[474] button 'Add My Tags to favorites', clickable
StaticText ''
[477] listitem ''
[479] link 'My Knowledge Articles', clickable
StaticText 'My Knowledge Articles'
[483] button 'Edit Module My Knowledge Articles', clickable
[486] button 'Add My Knowledge Articles to favorites', clickable
StaticText ''
[489] listitem ''
[491] link 'Take Survey', clickable
StaticText 'Take Survey'
[496] button 'Edit Module Take Survey', clickable
[499] button 'Add Take Survey to favorites', clickable
StaticText ''
[502] listitem ''
[504] link 'My Approvals', clickable
StaticText 'My Approvals'
[508] button 'Edit Module My Approvals', clickable
[511] button 'Add My Approvals to favorites', clickable
StaticText ''
[514] listitem ''
[516] link 'My Assessments & Surveys', clickable
StaticText 'My Assessments & Surveys'
[520] button 'Edit Module My Assessments & Surveys', clickable
[523] button 'Add My Assessments & Surveys to favorites', clickable
StaticText ''
[526] listitem ''
[528] link 'My Assets', clickable
StaticText 'My Assets'
[532] button 'Edit Module My Assets', clickable
[535] button 'Add My Assets to favorites', clickable
StaticText ''
[538] listitem ''
[540] link 'My Notification Preferences', clickable
StaticText 'My Notification Preferences'
[544] button 'Edit Module My Notification Preferences', clickable
[547] button 'Add My Notification Preferences to favorites', clickable
StaticText ''
[552] button 'Access Analyzer', expanded=False
StaticText 'Access Analyzer'
[557] button 'Edit Application Access Analyzer', clickable
[560] button 'Add Access Analyzer to favorites', clickable
[566] button 'Activity Subscriptions', expanded=False
StaticText 'Activity Subscriptions'
[572] button 'Edit Application Activity Subscriptions', clickable
[575] button 'Add Activity Subscriptions to favorites', clickable
[581] button 'App Engine', expanded=False
StaticText 'App Engine'
[587] button 'Edit Application App Engine', clickable
[590] button 'Add App Engine to favorites', clickable
[596] button 'Availability', expanded=False
StaticText 'Availability'
[601] button 'Edit Application Availability', clickable
[604] button 'Add Availability to favorites', clickable
[610] button 'Benchmarks', expanded=False
StaticText 'Benchmarks'
[615] button 'Edit Application Benchmarks', clickable
[618] button 'Add Benchmarks to favorites', clickable
[624] button 'Business Calendar', expanded=False
StaticText 'Business Calendar'
[629] button 'Edit Application Business Calendar', clickable
[632] button 'Add Business Calendar to favorites', clickable
[638] button 'Certificate Based Authentication', expanded=False
StaticText 'Certificate Based Authentication'
[644] button 'Edit Application Certificate Based Authentication', clickable
[647] button 'Add Certificate Based Authentication to favorites', clickable
[653] button 'Content Taxonomy', expanded=False
StaticText 'Content Taxonomy'
[658] button 'Edit Application Content Taxonomy', clickable
[661] button 'Add Content Taxonomy to favorites', clickable
[667] button 'Conversational Interfaces', expanded=False
StaticText 'Conversational Interfaces'
[673] button 'Edit Application Conversational Interfaces', clickable
[676] button 'Add Conversational Interfaces to favorites', clickable
[682] button 'Diagram Builder', expanded=False
StaticText 'Diagram Builder'
[687] button 'Edit Application Diagram Builder', clickable
[690] button 'Add Diagram Builder to favorites', clickable
[696] button 'Docker', expanded=False
StaticText 'Docker'
[701] button 'Edit Application Docker', clickable
[704] button 'Add Docker to favorites', clickable
[710] button 'Docker Webhook Answer Subflow', expanded=False
StaticText 'Docker Webhook Answer Subflow'
[715] button 'Edit Application Docker Webhook Answer Subflow', clickable
[718] button 'Add Docker Webhook Answer Subflow to favorites', clickable
navigation 'Unpinned Favorites menu', live='polite', relevant='additions text'
navigation 'Unpinned History menu', live='polite', relevant='additions text'
navigation 'Unpinned Workspaces menu', live='polite', relevant='additions text'
navigation 'Unpinned Admin menu', live='polite', relevant='additions text'
navigation 'More menus', live='polite', relevant='additions text'
[66] button 'My ServiceNow landing page', clickable, visible, describedby='logo-tooltip'
[67] image 'ServiceNow Service Management', visible
[217] button 'All', clickable, visible, expanded=True
[79] button 'More menus', clickable, visible, expanded=False
generic, describedby='title-tooltip'
StaticText 'Shared admin dashboard'
[92] button 'Create favorite for Shared admin dashboard', clickable, visible, live='polite', relevant='additions text', pressed='false'
[224] search '', visible
[225] button 'Search', clickable, visible, controls='sncwsgs-typeahead-input'
[229] region '', live='polite', relevant='additions text'
StaticText 'No exact match. Press Enter for full results.'
[104] button 'Scope selectors', clickable, visible, expanded=False
[236] button 'Sidebar discussions', clickable, visible, expanded=False
[110] button 'Show help', clickable, visible, expanded=False
[138] button 'Show notifications', clickable, visible, expanded=False
[150] button 'Linda Miller: available', clickable, visible, expanded=False
[153] image 'Linda Miller is Available', visible
StaticText 'LM'
[246] main 'Screen content', visible
[247] Section '', visible
[735] Section '', visible
[737] Section '', visible
[739] Section '', visible
StaticText 'Welcome to Admin Home, Linda!'
StaticText 'Manage, monitor, and discover all your day to day administrative actions and tools across the platform.'
[747] Section ''
[749] Section ''
StaticText 'Track what’s important to you'
[762] heading 'Shared admin dashboard', visible
[763] button 'Change dashboard', clickable, visible
[781] button 'Refresh dashboard', clickable, visible
[783] button 'View dashboard details', clickable, visible
[785] button 'Edit', clickable, visible
StaticText 'Edit'
[787] button 'More actions', clickable, visible, hasPopup='menu', expanded=False
[794] progressbar 'Saving Dashboard', visible, valuemin=0, valuemax=100, valuetext=''
[795] alert '', visible, live='assertive', atomic, relevant='additions text'
[796] list 'Expanded alert list with 0 alerts.', visible
[808] image 'Loading'
[813] image 'Loading'
[818] image 'Loading'
[825] heading 'Hardening compliance score', visible
[826] button 'More options', clickable, visible, hasPopup='menu', expanded=False
button '85%\xa0 Trending downward \xa0 -1% (-0.7%)\xa0since Jul 17'
StaticText '85%'
StaticText ''
image 'Trending downward'
StaticText ''
StaticText '-1% (-0.7%)'
StaticText ''
StaticText 'since Jul 17'
[848] image 'Loading'
[853] image 'Loading'
[858] image 'Loading'
[863] image 'Loading'
[868] image 'Loading'
[873] image 'Loading'
[892] Section ''
[894] Section ''
[898] paragraph ''
StaticText 'Tell us how we can make this page more useful'
[899] button 'Share a suggestion', clickable
StaticText 'Share a suggestion'
I have other examples where the fill command works, but when the agent clicks on the desired link nothing happens (the page doesn't change as expected).
Also noticed this happened when I ran multiple seeds, but when running the task once, it was ok.
Reproduction Steps
Run GenericAgent on all-menu task with GPT-4.1-mini. I used the following code:
from browsergym.experiments.benchmark import Benchmark
from browsergym.experiments.benchmark.utils import make_env_args_list_from_repeat_tasks
from browsergym.experiments.benchmark.metadata.utils import task_metadata
from browsergym.experiments.benchmark.configs import DEFAULT_HIGHLEVEL_ACTION_SET_ARGS
from agentlab.experiments.study import make_study, Study
from agentlab.llm.chat_api import AzureModelArgs
from agentlab.agents.generic_agent.generic_agent import GenericAgentArgs
from agentlab.agents.generic_agent.agent_configs import FLAGS_GPT_4o
GENERIC_AGENT_4_1_MINI = GenericAgentArgs(
chat_model_args=AzureModelArgs(
deployment_name="gpt-4.1-mini-2025-04-14",
model_name="gpt-4.1-mini",
max_new_tokens=16_384,
max_input_tokens=128_000,
max_total_tokens=128_000,
vision_support=True,
temperature=0.0,
),
flags=FLAGS_GPT_4o,
)
benchmark = Benchmark(
name="workarena_l1_tiny",
high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["workarena"],
is_multi_tab=False,
supports_parallel_seeds=False,
backends=["workarena"],
env_args_list=make_env_args_list_from_repeat_tasks(
task_list=[
"workarena.servicenow.all-menu",
],
max_steps=15,
n_repeats=10,
seeds_rng=np.random.RandomState(42),
),
task_metadata=task_metadata("workarena"),
)
study = make_study(
agent_args=[GENERIC_AGENT_4_1_MINI],
benchmark=benchmark,
logging_level_stdout=logging.WARNING,
ignore_dependencies=True,
)
study.run(
n_jobs=10,
parallel_backend="ray",
strict_reproducibility=False,
n_relaunch=3,
)
Relevant Logs
Additional Context
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working