Skip to content

[Bug]: Issue on all-menu task with GenericAgent #88

@imenelydiaker

Description

@imenelydiaker

Browsergym Version

0.14.1

Playwright Version

1.44.0

Operating System Type

Ubuntu

Operating System Version

Ubuntu 24.04 LTS (Noble Numbat)

Affected Browsers

Chromium

What happened?

When running GenericAgent evaluation with bgym on WorkArena L1, I noticed the agent is failing all-menu tasks (10 seeds), but it was not the case before, it used to succeed at them all.

This is an example of the trace, where the fill action did not get executed properly:

Task seed: 619

Action: fill('260', 'System Mobile')

AxTree:

Axtree of the step
RootWebArea 'Shared admin dashboard | ServiceNow', focused
	[47] generic, live='assertive', atomic, relevant='additions text'
	[48] generic, live='polite', atomic, relevant='additions text'
	[53] generic, live='polite', atomic, relevant='all'
	[56] navigation 'Global skip links', visible
		[57] link 'Skip to main content', clickable
		[58] link 'Open accessibility preferences', clickable
	[251] region 'There are 0 announcements displayed', visible, live='polite', relevant='additions text'
	[59] generic, live='polite', atomic, relevant='additions text'
	[62] navigation 'Primary', visible
		navigation 'Unpinned All menu', live='polite', relevant='additions text'
			[252] navigation '', visible
				[255] image '', visible
				[259] LabelText '', clickable
					StaticText 'Enter search term to filter All menu'
				[260] textbox 'Enter search term to filter All menu', clickable, visible, focused
				[264] generic, visible, live='assertive', relevant='additions text'
				[274] button 'Self-Service', visible, expanded=True
					StaticText 'Self-Service'
				[279] button 'Edit Application Self-Service', clickable, visible
				[282] button 'Add Self-Service to favorites', clickable, visible
				[285] list ''
					[286] listitem '', visible
						[288] link 'Business Applications', clickable, visible
							StaticText 'Business Applications'
						[292] button 'Edit Module Business Applications', clickable, visible
						[295] button 'Add Business Applications to favorites', clickable, visible
						StaticText ''
					[298] listitem '', visible
						[300] link 'Dashboards', clickable, visible
							StaticText 'Dashboards'
						[304] button 'Edit Module Dashboards', clickable, visible
						[307] button 'Add Dashboards to favorites', clickable, visible
						StaticText ''
					[310] listitem '', visible
						[312] link 'Service Catalog', clickable, visible
							StaticText 'Service Catalog'
						[317] button 'Edit Module Service Catalog', clickable, visible
						[320] button 'Add Service Catalog to favorites', clickable, visible
						StaticText ''
					[323] listitem '', visible
						[325] link 'Employee Center', clickable, visible
							StaticText 'Employee Center'
						[329] button 'Edit Module Employee Center', clickable, visible
						[332] button 'Add Employee Center to favorites', clickable, visible
						StaticText ''
					[335] listitem '', visible
						[337] link 'Knowledge', clickable, visible
							StaticText 'Knowledge'
						[342] button 'Edit Module Knowledge', clickable, visible
						[345] button 'Add Knowledge to favorites', clickable, visible
						StaticText ''
					[348] listitem '', visible
					[350] listitem '', visible
						[352] link 'Visual Task Boards', clickable, visible
							StaticText 'Visual Task Boards'
						[356] button 'Edit Module Visual Task Boards', clickable, visible
						[359] button 'Add Visual Task Boards to favorites', clickable, visible
						StaticText ''
					[362] listitem '', visible
						[364] link 'Incidents', clickable, visible
							StaticText 'Incidents'
						[369] button 'Edit Module Incidents', clickable, visible
						[372] button 'Add Incidents to favorites', clickable, visible
						StaticText ''
					[375] listitem '', visible
						[377] link 'Watched Incidents', clickable, visible
							StaticText 'Watched Incidents'
						[381] button 'Edit Module Watched Incidents', clickable, visible
						[384] button 'Add Watched Incidents to favorites', clickable, visible
						StaticText ''
					[387] listitem '', visible
						[389] link 'My Requests', clickable, visible
							StaticText 'My Requests'
						[393] button 'Edit Module My Requests', clickable, visible
						[396] button 'Add My Requests to favorites', clickable, visible
						StaticText ''
					[399] listitem '', visible
						[401] link 'Requested Items', clickable, visible
							StaticText 'Requested Items'
						[406] button 'Edit Module Requested Items', clickable, visible
						[409] button 'Add Requested Items to favorites', clickable, visible
						StaticText ''
					[412] listitem ''
						[414] link 'Watched Requested Items', clickable
							StaticText 'Watched Requested Items'
						[418] button 'Edit Module Watched Requested Items', clickable
						[421] button 'Add Watched Requested Items to favorites', clickable
						StaticText ''
					[424] listitem ''
					[426] listitem ''
						[428] link 'My Connected Apps', clickable
							StaticText 'My Connected Apps'
						[432] button 'Edit Module My Connected Apps', clickable
						[435] button 'Add My Connected Apps to favorites', clickable
						StaticText ''
					[438] listitem ''
						[440] link 'My Profile', clickable
							StaticText 'My Profile'
						[445] button 'Edit Module My Profile', clickable
						[448] button 'Add My Profile to favorites', clickable
						StaticText ''
					[451] listitem ''
						[453] link 'My Tagged Documents', clickable
							StaticText 'My Tagged Documents'
						[458] button 'Edit Module My Tagged Documents', clickable
						[461] button 'Add My Tagged Documents to favorites', clickable
						StaticText ''
					[464] listitem ''
						[466] link 'My Tags', clickable
							StaticText 'My Tags'
						[471] button 'Edit Module My Tags', clickable
						[474] button 'Add My Tags to favorites', clickable
						StaticText ''
					[477] listitem ''
						[479] link 'My Knowledge Articles', clickable
							StaticText 'My Knowledge Articles'
						[483] button 'Edit Module My Knowledge Articles', clickable
						[486] button 'Add My Knowledge Articles to favorites', clickable
						StaticText ''
					[489] listitem ''
						[491] link 'Take Survey', clickable
							StaticText 'Take Survey'
						[496] button 'Edit Module Take Survey', clickable
						[499] button 'Add Take Survey to favorites', clickable
						StaticText ''
					[502] listitem ''
						[504] link 'My Approvals', clickable
							StaticText 'My Approvals'
						[508] button 'Edit Module My Approvals', clickable
						[511] button 'Add My Approvals to favorites', clickable
						StaticText ''
					[514] listitem ''
						[516] link 'My Assessments & Surveys', clickable
							StaticText 'My Assessments & Surveys'
						[520] button 'Edit Module My Assessments & Surveys', clickable
						[523] button 'Add My Assessments & Surveys to favorites', clickable
						StaticText ''
					[526] listitem ''
						[528] link 'My Assets', clickable
							StaticText 'My Assets'
						[532] button 'Edit Module My Assets', clickable
						[535] button 'Add My Assets to favorites', clickable
						StaticText ''
					[538] listitem ''
						[540] link 'My Notification Preferences', clickable
							StaticText 'My Notification Preferences'
						[544] button 'Edit Module My Notification Preferences', clickable
						[547] button 'Add My Notification Preferences to favorites', clickable
						StaticText ''
				[552] button 'Access Analyzer', expanded=False
					StaticText 'Access Analyzer'
				[557] button 'Edit Application Access Analyzer', clickable
				[560] button 'Add Access Analyzer to favorites', clickable
				[566] button 'Activity Subscriptions', expanded=False
					StaticText 'Activity Subscriptions'
				[572] button 'Edit Application Activity Subscriptions', clickable
				[575] button 'Add Activity Subscriptions to favorites', clickable
				[581] button 'App Engine', expanded=False
					StaticText 'App Engine'
				[587] button 'Edit Application App Engine', clickable
				[590] button 'Add App Engine to favorites', clickable
				[596] button 'Availability', expanded=False
					StaticText 'Availability'
				[601] button 'Edit Application Availability', clickable
				[604] button 'Add Availability to favorites', clickable
				[610] button 'Benchmarks', expanded=False
					StaticText 'Benchmarks'
				[615] button 'Edit Application Benchmarks', clickable
				[618] button 'Add Benchmarks to favorites', clickable
				[624] button 'Business Calendar', expanded=False
					StaticText 'Business Calendar'
				[629] button 'Edit Application Business Calendar', clickable
				[632] button 'Add Business Calendar to favorites', clickable
				[638] button 'Certificate Based Authentication', expanded=False
					StaticText 'Certificate Based Authentication'
				[644] button 'Edit Application Certificate Based Authentication', clickable
				[647] button 'Add Certificate Based Authentication to favorites', clickable
				[653] button 'Content Taxonomy', expanded=False
					StaticText 'Content Taxonomy'
				[658] button 'Edit Application Content Taxonomy', clickable
				[661] button 'Add Content Taxonomy to favorites', clickable
				[667] button 'Conversational Interfaces', expanded=False
					StaticText 'Conversational Interfaces'
				[673] button 'Edit Application Conversational Interfaces', clickable
				[676] button 'Add Conversational Interfaces to favorites', clickable
				[682] button 'Diagram Builder', expanded=False
					StaticText 'Diagram Builder'
				[687] button 'Edit Application Diagram Builder', clickable
				[690] button 'Add Diagram Builder to favorites', clickable
				[696] button 'Docker', expanded=False
					StaticText 'Docker'
				[701] button 'Edit Application Docker', clickable
				[704] button 'Add Docker to favorites', clickable
				[710] button 'Docker Webhook Answer Subflow', expanded=False
					StaticText 'Docker Webhook Answer Subflow'
				[715] button 'Edit Application Docker Webhook Answer Subflow', clickable
				[718] button 'Add Docker Webhook Answer Subflow to favorites', clickable
		navigation 'Unpinned Favorites menu', live='polite', relevant='additions text'
		navigation 'Unpinned History menu', live='polite', relevant='additions text'
		navigation 'Unpinned Workspaces menu', live='polite', relevant='additions text'
		navigation 'Unpinned Admin menu', live='polite', relevant='additions text'
		navigation 'More menus', live='polite', relevant='additions text'
		[66] button 'My ServiceNow landing page', clickable, visible, describedby='logo-tooltip'
			[67] image 'ServiceNow Service Management', visible
		[217] button 'All', clickable, visible, expanded=True
		[79] button 'More menus', clickable, visible, expanded=False
		generic, describedby='title-tooltip'
			StaticText 'Shared admin dashboard'
			[92] button 'Create favorite for Shared admin dashboard', clickable, visible, live='polite', relevant='additions text', pressed='false'
		[224] search '', visible
			[225] button 'Search', clickable, visible, controls='sncwsgs-typeahead-input'
			[229] region '', live='polite', relevant='additions text'
				StaticText 'No exact match. Press Enter for full results.'
		[104] button 'Scope selectors', clickable, visible, expanded=False
		[236] button 'Sidebar discussions', clickable, visible, expanded=False
		[110] button 'Show help', clickable, visible, expanded=False
		[138] button 'Show notifications', clickable, visible, expanded=False
		[150] button 'Linda Miller: available', clickable, visible, expanded=False
			[153] image 'Linda Miller is Available', visible
				StaticText 'LM'
	[246] main 'Screen content', visible
		[247] Section '', visible
			[735] Section '', visible
				[737] Section '', visible
					[739] Section '', visible
						StaticText 'Welcome to Admin Home, Linda!'
						StaticText 'Manage, monitor, and discover all your day to day administrative actions and tools across the platform.'
			[747] Section ''
				[749] Section ''
					StaticText 'Track what’s important to you'
					[762] heading 'Shared admin dashboard', visible
					[763] button 'Change dashboard', clickable, visible
					[781] button 'Refresh dashboard', clickable, visible
					[783] button 'View dashboard details', clickable, visible
					[785] button 'Edit', clickable, visible
						StaticText 'Edit'
					[787] button 'More actions', clickable, visible, hasPopup='menu', expanded=False
					[794] progressbar 'Saving Dashboard', visible, valuemin=0, valuemax=100, valuetext=''
					[795] alert '', visible, live='assertive', atomic, relevant='additions text'
						[796] list 'Expanded alert list with 0 alerts.', visible
					[808] image 'Loading'
					[813] image 'Loading'
					[818] image 'Loading'
					[825] heading 'Hardening compliance score', visible
					[826] button 'More options', clickable, visible, hasPopup='menu', expanded=False
					button '85%\xa0 Trending downward \xa0 -1% (-0.7%)\xa0since Jul 17'
						StaticText '85%'
						StaticText ''
						image 'Trending downward'
						StaticText ''
						StaticText '-1% (-0.7%)'
						StaticText ''
						StaticText 'since Jul 17'
					[848] image 'Loading'
					[853] image 'Loading'
					[858] image 'Loading'
					[863] image 'Loading'
					[868] image 'Loading'
					[873] image 'Loading'
			[892] Section ''
			[894] Section ''
				[898] paragraph ''
					StaticText 'Tell us how we can make this page more useful'
				[899] button 'Share a suggestion', clickable
					StaticText 'Share a suggestion'

Screenshot before action:
Image

Screenshot after action:
Image

I have other examples where the fill command works, but when the agent clicks on the desired link nothing happens (the page doesn't change as expected).

Also noticed this happened when I ran multiple seeds, but when running the task once, it was ok.

Reproduction Steps

Run GenericAgent on all-menu task with GPT-4.1-mini. I used the following code:

from browsergym.experiments.benchmark import Benchmark
from browsergym.experiments.benchmark.utils import make_env_args_list_from_repeat_tasks
from browsergym.experiments.benchmark.metadata.utils import task_metadata
from browsergym.experiments.benchmark.configs import DEFAULT_HIGHLEVEL_ACTION_SET_ARGS
from agentlab.experiments.study import make_study, Study
from agentlab.llm.chat_api import AzureModelArgs
from agentlab.agents.generic_agent.generic_agent import GenericAgentArgs
from agentlab.agents.generic_agent.agent_configs import FLAGS_GPT_4o


GENERIC_AGENT_4_1_MINI = GenericAgentArgs(
    chat_model_args=AzureModelArgs(
        deployment_name="gpt-4.1-mini-2025-04-14",
        model_name="gpt-4.1-mini",
        max_new_tokens=16_384,
        max_input_tokens=128_000,
        max_total_tokens=128_000,
        vision_support=True,
        temperature=0.0,
    ),
    flags=FLAGS_GPT_4o,
)

benchmark = Benchmark(
    name="workarena_l1_tiny",
    high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["workarena"],
    is_multi_tab=False,
    supports_parallel_seeds=False,
    backends=["workarena"],
    env_args_list=make_env_args_list_from_repeat_tasks(
        task_list=[
            "workarena.servicenow.all-menu",
        ],
        max_steps=15,
        n_repeats=10,
        seeds_rng=np.random.RandomState(42),
    ),
    task_metadata=task_metadata("workarena"),
)

study = make_study(
    agent_args=[GENERIC_AGENT_4_1_MINI],
    benchmark=benchmark,
    logging_level_stdout=logging.WARNING,
    ignore_dependencies=True,
)

study.run(
    n_jobs=10,
    parallel_backend="ray",
    strict_reproducibility=False,
    n_relaunch=3,
)

Relevant Logs

Additional Context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions