WebArena:用于构建自主智能体的真实网络环境_第1页
WebArena:用于构建自主智能体的真实网络环境_第2页
WebArena:用于构建自主智能体的真实网络环境_第3页
WebArena:用于构建自主智能体的真实网络环境_第4页
WebArena:用于构建自主智能体的真实网络环境_第5页
已阅读5页,还剩40页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

arXiv:2307.13854v1[cs.AI]25Jul2023

WebArena:ARealisticWebEnvironmentforBuildingAutonomousAgents

ShuyanZhou♠∗FrankF.Xu♠∗

HaoZhu♠†XuhuiZhou♠†RobertLo♠†AbishekSridhar♠†

XianyiCheng♠YonatanBisk♠DanielFried♠UriAlon♠GrahamNeubig♠♣

♠CarnegieMellonUniversity♣InspiredCognition

{shuyanzh,fangzhex,gneubig}@

Abstract

WithgenerativeAIadvances,theexcitingpotentialforautonomousagentstomanagedailytasksvianaturallanguagecommandshasemerged.However,cur-rentagentsareprimarilycreatedandtestedinsimplifiedsyntheticenvironments,substantiallylimitingreal-worldscenariorepresentation.Inthispaper,webuildanenvironmentforagentcommandandcontrolthatishighlyrealisticandrepro-ducible.Specifically,wefocusonagentsthatperformtasksontheweb,andwecreateanenvironmentwithfullyfunctionalwebsitesfromfourcommondomains:e-commerce,socialforumdiscussions,collaborativesoftwaredevelopment,andcontentmanagement.Ourenvironmentisenrichedwithtools(e.g.,amap)andex-ternalknowledgebases(e.g.,usermanuals)toencouragehuman-liketask-solving.Buildinguponourenvironment,wereleaseasetofbenchmarktasksfocusingonevaluatingthefunctionalcorrectnessoftaskcompletions.Thetasksinourbenchmarkarediverse,long-horizon,anddesignedtoemulatetasksthathumansroutinelyperformontheinternet.Wedesignandimplementseveralautonomousagents,integratingrecenttechniquessuchasreasoningbeforeacting.Theresultsdemonstratethatsolvingcomplextasksischallenging:ourbestGPT-4-basedagentonlyachievesanend-to-endtasksuccessrateof10.59%.Theseresultshighlighttheneedforfurtherdevelopmentofrobustagents,thatcurrentstate-of-the-artLMsarefarfromperfectperformanceinthesereal-lifetasks,andthatWebArenacanbeusedtomeasuresuchprogress.

Ourcode,data,environmentreproductionresources,andvideodemonstrationsarepubliclyavailableat

https://webarena.dev/

.

1Introduction

Autonomousagentsthatcouldperformeverydaytasksviahumannaturallanguagecommandscouldsignificantlyaugmenthumancapabilities,giventheirpotentialtoenhanceefficiencyandpromotebroaderaccessibility.Nonetheless,tofullyleveragethepoweroftheseautonomousagents,itiscrucialtounderstandtheirbehaviorwithinanenvironmentthatisbothauthenticandreproducible.Thiswillmeasuretheabilityofagentsonrealtasksthathumanuserscareaboutandallowthemtobeevaluatedinafairandconsistentmanner.

Currentenvironmentstendtoover-simplifyreal-worldsituations.Asaresult,thefunctionalityofmanyenvironmentsisalimitedversionoftheirreal-worldcounterparts,leadingtoalackoftaskdiversitywithintheenvironment[

27

,

1

,

10

,

23

,

29

,

30

,

37

].Inaddition,thesesimplificationsoften

*Leadcontributors.

†Equalcontribution.

Preprint.Workinprogress.

2

TellmehowmuchIspentonfoodpurchaseinMarch2023

Action

“Createa‘NolanFans'repo,listingNolan'sOscar-winning

filmsinaREADMEfile”

Agent

check_repocheck_readmecheck_answer

Feedback

FunctionalSuccess

FunctionalFailure

Self-hostedfullyfunctionalwebapplications

CMS

Knowledgeresources

WebArena

Toolbox

Figure1:WebArenaisastandalone,self-hostablewebenvironmentforbuildingautonomousagents.WebArenacreateswebsitesfromfourpopularcategorieswithfunctionalityanddatamimickingtheirreal-worldequivalents.Toemulatehumanproblem-solving,WebArenaalsoembedstoolsandknowledgeresourcesasindependentwebsites.WebArenaintroducesabenchmarkoninterpretinghigh-levelrealisticnaturallanguagecommandtoconcreteweb-basedinteractions.Weprovideannotatedprogramsdesignedtoprogrammaticallyvalidatethefunctionalcorrectnessofeachtask.

lowerthecomplexityoftasksascomparedtotheirexecutionintherealworld[

25

,

29

,

37

].Finally,someenvironmentsarepresentedasastaticresource[

27

,

7

]whereagentsareconfinedtoaccessingonlythosestatesthatwerepreviouslycachedduringdatacollection,thuslimitingthebreadthanddiversityofexploration.Ontheevaluationaspect,manyenvironmentsfocusoncomparingthesurfaceformofthepredictedactionsequenceswithreferenceactionsequences,disregardingthefunctionalcorrectnessoftheexecutionsandpossiblealternativesolutions[

25

,

12

,

35

,

19

,

7

].Theselimitationsoftenresultinadiscrepancybetweensimulatedenvironmentsandtherealworld.SuchconstraintscanpotentiallyimpactthegeneralizabilityofAIagentstosuccessfullyunderstand,adapt,andoperatewithincomplexreal-worldsituations.

Inthiswork,weintroduceWebArena,arealisticandreproduciblewebenvironmentdesignedtofacilitatethedevelopmentofautonomousagentscapableofexecutingtasks(§

2

).AnoverviewofWebArenaisin

Figure1

.Ourenvironmentcomprisesfourfullyoperational,self-hostedwebapplications,eachrepresentingadistinctivedomainprevalentontheinternet:onlineshopping,discussionforums,collaborativedevelopment,andbusinesscontentmanagement.Furthermore,WebArenaincorporatesseveralutilitytools,suchasmap,calculator,andscratchpad,tobestsupportpossiblehuman-liketaskexecutions.Lastly,WebArenaiscomplementedbyanextensivecollectionofdocumentationandknowledgebasesthatvaryfromgeneralresourceslikeEnglishWikipediatomoredomain-specificreferences,suchasmanualsforusingtheintegrateddevelopmenttool[

8

].Thecontentpopulatingthesewebsitesisextractedfromtheirreal-worldcounterparts,preservingtheauthenticityofthecontentservedoneachplatform.WedeliverthehostingservicesusingDockercontainerswithgym-APIs[

4

],ensuringboththeusabilityandthereproducibilityofWebArena.

AlongwithWebArena,wereleaseaready-to-usebenchmarkwith812long-horizonweb-basedtasks(§

3

).Eachtaskisdescribedasahigh-levelnaturallanguageintent,emulatingtheabstractlanguageusagepatternstypicallyemployedbyhumans[

2

].Twoexampleintentsareshownintheupperleftof

Figure1

.Wefocusonevaluatingthefunctionalcorrectnessofthesetasks,i.e.,,doestheresultoftheexecutionactuallyachievethedesiredgoal(§

3.3

).Forinstance,toevaluatetheexamplein

Figure2

,ourevaluationmethodverifiestheconcretecontentsinthedesignatedrepository.Thisevaluationisnotonlymorereliable[

40

,

5

,

33

]thancomparingthevanillaactionsequences[

25

,

7

]butalsoaccommodatesarangeofpotentialvalidpathstoachievethesamegoal,whichisaubiquitousphenomenoninsufficientlycomplextasks.

WeusethisbenchmarktoevaluateseveralagentsthatcouldfollowNLcommandandperformweb-basedtasks(§

4

).Theseagentsareimplementedwithavarietyofapproaches,fromthosepredictingsubsequentactiondirectlybasedoncurrentobservationsandhistorytomoresophisticatedagentsusingstep-by-stepreasoning[

34

,

39

].Theseagentsareimplementedinafew-shotin-contextlearningfashionwithpowerfullargelanguagemodels(LLMs)suchasGPT-3.5andGPT-4.ExperimentresultsshowthatthebestGPT-4agentperformanceissomewhatlimited,withanend-to-endtasksuccessrateof10.59%.WehypothesizethatthelimitedperformanceofcurrentLLMsstemsfromalackof

3

“CreateanefficientitinerarytovisitallPittsburgh'sartmuseumswithminimaldrivingdistancestarting

fromCMU.Logtheorderinmy“awesome-northeast-us-travel”repository”

Searchformuseums

inPittsburgh

SearchforeachartmuseumontheMap

Recordtheoptimized

resultstotherepo

Figure2:Ahigh-leveltaskthatcanbefullyexecutedinWebArena.Completingsuchtasksrequiressophisticated,long-termplanningandreasoningcapability.Toaccomplishthegoalstatedatthetop,anagentneedstofindoutwhatartmuseumsarelocatedinPittsburghbysearchingWikipedia.Next,itshouldidentifythelocationofeachmuseumonamap,optimizingtheitinerarybasedontheinformationcollected.Finally,theagentneedstoupdatetheREADMEfileintheappropriaterepositorywiththeplannedroute.crucialcapabilitiessuchasactiveexplorationandfailurerecoverytosuccessfullyperformcomplextask(§

5.2

).Theseoutcomesunderscorethenecessityforfurtherdevelopmenttowardsrobustandeffectiveagents[

18

]inWebArena.

2WebArena:WebSitesasanEnvironmentforAutonomousAgentsOurgoalistocreatearealisticandreproduciblewebenvironment.Weachievereproducibilitybyhavingtheenvironmentbestandalone,notrelyingonlivewebsites.ThiscircumventstechnicalchallengessuchasbotsbeingsubjecttoCAPTCHAs,unpredictablecontentmodifications,andconfigurationchanges,whichobstructafaircomparisonacrossdifferentsystemsovertime.Weachieverealismbyusingopen-sourcelibrariesthatunderliemanyin-usesitesfromseveralpopularcategoriesandimportingdatatoourenvironmentfromtheirreal-worldcounterparts.

2.1WebsiteSelection

Inordertodecidewhichcategoriesofwebsitetouse,wefirstconductedananalysisofapproximately200examplesfromtheauthors’actualwebbrowserhistories.Eachauthordelvedintotheirbrowsinghistories,summarizingthegoalofparticularsegmentsoftheirbrowsersession.Basedonthis,weclassifiedthevisitedwebsitesintoabstractcategories.Wethenidentifiedthefourmostsalientcategoriesandimplementedoneinstancepercategorybasedonthisanalysis:(1)E-commerceplatformssupportingonlineshoppingactivities(e.g.,Amazon,eBay),(2)socialforumplatformsforopinionexchanges(e.g.,Reddit,StackExchange),(3)collaborativedevelopmentplatformsforsoftwaredevelopment(e.g.,GitLab),and(4)contentmanagementsystems(CMS)thatmanagethecreationandrevisionofthedigitalcontent(e.g.,onlinestoremanagement).

Inadditiontotheseplatforms,weselectedthreeutility-styletoolsthatarefrequentlyusedinweb-basedtasks:(1)amapfornavigationandsearchingforinformationaboutpointsofinterest(POIs)suchasinstitutionsorlocations(2)acalculator,and(3)ascratchpadfortakingnotes.

Recognizingthecriticalroleofinformation-seekingandknowledgeacquisitioninweb-basedtasks,wealsoincorporatedvariousknowledgeresourcesintoourenvironment.Theseresourcesrangefromgeneralinformationrepositories,suchastheEnglishWikipedia,tomorespecializedknowledgebases,suchasthewebsiteusermanuals.

ImplementationWeleveragedopen-sourcelibrariesrelevanttoeachcategorytobuildourownversionsofanE-commercewebsite(OneStopShop),GitLab,Reddit,anonlinestorecontentmanage-

4

[1552]link'OutdoorPatio..’

[1547]img'Image'

<div>

<ahref="..."><imgsrc="..."></a>

<divclass>

<ahref="...">OutdoorPatio…

<div>

<span>Rating:</span>

<div>

<span>82%</span>

</div>

<ahref=“…#reviews">12

<span>Reviews</span></a>

<li>

</a>

[1543]link'Image'

[4]RootWebArea‘Patio,Lawn..’

[1549]LayoutTable''

[1559]StaticText'Rating:'

[1557]generic'82%'

[1567]link'12Reviews'

[1574]StaticText‘$49.99'

[1582]button'AddtoCart’focusable:

True

[1585]button'WishList’focusable:…

[1586]button'Compare’focusable:…

Figure3:WedesigntheobservationtobetheURLandthecontentofawebpage,withoptionstorepresentthecontentasascreenshot(left),HTMLDOMtree(middle)andaccessibilitytree(right).Thecontentofthemiddleandrightfiguresaretrimmedtosavespace.

mentsystem(CMS),amap,andanEnglishWikipedia.Thenweimporteddatafromtheirreal-worldcounterpartsviaasamplingmethod.Asanexample,ourversionofGitLabwasdevelopedbasedontheactualGitLabproject.

3

Wecarefullyemulatedthefeaturesofatypicalcoderepositorybyincludingbothpopularprojectswithmanyissuesandpullrequestsandsmaller,personalprojects.DetailsofallwebsitesinWebArenacanbefoundinAppendix

A.1

.Wedelivertheenvironmentasdockersandprovidescriptstoresettheenvironmenttoadeterministicinitialstate.ThesedetailsareinAppendix

A.2

.

2.2ObservationSpace

Wedesigntheobservationspacetoroughlymimicthewebbrowserexperience:awebpageURL,theopenedtabsandthewebpagecontentofthefocusedtab.WebArenaisthefirstwebenvironmenttoconsidermulti-tabweb-basedtaskstopromotetoolusage,directcomparisonsandreferencesacrosstabs,andotherfunctionalities.Themulti-tabfunctionalityoffersamoreauthenticreplicationofhumanwebbrowsinghabitscomparedtomaintainingeverythinginasingletab.Weprovideflexible

configurationtorenderthepagecontentinmanymodes:(see

Figure3

foranexample):(1)therawwebpageHTML,composedofaDocumentObjectModel(DOM)tree,ascommonlyusedinpastwork[

27

,

7

,

19

];(2)thescreenshot,apixel-basedrepresentationthatrepresentsthecurrentwebpageasanRGBarrayand(3)theaccessibilitytreeofthewebpage.

4

TheaccessibilitytreeisasubsetoftheDOMtreewithelementsthatarerelevantandusefulfordisplayingthecontentsofawebpage.Everyelementisrepresentedasitsrole(e.g.,alink),itstextcontent,anditsproperties(e.g.,whetheritisfocusable).AccessibilitytreeslargelyretainthestructuredinformationofawebpagewhilebeingmorecompactthantheDOMrepresentation.Weprovideanoptiontolimitthecontenttothecontentswithinaviewportforallmodes.Thisensuresthattheobservationcanbeinputintoatext-basedmodelwithlimitedcontextlengthoranimage-basedmodelwithimagesizeorresolutionrequirements.

2.3ActionSpace

Followingpreviousworkonnavigationandoperationinwebandembodiedenvironments[

21

,

8

],wedesignacompoundactionspacewhichemulatesthekeyboardandmouseoperationsavailableonwebpages.

Table1

listsalltheavailableactionscategorizedintothreedistinctgroups.Thefirstcategoryincludeselementaloperationssuchasclicking,hovering,typing,andkeycombinationpressing.Thesecondcomprisestabmanagementactionssuchasopening,closing,andswitchingbetweentabs.ThethirdcategoryconsistsofURLnavigationactions,suchasvisitingaspecificURLornavigatingforwardsandbackwardinthebrowsinghistory.

WebArenaallowsagentstorefertoelementsonwebpageswithdifferentapproaches.Elementscanbeselectedeitherbytheiron-screencoordinates,representedas(x,y),orbyauniqueelementID.

3

/gitlab-org/gitlab

4

/en-US/docs/Glossary/Accessibility_tree

5

ActionType

Description

noop

Donothing

click(element)

Clickatanelement

hover(element)

Hoverontheelement

type(element,text)

Inputthetexttotheelement

keypress(keycomb)

Pressthekeycombination(e.g.,ctrl+v)

tabfocus(pagenumber)newtab

tabclose

BringtheopentabwithpagenumbertothefrontOpenanewtab

Closethecurrentpageandfocusthelastopenpage

goback

goforward

goto(URL)

VisitthelastURLvisited

UndogobackoperationGotoaURLinthecurrent

page

Table1:ActionSpaceofWebArena

ThisIDisanartifactgeneratedwhentraversingtheDocumentObjectModel(DOM)treeortheaccessibilitytree.WithelementIDs,theelementselectionistransformedintoann-wayclassificationproblem,therebyeliminatinganydisambiguationeffortsrequiredfromtheagentortheunderlyingimplementation.AnaccessibilitytreewiththeassignedelementIDsispresentedin

Figure3

.Forexample,whentheagentissuesanactionoftextttclick[1582],where[1582]istheuniqueIDfortheelement“AddtoCart”.TheunderlyingimplementationofWebArenawillperformtheclickingaction,andthewebpagewillupdateaccordingly.ThisflexibilityinelementselectionenablesWebArenatosupportagentsdesignedinvariousways(e.g.,acceptinginputfromdifferentmodalities)withoutcompromisingfaircomparisonmetricssuchasthenumberofstepstaken.

2.4UserRolesSimulation

Usersofthesamewebsiteoftenhavedisparateexperiencesduetotheirdistinctroles,permissions,andinteractionhistories.Forinstance,withinanE-commerceCMS,ashopownermightpossessfullreadandwritepermissionsacrossallcontent,whereasanemployeemightonlybegrantedwritepermissionsforproductsbutnotforcustomerdata.Weaimtoemulatethisscenariobygeneratinguniqueuserprofilesoneachplatform.

Ontheshoppingsite,wecreatedacustomerprofilethathasover35orderswithinaspanoftwoyears.OnGitLab,weselectedauserwhomaintainsseveralpopularopen-sourceprojectswithnumerousmergerequestsandissues.Thisuseralsomanagesahandfulofpersonalprojectsprivately.OnReddit,ourchosenprofilewasauserwhoactivelyparticipatesindiscussions,withmanypostsandcomments.Lastly,onourE-commerceCMS,wesetupauserprofileforashopownerwhohasfullread-and-writeaccesstoallsystemcontents.

Allusersareautomaticallyloggedintotheiraccountsusingapre-cachedcookie.Toourbestknowledge,thisisthefirstpubliclyavailableagentevaluationenvironmenttoimplementsuchacharacteristic.Existingliteraturetypicallyoperatesundertheassumptionofuniversallyidenticaluserroles[

27

,

21

,

7

].

3BenchmarkSuiteofWeb-basedTasks

Weprovideabenchmarkwith812testexamplesongroundinghigh-levelnaturallanguageinstructionstointeractionsinWebArena.Eachexamplecomeswithametrictoevaluatethefunctionalcorrectnessofthetaskexecution.Inthissection,wefirstformallydefinethetaskofcontrollinganautonomousagentthroughnaturallanguage.Thenweintroducetheannotationprocessofourbenchmark.

3.1ControllingAgentsthroughHigh-levelNaturalLanguage

TheWebArenaenvironmentisdenotedasEwithstatespaceS,actionspaceA(§

2.3

)andobservationspaceO(§

2.2

).ThetransitionfunctionT:S×Aisdeterministic,anditisdefinedbytheunderlyingimplementationofeachwebsiteintheenvironment.PerformingataskdescribedbyanaturallanguageintenticanbeformulatedasapartiallyobservableMarkovdecisionprocess(POMDP):ateachtime

6

Map

CMS

13.4%

22.4%

E-commerce

23.0%

5.9%

CrossSite

13.1%

22.2%

Gitlab

Reddit

Figure4:Theintentdistributionacrossdif-ferentwebsites.Cross-siteintentsnecessitateinteractingwithmultiplewebsites.Notably,regardlessofthewebsite,alluserintentsre-quireinteractionswithmultiplewebpages.

Category

Example

InformationSeeking

Whenwasthelasttime

Iboughtshampoo

ComparewalkinganddrivingtimefromAMCWaterfronttoRandyland

SiteNavigation

Checkoutmergerequestsassignedtome

Showmetheergonomicchairwiththebestrating

Content

&

Config

Posttoask“whether

IneedacarinNYC”

Deletethereviews

fromthescammerYoke

Figure5:Exampleintentsfromthreecategories.

stept,anagentissuesanactionatgiventhepartialobservationot.Consequently,theactionresultsinanewstatest+1anditscorrespondingobservationot+1.Weproposearewardfunctionr(a,s)tomeasurethesuccessofataskexecution,wherearepresentsthesequenceofactions,andsdenotesallintermediatestates.Thisrewardfunctionassessesifstatetransitionsalignwiththeexpectationsoftheintents.Forexample,withanintenttoplaceanorder,itverifieswhetheranorderhasbeenplaced.Additionally,itevaluatestheaccuracyoftheagent’sactions,suchascheckingthecorrectnessofthepredictedanswer.

3.2IntentCollection

WefocusoncuratingrealisticintentstocarryoutcomplexandcreativetaskswithinWebArena.Tostartwith,ourannotatorswereguidedtospendafewminutesexploringthewebsitestofamiliarizethemselveswiththewebsites’contentandfunctionalities.Asmostofourwebsitesarevirtuallyidenticaltotheiropen-webcounterparts,despitehavingsampleddata,mostannotatorscanquicklycomprehendthewebsites.

Next,weinstructedtheannotatorstoformulateintentsbasedonthefollowingcriteria:

(1)Theintentshouldbeabstractandhigh-level,implyingthatthetaskcannotbefulfilledwithmerelyoneortwoactions.Asanexample,insteadof“clickthesciencesubreddit”,weencouragedannotatorstocomeupwithsomethingmorecomplexlike“postagreetingmessageonsciencesubreddit”,whichinvolvesperformingmultipleactions.

(2)Theintentshouldbecreative.Commontaskssuchasaccountcreationcanbeeasilythoughtof.Weencouragedtheannotatorstoaddconstraints(e.g.,“createaRedditaccountidenticaltomyGitLabone”)tomaketheintentsmoreunique.

(3)Theintentshouldbeformulatedasatemplatebymakingreplaceableelementsasvariables.Theannotatorswerealsoresponsiblefordevelopingseveralinstantiationsforeachvariable.Forexample,theintent“createaRedditaccountidenticaltomyGitLabone”canbecon-vertedinto“createa{{site1}}accountidenticaltomy{{site2}}one”,withaninstantiationlike“{site1:Reddit,site2:GitLab}”andanotherlike“{site1:GitLab,site2:OneStop-Shopping}”.Notably,tasksderivedfromthesametemplatecanhavedistinctexecutiontraces.Thesimilarityresidesprimarilyinthehigh-levelsemanticsratherthanthespecificimplementation.

WeadditionallyprovidedapromptfortheannotatorstousetopromptChatGPT

5

forinspiration.Thispromptcontainsanoverviewofeachwebsiteandinstructsthemodeltodescribepotentialtaskstobeperformedonthesesites.Furthermore,weofferedacuratedlistofexamplesforannotatorstoreference.

5

/

7

IntentAnalysisIntotal,wecurated241templatesand812instantiatedintents.Onaverage,eachtemplateisinstantiatedto3.3examples.Theintentdistributionisshownin

Figure4

.Furthermore,weclassifytheintentsintothreeprimarycategorieswithexamplesshownin

Figure5

:

(1)Information-seekingtasks:Thesearetaskswhereatextualresponseisexpected.Importantly,theinformation-seekingtasksinWebArenaoftenrequirenavigationacrossmultiplepagesorfocusonuser-centriccontent.Thismakesthemdistinctfromopen-domainquestion-answeringtasks[

36

,

17

],whichfocusonqueryinggeneralknowledgewithasimpleretrievalstep.Forinstance,toanswer“WhenwasthelasttimeIboughttheshampoo”,anagentmusttraversetotheuser’spurchasehistory,checkingindividualorderdetailstoidentifythemostrecentshampoopurchase.

(2)Sitenavigationtasks:Thiscategoryiscomposedoftasksthatrequirenavigatingthroughwebpagesusingavarietyofinteractiveelementssuchassearchfunctionsandlinks.Theobjectiveisoftentolocatespecificinformationornavigatetoaparticularsectionofasite.

(3)Contentandconfigurationoperationtasks:Thiscategoryencapsulatestasksthatrequireoperatinginthewebenvironmenttocreate,revise,orconfigurecontentorsettings.Thisincludesadjustingsettings,managingaccounts,performingonlinetransactions,generatingnewwebcontent,andmodifyingexistingcontent.ExamplesrangefromupdatingasocialmediastatusorREADMEfiletoconductingonlinepurchasesandconfiguringprivacysettings.

3.3EvaluationAnnotation

EvalautingInformationSeekingTasksTomeasurethecorrectnessofinformation-seekingtaskswhereatextualanswerisexpected,weprovidetheannotatedanswera*foreachintent.Thea*isfurthercomparedwiththepredictedanswerwithoneofthefollowingscoringfunctionsrinfo(,a*).

First,wedefineexact_matchwhereonlythatisidenticalwitha*willreceiveascoreofone.Thisfunctionisprimarilyapplicabletothoseintenttypeswhoseresponsesfollowamorestandardizedformat,similartotheevaluationonquestionansweringliterature[

26

,

36

].

Second,wecreatemust_includewhereanycontaininga*receivesascoreofone.Thisfunctionisprimarilyusedinscenarioswhereanunorderedlistoftextisexpectedorwheretheemphasisofevaluationisoncertainkeyconcepts.Inthesecondexamplein

Table2

,weexpectboththecorrectnameandtheemailaddresstobepresented,irrespectiveoftheprecisewordingusedtoconveytheanswer.

Finally,weintro

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论