版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
arXiv:2307.13854v1[cs.AI]25Jul2023
WebArena:ARealisticWebEnvironmentforBuildingAutonomousAgents
ShuyanZhou♠∗FrankF.Xu♠∗
HaoZhu♠†XuhuiZhou♠†RobertLo♠†AbishekSridhar♠†
XianyiCheng♠YonatanBisk♠DanielFried♠UriAlon♠GrahamNeubig♠♣
♠CarnegieMellonUniversity♣InspiredCognition
{shuyanzh,fangzhex,gneubig}@
Abstract
WithgenerativeAIadvances,theexcitingpotentialforautonomousagentstomanagedailytasksvianaturallanguagecommandshasemerged.However,cur-rentagentsareprimarilycreatedandtestedinsimplifiedsyntheticenvironments,substantiallylimitingreal-worldscenariorepresentation.Inthispaper,webuildanenvironmentforagentcommandandcontrolthatishighlyrealisticandrepro-ducible.Specifically,wefocusonagentsthatperformtasksontheweb,andwecreateanenvironmentwithfullyfunctionalwebsitesfromfourcommondomains:e-commerce,socialforumdiscussions,collaborativesoftwaredevelopment,andcontentmanagement.Ourenvironmentisenrichedwithtools(e.g.,amap)andex-ternalknowledgebases(e.g.,usermanuals)toencouragehuman-liketask-solving.Buildinguponourenvironment,wereleaseasetofbenchmarktasksfocusingonevaluatingthefunctionalcorrectnessoftaskcompletions.Thetasksinourbenchmarkarediverse,long-horizon,anddesignedtoemulatetasksthathumansroutinelyperformontheinternet.Wedesignandimplementseveralautonomousagents,integratingrecenttechniquessuchasreasoningbeforeacting.Theresultsdemonstratethatsolvingcomplextasksischallenging:ourbestGPT-4-basedagentonlyachievesanend-to-endtasksuccessrateof10.59%.Theseresultshighlighttheneedforfurtherdevelopmentofrobustagents,thatcurrentstate-of-the-artLMsarefarfromperfectperformanceinthesereal-lifetasks,andthatWebArenacanbeusedtomeasuresuchprogress.
Ourcode,data,environmentreproductionresources,andvideodemonstrationsarepubliclyavailableat
https://webarena.dev/
.
1Introduction
Autonomousagentsthatcouldperformeverydaytasksviahumannaturallanguagecommandscouldsignificantlyaugmenthumancapabilities,giventheirpotentialtoenhanceefficiencyandpromotebroaderaccessibility.Nonetheless,tofullyleveragethepoweroftheseautonomousagents,itiscrucialtounderstandtheirbehaviorwithinanenvironmentthatisbothauthenticandreproducible.Thiswillmeasuretheabilityofagentsonrealtasksthathumanuserscareaboutandallowthemtobeevaluatedinafairandconsistentmanner.
Currentenvironmentstendtoover-simplifyreal-worldsituations.Asaresult,thefunctionalityofmanyenvironmentsisalimitedversionoftheirreal-worldcounterparts,leadingtoalackoftaskdiversitywithintheenvironment[
27
,
1
,
10
,
23
,
29
,
30
,
37
].Inaddition,thesesimplificationsoften
*Leadcontributors.
†Equalcontribution.
Preprint.Workinprogress.
2
TellmehowmuchIspentonfoodpurchaseinMarch2023
“
”
Action
“Createa‘NolanFans'repo,listingNolan'sOscar-winning
filmsinaREADMEfile”
Agent
check_repocheck_readmecheck_answer
Feedback
FunctionalSuccess
FunctionalFailure
Self-hostedfullyfunctionalwebapplications
CMS
Knowledgeresources
WebArena
Toolbox
Figure1:WebArenaisastandalone,self-hostablewebenvironmentforbuildingautonomousagents.WebArenacreateswebsitesfromfourpopularcategorieswithfunctionalityanddatamimickingtheirreal-worldequivalents.Toemulatehumanproblem-solving,WebArenaalsoembedstoolsandknowledgeresourcesasindependentwebsites.WebArenaintroducesabenchmarkoninterpretinghigh-levelrealisticnaturallanguagecommandtoconcreteweb-basedinteractions.Weprovideannotatedprogramsdesignedtoprogrammaticallyvalidatethefunctionalcorrectnessofeachtask.
lowerthecomplexityoftasksascomparedtotheirexecutionintherealworld[
25
,
29
,
37
].Finally,someenvironmentsarepresentedasastaticresource[
27
,
7
]whereagentsareconfinedtoaccessingonlythosestatesthatwerepreviouslycachedduringdatacollection,thuslimitingthebreadthanddiversityofexploration.Ontheevaluationaspect,manyenvironmentsfocusoncomparingthesurfaceformofthepredictedactionsequenceswithreferenceactionsequences,disregardingthefunctionalcorrectnessoftheexecutionsandpossiblealternativesolutions[
25
,
12
,
35
,
19
,
7
].Theselimitationsoftenresultinadiscrepancybetweensimulatedenvironmentsandtherealworld.SuchconstraintscanpotentiallyimpactthegeneralizabilityofAIagentstosuccessfullyunderstand,adapt,andoperatewithincomplexreal-worldsituations.
Inthiswork,weintroduceWebArena,arealisticandreproduciblewebenvironmentdesignedtofacilitatethedevelopmentofautonomousagentscapableofexecutingtasks(§
2
).AnoverviewofWebArenaisin
Figure1
.Ourenvironmentcomprisesfourfullyoperational,self-hostedwebapplications,eachrepresentingadistinctivedomainprevalentontheinternet:onlineshopping,discussionforums,collaborativedevelopment,andbusinesscontentmanagement.Furthermore,WebArenaincorporatesseveralutilitytools,suchasmap,calculator,andscratchpad,tobestsupportpossiblehuman-liketaskexecutions.Lastly,WebArenaiscomplementedbyanextensivecollectionofdocumentationandknowledgebasesthatvaryfromgeneralresourceslikeEnglishWikipediatomoredomain-specificreferences,suchasmanualsforusingtheintegrateddevelopmenttool[
8
].Thecontentpopulatingthesewebsitesisextractedfromtheirreal-worldcounterparts,preservingtheauthenticityofthecontentservedoneachplatform.WedeliverthehostingservicesusingDockercontainerswithgym-APIs[
4
],ensuringboththeusabilityandthereproducibilityofWebArena.
AlongwithWebArena,wereleaseaready-to-usebenchmarkwith812long-horizonweb-basedtasks(§
3
).Eachtaskisdescribedasahigh-levelnaturallanguageintent,emulatingtheabstractlanguageusagepatternstypicallyemployedbyhumans[
2
].Twoexampleintentsareshownintheupperleftof
Figure1
.Wefocusonevaluatingthefunctionalcorrectnessofthesetasks,i.e.,,doestheresultoftheexecutionactuallyachievethedesiredgoal(§
3.3
).Forinstance,toevaluatetheexamplein
Figure2
,ourevaluationmethodverifiestheconcretecontentsinthedesignatedrepository.Thisevaluationisnotonlymorereliable[
40
,
5
,
33
]thancomparingthevanillaactionsequences[
25
,
7
]butalsoaccommodatesarangeofpotentialvalidpathstoachievethesamegoal,whichisaubiquitousphenomenoninsufficientlycomplextasks.
WeusethisbenchmarktoevaluateseveralagentsthatcouldfollowNLcommandandperformweb-basedtasks(§
4
).Theseagentsareimplementedwithavarietyofapproaches,fromthosepredictingsubsequentactiondirectlybasedoncurrentobservationsandhistorytomoresophisticatedagentsusingstep-by-stepreasoning[
34
,
39
].Theseagentsareimplementedinafew-shotin-contextlearningfashionwithpowerfullargelanguagemodels(LLMs)suchasGPT-3.5andGPT-4.ExperimentresultsshowthatthebestGPT-4agentperformanceissomewhatlimited,withanend-to-endtasksuccessrateof10.59%.WehypothesizethatthelimitedperformanceofcurrentLLMsstemsfromalackof
3
“CreateanefficientitinerarytovisitallPittsburgh'sartmuseumswithminimaldrivingdistancestarting
fromCMU.Logtheorderinmy“awesome-northeast-us-travel”repository”
…
Searchformuseums
inPittsburgh
SearchforeachartmuseumontheMap
Recordtheoptimized
resultstotherepo
Figure2:Ahigh-leveltaskthatcanbefullyexecutedinWebArena.Completingsuchtasksrequiressophisticated,long-termplanningandreasoningcapability.Toaccomplishthegoalstatedatthetop,anagentneedstofindoutwhatartmuseumsarelocatedinPittsburghbysearchingWikipedia.Next,itshouldidentifythelocationofeachmuseumonamap,optimizingtheitinerarybasedontheinformationcollected.Finally,theagentneedstoupdatetheREADMEfileintheappropriaterepositorywiththeplannedroute.crucialcapabilitiessuchasactiveexplorationandfailurerecoverytosuccessfullyperformcomplextask(§
5.2
).Theseoutcomesunderscorethenecessityforfurtherdevelopmenttowardsrobustandeffectiveagents[
18
]inWebArena.
2WebArena:WebSitesasanEnvironmentforAutonomousAgentsOurgoalistocreatearealisticandreproduciblewebenvironment.Weachievereproducibilitybyhavingtheenvironmentbestandalone,notrelyingonlivewebsites.ThiscircumventstechnicalchallengessuchasbotsbeingsubjecttoCAPTCHAs,unpredictablecontentmodifications,andconfigurationchanges,whichobstructafaircomparisonacrossdifferentsystemsovertime.Weachieverealismbyusingopen-sourcelibrariesthatunderliemanyin-usesitesfromseveralpopularcategoriesandimportingdatatoourenvironmentfromtheirreal-worldcounterparts.
2.1WebsiteSelection
Inordertodecidewhichcategoriesofwebsitetouse,wefirstconductedananalysisofapproximately200examplesfromtheauthors’actualwebbrowserhistories.Eachauthordelvedintotheirbrowsinghistories,summarizingthegoalofparticularsegmentsoftheirbrowsersession.Basedonthis,weclassifiedthevisitedwebsitesintoabstractcategories.Wethenidentifiedthefourmostsalientcategoriesandimplementedoneinstancepercategorybasedonthisanalysis:(1)E-commerceplatformssupportingonlineshoppingactivities(e.g.,Amazon,eBay),(2)socialforumplatformsforopinionexchanges(e.g.,Reddit,StackExchange),(3)collaborativedevelopmentplatformsforsoftwaredevelopment(e.g.,GitLab),and(4)contentmanagementsystems(CMS)thatmanagethecreationandrevisionofthedigitalcontent(e.g.,onlinestoremanagement).
Inadditiontotheseplatforms,weselectedthreeutility-styletoolsthatarefrequentlyusedinweb-basedtasks:(1)amapfornavigationandsearchingforinformationaboutpointsofinterest(POIs)suchasinstitutionsorlocations(2)acalculator,and(3)ascratchpadfortakingnotes.
Recognizingthecriticalroleofinformation-seekingandknowledgeacquisitioninweb-basedtasks,wealsoincorporatedvariousknowledgeresourcesintoourenvironment.Theseresourcesrangefromgeneralinformationrepositories,suchastheEnglishWikipedia,tomorespecializedknowledgebases,suchasthewebsiteusermanuals.
ImplementationWeleveragedopen-sourcelibrariesrelevanttoeachcategorytobuildourownversionsofanE-commercewebsite(OneStopShop),GitLab,Reddit,anonlinestorecontentmanage-
4
[1552]link'OutdoorPatio..’
[1547]img'Image'
<div>
<ahref="..."><imgsrc="..."></a>
<divclass>
<ahref="...">OutdoorPatio…
<div>
<span>Rating:</span>
<div>
<span>82%</span>
</div>
<ahref=“…#reviews">12
<span>Reviews</span></a>
<li>
</a>
[1543]link'Image'
[4]RootWebArea‘Patio,Lawn..’
[1549]LayoutTable''
[1559]StaticText'Rating:'
[1557]generic'82%'
[1567]link'12Reviews'
[1574]StaticText‘$49.99'
[1582]button'AddtoCart’focusable:
True
[1585]button'WishList’focusable:…
[1586]button'Compare’focusable:…
Figure3:WedesigntheobservationtobetheURLandthecontentofawebpage,withoptionstorepresentthecontentasascreenshot(left),HTMLDOMtree(middle)andaccessibilitytree(right).Thecontentofthemiddleandrightfiguresaretrimmedtosavespace.
mentsystem(CMS),amap,andanEnglishWikipedia.Thenweimporteddatafromtheirreal-worldcounterpartsviaasamplingmethod.Asanexample,ourversionofGitLabwasdevelopedbasedontheactualGitLabproject.
3
Wecarefullyemulatedthefeaturesofatypicalcoderepositorybyincludingbothpopularprojectswithmanyissuesandpullrequestsandsmaller,personalprojects.DetailsofallwebsitesinWebArenacanbefoundinAppendix
A.1
.Wedelivertheenvironmentasdockersandprovidescriptstoresettheenvironmenttoadeterministicinitialstate.ThesedetailsareinAppendix
A.2
.
2.2ObservationSpace
Wedesigntheobservationspacetoroughlymimicthewebbrowserexperience:awebpageURL,theopenedtabsandthewebpagecontentofthefocusedtab.WebArenaisthefirstwebenvironmenttoconsidermulti-tabweb-basedtaskstopromotetoolusage,directcomparisonsandreferencesacrosstabs,andotherfunctionalities.Themulti-tabfunctionalityoffersamoreauthenticreplicationofhumanwebbrowsinghabitscomparedtomaintainingeverythinginasingletab.Weprovideflexible
configurationtorenderthepagecontentinmanymodes:(see
Figure3
foranexample):(1)therawwebpageHTML,composedofaDocumentObjectModel(DOM)tree,ascommonlyusedinpastwork[
27
,
7
,
19
];(2)thescreenshot,apixel-basedrepresentationthatrepresentsthecurrentwebpageasanRGBarrayand(3)theaccessibilitytreeofthewebpage.
4
TheaccessibilitytreeisasubsetoftheDOMtreewithelementsthatarerelevantandusefulfordisplayingthecontentsofawebpage.Everyelementisrepresentedasitsrole(e.g.,alink),itstextcontent,anditsproperties(e.g.,whetheritisfocusable).AccessibilitytreeslargelyretainthestructuredinformationofawebpagewhilebeingmorecompactthantheDOMrepresentation.Weprovideanoptiontolimitthecontenttothecontentswithinaviewportforallmodes.Thisensuresthattheobservationcanbeinputintoatext-basedmodelwithlimitedcontextlengthoranimage-basedmodelwithimagesizeorresolutionrequirements.
2.3ActionSpace
Followingpreviousworkonnavigationandoperationinwebandembodiedenvironments[
21
,
8
],wedesignacompoundactionspacewhichemulatesthekeyboardandmouseoperationsavailableonwebpages.
Table1
listsalltheavailableactionscategorizedintothreedistinctgroups.Thefirstcategoryincludeselementaloperationssuchasclicking,hovering,typing,andkeycombinationpressing.Thesecondcomprisestabmanagementactionssuchasopening,closing,andswitchingbetweentabs.ThethirdcategoryconsistsofURLnavigationactions,suchasvisitingaspecificURLornavigatingforwardsandbackwardinthebrowsinghistory.
WebArenaallowsagentstorefertoelementsonwebpageswithdifferentapproaches.Elementscanbeselectedeitherbytheiron-screencoordinates,representedas(x,y),orbyauniqueelementID.
3
/gitlab-org/gitlab
4
/en-US/docs/Glossary/Accessibility_tree
5
ActionType
Description
noop
Donothing
click(element)
Clickatanelement
hover(element)
Hoverontheelement
type(element,text)
Inputthetexttotheelement
keypress(keycomb)
Pressthekeycombination(e.g.,ctrl+v)
tabfocus(pagenumber)newtab
tabclose
BringtheopentabwithpagenumbertothefrontOpenanewtab
Closethecurrentpageandfocusthelastopenpage
goback
goforward
goto(URL)
VisitthelastURLvisited
UndogobackoperationGotoaURLinthecurrent
page
Table1:ActionSpaceofWebArena
ThisIDisanartifactgeneratedwhentraversingtheDocumentObjectModel(DOM)treeortheaccessibilitytree.WithelementIDs,theelementselectionistransformedintoann-wayclassificationproblem,therebyeliminatinganydisambiguationeffortsrequiredfromtheagentortheunderlyingimplementation.AnaccessibilitytreewiththeassignedelementIDsispresentedin
Figure3
.Forexample,whentheagentissuesanactionoftextttclick[1582],where[1582]istheuniqueIDfortheelement“AddtoCart”.TheunderlyingimplementationofWebArenawillperformtheclickingaction,andthewebpagewillupdateaccordingly.ThisflexibilityinelementselectionenablesWebArenatosupportagentsdesignedinvariousways(e.g.,acceptinginputfromdifferentmodalities)withoutcompromisingfaircomparisonmetricssuchasthenumberofstepstaken.
2.4UserRolesSimulation
Usersofthesamewebsiteoftenhavedisparateexperiencesduetotheirdistinctroles,permissions,andinteractionhistories.Forinstance,withinanE-commerceCMS,ashopownermightpossessfullreadandwritepermissionsacrossallcontent,whereasanemployeemightonlybegrantedwritepermissionsforproductsbutnotforcustomerdata.Weaimtoemulatethisscenariobygeneratinguniqueuserprofilesoneachplatform.
Ontheshoppingsite,wecreatedacustomerprofilethathasover35orderswithinaspanoftwoyears.OnGitLab,weselectedauserwhomaintainsseveralpopularopen-sourceprojectswithnumerousmergerequestsandissues.Thisuseralsomanagesahandfulofpersonalprojectsprivately.OnReddit,ourchosenprofilewasauserwhoactivelyparticipatesindiscussions,withmanypostsandcomments.Lastly,onourE-commerceCMS,wesetupauserprofileforashopownerwhohasfullread-and-writeaccesstoallsystemcontents.
Allusersareautomaticallyloggedintotheiraccountsusingapre-cachedcookie.Toourbestknowledge,thisisthefirstpubliclyavailableagentevaluationenvironmenttoimplementsuchacharacteristic.Existingliteraturetypicallyoperatesundertheassumptionofuniversallyidenticaluserroles[
27
,
21
,
7
].
3BenchmarkSuiteofWeb-basedTasks
Weprovideabenchmarkwith812testexamplesongroundinghigh-levelnaturallanguageinstructionstointeractionsinWebArena.Eachexamplecomeswithametrictoevaluatethefunctionalcorrectnessofthetaskexecution.Inthissection,wefirstformallydefinethetaskofcontrollinganautonomousagentthroughnaturallanguage.Thenweintroducetheannotationprocessofourbenchmark.
3.1ControllingAgentsthroughHigh-levelNaturalLanguage
TheWebArenaenvironmentisdenotedasEwithstatespaceS,actionspaceA(§
2.3
)andobservationspaceO(§
2.2
).ThetransitionfunctionT:S×Aisdeterministic,anditisdefinedbytheunderlyingimplementationofeachwebsiteintheenvironment.PerformingataskdescribedbyanaturallanguageintenticanbeformulatedasapartiallyobservableMarkovdecisionprocess(POMDP):ateachtime
6
Map
CMS
13.4%
22.4%
E-commerce
23.0%
5.9%
CrossSite
13.1%
22.2%
Gitlab
Figure4:Theintentdistributionacrossdif-ferentwebsites.Cross-siteintentsnecessitateinteractingwithmultiplewebsites.Notably,regardlessofthewebsite,alluserintentsre-quireinteractionswithmultiplewebpages.
Category
Example
InformationSeeking
Whenwasthelasttime
Iboughtshampoo
ComparewalkinganddrivingtimefromAMCWaterfronttoRandyland
SiteNavigation
Checkoutmergerequestsassignedtome
Showmetheergonomicchairwiththebestrating
Content
&
Config
Posttoask“whether
IneedacarinNYC”
Deletethereviews
fromthescammerYoke
Figure5:Exampleintentsfromthreecategories.
stept,anagentissuesanactionatgiventhepartialobservationot.Consequently,theactionresultsinanewstatest+1anditscorrespondingobservationot+1.Weproposearewardfunctionr(a,s)tomeasurethesuccessofataskexecution,wherearepresentsthesequenceofactions,andsdenotesallintermediatestates.Thisrewardfunctionassessesifstatetransitionsalignwiththeexpectationsoftheintents.Forexample,withanintenttoplaceanorder,itverifieswhetheranorderhasbeenplaced.Additionally,itevaluatestheaccuracyoftheagent’sactions,suchascheckingthecorrectnessofthepredictedanswer.
3.2IntentCollection
WefocusoncuratingrealisticintentstocarryoutcomplexandcreativetaskswithinWebArena.Tostartwith,ourannotatorswereguidedtospendafewminutesexploringthewebsitestofamiliarizethemselveswiththewebsites’contentandfunctionalities.Asmostofourwebsitesarevirtuallyidenticaltotheiropen-webcounterparts,despitehavingsampleddata,mostannotatorscanquicklycomprehendthewebsites.
Next,weinstructedtheannotatorstoformulateintentsbasedonthefollowingcriteria:
(1)Theintentshouldbeabstractandhigh-level,implyingthatthetaskcannotbefulfilledwithmerelyoneortwoactions.Asanexample,insteadof“clickthesciencesubreddit”,weencouragedannotatorstocomeupwithsomethingmorecomplexlike“postagreetingmessageonsciencesubreddit”,whichinvolvesperformingmultipleactions.
(2)Theintentshouldbecreative.Commontaskssuchasaccountcreationcanbeeasilythoughtof.Weencouragedtheannotatorstoaddconstraints(e.g.,“createaRedditaccountidenticaltomyGitLabone”)tomaketheintentsmoreunique.
(3)Theintentshouldbeformulatedasatemplatebymakingreplaceableelementsasvariables.Theannotatorswerealsoresponsiblefordevelopingseveralinstantiationsforeachvariable.Forexample,theintent“createaRedditaccountidenticaltomyGitLabone”canbecon-vertedinto“createa{{site1}}accountidenticaltomy{{site2}}one”,withaninstantiationlike“{site1:Reddit,site2:GitLab}”andanotherlike“{site1:GitLab,site2:OneStop-Shopping}”.Notably,tasksderivedfromthesametemplatecanhavedistinctexecutiontraces.Thesimilarityresidesprimarilyinthehigh-levelsemanticsratherthanthespecificimplementation.
WeadditionallyprovidedapromptfortheannotatorstousetopromptChatGPT
5
forinspiration.Thispromptcontainsanoverviewofeachwebsiteandinstructsthemodeltodescribepotentialtaskstobeperformedonthesesites.Furthermore,weofferedacuratedlistofexamplesforannotatorstoreference.
5
/
7
IntentAnalysisIntotal,wecurated241templatesand812instantiatedintents.Onaverage,eachtemplateisinstantiatedto3.3examples.Theintentdistributionisshownin
Figure4
.Furthermore,weclassifytheintentsintothreeprimarycategorieswithexamplesshownin
Figure5
:
(1)Information-seekingtasks:Thesearetaskswhereatextualresponseisexpected.Importantly,theinformation-seekingtasksinWebArenaoftenrequirenavigationacrossmultiplepagesorfocusonuser-centriccontent.Thismakesthemdistinctfromopen-domainquestion-answeringtasks[
36
,
17
],whichfocusonqueryinggeneralknowledgewithasimpleretrievalstep.Forinstance,toanswer“WhenwasthelasttimeIboughttheshampoo”,anagentmusttraversetotheuser’spurchasehistory,checkingindividualorderdetailstoidentifythemostrecentshampoopurchase.
(2)Sitenavigationtasks:Thiscategoryiscomposedoftasksthatrequirenavigatingthroughwebpagesusingavarietyofinteractiveelementssuchassearchfunctionsandlinks.Theobjectiveisoftentolocatespecificinformationornavigatetoaparticularsectionofasite.
(3)Contentandconfigurationoperationtasks:Thiscategoryencapsulatestasksthatrequireoperatinginthewebenvironmenttocreate,revise,orconfigurecontentorsettings.Thisincludesadjustingsettings,managingaccounts,performingonlinetransactions,generatingnewwebcontent,andmodifyingexistingcontent.ExamplesrangefromupdatingasocialmediastatusorREADMEfiletoconductingonlinepurchasesandconfiguringprivacysettings.
3.3EvaluationAnnotation
EvalautingInformationSeekingTasksTomeasurethecorrectnessofinformation-seekingtaskswhereatextualanswerisexpected,weprovidetheannotatedanswera*foreachintent.Thea*isfurthercomparedwiththepredictedanswerwithoneofthefollowingscoringfunctionsrinfo(,a*).
First,wedefineexact_matchwhereonlythatisidenticalwitha*willreceiveascoreofone.Thisfunctionisprimarilyapplicabletothoseintenttypeswhoseresponsesfollowamorestandardizedformat,similartotheevaluationonquestionansweringliterature[
26
,
36
].
Second,wecreatemust_includewhereanycontaininga*receivesascoreofone.Thisfunctionisprimarilyusedinscenarioswhereanunorderedlistoftextisexpectedorwheretheemphasisofevaluationisoncertainkeyconcepts.Inthesecondexamplein
Table2
,weexpectboththecorrectnameandtheemailaddresstobepresented,irrespectiveoftheprecisewordingusedtoconveytheanswer.
Finally,weintro
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026安徽淮南市部分市属公立医院招聘硕士研究生及以上人员21人备考题库有答案详解
- 2026年投资银行面试题及解析
- 2026年机电工程师自动化仿真题解析
- 2026年幼儿园学期教育专业知识
- 2026年网络工程师考试题库-华为认证
- 2026年小学生电力科普知识
- 2026年产品知识提升计划方案
- 2026年华为技术工程师招聘笔试模拟题
- 大一食品营养试题及答案
- 2026年人力资源管理师考试题
- 2024年福建省龙岩市新罗区小升初科学试卷(含解析)
- 2026年重庆高考数学考试卷附答案
- 药品生产管理规范实施指南(2025版)
- 地坪裂缝修补工程实施方案
- 泌尿系结石中西医结合治疗
- 2025年浙江高中信息技术学业水平考试卷试题(含答案详解)
- 员工雇佣合同管理规范
- 工业设备接口技术应用与维护
- 《土木工程智能施工》课件 第3章 土方作业辅助工程-土壁支护2
- 学堂在线 大数据机器学习 期末考试答案
- 《养老机构智慧运营与管理》智慧健康养老服务与管理专业全套教学课件
评论
0/150
提交评论