迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第1页
迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第2页
迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第3页
迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第4页
迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第5页
已阅读5页,还剩12页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

914|Nature|Vol651|26March2026

Article

Towardsend-to-endautomationofAIresearch

/10.1038/s41586-026-10265-5

Received:8July2025

Accepted:11February2026

Publishedonline:25March2026

Openaccess

Check

forupdates

ChrisLu1,2,5,CongLu1,3,4,5,RobertTjarkoLange1,5,YutaroYamada1,5✉,ShengranHu1,3,4,JakobFoerster2,DavidHa1✉&JeffClune3,4✉

Theautomationofscienceisalong-standingambitioninartificialintelligence(AI)

research

1

,

2

.Althoughthecommunityhasmadesubstantialprogressinautomating

individualcomponentsofthescientificprocess,asystemthatautonomouslynavigatestheentireresearchlifecycle—fromconceptiontopublication—hasremainedoutof

reach.Herewepresentapipelineforautomatingtheentirescientificprocessendtoend.WepresentTheAIScientist,whichcreatesresearchideas,writescode,runs

experiments,plotsandanalysesdata,writestheentirescientificmanuscript,and

performsitsownpeerreview.Itsideas,executionandpresentationareofsufficient

qualitythatthemanuscriptgeneratedbythisAIsystempassedthefirstroundofpeerreviewforaworkshopofatop-tiermachinelearningconference.Theworkshophadanacceptancerateof70%.Oursystemleveragesmodernfoundationmodels

3

5

withinacomplexagenticsystem.WeevaluateTheAIScientistintwosettings:afocused

modeusinghuman-providedcodetemplatesasaninitialscaffoldforconducting

researchonaspecifictopicandatemplate-free,open-endedmodethatleverages

agenticsearchforwiderscientificexploration

6

,

7

.Bothsettingsproducediverseideasandautomaticallytest,reportonandevaluatethem.ThisachievementdemonstratesthegrowingcapacityofAIformakingscientificcontributionsandsignifiesapotentialparadigmshiftinhowresearchisconducted.Aswithanyimpactfulnewtechnology,therecouldbeimportantrisks,includingtaxingoverwhelmedreviewsystemsand

addingnoisetothescientificliterature.However,ifdevelopedresponsibly,suchautonomoussystemscouldgreatlyacceleratescientificdiscovery.

AIhaslongbeenusedtoaidscientificdiscovery,anambitionwithdeep

Acentralchallengeindevelopingsuchasystemisautomatically

rootsinthehistoryofthefield

1

,

8

11

.Beforetheriseoflargelanguage

evaluatingthequalityofitsscientificoutputatscale.Toaddressthis,

models(LLMs),AIwaslimitedtohelpingwithspecific,narrowtasks,

wecreatedanautomatedreviewerandfirstevaluateditsperformance

suchasdiscoveringchemicalstructures

2

,findingmathematicalproofs1,

againstreal,human-generatedpapers.TheAutomatedReviewercan

discoveringnewmaterials

12

14

andpredictingthethree-dimensional

accuratelypredictconferenceacceptancedecisions,performingonpar

shapeofproteins

15

,

16

.Othersystemsfocusedonanalysingpre-collected

withhumanreviewers(SupplementaryInformationsectionA.3).We

datasetstofindnewinsights

10

,

17

,

18

.However,withtherecentadventof

thenusedTheAutomatedReviewertocomparevariousconfigurations

powerfulandgeneralfoundationmodels,theroleofAIhasexpandedto

ofTheAIScientistbyassessinghowperformancechangeswiththescale

includeassistingwithawiderarrayofresearchactivities.Forexample,

ofthetest-timecomputeandthequalityoftheunderlyingfoundation

LLMsnowhelpwithgeneratingnewhypotheses

19

23

,writingliterature

model.WefindthatTheAIScientistperformsbetterwithmorecompute

reviews

24

,

25

andcodingexperiments

26

29

.Despitetheseadvancesinauto-

resources(Fig.

3c

).Furthermore,TheAutomatedReviewershowsthat

matingindividualcomponents,asystemthatautonomouslynavigates

improvementstothebasemodelssignificantlyimprovethequality

theentireresearchlifecycle—fromconceptiontopublication—has

ofthegeneratedpapers,afindingthatstronglyimpliesthatfuture

remainedoutofreachuntilnow.

versionsofoursystemwillbesubstantiallymorecapable,asmodels

ThispaperintroducesTheAIScientist,apipelinethatachievesthe

continuetoimprove(Fig.

1b

).

visionoffullend-to-endautomationofthescientificprocess.TheAI

ToassessTheAIScientistinthesamesettinginwhichhuman-authored

Scientistusesexistingfoundationmodelstoperformideation,litera-

papersareevaluated,weconductedanexperimentwherewesubmit-

turesearch,experimentplanningandimplementation,resultanalysis,

tedgeneratedpaperstoaworkshopattheInternationalConference

manuscriptwriting,andpeerreviewtoproducecomplete,newpapers.

onLearningRepresentations(ICLR),withtheorganizers’consent.

Wefocusonmachinelearningscience,asexperimentstypicallyoccur

Incomputerscience,suchtop-tierconferencesaretheprimaryand

entirelyonthecomputer.

mostprestigiousvenuesforarchivalandrigorouslypeer-reviewed

1SakanaAI,Tokyo,Japan.2FLAIR,UniversityofOxford,Oxford,UK.3UniversityofBritishColumbia,Vancouver,BritishColumbia,Canada.4VectorInstitute,Toronto,Ontario,Canada.5Theseauthorscontributedequally:ChrisLu,CongLu,RobertTjarkoLange,YutaroYamada.✉e-mail:

yutaro.yamada.y@

;

hadavid@sakana.ai

;

jclune@

Nature|Vol651|26March2026|915

Sonnet-4

a

Fit(R2=0.517,P<0.00001)95%confidenceinterval

AIScientist:template-based

AIScientist:template-free

6

5

4

3

2

1

b

0

Sonnet-3.5

Sonnet-3.5

Sonnet-3.7

Gemini-2.5

AIreviewerpaperscore

GPT-4

Gemini-1.5

Sonnet-3

GPT-4o

GPT-4o

o1

Gemini-2.0

o3

July2023

October2023

January2024

April2024

July2024

October2024

January2025

April2025

July2025

Balancedaccuracy

Languagemodelreleasedate

0.7

0.6

0.5

0.4

Beforecutoff(2017–2024)HumanReject

Aftercutoff(2025)

Automatedreviewer

Random

c

Experimentation

Preliminaryinvestigation

[Writetolog][Best]

Hyperparametertuning

[Writetolog][Best]

Researchexecution

[Writetolog][Best]

Ablationstudies

Write-up

Plottingandfeedback

Papertemplate

Paper

PaperAIreview

Ideation

LLMideaproposal

Noveltychecking

Scoringandarchiving

Fig.1|TheAIScientistworkflow.a,TheAIScientistconsistsofdistinctphases

coveringautomatedideageneration,tree-basedexperimentation,manuscriptwritingandreviewing.Theexperimentationphaseusesanagentictreesearchtogenerateandrefinecodeimplementations.Thisisstructuredintofour

stages:(1)initialinvestigation,(2)hyperparametertuning,(3)researchagendaexecutionand(4)ablationstudies.Fromoneexperimentalstagetothenext,

thebest-performingcheckpointisselectedtoseedthenextstageofthetree

search.b,ScoresforTheAIScientistpapersacrossmodelreleases.Paperqualityconsistentlyimproveswiththeunderlyingmodelreleasedate(asjudgedby

TheAutomatedReviewer),indicatingconsistentfutureimprovementswith

improvingfoundationmodels.Theobservedcorrelationisstatistically

significant(P<0.00001).Shadedregionsrepresentthestandarderror.Pointsrepresentmeanscoreswitherrorbarsandshadedregionsindicatingthe

standarderror(n=6fortemplate-freepoints,n=3fortemplate-basedpoints).Fullexperimentaldetails,includingmodelversionsandreplicationcounts,areprovidedinSupplementaryInformationsectionA.2.9.c,Automatedreview

versusconferencedecisions.TheAutomatedReviewerachievesperformancecomparablewiththatofhumanreviewers,asvalidatedbyopenlyavailable

decisionsfrompastconferences(Table

1

).Barsrepresentmeanbalanced

accuracy;errorbarsshow95%bootstrappedconfidenceintervals(5,000

replicates).Forreplicability,eachautomatedreviewisa5-runensemble.Two-samplez-testsonsubsampledaccuracy(automatedn=698/876,humann=412)showednosignificantdifferencebeforethetrainingcutoff(P=0.319)orpost-cutoff(P=0.921).Non-parametricbootstraptestsonF1scoresshowed

automatedoutperformance(P<0.001).

publication.Theyalsohaveworkshopswithasubstantiallylowerbutstillnon-trivialbarforpeer-reviewedacceptance.

OneofTheAIScientist’smanuscriptsachievedhighenoughscorestoexceedtheaveragehumanacceptancethresholdataworkshop,provid-inganexampleofafullyAI-generatedpapersuccessfullynavigatingapeer-reviewprocess,albeitonewithalowerbar.

Generatingmanuscripts

TheAIScientistsequentiallycompletesfourmainphases(Fig.

1a

).Inthefirstphase,TheAIScientistispromptedtoiterativelygrowanarchive

30

ofhigh-levelresearchdirectionsandhypothesesthatitcanexplorewithinauser-specifiedmachinelearningresearchsubfield(anexampleprogressionisvisualizedinSupplementaryInformationsectionC.4).Foreachdirection,itgeneratesadescriptivetitle,itsrea-soningforwhattheideaisandwhyitwouldbeinterestingtopursue,andaproposedexperimentalplan(SupplementaryInformationsec-tionsA.1.1andA.2.6).Afterideageneration,TheAIScientistfiltersideasbyconnectingthelanguagemodeltotheSemanticScholarapplicationprogramminginterface(API)

31

andwebaccessastools

32

.ThisallowsTheAIScientisttodiscardanyideathattoocloselyresemblesaworkintheexistingliterature.

ThesecondphaseofTheAIScientistexecutestheproposedexperimentsandthenvisualizestheirresultsforthedownstreamwrite-up.Wetestedtwodifferentvariantsofexperimentexecution:

(1)Template-based:TheAIScientistisprovidedwithastartingcodetemplatethatreproducesatrainingrunfromapopularalgorithm.TheAIScientistthenexecutestheproposedexperimentplaninlinear

order(SupplementaryInformationsectionA.1).(2)Template-free:Alternatively,TheAIScientistcangenerateaninitialstartingcodescriptbyitself.Inthiscase,experimentationincludesfurtherstagesforoptimizingthecodeitwritesfromscratch,andexperimentexecu-tionleveragesextratest-timecomputewithatreesearch(Fig.

3a,b

andMethods).Aftereachexperiment,TheAIScientistisgiventheresultsandispromptedtotakenotesinthestyleofanexperimentaljournalforfutureplanningandwrite-up.

ThethirdphaseofTheAIScientistproducesaconcisewrite-upofitsresearchinthestyleofastandardmachinelearningconferencepaper.TheAIScientistispromptedtofillinablankLaTeXconferencetemplatesectionbysectionusingitsnotesandplots(Methods).Toconstructtherelatedworksectionandaddcitationsthroughoutthemanuscript,thesystemqueriestheSemanticScholar

31

APIforrelevantliteratureandcomparesitsfindingsagainstthegeneratedmanuscriptover20rounds.Foreachpotentialcitation,thesystemgeneratesatextualjustificationforitsinclusion,whichinformsTheAIScientistonhowtousethereferenceappropriatelywithinthemanuscript.

Finally,thepapergeneratedbyTheAIScientistundergoesareviewbyTheAutomatedReviewer,whichautomaticallyevaluatesthescientificqualityoftheconductedresearch.

Automatedevaluationofgeneratedpapers

TheAutomatedReviewerprovidesreviewsbasedonthereviewguidelinesforthetop-tierNeuralInformationProcessingSys-tems(NeurIPS)conference(

https://neurips.cc/Conferences/2022/

916|Nature|Vol651|26March2026

Article

Table1|PerformancecomparisonofhumanreviewersandTheAutomatedReviewer

Reviewer

Balancedaccuracy(↑)

Accuracy(↑)

F1score(↑)

AUC(↑)

FPR(↓)

FNR(↓)

Human(NeurIPS)

0.66

0.73

0.49

0.65

0.17

0.52

Yearsbeforeknowledgecutoff(2017–2024)

Randomdecision

0.50

0.54

0.47

0.52

0.47

0.43

Alwaysreject

0.50

0.65

0.00

0.50

0.00

1.00

AutomatedReviewer

0.69±0.04

0.65±0.10

0.62±0.09

0.69±0.09

0.45±0.10

0.17±0.08

Yearafterknowledgecutoff(2025)

Randomdecision

0.52

0.51

0.48

0.49

0.50

0.48

Alwaysreject

0.50

0.56

0.00

0.50

0.00

1.00

AutomatedReviewer

0.66±0.03

0.63±0.09

0.67±0.09

0.65±0.10

0.52±0.10

0.17±0.07

Performancecomparisonofhumanreviewers(NeurIPS2021consistencyexperiment

34

)andtheAutomatedReviewer,evaluatedonpaperspublishedbefore(2017–2024)andafter(2025)the

knowledgecutoff.TheAutomatedReviewerachievedperformancesuperiororcomparablewithhumanreviewerconsistencyinkeymetricssuchasF1score,areaunderthecurve(AUC)and

balancedaccuracy,evenfordatabeyondtheknowledgecutoff,highlightingitsrobustnessandreliabilityacrossdifferenttimeperiods.Errormarginsdenotethe95%bootstrappedconfidenceintervals.Arrowsindicatewhetheritisbetterforascoretobehigher(↑)orlower(↓).SupplementaryInformationsectionA.3.2explainseachmetricandcomparisonindetail.FNR,falsenegativerate;FPR,falsepositiverate.

ReviewerGuidelines

).Theoutputcontainsnumericalscores(soundness,presentation,contribution,overallqualityandreviewerconfidence),listsofweaknessesandstrengths,aswellasabinarydecision(acceptorreject).TheAutomatedReviewerpipelineconsistsofanensembleoffivereviews,followedbyameta-reviewinwhichthemodelactsasanareachairtomakeafinaldecisionconditionedonallfivereviews(Supple-mentaryInformationsectionA.3).WecomparedAutomatedReviewerdecisionswithgroundtruthdataforICLRpapersextractedfromthepubliclyavailableOpenReviewdataset

33

.AsshowninTable

1

,theagree-mentofAutomatedReviewerassessmentswithhumanassessmentsiscomparablewithinter-humanagreementmeasuredbyF1scoreandbalancedaccuracy,asreportedintheNeurIPS2021consistencystudy

34

,whichmeasuredagreementbetweenhumanreviewersonacompara-blesetofsubmissions(SupplementaryInformationsectionA.3).Thisdemonstratesitsabilitytoreplicatethecollectivejudgementofhumanreviewerswithhighfidelity.Theseresultsarestatisticallysignificant(non-parametricbootstraptest

35

andtwo-samplez-test

36

;Supplemen-taryInformationsectionA.3).Next,toinvestigatetheeffectofpotentialdatacontamination(thepossibilitythatdecisionsonapaperwerepartofthetrainingsetfortheLLM),weevaluatedTheAutomatedReviewerontwodatasets:onecontaining1,000papersfromyearspotentiallywithinthetrainingdatausedforthemodel(2017–2024)andasecond‘clean’datasetfromtheyearafterthecutoff(2025),whichcouldnothavebeenseenduringtraining.Acomparisonbetweenyearsbeforeandaftertheknowledgecutoffindicatesthatdatacontaminationmayexist,asbalanceddecisionaccuracydecreasesfrom69%beforeto66%intheyearafterthecutoff.However,theresultsfortheyearafterthecutoffremaincomparablewiththoseofhumanreviewers(forexample,66%balancedaccuracy),showingthatpotentialcontaminationhad,atmost,aminimaleffect.

UsingTheAutomatedReviewer,weassessedthequalityoftheresearchpapersgeneratedbyawiderangeofLLMsasthecoremodelwithinTheAIScientist.Ouranalysisrevealedacleartrend:asmodelsimproveovertime,thequalityofthepapersproducedbyTheAIScientistincreasedcorrespondingly(Fig.

1b

).Withrecentgenera-tionsofmodels,onaverage,TheAIScientistproducedpapersthatapproachborderlineacceptabilityformachinelearningconferenceworkshops,asjudgedbyourAutomatedReviewer(SupplementaryFig.B2).Additionally,thereisastrongcorrelationbetweentheamountofcomputeallocatedperpaperandtheresultingquality(Fig.

3c

),indicatingthatbothmodelscaleandinference-timeinvest-mentplayimportantrolesintheoutputqualityofTheAIScientist,furtherindicatingthepossibilityofsubstantialimprovementsasthecostsofAIsystemscontinuetoexponentiallydecreaseandcapabili-tiesexponentiallyincrease

37

.

Humanevaluationresults

PerhapstheultimateandfairesttestofthequalityoftheworkofTheAIScientistisaversionofwhatwemightcallanAIscientistTuringtest:submittingtheworktothesamerigorous,blindpeer-reviewsystemsusedtoevaluatehumanscience.Wesubmittedthreegeneratedmanu-scriptstotheformalpeer-reviewprocessofaworkshopatatop-tiermachinelearningconference.Thisexperimentwasconductedwiththeapprovaloftherelevantinstitutionalreviewboard(IRB;Supple-mentaryInformationsectionC.3)andthefullcooperationoftheICLR2025leadershipandtheorganizersoftheICan’tBelieveIt’sNotBetter(ICBINB)workshop.Thiswastheonlyvenuethatwesubmittedto.

Thetemplate-freeversionofTheAIScientistwasreadilyadaptedtothissettingbysimplypromptingitwiththebroadthemeofthework-shop(whichwasinvestigatingdeeplearninglimitations,includingwherepreviousideastoimproveithadnotworked).Theoverallprocesswasthenruntogenerateideas,experimentsandpapers.Wemanuallyfilteredthemostpromisingoutputsateachstage(SupplementaryInformationsectionA.4).Hadthisfilteringnotoccurred,thepapersunderanalysiswouldstillhavebeenproducedintheirfinalform,justalongwithotherpapersand,thus,atagreatertotalcost.Thisprocessresultedinthreecompletemanuscriptsbeingselectedforsubmission.Theselectionwasbasedonthreecriteria:whethertheideawasalignedwiththeworkshoptopic,whetherthecodecorrectlyimplementedtheproposedideaandranwithouterrors,andthecorrectnessofthemanuscriptformatting(SupplementaryInformationsectionA.4).Theentirescientificworkflowforeachpaper,fromideationandcodingtomanuscriptwriting,wasperformedwithoutanyhumanmodification.Thesethreesubmissionswereincludedamongthe43papersreviewedfortheworkshop.Reviewerswereinformedthatsomeofthesubmis-sionswereAI-generatedbutnotwhichones,ensuringablindprocess. OneofthethreeAI-generatedmanuscriptsreceivedanaveragescoreof6.33fromthereviewers(individualscoreswere6,7and6),placingitabovetheaverageacceptancethresholdfortheworkshop(Fig.

2

).Theorganizerssaidthatthepaperwouldhavebeenacceptedinalllikelihoodwereitnotwithdrawnaccordingtoourpre-establishedprotocolduetobeingAI-generated.Notably,theacceptedmanuscriptreportedanegativeresult,aligningwiththefocusoftheworkshoponinterestingnegativeresults.Theothertwopapersdidnotmeetthebarforacceptance(SupplementaryTableD9).Thus,afullyAI-generatedpaperpassedastandardscientificpeer-reviewprocess.Wealsocon-ductedourowninternalreview,usingthehumanAIresearchersonourteam(SupplementaryInformationsectionC.2).Theteamconcludedthatalthoughoneofthepapersdidmeetthebarforworkshoppapers,nonemetthehigherbarforamainICLRconferencepublication.A

Nature|Vol651|26March2026|917

Technicalmethodology(page2)

Titleandabstract(page1)

Datavisualizations(page4)

References(page5)

Fig.2|SelectedsectionsfromapapergeneratedbyTheAIScientistthat

wasacceptedviapeerreviewatatop-tiermachinelearningconference

workshop.Thepaperreceivedpeer-reviewscoresof6(weakaccept),7(accept)and6(weakaccept)beforemeta-reviewandrankedamongthetop45%of

paperssubmittedforpeerreview.ThisdemonstratesthatafullyAI-generatedpapercannavigatethepeer-reviewprocesssuccessfullyatatop-tierconferenceworkshop.Afull-sizedversionofthispaperisavailableinSupplementary

InformationsectionD.2.1.

fullanalysisofallthreesubmittedpapers,includingtheirstrengths,weaknessesandimplementations,isprovidedinSupplementaryInfor-mationsectionC.2.

Limitations

AlthoughTheAIScientistgeneratedaworkshoppaperthatpassedpeerreview,thereisroomforimprovementifitistomatchthebesthuman-producedscience.Onlyoneofthreesubmissionswasaccepted,andworkshopshavemuchhigheracceptanceratesthanmainconfer-ences(forexample,70%fortheICLR2025ICBINBworkshop

38

versus32%fortheICLR2025mainconference

39

).Therefore,TheAIScientistcannotyetmeetthestandardsoftop-tierpublicationsnorevendosoconsistentlyforworkshops.Commonfailuremodesincludethegenerationofnaiveorunderdevelopedideas,incorrectimplementa-tionsofthemainidea,alackofdeepmethodologicalrigour,errorsinexperimentalimplementation,duplicatingfiguresinthemaintextandtheappendix,andmanytypesofhallucinations,suchasinaccuratecitations(afullanalysisoffailuremodesisprovidedinSupplementaryInformationsectionsA.4,C.2andC.3).

Thatsaid,ofteninmachinelearning,oncesomethingbeginstowork(evenwithclearflaws),inafewshortyearswithscale(forexample,ofcomputeanddata),bettercoremodelsandbettertechniques,thecapabilitiesofasystembecomesurprisingandcanexceedhumanperformancelevels.Inassessingtheimpactofatechnology,itis,thus,importanttokeepinminditsprobablefuturetrajectory.Crucially,thistrajectoryisnotjustaboutbettermodelsbutaboutthecomplexityofthetasksthatAIsystemscanexecute.RecentworkindicatesthatthelengthoftasksthatAIcanreliablycompleteisdoublingevery

7months

40

,indicatingthatmanycurrentimplementationanddebug-gingbottlenecksmayberesolvedinthenearterm.However,someAIweaknesseshaveprovedsurprisinglydifficulttosolve,suchasAIbeingeasilyfooled

41

,

42

andoverconfidentlywrong(hallucinations)

43

,althoughprogresshasbeenmade

44

,

45

.SuchchallengescouldpersistandwouldpreventusfromreliablytrustingtheoutputsofsystemslikeTheAIScientist.ItisalsonotcleartowhatextentAIsystemscanproducenewcreativeideasthatresemblegreatconceptualleapsinscience.StudyingandimprovingAIsystemsonthesefrontsarekeyareasforfutureresearch.

Atpresent,TheAIScientistconductscomputationalexperimentsonly.Infuturework,thissameplaybookcouldbeappliedtoothersci-entificdomainswhereonecanautomaticallyconductexperiments(orhavehumansconductthem)andcollectdatafromthem(forexample,automatedchemistrylaboratories,onwhichswiftprogressisbeingmade

46

).

Theabilitytoautomatepapergenerationraisesimportantethi-calandsocietalconcerns,includingthepotentialtooverwhelmthepeer-reviewprocess,artificiallyinflateresearchcredentials,repurposetheideasofotherswithoutgivingpropercredit,eliminatescientistjobs,orconductunethicalordangerousexperiments(Supplemen-taryInformationsectionC.3).Toconductthisstudyresponsibly,weobtainedexplicitpermissionfromtheICLRleadership,theworkshoporganizersandtheUniversityofBritishColumbia’sIRB(H24-02652).Crucially,aspartofourexperimentalprotocol,wedeterminedinadvancethatallAI-generatedsubmissionswouldbewithdrawnafterpeerreview,regardlessofoutcome.Thisdecisionwasmadetoavoidsettingaprecedentforpublishingfullyautomatedresearchbeforethescientificcommunityhasestablishedclearstandardsfordisclosureand

918|Nature|Vol651|26March2026

Article

a

b

Stage1:preliminaryinvestigation

iiftlifttidhttli

Stage3:research

agendaexecutionStage4:ablationstudies

Stage2:hyperparameter

tuning

Paperscoresgivenbythe

automatedreviewer

Non-buggyBuggy

HyperparameterAblation

ReplicationAggregationBest

Refine

Debug

n=30AIScientist:template-free

4.0

3.8

3.2

51015202530

Numberofexperimentalnodes

3.6

3.4

n=30n=30

n=30

c

IntroducestheWaterbirdsdataset.

Replacesskewedsplitswithstratifiedsampling;usespretrainedResNet.

FixesarrayerrorbyexcludingimagedatafromDataFrames.

Ablationstudies:

1.Warm-upperiodduration

2.Dynamicpenaltyfactoradaptation

3.Penaltyapplicationstrengthandsoon

Builtacolour-biasedMNISTdatasettotestifslowinglearningon

specificweightsboostsshortcutrobustness.

Tunedsuppressionstrengthusingearlystopping;enhancedvisualizations.

Fixestrainingcrash,sharpensshortcutsignalandaddstheCelebAdataset.

Rootnode

Balancesall

colour-digitgroupsintrainingandtestsplitstoavoidemptygroups.

Stage3

Stage4

Stage1

Stage2

Topc:suppressngas-earnngeauresoavosorcureance

Bestnode

Fig.3|ThephasesandcomputescalingoftheAIScientist.a,Theresearch

experimentationphaseisvisualizedasafour-stageprocess.Apreliminary

baselinecodeimplementationisfirstconstructed(stage1)andrefinedby

tuningthehyperparameters(stage2).Theresultantcodeservesasastarting

pointforexecutingtheresearchagendathroughanagentictreesearch(stage3),followedbyablationexperiments(stage4).Fulldetailsoftheagentictree

searchprocessareprovidedinMethods.b,ArealexampleoftreesearchbyThe

evaluation.Developingthesenormsisacriticalnextsteptoensurethatsuchsystemsareusedtoadvance,notundermine,scientificintegrity.Finally,moreresearchisneededtoensurethatopen-endedexploratoryAIproceedssafelyandinalignmentwithhumanvalues

47

,

48

.

ThegenerationbyTheAIScientistofanAI-authoredmanuscriptthatpassedpeerreviewforaworkshopatatop-tiermachinelearningconferencemarksamilestoneinthecenturies-longscientificendeav-our.Althoughchallengesremainintermsofconsistencyandachievingtop-tierquality,thissuccessdemonstratesthegrowingcapacityofAIforscientificreasoning,anditsignalsthedawnofanewerainwhichtheprocessofdiscoveryisnolongerasolelyhumanpursuitandinwhichthepaceatwhichweareabletoreaptheharvestofscientificdiscoverycouldacceleratedramatically.

Onlinecontent

-

-

sy

Anymethods,additionalreferences,NaturePortfolioreportingsummaries,sourcedata,extendeddata,supplementaryinformation,acknowledgements,peerreviewinformation;detailsofauthorcontributionandcompetinginterests;andstatementsofdataandcodeavailabilitareavailableat

/10.1038/s41586-026-10265-5

.

1.Lenat,D.B.Automatedtheoryformationinmathematics.InProc.5thInternationalJointConferenceonArtificialIntelligence833–842(ed.Reddy,R.)(WilliamKaufmann,1977).

2.Buchanan,B.G.&Feigenbaum,E.A.Dendralandmeta-dendral:theirapplicationsdimension.Artif.Intell.11,5–24(1978).

3.OpenAI.GPT-4technicalreport.Preprintat

/10.48550/arXiv.2303.08774

(2023).

AIScientistwithnodeannotationsoutliningtheexperimen

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论