迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research

上传人：策*** IP属地：山西上传时间：2026-04-04 格式：DOCX 页数：17 大小：311.06KB 积分：19.9 举报 版权申诉

迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第2页

迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第3页

迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第4页

迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research_第5页

已阅读5页，还剩12页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

914|Nature|Vol651|26March2026

Article

Towardsend-to-endautomationofAIresearch

/10.1038/s41586-026-10265-5

Received:8July2025

Accepted:11February2026

Publishedonline:25March2026

Openaccess

Check

forupdates

ChrisLu1,2,5,CongLu1,3,4,5,RobertTjarkoLange1,5,YutaroYamada1,5✉,ShengranHu1,3,4,JakobFoerster2,DavidHa1✉&JeffClune3,4✉

Theautomationofscienceisalong-standingambitioninartificialintelligence(AI)

research

.Althoughthecommunityhasmadesubstantialprogressinautomating

individualcomponentsofthescientificprocess,asystemthatautonomouslynavigatestheentireresearchlifecycle—fromconceptiontopublication—hasremainedoutof

reach.Herewepresentapipelineforautomatingtheentirescientificprocessendtoend.WepresentTheAIScientist,whichcreatesresearchideas,writescode,runs

experiments,plotsandanalysesdata,writestheentirescientificmanuscript,and

performsitsownpeerreview.Itsideas,executionandpresentationareofsufficient

qualitythatthemanuscriptgeneratedbythisAIsystempassedthefirstroundofpeerreviewforaworkshopofatop-tiermachinelearningconference.Theworkshophadanacceptancerateof70%.Oursystemleveragesmodernfoundationmodels

–

withinacomplexagenticsystem.WeevaluateTheAIScientistintwosettings:afocused

modeusinghuman-providedcodetemplatesasaninitialscaffoldforconducting

researchonaspecifictopicandatemplate-free,open-endedmodethatleverages

agenticsearchforwiderscientificexploration

.Bothsettingsproducediverseideasandautomaticallytest,reportonandevaluatethem.ThisachievementdemonstratesthegrowingcapacityofAIformakingscientificcontributionsandsignifiesapotentialparadigmshiftinhowresearchisconducted.Aswithanyimpactfulnewtechnology,therecouldbeimportantrisks,includingtaxingoverwhelmedreviewsystemsand

addingnoisetothescientificliterature.However,ifdevelopedresponsibly,suchautonomoussystemscouldgreatlyacceleratescientificdiscovery.

AIhaslongbeenusedtoaidscientificdiscovery,anambitionwithdeep

Acentralchallengeindevelopingsuchasystemisautomatically

rootsinthehistoryofthefield

–

.Beforetheriseoflargelanguage

evaluatingthequalityofitsscientificoutputatscale.Toaddressthis,

models(LLMs),AIwaslimitedtohelpingwithspecific,narrowtasks,

wecreatedanautomatedreviewerandfirstevaluateditsperformance

suchasdiscoveringchemicalstructures

,findingmathematicalproofs1,

againstreal,human-generatedpapers.TheAutomatedReviewercan

discoveringnewmaterials

–

andpredictingthethree-dimensional

accuratelypredictconferenceacceptancedecisions,performingonpar

shapeofproteins

.Othersystemsfocusedonanalysingpre-collected

withhumanreviewers(SupplementaryInformationsectionA.3).We

datasetstofindnewinsights

.However,withtherecentadventof

thenusedTheAutomatedReviewertocomparevariousconfigurations

powerfulandgeneralfoundationmodels,theroleofAIhasexpandedto

ofTheAIScientistbyassessinghowperformancechangeswiththescale

includeassistingwithawiderarrayofresearchactivities.Forexample,

ofthetest-timecomputeandthequalityoftheunderlyingfoundation

LLMsnowhelpwithgeneratingnewhypotheses

–

,writingliterature

model.WefindthatTheAIScientistperformsbetterwithmorecompute

reviews

andcodingexperiments

–

.Despitetheseadvancesinauto-

resources(Fig.

).Furthermore,TheAutomatedReviewershowsthat

matingindividualcomponents,asystemthatautonomouslynavigates

improvementstothebasemodelssignificantlyimprovethequality

theentireresearchlifecycle—fromconceptiontopublication—has

ofthegeneratedpapers,afindingthatstronglyimpliesthatfuture

remainedoutofreachuntilnow.

versionsofoursystemwillbesubstantiallymorecapable,asmodels

ThispaperintroducesTheAIScientist,apipelinethatachievesthe

continuetoimprove(Fig.

visionoffullend-to-endautomationofthescientificprocess.TheAI

ToassessTheAIScientistinthesamesettinginwhichhuman-authored

Scientistusesexistingfoundationmodelstoperformideation,litera-

papersareevaluated,weconductedanexperimentwherewesubmit-

turesearch,experimentplanningandimplementation,resultanalysis,

tedgeneratedpaperstoaworkshopattheInternationalConference

manuscriptwriting,andpeerreviewtoproducecomplete,newpapers.

onLearningRepresentations(ICLR),withtheorganizers’consent.

Wefocusonmachinelearningscience,asexperimentstypicallyoccur

Incomputerscience,suchtop-tierconferencesaretheprimaryand

entirelyonthecomputer.

mostprestigiousvenuesforarchivalandrigorouslypeer-reviewed

1SakanaAI,Tokyo,Japan.2FLAIR,UniversityofOxford,Oxford,UK.3UniversityofBritishColumbia,Vancouver,BritishColumbia,Canada.4VectorInstitute,Toronto,Ontario,Canada.5Theseauthorscontributedequally:ChrisLu,CongLu,RobertTjarkoLange,YutaroYamada.✉e-mail:

yutaro.yamada.y@

;

hadavid@sakana.ai

;

jclune@

Nature|Vol651|26March2026|915

Sonnet-4

Fit(R2=0.517,P<0.00001)95%conﬁdenceinterval

AIScientist:template-based

AIScientist:template-free

Sonnet-3.5

Sonnet-3.7

Gemini-2.5

AIreviewerpaperscore

GPT-4

Gemini-1.5

Sonnet-3

GPT-4o

Gemini-2.0

July2023

October2023

January2024

April2024

July2024

October2024

January2025

April2025

July2025

Balancedaccuracy

Languagemodelreleasedate

0.7

0.6

0.5

0.4

Beforecutoff(2017–2024)HumanReject

Aftercutoff(2025)

Automatedreviewer

Random

Experimentation

Preliminaryinvestigation

[Writetolog][Best]

Hyperparametertuning

[Writetolog][Best]

Researchexecution

[Writetolog][Best]

Ablationstudies

Write-up

Plottingandfeedback

Papertemplate

Paper

PaperAIreview

Ideation

LLMideaproposal

Noveltychecking

Scoringandarchiving

Fig.1|TheAIScientistworkflow.a,TheAIScientistconsistsofdistinctphases

coveringautomatedideageneration,tree-basedexperimentation,manuscriptwritingandreviewing.Theexperimentationphaseusesanagentictreesearchtogenerateandrefinecodeimplementations.Thisisstructuredintofour

stages:(1)initialinvestigation,(2)hyperparametertuning,(3)researchagendaexecutionand(4)ablationstudies.Fromoneexperimentalstagetothenext,

thebest-performingcheckpointisselectedtoseedthenextstageofthetree

search.b,ScoresforTheAIScientistpapersacrossmodelreleases.Paperqualityconsistentlyimproveswiththeunderlyingmodelreleasedate(asjudgedby

TheAutomatedReviewer),indicatingconsistentfutureimprovementswith

improvingfoundationmodels.Theobservedcorrelationisstatistically

significant(P<0.00001).Shadedregionsrepresentthestandarderror.Pointsrepresentmeanscoreswitherrorbarsandshadedregionsindicatingthe

standarderror(n=6fortemplate-freepoints,n=3fortemplate-basedpoints).Fullexperimentaldetails,includingmodelversionsandreplicationcounts,areprovidedinSupplementaryInformationsectionA.2.9.c,Automatedreview

versusconferencedecisions.TheAutomatedReviewerachievesperformancecomparablewiththatofhumanreviewers,asvalidatedbyopenlyavailable

decisionsfrompastconferences(Table

).Barsrepresentmeanbalanced

accuracy;errorbarsshow95%bootstrappedconfidenceintervals(5,000

replicates).Forreplicability,eachautomatedreviewisa5-runensemble.Two-samplez-testsonsubsampledaccuracy(automatedn=698/876,humann=412)showednosignificantdifferencebeforethetrainingcutoff(P=0.319)orpost-cutoff(P=0.921).Non-parametricbootstraptestsonF1scoresshowed

automatedoutperformance(P<0.001).

publication.Theyalsohaveworkshopswithasubstantiallylowerbutstillnon-trivialbarforpeer-reviewedacceptance.

OneofTheAIScientist’smanuscriptsachievedhighenoughscorestoexceedtheaveragehumanacceptancethresholdataworkshop,provid-inganexampleofafullyAI-generatedpapersuccessfullynavigatingapeer-reviewprocess,albeitonewithalowerbar.

Generatingmanuscripts

TheAIScientistsequentiallycompletesfourmainphases(Fig.

).Inthefirstphase,TheAIScientistispromptedtoiterativelygrowanarchive

ofhigh-levelresearchdirectionsandhypothesesthatitcanexplorewithinauser-specifiedmachinelearningresearchsubfield(anexampleprogressionisvisualizedinSupplementaryInformationsectionC.4).Foreachdirection,itgeneratesadescriptivetitle,itsrea-soningforwhattheideaisandwhyitwouldbeinterestingtopursue,andaproposedexperimentalplan(SupplementaryInformationsec-tionsA.1.1andA.2.6).Afterideageneration,TheAIScientistfiltersideasbyconnectingthelanguagemodeltotheSemanticScholarapplicationprogramminginterface(API)

andwebaccessastools

.ThisallowsTheAIScientisttodiscardanyideathattoocloselyresemblesaworkintheexistingliterature.

ThesecondphaseofTheAIScientistexecutestheproposedexperimentsandthenvisualizestheirresultsforthedownstreamwrite-up.Wetestedtwodifferentvariantsofexperimentexecution:

(1)Template-based:TheAIScientistisprovidedwithastartingcodetemplatethatreproducesatrainingrunfromapopularalgorithm.TheAIScientistthenexecutestheproposedexperimentplaninlinear

order(SupplementaryInformationsectionA.1).(2)Template-free:Alternatively,TheAIScientistcangenerateaninitialstartingcodescriptbyitself.Inthiscase,experimentationincludesfurtherstagesforoptimizingthecodeitwritesfromscratch,andexperimentexecu-tionleveragesextratest-timecomputewithatreesearch(Fig.

3a,b

andMethods).Aftereachexperiment,TheAIScientistisgiventheresultsandispromptedtotakenotesinthestyleofanexperimentaljournalforfutureplanningandwrite-up.

ThethirdphaseofTheAIScientistproducesaconcisewrite-upofitsresearchinthestyleofastandardmachinelearningconferencepaper.TheAIScientistispromptedtofillinablankLaTeXconferencetemplatesectionbysectionusingitsnotesandplots(Methods).Toconstructtherelatedworksectionandaddcitationsthroughoutthemanuscript,thesystemqueriestheSemanticScholar

APIforrelevantliteratureandcomparesitsfindingsagainstthegeneratedmanuscriptover20rounds.Foreachpotentialcitation,thesystemgeneratesatextualjustificationforitsinclusion,whichinformsTheAIScientistonhowtousethereferenceappropriatelywithinthemanuscript.

Finally,thepapergeneratedbyTheAIScientistundergoesareviewbyTheAutomatedReviewer,whichautomaticallyevaluatesthescientificqualityoftheconductedresearch.

Automatedevaluationofgeneratedpapers

TheAutomatedReviewerprovidesreviewsbasedonthereviewguidelinesforthetop-tierNeuralInformationProcessingSys-tems(NeurIPS)conference(

https://neurips.cc/Conferences/2022/

916|Nature|Vol651|26March2026

Article

Table1|PerformancecomparisonofhumanreviewersandTheAutomatedReviewer

Reviewer

Balancedaccuracy(↑)

Accuracy(↑)

F1score(↑)

AUC(↑)

FPR(↓)

FNR(↓)

Human(NeurIPS)

0.66

0.73

0.49

0.65

0.17

0.52

Yearsbeforeknowledgecutoff(2017–2024)

Randomdecision

0.50

0.54

0.47

0.52

0.47

0.43

Alwaysreject

0.50

0.65

0.00

0.50

0.00

1.00

AutomatedReviewer

0.69±0.04

0.65±0.10

0.62±0.09

0.69±0.09

0.45±0.10

0.17±0.08

Yearafterknowledgecutoff(2025)

Randomdecision

0.52

0.51

0.48

0.49

0.50

0.48

Alwaysreject

0.50

0.56

0.00

0.50

0.00

1.00

AutomatedReviewer

0.66±0.03

0.63±0.09

0.67±0.09

0.65±0.10

0.52±0.10

0.17±0.07

Performancecomparisonofhumanreviewers(NeurIPS2021consistencyexperiment

)andtheAutomatedReviewer,evaluatedonpaperspublishedbefore(2017–2024)andafter(2025)the

knowledgecutoff.TheAutomatedReviewerachievedperformancesuperiororcomparablewithhumanreviewerconsistencyinkeymetricssuchasF1score,areaunderthecurve(AUC)and

balancedaccuracy,evenfordatabeyondtheknowledgecutoff,highlightingitsrobustnessandreliabilityacrossdifferenttimeperiods.Errormarginsdenotethe95%bootstrappedconfidenceintervals.Arrowsindicatewhetheritisbetterforascoretobehigher(↑)orlower(↓).SupplementaryInformationsectionA.3.2explainseachmetricandcomparisonindetail.FNR,falsenegativerate;FPR,falsepositiverate.

ReviewerGuidelines

).Theoutputcontainsnumericalscores(soundness,presentation,contribution,overallqualityandreviewerconfidence),listsofweaknessesandstrengths,aswellasabinarydecision(acceptorreject).TheAutomatedReviewerpipelineconsistsofanensembleoffivereviews,followedbyameta-reviewinwhichthemodelactsasanareachairtomakeafinaldecisionconditionedonallfivereviews(Supple-mentaryInformationsectionA.3).WecomparedAutomatedReviewerdecisionswithgroundtruthdataforICLRpapersextractedfromthepubliclyavailableOpenReviewdataset

.AsshowninTable

,theagree-mentofAutomatedReviewerassessmentswithhumanassessmentsiscomparablewithinter-humanagreementmeasuredbyF1scoreandbalancedaccuracy,asreportedintheNeurIPS2021consistencystudy

,whichmeasuredagreementbetweenhumanreviewersonacompara-blesetofsubmissions(SupplementaryInformationsectionA.3).Thisdemonstratesitsabilitytoreplicatethecollectivejudgementofhumanreviewerswithhighfidelity.Theseresultsarestatisticallysignificant(non-parametricbootstraptest

andtwo-samplez-test

;Supplemen-taryInformationsectionA.3).Next,toinvestigatetheeffectofpotentialdatacontamination(thepossibilitythatdecisionsonapaperwerepartofthetrainingsetfortheLLM),weevaluatedTheAutomatedReviewerontwodatasets:onecontaining1,000papersfromyearspotentiallywithinthetrainingdatausedforthemodel(2017–2024)andasecond‘clean’datasetfromtheyearafterthecutoff(2025),whichcouldnothavebeenseenduringtraining.Acomparisonbetweenyearsbeforeandaftertheknowledgecutoffindicatesthatdatacontaminationmayexist,asbalanceddecisionaccuracydecreasesfrom69%beforeto66%intheyearafterthecutoff.However,theresultsfortheyearafterthecutoffremaincomparablewiththoseofhumanreviewers(forexample,66%balancedaccuracy),showingthatpotentialcontaminationhad,atmost,aminimaleffect.

UsingTheAutomatedReviewer,weassessedthequalityoftheresearchpapersgeneratedbyawiderangeofLLMsasthecoremodelwithinTheAIScientist.Ouranalysisrevealedacleartrend:asmodelsimproveovertime,thequalityofthepapersproducedbyTheAIScientistincreasedcorrespondingly(Fig.

).Withrecentgenera-tionsofmodels,onaverage,TheAIScientistproducedpapersthatapproachborderlineacceptabilityformachinelearningconferenceworkshops,asjudgedbyourAutomatedReviewer(SupplementaryFig.B2).Additionally,thereisastrongcorrelationbetweentheamountofcomputeallocatedperpaperandtheresultingquality(Fig.

),indicatingthatbothmodelscaleandinference-timeinvest-mentplayimportantrolesintheoutputqualityofTheAIScientist,furtherindicatingthepossibilityofsubstantialimprovementsasthecostsofAIsystemscontinuetoexponentiallydecreaseandcapabili-tiesexponentiallyincrease

Humanevaluationresults

PerhapstheultimateandfairesttestofthequalityoftheworkofTheAIScientistisaversionofwhatwemightcallanAIscientistTuringtest:submittingtheworktothesamerigorous,blindpeer-reviewsystemsusedtoevaluatehumanscience.Wesubmittedthreegeneratedmanu-scriptstotheformalpeer-reviewprocessofaworkshopatatop-tiermachinelearningconference.Thisexperimentwasconductedwiththeapprovaloftherelevantinstitutionalreviewboard(IRB;Supple-mentaryInformationsectionC.3)andthefullcooperationoftheICLR2025leadershipandtheorganizersoftheICan’tBelieveIt’sNotBetter(ICBINB)workshop.Thiswastheonlyvenuethatwesubmittedto.

Thetemplate-freeversionofTheAIScientistwasreadilyadaptedtothissettingbysimplypromptingitwiththebroadthemeofthework-shop(whichwasinvestigatingdeeplearninglimitations,includingwherepreviousideastoimproveithadnotworked).Theoverallprocesswasthenruntogenerateideas,experimentsandpapers.Wemanuallyfilteredthemostpromisingoutputsateachstage(SupplementaryInformationsectionA.4).Hadthisfilteringnotoccurred,thepapersunderanalysiswouldstillhavebeenproducedintheirfinalform,justalongwithotherpapersand,thus,atagreatertotalcost.Thisprocessresultedinthreecompletemanuscriptsbeingselectedforsubmission.Theselectionwasbasedonthreecriteria:whethertheideawasalignedwiththeworkshoptopic,whetherthecodecorrectlyimplementedtheproposedideaandranwithouterrors,andthecorrectnessofthemanuscriptformatting(SupplementaryInformationsectionA.4).Theentirescientificworkflowforeachpaper,fromideationandcodingtomanuscriptwriting,wasperformedwithoutanyhumanmodification.Thesethreesubmissionswereincludedamongthe43papersreviewedfortheworkshop.Reviewerswereinformedthatsomeofthesubmis-sionswereAI-generatedbutnotwhichones,ensuringablindprocess. OneofthethreeAI-generatedmanuscriptsreceivedanaveragescoreof6.33fromthereviewers(individualscoreswere6,7and6),placingitabovetheaverageacceptancethresholdfortheworkshop(Fig.

).Theorganizerssaidthatthepaperwouldhavebeenacceptedinalllikelihoodwereitnotwithdrawnaccordingtoourpre-establishedprotocolduetobeingAI-generated.Notably,theacceptedmanuscriptreportedanegativeresult,aligningwiththefocusoftheworkshoponinterestingnegativeresults.Theothertwopapersdidnotmeetthebarforacceptance(SupplementaryTableD9).Thus,afullyAI-generatedpaperpassedastandardscientificpeer-reviewprocess.Wealsocon-ductedourowninternalreview,usingthehumanAIresearchersonourteam(SupplementaryInformationsectionC.2).Theteamconcludedthatalthoughoneofthepapersdidmeetthebarforworkshoppapers,nonemetthehigherbarforamainICLRconferencepublication.A

Nature|Vol651|26March2026|917

Technicalmethodology(page2)

Titleandabstract(page1)

Datavisualizations(page4)

References(page5)

Fig.2|SelectedsectionsfromapapergeneratedbyTheAIScientistthat

wasacceptedviapeerreviewatatop-tiermachinelearningconference

workshop.Thepaperreceivedpeer-reviewscoresof6(weakaccept),7(accept)and6(weakaccept)beforemeta-reviewandrankedamongthetop45%of

paperssubmittedforpeerreview.ThisdemonstratesthatafullyAI-generatedpapercannavigatethepeer-reviewprocesssuccessfullyatatop-tierconferenceworkshop.Afull-sizedversionofthispaperisavailableinSupplementary

InformationsectionD.2.1.

fullanalysisofallthreesubmittedpapers,includingtheirstrengths,weaknessesandimplementations,isprovidedinSupplementaryInfor-mationsectionC.2.

Limitations

AlthoughTheAIScientistgeneratedaworkshoppaperthatpassedpeerreview,thereisroomforimprovementifitistomatchthebesthuman-producedscience.Onlyoneofthreesubmissionswasaccepted,andworkshopshavemuchhigheracceptanceratesthanmainconfer-ences(forexample,70%fortheICLR2025ICBINBworkshop

versus32%fortheICLR2025mainconference

).Therefore,TheAIScientistcannotyetmeetthestandardsoftop-tierpublicationsnorevendosoconsistentlyforworkshops.Commonfailuremodesincludethegenerationofnaiveorunderdevelopedideas,incorrectimplementa-tionsofthemainidea,alackofdeepmethodologicalrigour,errorsinexperimentalimplementation,duplicatingfiguresinthemaintextandtheappendix,andmanytypesofhallucinations,suchasinaccuratecitations(afullanalysisoffailuremodesisprovidedinSupplementaryInformationsectionsA.4,C.2andC.3).

Thatsaid,ofteninmachinelearning,oncesomethingbeginstowork(evenwithclearflaws),inafewshortyearswithscale(forexample,ofcomputeanddata),bettercoremodelsandbettertechniques,thecapabilitiesofasystembecomesurprisingandcanexceedhumanperformancelevels.Inassessingtheimpactofatechnology,itis,thus,importanttokeepinminditsprobablefuturetrajectory.Crucially,thistrajectoryisnotjustaboutbettermodelsbutaboutthecomplexityofthetasksthatAIsystemscanexecute.RecentworkindicatesthatthelengthoftasksthatAIcanreliablycompleteisdoublingevery

7months

,indicatingthatmanycurrentimplementationanddebug-gingbottlenecksmayberesolvedinthenearterm.However,someAIweaknesseshaveprovedsurprisinglydifficulttosolve,suchasAIbeingeasilyfooled

andoverconfidentlywrong(hallucinations)

,althoughprogresshasbeenmade

.SuchchallengescouldpersistandwouldpreventusfromreliablytrustingtheoutputsofsystemslikeTheAIScientist.ItisalsonotcleartowhatextentAIsystemscanproducenewcreativeideasthatresemblegreatconceptualleapsinscience.StudyingandimprovingAIsystemsonthesefrontsarekeyareasforfutureresearch.

Atpresent,TheAIScientistconductscomputationalexperimentsonly.Infuturework,thissameplaybookcouldbeappliedtoothersci-entificdomainswhereonecanautomaticallyconductexperiments(orhavehumansconductthem)andcollectdatafromthem(forexample,automatedchemistrylaboratories,onwhichswiftprogressisbeingmade

Theabilitytoautomatepapergenerationraisesimportantethi-calandsocietalconcerns,includingthepotentialtooverwhelmthepeer-reviewprocess,artificiallyinflateresearchcredentials,repurposetheideasofotherswithoutgivingpropercredit,eliminatescientistjobs,orconductunethicalordangerousexperiments(Supplemen-taryInformationsectionC.3).Toconductthisstudyresponsibly,weobtainedexplicitpermissionfromtheICLRleadership,theworkshoporganizersandtheUniversityofBritishColumbia’sIRB(H24-02652).Crucially,aspartofourexperimentalprotocol,wedeterminedinadvancethatallAI-generatedsubmissionswouldbewithdrawnafterpeerreview,regardlessofoutcome.Thisdecisionwasmadetoavoidsettingaprecedentforpublishingfullyautomatedresearchbeforethescientificcommunityhasestablishedclearstandardsfordisclosureand

918|Nature|Vol651|26March2026

Article

Stage1:preliminaryinvestigation

iiftlifttidhttli

Stage3:research

agendaexecutionStage4:ablationstudies

Stage2:hyperparameter

tuning

Paperscoresgivenbythe

automatedreviewer

Non-buggyBuggy

HyperparameterAblation

ReplicationAggregationBest

Reﬁne

Debug

n=30AIScientist:template-free

4.0

3.8

3.2

51015202530

Numberofexperimentalnodes

3.6

3.4

n=30n=30

n=30

IntroducestheWaterbirdsdataset.

Replacesskewedsplitswithstratiﬁedsampling;usespretrainedResNet.

FixesarrayerrorbyexcludingimagedatafromDataFrames.

Ablationstudies:

1.Warm-upperiodduration

2.Dynamicpenaltyfactoradaptation

3.Penaltyapplicationstrengthandsoon

Builtacolour-biasedMNISTdatasettotestifslowinglearningon

speciﬁcweightsboostsshortcutrobustness.

Tunedsuppressionstrengthusingearlystopping;enhancedvisualizations.

Fixestrainingcrash,sharpensshortcutsignalandaddstheCelebAdataset.

Rootnode

Balancesall

colour-digitgroupsintrainingandtestsplitstoavoidemptygroups.

Stage3

Stage4

Stage1

Stage2

Topc:suppressngas-earnngeauresoavosorcureance

Bestnode

Fig.3|ThephasesandcomputescalingoftheAIScientist.a,Theresearch

experimentationphaseisvisualizedasafour-stageprocess.Apreliminary

baselinecodeimplementationisfirstconstructed(stage1)andrefinedby

tuningthehyperparameters(stage2).Theresultantcodeservesasastarting

pointforexecutingtheresearchagendathroughanagentictreesearch(stage3),followedbyablationexperiments(stage4).Fulldetailsoftheagentictree

searchprocessareprovidedinMethods.b,ArealexampleoftreesearchbyThe

evaluation.Developingthesenormsisacriticalnextsteptoensurethatsuchsystemsareusedtoadvance,notundermine,scientificintegrity.Finally,moreresearchisneededtoensurethatopen-endedexploratoryAIproceedssafelyandinalignmentwithhumanvalues

ThegenerationbyTheAIScientistofanAI-authoredmanuscriptthatpassedpeerreviewforaworkshopatatop-tiermachinelearningconferencemarksamilestoneinthecenturies-longscientificendeav-our.Althoughchallengesremainintermsofconsistencyandachievingtop-tierquality,thissuccessdemonstratesthegrowingcapacityofAIforscientificreasoning,anditsignalsthedawnofanewerainwhichtheprocessofdiscoveryisnolongerasolelyhumanpursuitandinwhichthepaceatwhichweareabletoreaptheharvestofscientificdiscoverycouldacceleratedramatically.

Onlinecontent

Anymethods,additionalreferences,NaturePortfolioreportingsummaries,sourcedata,extendeddata,supplementaryinformation,acknowledgements,peerreviewinformation;detailsofauthorcontributionandcompetinginterests;andstatementsofdataandcodeavailabilitareavailableat

/10.1038/s41586-026-10265-5

1.Lenat,D.B.Automatedtheoryformationinmathematics.InProc.5thInternationalJointConferenceonArtificialIntelligence833–842(ed.Reddy,R.)(WilliamKaufmann,1977).

2.Buchanan,B.G.&Feigenbaum,E.A.Dendralandmeta-dendral:theirapplicationsdimension.Artif.Intell.11,5–24(1978).

3.OpenAI.GPT-4technicalreport.Preprintat

/10.48550/arXiv.2303.08774

(2023).

AIScientistwithnodeannotationsoutliningtheexperimen

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research

文档简介

温馨提示

最新文档

评论

迈向人工智能研究的端到端自动化 Towards end-to-end automation of AI research

文档简介

温馨提示

最新文档

评论

相关文档