版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
914|Nature|Vol651|26March2026
Article
Towardsend-to-endautomationofAIresearch
/10.1038/s41586-026-10265-5
Received:8July2025
Accepted:11February2026
Publishedonline:25March2026
Openaccess
Check
forupdates
ChrisLu1,2,5,CongLu1,3,4,5,RobertTjarkoLange1,5,YutaroYamada1,5✉,ShengranHu1,3,4,JakobFoerster2,DavidHa1✉&JeffClune3,4✉
Theautomationofscienceisalong-standingambitioninartificialintelligence(AI)
research
1
,
2
.Althoughthecommunityhasmadesubstantialprogressinautomating
individualcomponentsofthescientificprocess,asystemthatautonomouslynavigatestheentireresearchlifecycle—fromconceptiontopublication—hasremainedoutof
reach.Herewepresentapipelineforautomatingtheentirescientificprocessendtoend.WepresentTheAIScientist,whichcreatesresearchideas,writescode,runs
experiments,plotsandanalysesdata,writestheentirescientificmanuscript,and
performsitsownpeerreview.Itsideas,executionandpresentationareofsufficient
qualitythatthemanuscriptgeneratedbythisAIsystempassedthefirstroundofpeerreviewforaworkshopofatop-tiermachinelearningconference.Theworkshophadanacceptancerateof70%.Oursystemleveragesmodernfoundationmodels
3
–
5
withinacomplexagenticsystem.WeevaluateTheAIScientistintwosettings:afocused
modeusinghuman-providedcodetemplatesasaninitialscaffoldforconducting
researchonaspecifictopicandatemplate-free,open-endedmodethatleverages
agenticsearchforwiderscientificexploration
6
,
7
.Bothsettingsproducediverseideasandautomaticallytest,reportonandevaluatethem.ThisachievementdemonstratesthegrowingcapacityofAIformakingscientificcontributionsandsignifiesapotentialparadigmshiftinhowresearchisconducted.Aswithanyimpactfulnewtechnology,therecouldbeimportantrisks,includingtaxingoverwhelmedreviewsystemsand
addingnoisetothescientificliterature.However,ifdevelopedresponsibly,suchautonomoussystemscouldgreatlyacceleratescientificdiscovery.
AIhaslongbeenusedtoaidscientificdiscovery,anambitionwithdeep
Acentralchallengeindevelopingsuchasystemisautomatically
rootsinthehistoryofthefield
1
,
8
–
11
.Beforetheriseoflargelanguage
evaluatingthequalityofitsscientificoutputatscale.Toaddressthis,
models(LLMs),AIwaslimitedtohelpingwithspecific,narrowtasks,
wecreatedanautomatedreviewerandfirstevaluateditsperformance
suchasdiscoveringchemicalstructures
2
,findingmathematicalproofs1,
againstreal,human-generatedpapers.TheAutomatedReviewercan
discoveringnewmaterials
12
–
14
andpredictingthethree-dimensional
accuratelypredictconferenceacceptancedecisions,performingonpar
shapeofproteins
15
,
16
.Othersystemsfocusedonanalysingpre-collected
withhumanreviewers(SupplementaryInformationsectionA.3).We
datasetstofindnewinsights
10
,
17
,
18
.However,withtherecentadventof
thenusedTheAutomatedReviewertocomparevariousconfigurations
powerfulandgeneralfoundationmodels,theroleofAIhasexpandedto
ofTheAIScientistbyassessinghowperformancechangeswiththescale
includeassistingwithawiderarrayofresearchactivities.Forexample,
ofthetest-timecomputeandthequalityoftheunderlyingfoundation
LLMsnowhelpwithgeneratingnewhypotheses
19
–
23
,writingliterature
model.WefindthatTheAIScientistperformsbetterwithmorecompute
reviews
24
,
25
andcodingexperiments
26
–
29
.Despitetheseadvancesinauto-
resources(Fig.
3c
).Furthermore,TheAutomatedReviewershowsthat
matingindividualcomponents,asystemthatautonomouslynavigates
improvementstothebasemodelssignificantlyimprovethequality
theentireresearchlifecycle—fromconceptiontopublication—has
ofthegeneratedpapers,afindingthatstronglyimpliesthatfuture
remainedoutofreachuntilnow.
versionsofoursystemwillbesubstantiallymorecapable,asmodels
ThispaperintroducesTheAIScientist,apipelinethatachievesthe
continuetoimprove(Fig.
1b
).
visionoffullend-to-endautomationofthescientificprocess.TheAI
ToassessTheAIScientistinthesamesettinginwhichhuman-authored
Scientistusesexistingfoundationmodelstoperformideation,litera-
papersareevaluated,weconductedanexperimentwherewesubmit-
turesearch,experimentplanningandimplementation,resultanalysis,
tedgeneratedpaperstoaworkshopattheInternationalConference
manuscriptwriting,andpeerreviewtoproducecomplete,newpapers.
onLearningRepresentations(ICLR),withtheorganizers’consent.
Wefocusonmachinelearningscience,asexperimentstypicallyoccur
Incomputerscience,suchtop-tierconferencesaretheprimaryand
entirelyonthecomputer.
mostprestigiousvenuesforarchivalandrigorouslypeer-reviewed
1SakanaAI,Tokyo,Japan.2FLAIR,UniversityofOxford,Oxford,UK.3UniversityofBritishColumbia,Vancouver,BritishColumbia,Canada.4VectorInstitute,Toronto,Ontario,Canada.5Theseauthorscontributedequally:ChrisLu,CongLu,RobertTjarkoLange,YutaroYamada.✉e-mail:
yutaro.yamada.y@
;
hadavid@sakana.ai
;
jclune@
Nature|Vol651|26March2026|915
Sonnet-4
a
Fit(R2=0.517,P<0.00001)95%confidenceinterval
AIScientist:template-based
AIScientist:template-free
6
5
4
3
2
1
b
0
Sonnet-3.5
Sonnet-3.5
Sonnet-3.7
Gemini-2.5
AIreviewerpaperscore
GPT-4
Gemini-1.5
Sonnet-3
GPT-4o
GPT-4o
o1
Gemini-2.0
o3
July2023
October2023
January2024
April2024
July2024
October2024
January2025
April2025
July2025
Balancedaccuracy
Languagemodelreleasedate
0.7
0.6
0.5
0.4
Beforecutoff(2017–2024)HumanReject
Aftercutoff(2025)
Automatedreviewer
Random
c
Experimentation
Preliminaryinvestigation
[Writetolog][Best]
Hyperparametertuning
[Writetolog][Best]
Researchexecution
[Writetolog][Best]
Ablationstudies
Write-up
Plottingandfeedback
Papertemplate
Paper
PaperAIreview
Ideation
LLMideaproposal
Noveltychecking
Scoringandarchiving
Fig.1|TheAIScientistworkflow.a,TheAIScientistconsistsofdistinctphases
coveringautomatedideageneration,tree-basedexperimentation,manuscriptwritingandreviewing.Theexperimentationphaseusesanagentictreesearchtogenerateandrefinecodeimplementations.Thisisstructuredintofour
stages:(1)initialinvestigation,(2)hyperparametertuning,(3)researchagendaexecutionand(4)ablationstudies.Fromoneexperimentalstagetothenext,
thebest-performingcheckpointisselectedtoseedthenextstageofthetree
search.b,ScoresforTheAIScientistpapersacrossmodelreleases.Paperqualityconsistentlyimproveswiththeunderlyingmodelreleasedate(asjudgedby
TheAutomatedReviewer),indicatingconsistentfutureimprovementswith
improvingfoundationmodels.Theobservedcorrelationisstatistically
significant(P<0.00001).Shadedregionsrepresentthestandarderror.Pointsrepresentmeanscoreswitherrorbarsandshadedregionsindicatingthe
standarderror(n=6fortemplate-freepoints,n=3fortemplate-basedpoints).Fullexperimentaldetails,includingmodelversionsandreplicationcounts,areprovidedinSupplementaryInformationsectionA.2.9.c,Automatedreview
versusconferencedecisions.TheAutomatedReviewerachievesperformancecomparablewiththatofhumanreviewers,asvalidatedbyopenlyavailable
decisionsfrompastconferences(Table
1
).Barsrepresentmeanbalanced
accuracy;errorbarsshow95%bootstrappedconfidenceintervals(5,000
replicates).Forreplicability,eachautomatedreviewisa5-runensemble.Two-samplez-testsonsubsampledaccuracy(automatedn=698/876,humann=412)showednosignificantdifferencebeforethetrainingcutoff(P=0.319)orpost-cutoff(P=0.921).Non-parametricbootstraptestsonF1scoresshowed
automatedoutperformance(P<0.001).
publication.Theyalsohaveworkshopswithasubstantiallylowerbutstillnon-trivialbarforpeer-reviewedacceptance.
OneofTheAIScientist’smanuscriptsachievedhighenoughscorestoexceedtheaveragehumanacceptancethresholdataworkshop,provid-inganexampleofafullyAI-generatedpapersuccessfullynavigatingapeer-reviewprocess,albeitonewithalowerbar.
Generatingmanuscripts
TheAIScientistsequentiallycompletesfourmainphases(Fig.
1a
).Inthefirstphase,TheAIScientistispromptedtoiterativelygrowanarchive
30
ofhigh-levelresearchdirectionsandhypothesesthatitcanexplorewithinauser-specifiedmachinelearningresearchsubfield(anexampleprogressionisvisualizedinSupplementaryInformationsectionC.4).Foreachdirection,itgeneratesadescriptivetitle,itsrea-soningforwhattheideaisandwhyitwouldbeinterestingtopursue,andaproposedexperimentalplan(SupplementaryInformationsec-tionsA.1.1andA.2.6).Afterideageneration,TheAIScientistfiltersideasbyconnectingthelanguagemodeltotheSemanticScholarapplicationprogramminginterface(API)
31
andwebaccessastools
32
.ThisallowsTheAIScientisttodiscardanyideathattoocloselyresemblesaworkintheexistingliterature.
ThesecondphaseofTheAIScientistexecutestheproposedexperimentsandthenvisualizestheirresultsforthedownstreamwrite-up.Wetestedtwodifferentvariantsofexperimentexecution:
(1)Template-based:TheAIScientistisprovidedwithastartingcodetemplatethatreproducesatrainingrunfromapopularalgorithm.TheAIScientistthenexecutestheproposedexperimentplaninlinear
order(SupplementaryInformationsectionA.1).(2)Template-free:Alternatively,TheAIScientistcangenerateaninitialstartingcodescriptbyitself.Inthiscase,experimentationincludesfurtherstagesforoptimizingthecodeitwritesfromscratch,andexperimentexecu-tionleveragesextratest-timecomputewithatreesearch(Fig.
3a,b
andMethods).Aftereachexperiment,TheAIScientistisgiventheresultsandispromptedtotakenotesinthestyleofanexperimentaljournalforfutureplanningandwrite-up.
ThethirdphaseofTheAIScientistproducesaconcisewrite-upofitsresearchinthestyleofastandardmachinelearningconferencepaper.TheAIScientistispromptedtofillinablankLaTeXconferencetemplatesectionbysectionusingitsnotesandplots(Methods).Toconstructtherelatedworksectionandaddcitationsthroughoutthemanuscript,thesystemqueriestheSemanticScholar
31
APIforrelevantliteratureandcomparesitsfindingsagainstthegeneratedmanuscriptover20rounds.Foreachpotentialcitation,thesystemgeneratesatextualjustificationforitsinclusion,whichinformsTheAIScientistonhowtousethereferenceappropriatelywithinthemanuscript.
Finally,thepapergeneratedbyTheAIScientistundergoesareviewbyTheAutomatedReviewer,whichautomaticallyevaluatesthescientificqualityoftheconductedresearch.
Automatedevaluationofgeneratedpapers
TheAutomatedReviewerprovidesreviewsbasedonthereviewguidelinesforthetop-tierNeuralInformationProcessingSys-tems(NeurIPS)conference(
https://neurips.cc/Conferences/2022/
916|Nature|Vol651|26March2026
Article
Table1|PerformancecomparisonofhumanreviewersandTheAutomatedReviewer
Reviewer
Balancedaccuracy(↑)
Accuracy(↑)
F1score(↑)
AUC(↑)
FPR(↓)
FNR(↓)
Human(NeurIPS)
0.66
0.73
0.49
0.65
0.17
0.52
Yearsbeforeknowledgecutoff(2017–2024)
Randomdecision
0.50
0.54
0.47
0.52
0.47
0.43
Alwaysreject
0.50
0.65
0.00
0.50
0.00
1.00
AutomatedReviewer
0.69±0.04
0.65±0.10
0.62±0.09
0.69±0.09
0.45±0.10
0.17±0.08
Yearafterknowledgecutoff(2025)
Randomdecision
0.52
0.51
0.48
0.49
0.50
0.48
Alwaysreject
0.50
0.56
0.00
0.50
0.00
1.00
AutomatedReviewer
0.66±0.03
0.63±0.09
0.67±0.09
0.65±0.10
0.52±0.10
0.17±0.07
Performancecomparisonofhumanreviewers(NeurIPS2021consistencyexperiment
34
)andtheAutomatedReviewer,evaluatedonpaperspublishedbefore(2017–2024)andafter(2025)the
knowledgecutoff.TheAutomatedReviewerachievedperformancesuperiororcomparablewithhumanreviewerconsistencyinkeymetricssuchasF1score,areaunderthecurve(AUC)and
balancedaccuracy,evenfordatabeyondtheknowledgecutoff,highlightingitsrobustnessandreliabilityacrossdifferenttimeperiods.Errormarginsdenotethe95%bootstrappedconfidenceintervals.Arrowsindicatewhetheritisbetterforascoretobehigher(↑)orlower(↓).SupplementaryInformationsectionA.3.2explainseachmetricandcomparisonindetail.FNR,falsenegativerate;FPR,falsepositiverate.
ReviewerGuidelines
).Theoutputcontainsnumericalscores(soundness,presentation,contribution,overallqualityandreviewerconfidence),listsofweaknessesandstrengths,aswellasabinarydecision(acceptorreject).TheAutomatedReviewerpipelineconsistsofanensembleoffivereviews,followedbyameta-reviewinwhichthemodelactsasanareachairtomakeafinaldecisionconditionedonallfivereviews(Supple-mentaryInformationsectionA.3).WecomparedAutomatedReviewerdecisionswithgroundtruthdataforICLRpapersextractedfromthepubliclyavailableOpenReviewdataset
33
.AsshowninTable
1
,theagree-mentofAutomatedReviewerassessmentswithhumanassessmentsiscomparablewithinter-humanagreementmeasuredbyF1scoreandbalancedaccuracy,asreportedintheNeurIPS2021consistencystudy
34
,whichmeasuredagreementbetweenhumanreviewersonacompara-blesetofsubmissions(SupplementaryInformationsectionA.3).Thisdemonstratesitsabilitytoreplicatethecollectivejudgementofhumanreviewerswithhighfidelity.Theseresultsarestatisticallysignificant(non-parametricbootstraptest
35
andtwo-samplez-test
36
;Supplemen-taryInformationsectionA.3).Next,toinvestigatetheeffectofpotentialdatacontamination(thepossibilitythatdecisionsonapaperwerepartofthetrainingsetfortheLLM),weevaluatedTheAutomatedReviewerontwodatasets:onecontaining1,000papersfromyearspotentiallywithinthetrainingdatausedforthemodel(2017–2024)andasecond‘clean’datasetfromtheyearafterthecutoff(2025),whichcouldnothavebeenseenduringtraining.Acomparisonbetweenyearsbeforeandaftertheknowledgecutoffindicatesthatdatacontaminationmayexist,asbalanceddecisionaccuracydecreasesfrom69%beforeto66%intheyearafterthecutoff.However,theresultsfortheyearafterthecutoffremaincomparablewiththoseofhumanreviewers(forexample,66%balancedaccuracy),showingthatpotentialcontaminationhad,atmost,aminimaleffect.
UsingTheAutomatedReviewer,weassessedthequalityoftheresearchpapersgeneratedbyawiderangeofLLMsasthecoremodelwithinTheAIScientist.Ouranalysisrevealedacleartrend:asmodelsimproveovertime,thequalityofthepapersproducedbyTheAIScientistincreasedcorrespondingly(Fig.
1b
).Withrecentgenera-tionsofmodels,onaverage,TheAIScientistproducedpapersthatapproachborderlineacceptabilityformachinelearningconferenceworkshops,asjudgedbyourAutomatedReviewer(SupplementaryFig.B2).Additionally,thereisastrongcorrelationbetweentheamountofcomputeallocatedperpaperandtheresultingquality(Fig.
3c
),indicatingthatbothmodelscaleandinference-timeinvest-mentplayimportantrolesintheoutputqualityofTheAIScientist,furtherindicatingthepossibilityofsubstantialimprovementsasthecostsofAIsystemscontinuetoexponentiallydecreaseandcapabili-tiesexponentiallyincrease
37
.
Humanevaluationresults
PerhapstheultimateandfairesttestofthequalityoftheworkofTheAIScientistisaversionofwhatwemightcallanAIscientistTuringtest:submittingtheworktothesamerigorous,blindpeer-reviewsystemsusedtoevaluatehumanscience.Wesubmittedthreegeneratedmanu-scriptstotheformalpeer-reviewprocessofaworkshopatatop-tiermachinelearningconference.Thisexperimentwasconductedwiththeapprovaloftherelevantinstitutionalreviewboard(IRB;Supple-mentaryInformationsectionC.3)andthefullcooperationoftheICLR2025leadershipandtheorganizersoftheICan’tBelieveIt’sNotBetter(ICBINB)workshop.Thiswastheonlyvenuethatwesubmittedto.
Thetemplate-freeversionofTheAIScientistwasreadilyadaptedtothissettingbysimplypromptingitwiththebroadthemeofthework-shop(whichwasinvestigatingdeeplearninglimitations,includingwherepreviousideastoimproveithadnotworked).Theoverallprocesswasthenruntogenerateideas,experimentsandpapers.Wemanuallyfilteredthemostpromisingoutputsateachstage(SupplementaryInformationsectionA.4).Hadthisfilteringnotoccurred,thepapersunderanalysiswouldstillhavebeenproducedintheirfinalform,justalongwithotherpapersand,thus,atagreatertotalcost.Thisprocessresultedinthreecompletemanuscriptsbeingselectedforsubmission.Theselectionwasbasedonthreecriteria:whethertheideawasalignedwiththeworkshoptopic,whetherthecodecorrectlyimplementedtheproposedideaandranwithouterrors,andthecorrectnessofthemanuscriptformatting(SupplementaryInformationsectionA.4).Theentirescientificworkflowforeachpaper,fromideationandcodingtomanuscriptwriting,wasperformedwithoutanyhumanmodification.Thesethreesubmissionswereincludedamongthe43papersreviewedfortheworkshop.Reviewerswereinformedthatsomeofthesubmis-sionswereAI-generatedbutnotwhichones,ensuringablindprocess. OneofthethreeAI-generatedmanuscriptsreceivedanaveragescoreof6.33fromthereviewers(individualscoreswere6,7and6),placingitabovetheaverageacceptancethresholdfortheworkshop(Fig.
2
).Theorganizerssaidthatthepaperwouldhavebeenacceptedinalllikelihoodwereitnotwithdrawnaccordingtoourpre-establishedprotocolduetobeingAI-generated.Notably,theacceptedmanuscriptreportedanegativeresult,aligningwiththefocusoftheworkshoponinterestingnegativeresults.Theothertwopapersdidnotmeetthebarforacceptance(SupplementaryTableD9).Thus,afullyAI-generatedpaperpassedastandardscientificpeer-reviewprocess.Wealsocon-ductedourowninternalreview,usingthehumanAIresearchersonourteam(SupplementaryInformationsectionC.2).Theteamconcludedthatalthoughoneofthepapersdidmeetthebarforworkshoppapers,nonemetthehigherbarforamainICLRconferencepublication.A
Nature|Vol651|26March2026|917
Technicalmethodology(page2)
Titleandabstract(page1)
Datavisualizations(page4)
References(page5)
Fig.2|SelectedsectionsfromapapergeneratedbyTheAIScientistthat
wasacceptedviapeerreviewatatop-tiermachinelearningconference
workshop.Thepaperreceivedpeer-reviewscoresof6(weakaccept),7(accept)and6(weakaccept)beforemeta-reviewandrankedamongthetop45%of
paperssubmittedforpeerreview.ThisdemonstratesthatafullyAI-generatedpapercannavigatethepeer-reviewprocesssuccessfullyatatop-tierconferenceworkshop.Afull-sizedversionofthispaperisavailableinSupplementary
InformationsectionD.2.1.
fullanalysisofallthreesubmittedpapers,includingtheirstrengths,weaknessesandimplementations,isprovidedinSupplementaryInfor-mationsectionC.2.
Limitations
AlthoughTheAIScientistgeneratedaworkshoppaperthatpassedpeerreview,thereisroomforimprovementifitistomatchthebesthuman-producedscience.Onlyoneofthreesubmissionswasaccepted,andworkshopshavemuchhigheracceptanceratesthanmainconfer-ences(forexample,70%fortheICLR2025ICBINBworkshop
38
versus32%fortheICLR2025mainconference
39
).Therefore,TheAIScientistcannotyetmeetthestandardsoftop-tierpublicationsnorevendosoconsistentlyforworkshops.Commonfailuremodesincludethegenerationofnaiveorunderdevelopedideas,incorrectimplementa-tionsofthemainidea,alackofdeepmethodologicalrigour,errorsinexperimentalimplementation,duplicatingfiguresinthemaintextandtheappendix,andmanytypesofhallucinations,suchasinaccuratecitations(afullanalysisoffailuremodesisprovidedinSupplementaryInformationsectionsA.4,C.2andC.3).
Thatsaid,ofteninmachinelearning,oncesomethingbeginstowork(evenwithclearflaws),inafewshortyearswithscale(forexample,ofcomputeanddata),bettercoremodelsandbettertechniques,thecapabilitiesofasystembecomesurprisingandcanexceedhumanperformancelevels.Inassessingtheimpactofatechnology,itis,thus,importanttokeepinminditsprobablefuturetrajectory.Crucially,thistrajectoryisnotjustaboutbettermodelsbutaboutthecomplexityofthetasksthatAIsystemscanexecute.RecentworkindicatesthatthelengthoftasksthatAIcanreliablycompleteisdoublingevery
7months
40
,indicatingthatmanycurrentimplementationanddebug-gingbottlenecksmayberesolvedinthenearterm.However,someAIweaknesseshaveprovedsurprisinglydifficulttosolve,suchasAIbeingeasilyfooled
41
,
42
andoverconfidentlywrong(hallucinations)
43
,althoughprogresshasbeenmade
44
,
45
.SuchchallengescouldpersistandwouldpreventusfromreliablytrustingtheoutputsofsystemslikeTheAIScientist.ItisalsonotcleartowhatextentAIsystemscanproducenewcreativeideasthatresemblegreatconceptualleapsinscience.StudyingandimprovingAIsystemsonthesefrontsarekeyareasforfutureresearch.
Atpresent,TheAIScientistconductscomputationalexperimentsonly.Infuturework,thissameplaybookcouldbeappliedtoothersci-entificdomainswhereonecanautomaticallyconductexperiments(orhavehumansconductthem)andcollectdatafromthem(forexample,automatedchemistrylaboratories,onwhichswiftprogressisbeingmade
46
).
Theabilitytoautomatepapergenerationraisesimportantethi-calandsocietalconcerns,includingthepotentialtooverwhelmthepeer-reviewprocess,artificiallyinflateresearchcredentials,repurposetheideasofotherswithoutgivingpropercredit,eliminatescientistjobs,orconductunethicalordangerousexperiments(Supplemen-taryInformationsectionC.3).Toconductthisstudyresponsibly,weobtainedexplicitpermissionfromtheICLRleadership,theworkshoporganizersandtheUniversityofBritishColumbia’sIRB(H24-02652).Crucially,aspartofourexperimentalprotocol,wedeterminedinadvancethatallAI-generatedsubmissionswouldbewithdrawnafterpeerreview,regardlessofoutcome.Thisdecisionwasmadetoavoidsettingaprecedentforpublishingfullyautomatedresearchbeforethescientificcommunityhasestablishedclearstandardsfordisclosureand
918|Nature|Vol651|26March2026
Article
a
b
Stage1:preliminaryinvestigation
iiftlifttidhttli
Stage3:research
agendaexecutionStage4:ablationstudies
Stage2:hyperparameter
tuning
Paperscoresgivenbythe
automatedreviewer
Non-buggyBuggy
HyperparameterAblation
ReplicationAggregationBest
Refine
Debug
n=30AIScientist:template-free
4.0
3.8
3.2
51015202530
Numberofexperimentalnodes
3.6
3.4
n=30n=30
n=30
c
IntroducestheWaterbirdsdataset.
Replacesskewedsplitswithstratifiedsampling;usespretrainedResNet.
FixesarrayerrorbyexcludingimagedatafromDataFrames.
Ablationstudies:
1.Warm-upperiodduration
2.Dynamicpenaltyfactoradaptation
3.Penaltyapplicationstrengthandsoon
Builtacolour-biasedMNISTdatasettotestifslowinglearningon
specificweightsboostsshortcutrobustness.
Tunedsuppressionstrengthusingearlystopping;enhancedvisualizations.
Fixestrainingcrash,sharpensshortcutsignalandaddstheCelebAdataset.
Rootnode
Balancesall
colour-digitgroupsintrainingandtestsplitstoavoidemptygroups.
Stage3
Stage4
Stage1
Stage2
Topc:suppressngas-earnngeauresoavosorcureance
Bestnode
Fig.3|ThephasesandcomputescalingoftheAIScientist.a,Theresearch
experimentationphaseisvisualizedasafour-stageprocess.Apreliminary
baselinecodeimplementationisfirstconstructed(stage1)andrefinedby
tuningthehyperparameters(stage2).Theresultantcodeservesasastarting
pointforexecutingtheresearchagendathroughanagentictreesearch(stage3),followedbyablationexperiments(stage4).Fulldetailsoftheagentictree
searchprocessareprovidedinMethods.b,ArealexampleoftreesearchbyThe
evaluation.Developingthesenormsisacriticalnextsteptoensurethatsuchsystemsareusedtoadvance,notundermine,scientificintegrity.Finally,moreresearchisneededtoensurethatopen-endedexploratoryAIproceedssafelyandinalignmentwithhumanvalues
47
,
48
.
ThegenerationbyTheAIScientistofanAI-authoredmanuscriptthatpassedpeerreviewforaworkshopatatop-tiermachinelearningconferencemarksamilestoneinthecenturies-longscientificendeav-our.Althoughchallengesremainintermsofconsistencyandachievingtop-tierquality,thissuccessdemonstratesthegrowingcapacityofAIforscientificreasoning,anditsignalsthedawnofanewerainwhichtheprocessofdiscoveryisnolongerasolelyhumanpursuitandinwhichthepaceatwhichweareabletoreaptheharvestofscientificdiscoverycouldacceleratedramatically.
Onlinecontent
-
-
sy
Anymethods,additionalreferences,NaturePortfolioreportingsummaries,sourcedata,extendeddata,supplementaryinformation,acknowledgements,peerreviewinformation;detailsofauthorcontributionandcompetinginterests;andstatementsofdataandcodeavailabilitareavailableat
/10.1038/s41586-026-10265-5
.
1.Lenat,D.B.Automatedtheoryformationinmathematics.InProc.5thInternationalJointConferenceonArtificialIntelligence833–842(ed.Reddy,R.)(WilliamKaufmann,1977).
2.Buchanan,B.G.&Feigenbaum,E.A.Dendralandmeta-dendral:theirapplicationsdimension.Artif.Intell.11,5–24(1978).
3.OpenAI.GPT-4technicalreport.Preprintat
/10.48550/arXiv.2303.08774
(2023).
AIScientistwithnodeannotationsoutliningtheexperimen
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 建筑设计有限公司建筑设计流程的管理细则
- 社区获得性肺炎防治指南
- 防治质量通病的措施
- 防汛应急预案响应程序
- 方城密封固化地坪施工方案
- 2026年客户满意度调查分析报告
- (新)《美术鉴赏》测试题及答案
- 2023药品销售年度工作总结
- 2026年高考北京卷政治考试复习试卷及答案
- 2025年绵阳南山双语中学初一入学数学分班考试真题含答案
- 2025中数联物流科技(上海)有限公司招聘笔试历年参考题库附带答案详解
- 物业交接表格2
- 驾驶员雨天安全教育培训课件
- 超市即时配送管理办法
- 2025年常州市中考物理试卷(含标准答案及解析)
- 2024年高校辅导员素质能力大赛试题(附答案)
- 2025译林版高中英语新教材必修第一册单词表默写(汉英互译)
- SolidWorks软件介绍讲解
- 交换机的工作原理
- 2025年针灸简答题试题及答案
- 2025年高考真题-化学(湖南卷) 含答案
评论
0/150
提交评论