下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
InformationRetrieval&WebLecture8:OpenSourceDongShoubin( Communication&ComputerNetworkLaboratory(CCNL)(华南理工大 省计算机网 IntroductiontoOpenSourceSearchSome2OpenSourceSearchLowcost:NolicensingSourcecodeavailableforGoodformodestorevenlargedataPerformance,3OpenSourceSearch Afull-textsearchlibrarywithcoreindexingandsearchCompetitiveinengineperformance,relevancy,andcodebasedontheLuceneJavasearchlibrarywithXML/HTTPAPIscaching,replication,andawebadministrationC++searchenginefromU.4AComparisonofOpenSource sourcesearchengines.AComparisonofOpenSource IntroductiontoOpenSourceSearchSome7AmatureApacheopen-sourceJavalibraryfortextindexingandDevelopedbyDougCuttingJava-based.Createdin1999,DonatedtoApacheFeatures:Nocrawler, parsing,Thesearchtechnologybehindalotofwebsites&applications(ZOË,JIRA,Lookout,Furl,89InputContentforLogical
sareacollectionofStored–StoredverbatimforretrievalwithIndexed–TokenizedandmadeIndexedtermsstoredininvertedPhysicalstructureofinverted sstoredinIndexWriterisinterfaceobjectforentireLucene的索引索引(Index):处于Lucene索引结构处于最上层。索引存一个文件夹中。这个文件夹里的所有文件构成一个Lucene段():索引由许多段组成,段与段是相互独立的,新_0)。segments.gen和sesgments_5保存了段的属性信息,它们是元数据文件文档 ):文档是索引的基本单位,文档保存于段,一个段由多个文档构成。文档是一个通用的术语,它可以是个网页、文本文件、doc、PDF域(Field):文档包含了许多不同的信息,为使索引更有针,可将文档的这些属性划分为不同的字段,分开索引,比如时、正文、标题、作者等,这样的一个字段就是一个词():分析后的字符串。词保存在相应的域中。例子:索xxx.tis,xxx.tii:词典(TermDictionary)信息,即此段含有的所有词项按词典顺序的排列LuceneIndexFiles:FieldinfosfileFieldsCount,<FieldName,thenumberoffieldsinthethenameofthefieldinaabyteandanintwherethebitofthebyteshowswhetherthefieldisindexed,andtheintistheidoftheterm1,1,<content,LuceneIndexFiles:TermDictionaryfileTermCount,<Term,<PrefixLength,Suffix,theterm'sfieldname,andwithinthatlexicographicallybythethenumberoftermsinthenumberofinitialcharactersfromtheprevioustermwhichmustbepre‐pendedtoaterm'ssuffixinordertoformtheterm'stext.Thus,ifthepreviousterm'stextwas"bone"andthetermis"boy",thePrefixLengthistwoandthesuffixistheterm'sfield,whosenameisstoredinthe.fnm4,<<0,football,1>,2>4,<<0,football,1>,2><<0,penn,1>,1>LuceneIndexFiles:TermInfoindexIndexTermCount,IndexInterval,<TermInfo,ThiscontainseveryIndexIntervalthentryfromthe.tisfile,withitslocationinthe"tis"file.Thisisdesignedtobereadintomemoryandusedtoproviderandomaccesstothe"tis"determinesthepositionofthisterm'sTermInfowithinthe.tisfile.Inparticular,itisthedifferencebetweenthepositionofthisterm'sentryinthatfileandthepositionofthepreviousterm's4,<football,1><penn,3><layers,2> LuceneIndexFiles:FrequencyfileDocDelta,TermFreqsareorderedbyterm(thetermisimplicit,fromthe.tisTermFreqentriesareorderedbyincreasingdeterminesboththenumberandfrequency.Inparticular,DocDelta/2isthebetweenthisnumberandthenumber(orzerowhenthisistheinaTermFreqs).WhenDocDeltaisodd,thefrequencyisone.WhenDocDeltaiseven,thefrequencyisreadasthenextInt.Forexample,theTermFreqsforatermwhichonce sevenandthreetimeselevenwouldbethefollowingsequenceofInts:15,8,<<2,<<2,2,3><3><5><3,LuceneIndexFiles:Positionfile<PositionDeltaTermPositionsareorderedbyterm(thetermisimplicit,fromthe.tisPositionsentriesareorderedbyincreasingnumber(thenumberisimplicitfromthe.frqfile).thedifferencebetweenthepositionoftheoccurrenceintheandtheprevious(orzero,ifthisisthefirstoccurrenceinthisForexample,theTermPositionsforatermwhichoccursasthefourthterminone ,andasthefifthandninthterminasubsequent ,wouldbethefollowingsequenceofInts:4,5,<<3,<<3,64><1>><<1><0>><<0><2>><<2>BFG-IndexingBFG-LexCorpLexCorpBFG-Specifyhowthetextinafieldistobe dividestextat dividestextatnon-converttolower removesstop goodformostEuropeanremovesstopconverttolowerCreateyou QuerySyntaxandTermswithfieldsandTitle:rightandtext:Title:rightand (goappearsindefaultTitle:“therightway”and“quick (plateorplaceor (practiceorpracticalorFuzzy(editdistanceas
(grantingorplanning)(defaultis0.5)QuerySyntaxand author:{kingTORankingweightboostingtitle:“Bell”
Defaultboostvalue1.Maybe<1(e.gBooleanoperators:AND,"+",OR,NOTand"-“LinuxOS”ANDLinuxORsystem,Linux+Linux+LinuxTitle:(+linux+"operatingBFG-Searching:BFG-LexCorpLexCorpBFG-Lexcorp ATermInfoIndex(inMemory)ConstantQueryTermInfoIndex(inMemory)ConstantFieldField(in(Random(Randomfile(RandomfilePosition(RandomfileLuceneRankingtThescoreofqueryqfor dcorrelatestothe andqueryt(wk,iwk,qSim(di,q) k 2 k kk kwk,itfk,i wk,qboostkidfk
Numberoftotal Numberof scontaintermk1tfk,j termfrequencyofk FactorsinRankingtf(tinTerm'sfrequency,definedasthenumberoftimestermappearsinthecurrentlyscoredInverseFrequencyoftermfieldBoost(t.fieldinFieldboost-setbycallingfield.setBoost()beforeaddingfieldtoafieldNorm(t.fieldinNormalizationvalue,computedwhentheisaddedtotheindexinaccordancewiththenumberoftokensofthisfieldinthe.FieldNormiscomputedatindexing.boost-setbycallingdoc.setBoost()before tothecoord(q,Coordinationfactor,isascorefactorbasedonhowmanythequerytermsarefoundinthespecifiedAnormalizingfactorusedtomakescoresbetweenqueriesRankingFormulaofTermTermQueryisakindofsimplestquery,whichhasonlyonequeryterm.score(di,q)tfk,i*idfk*boostk*tt2kk fieldNorm
IdfkandboostkhavenoeffectonfieldBoost,doc.Boostcanbesetmanually,defaultisThusrankingisproportionaltotf/RankingFormulaofPhraseComputerthescoreofeachtermkscore(di,k)weightkfieldWeightk*(tfk,iidfkfieldNorm)*(BoostkidfkqueryNorm
queryNorm t2idfk2kComputerthetotalscore scorecoord
kcoordnumberofmatchedquerynumberofqueryQueryThesame :广州华南理工大学争创国家一流大学2: “AfterChinese123计算SolutiontoQueryExample 1
idf大学
2
idf计算机queryNorm
2 tf大学 1tf大学 tf大学 3
大 计算tf计算机 1tf计算机 tf计算机 SolutiontoQueryExampleSetfieldBoost=1,boost=1,ThefieldNormsof3 arenearly0.3125,0.5,0.3125(PleasenotethatthesystemuseonlyonebytetostorefieldNorm,thereexistasmallmarginofinaccurateerror)Thusthescoreof 1 2 3Theranked LuceneSub-projectsorNutch Webcrawler Hadoop DistributedfilesystemsanddataImplementsSolr Zookeeper Centralizedservice(directory)withIntroductiontoOpenSourceSearchSomeNutchNutchisopensourceweb-searchsoftware.ItbuildsonLuceneJava.Addingweb-specificslink anchorHTMLandother detection&parsinglanguage,charsetdetection&extensibleindexing&Historyof年HistoryofNutch2002年8月2011年11月发布版本2012年6月发布版本2012年7月发布版本2.0(2.XGora诞生,table-basedarchitecture2012年7月发布版本2012年8月Nutch诞生年月发布版本.1(2.X开始支持elasticsearch年月发布版本2013年6月发布版本 mons诞生2013年6月发布版本 mons诞生2013年7月发布版本2015年1月发布版本NutchRelatedHadoop:云计算平台, HDFS(HadoopDistributedFileSystem),和Tika ):文本解析工具,对各档进行元数据和文本数据的提取。如从不同格式的文内Gora( Object/RelationMap Elasticsearch( NutchWorkflowinKeyDataKeepsinformationonallknownpagesandthelinksthatconnectKeepstheURLinformationofKeepsthebasicinformationofcrawledpagesandthepageslinkedtoParseKeepstheanchortextofallParseKeepsthetextinformationaftertagsKeepstheoriginalHTMLsourceWorkingdirectoryofNutchtostorefilesincludingFetchlist、Fetcher、Parsedata、Parsetext、ContentIndexfilesbuiltunderWebPageUsedforfetchLinkRepresentsfulllinkStoresanchortextassociatedwitheachUsed AnchortextWebDatabase:Nutchscorescore(url)score(anchor)score(title)TextintheURLofcurrentweb4Anchortextofinlinkofcurrent2TextinTextaftertags1NutchRankingcoord
field
fieldln(fieldLengthfieldNorm field Weight
field doc.boostdoc.boostln
queryNorm
doc.Boost(idf2fieldBoost2 <title>FOOTBALLTHISISTHEFOOTBALLPAGEFORTEST<ahref="basketball.html">BASKETBALL</a><ahref="golf.html">GOLF</a><ahref="sports.html">SPORTSHOME<title>SportsTHISISTHESPORTSPAGEFORTEST<ahref="football.html">FOOTBALL</a><ahref="basketball.html">BASKETBALL</a><ahref="golf.html">GOLF</a><title>GOLFTHISISTHEGOLFPAGEFOR <ahref="football.html">FOOTBALL</a><ahref="basketball.html">BASKETBALL</a><ahref="sports.html">SPORTSHOME<title>BASKETBALLTHISISTHEBASKETBALLPAGEFORTEST<ahref="football.html">FOOTBALL</a><ahref="golf.html">GOLF</a><ahref="sports.html">SPORTSHOMEExample:Web yzetheurl:httphttp-1271270018080footballtitle:footballcontent:footballpagethisisis-thethethe-footballfootballpageforforfor-testtestbasketballgolfsportshomeExample:tf&Termfrequencyof1111211Thenumberofcontain41414441Example:queryNorm&queryNorm
((idf2fieldBoost2 42 22 1.52 Numberof1921Example:scoreoffieldsscore(url)queryWeight(url)fieldBoost(url)idf(url)tf(url)idf(url)4 score(title)queryWeight(title)fieldBoost(title)idf(title)tf(title)idf(title) Example:scoreoffieldsscore(content)queryWeight(content)fieldBoost(content)idf(content)tf(content)idf(content)1 score(anchor)queryWeight(anchor)fieldBoost(anchor)idf(anchor)tf(anchor)idf(anchor)2 scorescore(url)score(title)score(content) Example:NutchScore•• =sum =weight(url:football^4.0in3),product4.0====fieldWeight(url:footballin3),product1.0==0.109375=fieldNorm(field=url, =queryWeight(anchorfootball^2.0),product2.0= = ==fieldWeight(anchor:footballin3),product1.0==0.75=fieldNorm(field=anchor,=weight(content:footballin3),product =queryWeight(contentfootball),product===fieldWeight(content:footballin3),product==0.03125=fieldNorm(field=content, =weight(title:football^1.5in3),product =queryWeight(title:football^1.5),product1.5= = =1.058217=fieldWeight(title:footballin3),product1.0==0.625=fieldNorm(field=title,Nutch UrlsNutchIntroductiontoOpenSourceSearchSomeDevelopedbyYonikSeeleyatCNET.DonatedApacheinServlet,WebAdministrationXML/HTTP,JSONFaceting,SchematodefinetypesandHighlighting,Caching,IndexReplication(Master/Pluggable.PoweredbyNetflix,CNET,Smithsonian,GameSpot,AOL:sportsDrupalArchitectureofHTTPRequest Update
Disjunction
Solr
Applicationusageof YouSeer
System
WWW
…
sinUseHTTPPOSTto<add><doc<field<fieldname=“title”>Apache<fieldname=“subject”>An<field<field<fieldname=“body”>Solrisa Insertinga withalreadypresentuniqueKeywillerasetheoriginalDeletebyuniqueKeyfield(e.gDeletebyQuery <commit/>makeschangesclosesremovesopensnewnewSearcher/firstSearchercache“register”thenew<optimize/>sameascommit,mergesallindexDefaultQuery missionimpossible;releaseDate +mission+impossible“missionimpossible”title:spiderman^10+HDTV+weight:[0TOWildcardqueries:te?t,te*t,DefaultQueryArgumentsforHTTPGET/POSTtoqThe0OffsetintothelistofNumberofsto*StoredfieldstoQuerytype;mapstoqueryDefaultfieldtoSearch<resultnumFound="16173" <strname="name">Apple60GBiPodwith <float<float<strname="name">ASUSExtreme<floatLucenehasnonotionofaSorting-stringvs.Ranges-val:42includedinval:[1TO5]LuceneQueryParserhasdate-rangesupport,butmustguess.Definesfields,theirtypes,Definesuniquekeyfield,defaultsearchfield,SimilarityimplementationFieldFieldAttributes:name,type,indexed,stored(meansretrievableduringsearch),multiValued,<field<fieldname="id“indexed="true"<field<fieldtype="textTight”indexed="true"<fieldname=“reviews“indexed="true"indexed="true“<fieldname="category“type="text_ws“indexed="true"DynamicFields,inthespiritof<dynamicField<dynamicFieldname="*_i"indexed="true"<dynamicFieldname="*_s"type="string“indexed="true"<dynamicFieldname="*_t" indexed="true" yzer <tokenizer<filter<filter<filter<filter <tokenizer<filter More<fieldtypename="text" <tokenizer<filter<filter<filter<filter Search PowerShotPowerShotSD power-power-
A Copiesonefieldtoanotheratindex yzesamefielddifferentcopyintoafieldwitha boostexact-case,exact-punctuationlanguagetranslations,thesaurus,<field<fieldname=“title”<fieldname=“title_exact”type=“text_exact”<copyFieldsource=“title”FacetedSearch/BrowsingFacetedmemory:[1GBmemory:[1GBTO
price:[0TOprice:[0TO
==
=sectionof
setofall
=price:[500TOprice:[500TO==QueryHighSolr
LoadLoadIndexadminadmin
Solr
IndexSearcher’sviewofanindexisAggressivecachingConsistencyformulti-queryfilterCache–unorderedsetof matchingaqueryresultCache–orderedsubsetof matchingaqueryCache–thestoredfields userCaches–applicationspecific,customqueryWarmingforLuceneIndexReaderfieldnorms,FieldCache,tii–thetermStaticCacheConfigurablestaticrequeststowarmnewSmartCacheWarmingUsingMRUitemsinthecurrentcachetopre-populatethenewcacheWarminginparallelwithliveWebAdminShowConfig,Schema,DistributionQueryCaches:lookups,hits,hitratio,inserts,evictions,RequestHandlers:requests,UpdateHandler:adds,deletes,commits,IndexReader,open-time,index-version,numDocs,ysisShowstokensafter yzerShowstokenmatchesforqueryvsWebAdminIntroductiontoOpenSourceSearchSomeLemur是 马萨诸塞大学(UMASS)大学(CMU)联合推出的一个开源工具箱,它主要息检索研究者用来比较先进的搜Leur(TD)、概k)Livgn)的检索LuiIx(建REv等IdLeurIndri常用的应用有IndriBuildIndex和等Lur和ndXML例子:用Lemur做检 实验环境:Windows7操作Lemur版本:lemur-数据:使用u4.12自带数据,在Lu安装 下“Lur4.12\ddt”件夹中,包含一个测试用的语库文database.sgl,该件包3204个简单的文档来生索引,有个quy.sgl含个查询 。建立索 建立索解析查解析查
解析
<DOC O>1Whatarticlesexistwhichdealwith(TimeSharingSystem),ansystemforIBM生成结果生成语言模型平滑所需支持信息:检索索引TFIDF模型。生成结果文件KL分散度语言模型。生成结果文件检索结检索结果 :将输出结果存入metric文挑选52查 检索结果的可视使用irevalGUI.jar进行可视化Carrot2:SearchresultsAnOpenSourceSearchResultsClusteringCarrot2offersready-to-usecomponentsforfetchingsearchresultsfromvarioussources API,BingAPI,Lucene,SOLR,andmoreCarrot2’sWhyclusterwebsea
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 福建龙净环保股份有限公司投资分析报告
- 2023云南特岗生物历年真题同源模拟题及精准答案
- 2024粮油仓储管理员考试初级专属备考试题及答案解析
- 2024年江苏省建筑安全员C1证考试改革后新版题库及答案
- 2022年全国保育师统考幼儿养育照护真题及答案解析
- 2026年《诗经二首》测试题及答案
- 2021会考化学历年真题试题及知识点串联答案解析
- 旧校区家装电梯协议书
- 津心登买卖协议书号
- 精神科病人保护性约束
- 直播带货合作协议标准范本
- 2025年上海市中考生命科学试题
- 郑州黄河护理单招题库及答案解析
- 2025-2026学年五年级英语下册 Unit 2 Can I help you Lesson 11说课稿 人教精通版(三起)
- 轨道交通机电设备维修工初级试用期工作总结与自我评价
- 2025年初级护理师考试历年真题570题(含答案及解析)
- 绿色农产品生产供应基地建设项目规划设计方案
- 《汽车拆装与调整》-项目12离合器片的更换-学生工单
- 清洁生产与清洁生产审核培训
- 福建省福州市仓山区红星农场国民经济和社会发展第十五个五年规划
- 2025年初中心理健康教师招聘考试试卷及答案
评论
0/150
提交评论