大数据基础课程设计报告_第1页
大数据基础课程设计报告_第2页
大数据基础课程设计报告_第3页
大数据基础课程设计报告_第4页
大数据基础课程设计报告_第5页
已阅读5页,还剩35页未读 继续免费阅读

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

大数据基础课程设计报告一、项目简介:使用hadoop中的hive、mapreduce以及HBASE对网上的一个搜狗五百万的数进行了一个比较实际的数据分析。搜狗五百万数据,是通过解决后的搜狗搜索引擎生产数据,具有真实性,大数据性,可以较好的满足分布式计算应用开发课程设计的数据规定。搜狗数据的数据格式为:访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL。其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询相应同一个用户ID。二、操作规定1.将原始数据加载到HDFS平台。2.将原始数据中的时间字段拆分并拼接,添加年、月、日、小时字段。3.将解决后的数据加载到HDFS平台。4.以下操作分别通过MR和Hive实现。查询总条数非空查询条数无反复总条数独立UID总数查询频度排名(频度最高的前50词)查询次数大于2次的用户总数查询次数大于2次的用户占比Rank在10以内的点击次数占比直接输入URL查询的比例查询搜索过”仙剑奇侠传“的uid,并且次数大于35.将4每环节生成的结果保存到HDFS中。6.将5生成的文献通过JavaAPI方式导入到HBase(一张表)。7.通过HBaseshell命令查询6导出的结果。三、实验流程将原始数据加载到HDFS平台将原始数据中的时间字段拆分并拼接,添加年、月、日、小时字段编写1个脚本sogou-log-extend.sh,其中sogou-log-extend.sh的内容为:#!/bin/bash#infile=/root/sogou.500w.utf8infile=$1#outfile=/root/filesogou.500w.utf8.extoutfile=$2awk-F'\t''{print$0"\t"substr($1,0,4)"年\t"substr($1,5,2)"月\t"substr($1,7,2)"日\t"substr($1,8,2)"hour"}'$infile>$outfile解决脚本文献:bashsogou-log-extend.shsogou.500w.utf8sogou.500w.utf8.ext结果为:将解决后的数据加载到HDFS平台hadoopfs-putsogou.500w.utf8.ext/以下操作分别通过MR和Hive实现Ⅰ.hive实现1.查看数据库:showdatabases;2.创建数据库:createdatabasesogou;3.使用数据库:usesogou;4.查看所有表:showtables;5.创建sougou表:Createtablesogou(timestring,uuidstring,namestring,num1int,num2int,urlstring)Rowformatdelimitedfieldsterminatedby'\t';6.将本地数据导入到Hive表里:Loaddatalocalinpath'/root/sogou.500w.utf8'intotablesogou;7.查看表信息:descsogou;查询总条数selectcount(*)fromsogou;非空查询条数selectcount(*)fromsogouwherenameisnotnullandname!='';无反复总条数selectcount(*)from(select*fromsogougroupbytime,num1,num2,uuid,name,urlhavingcount(*)=1)a;独立UID总数selectcount(distinctuuid)fromsogou;查询频度排名(频度最高的前50词)selectname,count(*)aspdfromsogougroupbynameorderbypddesclimit50;(6)查询次数大于2次的用户总数selectcount(a.uuid)from(selectuuid,count(*)ascntfromsogougroupbyuuidhavingcnt>2)a;(7)查询次数大于2次的用户占比selectcount(*)from(selectuuid,count(*)ascntfromsogougroupbyuuidhavingcnt>2)a;Rank在10以内的点击次数占比selectcount(*)fromsogouwherenum1<11;直接输入URL查询的比例selectcount(*)fromsogouwhereurllike'%www%';查询搜索过”仙剑奇侠传“的uid,并且次数大于3selectuuid,count(*)asuufromsogouwherename='仙剑奇侠传'groupbyuuidhavinguu>3;Ⅱ.MapReduce实现(import的各种包省略)查询总条数publicclassMRCountAll{publicstaticIntegeri=0;publicstaticbooleanflag=true;publicstaticclassCountAllMapextendsMapper<Object,Text,Text,Text>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,Text>.Contextcontext)throwsIOException,InterruptedException{i++;}}publicstaticvoidruncount(StringInputpath,StringOutpath){Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=null;try{job=Job.getInstance(conf,"count");}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}job.setJarByClass(MRCountAll.class);job.setMapperClass(CountAllMap.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);try{FileInputFormat.addInputPath(job,newPath(Inputpath));}catch(IllegalArgumentExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}FileOutputFormat.setOutputPath(job,newPath(Outpath));try{job.waitForCompletion(true);}catch(ClassNotFoundExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}catch(InterruptedExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}}publicstaticvoidmain(String[]args)throwsException{runcount("/sogou/data/sogou.500w.utf8","/sogou/data/CountAll");System.out.println("总条数:"+i);}}非空查询条数publicclassCountNotNull{publicstaticStringStr="";publicstaticinti=0;publicstaticbooleanflag=true;publicstaticclasswyMapextendsMapper<Object,Text,Text,IntWritable>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{String[]values=value.toString().split("\t");if(!values[2].equals(null)&&values[2]!=""){context.write(newText(values[1]),newIntWritable(1));i++;}}}publicstaticvoidrun(StringinputPath,StringoutputPath){Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=null;try{job=Job.getInstance(conf,"countnotnull");}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}assertjob!=null;job.setJarByClass(CountNotNull.class);job.setMapperClass(wyMap.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);try{FileInputFormat.addInputPath(job,newPath(inputPath));}catch(IllegalArgumentExceptione){e.printStackTrace();}catch(IOExceptione){e.printStackTrace();}try{FileOutputFormat.setOutputPath(job,newPath(outputPath));job.waitForCompletion(true);}catch(ClassNotFoundExceptione){e.printStackTrace();}catch(IOExceptione){e.printStackTrace();}catch(InterruptedExceptione){e.printStackTrace();}}publicstaticvoidmain(String[]args){run("/sogou/data/sogou.500w.utf8","/sogou/data/CountNotNull");System.out.println("非空条数:"+i);}}无反复总条数publicclassCountNotRepeat{publicstaticinti=0;publicstaticclassNotRepeatMapextendsMapper<Object,Text,Text,Text>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,Text>.Contextcontext)throwsIOException,InterruptedException{Stringtext=value.toString();String[]values=text.split("\t");Stringtime=values[0];Stringuid=values[1];Stringname=values[2];Stringurl=values[5];context.write(newText(time+uid+name+url),newText("1"));}}publicstaticclassNotRepeatReducextendsReducer<Text,IntWritable,Text,IntWritable>{@Overrideprotectedvoidreduce(Textkey,Iterable<IntWritable>values,Reducer<Text,IntWritable,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{i++;context.write(newText(key.toString()),newIntWritable(i));}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=null;try{job=Job.getInstance(conf,"countnotnull");}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}assertjob!=null;job.setJarByClass(CountNotRepeat.class);job.setMapperClass(NotRepeatMap.class);job.setReducerClass(NotRepeatReduc.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);try{FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));}catch(IllegalArgumentExceptione){e.printStackTrace();}catch(IOExceptione){e.printStackTrace();}try{FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountNotRepeat"));job.waitForCompletion(true);}catch(ClassNotFoundExceptione){e.printStackTrace();}catch(IOExceptione){e.printStackTrace();}catch(InterruptedExceptione){e.printStackTrace();}System.out.println("无反复总条数为:"+i);}}独立UID总数publicclassCountNotMoreUid{publicstaticinti=0;publicstaticclassUidMapextendsMapper<Object,Text,Text,Text>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,Text>.Contextcontext)throwsIOException,InterruptedException{Stringtext=value.toString();String[]values=text.split("\t");Stringuid=values[1];context.write(newText(uid),newText("1"));}}publicstaticclassUidReducextendsReducer<Text,IntWritable,Text,IntWritable>{@Overrideprotectedvoidreduce(Textkey,Iterable<IntWritable>values,Reducer<Text,IntWritable,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{i++;context.write(newText(key.toString()),newIntWritable(i));}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=null;try{job=Job.getInstance(conf,"countnotnull");}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}assertjob!=null;job.setJarByClass(CountNotNull.class);job.setMapperClass(UidMap.class);job.setReducerClass(UidReduc.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);try{FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));}catch(IllegalArgumentExceptione){e.printStackTrace();}catch(IOExceptione){e.printStackTrace();}try{FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountNotMoreUid"));job.waitForCompletion(true);}catch(ClassNotFoundExceptione){e.printStackTrace();}catch(IOExceptione){e.printStackTrace();}catch(InterruptedExceptione){e.printStackTrace();}System.out.println("独立UID条数:"+i);}}查询频度排名(频度最高的前50词)publicclassCountTop50{publicstaticclassTopMapperextendsMapper<LongWritable,Text,Text,LongWritable>{Texttext=newText();@Overrideprotectedvoidmap(LongWritablekey,Textvalue,Contextcontext)throwsIOException,InterruptedException{String[]line=value.toString().split("\t");Stringkeys=line[2];text.set(keys);context.write(text,newLongWritable(1));}}publicstaticclassTopReducerextendsReducer<Text,LongWritable,Text,LongWritable>{Texttext=newText();TreeMap<Integer,String>map=newTreeMap<Integer,String>();@Overrideprotectedvoidreduce(Textkey,Iterable<LongWritable>value,Contextcontext)throwsIOException,InterruptedException{intsum=0;//key出现次数for(LongWritableltext:value){sum+=ltext.get();}map.put(sum,key.toString());//去前50条数据if(map.size()>50){map.remove(map.firstKey());}}@Overrideprotectedvoidcleanup(Contextcontext)throwsIOException,InterruptedException{for(Integercount:map.keySet()){context.write(newText(map.get(count)),newLongWritable(count));}}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=Job.getInstance(conf,"count");job.setJarByClass(CountTop50.class);job.setJobName("Five");job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);job.setMapperClass(TopMapper.class);job.setReducerClass(TopReducer.class);FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountTop50"));job.waitForCompletion(true);}}查询次数大于2次的用户总数publicclassCountQueriesGreater2{publicstaticinttotal=0;publicstaticclassMyMaperextendsMapper<Object,Text,Text,IntWritable>{protectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{String[]str=value.toString().split("\t");Textword;IntWritableone=newIntWritable(1);word=newText(str[1]);context.write(word,one);}}publicstaticclassMyReducerextendsReducer<Text,IntWritable,Text,IntWritable>{@Overrideprotectedvoidreduce(Textarg0,Iterable<IntWritable>arg1,Reducer<Text,IntWritable,Text,IntWritable>.Contextarg2)throwsIOException,InterruptedException{//arg0是一个单词arg1是相应的次数intsum=0;for(IntWritablei:arg1){sum+=i.get();}if(sum>2){total=total+1;}//arg2.write(arg0,newIntWritable(sum));}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");//1.实例化一个JobJobjob=Job.getInstance(conf,"six");//2.设立mapper类job.setMapperClass(MyMaper.class);//3.设立Combiner类不是必须的//job.setCombinerClass(MyReducer.class);//4.设立Reducer类job.setReducerClass(MyReducer.class);//5.设立输出key的数据类型job.setOutputKeyClass(Text.class);//6.设立输出value的数据类型job.setOutputValueClass(IntWritable.class);//设立通过哪个类查找job的Jar包job.setJarByClass(CountQueriesGreater2.class);//7.设立输入途径FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));//8.设立输出途径FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountQueriesGreater2"));//9.执行该作业job.waitForCompletion(true);System.out.println("查询次数大于2次的用户总数:"+total+"条");}}查询次数大于2次的用户占比publicclassCountQueriesGreaterPro{publicstaticinttotal1=0;publicstaticinttotal2=0;publicstaticclassMyMaperextendsMapper<Object,Text,Text,IntWritable>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{total2++;String[]str=value.toString().split("\t");Textword;IntWritableone=newIntWritable(1);word=newText(str[1]);context.write(word,one);//执行完毕后就是一个单词相应一个value(1)}}publicstaticclassMyReducerextendsReducer<Text,IntWritable,Text,IntWritable>{@Overrideprotectedvoidreduce(Textarg0,Iterable<IntWritable>arg1,Reducer<Text,IntWritable,Text,IntWritable>.Contextarg2)throwsIOException,InterruptedException{//arg0是一个单词arg1是相应的次数intsum=0;for(IntWritablei:arg1){sum+=i.get();}if(sum>2){total1++;}arg2.write(arg0,newIntWritable(sum));}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{System.out.println("sevenbegin");Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");//1.实例化一个JobJobjob=Job.getInstance(conf,"seven");//2.设立mapper类job.setMapperClass(MyMaper.class);//3.设立Combiner类不是必须的//job.setCombinerClass(MyReducer.class);//4.设立Reducer类job.setReducerClass(MyReducer.class);//5.设立输出key的数据类型job.setOutputKeyClass(Text.class);//6.设立输出value的数据类型job.setOutputValueClass(IntWritable.class);//设立通过哪个类查找job的Jar包job.setJarByClass(CountQueriesGreaterPro.class);//7.设立输入途径FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));//8.设立输出途径FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountQueriesGreaterPro"));//9.执行该作业job.waitForCompletion(true);System.out.println("total1="+total1+"\ttotal2="+total2);floatpercentage=(float)total1/(float)total2;System.out.println("查询次数大于2次的用户占比为:"+percentage*100+"%");System.out.println("over");}}Rank在10以内的点击次数占比publicclassCountRank{publicstaticintsum1=0;publicstaticintsum2=0;publicstaticclassMyMapperextendsMapper<Object,Text,Text,Text>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,Text>.Contextcontext)throwsIOException,InterruptedException{sum2++;String[]str=value.toString().split("\t");intrank=Integer.parseInt(str[3]);if(rank<11){sum1=sum1+1;}}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=Job.getInstance(conf,"eight");job.setMapperClass(MyMapper.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);job.setJarByClass(CountRank.class);FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountRank"));job.waitForCompletion(true);System.out.println("sum1="+sum1+"\tsum2="+sum2);floatpercentage=(float)sum1/(float)sum2;System.out.println("Rank在10以内的点击次数占比:"+percentage*100+"%");}}直接输入URL查询的比例publicclassCountURL{publicstaticintsum1=0;publicstaticintsum2=0;publicstaticclassMyMapperextendsMapper<Object,Text,Text,Text>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,Text>.Contextcontext)throwsIOException,InterruptedException{String[]str=value.toString().split("\t");Patternp=Ppile("www");Matchermatcher=p.matcher(str[2]);matcher.find();try{if(matcher.group()!=null)sum1++;sum2++;}catch(Exceptione){sum2++;}}}publicstaticvoidmain(String[]args)throwsIOException,ClassNotFoundException,InterruptedException{Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=Job.getInstance(conf,"nine");job.setMapperClass(MyMapper.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);job.setJarByClass(CountURL.class);FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountURL"));job.waitForCompletion(true);System.out.println("sum1="+sum1+"\tsum2="+sum2);floatpercentage=(float)sum1/(float)sum2;System.out.println("直接用url'%www%'查询的用户占比:"+percentage*100+"%");}}查询搜索过”仙剑奇侠传“的uid,并且次数大于3publicclassCountUidGreater3{publicstaticStringStr="";publicstaticinti=0;publicstaticclassMapextendsMapper<Object,Text,Text,IntWritable>{@Overrideprotectedvoidmap(Objectkey,Textvalue,Mapper<Object,Text,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{String[]values=value.toString().split("\t");Stringpattern="仙剑奇侠传";if(values[2].equals(pattern)){context.write(newText(values[1]),newIntWritable(1));}}}publicstaticclassReduceextendsReducer<Text,IntWritable,Text,IntWritable>{@Overrideprotectedvoidreduce(Textkey,Iterable<IntWritable>value,Reducer<Text,IntWritable,Text,IntWritable>.Contextcontext)throwsIOException,InterruptedException{intsum=0;for(IntWritablev:value){sum=sum+v.get();}if(sum>3){Str=Str+key.toString()+"\n";i++;}}}publicstaticvoidmain(String[]args){Configurationconf=newConfiguration();conf.set("fs.defaultFS","hdfs://0:9000");Jobjob=null;try{job=Job.getInstance(conf,"count");}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}job.setJarByClass(CountUidGreater3.class);job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);try{FileInputFormat.addInputPath(job,newPath("/sogou/data/sogou.500w.utf8"));}catch(IllegalArgumentExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}try{FileOutputFormat.setOutputPath(job,newPath("/sogou/data/CountUidGreater3"));job.waitForCompletion(true);}catch(ClassNotFoundExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}catch(IOExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}catch(InterruptedExceptione){//TODOAuto-generatedcatchblocke.printStackTrace();}System.out.println("i:"+i);System.out.println(Str);}}将4每环节生成的结果保存到HDFS中使用INSERTOVERWRITEDIRECTORY可完毕操作例如:将5生成的文献通过JavaAPI方式导入到HBase(一张表)将中5生成的文献通过JavaAPI方式导入到HBase(一张表)publicclassHBaseImport{//reduce输出的表名privatestaticStringtableName="test";//初始化连接staticConfigurationconf=null;static{conf=HBaseConfiguration.create();conf.set("hbase.rootdir","hdfs://0:9000/hbase");conf.set("hbase.master","hdfs://0:60000");conf.set("perty.clientPort","2181");conf.set("hbase.zookeeper.quorum","master,slave1,slave2");conf.set(TableOutputFormat.OUTPUT_TABLE,tableName);}publicstaticclassBatchMapperextendsMapper<LongWritable,Text,LongWritable,Text>{protectedvoidmap(LongWritablekey,Textvalue,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论