大數(shù)據(jù)基礎(chǔ)課程設(shè)計報告_第1頁
大數(shù)據(jù)基礎(chǔ)課程設(shè)計報告_第2頁
大數(shù)據(jù)基礎(chǔ)課程設(shè)計報告_第3頁
大數(shù)據(jù)基礎(chǔ)課程設(shè)計報告_第4頁
大數(shù)據(jù)基礎(chǔ)課程設(shè)計報告_第5頁
已閱讀5頁,還剩21頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

大數(shù)據(jù)基礎(chǔ)課程設(shè)計報告一、項(xiàng)目簡介: 使用hadoop中的hive、mapreduce以及HBASE對網(wǎng)上的一個搜狗五百萬的數(shù)進(jìn)行了一個比較實(shí)際的數(shù)據(jù)分析。搜狗五百萬數(shù)據(jù),是經(jīng)過處理后的搜狗搜索引擎生產(chǎn)數(shù)據(jù),具有真實(shí)性,大數(shù)據(jù)性,能夠較好的滿足分布式計算應(yīng)用開發(fā)課程設(shè)計的數(shù)據(jù)要求。 搜狗數(shù)據(jù)的數(shù)據(jù)格式為:訪問時間t 用戶 IDt查詢詞t 該 URL 在返回結(jié)果中的排名t 用戶點(diǎn)擊的順序號t 用戶點(diǎn)擊的 URL。其中,用戶 ID 是根據(jù)用戶使用瀏覽器訪問搜索引擎時的 Cookie 信息自動賦值,即同一次使用瀏覽器輸入的不同查詢對應(yīng)同一個用戶 ID。二、操作要求1.將原始數(shù)據(jù)加載到HDFS平臺。 2.將原始數(shù)據(jù)中的時間字段拆分并拼接,添加年、月、日、小時字段。 3.將處理后的數(shù)據(jù)加載到HDFS平臺。 4.以下操作分別通過MR和Hive實(shí)現(xiàn)。l 查詢總條數(shù)l 非空查詢條數(shù)l 無重復(fù)總條數(shù)l 獨(dú)立UID總數(shù)l 查詢頻度排名(頻度最高的前50詞)l 查詢次數(shù)大于2次的用戶總數(shù)l 查詢次數(shù)大于2次的用戶占比l Rank在10以內(nèi)的點(diǎn)擊次數(shù)占比l 直接輸入URL查詢的比例l 查詢搜索過”仙劍奇?zhèn)b傳“的uid,并且次數(shù)大于35.將4每步驟生成的結(jié)果保存到HDFS中。 6.將5生成的文件通過Java API方式導(dǎo)入到HBase(一張表)。 7.通過HBase shell命令查詢6導(dǎo)出的結(jié)果。三、實(shí)驗(yàn)流程1. 將原始數(shù)據(jù)加載到HDFS平臺2. 將原始數(shù)據(jù)中的時間字段拆分并拼接,添加年、月、日、小時字段(1) 編寫1個腳本sogou-log-extend.sh,其中sogou-log-extend.sh的內(nèi)容為:#!/bin/bash#infile=/root/sogou.500w.utf8infile=$1#outfile=/root/filesogou.500w.utf8.extoutfile=$2awk -F t print $0tsubstr($1,0,4)年tsubstr($1,5,2)月tsubstr($1,7,2)日tsubstr($1,8,2)hour $infile $outfile處理腳本文件:bash sogou-log-extend.sh sogou.500w.utf8 sogou.500w.utf8.ext結(jié)果為:3. 將處理后的數(shù)據(jù)加載到HDFS平臺hadoop fs -put sogou.500w.utf8.ext /4. 以下操作分別通過MR和Hive實(shí)現(xiàn).hive實(shí)現(xiàn)1.查看數(shù)據(jù)庫:show databases;2.創(chuàng)建數(shù)據(jù)庫: create database sogou;3.使用數(shù)據(jù)庫: use sogou;4.查看所有表:show tables;5.創(chuàng)建sougou表:Create table sogou(time string,uuid string,name string,num1 int,num2 int,url string) Row format delimited fields terminated by t;6.將本地數(shù)據(jù)導(dǎo)入到Hive表里:Load data local inpath /root/sogou.500w.utf8 into table sogou;7.查看表信息:desc sogou;(1) 查詢總條數(shù)select count(*) from sogou;(2) 非空查詢條數(shù)select count(*) from sogou where name is not null and name !=;(3) 無重復(fù)總條數(shù)select count(*) from (select * from sogou group by time,num1,num2,uuid,name,url having count(*)=1) a;(4) 獨(dú)立UID總數(shù)select count(distinct uuid) from sogou;(5) 查詢頻度排名(頻度最高的前50詞)select name,count(*) as pd from sogou group by name order by pd desc limit 50;(6)查詢次數(shù)大于2次的用戶總數(shù) select count(a.uuid) from (select uuid,count(*) as cnt from sogou group by uuid having cnt 2) a;(7)查詢次數(shù)大于2次的用戶占比select count(*) from (select uuid,count(*) as cnt from sogou group by uuid having cnt 2) a;(8) Rank在10以內(nèi)的點(diǎn)擊次數(shù)占比select count(*) from sogou where num13;.MapReduce實(shí)現(xiàn)(import的各種包省略)(1) 查詢總條數(shù)public class MRCountAll public static Integer i = 0; public static boolean flag = true; public static class CountAllMap extends Mapper Override protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException i+; public static void runcount(String Inputpath, String Outpath) Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); Job job = null; try job = Job.getInstance(conf, count); catch (IOException e) / TODO Auto-generated catch block e.printStackTrace(); job.setJarByClass(MRCountAll.class); job.setMapperClass(CountAllMap.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); try FileInputFormat.addInputPath(job, new Path(Inputpath); catch (IllegalArgumentException e) / TODO Auto-generated catch block e.printStackTrace(); catch (IOException e) / TODO Auto-generated catch block e.printStackTrace(); FileOutputFormat.setOutputPath(job, new Path(Outpath); try job.waitForCompletion(true); catch (ClassNotFoundException e) / TODO Auto-generated catch block e.printStackTrace(); catch (IOException e) / TODO Auto-generated catch block e.printStackTrace(); catch (InterruptedException e) / TODO Auto-generated catch block e.printStackTrace(); public static void main(String args) throws Exception runcount(/sogou/data/sogou.500w.utf8, /sogou/data/CountAll); System.out.println(總條數(shù): + i); (2) 非空查詢條數(shù)public class CountNotNull public static String Str = ; public static int i = 0; public static boolean flag = true; public static class wyMap extends Mapper Override protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException String values = value.toString().split(t); if (!values2.equals(null) & values2 != ) context.write(new Text(values1), new IntWritable(1); i+; public static void run(String inputPath, String outputPath) Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); Job job = null; try job = Job.getInstance(conf, countnotnull); catch (IOException e) / TODO Auto-generated catch block e.printStackTrace(); assert job != null; job.setJarByClass(CountNotNull.class); job.setMapperClass(wyMap.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); try FileInputFormat.addInputPath(job, new Path(inputPath); catch (IllegalArgumentException e) e.printStackTrace(); catch (IOException e) e.printStackTrace(); try FileOutputFormat.setOutputPath(job, new Path(outputPath); job.waitForCompletion(true); catch (ClassNotFoundException e) e.printStackTrace(); catch (IOException e) e.printStackTrace(); catch (InterruptedException e) e.printStackTrace(); public static void main(String args) run(/sogou/data/sogou.500w.utf8, /sogou/data/CountNotNull); System.out.println(非空條數(shù): + i); (3) 無重復(fù)總條數(shù)public class CountNotRepeat public static int i = 0; public static class NotRepeatMap extends Mapper Override protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException String text = value.toString(); String values = text.split(t); String time = values0; String uid = values1; String name = values2; String url = values5; context.write(new Text(time+uid+name+url), new Text(1); public static class NotRepeatReduc extends Reducer Override protected void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException i+; context.write(new Text(key.toString(),new IntWritable(i); public static void main(String args) throws IOException, ClassNotFoundException, InterruptedException Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); Job job = null; try job = Job.getInstance(conf, countnotnull); catch (IOException e) / TODO Auto-generated catch block e.printStackTrace(); assert job != null; job.setJarByClass(CountNotRepeat.class); job.setMapperClass(NotRepeatMap.class); job.setReducerClass(NotRepeatReduc.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); try FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); catch (IllegalArgumentException e) e.printStackTrace(); catch (IOException e) e.printStackTrace(); try FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountNotRepeat); job.waitForCompletion(true); catch (ClassNotFoundException e) e.printStackTrace(); catch (IOException e) e.printStackTrace(); catch (InterruptedException e) e.printStackTrace(); System.out.println(無重復(fù)總條數(shù)為: + i); (4) 獨(dú)立UID總數(shù)public class CountNotMoreUid public static int i = 0; public static class UidMap extends Mapper Override protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException String text = value.toString(); String values = text.split(t); String uid = values1; context.write(new Text(uid), new Text(1); public static class UidReduc extends Reducer Override protected void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException i+; context.write(new Text(key.toString(),new IntWritable(i); public static void main(String args) throws IOException, ClassNotFoundException, InterruptedException Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); Job job = null; try job = Job.getInstance(conf, countnotnull); catch (IOException e) / TODO Auto-generated catch block e.printStackTrace(); assert job != null; job.setJarByClass(CountNotNull.class); job.setMapperClass(UidMap.class); job.setReducerClass(UidReduc.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); try FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); catch (IllegalArgumentException e) e.printStackTrace(); catch (IOException e) e.printStackTrace(); try FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountNotMoreUid); job.waitForCompletion(true); catch (ClassNotFoundException e) e.printStackTrace(); catch (IOException e) e.printStackTrace(); catch (InterruptedException e) e.printStackTrace(); System.out.println(獨(dú)立UID條數(shù): + i); (5) 查詢頻度排名(頻度最高的前50詞)public class CountTop50 public static class TopMapper extends Mapper Text text =new Text(); Override protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException String line= value.toString().split(t); String keys = line2; text.set(keys); context.write(text,new LongWritable(1); public static class TopReducer extends Reducer Text text = new Text(); TreeMap map = new TreeMap(); Override protected void reduce(Text key, Iterable value, Context context) throws IOException, InterruptedException int sum=0;/key出現(xiàn)次數(shù) for (LongWritable ltext : value) sum+=ltext.get(); map.put(sum,key.toString(); /去前50條數(shù)據(jù) if(map.size()50) map.remove(map.firstKey(); Override protected void cleanup(Context context) throws IOException, InterruptedException for(Integer count:map.keySet() context.write(new Text(map.get(count), new LongWritable(count); public static void main(String args) throws IOException, ClassNotFoundException, InterruptedException Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); Job job = Job.getInstance(conf, count); job.setJarByClass(CountTop50.class); job.setJobName(Five); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setMapperClass(TopMapper.class); job.setReducerClass(TopReducer.class); FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountTop50); job.waitForCompletion(true); (6) 查詢次數(shù)大于2次的用戶總數(shù)public class CountQueriesGreater2 public static int total = 0; public static class MyMaper extends Mapper protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException String str = value.toString().split(t); Text word; IntWritable one = new IntWritable(1); word = new Text(str1); context.write(word, one); public static class MyReducer extends Reducer Override protected void reduce(Text arg0, Iterable arg1, Reducer.Context arg2) throws IOException, InterruptedException / arg0是一個單詞 arg1是對應(yīng)的次數(shù) int sum = 0; for (IntWritable i : arg1) sum += i.get(); if(sum2) total=total+1; /arg2.write(arg0, new IntWritable(sum); public static void main(String args) throws IOException, ClassNotFoundException, InterruptedException Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); / 1.實(shí)例化一個Job Job job = Job.getInstance(conf, six); / 2.設(shè)置mapper類 job.setMapperClass(MyMaper.class); / 3.設(shè)置Combiner類 不是必須的 / job.setCombinerClass(MyReducer.class); / 4.設(shè)置Reducer類 job.setReducerClass(MyReducer.class); / 5.設(shè)置輸出key的數(shù)據(jù)類型 job.setOutputKeyClass(Text.class); / 6.設(shè)置輸出value的數(shù)據(jù)類型 job.setOutputValueClass(IntWritable.class); / 設(shè)置通過哪個類查找job的Jar包 job.setJarByClass(CountQueriesGreater2.class); / 7.設(shè)置輸入路徑 FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); / 8.設(shè)置輸出路徑 FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountQueriesGreater2); / 9.執(zhí)行該作業(yè) job.waitForCompletion(true); System.out.println(查詢次數(shù)大于2次的用戶總數(shù): + total + 條); (7) 查詢次數(shù)大于2次的用戶占比public class CountQueriesGreaterPro public static int total1 = 0; public static int total2 = 0; public static class MyMaper extends Mapper Override protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException total2+; String str = value.toString().split(t); Text word; IntWritable one = new IntWritable(1); word = new Text(str1); context.write(word, one); / 執(zhí)行完畢后就是一個單詞 對應(yīng)一個value(1) public static class MyReducer extends Reducer Override protected void reduce(Text arg0, Iterable arg1, Reducer.Context arg2) throws IOException, InterruptedException / arg0是一個單詞 arg1是對應(yīng)的次數(shù) int sum = 0; for (IntWritable i : arg1) sum += i.get(); if(sum2) total1+; arg2.write(arg0, new IntWritable(sum); public static void main(String args) throws IOException, ClassNotFoundException, InterruptedException System.out.println(seven begin); Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/10.49.47.20:9000); / 1.實(shí)例化一個Job Job job = Job.getInstance(conf, seven); / 2.設(shè)置mapper類 job.setMapperClass(MyMaper.class); / 3.設(shè)置Combiner類 不是必須的 / job.setCombinerClass(MyReducer.class); / 4.設(shè)置Reducer類 job.setReducerClass(MyReducer.class); / 5.設(shè)置輸出key的數(shù)據(jù)類型 job.setOutputKeyClass(Text.class); / 6.設(shè)置輸出value的數(shù)據(jù)類型 job.setOutputValueClass(IntWritable.class); / 設(shè)置通過哪個類查找job的Jar包 job.setJarByClass(CountQueriesGreaterPro.class); / 7.設(shè)置輸入路徑 FileInputFormat.addInputPath(job, new Path

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論