本网页所有文字内容由 imapbox邮箱云存储,邮箱网盘, iurlBox网页地址收藏管理器 下载并得到。
ImapBox 邮箱网盘 工具地址: https://www.imapbox.com/download/ImapBox.5.5.1_Build20141205_CHS_Bit32.exe
PC6下载站地址:PC6下载站分流下载
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox 网页视频 工具地址: https://www.imapbox.com/download/ImovieBox4.7.0_Build20141115_CHS.exe
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
在Crawl中的main函数中有一句是: // initializecrawlDb injector.inject(crawlDb, rootUrlDir); 引用[李阳]:inject操作调用的是nutch的核心包之一crawl包中的类Injector。 inject操作主要作用: 1. 将URL集合进行格式化和过滤,消除其中的非法URL,并设定URL状态(UNFETCHED),按照一定方法进行初始化分值; 2. 将URL进行合并,消除重复的URL入口; 3. 将URL及其状态、分值存入crawldb数据库,与原数据库中重复的则删除旧的,更换新的。 inject操作结果:crawldb数据库内容得到更新,包括URL及其状态。 看一下inject调用的函数: public voidinject(Path crawlDb, Path urlDir) throwsIOException { //产生一个文件名是随机的临时文件夹 Path tempDir = newPath(getConf().get("mapred.temp.dir", ".") + "/inject-temp-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a<url,CrawlDatum> file // 产生<url,CrawlDatum>key-value对的文件 JobConf sortJob = newNutchJob(getConf()); sortJob.setJobName("inject" + urlDir); FileInputFormat.addInputPath(sortJob,urlDir); sortJob.setMapperClass(InjectMapper.class); FileOutputFormat.setOutputPath(sortJob,tempDir); sortJob.setOutputFormat(SequenceFileOutputFormat.class); sortJob.setOutputKeyClass(Text.class); sortJob.setOutputValueClass(CrawlDatum.class); sortJob.setLong("injector.current.time", System.currentTimeMillis()); JobClient.runJob(sortJob); 这里用的是hadoop的东西,输入文件目录为:用户指定的url目录。输出目录为:产生的那个临时文件夹。这里的SequenceFileOutputFormat在<Hadoop,The definitive book>中的解释为:Imagine a logfile,where each log https://c.tieba.baidu.com/p/3476808306
https://c.tieba.baidu.com/p/3476798710
https://c.tieba.baidu.com/p/3474281354
https://c.tieba.baidu.com/p/3474300101
https://c.tieba.baidu.com/p/3474294075
https://c.tieba.baidu.com/p/3474123295
https://c.tieba.baidu.com/p/3474314242
https://c.tieba.baidu.com/p/3474310411
https://c.tieba.baidu.com/p/3474304550
https://c.tieba.baidu.com/p/3475433945
https://c.tieba.baidu.com/p/3475430015
https://c.tieba.baidu.com/p/3475433348
https://c.tieba.baidu.com/p/3475431434
https://c.tieba.baidu.com/p/3474176863
https://c.tieba.baidu.com/p/3474159835
https://c.tieba.baidu.com/p/3474163941
https://c.tieba.baidu.com/p/3474156121
https://c.tieba.baidu.com/p/3474147660
https://c.tieba.baidu.com/p/3474151899
https://c.tieba.baidu.com/p/3474142287
https://c.tieba.baidu.com/p/3474136965
https://c.tieba.baidu.com/p/3474133165
https://c.tieba.baidu.com/p/3474128675
https://c.tieba.baidu.com/p/3474103896
https://c.tieba.baidu.com/p/3474099488
https://c.tieba.baidu.com/p/3474094120
https://c.tieba.baidu.com/p/3475431976
https://c.tieba.baidu.com/p/3474267991
https://c.tieba.baidu.com/p/3474259583
https://c.tieba.baidu.com/p/3474254990
https://c.tieba.baidu.com/p/3474228986
https://c.tieba.baidu.com/p/3474221626
https://c.tieba.baidu.com/p/3474215742
https://c.tieba.baidu.com/p/3474212122
https://c.tieba.baidu.com/p/3474188883
https://c.tieba.baidu.com/p/3474207722
https://c.tieba.baidu.com/p/3474184143
https://c.tieba.baidu.com/p/3474180522
https://c.tieba.baidu.com/p/3474171022
https://c.tieba.baidu.com/p/3474086627
阅读和此文章类似的: 程序员专区