




已閱讀5頁(yè),還剩3頁(yè)未讀, 繼續(xù)免費(fèi)閱讀
版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
英文原文AnApproachtoReduceWebCrawlerTrafficUsingAsp.NetNowdayssearchenginetransfersthewebdatafromoneplacetoanother.Theyworkonclientserverarchitecturewherethecentralservermanagesalltheinformation.Awebcrawlerisaprogramthatextractstheinformationoverthewebandsendsittothesearchengineforfurtherprocessing.Itisfoundthatmaximumtraffic(approximately40.1%)isduetothewebcrawler.TheproposedschemeshowshowwebcrawlercanreducethetrafficusingDynamicwebpageandHTTPGET.I.INTRODUCTIONAllthesearchengineshavepowerfulcrawlersthatvisittheinternettimetotimeforextractingtheusefulinformationovertheinternet.Theretrievedpagesareindexedandstoredinthedatabaseasshowninfigure1.ActuallyInternetisadirectedgraph,orwebpageasanodeandhyperlinkasedge,sothesearchoperationcouldbeabstractedasaprocessoftraversingdirectedstructuregraph.Byfollowingthelinkedstructureoftheweb,wecantraverseanumberofnewpagesstartedfromstartingwebpages.Webcrawlersaredesignedtoretrievewebpagesandaddthemtheirrepresenttothelocalrepository/databases.Crawlerupdatestheirinformationonceaweek,sometimesitupdatemonthlyorquarterlyalso.Theycannotprovideup-to-dateversionoffrequentlyupdatedpages.Tocatchupfrequentupdateswithoutputtingalargeburdenoncontentprovider,webelieveretrievingandprocessingdatanearthedatasourceisinevitable.Currentlymorethanonesearchenginesareavailableinthemarket.Thatincreaseincomplexityofwebtraffichasrequiredthatwebaseourmodelonthenotationofwebrequestratherthanthewebpages.Webcrawleraresoftwaresystemsthatusethetextandlinksonwebpagestocreatesearchindexesofthepages,usingHTMLlinkstofolloworcrawltheconnectionsbetweenpages.Figure1,Architectureofawebsearchengine.TheWWWisawebofhyperlinkedrepositoryoftrillionsofhypertextdocuments9layingondifferentwebsites.WorldWideWeb(Web)trafficcontinuestoincreaseandisnowestimatedtobemorethan70percentofthetotaltrafficontheInternet.A.BasicCrawlingTerminologyWeneedtoknowsomebasicterminologyofwebcrawlerwhichplaysanimportantroleinimplementationofthewebcrawler.Seedpage:CrawlingmeanstotraversethewebrecursivelybypickedupthestartingURLfromthesetofURL.StartingURLisentrypointfromwhereallthecrawlersstarttheirsearchingprocedure.ThissetofURLknownasseedpage.Frontier:ThecrawlingprocedurestartswithagivenURL,ExtractingthelinkfromitandaddingthemtoanunvisitedlistofURL.thisunvisitedlistknownasfrontier.Thefrontierimplementedbyaqueue.ParserParsingmayimplysimplehyperlinked/URLextractionoritmayinvolvethemorecomplexprocessoftidyinguptheHTMLcontentinordertoanalyzetheHTMLtagtree.ThejobanyparseristoparsethefetchedpagestoextractthelistofnewURLfromitandreturnthenewunvisitedURLtothefrontier.TheBasicalgorithmofawebcrawlerisgivenbelow:StartReadtheURLfromtheseedURLCheckwhetherthedocumentsalreadydownloadedornotIfdocumentsarealreadydownload.Break.ElseAddittothefrontier.NowpicktheURLfromthatfrontierandextractthenewlinkfromitAddallthenewlyfoundURLintothefrontier.Continue.EndThemainfunctionofacrawleristoaddnewlinksintothefrontieraddtoselectanew.II.RELATEDWORKToreducethewebcrawlertrafficmanyresearchershascompletedtheirresearchinfollowingareas:InthisauthoruseddynamicwebpageswithHTTPGetrequestwithlastvisitparameter.Oneapproachistheuseofactivenetworktoreduceunnecessarycrawlertraffic.Theauthorproposedanapproachwhichusesthebandwidthcontrolsysteminordertoreducethewebcrawlertrafficovertheinternet.Oneistoplacethemobilecrawleratwebserver.Crawlercheckupdatesinwebsiteandsendthemtothesearchengineforindexing.DesignanewwebcrawlerusingVB.NETtechnology.III.PERFORMANCEMATRICESIntheimplementationofwebcrawlerwehavetakensomeassumptionsintotheaccountjustforsimplifyingalgorithmandimplementationandresults.RemoveaURLfromtheURLlistDeterminetheprotocolofunderlyinghostlikehttp,ftpetc.Downloadthecorrespondingdocument.Extractanylinkscontainedinit.AddtheselinksbacktotheURLlist.IV.SIMULATORThesimulatorhasbeendesignedtostudythebehaviorpatternofdifferentcrawlingalgorithmsfromthesamesetofURLs.WedesignedacrawlerusingVB.NETandASP.NETwindowapplicationprojecttypeourcrawlercanworkongloballyandlocally,meansitcangiveresultonintranetandinternet.ItuseURLinaformatlikeandsetalocationornameforsavingcrawlingresultsdatainMSAccessdatabase.Figure2,SnapshotofWebCrawler.SnapshotfortheuserinterfaceofWebCrawlerisrunningoneitherintranetorinternet.Fortakingaresultofcrawlerweuseawebsite.Ateachsimulationstep,theschedulerchoosesthetopmostwebsitefromthequeueofthewebsitesandsendsthissiteinformationtoamodulethatwillsimulatedownloadingpagesfromthewebsites.ForthissimulatorweusecrawlingpoliciesandsavethedatacollectedordownloadintheMS-Accessdatabasetablewithsomedatafield.CrawlingResult,TheCrawlingresultispresentintheformoftabledepictingtheresultintheformofrowandcolumnstheoutput,oftheCrawlerisshownasasnapshot.Figure3,SnapshotoftheCrawledResultDatabase.InthisproposedworkIanalyzedthatwhenwecrawledthewebsiteitdownloadedallthepagesofwebsite.SecondtimewhenIcrawledthesamesiteIfoundthatcrawlercrawledallthepagesagainwhilesiteupdatedonlyitsdynamicpagesandrarelyitsstaticpages.Forreducingthecrawlertrafficweproposetheuseofdynamicwebpagetoinformthewebcrawleraboutthenewpagesandupdatesonwebsite.Inexperimentweusewebsiteof7webpages.WebsitedeployedonASP.NETusingC#Language.DynamicwebpageiscodedinC#language.WebcrawleriscodedinVB.NET.LAST_VISITparameterpassedismillisecondtimeofsystem,returnbyC#,millisecondtimeismaintainedby“update”datastructure.Firstweperformcrawlingonwebsiteusingoldapproach.Thenweperformcrawlingusingproposedapproach.Whenweperformthewebcrawlingonwebsite.TheresultsobtainedshowninTable1.Totesttheproposedapproachwedirectthewebcrawlertodynamicwebpagedynamic.aspxandsetthelastvisittimeatURLandperformcrawling.Test1:UpdatetimeandURLofpagesindex,branchandpersonin“Update”datastructureatwebcrawlersettheLAST_VISITtimebeforetimeofpagesintheUpdate.Performedcrawling,resultsobtainedareshownintable2.Test2:UpdatetimeandURLofpageaboutin“Update”datastructure.AtwebcrawlersetstheLAST_VISITtime,beforethetimeofpagesintheupdate.Performedcrawling,resultsobtainedareshownintable3.Test3:UpdatetimeandURLofpagesserviceandqueryin“Update”datastructure.AtwebcrawlersettheLAST_VISITtimebeforetimeofpagesintheUpdate.Performedcrawling,resultsobtainedareshownintable4.Innormalcrawlingisatimeconsumingprocessbecausecrawlervisiteverywebpagetoknowallupdatedinformationinwebsite.Innormalcrawlingitvisitsatotalof7pages.Crawlertakes1385millisecondstovisitcompletesite.InproposedapproachcrawlervisitsDynamicupdatepageandupdatedwebpagesonly.Crawlertakeabout500millisecondswhenthereare3updates,about450millisecondswhentherearetwoupdate.WhentherearethreeupdatesinexperimentalWebsiteproposedsachemis4.83timefasterthanoldapproach.Withtwoupdatesproposedschemeis7.03timesfasterthanoldscheme.Graph1showstimetakenbywebcrawlertodownloadupdates.Innormalcrawlingcrawlervisits7pagestofindupdates.Butnumberofpagevisitisverysmallinproposedapproach.Whenthereisoneupdatecrawleronlyvisit2pagesandwhenthereare2updatescrawleronlyvisits3pages.Ifthereare3updatesinwebsitecrawlervisit4pages.V.CONCLUSIONWiththisapproachCrawlerfindnewupdatesonthewebserverusingDynamicwebpage.UsingthiscrawleryoucansendthequerieswithrequestedURLsandcanreducethemaximumcrawlertrafficovertheinternet.Itisfoundthatapproximately40.1%trafficisduetothewebcrawler.Sothatusingthismethodyoucanreduce50%trafficofthewebcrawler(meanshalfofthewebcrawlertraffici.e.20%overtheinternet).Thefutureworkofthispaperwillbewecanreducethecrawlertrafficusingpagerankmethodandbyusingsomeparameterslikeaslastmodifiedparameter.Thisparametertellsthemodifieddateandtimeofthefetchedpage.LastmodifiedparametercanbeusedbythecrawlerforfetchingthefreshpagesfromtheWebsites.Inhigh-levelterms,theMVCpatternmeansthatanMVCapplicationwillbesplitintoatleastthreepieces:Models,whichcontainorrepresentthedatathatusersworkwith.Thesecanbesimpleviewmodels,whichjustrepresentdatabeingtransferredbetweenviewsandcontrollers;ortheycanbedomainmodels,whichcontainthedatainabusinessdomainaswellastheoperations,transformations,andrulesformanipulatingthatdata.Views,whichareusedtorendersomepartofthemodelasaUI.Controllers,whichprocessincomingrequests,performoperationsonthemodel,andselectviewstorendertotheuser.Modelsarethedefinitionoftheuniverseyourapplicationworksin.Inabankingapplication,forexample,themodelrepresentseverythinginthebankthattheapplicationsupports,suchasaccounts,thegeneralledger,andcreditlimitsforcustomers,aswellastheoperationsthatcanbeusedtomanipulatethedatainthemodel,suchasdepositingfundsandmakingwithdrawalsfromtheaccounts.Themodelisalsoresponsibleforpreservingtheoverallstateandconsistencyofthedata;forexample,makingsurethatalltransactionsareaddedtotheledger,andthataclientdoesntwithdrawmoremoneythanheisentitl
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 維修工程質(zhì)量管理與控制績(jī)效評(píng)估體系建立考核試卷
- 物流設(shè)施智能化升級(jí)改造策略考核試卷
- 公共設(shè)施管理中的社會(huì)力量整合與利用考核試卷
- 農(nóng)業(yè)科技創(chuàng)新政策實(shí)施路徑研究考核試卷
- 體育競(jìng)技精神在初中生成長(zhǎng)中的作用考核試卷
- 對(duì)口高考測(cè)試題及答案
- 玩具音樂(lè)測(cè)試題及答案
- java自增面試題及答案
- 山陽(yáng)煤礦考試題及答案
- 兗煤招生考試試題及答案
- 創(chuàng)新中職學(xué)校德育工作的實(shí)踐與反思
- 河南洛陽(yáng)文旅集團(tuán)財(cái)務(wù)崗位招聘考試真題2024
- 深入研究福建事業(yè)單位考試中的經(jīng)典案例試題及答案
- 《中華傳統(tǒng)文化進(jìn)中小學(xué)課程教材指南》
- 七年級(jí)歷史下學(xué)期核心知識(shí)點(diǎn)、難點(diǎn)、重點(diǎn)知識(shí)總結(jié)
- 《基于web的寵物商城管理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)》8800字(論文)
- 磷酸錳鐵鋰正極材料的研究現(xiàn)狀
- 直銷團(tuán)隊(duì)隊(duì)伍建設(shè)與管理
- 8.1公平正義的價(jià)值 教案 -2024-2025學(xué)年統(tǒng)編版道德與法治八年級(jí)下冊(cè)
- 旅行社脫團(tuán)免責(zé)協(xié)議
- 云南省大理白族自治州2023-2024學(xué)年高一下學(xué)期7月期末考試 政治 含解析
評(píng)論
0/150
提交評(píng)論