版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
Google搜索與
Inter網(wǎng)的信息檢索
馬志明
May16,2008Email:mazm@/member/mazhiming/index.html約有626,000項(xiàng)符合中國(guó)科學(xué)院數(shù)學(xué)與系統(tǒng)科學(xué)研究院的查詢結(jié)果,以下是第1-100項(xiàng)。
(搜索用時(shí)0.45
秒)Howcangooglemakearankingof626,000pagesin0.45seconds?Amaintaskof
Internet(Web)
InformationRetrieval
=DesignandAnalysisof
SearchEngine(SE)Algorithm
involvingplentyofMathematicsHITS
PageRank1998JonKleinbergCornellUniversity
SergeyBrinandLarryPageStanfordUniversityNevanlinnaPrize(2006)
JonKleinberg
OneofKleinberg‘smostimportantresearchachievementsfocusesontheinternetworkstructureoftheWorldWideWeb.Priorto
Kleinberg‘swork,searchenginesfocusedonlyonthecontentofwebpages,notonthelinkstructure.Kleinbergintroducedtheideaof“authorities”and“hubs”:Anauthorityisawebpagethatcontains
informationonaparticulartopic,andahubisapagethatcontainslinksto
manyauthorities.Zhuzihuthesis.pdfPage
Rank,therankingsystem
usedbytheGooglesearch
engine.
Queryindependentcontentindependent.usingonlythewebgraphstructurePage
Rank,therankingsystemusedbytheGooglesearchengine.
PageRankasaFunctionoftheDampingFactorPaoloBoldiMassimoSantiniSebastianoVignaDSI,UniversitàdegliStudidiMilanoWWW2005paper3.1Choosingthedampingfactor3GeneralBehaviour3.2Gettingcloseto1
canwesomehowcharacterisethepropertiesof?whatmakes
differentfromtheother(infinitelymany,ifPisreducible)limitdistributionsofP?
isthelimitdistributionofPwhenthestartingdistributionisuniform,thatis,Conjecture1
:
Website
provideplentyofinformation:
pagesinthesamewebsitemaysharethesameIP,runonthesamewebserveranddatabaseserver,andbeauthored/maintainedbythesamepersonororganization.
theremightbehighcorrelationsbetweenpagesinthesamewebsite,intermsofcontent,pagelayoutandhyperlinks.
websitescontainhigherdensityofhyperlinksinsidethem(about75%)andlowerdensityofedgesinbetween.HostGraphlosesmuchtransitioninformation
Canasurferjumpfrompage5ofsite1toapageinsite2?From:s06-pc-chairs-email@[mailto:s06-pc-chairs-Sent:2006年4月4日8:36
To:Tie-YanLiu;wangying@;fengg03@;ybao@;mazm@
Subject:[SIGIR2006]YourPaper#191
Title:AggregateRank:BringOrdertoWebSites
Congratulations!!29thAnnual
International
Conferenceon
Research&DevelopmentonInformationRetrieval(SIGIR’06,August6–11,2006,Seattle,Washington,USA).RankingWebsites,
aProbabilisticView
YingBao,GangFeng,Tie-YanLiu,Zhi-MingMa,andYingWang
InternetMathematics,
Volume3(2007),Issue3-WesuggestevaluatingtheimportanceofawebsitewiththemeanfrequencyofvisitingthewebsitefortheMarkovchainontheInternetGraphdescribingrandomsurfing.
WeshowthatthismeanfrequencyisequaltothesumofthePageRanksofallthewebpagesinthatwebsite(henceisreferredasPageRankSum)
Weproposeanovelalgorithm(AggregateRankAlgorithm)basedonthetheoryofstochasticcomplement
tocalculatetherankofawebsite.TheAggregateRankAlgorithmcanapproximatethePageRankSumaccurately,whilethecorrespondingcomputationalcomplexityismuchlowerthanPageRankSum
Byconstructingreturn-timeMarkovchainsrestrictedtoeachwebsite,wedescribealsotheprobabilisticrelationbetweenPageRankandAggregateRank.
ThecomplexityandtheerrorboundofAggregateRankAlgorithmwithexperimentsofrealdadaarediscussedattheendofthepaper.nwebsinNsites,
Thestationarydistribution,knownasthePageRankvector,isgivenbyWemayrewritethestationarydistributionaswithasarowvectoroflength
Wedefinetheone-steptransitionprobabilityfromthewebsite
tothewebsite
bywhereeisandimensionalcolumnvectorofallones
TheN×NmatrixC(α)=(cij(α))isreferredtoasthecouplingmatrix,whoseelementsrepresentthetransitionprobabilitiesbetweenwebsites.ItcanbeprovedthatC(α)isanirreduciblestochasticmatrix,sothatitpossessesauniquestationaryprobabilityvector.Weuseξ(α)todenotethisstationaryprobability,whichcanbegottenfrom
SinceOnecaneasilycheckthatistheuniquesolutionto
WeshallreferastheAggregateRankThatis,theprobabilityofvisitingawebsiteisequaltothesumofPageRanksofallthepagesinthatwebsite.Thisconclusionisconsistenttoourintuition.thetransitionprobabilityfromSitoSjactuallysummarizesallthecasesthattherandomsurferjumpsfromanypageinSitoanypageinSjwithinone-steptransition.Therefore,thetransitioninthisnewHostGraphisinaccordancewiththerealbehavioroftheWebsurfers.Inthisregard,theso-calculatedrankfromthecouplingmatrixC(α)willbemorereasonablethanthosepreviousworks.Let
denotethenumberofvisitingthewebsite
duringthentimes,thatisWehaveAssumeastartingstateinwebsiteA,i.e.Itisclearthatallthevariables
arestoppingtimesforX.WedefineandinductivelyLet
denotethetransitionmatrixofthereturn-timeMarkovchainforsiteSimilarly,wehaveSinceThereforeSupposethatAggregateRank,i.e.thestationarydistributionofisBasedontheabovediscussions,thedirectapproachofcomputingtheAggregateRankξ(α)istoaccumulatePageRankvalues(denotedbyPageRankSum).However,thisapproachisunfeasiblebecausethecomputationofPageRankisnotatrivialtaskwhenthenumberofwebpagesisaslargeasseveralbillions.Therefore,Efficientcomputationbecomesasignificantproblem.1.Dividethen×nmatrix
intoN×NblocksaccordingtotheNsites.AggregateRank
Constructthestochasticmatrixforbychangingthediagonalelementsoftomakeeachrawsumupto1.3.Determinefrom4.Formanapproximation
tothecouplingmatrix
,byevaluating5.Determinethestationarydistributionof
anddenoteit
,i.e.,Experiments
Inourexperiments,thedatacorpusisthebenchmarkdatafortheWebtrackofTREC2003and2004,domainintheyearof2002.Itcontains1,247,753dataset.Thelargestwebsitecontains137,103webpageswhilethesmallestonecontainsonly1page.PerformanceEvaluationofRankingAlgorithmsbasedonKendall'sdistanceSimilaritybetweenPageRankSumandotherthreerankingresults.From:pcchairs@
Sent:Thursday,April03,20089:48AM
DearYutingLiu,BinGao,Tie-YanLiu,YingZhang,ZhimingMa,ShuyuanHe,HangLi
Wearepleasedtoinformyouthatyourpaper
Title:BrowseRank:LettingWebUsersVoteforPageImportance
hasbeenacceptedfororalpresentationasafullpaperandforpublicationasaneightpaperintheproceedingsofthe31stAnnualInternationalACMSIGIR
ConferenceonResearch&DevelopmentonInformationRetrieval.
Congratulations!!BuildingmodelPropertiesofQprocess:Stationarydistribution:
Jumpingprobability:
EmbeddedMarkovchain:isaMarkovchainwiththetransitionprobabilitymatrixMainconclusion1
isthemeanofthestayingtimeonpagei.
Themoreimportantapageis,thelongerstayingtimeonitis.isthemeanofthefirstre-visittimeatpagei.Themoreimportantapageis,thesmallerthere-visittimeis,andthelargerthevisitfrequencyis.Mainconclusion2
isthestationarydistributionofThestationarydistributionofdiscretemodeliseasytocomputePowermethodforLogdataforFurtherquestionsHowaboutinhomogenousprocess?Statisticresultshow:differentperiodoftimepossessesdifferentvisitingfrequency.Poissonprocesseswithdifferentintensity.MarkedpointprocessHyperlinkisnotreliable.Users’realbehaviorshouldbeconsidered.RelevanceRankingManyfeaturesformeasuringrelevanceTermdistribution(anchor,URL,title,body,proximity,….)Recommendation&citation(PageRank,click-throughdata,…)StatisticsorknowledgeextractedfromwebdataQuestionsWhatistheoptimalrankingfunctiontocombinedifferentfeatures(orevidences)?Howtomeasurerelevance?LearningtoRankWhatistheoptimalweightingsforcombiningthevariousfeaturesUsemachinelearningmethodstolearntherankingfunctionHumanrelevancesystem(HRS)Relevanceverificationtests(RVT)Wei-YingMa,MicrosoftResearchAsiaLearningtoRankModelLearningSystemRankingSystemminLoss66Wei-YingMa,MicrosoftResearchAsiaLearningtoRank(Cont)
State-of-the-artalgorithmsforlearningtoranktakethepairwiseapproachRankingSVMRankBoostRankNet(employedatLiveSearch)67BreakdownWei-YingMa,MicrosoftResearchAsialearningtorankThegoaloflearningtorankistoconstructareal-valuedfunctionthatcangeneratearankingonthedocumentsassociatedwiththegivenquery.Thestate-of-the-artmethodstransformsthelearningproblemintothatofclassificationandthenperformsthelearningtask:Foreachquery,itisassumedthattherearetwocategoriesofdocuments:positiveandnegative(representingrelevantandirreverentwithrespecttothequery).Thendocumentpairsareconstructedbetweenpositivedocumentsandnegativedocuments.Inthetrainingprocess,thequeryinformationisactuallyignored.[5]Y.Cao,J.Xu,T.-Y.Liu,H.Li,Y.Huang,andH.-W.Hon.Adaptingrankingsvmtodocumentretrieval.InProc.ofSIGIR’06,pages186–193,2006.[11]T.Qin,T.-Y.Liu,M.-F.Tsai,X.-D.Zhang,andH.Li.Learningtosearchwebpageswithquery-levellossfunctions.TechnicalReportMSR-TR-2006-156,2006.Ascasestudies,weinvestigateRankingSVMandRankBoost.Weshowthatafterintroducing
query-levelnormalization
toitsobjectivefunction,RankingSVMwillhavequery-levelstability.ForRankBoost,thequery-levelstabilitycanbeachievedifweintroduceboth
query-levelnormalizationandregularization
toitsobjectivefunction.Were-representthelearningtorankproblembyintroducingtheconceptof‘query’and‘distributiongivenquery’intoitsmathematicalformulation.Moreprecisely,weassumethatqueriesaredrawnindependentlyfromaqueryspaceQaccordingtoan(unknown)probabilitydistributionItshouldbenotedthatif,thentheboundmakessense.Thisconditioncanbesatisfiedinmanypracticalcases.Ascasestudies,weinvestigateRankingSVMandRankBoost.Weshowthatafterintroducingquery-levelnormalizationtoitsobjectivefunction,RankingSVMwillhavequery-levelstability.ForRankBoost,thequery-levelstabilitycanbeachievedifweintroducebothquery-levelnormalizationandregularizationtoitsobjectivefunction.Theseanalysesagreelargelywithourexperimentsandtheexperimentsin[5]and[11].RankaggregationRankaggregationistocombinerankingresultsofentitiesfrommultiplerankingfunctionsinordertogenerateabetterone.Theindividualrankingfunctionsarereferredtoasbaserankers,orsimplyrankers.Score-basedaggregationRankaggregationcanbeclassifiedintotwocategories[2].Inthefirstcategory,theentitiesinindividualrankinglistsareassignedscoresandtherankaggregationfunctionisassumedtousethescores(denotedasscore-basedaggregation)[11][18][28].order-basedaggregation
Inthesecondcategory,onlytheordersoftheentitiesinindividualrankinglistsa
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年安徽房屋租賃合同模板(二篇)
- 2024年小學(xué)生寒假學(xué)習(xí)計(jì)劃范本(五篇)
- 2024年學(xué)校傳染病工作制度范文(二篇)
- 2024年國(guó)際勞務(wù)合同例文(四篇)
- 2024年單位租房合同樣本(二篇)
- 2024年學(xué)生會(huì)秘書(shū)處工作計(jì)劃樣本(四篇)
- 2024年城鄉(xiāng)勞動(dòng)者臨時(shí)務(wù)工勞動(dòng)合同(三篇)
- 2024年小學(xué)教師年終工作總結(jié)簡(jiǎn)單版(四篇)
- 2024年單位年度工作計(jì)劃樣本(六篇)
- 2024年大學(xué)教師個(gè)人工作計(jì)劃范本(二篇)
- 2023年國(guó)家公務(wù)員考試申論試題(行政執(zhí)法卷)及參考答案
- 砌筑腳手架施工方案(有計(jì)算)
- 2023-2024學(xué)年浙江省山海聯(lián)盟協(xié)作學(xué)校八年級(jí)(上)期中數(shù)學(xué)試卷
- 《10以內(nèi)數(shù)的加減混合運(yùn)算》說(shuō)課稿子
- 光伏車棚施工方案圖
- 《少有人走的路》課件
- 營(yíng)養(yǎng)與食療學(xué)智慧樹(shù)知到課后章節(jié)答案2023年下江西中醫(yī)藥大學(xué)
- 渤海小吏講中國(guó)史:秦并天下
- 工程監(jiān)理服務(wù)的設(shè)計(jì)和開(kāi)發(fā)控制程序
- 治安管理處罰法共ppt
- 初中政治人教版九年級(jí)上冊(cè)《中國(guó)人中國(guó)夢(mèng)》教育教學(xué)課件
評(píng)論
0/150
提交評(píng)論