




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
DataStreamMiningandQueryingSlidestakenfromanexcellentTutorialonDataStreamMiningandQueryingbyMinosGarofalakis,
JohannesGehrkeandRajeevRastogi
AndbyMino’slecturesslidesfromthere:/cs286sp07/ProcessingDataStreams:MotivationAgrowingnumberofapplicationsgeneratestreamsofdataPerformancemeasurementsinnetworkmonitoringandtrafficmanagementCalldetailrecordsintelecommunicationsTransactionsinretailchains,ATMoperationsinbanksLogrecordsgeneratedbyWebServersSensornetworkdataApplicationcharacteristicsMassivevolumesofdata(severalterabytes)RecordsarriveatarapidrateGoal:Minepatterns,processqueriesandcomputestatisticsondatastreamsinreal-timeDataStreams:ComputationModelAdatastreamisa(massive)sequenceofelements:
StreamprocessingrequirementsSinglepass:EachrecordisexaminedatmostonceBoundedstorage:LimitedMemory(M)forstoringsynopsisReal-time:Perrecordprocessingtime(tomaintainsynopsis)mustbelowStreamProcessingEngine(Approximate)AnswerSynopsisinMemoryDataStreamsDataStreamProcessingAlgorithmsGenerally,algorithmscomputeapproximateanswersDifficulttocomputeanswersaccuratelywithlimitedmemoryApproximateanswers-DeterministicboundsAlgorithmsonlycomputeanapproximateanswer,butboundsonerrorApproximateanswers-ProbabilisticboundsAlgorithmscomputeanapproximateanswerwithhighprobabilityWithprobabilityatleast,thecomputedansweriswithinafactoroftheactualanswerSingle-passalgorithmsforprocessingstreamsalsoapplicableto(massive)terabytedatabases!
Sampling:BasicsIdea:AsmallrandomsampleSofthedataoftenwell-representsallthedataForafastapproxanswer,apply“modified”querytoSExample:selectaggfromRwhereR.eisodd
(n=12)
Ifaggisavg,returnaverageofoddelementsinSIfaggiscount,returnaverageoverallelementseinSofnifeisodd0ifeisevenUnbiased:Forexpressionsinvolvingcount,sum,avg:theestimatorisunbiased,i.e.,theexpectedvalueoftheansweristheactualanswerDatastream:935271658491SampleS:9518answer:5answer:12*3/4=9ProbabilisticGuaranteesExample:Actualansweriswithin5±1withprob0.9UseTailInequalitiestogiveprobabilisticboundsonreturnedanswerMarkovInequalityChebyshev’sInequalityHoeffding’sInequalityChernoffBoundTailInequalitiesGeneralboundsontailprobabilityofarandomvariable(thatis,probabilitythatarandomvariabledeviatesfarfromitsexpectation)
BasicInequalities:LetXbearandomvariablewithexpectationandvarianceVar[X].ThenforanyProbabilitydistributionTailprobabilityMarkov:Chebyshev:TailInequalitiesforSumsPossibletoderivestrongerboundsontailprobabilitiesforthesumofindependentrandomvariablesHoeffding’sInequality:LetX1,...,Xmbeindependentrandomvariableswith0<=Xi<=r.Letandbetheexpectationof.Then,foranyApplicationtoavgqueries:missizeofsubsetofsampleSsatisfyingpredicate(3inexample)risrangeofelementvaluesinsample(8inexample)TailInequalitiesforSums(Contd.)PossibletoderiveevenstrongerboundsontailprobabilitiesforthesumofindependentBernoullitrialsChernoffBound:LetX1,...,XmbeindependentBernoullitrialssuchthatPr[Xi=1]=p(Pr[Xi=0]=1-p).Letandbetheexpectationof.Then,forany,Applicationtocountqueries:missizeofsampleS(4inexample)pisfractionofoddelementsinstream(2/3inexample)
Remark:ChernoffboundresultsintighterboundsforcountqueriescomparedtoHoeffding’sinequalityComputingStreamSampleReservoirSampling[Vit85]:
MaintainsasampleSofafixed-sizeMAddeachnewelementtoSwithprobabilityM/n,wherenisthecurrentnumberofstreamelementsIfaddanelement,evictarandomelementfromSInsteadofflippingacoinforeachelement,determinethenumberofelementstoskipbeforethenexttobeaddedtoSConcisesampling[GM98]:DuplicatesinsampleSstoredas<value,count>pairs(thus,potentiallyboostingactualsamplesize)AddeachnewelementtoSwithprobability1/T(simplyincrementcountifelementalreadyinS)IfsamplesizeexceedsMSelectnewthresholdT’>TEvicteachelement(decrementcount)fromSwithprobability1-T/T’AddsubsequentelementstoSwithprobability1/T’StreamingModel:SpecialCases
Time-SeriesModelOnlyj-thupdateupdatesA[j](i.e.,A[j]:=c[j])
Cash-RegisterModelc[j]isalways>=0(i.e.,increment-only)Typically,c[j]=1,soweseeamulti-setofitemsinonepass
TurnstileModelMostgeneralstreamingmodelc[j]canbe>0or<0(i.e.,incrementordecrement)
ProblemdifficultyvariesdependingonthemodelE.g.,MIN/MAXinTime-Seriesvs.Turnstile!Linear-Projection(akaAMS)SketchSynopsesGoal:Buildsmall-spacesummaryfordistributionvectorf(i)(i=1,...,N)seenasastreamofi-valuesBasicConstruct:
RandomizedLinearProjectionoff()=projectontoinner/dotproductoff-vectorSimpletocomputeoverthestream:Addwheneverthei-thvalueisseenGenerate‘sinsmall(logN)spaceusingpseudo-randomgeneratorsTunableprobabilisticguaranteesonapproximationerrorDelete-Proof:Justsubtracttodeleteani-thvalueoccurrenceComposable:Simplyaddindependently-builtprojectionsDatastream:3,1,2,4,2,3,5,...Datastream:3,1,2,4,2,3,5,...f(1)f(2)f(3)f(4)f(5)11122where=vectorofrandomvaluesfromanappropriatedistributionExample:Binary-JoinCOUNTQueryProblem:ComputeanswerforthequeryCOUNT(RAS)Example:Exactsolution:tooexpensive,requiresO(N)space!N=sizeof(domain(A))DatastreamR.A:41241412032134DatastreamS.A:31242412213421=10(2+2+0+6)BasicAMSSketchingTechnique[AMS96]KeyIntuition:Userandomizedlinearprojectionsoff()todefinerandomvariableXsuchthatXiseasilycomputedoverthestream(insmallspace)E[X]=COUNT(RAS)Var[X]issmall
BasicIdea:Defineafamilyof4-wiseindependent{-1,+1}randomvariables
Pr[=+1]=Pr[=-1]=1/2Expectedvalueofeach,E[]=0Variablesare4-wiseindependentExpectedvalueofproductof4distinct=0Variablescanbegeneratedusingpseudo-randomgeneratorusingonlyO(logN)space(forseeding)!Probabilisticerrorguarantees(e.g.,actualansweris10±1withprobability0.9)AMSSketchConstructionComputerandomvariables:andSimplyaddtoXR(XS)wheneverthei-thvalueisobservedintheR.A(S.A)streamDefineX=XRXStobeestimateofCOUNTqueryExample:DatastreamR.A:412414DatastreamS.A:3124241202134122134213Binary-JoinAMSSketchingAnalysisExpectedvalueofX=COUNT(RAS)
Using4-wiseindependence,possibletoshowthat
isself-joinsizeofR(second/L2moment)01BoostingAccuracyChebyshev’sInequality:
BoostaccuracytobyaveragingoverseveralindependentcopiesofX(reducesvariance)
ByChebyshev:xxxAverageyBoostingConfidenceBoostconfidencetobytakingmedianof2log(1/)independentcopiesofYEachY=BernoulliTrial
Pr[|median(Y)-COUNT|COUNT](ByChernoffBound)=Pr[#failuresin2log(1/)trials>=log(1/)]yyycopiesmedian2log(1/)“FAILURE”:SummaryofBinary-JoinAMSSketchingStep1:Computerandomvariables:andStep2:DefineX=XRXSSteps3&4:AverageindependentcopiesofX;Returnmedianofaverages
MainTheorem(AGMS99):SketchingapproximatesCOUNTtowithinarelativeerrorofwithprobabilityusingspace
Remember:O(logN)spacefor“seeding”theconstructionofeachX
xxxAverageyxxxAverageyxxxAverageycopiescopiesmedian2log(1/)ASpecialCase:Self-joinSizeEstimateCOUNT(RAR)(originalAMSpaper)Second(L2)momentofdatadistribution,Giniindexofheterogeneity,measureofskewinthedata
Inthiscase,COUNT=SJ(R),sowegetan(e,d)-estimateusingspaceonly
Best-caseforAMSstreamingjoin-sizeestimationDistinctValueEstimationProblem:Findthenumberofdistinctvaluesinastreamofvalueswithdomain[0,...,N-1]Zerothfrequencymoment,L0(Hamming)streamnormStatistics:numberofspeciesorclassesinapopulationImportantforqueryoptimizersNetworkmonitoring:distinctdestinationIPaddresses,source/destinationpairs,requestedURLs,etc.Example(N=64)Datastream:305301751037Numberofdistinctvalues:5Assumeahashfunctionh(x)thatmapsincomingvaluesxin[0,…,N-1]uniformlyacross[0,…,2^L-1],whereL=O(logN)Letlsb(y)denotethepositionoftheleast-significant1bitinthebinaryrepresentationofyAvaluexismappedtolsb(h(x))MaintainHashSketch=BITMAParrayofLbits,initializedto0Foreachincomingvaluex,setBITMAP[lsb(h(x))]=1Hash(akaFM)SketchesforDistinctValueEstimation[FM85]x=5h(x)=101100lsb(h(x))=2000001BITMAP543210Hash(akaFM)SketchesforDistinctValueEstimation[FM85]Byuniformitythroughh(x):Prob[BITMAP[k]=1]=Prob[]=Assumingddistinctvalues:expectd/2tomaptoBITMAP[0],d/4tomaptoBITMAP[1],...LetR=positionofrightmostzeroinBITMAPUseasindicatoroflog(d)[FM85]provethatE[R]=,whereEstimated=Averageseveraliidinstances(differenthashfunctions)toreduceestimatorvariancefringeof0/1saroundlog(d)000001BITMAP000111111111position<<log(d)position>>log(d)0L-1[FM85]assume“ideal”hashfunctionsh(x)(N-wiseindependence) [AMS96]:pairwiseindependenceissufficienth(x)=,wherea,barerandombinaryvectorsin[0,…,2^L-1]Small-spaceestimatesfordistinctvaluesproposedbasedonFMideasDelete-Proof:Justusecountersinsteadofbitsinthesketchlocations+1forinserts,-1fordeletesComposable:Compon
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025至2030年中國雙層不銹鋼保溫杯數(shù)據(jù)監(jiān)測研究報告
- 廣東省揭陽市新華中學(xué)2024-2025學(xué)年高一下學(xué)期3月第一次月考化學(xué)試卷(含答案)
- 2025年軍隊文職人員招聘之軍隊文職管理學(xué)通關(guān)試題庫(有答案)
- 2019-2025年消防設(shè)施操作員之消防設(shè)備基礎(chǔ)知識押題練習(xí)試題A卷含答案
- 2019-2025年消防設(shè)施操作員之消防設(shè)備基礎(chǔ)知識考前沖刺模擬試卷A卷含答案
- 2025年軍隊文職人員招聘之軍隊文職教育學(xué)綜合練習(xí)試卷B卷附答案
- 2025年軍隊文職人員招聘之軍隊文職法學(xué)每日一練試卷A卷含答案
- 營養(yǎng)與食品衛(wèi)生學(xué)-營養(yǎng)學(xué)566
- 2025年大學(xué)生防詐騙知識競賽題庫試題及答案(共90題)
- 專業(yè)知識培訓(xùn)課件模板
- 2016-2023年大慶醫(yī)學(xué)高等??茖W(xué)校高職單招(英語/數(shù)學(xué)/語文)筆試歷年參考題庫含答案解析
- 泛微協(xié)同OA與SAP集成應(yīng)用解決方案V講訴
- 探討電磁感應(yīng)現(xiàn)象對電能轉(zhuǎn)化效率的影響
- EHS法律法規(guī)清單及合規(guī)性評估
- 橋梁定期檢查-主要部件檢查要點(diǎn)與評定標(biāo)準(zhǔn)
- 長途汽車客運(yùn)站調(diào)研報告
- 陜西各市(精確到縣區(qū))地圖PPT課件(可編輯版)
- JTG C10-2007 公路勘測規(guī)范正式版
- (完整版)國際金融法
- 近代德國的學(xué)前教育課件
- 球墨鑄鐵正火工藝
評論
0/150
提交評論