Apache Samza大規(guī)模數(shù)據(jù)流處理_第1頁
Apache Samza大規(guī)模數(shù)據(jù)流處理_第2頁
Apache Samza大規(guī)模數(shù)據(jù)流處理_第3頁
Apache Samza大規(guī)模數(shù)據(jù)流處理_第4頁
Apache Samza大規(guī)模數(shù)據(jù)流處理_第5頁
已閱讀5頁,還剩23頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Apache Samza: 領(lǐng)英大規(guī)模數(shù)據(jù)流處理(Secret Kung Fu of Massive Scale Stream Processing with Apache SamzaLinkedIn)介紹Samza LinkedInSamza大規(guī)模流處理的4大秘籍 總結(jié)展望ApacheSamzaApacheSamza is adistributedstream andbatch processingframework.DBchange captureKafkaHadoopKinesisKafkaHadoopCloud ServiceElastic SearchsamzaApacheSamza

2、 社區(qū)Apache頂級(jí)項(xiàng)目(2014年12月起)7次發(fā)布(0.70.13)14Committers62Contributors用 戶 :LinkedIn, Uber, MetaMarkets, Netflix, Intuit, TripAdvisor, MobileAware, Optimizely Samza LinkedIn200 應(yīng)用Production : A Trillion Events +/天單個(gè)節(jié)點(diǎn):1.1M Messages / 秒廣泛應(yīng)用在各個(gè)領(lǐng)域,數(shù)量在過去兩年內(nèi)指數(shù)級(jí)增長(zhǎng)SecurityNotificationsNews classificationPerformanc

3、e monitoring實(shí)時(shí)推送通知系統(tǒng)Notification ProcessorUser Chat EventUser Action EventConnectio n Activity EventRestful Servicesinvitations, mailbox, connection graph,network feed, and commentsMember profile databas eAggregati on EngineChannel SelectionState store6input1input2input3(D Local Data Access Remote D

4、atabase Lookup Remote Service Calloutput介紹Samza LinkedInSamza大規(guī)模流處理的4大秘籍總結(jié)展望Samza 數(shù)據(jù)處理秘籍之面向數(shù)據(jù)流的編程模型High-level API范例分析對(duì)千個(gè)PageViewEvent的Kafka數(shù)據(jù)流(Partitioned by page key),每五分鐘統(tǒng)計(jì)次每個(gè)用戶的event 數(shù)量,然后發(fā)送給另個(gè)Kafka topic.tt+5傳統(tǒng)的事件處理編程模型如果使用基本的Event Processing編程,對(duì)每個(gè)event都需要做如下的工作:將原PageViewEvent對(duì)用戶Id進(jìn)行repartition

5、在repartition后,每個(gè)Event都根據(jù)key = (timestamp, memberId) 寫入個(gè)key-value store.當(dāng)5分鐘window timer到來時(shí),對(duì)這個(gè)kv store進(jìn)行過去五分鐘的range query,對(duì)這五分鐘內(nèi)出現(xiàn)的所有用戶和Pageview進(jìn)行統(tǒng)計(jì)統(tǒng)計(jì)結(jié)果發(fā)送到另個(gè)Kafka topic此編程模型效率低,程序冗長(zhǎng),容易出錯(cuò),可維護(hù)性差。Re#paronInsert into KV StoreWindow &CountsendToPageViewEventPageViewEven tByMemberIdPageViewEventPer Member

6、Stream數(shù)據(jù)流編程模型public class RepartitionAndCounterExample implements StreamApplication Override public void init(StreamGraph graph, Config config) Supplier initialValue = () - 0; MessageStream pageViewEvents =graph.getInputStream(pageViewEventStream, (k, m) - (PageViewEvent) m);OutputStream pageViewEve

7、ntPerMemberStream = graph.getOutputStream(pageViewEventPerMemberStream, m - m.memberId, m - m);pageViewEvents.partitionBy(m - m.memberId).window(Windows.keyedTumblingWindow (m - m.memberId, Duration.ofMinutes(5), initialValue, (m, c) - c + 1).map(MyStreamOutput:new).sendTo(pageViewEventPerMemberStre

8、am);運(yùn)行可視化工具實(shí)例可視化鏈接Samza Operatorsfilterselect a subset of messages from the streammapmap one input message to an output messageflatMapmap one input message to 0 or more output messagesmergeunion all inputs into a single output streampartitionByre-partition the input messages based on a specific fiel

9、dsendTosend the result to an output streamsinksend the result to an external system (e.g. external DB)windowwindow aggregation on the input streamjoinjoin messages from two input streamsstateless functionsI/O functionsstateful functionsSamza 數(shù)據(jù)處理秘籍之二可擴(kuò)展的數(shù)據(jù)存取scalable data access范例分析在上面的PageViewEvent統(tǒng)

10、計(jì)實(shí)例中,用戶需要1.保存統(tǒng)計(jì)的中間結(jié)果以便故障恢復(fù)2.讀取遠(yuǎn)程的用戶數(shù)據(jù)信息本地?cái)?shù)據(jù)存取Samza提供基千內(nèi)存或RocksDb的Key-Value Store 用千高 速本地?cái)?shù)據(jù)存取Samza ProcessorState StoreKafka Change LogSamza ProcessorAdjunct Data StoreChange CaptureDatabase本地狀態(tài)本地?cái)?shù)據(jù)1.1MTPSOn a single machine100 xFaster1.2TBState60 xFasterthan bootstrapInput StreamOutput StreamInput S

11、treamOutput Stream遠(yuǎn)程數(shù)據(jù)存取Samza支持Native異步數(shù)據(jù)處理Samza提供multi-threaded同步數(shù)據(jù)處理Samza ProcessorEvent LoopSingle threadNon-blockingTask max concurrencyRestful ServicesJava NIO, Netty100 xParallelismSamza 數(shù)據(jù)處理秘籍之三統(tǒng)的流處理和批處理Unified Stream & Batch Processing范例分析Kafka提供實(shí)時(shí)的PageViewEvent,同樣的數(shù)據(jù)也被存儲(chǔ)到了Hadoop HDFS。用戶需要隨時(shí)

12、選擇不 同的數(shù)據(jù)源進(jìn)行處理,那需要兩套完全不同的處 理流程嗎?統(tǒng)的流處理和批處理streams.pageViewEventStream.system=kaEa streams.pageViewEventS=PageViewEvent systems.kaEa.samza.factory=org.apache.samza.system.kaEa.KaEaSystemFactory systems.kaEa.consumer.zookeeper.connect=localhost:2181/ systems.kaEducer.bootstrap.servers=localhost:9092Samz

13、a應(yīng)用不需要任何程序流程的改變,只需要 在Configuration里修改數(shù)據(jù)源。Kafka 數(shù)據(jù)源HDFS 數(shù)據(jù)源streams.pageViewEventStream.system=hdfs streams.pageViewEventS=hdfs:/mydbsnapshot/PageViewEvent/ systems.hdfs.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory性能比較統(tǒng)計(jì)各國用戶人數(shù):通過統(tǒng)計(jì),得到用戶人數(shù)最多的N個(gè)國家Test data: member profile (242 GB, 487 fi

14、les, around 450 million records)Samza 數(shù)據(jù)處理秘籍之四靈活的部署方式Flexible deployment models范例分析在實(shí)際中,用戶運(yùn)行數(shù)據(jù)處理的機(jī)群是多種多樣 的,怎么才 夠在不同的機(jī)群結(jié)構(gòu)里運(yùn)行同樣的 Samza應(yīng)用?Samza-as-a-library 部署Stream ApplicationStream Processormain() runnLocal Application start RunnerZookeeperStreamProcessorContainerSamzaJobCoordinatorLeaderStreamProce

15、ssorSamzaJobContainerCoordinatorStreamProcessorSamzaJobContainerCoordinatorSamza集成在用戶程序當(dāng)中, 用戶完全掌握自己的流處理, 如應(yīng) 用的生態(tài)周期和資源的分配和管理Samza inaCluster 部署NMNMNMNMNMNMRMRMYARNResource ManagersNodes in the YARNclusterYARN processes (RM/NM)Samza ContainersStream ApplicationJob Runner run-app.sh runRemote ApplicationRunner startsubmitSa

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論