Spark筆記-技術(shù)點(diǎn)匯總

上傳人：d*** IP屬地：江西上傳時(shí)間：2022-08-25 格式：DOCX 頁(yè)數(shù)：47 大?。?.79MB 積分：20 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩42頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶(hù)提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、目錄 HYPERLINK /netoxi/p/7223412.html l %E6%A6%82%E5%86%B5 概況 HYPERLINK /netoxi/p/7223412.html l %E6%89%8B%E5%B7%A5%E6%90%AD%E5%BB%BA%E9%9B%86%E7%BE%A4 手工搭建集群 HYPERLINK /netoxi/p/7223412.html l %E5%BC%95%E8%A8%80 引言 HYPERLINK /netoxi/p/7223412.html l %E5%AE%89%E8%A3%85Scala 安裝Scala HYPERLINK /netoxi/p

2、/7223412.html l %E9%85%8D%E7%BD%AE%E6%96%87%E4%BB%B6 配置文件 HYPERLINK /netoxi/p/7223412.html l %E5%90%AF%E5%8A%A8%E4%B8%8E%E6%B5%8B%E8%AF%95 啟動(dòng)與測(cè)試 HYPERLINK /netoxi/p/7223412.html l %E5%BA%94%E7%94%A8%E9%83%A8%E7%BD%B2 應(yīng)用部署 HYPERLINK /netoxi/p/7223412.html l %E9%83%A8%E7%BD%B2%E6%9E%B6%E6%9E%84 部署架構(gòu) H

3、YPERLINK /netoxi/p/7223412.html l %E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E9%83%A8%E7%BD%B2 應(yīng)用程序部署 HYPERLINK /netoxi/p/7223412.html l %E6%A0%B8%E5%BF%83%E5%8E%9F%E7%90%86 核心原理 HYPERLINK /netoxi/p/7223412.html l RDD%E6%A6%82%E5%BF%B5 RDD概念 HYPERLINK /netoxi/p/7223412.html l RDD%E6%A0%B8%E5%BF%83%E7%BB%8

4、4%E6%88%90 RDD核心組成 HYPERLINK /netoxi/p/7223412.html l RDD%E4%BE%9D%E8%B5%96%E5%85%B3%E7%B3%BB RDD依賴(lài)關(guān)系 HYPERLINK /netoxi/p/7223412.html l DAG%E5%9B%BE DAG圖 HYPERLINK /netoxi/p/7223412.html l RDD%E6%95%85%E9%9A%9C%E6%81%A2%E5%A4%8D%E6%9C%BA%E5%88%B6 RDD故障恢復(fù)機(jī)制 HYPERLINK /netoxi/p/7223412.html l Standal

5、one%E6%A8%A1%E5%BC%8F%E7%9A%84Spark%E6%9E%B6%E6%9E%84 Standalone模式的Spark架構(gòu) HYPERLINK /netoxi/p/7223412.html l YARN%E6%A8%A1%E5%BC%8F%E7%9A%84Spark%E6%9E%B6%E6%9E%84 YARN模式的Spark架構(gòu) HYPERLINK /netoxi/p/7223412.html l %E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F%E8%B5%84%E6%BA%90%E6%9E%84%E5%BB%BA 應(yīng)用程序資源構(gòu)建 HYP

6、ERLINK /netoxi/p/7223412.html l API API HYPERLINK /netoxi/p/7223412.html l WordCount%E7%A4%BA%E4%BE%8B WordCount示例 HYPERLINK /netoxi/p/7223412.html l RDD%E6%9E%84%E5%BB%BA RDD構(gòu)建 HYPERLINK /netoxi/p/7223412.html l RDD%E7%BC%93%E5%AD%98%E4%B8%8E%E6%8C%81%E4%B9%85%E5%8C%96 RDD緩存與持久化 HYPERLINK /netoxi/p

7、/7223412.html l RDD%E5%88%86%E5%8C%BA%E6%95%B0 RDD分區(qū)數(shù) HYPERLINK /netoxi/p/7223412.html l %E5%85%B1%E4%BA%AB%E5%8F%98%E9%87%8F 共享變量 HYPERLINK /netoxi/p/7223412.html l RDD%20Operation RDD Operation HYPERLINK /netoxi/p/7223412.html l RDD%20Operation%E9%9A%90%E5%BC%8F%E8%BD%AC%E6%8D%A2 RDD Operation隱式轉(zhuǎn)換

8、 HYPERLINK /netoxi/p/7223412.html l RDDT%E5%88%86%E5%8C%BAOperation RDDT分區(qū)Operation HYPERLINK /netoxi/p/7223412.html l RDDT%E5%B8%B8%E7%94%A8%E8%81%9A%E5%90%88Operation RDDT常用聚合Operation HYPERLINK /netoxi/p/7223412.html l RDD%E9%97%B4%E6%93%8D%E4%BD%9COperation RDD間操作Operation HYPERLINK /netoxi/p/72

9、23412.html l DoubleRDDFunctions%E5%B8%B8%E7%94%A8Operation DoubleRDDFunctions常用Operation HYPERLINK /netoxi/p/7223412.html l PairRDDFunctions%E8%81%9A%E5%90%88Operation PairRDDFunctions聚合Operation HYPERLINK /netoxi/p/7223412.html l PairRDDFunctions%E9%97%B4%E6%93%8D%E4%BD%9COperation PairRDDFunctions

10、間操作Operation HYPERLINK /netoxi/p/7223412.html l OrderedRDDFunctions%E5%B8%B8%E7%94%A8Operation OrderedRDDFunctions常用Operation HYPERLINK /netoxi/p/7223412.html l %E6%A1%88%E4%BE%8B%EF%BC%9A%E7%A7%BB%E5%8A%A8%E7%BB%88%E7%AB%AF%E4%B8%8A%E7%BD%91%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90 案例：移動(dòng)終端上網(wǎng)數(shù)據(jù)分析 HYPERLI

11、NK /netoxi/p/7223412.html l %E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87 數(shù)據(jù)準(zhǔn)備 HYPERLINK /netoxi/p/7223412.html l %E5%8A%A0%E8%BD%BD&%E9%A2%84%E5%A4%84%E7%90%86 加載&預(yù)處理 HYPERLINK /netoxi/p/7223412.html l %E7%BB%9F%E8%AE%A1App%E8%AE%BF%E9%97%AE%E6%AC%A1%E6%95%B0 統(tǒng)計(jì)App訪(fǎng)問(wèn)次數(shù) HYPERLINK /netoxi/p/7223412.html l %E7

12、%BB%9F%E8%AE%A1DAU 統(tǒng)計(jì)DAU HYPERLINK /netoxi/p/7223412.html l %E7%BB%9F%E8%AE%A1MAU 統(tǒng)計(jì)MAU HYPERLINK /netoxi/p/7223412.html l %E7%BB%9F%E8%AE%A1App%E4%B8%8A%E4%B8%8B%E6%B5%81%E9%87%8F 統(tǒng)計(jì)App上下流量概況1. Spark相對(duì)MapReduce的優(yōu)勢(shì)： a)支持迭代計(jì)算； b)中間結(jié)果存儲(chǔ)在內(nèi)存而不是硬盤(pán)，降低延遲。2. Spark已成為輕量級(jí)大數(shù)據(jù)快速處理統(tǒng)一平臺(tái)，“Onestacktorulethemall”，一個(gè)

13、平臺(tái)完成：即席查詢(xún)（ad-hocqueries）、批處理（batchprocessing）、流式處理（streamprocessing）。3. Spark集群搭建方式： a)集成部署工具，如ClouderaManager； b)手工搭建。4. Spark源碼編譯方式： a)SBT編譯； b) Maven編譯。手工搭建集群引言1. 環(huán)境：RoleHostnameMastercentos1Slavecentos2centos32. Standalone模式需在Master和Slave節(jié)點(diǎn)部署，YARN模式僅需在命令提交機(jī)器部署。3. 假設(shè)已成功安裝JDK、Hadoop集群。安裝Scala1. Ma

14、ster（Standalone模式）或命令提交機(jī)器（YARN模式）安裝Scala到/opt/app目錄下。tar zxvf scala-2.10.6.tgz -C /opt/app2. Master（Standalone模式）或命令提交機(jī)器（YARN模式）配置環(huán)境變量。vi /etc/profileexport SCALA_HOME=/opt/app/scala-2.10.6export PATH=$SCALA_HOME/bin:$PATHsource /etc/profile # 生效env | grep SCALA_HOME # 驗(yàn)證配置文件3. Master（Standalone模式）或

15、命令提交機(jī)器（YARN模式）tar zxvf spark-1.6.3-bin-hadoop2.6.tgz -C /opt/appcd /opt/app/spark-1.6.3-bin-hadoop2.6/confcp spark-env.sh.template spark-env.shvi spark-env.shexport JAVA_HOME=/opt/app/jdk1.8.0_121export SCALA_HOME=/opt/app/scala-2.10.6export HADOOP_HOME=/opt/app/hadoop-2.6.5export HADOOP_CONF_DIR=$H

16、ADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop# For standalone modeexport SPARK_WORKER_CORES=1export SPARK_DAEMON_MEMORY=512mcp spark-defaults.conf.template spark-defaults.confhadoop fs -mkdir /spark.eventLog.dirvi spark-defaults.confspark.driver.extraClassPath /opt/app/apache-hiv

17、e-1.2.2-bin/lib/mysql-connector-java-5.1.22-bin.jarspark.eventLog.enabled truespark.eventLog.dir hdfs:/centos1:9000/spark.eventLog.dircp slaves.template slavesvi slavescentos2centos3ln -s /opt/app/apache-hive-1.2.2-bin/conf/hive-site.xml .4. Master（Standalone模式）從Master復(fù)制Spark目錄到各Slave。注意：僅Standalone

18、集群需要執(zhí)行本步驟。scp -r /opt/app/spark-1.6.3-bin-hadoop2.6 hadoopcentos2:/opt/appscp -r /opt/app/spark-1.6.3-bin-hadoop2.6 hadoopcentos3:/opt/app啟動(dòng)與測(cè)試5. Master（Standalone模式）或命令提交機(jī)器（YARN模式）配置Spark環(huán)境變量。export SPARK_HOME=/opt/app/spark-1.6.3-bin-hadoop2.6export PATH=$PATH:$SPARK_HOME/bin6. Master（Standalone模式

19、）啟動(dòng)Spark，測(cè)試。sbin/start-all.shjpsMaster # Master機(jī)器進(jìn)程Worker # Slave機(jī)器進(jìn)程7. Master（Standalone模式）或命令提交機(jī)器（YARN模式）測(cè)試。bin/spark-submit -master spark:/centos1:7077 -deploy-mode client -class org.apache.spark.examples.SparkPi -driver-memory 512m -executor-memory 512m -num-executors 1 -executor-cores 1 lib/spa

20、rk-examples-1.6.3-hadoop2.6.0.jar # Standalone Client模式運(yùn)行bin/spark-submit -master spark:/centos1:7077 -deploy-mode cluster -class org.apache.spark.examples.SparkPi -driver-memory 512m -executor-memory 512m -num-executors 1 -executor-cores 1 lib/spark-examples-1.6.3-hadoop2.6.0.jar # Standalone Clust

21、er模式運(yùn)行bin/spark-submit -master yarn-client -class org.apache.spark.examples.SparkPi -driver-memory 512m -executor-memory 512m -num-executors 1 -executor-cores 1 lib/spark-examples-1.6.3-hadoop2.6.0.jar # Yarn Client模式運(yùn)行bin/spark-submit -master yarn-cluster -class org.apache.spark.examples.SparkPi -d

22、river-memory 512m -executor-memory 512m -num-executors 1 -executor-cores 1 lib/spark-examples-1.6.3-hadoop2.6.0.jar # Yarn Custer模式運(yùn)行bin/yarn application -list # 查看YARN運(yùn)行的應(yīng)用bin/yarn application -kill ApplicationID # 殺死YARN運(yùn)行的應(yīng)用bin/spark-shell -master spark:/centos1:7077 -deploy-mode client -driver-m

23、emory 512m -executor-memory 512m -num-executors 1 -executor-cores 1 # Standalone Client模式運(yùn)行bin/spark-shell -master yarn -deploy-mode client -driver-memory 512m -executor-memory 512m -num-executors 1 -executor-cores 1 # Yarn Client模式運(yùn)行8. 監(jiān)控頁(yè)面。http:/centos1:8080Spark監(jiān)控http:/centos1:8088YARN監(jiān)控應(yīng)用部署部署架構(gòu)1

24、. Application：Spark應(yīng)用程序，包括一個(gè)DriverProgram和集群中多個(gè)WorkNode中的Executor，其中每個(gè)WorkNode為每個(gè)Application僅提供一個(gè)Executor。2. DriverProgram：運(yùn)行Application的main函數(shù)。通常也用SparkContext表示。負(fù)責(zé)DAG構(gòu)建、Stage劃分、Task管理及調(diào)度、生成SchedulerBackend用于A(yíng)kka通信，主要組件有DAGScheduler、TaskScheduler、SchedulerBackend。3. ClusterManager：集群管理器，可封裝如SparkSt

25、andalone、YARN等不同集群管理器。DriverProgram通過(guò)ClusterManager分配資源，并將任務(wù)發(fā)送到多個(gè)WorkNode執(zhí)行。4. WorkNode：集群節(jié)點(diǎn)。應(yīng)用程序在運(yùn)行時(shí)的Task在WorkNode的Executor中執(zhí)行。5. Executor：WorkNode為Application啟動(dòng)的一個(gè)進(jìn)程，負(fù)責(zé)執(zhí)行Task。6. Stage：一個(gè)Applicatoin一般包含一到多個(gè)Stage。7. Task：被DriverProgram發(fā)送到Executor的計(jì)算單元，通常一個(gè)Task處理一個(gè)split（即一個(gè)分區(qū)），每個(gè)split一般是一個(gè)Block大小。一個(gè)S

26、tage包含一到多個(gè)Task，通過(guò)多個(gè)Task實(shí)現(xiàn)并行計(jì)算。8. DAGScheduler：將Application分解成一到多個(gè)Stage，每個(gè)Stage根據(jù)RDD分區(qū)數(shù)決定Task個(gè)數(shù)，然后生成相應(yīng)TaskSet放到TaskScheduler中。9. DeployMode：Driver進(jìn)程部署模式，有cluster和client兩種。10. 注意： a)DriverProgram必須與Spark集群處于同一網(wǎng)絡(luò)環(huán)境。因?yàn)镾parkContext要發(fā)送任務(wù)給不同WorkNode的Executor并接受Executor的執(zhí)行結(jié)果。 b)生產(chǎn)環(huán)境中，DriverProgram所在機(jī)器性能配置，尤

27、其CPU較好。應(yīng)用程序部署1. 分類(lèi)： a)spark-shell：交互式，用于開(kāi)發(fā)調(diào)試。已創(chuàng)建好“valsc:SparkContext”和“valsqlContext:SQLContext”實(shí)例。 b)spark-submit：應(yīng)用提交式，用于生產(chǎn)部署。2. spark-shell參數(shù)：bin/spark-shell -helpUsage: ./bin/spark-shell optionsOptions: -master MASTER_URL spark:/host:port, mesos:/host:port, yarn, or local. -deploy-mode DEPLOY_MO

28、DE Whether to launch the driver program locally (client) or on one of the worker machines inside the cluster (cluster) (Default: client). -class CLASS_NAME Your applications main class (for Java / Scala apps). -name NAME A name of your application. -jars JARS Comma-separated list of local jars to in

29、clude on the driver and executor classpaths. -packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by -repositories. The format for the coordinate

30、s should be groupId:artifactId:version. -exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in -packages to avoid dependency conflicts. -repositories Comma-separated list of additional remote repositories to search for the maven coordina

31、tes given with -packages. -py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. -files FILES Comma-separated list of files to be placed in the working directory of each executor. -conf PROP=VALUE Arbitrary Spark configuration property. -prope

32、rties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. -driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M). -driver-java-options Extra Java options to pass to the driver. -driver-library-path Extra library

33、 path entries to pass to the driver. -driver-class-path Extra class path entries to pass to the driver. Note that jars added with -jars are automatically included in the classpath. -executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). -proxy-user NAME User to impersonate when subm

34、itting the application. -help, -h Show this help message and exit -verbose, -v Print additional debug output -version, Print the version of current Spark Spark standalone with cluster deploy mode only: -driver-cores NUM Cores for driver (Default: 1). Spark standalone or Mesos with cluster deploy mod

35、e only: -supervise If given, restarts the driver on failure. -kill SUBMISSION_ID If given, kills the driver specified. -status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: -total-executor-cores NUM Total cores for all executors. Spark standalo

36、ne and YARN only: -executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) YARN-only: -driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). -queue QUEUE_NAME The YARN queue to submit to (De

37、fault: default). -num-executors NUM Number of executors to launch (Default: 2). -archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. -principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS. -keytab KEYTAB The

38、 full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.3. spark-submit參數(shù)（除Usage外，其他參數(shù)與spark-shell一

39、樣）：bin/spark-submit -helpUsage: spark-submit options app argumentsUsage: spark-submit -kill submission ID -master spark:/.Usage: spark-submit -status submission ID -master spark:/.Options: -master MASTER_URL spark:/host:port, mesos:/host:port, yarn, or local. -deploy-mode DEPLOY_MODE Whether to laun

40、ch the driver program locally (client) or on one of the worker machines inside the cluster (cluster) (Default: client). -class CLASS_NAME Your applications main class (for Java / Scala apps). -name NAME A name of your application. -jars JARS Comma-separated list of local jars to include on the drive

41、r and executor classpaths. -packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by -repositories. The format for the coordinates should be groupI

42、d:artifactId:version. -exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in -packages to avoid dependency conflicts. -repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with -pa

43、ckages. -py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. -files FILES Comma-separated list of files to be placed in the working directory of each executor. -conf PROP=VALUE Arbitrary Spark configuration property. -properties-file FILE Pa

44、th to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. -driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M). -driver-java-options Extra Java options to pass to the driver. -driver-library-path Extra library path entries to p

45、ass to the driver. -driver-class-path Extra class path entries to pass to the driver. Note that jars added with -jars are automatically included in the classpath. -executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). -proxy-user NAME User to impersonate when submitting the applica

46、tion. -help, -h Show this help message and exit -verbose, -v Print additional debug output -version, Print the version of current Spark Spark standalone with cluster deploy mode only: -driver-cores NUM Cores for driver (Default: 1). Spark standalone or Mesos with cluster deploy mode only: -supervise

47、 If given, restarts the driver on failure. -kill SUBMISSION_ID If given, kills the driver specified. -status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: -total-executor-cores NUM Total cores for all executors. Spark standalone and YARN only:

48、-executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) YARN-only: -driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). -queue QUEUE_NAME The YARN queue to submit to (Default: default). -

49、num-executors NUM Number of executors to launch (Default: 2). -archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. -principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS. -keytab KEYTAB The full path to the

50、file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.4. 默認(rèn)參數(shù)： a)默認(rèn)應(yīng)用程序參數(shù)配置文件：conf/spark-defaults.conf b)默認(rèn)JVM參數(shù)配置文

51、件：conf/spark-env.sh c)常用的jar文件可通過(guò)“-jar”參數(shù)配置。5. 參數(shù)優(yōu)先級(jí)（由高到低）： a)SparkConf顯示配置參數(shù)； b)spark-submit指定參數(shù)； c)conf/spark-defaults.conf配置文件參數(shù)。6. MASTER_URL格式MASTER_URL說(shuō)明local以單線(xiàn)程在本地運(yùn)行（完全無(wú)并行）localK在本地以K個(gè)Worker線(xiàn)程運(yùn)行，K設(shè)置為CPU核數(shù)較理想local*K=CPU核數(shù)spark:/HOST:PORT連接Standalone集群的Master，即Spark監(jiān)控頁(yè)面的URL，端口默認(rèn)為7077（不支持省略）yar

52、n-client以client模式連接到Y(jié)ARN集群，通過(guò)HADOOP_CONF_DIR環(huán)境變量查找集群yarn-cluster以cluster模式連接到Y(jié)ARN集群，通過(guò)HADOOP_CONF_DIR環(huán)境變量查找集群7. 注意： a)spark-shell默認(rèn)使用4040端口，當(dāng)4040端口被占用時(shí)，程序打印日志警告WARN并嘗試遞增端口（4041、4042）直到找到可用端口為止。 b)Executor節(jié)點(diǎn)上每個(gè)DriverProgram的jar包和文件會(huì)被復(fù)制到工作目錄下，可能占用大量空間。YARN集群會(huì)自動(dòng)清除，Standalone集群需配置“spark.worker.cleanup.a

53、ppDataTtl”開(kāi)啟自動(dòng)清除。8. 應(yīng)用程序模板import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.hive.HiveContextobject Test def main(args: ArrayString): Unit = val conf = new SparkConf().setAppName(Test) val sc = new SparkContext(conf)

54、 / . 9. 提交示例：bin/spark-submit -master spark:/ubuntu1:7077 -class org.apache.spark.examples.SparkPi lib/spark-examples-1.6.3-hadoop2.6.0.jar核心原理RDD概念1. RDD：ResilientDistributedDataset，彈性分布式數(shù)據(jù)集。2. 意義：Spark最核心的抽象概念；具有容錯(cuò)性基于內(nèi)存的集群計(jì)算方法。RDD核心組成1. 5個(gè)核心方法。 a)getPartitions：分區(qū)列表（數(shù)據(jù)塊列表） b)compute：計(jì)算各分區(qū)數(shù)據(jù)的函數(shù)。 c)g

55、etDependencies：對(duì)父RDD的依賴(lài)列表。 d)partitioner：key-valueRDD的分區(qū)器。 e)getPreferredLocations：每個(gè)分區(qū)的預(yù)定義地址列表（如HDFS上的數(shù)據(jù)塊地址）。2. 按用途分類(lèi)以上5個(gè)方法： a)前3個(gè)：描述RDD間的血統(tǒng)關(guān)系（Lineage），必須有的方法； b)后2個(gè)：用于優(yōu)化執(zhí)行。3. RDD的實(shí)例：RDDT，T為泛型，即實(shí)例。4. 分區(qū)： a)分區(qū)概念：將大數(shù)據(jù)量T實(shí)例集合split成多個(gè)小數(shù)據(jù)量的T實(shí)例子集合。 b)分區(qū)源碼：實(shí)際上是IteratorT。 c)分區(qū)存儲(chǔ)：例如以Block方式存在HDFS。5. 依賴(lài)： a)依賴(lài)

56、列表：一個(gè)RDD可有多個(gè)父依賴(lài)，所以是父RDD依賴(lài)列表。 b)與分區(qū)關(guān)系：依賴(lài)是通過(guò)RDD分區(qū)間的依賴(lài)體現(xiàn)的，通過(guò)依賴(lài)列表和getPartitions方法可知RDD各分區(qū)是如何依賴(lài)一組父RDD分區(qū)的。6. compute方法： a)延時(shí)（lazy）特性，當(dāng)觸發(fā)Action時(shí)才真正執(zhí)行compute方法； b)計(jì)算粒度是分區(qū)，而不是T元素。7. partitioner方法：T實(shí)例為key-value對(duì)類(lèi)型的RDD。8. RDD抽象類(lèi)源碼（節(jié)選自v1.6.3）： 1 package org.apache.spark.rdd 2 3 / 4 5 /* 6 * A Resilient Distribu

57、ted Dataset (RDD), the basic abstraction in Spark. Represents an immutable, 7 * partitioned collection of elements that can be operated on in parallel. This class contains the 8 * basic operations available on all RDDs, such as map, filter, and persist. In addition, 9 * org.apache.spark.rdd.PairRDDF

58、unctions contains operations available only on RDDs of key-value10 * pairs, such as groupByKey and join;11 * org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of12 * Doubles; and13 * org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs

59、that14 * can be saved as SequenceFiles.15 * All operations are automatically available on any RDD of the right type (e.g. RDD(Int, Int)16 * through implicit.17 *18 * Internally, each RDD is characterized by five main properties:19 *20 * - A list of partitions21 * - A function for computing each spli

60、t22 * - A list of dependencies on other RDDs23 * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)24 * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for25 * an HDFS file)26 *27 * All of the scheduling and execut

人人文庫(kù)> 全部分類(lèi)> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

Spark筆記-技術(shù)點(diǎn)匯總

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

Spark筆記-技術(shù)點(diǎn)匯總

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔