本文共 9766 字,大约阅读时间需要 32 分钟。
Spark与Hadoop大数据分析比较系统地讲解了利用Hadoop和Spark及其生态系统里的一系列工具进行大数据分析的方法,既涵盖ApacheSpark和Hadoop的基础知识,又深入探讨所有Spark组件——SparkCore、SparkSQL、DataFrame、DataSet、普通流、结构化流、MLlib、Graphx,以及Hadoop的核心组件(HDFS、MapReduce和Yarn)等,并配套详细的实现示例,是快速掌握大数据分析基础架构及其实施方法的详实参考。
全书共10章,第1章从宏观的角度讲解大数据分析的概念,并介绍在Hadoop和Spark平台上使用的工具和技术,以及一些*常见的用例;第2章介绍Hadoop和Spark平台的基础知识;第3章深入探讨并学习Spark;第4章主要介绍DataSourcesAPI、DataFrameAPI和新的DatasetAPI;第5章讲解如何用SparkStreaming进行实时分析;第6章介绍Spark和Hadoop配套的笔记本和数据流;第7章讲解Spark和Hadoop上的机器学习技术;第8章介绍如何构建推荐系统;第9章介绍如何使用GraphX进行图分析;第10章介绍如何使用SparkR。
目录:第1章 从宏观视角看大数据分析··········1
1.1 大数据分析以及Hadoop和Spark
在其中承担的角色····························3
1.1.1 典型大攻据分析项目的
生名周期.....................4
1.1.2 Hadoop中Spark承担的角色·············6
1.2 大数据札学以及Hadoop和
Spark在其中承扣的角色…………6
1.2.1 从数据分析到数据科学的
根本性转变···························6
1.2.2 典型数据科学项目的生命周期··········8
1.2.3 Hadoop和Spark承担的角色·················9
1.3 工具和技术··························9
1.4 实际环境中的用例·············11
1.5 小结········································12
第2章 Apache Hadoop和ApacheSpark 入门····13
2.1 Apache Hadoop概述..…………13
2.1.1 Hadoop分布式文件系统····14
2.1.2 HDFS的特性·······························15
2.1.3 MapReduce··························16
2.1.4 MapReduce的特性······················17
2.1.5 MapReduce v 1与
MapRcduce v2 对比······················17
2.1.6 YARN··································18
2.1.7 Hadoop上的存储选择······················20
2.2 Apache Spark概述···························24
2.2.1 Spark的发展历史······················24
2.2.2 Apache Spark是什么······················25
2.2.3 Apache Spark不是什么·······26
2.2.4 MapReduce的问题······················27
2.2.5 Spark的架构························28
2.3 为何把Hadoop和Spark结合使用·······31
2.3.1 Hadoop的持性······················31
2.3.2 Spark的特性·······························31
2.4 安装Hadoop和Spark集群···············33
2.5 小结··················································36
第3章 深入剖析Apache Spark ··········37
3.1 启动Spark守护进程·······························37
3.1.1 使用CDH ····························38
3.1.2 使用HDP 、MapR和Spark预制软件包··············38
3.2 学习Spark的核心概念························39
3.2.1 使用Spark的方法.··························39
3.2.2 弹性分布式数据集······················41
3.2.3 Spark环境································13
3.2.4 变换和动作..........................44
3.2.5 ROD中的并行度·························46
3.2.6 延迟评估·······························49
3.2.7 谱系图··································50
3.2.8 序列化·································51
3.2.9 在Spark 中利用Hadoop文件格式····52
3.2.10 数据的本地性··················53
3.2.11 共享变量........................... 54
3.2.12 键值对RDD ······················55
3.3 Spark 程序的生命周期………………55
3.3.1 流水线............................... 57
3.3.2 Spark执行的摘要....………58
3.4 Spark应用程序······························59
3.4.1 Spark Shell和Spark应用程序·········59
3.4.2 创建Spark环境…….............59
3.4.3 SparkConf·························59
3.4.4 SparkSubmit ························60
3.4.5 Spark 配置项的优先顺序····61
3.4.6 重要的应用程序配置··········61
3.5.1 存储级别............................. 62
3.5.2 应该选择哪个存储级别·····63
3.6 Spark 资源管理器: Standalone 、
YARN和Mesos·······························63
3.6.1 本地和集群模式··················63
3.6.2 集群资源管理器························64
3.7 小结·················································67
第4章 利用Spark SQL 、DataFrame
和Dataset 进行大数据分析····················69
4.1 Spark SQL的发展史····························70
4.2 Spark SQL的架构·······················71
4.3 介绍Spark SQL的四个组件················72
4.4 DataFrame和Dataset的演变············74
4.4.1 ROD 有什么问题····························74
4.4.2 ROD 变换与Dataset和
DataFramc 变换....................75
4.5 为什么要使用Dataset和Dataframe·····75
4.5.1 优化·····································76
4.5.2 速度·····································76
4.5.3 自动模式发现························77
4.5.4 多数据源,多种编程语言··················77
4.5.5 ROD和其包API之间的互操作性.......77
4.5.6 仅选择和读取为要的数据···········78
4.6 何时使用ROD 、Dataset
和DataFrame·············78
4.7 利用DataFraine进行分析.......……78
4.7.1 创建SparkSession …………...79
4.7.2 创建DataFrame·····························79
4.7.3 把DataFrame转换为RDD·············82
4.7.4 常用的Dataset DataFrame操作······83
4.7.5 缓存数据··································84
4.7.6 性能优化·····························84
4.8 利用DatasetAPl进行分析················85
4.8.1 创建Dataset·····························85
4.8.2 把Dataframe转换为Dataset····86
4.8.3 利用数据字典访问元数据···············87
4.9 Data Sources API ............................87
4.9.1 读和写函数································88
4.9.2 内置数据库····································88
4.9.3 外部数据源··························93
4.10 把Spark SQL作为分布式SQL引擎····97
4.10.1 把Spark SQL的Thrift服务器
用于JDBC/ODBC访问............97
4.10.2 使用beeline客户端查询数据·········98
4.10.3 使用spark-sqI CLI从Hive查询数据....99
4.10.4 与BI工具集成··························100
4.11 Hive on Spark ...........................…100
4.12 小结..............................................100
第5章 利用Spark Streaming和Structured Streaming 进行
实时分析···102
5.1 实时处理概述··························103
5.1.1 Spark Streaming 的优缺点...104
5.1.2 Spark Strcruning的发展史····104
5.2 Spark Streaming的架构···············104
5.2.1 Spark Streaming应用程序流··········106
5.2.2 无状态和有状态的准处理·················107
5.3 Spark Streaming的变换和动作········109
5.3.1 union·································· 109
5.3.2 join···························109
5.3.3 transform操作··························109
5.3.4 updateStateByKey·····················109
5.3.5 mapWithState ····················110
5.3.6 窗口操作······ ·····················110
5.3.7 输出操作........................... 1 11
5.4 输人数据源和输出存储·············111
5.4.1 基本数据源·······112
5.4.2 高级数据源····················112
5.4.3 自定义数据源.···················112
5.4.4 接收器的可靠性························ 112
5.4.5 输出存储··························113
5.5 使用Katlca和HBase的SparkStreaming···113
5.5.1 基于接收器的方法·······················114
5.5.2 直接方法(无接收器······················116
5.5.3 与HBase集成···························117
5.6 Spark Streaming的高级概念·········118
5.6.1 使用DataF rame······················118
5.6.2 MLlib操作·······················119
5.6.3 缓存/持久化·······················119
5.6.4 Spark Streaming中的容错机制······119
5.6.5 Spark Streaming应用程序的
性能调优············121
5.7 监控应用程序·······························122
5.8 结构化流概述································123
5.8.1 结构化流应用程序的工作流··········123
5.8.2 流式Dataset和流式DataFrame·····125
5.8.3 流式Dataset和流式
DataFrame的操作·················126
5.9 小结········································129
第6章 利用Spark 和Hadoop的
笔记本与数据流····················130
6.1 基下网络的笔记本概述·····················130
6.2 Jupyter概述..·························· 131
6.2.1 安装Jupyter···················132
6.2.2 用Jupyter进行分析···················134
6.3 Apache Zeppelin 概述····················· 135
6.3.1 Jupyter和Zeppelin对比····136
6.3.2 安装ApacheZeppelin···················137
6.3.3 使用Zeppelin进行兮析····139
6.4 Livy REST作业服务器和Hue笔记本····140
6.4.1 安装设置Livy服务器和Hue········141
6.4.2 使用Livy服务器····················1 42
6.4.3 Livy和Hue笔记本搭配使用·········145
6.4.4 Livy和Zeppelin搭配使用·············148
6.5 用于数据流的ApacheNiFi概述········148
6.5.1 安装ApacheNiFi··················148
6.5.2 把N iF1用干数据流和分析·····149
6.6 小结·····························152
第7章 利用Spark 和Hadoop 进行机器学习...153
7.1 机器学习概述........….................... 153
7.2 在Spark和Hadoop上进行机器学习.....154
7.3 机器学习算法··················155
7.3.1 有监督学习........…............. 156
7.3.2 无监督学习···················156
7.3.3 推荐系统…................…..... 157
7.3.4 特征提取和变换……...…157
7.3.5 优化...................................158
7.3.6 Spark MLlib的数据类型…158
7.4 机器学习算法示例·················160
7.5 构建机器学习流水线·················163
7.5.1 流水线工作流的一个示例···········163
7.5.2 构建一个ML流水线··················164
7.5.3 保存和加载模型··················166
7.6 利用H2O和Spark进行机器学习·····167
7.6.1 为什么使用SparklingWatcr······167
7.6.2 YARN上的一个应用程序流.........167
7 .6.3 Sparkling Water入门........168
7.7 Hivemall概述……..…………..169
7.8 Hivemall for Spark概述.. ……........170
7.9 小结······························170
第8章 利用Spark和Mahout构建推荐系统...171
8.1 构建推荐系统..............…171
8.1.1 基干内容的过滤························172
8.1.2 协同过滤······························ 172
8.2 推荐系统的局限性··························· 173
8.3 用MLlib实现推荐系统·······················173
8.3.1 准备环境·······················174
8.3.2 创建RDD······················175
8.3.3 利用DataFrame探索数据·······176
8.3.4 创建训练和测试数据集················178
8.3.5 创建一个模型···················178
8.3.6 做出预测··························179
8.3.7 利用测试数据对模型进行评估·······179
8.3.8 检查误型的准确度……......180
8.3.9 显式和隐式反馈····················181
8.4 Mahout和Spark的集成·····················181
8.4.1 安装Mahout····················181
8.4.2 探索Mahout shell ·····················182
8.4.3 利可Mahout和搜索工具
构建一个通用的推荐系统········186
8.5 小结····················189
第9章 利用GraphX进行图分析···190
9.1 图处理概述···································190
9.1.1 图是什么···························191
9.1.2 图数据库和图处理系统····191
9.1.3 GraphX概述·······················192
9.1.4 图算法···································192
9.2 GraphX入门·······················193
9.2.1 GraphX的基本操作·······················193
9.2.2 图的变换·············198
9.2.3 GraphX算法·······················202
9.3 利用GraphX分析航班数据···········205
9.4 GraphFrames概述························209
9.4.1 模式发现··························· 211
9.4.2 加载和保存Graphframes···212
9.5 小结...............................................212
第10章 利用SparkR进行交互式分析······213
10.1 R语言和Spark.R概述·······················213
10.1.1 R语言是什么.··························214
10.1.2 SparkR慨述.....................214
10.1.3 SparkR架构..................... 216
10.2 SparkR入门·······················216
10.2.1 安装和配置R·························216
10.2.2 使用SparkR shell··········218
10.2.3 使甲Spark.R脚本·······················222
10.3 在 SparkR里使用Dataframe······223
10.4 在RStudio里使用SparkR···········228
10.5 利用SparkR进行机器学习·······230
10.5.1 利用朴素贝叶斯模型······230
10.5.2 利用K均值模型·······················232
10.6 在Zeppelin里使用SparkR·······233
10.7 小结·······················234
如果想得到下载地址,请访问中科院计算所培训中心官网添加官网上的微信客服获取!
转载于:https://blog.51cto.com/14242083/2363478