Persistence and storage levels
Official Website: http://bigdataelearning.com
What does it mean to persist a RDD?
RDD Persistence is an important capability of spark. When an RDD is persisted, it means that the data is stored in memory and it will be reused when subsequent actions use them. Why the phrase “subsequent actions”? Because the first time the RDD is computed in an action it gets created and gets stored in the memory. During the subsequent actions, the RDD is used from the memory instead of re-computing.
Why to persist a RDD?
⮚ Since the persisted RDD is not recomputed and can be fetched from the memory directly, the execution will be much faster.
⮚ This can also be used when an RDD computation is expensive. By persisting an expensive RDD, we can avoid it from being recomputed in the case of node failure.
⮚ RDD persistence can be used for iterative algorithms and interactive uses.
Persistence levels
RDD can be persisted on different levels.
1. MEMORY_ONLY - It can be persisted on memory as de-serialized objects. When the entire RDD doesn’t fit on the memory, the remaining dataset is recomputed on the fly. This is like using CACHE method to persist the RDD. In other words, rdd1.cache() is same as rdd1.persist(StorageLevel.MEMORY_ONLY)
2. MEMORY_AND_DISK - RDD can be persisted on memory and disk, which means the RDD will be stored in memory and the excess RDD that can’t be fit into the memory will be stored in the disk. E.g. rdd1.persist(StorageLevel.MEMORY_AND_DISK)
3. MEMORY_ONLY_SER – This is very much like the MEMORY_ONLY option. It can be stored in memory but as serialized objects. This is generally more space efficient than de-serialized objects. E.g. rdd1.persist(StorageLevel.MEMORY_ONLY_SER)
4. MEMORY_AND_DISK_SER - This is very much like the MEMORY_AND_DISK option. It can be stored in memory and disk but as serialized objects. The data that doesn’t fit into memory are spilled on the disk. E.g. rdd1.persist(StorageLevel.MEMORY_AND_DISK_SER)
5. DISK_ONLY – stores the RDD data only on disk. E.g. rdd1.persist(StorageLevel.DISK_ONLY)
Видео Persistence and storage levels канала BigDataElearning
What does it mean to persist a RDD?
RDD Persistence is an important capability of spark. When an RDD is persisted, it means that the data is stored in memory and it will be reused when subsequent actions use them. Why the phrase “subsequent actions”? Because the first time the RDD is computed in an action it gets created and gets stored in the memory. During the subsequent actions, the RDD is used from the memory instead of re-computing.
Why to persist a RDD?
⮚ Since the persisted RDD is not recomputed and can be fetched from the memory directly, the execution will be much faster.
⮚ This can also be used when an RDD computation is expensive. By persisting an expensive RDD, we can avoid it from being recomputed in the case of node failure.
⮚ RDD persistence can be used for iterative algorithms and interactive uses.
Persistence levels
RDD can be persisted on different levels.
1. MEMORY_ONLY - It can be persisted on memory as de-serialized objects. When the entire RDD doesn’t fit on the memory, the remaining dataset is recomputed on the fly. This is like using CACHE method to persist the RDD. In other words, rdd1.cache() is same as rdd1.persist(StorageLevel.MEMORY_ONLY)
2. MEMORY_AND_DISK - RDD can be persisted on memory and disk, which means the RDD will be stored in memory and the excess RDD that can’t be fit into the memory will be stored in the disk. E.g. rdd1.persist(StorageLevel.MEMORY_AND_DISK)
3. MEMORY_ONLY_SER – This is very much like the MEMORY_ONLY option. It can be stored in memory but as serialized objects. This is generally more space efficient than de-serialized objects. E.g. rdd1.persist(StorageLevel.MEMORY_ONLY_SER)
4. MEMORY_AND_DISK_SER - This is very much like the MEMORY_AND_DISK option. It can be stored in memory and disk but as serialized objects. The data that doesn’t fit into memory are spilled on the disk. E.g. rdd1.persist(StorageLevel.MEMORY_AND_DISK_SER)
5. DISK_ONLY – stores the RDD data only on disk. E.g. rdd1.persist(StorageLevel.DISK_ONLY)
Видео Persistence and storage levels канала BigDataElearning
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
![How to programmatically specify a schema?](https://i.ytimg.com/vi/4GQm2VmcKZE/default.jpg)
![Actions on pair RDDs](https://i.ytimg.com/vi/8gD6UnMpntw/default.jpg)
![Transformations on one pair RDD](https://i.ytimg.com/vi/5oO98Ql73IA/default.jpg)
![How to Create Pair RDDs: Convert regular RDD to pair RDD: Create pair RDD from in-memory collection](https://i.ytimg.com/vi/J9CKgJvnWM8/default.jpg)
![Pig Describe , Explain , Illustrate Operators : Some useful development tools in Pig](https://i.ytimg.com/vi/vUctEFt2nlw/default.jpg)
![How to create partitions in RDD](https://i.ytimg.com/vi/FV3L7n1AejQ/default.jpg)
![Get familiarized with Scala and Python shells](https://i.ytimg.com/vi/j22J3XGxzl8/default.jpg)
![How to create a dataframe from a CSV file](https://i.ytimg.com/vi/DW8_yiSfSZY/default.jpg)
![What is a Dataset: 3 specific features that Dataset provides](https://i.ytimg.com/vi/icxGs_OjJl8/default.jpg)
![How to install Java : How to install IntelliJ : How to create Hello World program in Java](https://i.ytimg.com/vi/gXG1Wu5S7Ko/default.jpg)
![Transformations on multi pair RDDs](https://i.ytimg.com/vi/exBYfbpJ6F0/default.jpg)
![HIVE data types: Hive data model](https://i.ytimg.com/vi/ekFAGB5Fpm4/default.jpg)
![Apache Spark Components : Different Components in Spark Framework](https://i.ytimg.com/vi/m4pYYnY4_gU/default.jpg)
![JVM Architecture](https://i.ytimg.com/vi/QHIWkwxs0AI/default.jpg)
![Apache Spark : Commonly used Transformations : Map, Filter, Flatmap Transformations](https://i.ytimg.com/vi/HS8Cx-l9Vhg/default.jpg)
![Apache Spark RDD operations : Transformations and Actions](https://i.ytimg.com/vi/9MeMWdILl5Q/default.jpg)
![Apache Spark Architecture : Run Time Architecture of Spark Application](https://i.ytimg.com/vi/rJFg2i_auAg/default.jpg)
![Class Structure and components - Part I](https://i.ytimg.com/vi/fmobr_9bi_E/default.jpg)
![What is Apache Hive? : Understanding Hive](https://i.ytimg.com/vi/cMziv1iYt28/default.jpg)
![Determining the number of partitions](https://i.ytimg.com/vi/WGNcy4yKTRo/default.jpg)