Ramesh Maddileti: March 2020

Monday, March 23, 2020

Spark Core components and Execution Model

The core components of the spark application are:
* Driver
* When we submit a Spark application in cluster mode using spark-submit, Driver will interact with the Cluster
Resource Manager to start the Application Master.
* It is also convert user code into logical plan (DAG) and then convert to physical plan.
* Application Master
* Driver request Application Master for executors for executing the user code, application Master will negotiates
the resources with the Resource Manager to host these executors.
* Spark Context
* Driver will create Spark Context for each application and Spark Context is the main entry point for executing
any Spark functionality.
* Executors
* Executors are processes on the worker nodes whose job is to execute the assigned tasks.

The Spark execution model can be defined in three phases:
* Logical Plan
* Converting user code into different steps which will be executed when an action is performed.
* Logical plan will create DAG for how spark will execute all transformation
* Physical Plan
* Converting Logical plan into Physical plan using Catalyst and Tungsten optimisation techniques.
* Few methods while choosing/translating best physical plan using Catalyst optimiser.
* Remove repeated operations(ex. adding of two numbers to perform for each row)
* Predicate pushdown : pushes the filters as close as possible to data sources.
* Column pruning : only select needed columns.
* Tungsten: executing query plan on the actual cluster, which generate optimised code based on query plan that
generated by Catalyst Optimiser
* Executions:
* Physical plan is covered into number of stages and then convert into tasks.
* Driver request the Cluster Manager and negotiates the resources. Cluster Manager will allocate containers and
launches executors on all the allocated containers and assigns tasks to run on behalf of the Driver.

RCFile vs ORC vs Parquet

Row major data:
Advantages:
1. Suppose we perform a operation select all columns from a table, then read each row one by one
and display the details.
Disadvantages:
1. If we select specific columns for a table, then we need to read all columns data and then skip unwanted
columns and fetch the only specific
column and it will continue same for all other rows. Because of read and skip unwanted data to reach
specific column , it will impact the
performance.
2. Also because of different data types in each row, compression also not much efficient.

Column major data:
Advantages:
1. Store the data in columnar way. All column 1 for all records are stored together.
2. Select specific column is vert fast, because of all data related to specific column is stored together.
3. Storing same type of data together, Compression will be good.
Disadvantages:
1. If we want to selecting all columns data , then it is not much efficient

RCfile: Combination of both row major and column major formats.
Partition(called row group) the data into row based first, then for each partitions convert
into columnar file format.
Stored in Binary format.
Default partition size is 4MB
Disadvantages:
1. No Metadata about columns
2. Many partitions(row group) will be created for larger data sets because of default
partition size is 4MB , so sequential will suffer
3. Because of no metadata about columns, compression is less efficient

ORC: Similar to RC , ORC also divide row and columnar formats called stripes. Along with data it
will also create indexes/statistics at stripes level as well as in file footer level.
Statistics like MIN and MAX for each column in data.

Parquet: It is also a columnar file format, similar to RC and ORC, but parquet stores nested data
structures in a flat columnar format.
Also support very good compression methods.

Ramesh Maddileti

Monday, March 23, 2020

Spark Core components and Execution Model

RCFile vs ORC vs Parquet

About Me

Blog Archive

Sql Server