Row major data: Advantages: 1. Suppose we perform a operation select all columns from a table, then read each row one by one and display the details. Disadvantages: 1. If we select specific columns for a table, then we need to read all columns data and then skip unwanted columns and fetch the only specific column and it will continue same for all other rows. Because of read and skip unwanted data to reach specific column , it will impact the performance. 2. Also because of different data types in each row, compression also not much efficient. Column major data: Advantages: 1. Store the data in columnar way. All column 1 for all records are stored together. 2. Select specific column is vert fast, because of all data related to specific column is stored together. 3. Storing same type of data together, Compression will be good. Disadvantages: 1. If we want to selecting all columns data , then it is not much efficient RCfile: Combination of both row major and column major formats. Partition(called row group) the data into row based first, then for each partitions convert into columnar file format. Stored in Binary format. Default partition size is 4MB Disadvantages: 1. No Metadata about columns 2. Many partitions(row group) will be created for larger data sets because of default partition size is 4MB , so sequential will suffer 3. Because of no metadata about columns, compression is less efficient ORC: Similar to RC , ORC also divide row and columnar formats called stripes. Along with data it will also create indexes/statistics at stripes level as well as in file footer level. Statistics like MIN and MAX for each column in data. Parquet: It is also a columnar file format, similar to RC and ORC, but parquet stores nested data structures in a flat columnar format. Also support very good compression methods.
Monday, March 23, 2020
RCFile vs ORC vs Parquet
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment