Handling Large Amounts of Data with Parquet – Part 2

Parquet provides various configuration to let the applications control how do they want the library to handle the writes. In this blog post, we will try to go over these configurations and understand how do those configurations have an effect on the overall throughput of the writes / reads / compression.

Parquet provides following configurations which can be tweaked by the application. These configurations can be used by the applications to fit their use case.

  • Row Page Size
  • Compression Codecs
  • Dictionary for Column Values
  • Data Page Size and Dictionary Size

Row Page Size

Row Page Size Threshold decides as to when we need to flush the in-memory data structures to the row group and then append them to the parquet file.

if (memSize > nextRowGroupSize) {
  flushRowGroupToStore();
  initStore();
}

Note: Checking for the Size of the In-Memory Data Structures of the parquet Writer is a bit costly operation. So this operation does not happen for every record which is added, but we carry out this operation after certain k operations. This number k is revisited everytime we are done with those k operations.

Compression Codecs

Parquet 1.10 supports around these 6 compression codecs

  • Snappy
  • GZIP
  • LZO
  • Brotli
  • LZ4
  • ZSTD

Along with these 6 compression codecs, we also have the operation of not applying any compression codec i.e. UNCOMPRESSED. This compression codec ensures that we store the data page bytes as well as dictionary page bytes as it is.

Dictionary for the Column Values

As discussed in the previous blog article, parquet provides the ability to encode the values of a column in a dictionary. The application can control if it needs to enable the dictionary encoding or not. There are various advantages of maintaining the dictionary

  • Due to dictionary encoding, most of the times in practice the total size of the dictionary + encoded values is less than the total size of the actual raw values. This is because most of the times, our values are duplicated across different column values, so creating a dictionary out of them removes duplicate entries from the column values.
  • Due to dictionary encoding, we can provide stats filtering over those column values. We already discussed in the previous blog, that dictionary can be used to provide quick filtering to reject row groups which do not have particular terms in them. Eg. Say a dictionary for a column value does not contain the term “Sheldon”, so we do not need to read each column value to figure out which records contains the term “Sheldon”, we can easily filter out those row groups whose column dictionary does not contain those terms.

Data Page Size and Dictionary Size

Parquet allows us to specify the data page size and dictionary page sizes. Data Page Size and Dictionary Page Size configurations help us define the maximum in memory size for each and every column data values and dictionary values respectively. Dictionary Page Size is also used for deciding whether we need to fall back to the PlainValueWriter for columns instead of the DictionaryValueWriter format.

@Override
public boolean shouldFallBack() {
  // if the dictionary reaches the max byte size or the values can not be encoded on 4 bytes anymore.
  return dictionaryByteSize > maxDictionaryByteSize
      || getDictionarySize() > MAX_DICTIONARY_ENTRIES;
}

 

Encodings In Parquet

Apart from dictionary encoding and plain encoding, parquet uses multiple other encoding schemes to store the data optimally.

  • Run Length Bit Packing Hybrid Encoding
    • This encoding uses a combination of run length + bit packing encoding to store data more efficiently. In parquet, it is used for encoding boolean values.
      Eg. 
      
      We have a list of boolean values say 
      0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1 (0 = false and 1 = true)
      
      This will get encoded to
      1000,0000,1000,0001
      
      where 1000 => 8 which is number of occurences of 0
            0000 => 0 which is the number getting repeated
            1000 => 8 which is number of occurences of 1
            0001 => 1 which is the number getting repeated

      Note: There are some other nuisances as well which have not been mentioned in this article for simplicity

  • Delta Binary Packing Encoding

Some Common Question’s

Question:  What is the value one should keep for row group size? 
Answer: This totally depends on your application resource requirements. If your application demands to have a lower memory footprint for these parquet Writers, then you have no option other than to keep the row group size to minimal value e.g. 100 KB to 1 MB. But if your application can sustain higher memory pressure, then we can afford to keep this limit even to a higher value say 10 MB. With higher values of row group sizes, we have better compression guarantees and overall better write throughput for each column.

Question: Which compression codec should one choose for my application?
Answer: The compression codec to choose depends on the data you are handling in your application. Compression and Decompression Speed also play a crucial role in deciding the compression algorithm. Some applications can take a hit in compression speed, but they want best compression ratios. There is no “One Size Fit All” compression algorithm. But if I have to choose one, I would go with ZSTD because of the high compression ratios and decompression speed offered by this algorithm. See this for more details.

References

One thought on “Handling Large Amounts of Data with Parquet – Part 2

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.