This thought paper tries to reason about how can we combine encryption and compression in a single operation and get away with the traditional way of first compressing and then encrypting the compressed data to create the final encrypted and compressed data.
Problem
Currently, in this era, we have huge volumes of data and cheap external storage options, so more than often we have this use case of transferring data from those external storage options on our compute machines during runtime, which means that we have to
- Use HTTPs connection because we cannot risk the third parties to have a sneak peek at this data.
- Although this could also be achieved via our own encryption mechanisms
- Encrypting the data with a symmetric key A
- Uploading the encrypted data
- Downloading the encrypted data from the external storage
- Decrypting the downloaded data with the same symmetric key
- Although this could also be achieved via our own encryption mechanisms
- Then Uncompress the data which has been downloaded, because we want to save $$$ and these external data stores charge on data stored, so better compressed the data lesser would be the cost.
So essentially a lot of CPU cycles gets wasted in these two operations. What if we can combine these two somewhat orthogonal operations in a single operation, only motivation being that we can save some CPU cycles or instructions.
Traditionally Encryption and Compression have been considered orthogonal to each other, but both of those operations essentially have one thing in common that they shape data is one possible
Let us understand this statement with a simple example:
Compression
Payload (10101010100010010010) + CA ( ) = Result (11011001010) where Payload is the uncompressed data CA () is the compression algorithm Result is the compressed data
Encryption
Payload (10101010100010010010) + ECA ( Salt ) = Encrypted (10100101) where Payload is the unencrypted data EC ( salt ) is the encryption algorithm which take a SALT as input Result is the encrypted data
Now let’s save some CPU Cycles
We will try to construct a function which compresses as well as encrypts the data in fewer CPU cycles when compared to traditional way of compression + encryption.
We will first take a deep dive into understanding GZIP compression and its internals. Then we would figure out ways of encrypting that GZIP compressed data which would help us save CPU cycles.
GZIP
“GZIP provides a lossless compression, that is, we can recover the original data when decompressing it. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding.”
Source: http://blog.servergrove.com/2014/04/14/gzip-compression-works/
LZ77
LZ77 is an algorithm in which we replace the subsequent occurrences of substring T’ with the reference of the original substring T.
Original text: "ServerGrove, the PHP hosting company, provides hosting solutions for PHP projects" (81 bytes) LZ77: "ServerGrove, the PHP hosting company, pidessolutions forjects" (73 bytes)
Source: http://blog.servergrove.com/2014/04/14/gzip-compression-works/
Huffman Encoding
The same logic can be applied to Huffman encoding as well in which we have different encodings for every character.
“Huffman coding is a variable-length coding method that assigns shorter codes to more frequent “characters”.”
Quoted from: http://blog.servergrove.com/2014/04/14/gzip-compression-works/
Deflate Algorithm
In deflate algorithm, we combine the LZ77 as well as Huffman encoding to achieve best compression levels. In deflate algorithm, we divide the uncompressed data into chunks and then apply the compression. In every chunk, we first apply LZ77 compression technique to replace the repeated pieces of strings with a pointer to the original string and then encode the literals in the corresponding chunk with Huffman trees. These Huffman trees are in turn appended to the compressed chunk and then transferred along with the data.
Suggested Change to the current compression technique
We already know that our compressed data with GZIP ( deflate algorithm ) consists of chunks of compressed data along with their Huffman trees.
So here is the suggestion, what if we encrypt only the huffman trees
for those compressed chunks. This would give us encryption with
very few instructions set possibly because of the very limited size
of the huffman trees. These huffman trees are the basis for
decompressing the chunks. So if we have encrypted those huffman
trees for those compressed chunks, then there is no possible way
for retrieving the decompressed chunk without decrypting the huffman
trees.
So with this approach, we reap the benefits of compression as well as encryption with the lesser instruction set or in other words low CPU consumption.
So how many CPU cycles can we save ??
Now each of these Huffman trees inside each chunk of text contains the mapping for the literals to their bit codes.
However, the Huffman code table itself is pretty big. It has to include codes for all 255 possible literal bytes, as well as the back pointers themselves. As a result, the lengths of the codes are themselves Huffman encoded! To avoid conceptual infinite recursion, the lengths of those codes are not Huffman encoded, but instead given as fixed three-bit fields. This means that there can only be at most 8 unique length codes for the Huffman codes that represent the compressed data itself. To summarize, then, the Deflate format is illustrated in figure 4:
Quoted from: http://www.infinitepartitions.com/art001.html
Let’s understand this with an example.
Suppose these are the Huffman codes corresponding to the characters.
0: ‘a’ 1: ‘i’ 10: ‘e’ 110: ‘t’ 111: ‘s’ 1101: ‘j’
We will encode in this way
1: ‘a’ 1: ‘i’ 2: ‘e’ 3: ‘t’ 3: ‘s’ 4: ‘j’
We encode in this way because we want to further reduce the space taken by Huffman codes. For further understanding read this.
Now assuming this compressed Huffman code table amounts to 20% of the compressed data, then we could easily save around 80% of the CPU cycles.
Conclusion
So far we have talked about how can we save CPU cycles while encryption by taking advantage of the distribution of this compressed data in case of GZIP. Apart from GZIP, there are many other algorithms in the market which uses Huffman codes for better compression for literals. So those as well can be compressed + encrypted with fewer CPU cycles with this same technique.
Be it any compression algorithm , we can achieve lower CPU cycles consumption for encryption given we encrypt only the dictionary of compressed data and not the whole compressed data.
References
[1] http://blog.servergrove.com/2014/04/14/gzip-compression-works/
[2] https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
[3] https://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression#Adaptive_dictionary_algorithms
[4] http://www.infinitepartitions.com/art001.html
[5] https://jvns.ca/blog/2015/02/22/how-gzip-uses-huffman-coding/