27 Aug 2016
Google Cloud Storage offers hosting static files. Or they call it durable and highly available object storage.
However, when we download a file from Google Cloud Storage programmatically, there can be many things going wrong.
We need a way to validate that the downloaded file is valid. And checksum is a right way to do that!
Google Cloud Storage offers the response header 'x-goog-hash' when downloading a certain file. It looks like below:
We can simply compute CRC32C on the downloaded file, encode the value properly, and compare the computed value against the one in x-goog-hash. and that is it!
Well, not so fast.
It turns out that computing the value in Python is quite a journey. It took me 2 hours to figure this out.
The doc casually describe the value as 'The Base64 encoded CRC32c'. It's not as simple as that.
Here's the real steps in Python:
'%08x' % checksum.
binascii.unhexlify(hexstring). I don't even know what this value really is. It looks like a valid string.
The appengine's crc32c returns an integer value. That is what makes it difficult.
If you can use crcmod, it should be better because I see a lot of helpers inside (e.g. hexstring). I can't use crcmod because it contains *.c files.
You can also use gsutil to help debugging the file.
gsutil hash file.zip to see the base64-encoded value. Or
gsutil hash -h file.zip to see the hexadecimal value.
Update: Please use crcmod. It is 20x faster than the appengine's crc32c because crcmod is written in C.
Update2: Don't use crc32c at all if you don't have to. md5 is not available for composite objects. But I don't have composite objects. Because it's difficult to install a compiled crcmod on windows.