DM-7477 increased the mask size from 16 to 32 bits, resulting in a 20% increase in file sizes for exposures (10 --> 12 bytes per pixel). In order to counteract this increase, DM-11332 introduces the use of FITS tile compression. This RFC proposes that lossless FITS tile compression be the default for writing images to FITS. This yields a minor space gain for floating-point images, but a big gain for masks.
FITS tile compression is distinct from simply applying common *nix compression utilities (e.g., gzip, bzip2) to the FITS file. The latter causes the entire file to be compressed, meaning that reading the header or a subimage require uncompressing the entire file. In contrast, FITS tile compression leaves the headers (relatively) untouched, and compresses the pixels only; moreover, the pixels are grouped into subregions ("tiles") before compression, allowing quick reading of subimages. For more details on FITS tile compression, see 2009PASP..121..414P.
FITS tile compression may be either lossless or lossy, depending on the parameters chosen. Lossless compression can be performed on integer images with good compression factors (e.g., 4--10, depending on the distribution of values in the image), but floating-point images do not typically yield good compression factors (e.g., 1.1) when compressed losslessly. Although the implementation in
DM-11332 supports lossy compression and I'm fairly confident that enabling it with reasonable parameter choices would not compromise the scientific information in our images (e.g., see 2010PASP..122.1065P), we have not yet done the work to validate it and therefore do not (yet) propose to use lossy compression by default.
I propose that lossless FITS tile compression be the default for writing all image types (Image, Mask, MaskedImage, Exposure) to FITS. Specifically, the writeFits methods of these image types will accept write options (in the form of a new class, lsst::afw::fits::ImageWriteOptions), which will default to lossless compression with the GZIP_SHUFFLE (called GZIP_2 in cfitiso) compression scheme. The cfitsio manual explains that this scheme "first shuffles the bytes in all the pixel values so that the most-significant byte of every pixel appears first, followed by the less significant bytes in sequence" before compressing with gzip. It further notes that this scheme "may be more effective in cases where the most significant byte in most of the image pixel values contains the same bit pattern". I've observed that this produces a bit better compression factors than straight compression with gzip (GZIP_1 in cfitsio).
Pickling of images will continue to be done without compression, since for this use case I believe that keeping the computational burden down is more important than the size savings (and we don't normally pickle large images anyway).
Compression can be disabled globally in Python or C++, or configured for individual writes.
There should be no negative impact on our day-to-day development and operations, since the old APIs still work (with a modified implementation at the back-end). We will gain from reduced file sizes (maybe the equivalent of about 8 bytes per pixel for an Exposure?). There will be a small additional compute overhead from compressing and uncompressing the image, but I believe that to be small compared to the I/O overhead.
There are some instances in the stack of determining the size of a persisted image by reading NAXIS1 and NAXIS2 from the header. This has always been naughty (because it assumes a particular persistence implementation), but it will not work once we are using FITS tile compression — the function lsst.afw.image.bboxFromMetadata should be used instead. I believe I've already identified and updated all instances of this in the stack as part of the work on
DM-11332 (using Jenkins to build lsst_distrib and ci_hsc, and searching our GitHub repos for NAXIS1).
This proposal has the potential to affect users who read our FITS images with external software. However most commonly used FITS readers these days support compression, including cfitsio, pyfits and ds9. If a user desires to use a FITS reader that doesn't support FITS tile compression, they can use the funpack utility in cfitsio to uncompress the image and then read it as a normal FITS image. Because we're using tile compression, the headers remain accessible without the need to uncompress the entire image.