# analyze the storage usage of the output butler repos from S17B reprocessing

## Description

The S17B HSC reprocessing (DM-10404) generated some butler repos with processed data:

 /datasets/hsc/repo/rerun/DM-10404/SFM /datasets/hsc/repo/rerun/DM-10404/DEEP /datasets/hsc/repo/rerun/DM-10404/UDEEP /datasets/hsc/repo/rerun/DM-10404/WIDE 

We would like to know, for example, how many files are there, what are the sizes of the files (mostly small files?), what data products are taking up the space (likely images, but which ones?), and so on. Write up a brief summary.

1. butler_size_duw.png
37 kB
2. butler_size_smf.png
35 kB
3. file_size_dist.png
28 kB
4. num_dir_files.png
18 kB

Samantha Thrush added a comment - - edited

I have finally finished the histograms on the space taken up by various butler datasets. As you will see below, SFM is fundamentally different from the other three directories (DEEP, UDEEP, WIDE).
Let's first consider the SFM butler space graph below:

Here, I found the paths to these files partially through butler and partially just by snooping around the SFM directory. The location of these files are included in the table below, but please take note of my short-hand: if a word in the path is bold, then that means that it is a wildcard for the directories that fit that descriptor. For example, if I specified the path to a file as /SFM/number then that would mean that the file would reside in one of the many subdirectories inside of SFM that is a number.

Butler file Path
CORR SFM/number/filter/corr/CORR-number-number.fits
SRC SFM/number/filter/output/SRC-number-number.fits
SRCMATCH SFM/number/filter/output/SRCMATCH-number-number.fits
SRCMATCHFULL SFM/number/filter/output/SRCMATCHFULL-number-number.fits
BKGD SFM/number/filter/corr/BKGD-number-number.fits
flattened files SFM/number/filter/thumbs/flattened-number-number.png
oss files SFM/number/filter/thumbs/oss-number-number.png
icSrc SFM/schema/icSrc.fits
src SFM/schema/src.fits

In a separate comment, I will discuss the other three directories since their structures are all so similar.

Samantha Thrush added a comment - - edited

As stated previously, DEEP, UDEEP, and WIDE all have very similar file structures and similar relations between the relative sizes of most of their butler datasets, with the exception of their schema files, which take up the same space in all three directories.

In order to better understand where each of these files are coming from, here is a table. This table is formatted like the one above, with the wildcards being bolded. To be more succinct, unlike above, please assume that all of these paths dwell within DEEP, UDEEP, or WIDE

Butler File Path
FORCEDSRC number/filter/tract*number*/FORCEDSRC-number-number.fits
fcr jointcal-results/number/fcr-number-number.fits
wcs jointcal-results/number/wcs-number-number.fits
schema files schema/*
Hsin-Fang Chiang added a comment -

Thank you Samantha Thrush, these look great. Feel free to close the ticket.

Pinging Paul Domagala [X] Andrew Loftus Greg Daues as they may be interested in this summary too and may have suggestions on what to look in future tickets.

Samantha Thrush added a comment -

Just as a small update, I was reviewing the Butler SFM plot and noticed that I forgot to include two different butler file types. I have since amended the plot and the table.

Samantha Thrush added a comment -

The two scripts I used to gather information for the graphs above can be found at the following links:
https://github.com/Samantha-Thrush/LSST_codes/blob/master/butlerfind.sh
https://github.com/Samantha-Thrush/LSST_codes/blob/master/repostats.sh

