Uploaded image for project: 'Request For Comments'
  1. Request For Comments
  2. RFC-249

Common Dataset Organization and Policy


    • Type: RFC
    • Status: Implemented
    • Resolution: Done
    • Component/s: DM
    • Labels:
    • Location:


      Building upon RFC-95, this RFC nails down the specific format and policies governing shared datasets available in /datasets.


      Discussions on format happened in RFC-95 and clo. This RFC will just summarize the conclusions and rely on those discussions (and those within the comments here) for explanation.

      All data added to /datasets must adhere to the following format (caveats below):

      (note: caps are tokens)

      /datasets/<camera>/[REPO|RERUN|PREPROCESSED|SIM|RAW|CALIB] | /datasets/REFCATS


      REPO = repo (butler root)
      RERUN = REPO/rerun/PUBLIC | REPO/rerun/PRIVATE (processed results)
      PUBLIC = <ticket>/PRIVATE
      PRIVATE = private/<user> | ""
      PREPROCESSED = preprocessed/<label>/ | preprocessed/<label>/<date>/ (ex. 'dr9')
      SIM = <ticket>_<date>/ | <user>/<ticket>/
      RAW = raw/<survey-name>/ (where actual files live)
      CALIB = calib/default/ | calib/label/ (ex. master20161025)
      REFCATS = refcats/<label> (ex astrometry_net_data, gaia_DR1_v1)

      Some data resides within /datasets which does not adhere to this format; they are provided for general consumption though not as verification data. The following are currently exempted:


      NCSA is working on a simple procedure for making the data both shared and safe. The process is TBD (reiterating on the design to emphasize simplicity). Essentially, it consists of applying default DAC permission sets and GPFS immutability. Until the self-service commands are available, it is sufficient to ask Greg Daues to lock down or unlock a repo. Illustration:

      Steps to add to datasets

      • (you) RFC if necessary per policy (below)
      • (you) Ask for write access to a new rerun|new camera|ref cat| directory
      • Directory created, write permissions given
      • (you) populate and organize data (as per policy), ask to have it locked down
      • sharing and immutability applied

      Steps to modify/remove from datasets

      • (you) RFC if necessary per policy (below)
      • (you) Ask for write access to existing rerun|new camera|ref cat| directory
      • write permissions given, immutability removed
      • (you) reorganize, ask to have it locked down
      • sharing and immutability reapplied (to parent directory, as applicable)


      Formatting exists to make data sets easier to consume for the DM project at large. Policy exists to enforce the format and serves to inform whenever policy must change. I suggest the following policies which serve to both enforce and inform:

      /datasets Format Changes - This should be obvious. Future needs will certainly require format changes. We must go through the RFC process to change the format.

      /datasets additions/changes/deletions -

      • Additions / modifications / deletions of any non-private data requires an RFC (strictly for input for naming convention, organization, etc)
      • Additions / modifications /deletions of private data can be performed without a RFC

      The RFC allows a gate to confirm that things are compliant and necessary. The RFC should include:

      • description and reason for addition/change/deletion
      • target top-level-directory for location of addition/change/deletion
      • organization of data
      • other necessary domain knowledge as identified by project members relating to the contents of the data

      Regarding responsibilities on ingest or maintenance:

      • Ticket creator is responsible for butler-ization of dataset (or delegation of responsibility)
      • Responsibility for maintaining usable datasets is a DM-wide effort

      All local non-private data governed by this RFC must reside within /datasets proper; symbolic links to local non-private data residing on alternate file systems are prohibited. This does not prohibit the use of remote URI's, when supported through the butler, that point to external public repos although this does require the RFC process for addition/deletion of the URI-repo. This is due to operational concerns including immutability, sharing permissions, developer change of positions / jobs, etc.

      Caveats / Implementation Details for PRIVATE:

      • private is created with the sticky bit to allow user managed contents
      • private only contains symbolic links pointing out of datasets or contains sub directories containing symbolic links (for organization)
      • no data resides in private/ or subdirectories
      • no access or recovery is offered from private/ other than that provided by the target file system
      • it is a user responsibility to make the private rerun repo shared, or not, and allow, or disallow, sub rerun directories from other users
      • data retention in private is not guaranteed (points to scratch, points to home and user leaves, user erroneously deletes repo, etc)
      • data in private is not immutable
      • private/ entries do not require Jira tickets for creation/deletion/modification

      Credits: Hsin-Fang Chiang Paul Price Jim Bosch Greg Daues


          Issue Links



              • Assignee:
                jalt Jason Alt [X] (Inactive)
                jalt Jason Alt [X] (Inactive)
                Greg Daues, Gregory Dubois-Felsmann, Hsin-Fang Chiang, Jason Alt [X] (Inactive), Jim Bosch, John Swinbank, Nate Pease, Paul Price, Robert Lupton
              • Votes:
                0 Vote for this issue
                9 Start watching this issue


                • Created:
                  Planned End:

                  Summary Panel