# Common Dataset Organization and Policy

XMLWordPrintable

#### Details

• Type: RFC
• Status: Implemented
• Resolution: Done
• Component/s:
• Labels:
None
• Location:
CLO

#### Description

Building upon RFC-95, this RFC nails down the specific format and policies governing shared datasets available in /datasets.

### FORMAT

Discussions on format happened in RFC-95 and clo. This RFC will just summarize the conclusions and rely on those discussions (and those within the comments here) for explanation.

All data added to /datasets must adhere to the following format (caveats below):

(note: caps are tokens)

/datasets/<camera>/[REPO|RERUN|PREPROCESSED|SIM|RAW|CALIB] | /datasets/REFCATS

where

REPO = repo (butler root)
RERUN = REPO/rerun/PUBLIC | REPO/rerun/PRIVATE (processed results)
PUBLIC = <ticket>/PRIVATE
PRIVATE = private/<user> | ""
PREPROCESSED = preprocessed/<label>/ | preprocessed/<label>/<date>/ (ex. 'dr9')
SIM = <ticket>_<date>/ | <user>/<ticket>/
RAW = raw/<survey-name>/ (where actual files live)
CALIB = calib/default/ | calib/label/ (ex. master20161025)
REFCATS = refcats/<label> (ex astrometry_net_data, gaia_DR1_v1)

Some data resides within /datasets which does not adhere to this format; they are provided for general consumption though not as verification data. The following are currently exempted:
/datasets/all-sky
/datasets/all-sky-ASIVA

### Immutability/Sharing

NCSA is working on a simple procedure for making the data both shared and safe. The process is TBD (reiterating on the design to emphasize simplicity). Essentially, it consists of applying default DAC permission sets and GPFS immutability. Until the self-service commands are available, it is sufficient to ask Greg Daues to lock down or unlock a repo. Illustration:

• (you) RFC if necessary per policy (below)
• Directory created, write permissions given
• (you) populate and organize data (as per policy), ask to have it locked down
• sharing and immutability applied

Steps to modify/remove from datasets

• (you) RFC if necessary per policy (below)
• write permissions given, immutability removed
• (you) reorganize, ask to have it locked down
• sharing and immutability reapplied (to parent directory, as applicable)

### Policy

Formatting exists to make data sets easier to consume for the DM project at large. Policy exists to enforce the format and serves to inform whenever policy must change. I suggest the following policies which serve to both enforce and inform:

/datasets Format Changes - This should be obvious. Future needs will certainly require format changes. We must go through the RFC process to change the format.

• Additions / modifications / deletions of any non-private data requires an RFC (strictly for input for naming convention, organization, etc)
• Additions / modifications /deletions of private data can be performed without a RFC

The RFC allows a gate to confirm that things are compliant and necessary. The RFC should include:

• description and reason for addition/change/deletion
• target top-level-directory for location of addition/change/deletion
• organization of data
• other necessary domain knowledge as identified by project members relating to the contents of the data

Regarding responsibilities on ingest or maintenance:

• Ticket creator is responsible for butler-ization of dataset (or delegation of responsibility)
• Responsibility for maintaining usable datasets is a DM-wide effort

All local non-private data governed by this RFC must reside within /datasets proper; symbolic links to local non-private data residing on alternate file systems are prohibited. This does not prohibit the use of remote URI's, when supported through the butler, that point to external public repos although this does require the RFC process for addition/deletion of the URI-repo. This is due to operational concerns including immutability, sharing permissions, developer change of positions / jobs, etc.

Caveats / Implementation Details for PRIVATE:

• private is created with the sticky bit to allow user managed contents
• private only contains symbolic links pointing out of datasets or contains sub directories containing symbolic links (for organization)
• no data resides in private/ or subdirectories
• no access or recovery is offered from private/ other than that provided by the target file system
• it is a user responsibility to make the private rerun repo shared, or not, and allow, or disallow, sub rerun directories from other users
• data retention in private is not guaranteed (points to scratch, points to home and user leaves, user erroneously deletes repo, etc)
• data in private is not immutable
• private/ entries do not require Jira tickets for creation/deletion/modification

#### Activity

Hide
Jason Alt [X] (Inactive) added a comment -

So if symb link e.g. ~/rerun to /ROOT/rerun/private/rhl before I start, I can say
processXXX.py /ROOT --rerun private/rhl/foo/bar
and all will be well?

Yes.

I don't like the name "private" as it really seems to more mean "ephemeral".

I'm not committed to "private". I will say that ephemeral is a side-effect of the implementation.

I'm less convinced of the proposal for public, as I'm not quite sure what you think will be there (I think it is much more likely that people will share "private" repos).

This is a RFC-95 ism. From the RFC:

we should distinguish between "public" simulation runs that have a wide audience and "private" runs used only by a small group of people with a particular goal in mind ("private" != "secret").

I would say it is no longer 'owned' by an individual. "public" status requires an announcement of its availability and a level of management around it to guarantee it remains available including an announcement of its removal. "public" datasets may have others depending upon its presence while private datasets provide no assurance.

if I want to use --rerun public/rhl/foo/bar) I need to create a ticket, but then the rest will be handled for me? Or do I need to file a new ticket for public/rhl/foo/goo as well?

You would need to create a ticket for every public repo creation / addition / modification / deletion. Recognize that this isn't to require my approval; it is for coordinating the naming scheme (if any), informing others of its availability, applying sharing and immutability features, etc. Also, NCSA is only providing the sharing / immutability management; you are responsible for the grunt work of populating and organizing the repo.

If so, is there a way for me to promote a private to a public rerun and have the system handle it?

I have been wondering this as well. The question is really if the current tooling will allow this. I suspect it has something to do with butler repo configs. And what happens if you want to promote a private rerun that has dependent reruns?

Show
Jason Alt [X] (Inactive) added a comment - So if symb link e.g. ~/rerun to /ROOT/rerun/private/rhl before I start, I can say processXXX.py /ROOT --rerun private/rhl/foo/bar and all will be well? Yes. I don't like the name "private" as it really seems to more mean "ephemeral". I'm not committed to "private". I will say that ephemeral is a side-effect of the implementation. I'm less convinced of the proposal for public, as I'm not quite sure what you think will be there (I think it is much more likely that people will share "private" repos). This is a RFC-95 ism. From the RFC: we should distinguish between "public" simulation runs that have a wide audience and "private" runs used only by a small group of people with a particular goal in mind ("private" != "secret"). I would say it is no longer 'owned' by an individual. "public" status requires an announcement of its availability and a level of management around it to guarantee it remains available including an announcement of its removal. "public" datasets may have others depending upon its presence while private datasets provide no assurance. if I want to use --rerun public/rhl/foo/bar) I need to create a ticket, but then the rest will be handled for me? Or do I need to file a new ticket for public/rhl/foo/goo as well? You would need to create a ticket for every public repo creation / addition / modification / deletion. Recognize that this isn't to require my approval; it is for coordinating the naming scheme (if any), informing others of its availability, applying sharing and immutability features, etc. Also, NCSA is only providing the sharing / immutability management; you are responsible for the grunt work of populating and organizing the repo. If so, is there a way for me to promote a private to a public rerun and have the system handle it? I have been wondering this as well. The question is really if the current tooling will allow this. I suspect it has something to do with butler repo configs. And what happens if you want to promote a private rerun that has dependent reruns?
Hide
Gregory Dubois-Felsmann added a comment -

How will this work for repos that are created by automated processes, like weekly or monthly QC runs?

Show
Gregory Dubois-Felsmann added a comment - How will this work for repos that are created by automated processes, like weekly or monthly QC runs?
Hide
Jason Alt [X] (Inactive) added a comment -

Gregory Dubois-Felsmann As a follow up, I think we should tackle your concern in another RFC once we understand the requirements.

Show
Jason Alt [X] (Inactive) added a comment - Gregory Dubois-Felsmann As a follow up, I think we should tackle your concern in another RFC once we understand the requirements.
Hide
Hsin-Fang Chiang added a comment - - edited

Reporting back, I just verified that this works on lsst-dev, using the shared stack on /software/lsstsw/ w_2016_50

 ln -s /scratch/hchiang2/hscRerun /datasets/hsc/repo/rerun/private/hchiang2 processCcd.py /datasets/hsc/repo --rerun private/hchiang2/test1 --calib /datasets/hsc/repo/CALIB/20160419/ --id visit=903274 ccd=99 

Then it writes the output to my own scratch folder.

Show
Hsin-Fang Chiang added a comment - - edited Reporting back, I just verified that this works on lsst-dev, using the shared stack on /software/lsstsw/ w_2016_50 ln -s /scratch/hchiang2/hscRerun /datasets/hsc/repo/rerun/ private /hchiang2 processCcd.py /datasets/hsc/repo --rerun private /hchiang2/test1 --calib /datasets/hsc/repo/CALIB/ 20160419 / --id visit= 903274 ccd= 99 Then it writes the output to my own scratch folder.
Hide
Hsin-Fang Chiang added a comment -

The policy/documentation is now on https://developer.lsst.io/services/datasets.html

Show
Hsin-Fang Chiang added a comment - The policy/documentation is now on https://developer.lsst.io/services/datasets.html

#### People

Assignee:
Jason Alt [X] (Inactive)
Reporter:
Jason Alt [X] (Inactive)
Watchers:
Greg Daues, Gregory Dubois-Felsmann, Hsin-Fang Chiang, Jason Alt [X] (Inactive), Jim Bosch, John Swinbank, Nate Pease, Paul Price, Robert Lupton