Details
-
Type:
RFC
-
Status: Adopted
-
Resolution: Unresolved
-
Component/s: LSST
-
Labels:None
Description
In order to properly include in the product tree all packages resolved via lsst_distrib, it is necessary to provide the following information, in each package git repository:
- Short name of the package, a text to display in the (yellow box) of the product tree (max 18 characters)
- Suggestion: It could follow the namespace naming rules, but first letter capital
- Key string, 2 to 7 character, it shall be unique, the shorter the better, it has to be used for automatic process building the product tree
- Examples, JCAL for jointcal, ASTMT for astro_metadata_translator, DAFB for daf_butler,
- Reference person(s), someone that has the best knowledge on that software package and is usually maintaining it, 2 references can be given
- Proposed format: github_username (Full Name) [github_username (Full Name)]
The proposal is to add this information in a file, my first thought is a info.yaml in the home folder of each repository.
Attachments
Issue Links
Activity
Some packages have grown to the point where no single person can really be considered an expert on the whole thing (afw, I'm looking at you). Could that field be multivalued?
Examples of non short name are:
- meas_extensions_astrometryNet,
- meas_extensions_simpleShape
The reference person may have deep knowledge one some parts of the package, and know to who to refer for the other parts. Two persons could be listed as reference also.
An example of non short name if:
meas_extensions_astrometryNet
Why is that too long?
It do not fit in a product tree box.
At the same time you may want to reword the name differently.
From a logical point of view, the name of a product and the github repository name, are two different things.
I think it would help to better assess this RFC if we had more information on exactly how this information will be used. What will be reading these YAML (or whatever) files? What will it be doing with the results?
Other questions:
- Why do packages need both a short name and a key string?
- Why isn't the package short name the same as the repository name?
- Why use a GitHub, rather than an LSST, username?
- What's the distinct role of the “reference person” for the package that isn't covered by the responsible T/CAM, product owner or science/technical lead?
As specified in the first line of the RFC, this information is used to populate the product tree, in an automatic way.
The short name is used to be displayed in the product tree boxes.
The short key is used internally, it could be generated automatically, but I think it is preferable to get this chosen by who is setting up the new repository (the reference person(s) is already setup).
Can you be more clear on the third bullet?
The reference person could be contacted in case of problem, could take care of the issues assigned to that SW package. Indeed all activities need to be coordinated with the relevant T/CAM, product owner and science/technical lead, in the same way as they are now.
I'm sorry, can you please explain more why we need the short name and short key, and why the existing product names are not sufficient? I don't know what a "product tree box" is, but it seems like adding extra names for packages would be even more confusing.
Can you be more clear on the third bullet?
Third bullet was:
Why use a GitHub, rather than an LSST, username?
My point here is simply that we have an LSST identity service which follows us across the project: we have a single username that applies to project.lsst.org, Jira, Confluence, the Data Facility, etc. And then there's GitHub, which is anomalous.
One can imagine that, in years to come, we might migrate our code off GitHub. However, one assumes that the LSST project IDM will remain supported through the duration of the project. That seems to point to the latter being the more stable, more supported system.
Why would we prefer GitHub in this circumstance?
As specified in the first line of the RFC, this information is used to populate the product tree, in an automatic way... The short name is used to be displayed in the product tree... (etc)
This RFC seems to be oriented around narrow technical considerations about producing a particular arrangement of yellow boxes on a PDF.
I suggest that's not really the most important question. Why do we want to collect this information? Who does it need to be made available to? What will they do with it?
Once we've addressed those, then we can discuss questions of formatting.
John Parejko: I can derive the information from what available in github, and just let open the possibility to provide inputs (a different name), if this will be the outcome of the RFC.
John Swinbank: the main reason to provide this information is to have a complete product tree to add to our baselined documentation. This was done in 2017, and after the product tree review this summer (2018) the lower level product tree is missing. Adding this information, will permit us to get the complete product tree up to date, each time it changes.
The user information can be provided in the way we prefer, since we are getting information from github, github username seems to me to make more sense, and github is not going to be unsupported in the short or mid term. However, if we prefer, the LSST identity manager username can be used instead. Is just a matter of agree on it.
I'll amplify others who doubt the value of a new short name for each repository. If I understand you correctly, the issue is that our repo names are too long for the LaTeX that's currently being generated by https://github.com/lsst/LDM-294/blob/master/makeProductTree.py ? If that's the case, it might be better to have tools like makeProductTree.py (and others) deal with our actual names by wrapping text, and so on. Creating an extra layer of naming seems like it might obfuscate things and create misunderstandings in the long run.
For naming those who are responsible for any given repository, I'll point out that GitHub already has a CODEOWNERS file (https://help.github.com/articles/about-codeowners/). Though yes, it ties into GitHub identities rather than LSST IT's Kerberos, it is at least a standard that other tooling can leverage if we so decide in the future. There's still the larger question of whether we know exactly who these responsible people are, as John Swinbank mentions. A useful source of metadata that already exists here is the default assignee for Jira components that map to these repositories.
Lastly, I'll just state general interest in this space. For www.lsst.io I intend to index our GitHub projects (https://sqr-013.lsst.io for the very old concept). Now my interest is mostly in generating codemeta so that our software can be directly cited. My thinking has been that such a codemeta.json metadata file can be mostly automatically generated from existing metadata sources, though it may be useful to embed a stub codemeta.json file in repositories as well. So although this is highly tangential to the current RFC, I just want to point out that there are other folks thinking about adding metadata files to Git repositories, and that there may be useful standards we can build on.
Ok, we chatted offline about this and it seems that this metadata is necessary just to get the configuration management work off the ground, so it's not really my place to micromanage that. We think that integration with SQuaRE's project metadata initiative makes sense, but no need to tie those projects together at the outset (especially since the timeline for really starting work on www.lsst.io is a bit farther out).
So, here's why I think this is useful, and what that can tell us about implementation.
First of all, what I hope to gain from this is a clear mapping from package/repository to a responsible team. I use the word “team“ rather than “individual” deliberately: most directly, I see this as a tool to help us understand who is on the hook for delivering what. The responsible team links directly to the budget and to the T/CAM, who has the responsibility for assessing, prioritising and scheduling issues. This directly forces us to resolve and document perennially vexed questions like “who owns obs_decam?, pex_config?, ... etc”: that is the primary value here.
In the past, the topic of “package experts” has come up (e.g. RFC-150). I personally still regard this as a distraction from our regular workflow, but — given the success of that RFC — I've no objection to including an expert, with the caveat that everything they do in that role has to be agreed with their T/CAM.
In general, we should expect external users of the codebase to install high level “meta-packages” (lsst_distrib, or its successors) and to expect the project to have a centralised and coherent approach to handling the issues they encounter. Except for a few of the real enthusiasts, I don't expect science users to be crawling through metadata on individual packages, and certainly not to be using that to develop “targeted” bug reports.
It's a nice side effect that we can build a diagram showing which packages relate to which high-level product, but I don't think that's of fundamental importance and it shouldn't drive the implementation. Simply “adding a complete product tree to our baselined documentation” isn't a fundamental good or something to strive for unless that, in turn, is enabling something else (if so: what?).
I quite strongly agree with comments to the effect that introducing two(!) new names for each package is unnecessary and likely to be counterproductive and confusing.
It's kind of ironic to say that the Jira component owner could be used to seed the package "reference person"; I was rather hoping that the opposite would be the case, as I think many packages lack proper Jira component owners (if they even have Jira components).
I was also hoping that the "short name" could be reused as a standardized namespace abbreviation in Python and C++, which is otherwise generated in an ad hoc fashion.
I think that adding the owner in the list of metadata collected is a good idea for the time been.
Adding a technical reference, a short name, and a key can be optional. If not provided they can be derived with a python function.
As Kian-Tat Lim pointed out, maintaining Jira components may be easier when this RFC is implemented.
Kian-Tat Lim exactly how do you see the "standardized namespace abbreviation" working? Would a developer who wants to abbreviate, say, lsst.daf.persistence be expected to look up the short name in the daf_persistence repository?
It seems to me that if developers have to look up a particular package's abbreviation, that means they're not familiar with the (an?) abbreviated form, and standardization provides no benefit in terms of reader friction.
Krzysztof Findeisen Yes, a developer would be expected to look it up if he/she were unfamiliar. At least there would be a suggestion of an appropriate namespace abbreviation that would be in the repository. Presumably the same abbreviation would be used throughout the package's code and in its documentation. Right now, there's not even a potential single source of truth for such an abbreviation. I'd be fine (and I think Gabriele Comoretto [X] could adapt) if this lived somewhere well-known in package/doc/ rather than package/info.yaml.
In my mind, standardization means that abbreviations for common dependencies can be learned and become familiar over time rather than potentially having to be relearned for each new package using the dependency because the depending package developer chose a different abbreviation. (The reduction in expenditure of creative energy by the depending package developer in coming up with new abbreviations is admittedly small and is traded off against the lookup cost.)
To me, this is not a major motivation for the RFC; it is a side-effect.
GitHub already has a semi-standard way of representing "code owners" and it seems it can also be broken down by directory:
https://blog.github.com/2017-07-06-introducing-code-owners/
I'd suggest trying to leverage that, if possible, because the pull requests seem to potentially have that integrated in as well.
IMO - A 7 character, unique, key will be insufficient to convey any useful information about a repo except for some kind of identity mapping - so I'd suggest expecting it to be a black box ID and just hashing the the name at that point or using the github ID (https://api.github.com/users/lsst/repos) or something, which doesn't require the user to come up with a unique 7 character name.
Using the code owner as suggested by Jonathan Sick and Brian Van Klaveren, seems to me a good idea. I suspect that this is not the same owner that John Swinbank is referring to.
Actually, I think it was Jonathan Sick who mentioned code owners in this post earlier in the thread, and I think it's the same thing Brian Van Klaveren is linking to.
Yes, sorry.
I have been analyzing the inputs provided so far with Jonathan Sick.
It seems that a reasonable way forward can be to extract metadata to existing files, rather than adding extra files to the repositories.
First, we can extend the README files to include a standardized "Info" section that includes a listing of metadata. From the above comments, we can add these fields:
- wbsowner (required): the name of T/CAM or the budget responsible. This information is not readily available anywhere else.
- short_name (optional): used to display in the product tree. If not provided it will be derived programmatically from the existing information
- key (optional): used by the product tree tool to uniquely identify each product. If not provided it will be derived programmatically from the existing information
We can incorporate this information into the README template. By structuring this information, it can be parsed and extracted directly from the README file. It will also be easily readable by anyone browsing source repositories.
Second, we can leverage the CODEOWNER file already provided by GitHub to establish a "reference person" who on a practical basis is deeply involved in the development and maintenance of the code and can be assigned to review code changes. The person(s) named in the CODEOWNER file can also be the default assignee for issues in the corresponding Jira component. See RFC-150.
So, if John Swinbank and the other commenters are OK with the proposal, we will set the RFC to adopted and open a couple of implementation issues:
- Extend README template to include this metadata in a standard and structured way.
- Update existing READMEs.
We will also invite everybody to add the code owner information in each repository in GitHub. See https://help.github.com/articles/about-codeowners/ . For the time been we can consider this information as not mandatory.
Due date is extended a couple of days more.
I still don't understand what the justification for the short_name and key are. If the goal is to identify our software in some managerial document, I would think it more important to use the actual name of the software, not some other (shortened) name.
The name information is usually available from the README. If not, the github package name can be considered the official name also. So there is no need to add extra metadata for it. If you feel that a short name is not relevant don't add it.
- short_name (optional): used to display in the product tree. If not provided it will be derived programmatically from the existing information
- key (optional): used by the product tree tool to uniquely identify each product. If not provided it will be derived programmatically from the existing information
I'd still prefer to omit these unless we really understand what they are for. In what circumstances is it not appropriate to use the product name as the short name? Whose responsibility is it to “feel that a short name is not relevant”?
establish a "reference person" who on a practical basis is deeply involved in the development and maintenance of the code and can be assigned to review code changes. The person(s) named in the CODEOWNER file can also be the default assignee for issues in the corresponding Jira component. See
RFC-150.
No, please don't do this. Specifically:
- There should be no implication that the person listed as an “owner” should be assigned to review changes. It's not fair on any individual to have them as the “default reviewer” for a huge product like pipe_tasks or afw.
- The conclusion re default assignees in
RFC-150seems to be “it would be best if the default assignee for new stories would actually be Unassigned, not TCAM, and not the component expert”.
the key I think we can definitely derive - the short_name is perhaps a misnomer - its display_name - its used by my script to make the product tree which I would still like to print out large for review purposes.
Agree not to have default assignee - but we did want to identify "someone" in general responsible for the packages.
Hi Wil O'Mullane — I'm not sure what value is gained by producing a poster-sized display of all our code repositories. This seems like it's going beyond the level of abstraction usefully captured by the product tree.
However, if we agree that this is the goal, then I suggest that “display names” for each repository can be more conveniently stored with (or generated by) the product tree generation scripts, rather than in the repositories themselves. Regular developers, who aren't involved with generating this sort of display, should never need to see them, define them, or use them.
Agree not to have default assignee - but we did want to identify "someone" in general responsible for the packages.
Yes, agreed. This is the conclusion of RFC-150 (which I disagree with, but I don't want to reopen that discussion). Identifying an “expert” who is not a default assignee or default reviewer is fine.
When I did the original product tree I thought we agreed tieing in all repos was the correct thing to do- we can roll up and display at any level we like once we have a consistent set of relationships.
Ah yes you mean put a lookup map(json) with the script generating the tree that could work.
Ah yes you mean put a lookup map(json) with the script generating the tree that could work.
.
If I understand correctly from the last comments, the only information we need is:
- wbsowner
- expert
I had a look to the milestone project (in lsst-dm) but I did not find any json with project names.
John Swinbank can you indicate where I can find this information?
Gabriele Comoretto [X] — sorry I didn't see your comment earlier.
I think we discussed this at the DM-CCB yesterday. The only JSON file in https://github.com/lsst-dm/milestones is used to provide information about milestones. I'm not aware of any existing JSON file which defines project (product?) names, although I guess maybe https://github.com/lsst/repos/blob/master/etc/repos.yaml is close?
I think repos.yaml is used for build purposes. Add there information for documentation purpose is not the right thing in my opinion.
Since I will be parsing README files in any cases, I think that this is the most suitable location for the (optional) display_name information, in addition to wbsowner and expert as concluded above.
As per DMCCB #3 discussion.
Why is this necessary? The product names are already relatively short and contain abbreviations themselves. I don't think we need more abbreivations.