LICENSE, COPYRIGHT, and GitHub's metadata detection
The purpose of this topic is to discuss the implementation of RFC-45 (license and copyright management in the Stack). Note that this post itself does convey any change of policy compared to RFC-45.
Background
The intent of RFC-45 is to simplify licensing and copyright management in the Stack by consolidating copyright claims in a COPYRIGHT file, rather than in the preambles of each source file. A secondary goal of this work is to properly document the GPL-3.0 license of each stack package through a LICENSE file.
GitHub is able to detect the licensing of a repository (https://help.github.com/articles/licensing-a-repository/) if a LICENSE file contains the appropriate legal text. GitHub uses the licensee (https://github.com/benbalter/licensee) Ruby gem to do this license detection. If it confidently detects a license, GitHub displays the license though its UI and API. Take SQuaRE's ltd-keeper (https://github.com/lsst-sqre/ltd-keeper) as an example. This metadata is generally quite useful for the community, and is a indication that a license is accurately documented in a repository.
The problem with our new approach
We expected that our new strategy would be compatible with GitHub's license detection since the new LICENSE file is the verbatim legal text of the GPL-3.0 license. It turns out this is not quite the case.
To see why, I've run licensee on the daf_butler (https://github.com/lsst/daf_butler) repository, which uses the RFC-45 LICENSE and COPYRIGHT strategy. This is the result:
License: Other
|
Matched files: ["LICENSE", "COPYRIGHT"]
|
LICENSE:
|
Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0
|
Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
|
Confidence: 100.00%
|
Matcher: Licensee::Matchers::Exact
|
License: GNU General Public License v3.0
|
COPYRIGHT:
|
Content hash: b55b124b3b2c9d14da5e071b1b080dfcd78e1250
|
Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University,
|
License: Other
|
As you can see, licensee is parsing both the COPYRIGHT and LICENSE files for license information. While it detects the GPL-3.0 license in the LICENSE file, it fails to detect a license in the COPYRIGHT file. Given this mixed result, licensee determines that, overall, the license of daf_butler is unknown.
If you look at the COPYRIGHT file, though, its a little complex, with multiple lines:
Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University,
|
through SLAC National Accelerator Laboratory.
|
Copyright 2018 Association of Universities for Research in Astronomy.
|
Copyright 2015, 2018 The Trustees of Princeton University
|
What I realized is that if the copyright attribution is a single line, licensee recognizes the copyright statement in the COPYRIGHT file, and realizes the COPYRIGHT file has no license claim. For example, a single-line COPYRIGHT file:
Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory
|
The result is:
License: GNU General Public License v3.0
|
Matched files: ["LICENSE", "COPYRIGHT"]
|
LICENSE:
|
Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0
|
Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
|
Confidence: 100.00%
|
Matcher: Licensee::Matchers::Exact
|
License: GNU General Public License v3.0
|
COPYRIGHT:
|
Content hash: da39a3ee5e6b4b0d3255bfef95601890afd80709
|
Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory
|
Confidence: 100.00%
|
Matcher: Licensee::Matchers::Copyright
|
License: No-license
|
This is exactly what we want. Except that we only attribute copyright to one institution, rather than three.
The solution seems to be putting all copyright claims on one line, like this:
Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University
|
The detected license is:
License: GNU General Public License v3.0
|
Matched files: ["LICENSE", "COPYRIGHT"]
|
LICENSE:
|
Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0
|
Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
|
Confidence: 100.00%
|
Matcher: Licensee::Matchers::Exact
|
License: GNU General Public License v3.0
|
COPYRIGHT:
|
Content hash: da39a3ee5e6b4b0d3255bfef95601890afd80709
|
Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University
|
Confidence: 100.00%
|
Matcher: Licensee::Matchers::Copyright
|
License: No-license
|
Relevant GitHub PR that implemented original behavior: https://github.com/benbalter/licensee/pull/233
Relevant code https://github.com/benbalter/licensee/blob/master/lib/licensee/matchers/copyright.rb
I’ve created an issue with the licensee project to see if we can have it support our original COPYRIGHT file format.
I’ve created Issue 285 in the upstream licensee repo to see if we can get support for our COPYRIGHT files.