Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-13966

Research why license is not detected for daf_butler

    XMLWordPrintable

    Details

      Description

      daf_butler is the first package to fully implement the RFC-45 strategy for managing license and copyright information. Despite this, GitHub is not detecting the GPL-3.0 license and displaying it through its API/UI. This means that something might be wrong with the RFC-45 strategy (ideally we want our licenses detected). This ticket will research why this is, and hopefully suggest a resolution.

        Attachments

          Issue Links

            Activity

            Hide
            jsick Jonathan Sick added a comment -

            I’ve created Issue 285 in the upstream licensee repo to see if we can get support for our COPYRIGHT files.

            Show
            jsick Jonathan Sick added a comment - I’ve created Issue 285 in the upstream licensee repo to see if we can get support for our COPYRIGHT files.
            Hide
            jsick Jonathan Sick added a comment -

            LICENSE, COPYRIGHT, and GitHub's metadata detection

            The purpose of this topic is to discuss the implementation of RFC-45 (license and copyright management in the Stack). Note that this post itself does convey any change of policy compared to RFC-45.

            Background

            The intent of RFC-45 is to simplify licensing and copyright management in the Stack by consolidating copyright claims in a COPYRIGHT file, rather than in the preambles of each source file. A secondary goal of this work is to properly document the GPL-3.0 license of each stack package through a LICENSE file.

            GitHub is able to detect the licensing of a repository (https://help.github.com/articles/licensing-a-repository/) if a LICENSE file contains the appropriate legal text. GitHub uses the licensee (https://github.com/benbalter/licensee) Ruby gem to do this license detection. If it confidently detects a license, GitHub displays the license though its UI and API. Take SQuaRE's ltd-keeper (https://github.com/lsst-sqre/ltd-keeper) as an example. This metadata is generally quite useful for the community, and is a indication that a license is accurately documented in a repository.

            The problem with our new approach

            We expected that our new strategy would be compatible with GitHub's license detection since the new LICENSE file is the verbatim legal text of the GPL-3.0 license. It turns out this is not quite the case.

            To see why, I've run licensee on the daf_butler (https://github.com/lsst/daf_butler) repository, which uses the RFC-45 LICENSE and COPYRIGHT strategy. This is the result:

            License: Other
            Matched files: ["LICENSE", "COPYRIGHT"]
            LICENSE:
             Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0
             Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
             Confidence: 100.00%
             Matcher: Licensee::Matchers::Exact
             License: GNU General Public License v3.0
            COPYRIGHT:
             Content hash: b55b124b3b2c9d14da5e071b1b080dfcd78e1250
             Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University,
             License: Other
            

            As you can see, licensee is parsing both the COPYRIGHT and LICENSE files for license information. While it detects the GPL-3.0 license in the LICENSE file, it fails to detect a license in the COPYRIGHT file. Given this mixed result, licensee determines that, overall, the license of daf_butler is unknown.

            If you look at the COPYRIGHT file, though, its a little complex, with multiple lines:

            Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University,
            through SLAC National Accelerator Laboratory.
            Copyright 2018 Association of Universities for Research in Astronomy.
            Copyright 2015, 2018 The Trustees of Princeton University
            

            What I realized is that if the copyright attribution is a single line, licensee recognizes the copyright statement in the COPYRIGHT file, and realizes the COPYRIGHT file has no license claim. For example, a single-line COPYRIGHT file:

            Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory
            

            The result is:

            License: GNU General Public License v3.0
            Matched files: ["LICENSE", "COPYRIGHT"]
            LICENSE:
             Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0
             Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
             Confidence: 100.00%
             Matcher: Licensee::Matchers::Exact
             License: GNU General Public License v3.0
            COPYRIGHT:
             Content hash: da39a3ee5e6b4b0d3255bfef95601890afd80709
             Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory
             Confidence: 100.00%
             Matcher: Licensee::Matchers::Copyright
             License: No-license
            

            This is exactly what we want. Except that we only attribute copyright to one institution, rather than three.

            The solution seems to be putting all copyright claims on one line, like this:

            Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University
            

            The detected license is:

            License: GNU General Public License v3.0
            Matched files: ["LICENSE", "COPYRIGHT"]
            LICENSE:
             Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0
             Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
             Confidence: 100.00%
             Matcher: Licensee::Matchers::Exact
             License: GNU General Public License v3.0
            COPYRIGHT:
             Content hash: da39a3ee5e6b4b0d3255bfef95601890afd80709
             Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University
             Confidence: 100.00%
             Matcher: Licensee::Matchers::Copyright
             License: No-license
            

            Relevant GitHub PR that implemented original behavior: https://github.com/benbalter/licensee/pull/233

            Relevant code https://github.com/benbalter/licensee/blob/master/lib/licensee/matchers/copyright.rb

            I’ve created an issue with the licensee project to see if we can have it support our original COPYRIGHT file format.

            Show
            jsick Jonathan Sick added a comment - LICENSE, COPYRIGHT, and GitHub's metadata detection The purpose of this topic is to discuss the implementation of RFC-45 (license and copyright management in the Stack). Note that this post itself does convey any change of policy compared to RFC-45 . Background The intent of RFC-45 is to simplify licensing and copyright management in the Stack by consolidating copyright claims in a COPYRIGHT file, rather than in the preambles of each source file. A secondary goal of this work is to properly document the GPL-3.0 license of each stack package through a LICENSE file. GitHub is able to detect the licensing of a repository ( https://help.github.com/articles/licensing-a-repository/ ) if a LICENSE file contains the appropriate legal text. GitHub uses the licensee ( https://github.com/benbalter/licensee ) Ruby gem to do this license detection. If it confidently detects a license, GitHub displays the license though its UI and API. Take SQuaRE's ltd-keeper ( https://github.com/lsst-sqre/ltd-keeper ) as an example. This metadata is generally quite useful for the community, and is a indication that a license is accurately documented in a repository. The problem with our new approach We expected that our new strategy would be compatible with GitHub's license detection since the new LICENSE file is the verbatim legal text of the GPL-3.0 license. It turns out this is not quite the case. To see why, I've run licensee on the daf_butler ( https://github.com/lsst/daf_butler ) repository, which uses the RFC-45 LICENSE and COPYRIGHT strategy. This is the result: License: Other Matched files: ["LICENSE", "COPYRIGHT"] LICENSE: Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0 Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/> Confidence: 100.00% Matcher: Licensee::Matchers::Exact License: GNU General Public License v3.0 COPYRIGHT: Content hash: b55b124b3b2c9d14da5e071b1b080dfcd78e1250 Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, License: Other As you can see, licensee is parsing both the COPYRIGHT and LICENSE files for license information. While it detects the GPL-3.0 license in the LICENSE file, it fails to detect a license in the COPYRIGHT file. Given this mixed result, licensee determines that, overall, the license of daf_butler is unknown. If you look at the COPYRIGHT file, though, its a little complex, with multiple lines: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University What I realized is that if the copyright attribution is a single line, licensee recognizes the copyright statement in the COPYRIGHT file, and realizes the COPYRIGHT file has no license claim. For example, a single-line COPYRIGHT file: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory The result is: License: GNU General Public License v3.0 Matched files: ["LICENSE", "COPYRIGHT"] LICENSE: Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0 Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/> Confidence: 100.00% Matcher: Licensee::Matchers::Exact License: GNU General Public License v3.0 COPYRIGHT: Content hash: da39a3ee5e6b4b0d3255bfef95601890afd80709 Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory Confidence: 100.00% Matcher: Licensee::Matchers::Copyright License: No-license This is exactly what we want. Except that we only attribute copyright to one institution, rather than three. The solution seems to be putting all copyright claims on one line, like this: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University The detected license is: License: GNU General Public License v3.0 Matched files: ["LICENSE", "COPYRIGHT"] LICENSE: Content hash: f9aaff304edbbe732f27b554dcb6abe49e0b4fa0 Attribution: Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/> Confidence: 100.00% Matcher: Licensee::Matchers::Exact License: GNU General Public License v3.0 COPYRIGHT: Content hash: da39a3ee5e6b4b0d3255bfef95601890afd80709 Attribution: Copyright 2016-2017 The Board of Trustees of the Leland Stanford Junior University, through SLAC National Accelerator Laboratory. Copyright 2018 Association of Universities for Research in Astronomy. Copyright 2015, 2018 The Trustees of Princeton University Confidence: 100.00% Matcher: Licensee::Matchers::Copyright License: No-license Relevant GitHub PR that implemented original behavior: https://github.com/benbalter/licensee/pull/233 Relevant code https://github.com/benbalter/licensee/blob/master/lib/licensee/matchers/copyright.rb I’ve created an issue with the licensee project to see if we can have it support our original COPYRIGHT file format.
            Hide
            jsick Jonathan Sick added a comment -

            Thanks to https://github.com/benbalter/licensee/pull/295 this issue appears to be resolved.

            Show
            jsick Jonathan Sick added a comment - Thanks to https://github.com/benbalter/licensee/pull/295 this issue appears to be resolved.

              People

              Assignee:
              jsick Jonathan Sick
              Reporter:
              jsick Jonathan Sick
              Watchers:
              Jonathan Sick
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Jenkins

                  No builds found.