Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-6655

Webpage of flags produced by various stack products

    XMLWordPrintable

    Details

      Description

      SDSS has a handy webpage with descriptions of all of their bitmask flags:

      http://www.sdss.org/dr12/algorithms/bitmasks/#ListofBitmasks

      It would be exceptionally useful for LSST to produce a similar webpage. I could see it being auto-built from our current flags documetation, which would also help us identify places where our current docstrings are lacking (which many of them are).

        Attachments

          Issue Links

            Activity

            No builds found.
            Parejkoj John Parejko created issue -
            jsick Jonathan Sick made changes -
            Field Original Value New Value
            Issue Type Bug [ 1 ] Story [ 10001 ]
            jsick Jonathan Sick made changes -
            Labels pipelines-docs
            jsick Jonathan Sick made changes -
            Epic Link DM-6199 [ 24715 ]
            jsick Jonathan Sick made changes -
            Epic Link DM-6199 [ 24715 ] DM-5646 [ 23496 ]
            Parejkoj John Parejko made changes -
            Link This issue relates to DM-9050 [ DM-9050 ]
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I'd like to see this evolve even in the direction of providing some microservices supporting this, so that we can elucidate flag bits in the Portal UI.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I'd like to see this evolve even in the direction of providing some microservices supporting this, so that we can elucidate flag bits in the Portal UI.
            Hide
            jsick Jonathan Sick added a comment -

            Just to add some tangential context, I want maintain a strong separation between documentation of data products produced by the Stack/Science Pipelines, and documentation of data products made available through a PDAC or Data Release. This way the Science Pipelines can remain a fairly generic open source project, whereas the data release documentation explains specifically what data we're shipping.

            In practice this implies two layers of documentation. pipelines.lsst.io documents the data that comes of out the Stack, and pdac/drN.lsst.io documents what data products are available to astronomers. The latter documentation would reference the former.

            Your idea of a data product microservice is an interesting one, Gregory Dubois-Felsmann. Roughly speaking, I think that the same documentation tools that introspect pipelines code to build the pipelines.lsst.io static website could also product some JSON datasets. A web api server would take these JSON datasets, tagged for each version of the stack software involved in making a data release, and make that data available for querying by the portal UI.

            The same pattern could be extended more generally to make more and more of the science platform data available through a web api for querying. Just as HTML is currently an output format, we can also output well structured JSON. I think this would be a great way to make "pipelines.lsst.io"-type software documentation accessible from data release/science platform contexts.

            Show
            jsick Jonathan Sick added a comment - Just to add some tangential context, I want maintain a strong separation between documentation of data products produced by the Stack/Science Pipelines, and documentation of data products made available through a PDAC or Data Release. This way the Science Pipelines can remain a fairly generic open source project, whereas the data release documentation explains specifically what data we're shipping. In practice this implies two layers of documentation. pipelines.lsst.io documents the data that comes of out the Stack, and pdac/drN.lsst.io documents what data products are available to astronomers. The latter documentation would reference the former. Your idea of a data product microservice is an interesting one, Gregory Dubois-Felsmann . Roughly speaking, I think that the same documentation tools that introspect pipelines code to build the pipelines.lsst.io static website could also product some JSON datasets. A web api server would take these JSON datasets, tagged for each version of the stack software involved in making a data release, and make that data available for querying by the portal UI. The same pattern could be extended more generally to make more and more of the science platform data available through a web api for querying. Just as HTML is currently an output format, we can also output well structured JSON. I think this would be a great way to make "pipelines.lsst.io"-type software documentation accessible from data release/science platform contexts.
            Hide
            jsick Jonathan Sick added a comment -

            John Parejko: could you point me to some code on GitHub that produces flags?

            Show
            jsick Jonathan Sick added a comment - John Parejko : could you point me to some code on GitHub that produces flags?
            Hide
            Parejkoj John Parejko added a comment -

            Unfortunately, I don't really know of examples (which is partly why I wrote this ticket: I don't know where to find this information in the first place). Suggest asking Jim Bosch or Russell Owen?

            Show
            Parejkoj John Parejko added a comment - Unfortunately, I don't really know of examples (which is partly why I wrote this ticket: I don't know where to find this information in the first place). Suggest asking Jim Bosch or Russell Owen ?
            Hide
            jbosch Jim Bosch added a comment -

            Here are some examples of adding Flag fields to a Schema in C++:

            https://github.com/lsst/meas_base/blob/master/src/PixelFlags.cc#L82

            and in Python:

            https://github.com/lsst/meas_base/blob/master/python/lsst/meas/base/classification.py#L70

            Because these kinds of calls often involve several layers deep in convenience wrappers, I think it's going to be impossible to extract this information from a static view of the source code.

            All of these calls do ultimately go through a single function in afw, so we could catch them at runtime if put some kind of hook there, but I worry there's not enough context there to turn those calls into documentation. There's also, of course, the problem of making sure all that code is actually run somehow when the doc hook might be active. Perhaps the best we could hope for from this kind of tooling would be something that just checks documentation coverage for flags, and I'm not sure even that's worth the effort.

            But I think the "list of flags" documentation, like any Schema documentation, is really more part of the "data product" documentation than "pipeline code documentation", and to the extent it's the latter, what people will want is the list of flags in the data products produced by the default configuration of measurement algorithms, Tasks, SuperTasks, and Pipelines that are defined by our codebase. We can easily run those in special contexts where we could extract their schemas and produce static documentation from them; I think it'd make sense to make that harness a part of a standard unit test harness for those components. That would just leave the problem of making sure we have good unit test coverage for those components at multiple levels, but at least that's not a new problem.

            Show
            jbosch Jim Bosch added a comment - Here are some examples of adding Flag fields to a Schema in C++: https://github.com/lsst/meas_base/blob/master/src/PixelFlags.cc#L82 and in Python: https://github.com/lsst/meas_base/blob/master/python/lsst/meas/base/classification.py#L70 Because these kinds of calls often involve several layers deep in convenience wrappers, I think it's going to be impossible to extract this information from a static view of the source code. All of these calls do ultimately go through a single function in afw, so we could catch them at runtime if put some kind of hook there, but I worry there's not enough context there to turn those calls into documentation. There's also, of course, the problem of making sure all that code is actually run somehow when the doc hook might be active. Perhaps the best we could hope for from this kind of tooling would be something that just checks documentation coverage for flags, and I'm not sure even that's worth the effort. But I think the "list of flags" documentation, like any Schema documentation, is really more part of the "data product" documentation than "pipeline code documentation", and to the extent it's the latter, what people will want is the list of flags in the data products produced by the default configuration of measurement algorithms, Tasks, SuperTasks, and Pipelines that are defined by our codebase. We can easily run those in special contexts where we could extract their schemas and produce static documentation from them; I think it'd make sense to make that harness a part of a standard unit test harness for those components. That would just leave the problem of making sure we have good unit test coverage for those components at multiple levels, but at least that's not a new problem.
            Hide
            zivezic Zeljko Ivezic added a comment -

            It seems that nothing happened with this ticket since May despite Priority = Major!

            The lack of documentation about flags, and the lack of robust and convenient tools
            to interpret them using names of flags in English (c.f. mostly nice SDSS flag tools)
            is now a major obstacle when trying to understand stack outputs (e.g. right now in
            the context of the stack crowded field performance analysis by DM SST).

            It is not obvious to me that this ticket is (only) J. Sick's problem!
            The Science Pipelines Crew, what say you? John Swinbank Robert Lupton

            Btw, if someone will start addressing this ticket, I recommend to get in touch with
            Chris Suberlak. He deciphered some flags by extracting them from the fits file
            headers, but that method is not robust and most certainly we don't want to have
            thousands of LSST users write the code that he just showed to me!

            Show
            zivezic Zeljko Ivezic added a comment - It seems that nothing happened with this ticket since May despite Priority = Major! The lack of documentation about flags, and the lack of robust and convenient tools to interpret them using names of flags in English (c.f. mostly nice SDSS flag tools) is now a major obstacle when trying to understand stack outputs (e.g. right now in the context of the stack crowded field performance analysis by DM SST). It is not obvious to me that this ticket is (only) J. Sick's problem! The Science Pipelines Crew, what say you? John Swinbank Robert Lupton Btw, if someone will start addressing this ticket, I recommend to get in touch with Chris Suberlak. He deciphered some flags by extracting them from the fits file headers, but that method is not robust and most certainly we don't want to have thousands of LSST users write the code that he just showed to me!
            Hide
            jbosch Jim Bosch added a comment - - edited

            This is all pretty easy to do right now if you use the DM code to read the catalogs - the documentation for the flags can be printed, and there is no need to worry about relationship between flags and bits because the flag fields just appear as regular booleans.

            The problem comes when trying to read our FITS files with external code. There is a big gap in the FITS standard that makes it impossible to write those in a way that would let other tools read them with full functionality (no standard way to label bits in an array), and in the future we do not expect to have people read them nearly as often as other forms (e.g. SQL tables) anyway. While we're not all the way there yet, I don't think this is an unreasonable state of affairs.

            The downside of a simple webpage is that the set of columns is itself dynamic - it's much easier to get this documentation right if we focus on what's packaged with the data. The only times the set of columns is really going to be static is when it's frozen for a data release. I'm hoping Jonathan Sick can make web documentation flexible enough to capture that dynamism, and I'd be hesitant to try to document this on the web without that.

            Show
            jbosch Jim Bosch added a comment - - edited This is all pretty easy to do right now if you use the DM code to read the catalogs - the documentation for the flags can be printed, and there is no need to worry about relationship between flags and bits because the flag fields just appear as regular booleans. The problem comes when trying to read our FITS files with external code. There is a big gap in the FITS standard that makes it impossible to write those in a way that would let other tools read them with full functionality (no standard way to label bits in an array), and in the future we do not expect to have people read them nearly as often as other forms (e.g. SQL tables) anyway. While we're not all the way there yet, I don't think this is an unreasonable state of affairs. The downside of a simple webpage is that the set of columns is itself dynamic - it's much easier to get this documentation right if we focus on what's packaged with the data. The only times the set of columns is really going to be static is when it's frozen for a data release. I'm hoping Jonathan Sick can make web documentation flexible enough to capture that dynamism, and I'd be hesitant to try to document this on the web without that.
            Hide
            jsick Jonathan Sick added a comment -

            I haven't got to this yet because there's a number of prerequisite layers of work needed to make it maintainable/sustainable (the primary aim of my work). However, we could throw together Confluence pages with tables of flags with the understanding that these will be replaced with the real pipelines.lsst.io docs.

            John Swinbank, do you think anyone from Pipelines would be able to do this? I don't think I can prioritize this on the timescale needed to support Chris Suberlak's work.

            Show
            jsick Jonathan Sick added a comment - I haven't got to this yet because there's a number of prerequisite layers of work needed to make it maintainable/sustainable (the primary aim of my work). However, we could throw together Confluence pages with tables of flags with the understanding that these will be replaced with the real pipelines.lsst.io docs. John Swinbank , do you think anyone from Pipelines would be able to do this? I don't think I can prioritize this on the timescale needed to support Chris Suberlak's work.
            Hide
            swinbank John Swinbank added a comment -

            I'm worried that there are a couple (at least) of separate requests getting conflated here, and that's causing us to misfire on understanding the work that needs to be done and what its priority is.

            First: when you interact with a catalogue through the regular stack API, you have access to the schema, which tells you what all the flags in that catalogue are and provides documentation for them. In other words, you can quite easily pull up a list of flags and other information about a data release which is stored on disk. (Note that at least some of today's confusion emerges from trying to do this not using the stack API but by accessing the persisted FITS files directly. That's much harder, not an interface that we encourage, and not — I think — necessary for what Krzysztof Suberlak was trying to achieve).

            Given the above capability, it's (relatively) easy to write an "afterburner" type script for a data release which dumps a web page of all the flag fields that it contains and the associated documentation. If this is really a pressing need, we could go ahead and do it, but I'm not sure that anybody really needs this at the moment. (Of course, when we are generating stable, documented data releases, the need for this is obvious... but currently, we aren't.)

            Of course, there's a caveat in the above: not all of the flags have useful documentation (and this, I think, is what winds up John Parejko). Maybe tabulating all of the flags would make that easier to spot, as he suggests. In that case, though, such a table is a means to an end, rather than an end in itself. In any event, badly documented flags count as bugs, and we should file tickets and fix them when we find them.

            Now, note that any particular "data release" (or other persisted repository of output data) is the result of executing a particular configuration of the pipelines. That configuration determines exactly which flag fields we actually store. Therefore, a more general — and harder to answer — question than "what flag fields exist in this data release?" is "what flag fields could exist in any conceivable data release?". Jim Bosch gives some thoughts above on how we might go about answering this, but, realistically, I'd question whether this is something worth prioritising.

            So that's the story if you're using our API. What if you want to load our FITS data into external tools? Note that this is something we are actually required to support (it's DMS-REQ-0078). Once again, Jim speaks to that above: providing a perfectly generic FITS export of this data is difficult or impossible, and it's of questionable usefulness. That may be a topic we have to return to later in construction, but, at the moment, it's not a priority or something I'd suggest we invest significant resources in.

            So where does that get us?

            • I hope Krzysztof Suberlak — and any other consumer of a data release — has everything he needs through supported stack APIs, so I don't believe there's actually a crisis here (if not, do shout).
            • Badly documented flags deserve bugs, and we'll fix them.
            • Dumping HTML catalogues of flags in a data release is certainly something we can do in principle, and will have to for supported data releases, but I question if it's useful in the general case.
            • Dumping HTML catalogues of all possible flags is something I'd be reluctant to sign up to unless somebody evinces a really compelling argument (and even then, it'd be serious work).
            Show
            swinbank John Swinbank added a comment - I'm worried that there are a couple (at least) of separate requests getting conflated here, and that's causing us to misfire on understanding the work that needs to be done and what its priority is. First: when you interact with a catalogue through the regular stack API, you have access to the schema, which tells you what all the flags in that catalogue are and provides documentation for them. In other words, you can quite easily pull up a list of flags and other information about a data release which is stored on disk. (Note that at least some of today's confusion emerges from trying to do this not using the stack API but by accessing the persisted FITS files directly. That's much harder, not an interface that we encourage, and not — I think — necessary for what Krzysztof Suberlak was trying to achieve). Given the above capability, it's (relatively) easy to write an "afterburner" type script for a data release which dumps a web page of all the flag fields that it contains and the associated documentation. If this is really a pressing need, we could go ahead and do it, but I'm not sure that anybody really needs this at the moment. (Of course, when we are generating stable, documented data releases, the need for this is obvious... but currently, we aren't.) Of course, there's a caveat in the above: not all of the flags have useful documentation (and this, I think, is what winds up John Parejko ). Maybe tabulating all of the flags would make that easier to spot, as he suggests. In that case, though, such a table is a means to an end, rather than an end in itself. In any event, badly documented flags count as bugs, and we should file tickets and fix them when we find them. Now, note that any particular "data release" (or other persisted repository of output data) is the result of executing a particular configuration of the pipelines. That configuration determines exactly which flag fields we actually store. Therefore, a more general — and harder to answer — question than "what flag fields exist in this data release?" is "what flag fields could exist in any conceivable data release?". Jim Bosch gives some thoughts above on how we might go about answering this, but, realistically, I'd question whether this is something worth prioritising. So that's the story if you're using our API. What if you want to load our FITS data into external tools? Note that this is something we are actually required to support (it's DMS-REQ-0078). Once again, Jim speaks to that above: providing a perfectly generic FITS export of this data is difficult or impossible, and it's of questionable usefulness. That may be a topic we have to return to later in construction, but, at the moment, it's not a priority or something I'd suggest we invest significant resources in. So where does that get us? I hope Krzysztof Suberlak — and any other consumer of a data release — has everything he needs through supported stack APIs, so I don't believe there's actually a crisis here (if not, do shout). Badly documented flags deserve bugs, and we'll fix them. Dumping HTML catalogues of flags in a data release is certainly something we can do in principle, and will have to for supported data releases, but I question if it's useful in the general case. Dumping HTML catalogues of all possible flags is something I'd be reluctant to sign up to unless somebody evinces a really compelling argument (and even then, it'd be serious work).
            Hide
            Parejkoj John Parejko added a comment - - edited

            Being able to parse the output of src.schema is also part of the problem: our schemas contain a huge number of things, and interpreting that as a python string dump is not ideal. A bare minimum webpage that contains all the flags+documentation, and preferably other the fields, produced by the default run of processCcd.py seems like it wouldn't be too hard to build. We could probably even do it using the output of a weekly validate_drp run on one of the validation datasets.

            The claim that we can't make that webpage because "set of columns is itself dynamic" seems to be dodging the issue, since we do have a relatively standard set of things that come out of processCcd, so one would think we could at least get most of the way there?

            Also, having that webpage makes it a lot easier to identify the places where our documentation is insufficient.

            Show
            Parejkoj John Parejko added a comment - - edited Being able to parse the output of src.schema is also part of the problem: our schemas contain a huge number of things, and interpreting that as a python string dump is not ideal. A bare minimum webpage that contains all the flags+documentation, and preferably other the fields, produced by the default run of processCcd.py seems like it wouldn't be too hard to build. We could probably even do it using the output of a weekly validate_drp run on one of the validation datasets. The claim that we can't make that webpage because "set of columns is itself dynamic" seems to be dodging the issue, since we do have a relatively standard set of things that come out of processCcd, so one would think we could at least get most of the way there? Also, having that webpage makes it a lot easier to identify the places where our documentation is insufficient.
            Hide
            ctslater Colin Slater added a comment -


            I would find it super useful myself to have an easily-accessible web page documenting the entire schema (not exclusively flags). It's hard to help novice users when I can't point them to a general description of what is in the data files they just produced, and it's always a bunch of digging ("what was the last thing I processed, where is it, what visit number?") when I need the schema for reference. I second John Parejko's suggestion of using validate_drp or ci_hsc, since that likely covers 90% of the the use cases.

            Show
            ctslater Colin Slater added a comment - I would find it super useful myself to have an easily-accessible web page documenting the entire schema (not exclusively flags). It's hard to help novice users when I can't point them to a general description of what is in the data files they just produced, and it's always a bunch of digging ("what was the last thing I processed, where is it, what visit number?") when I need the schema for reference. I second John Parejko 's suggestion of using validate_drp or ci_hsc, since that likely covers 90% of the the use cases.
            Hide
            swinbank John Swinbank added a comment -

            The claim that we can't make that webpage because "set of columns is itself dynamic" seems to be dodging the issue, since we do have a relatively standard set of things that come out of processCcd, so one would think we could at least get most of the way there?

            For what it's worth, I don't think anybody tried to dodge the issue in this way — I think we're all agreed that dumping HTML corresponding to any particular execution of the pipeline is easy enough. If there's a consensus that it would really make life easier, let's make sure that gets put on the backlog: it's a more tightly constrained request than the current ticket, so it's more likely to get addressed in a meaningful way.

            Show
            swinbank John Swinbank added a comment - The claim that we can't make that webpage because "set of columns is itself dynamic" seems to be dodging the issue, since we do have a relatively standard set of things that come out of processCcd, so one would think we could at least get most of the way there? For what it's worth, I don't think anybody tried to dodge the issue in this way — I think we're all agreed that dumping HTML corresponding to any particular execution of the pipeline is easy enough. If there's a consensus that it would really make life easier, let's make sure that gets put on the backlog: it's a more tightly constrained request than the current ticket, so it's more likely to get addressed in a meaningful way.
            Hide
            jsick Jonathan Sick added a comment -

            Ok, let's do it. I can do this to to make progress in the near term:

            • Script to dump schemas from some standard process (that gets run manually, on demand) into CSV files.
            • Commit the CSV into the pipelines_docs repo and have one page per Butler dataset (again, this is updated manually) that includes a reStructuredText csv-table.

            What datasets should be covered? processCcd.py provides src. Any others? A deepCoadd_forced_src?

            I think I'd prefer to use this ticket for this near-term solution since the more durable documentation of datasets will probably require several tickets and it's not designed yet.

            Show
            jsick Jonathan Sick added a comment - Ok, let's do it. I can do this to to make progress in the near term: Script to dump schemas from some standard process (that gets run manually, on demand) into CSV files. Commit the CSV into the pipelines_docs repo and have one page per Butler dataset (again, this is updated manually) that includes a reStructuredText csv-table . What datasets should be covered? processCcd.py provides src . Any others? A deepCoadd_forced_src ? I think I'd prefer to use this ticket for this near-term solution since the more durable documentation of datasets will probably require several tickets and it's not designed yet.
            jsick Jonathan Sick made changes -
            Epic Link DM-5646 [ 23496 ] DM-7500 [ 26629 ]
            mjuric Mario Juric made changes -
            Labels pipelines-docs dm-sst pipelines-docs
            Parejkoj John Parejko made changes -
            Link This issue relates to DM-6887 [ DM-6887 ]
            Parejkoj John Parejko made changes -
            Link This issue relates to DM-13139 [ DM-13139 ]
            swinbank John Swinbank made changes -
            Link This issue relates to DM-2297 [ DM-2297 ]
            jsick Jonathan Sick made changes -
            Epic Link DM-7500 [ 26629 ] DM-12790 [ 36408 ]
            jsick Jonathan Sick made changes -
            Epic Link DM-12790 [ 36408 ] DM-14522 [ 86285 ]
            Hide
            lguy Leanne Guy added a comment -

            What is the status of this ticket? Colin Slater  Gregory Dubois-Felsmann could we address this as part of the work on the Data Model (https://confluence.lsstcorp.org/display/DM/LSST+Data+Model) ?

            Show
            lguy Leanne Guy added a comment - What is the status of this ticket? Colin Slater   Gregory Dubois-Felsmann could we address this as part of the work on the Data Model ( https://confluence.lsstcorp.org/display/DM/LSST+Data+Model)  ?
            lguy Leanne Guy made changes -
            Risk Score 0
            gpdf Gregory Dubois-Felsmann made changes -
            Remote Link This issue links to "Page (Confluence)" [ 18030 ]
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I've taken an action to read through this ticket thoroughly from scratch and think about it in the context of the data model work.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I've taken an action to read through this ticket thoroughly from scratch and think about it in the context of the data model work.
            Hide
            jsick Jonathan Sick added a comment - - edited

            It's been a while and I have some fresh perspective, especially having engineered task documentation.

            It sounds like there are a now possibly two different things that we're talking about. I think the original request was to document Butler datasets, and so I'm going to stick with that scope. Documenting our databases and data products ("LSST Data Model") also needs to be done, but that's a different thing and needs a different ticket from what I can see.

            For Butler datasets, I now believe that I can create canonical documentation topics in pipelines.lsst.io for each dataset. These topics will be linked to the tasks that generate and transform them. I think that from the ground-up we can document how each task modifies a table schema or modifies metadata, for example, and that information can flow into both the published documentation for a task, and also the canonical documentation for a Butler dataset.

            What we mentioned last November still stands, that we can't publish a table of dataset columns that's 100% relevant to any particular pipeline. But with the system I've started to build, we can certainly give users all the tools they need to identify what columns might be part of their datasets, and expose knowledge about the task that generated those columns and what those columns mean. Again, this strategy is particular to the pipelines.lsst.io documentation and Butler datasets.

            Show
            jsick Jonathan Sick added a comment - - edited It's been a while and I have some fresh perspective, especially having engineered task documentation. It sounds like there are a now possibly two different things that we're talking about. I think the original request was to document Butler datasets , and so I'm going to stick with that scope. Documenting our databases and data products ("LSST Data Model") also needs to be done, but that's a different thing and needs a different ticket from what I can see. For Butler datasets, I now believe that I can create canonical documentation topics in pipelines.lsst.io for each dataset. These topics will be linked to the tasks that generate and transform them. I think that from the ground-up we can document how each task modifies a table schema or modifies metadata, for example, and that information can flow into both the published documentation for a task, and also the canonical documentation for a Butler dataset. What we mentioned last November still stands, that we can't publish a table of dataset columns that's 100% relevant to any particular pipeline. But with the system I've started to build, we can certainly give users all the tools they need to identify what columns might be part of their datasets, and expose knowledge about the task that generated those columns and what those columns mean. Again, this strategy is particular to the pipelines.lsst.io documentation and Butler datasets.
            Hide
            Parejkoj John Parejko added a comment -

            Since I originally filed this: I was specifically looking for a web page that documents the _flag fields that get set in our catalogs. We should be able to extract that from a "typical" run on some data (say, ci_hsc). More broadly, it would be very useful if that web page had descriptions of all of the fields our "typical" catalogs contain.

            It seems silly to me that a user has to go and read in a table and look at its schema to understand what sorts of fields the LSST software could produce.

            Show
            Parejkoj John Parejko added a comment - Since I originally filed this: I was specifically looking for a web page that documents the _flag fields that get set in our catalogs. We should be able to extract that from a "typical" run on some data (say, ci_hsc). More broadly, it would be very useful if that web page had descriptions of all of the fields our "typical" catalogs contain. It seems silly to me that a user has to go and read in a table and look at its schema to understand what sorts of fields the LSST software could produce.
            Hide
            jsick Jonathan Sick added a comment -

            Got it. I think I have that covered.

            Show
            jsick Jonathan Sick added a comment - Got it. I think I have that covered.
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            It seems to me that we definitely need both Task-level documentation and LSST-data-model-level documentation on this. There's no guarantee that any packed flag words in the released data model will be the outputs of single algorithmic Tasks - we may well combine flags from multiple Tasks. (Of course, as a matter of good implementation practice the combiner itself might be a Task, but its selections of what to combine would be more likely to be configuration.)

            Ideally we could point back from the data model documentation, for most (perhaps all, depending on whether any combining is done) flags, to the Task documentation that defines it algorithmic meaning.

            Show
            gpdf Gregory Dubois-Felsmann added a comment - It seems to me that we definitely need both Task-level documentation and LSST-data-model-level documentation on this. There's no guarantee that any packed flag words in the released data model will be the outputs of single algorithmic Tasks - we may well combine flags from multiple Tasks. (Of course, as a matter of good implementation practice the combiner itself might be a Task, but its selections of what to combine would be more likely to be configuration.) Ideally we could point back from the data model documentation, for most (perhaps all, depending on whether any combining is done) flags, to the Task documentation that defines it algorithmic meaning.
            jsick Jonathan Sick made changes -
            Epic Link DM-14522 [ 86285 ] DM-5646 [ 23496 ]
            Hide
            gpdf Gregory Dubois-Felsmann added a comment -

            I'm taking another look at this; where do we stand on the production of flag values in the output of the SDM standardization? Has documentation of those flags been thought about as part of that work? Colin Slater, maybe?

            Show
            gpdf Gregory Dubois-Felsmann added a comment - I'm taking another look at this; where do we stand on the production of flag values in the output of the SDM standardization? Has documentation of those flags been thought about as part of that work? Colin Slater , maybe?
            Parejkoj John Parejko made changes -
            Link This issue relates to DM-28280 [ DM-28280 ]
            lguy Leanne Guy made changes -
            Labels dm-sst pipelines-docs DM-SST dm-sst pipelines-docs
            lguy Leanne Guy made changes -
            Labels DM-SST dm-sst pipelines-docs DM-SST pipelines-docs
            jsick Jonathan Sick made changes -
            Epic Link DM-5646 [ 23496 ] DM-30576 [ 511565 ]
            frossie Frossie Economou made changes -
            Epic Link DM-30576 [ 511565 ] DM-30577 [ 511566 ]
            Parejkoj John Parejko made changes -
            Link This issue relates to DM-4201 [ DM-4201 ]
            frossie Frossie Economou made changes -
            Epic Link DM-30577 [ 511566 ] DM-37471 [ 2888985 ]
            Parejkoj John Parejko made changes -
            Link This issue relates to DM-38513 [ DM-38513 ]
            Hide
            jbosch Jim Bosch added a comment -

            As I just did with DM-28280, I'm linking DM-37544 and DM-33034 as being relevant for how I see this happening. I won't repeat everything I said there, but basically I see us declaring important catalog dataset types in pipelines, then looking at the schema files for those catalogs to find flags when generating pipelines.lsst.io docs for packages with "leaf" pipelines like drp_pipe. That will reveal (as John Swinbank said ages ago) that a lot of the docs for those flags are not very good, but that's a separate problem; we at least already have places to put that documentation.

            It's worth noting, however, that we do not have a way to propagate schema information (let alone docs) from afw.table schema datasets to parquet. I think we could now save schema information for parquet files as initOutputs, thanks to Eli Rykoff's work on parquet formatters. But we need to come up with conventions for doing so, and figure out the relationship between that task-written schema information and the stuff in sdm_schemas. It may be that it will work better to write docs for standardized schemas directly in sdm_schemas YAML files, but I worry that that's too "far away" from the code that sets the flags to be maintainable.

            Show
            jbosch Jim Bosch added a comment - As I just did with DM-28280 , I'm linking DM-37544 and DM-33034 as being relevant for how I see this happening. I won't repeat everything I said there, but basically I see us declaring important catalog dataset types in pipelines, then looking at the schema files for those catalogs to find flags when generating pipelines.lsst.io docs for packages with "leaf" pipelines like drp_pipe. That will reveal (as John Swinbank said ages ago) that a lot of the docs for those flags are not very good, but that's a separate problem; we at least already have places to put that documentation. It's worth noting, however, that we do not have a way to propagate schema information (let alone docs) from afw.table schema datasets to parquet. I think we could now save schema information for parquet files as initOutputs, thanks to Eli Rykoff 's work on parquet formatters. But we need to come up with conventions for doing so, and figure out the relationship between that task-written schema information and the stuff in sdm_schemas. It may be that it will work better to write docs for standardized schemas directly in sdm_schemas YAML files, but I worry that that's too "far away" from the code that sets the flags to be maintainable.
            jbosch Jim Bosch made changes -
            Link This issue relates to DM-33034 [ DM-33034 ]
            jbosch Jim Bosch made changes -
            Link This issue relates to DM-37544 [ DM-37544 ]
            jbosch Jim Bosch made changes -
            Rank Ranked lower
            jbosch Jim Bosch made changes -
            Rank Ranked lower

              People

              Assignee:
              jsick Jonathan Sick
              Reporter:
              Parejkoj John Parejko
              Watchers:
              Colin Slater, Eric Bellm, Gregory Dubois-Felsmann, Hsin-Fang Chiang, Jim Bosch, John Parejko, John Swinbank, Jonathan Sick, Krzysztof Suberlak, Leanne Guy, Simon Krughoff, Zeljko Ivezic
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.