# Develop minimal DB-side access controls for shared NCSA registry

XMLWordPrintable

#### Details

• Type: Story
• Status: Done
• Resolution: Done
• Fix Version/s: None
• Component/s:
• Labels:
• Story Points:
2
• Team:
Data Release Production
• Urgent?:
No

#### Description

I'm going to put together a PostgreSQL-specific SQL script to set up access controls for the shared database at NCSA.  I'll try to keep these to the bare minimum necessary to avoid accidental modifications of important things.  I'll ask (at least) Christopher Stephens [X] for a review, and then we can decide how to put these in place and maintain them (probably on another ticket).  The answer may be as simple as just putting those SQL scripts in git somewhere and having an admin run them, and should not be any more complicated than having daf_butler run some snippets when it creates tables (via some kind of new configuration).

#### Activity

Hide
Jim Bosch added a comment -

Christopher Stephens [X], a question for you before I dive too deeply into this: I know we grant users permission to create tables in our shared repos (and we have to). But what does the ownership of those user-created tables look like afterwards? Is there a way we could set things up so users can still create those tables, but their ownership reverts to the same admin role that presumably owns the static tables after the dynamic tables are created?

Show
Jim Bosch added a comment - Christopher Stephens [X] , a question for you before I dive too deeply into this: I know we grant users permission to create tables in our shared repos (and we have to). But what does the ownership of those user-created tables look like afterwards? Is there a way we could set things up so users can still create those tables, but their ownership reverts to the same admin role that presumably owns the static tables after the dynamic tables are created?
Hide
Christopher Stephens [X] (Inactive) added a comment -

from the docs: {{}}

 CREATE TABLE will create a new, initially empty table in the current database. The table will be owned by the user issuing the command.

i think the only options are to "set role" to desired owner prior to create statement or issue and "alter table owner" after the create statement. if you wanted to get fancy, it looks like you could create a trigger https://blog.hagander.net/setting-owner-at-create-table-237/ but triggers are just asking for problems and should only be used when absolutely necessary imo.

Show
Christopher Stephens [X] (Inactive) added a comment - from the docs: {{}} CREATE TABLE will create a new, initially empty table in the current database. The table will be owned by the user issuing the command. i think the only options are to "set role" to desired owner prior to create statement or issue and "alter table owner" after the create statement. if you wanted to get fancy, it looks like you could create a trigger https://blog.hagander.net/setting-owner-at-create-table-237/  but triggers are just asking for problems and should only be used when absolutely necessary imo.
Hide
Jim Bosch added a comment -

Christopher Stephens [X], I've put together a pretty detailed proposal for what we might do here, and I'm afraid it's pretty dense. I'm hoping you can take a close enough look to tell me if anything here is a bad idea or otherwise inconsistent with how we'd like to do things.

Please see the PR description for some notes on how best to wade in.

Show
Jim Bosch added a comment - Christopher Stephens [X] , I've put together a pretty detailed proposal for what we might do here, and I'm afraid it's pretty dense. I'm hoping you can take a close enough look to tell me if anything here is a bad idea or otherwise inconsistent with how we'd like to do things. Please see the PR description for some notes on how best to wade in.
Hide
Christopher Stephens [X] (Inactive) added a comment -

I took a look at the dbaccess directory in https://github.com/lsst-dm/gen3_shared_repo_admin and attempted to get my head wrapped around how this would actually look in the database. On the surface, it seems like this should work logically. I'm not familiar with the requirements but none of the policies look very complicated and the expressions used to determine whether a row is visible or not look pretty straightforward depending on what some of the snippets stuff returns.

My biggest concern as is usually the case for me on this project, is performance. You have some policies using regular expressions which will likely render any indexing useless. That assumes the optimizer chooses to evaluate the expression at the proper time in the execution plan. Beyond that, a quick google for "postgresql row based security performance" shows policy evaluation in during SQL execution to be a significant problem in general. It looks like things continue to get better with each version but comments like "To summarize, RLS policies in combination with subqueries are unfortunately not each other friends performance-wise" don't bode well.

Each column is going to have any number of policies that will all have to be accounted for. That's on top of some fairly complicated, nested, dynamically generated SQL which has already presented some significant performance problems.

Performance will likely be fine when data volumes on tables with policies are low but I'm very skeptical about this scaling well. Unfortunately, I don't really have any concrete suggestions. You mentioned that this is to prevent accidental misuse and not intentional misuse. If that's the case, would it make more sense to move a lot of this into the python code since it's the only approved way to access any repo? I think that would help the optimizer and it would definitely help with troubleshooting when performance tanks.

Sorry to be so pessimistic but this is asking the database to get a lot right.

Show
Christopher Stephens [X] (Inactive) added a comment - I took a look at the dbaccess directory in https://github.com/lsst-dm/gen3_shared_repo_admin  and attempted to get my head wrapped around how this would actually look in the database. On the surface, it seems like this should work logically. I'm not familiar with the requirements but none of the policies look very complicated and the expressions used to determine whether a row is visible or not look pretty straightforward depending on what some of the snippets stuff returns. My biggest concern as is usually the case for me on this project, is performance. You have some policies using regular expressions which will likely render any indexing useless. That assumes the optimizer chooses to evaluate the expression at the proper time in the execution plan. Beyond that, a quick google for "postgresql row based security performance" shows policy evaluation in during SQL execution to be a significant problem in general. It looks like things continue to get better with each version but comments like "To summarize, RLS policies in combination with subqueries are unfortunately not each other friends performance-wise" don't bode well.  Each column is going to have any number of policies that will all have to be accounted for. That's on top of some fairly complicated, nested, dynamically generated SQL which has already presented some significant performance problems. Performance will likely be fine when data volumes on tables with policies are low but I'm very skeptical about this scaling well. Unfortunately, I don't really have any concrete suggestions. You mentioned that this is to prevent accidental misuse and not intentional misuse. If that's the case, would it make more sense to move a lot of this into the python code since it's the only approved way to access any repo? I think that would help the optimizer and it would definitely help with troubleshooting when performance tanks. Sorry to be so pessimistic but this is asking the database to get a lot right.
Hide
Jim Bosch added a comment - - edited

Finally getting back to this. I'm probably going to just close this as Won't Fix, but I'll drop some comments here on why I thought performance might not tank under this plan, on the off chance that changes how Christopher Stephens [X] thinks about it:

• To implement these kinds of checks in Python, I'd generally have to run exactly the same regular expression checks, on input rows, before inserting them into the DB. That's probably about the same speed as doing them in the DB, and if anything a bit slower, because Python is going to be a bit slower than PostgreSQL's C implementation even if most of the regex work is happening in C in both cases. That's in the single-user case, of course. Putting all of the regex-crunching on the server is admittedly something that doesn't scale as well.
• There are no row-level read constraints in these policies, just write constraints, and I think all of our INSERT, DELETE and UPDATE queries are pretty trivial - it's only the SELECTs that get huge and complicated (though maybe those DELETEs against FKs with ON DELETE CASCADE are a bit trickier). So if PostgreSQL is smart enough to realize that the row-level security is a no-op on those SELECT queries, it shouldn't affect those. I have no idea how likely that "if" is to be true.

Anyhow, the alternative to doing this in the DB is certainly doing it in Python, as you suggested, and the motivation for doing it in the DB is more about the fact that I don't want to implement my own rules-based system for describing database entity ownership and access controls when PostgreSQL already has one. We'll have to do something like that eventually for the client/server butler that the science platform will use, but I worry that the ownership/permission model for that will be sufficiently different from what we want now that I'll end up writing two such systems, and I'd prefer not to write any yet.

Show
Jim Bosch added a comment - - edited Finally getting back to this. I'm probably going to just close this as Won't Fix, but I'll drop some comments here on why I thought performance might not tank under this plan, on the off chance that changes how Christopher Stephens [X] thinks about it: To implement these kinds of checks in Python, I'd generally have to run exactly the same regular expression checks, on input rows, before inserting them into the DB. That's probably about the same speed as doing them in the DB, and if anything a bit slower, because Python is going to be a bit slower than PostgreSQL's C implementation even if most of the regex work is happening in C in both cases. That's in the single-user case, of course. Putting all of the regex-crunching on the server is admittedly something that doesn't scale as well. There are no row-level read constraints in these policies, just write constraints, and I think all of our INSERT, DELETE and UPDATE queries are pretty trivial - it's only the SELECTs that get huge and complicated (though maybe those DELETEs against FKs with ON DELETE CASCADE are a bit trickier). So if PostgreSQL is smart enough to realize that the row-level security is a no-op on those SELECT queries, it shouldn't affect those. I have no idea how likely that "if" is to be true. Anyhow, the alternative to doing this in the DB is certainly doing it in Python, as you suggested, and the motivation for doing it in the DB is more about the fact that I don't want to implement my own rules-based system for describing database entity ownership and access controls when PostgreSQL already has one. We'll have to do something like that eventually for the client/server butler that the science platform will use, but I worry that the ownership/permission model for that will be sufficiently different from what we want now that I'll end up writing two such systems, and I'd prefer not to write any yet.
Hide
Christopher Stephens [X] (Inactive) added a comment -

oh wow. i completely missed the fact that these policies are for writes only. that should have been obvious to me. i'd be far less concerned about write performance since there are fewer places postgres can apply policies in any given execution plan. i agree that the database is the proper place to enforce something like this and will likely perform better than any implementation in python.

sorry for not thinking about this more carefully Jim Bosch

we obviously won't know for certain how this will affect performance until we can test but i think its worth the effort assuming that the security requirements are accurate and unlikely to change.

Show
Christopher Stephens [X] (Inactive) added a comment - oh wow. i completely missed the fact that these policies are for writes only. that should have been obvious to me. i'd be far less concerned about write performance since there are fewer places postgres can apply policies in any given execution plan. i agree that the database is the proper place to enforce something like this and will likely perform better than any implementation in python. sorry for not thinking about this more carefully  Jim Bosch .  we obviously won't know for certain how this will affect performance until we can test but i think its worth the effort assuming that the security requirements are accurate and unlikely to change.
Hide
Jim Bosch added a comment -

Excellent! I'm going to merge these policies to the repo (since it's a brand-new repo to-be full of unofficial scripts, anyway), and call this Done, and come back to this on another ticket that involves actual code in daf_butler for emitting the grant and policy statements when we emit DDL.

Show
Jim Bosch added a comment - Excellent! I'm going to merge these policies to the repo (since it's a brand-new repo to-be full of unofficial scripts, anyway), and call this Done, and come back to this on another ticket that involves actual code in daf_butler for emitting the grant and policy statements when we emit DDL.

#### People

Assignee:
Jim Bosch
Reporter:
Jim Bosch
Reviewers:
Christopher Stephens [X] (Inactive)
Watchers:
Christopher Stephens [X] (Inactive), Jim Bosch