Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-3536

Czar Failover in xrdssi

    XMLWordPrintable

    Details

    • Type: Epic
    • Status: To Do
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: xrootd
    • Labels:
      None
    • Epic Name:
      Add Support for czar failover in xrdssi
    • Story Points:
      20
    • WBS:
      02C.06.02.03
    • Team:
      Data Access and Database

      Description

      In production we will need to recover from czar failures by automatically failing over to a different czar. This epic involves designing and implementing the interfaces in xrootd xrdssi that will be need to support czar fail over.

        Attachments

          Issue Links

            Activity

            Hide
            abh Andy Hanushevsky added a comment -

            The following proposal is to allow the czar to fail-over to another czar and recover any pending requests. In general,

            When a czar fails (i.e. loses its internet connection) to a worker, the worker can be configured to hold on to requests for a specific amount of time before they are cancelled. Note that specific cancellation requests from the czar are always honored. Cancellation deferral only applies to cancellation due to czar failure.

            A new czar must first recover all sessions at each worker (presumably the czar has recorded this information). In order to make this easier, a new field is added to the resource object named rUser. This field should contain the “query ID” when the resource is initially provisioned. This allows the SSI framework to keep track of sessions by usage (i.e. query identifier). It also allows messages to include the query ID so that one can easily back track to the query in question if a problem occurs.

            An additional option will be added to Provision() that indicates that the czar wishes to recover a session not create a new one (currently that option is called “Recover”). The Provision() call must be directed to the actual endpoint holding the session to be recovered. The endpoint host name can always be obtained via session object’s Location() method. So, recovering a session simply returns a session object attached to the previously created session with the same query ID. Alternatively, we could simply add a new method to the service object if that seems to be a better fit; instead of overloading an existing method.

            The trick now is how to recover a pending request. In the czar’s case, there is a one-to-one correspondence between session object and a request object. So, request recover is pretty simple. I will not go into what needs to happen when multiple requests are in progress within the context of a single session as this is not the case for qServ (though I have a scheme for that).

            To recover a request, we introduce a new method in the session object called RecoverRequest(). It looks identical to ProcessRequest(). However, in this case, the passed request object is reattached to the currently existing request object at the worker node. After the request object is attached, everything continues from where it left off. This implies that responses may be posted by a worker to a disconnected request. The SSI framework simply holds on to the response until a reconnect happens.

            In each of the calls, an error is indicated if the session no longer exists or if the request no longer exists. This can happen if the hold timeout is exceeded or if some data has already been sent to the czar before the czar failed. The assumption here is that the czar will simply re-issue all non-recoverable requests.

            Two interesting side-effects appear. First, it provides the opportunity to actually disconnect from running requests and to reconnect later, either using the same czar or some different czar. This opens up new strategies to load balance requests. Secondly, it would be possible to locate all of the workers handling a particular query ID using the standard look-up facilities. That said, this is not what I am proposing unless people feel this is something worth pursuing.

            I should also note that the XRootD framework is already capable of handling this kind of processing; they just need to be exposed through the SSI interface with some work in the SSI plugin.

            Please comment on what you think of this scheme.

            Show
            abh Andy Hanushevsky added a comment - The following proposal is to allow the czar to fail-over to another czar and recover any pending requests. In general, When a czar fails (i.e. loses its internet connection) to a worker, the worker can be configured to hold on to requests for a specific amount of time before they are cancelled. Note that specific cancellation requests from the czar are always honored. Cancellation deferral only applies to cancellation due to czar failure. A new czar must first recover all sessions at each worker (presumably the czar has recorded this information). In order to make this easier, a new field is added to the resource object named rUser. This field should contain the “query ID” when the resource is initially provisioned. This allows the SSI framework to keep track of sessions by usage (i.e. query identifier). It also allows messages to include the query ID so that one can easily back track to the query in question if a problem occurs. An additional option will be added to Provision() that indicates that the czar wishes to recover a session not create a new one (currently that option is called “Recover”). The Provision() call must be directed to the actual endpoint holding the session to be recovered. The endpoint host name can always be obtained via session object’s Location() method. So, recovering a session simply returns a session object attached to the previously created session with the same query ID. Alternatively, we could simply add a new method to the service object if that seems to be a better fit; instead of overloading an existing method. The trick now is how to recover a pending request. In the czar’s case, there is a one-to-one correspondence between session object and a request object. So, request recover is pretty simple. I will not go into what needs to happen when multiple requests are in progress within the context of a single session as this is not the case for qServ (though I have a scheme for that). To recover a request, we introduce a new method in the session object called RecoverRequest(). It looks identical to ProcessRequest(). However, in this case, the passed request object is reattached to the currently existing request object at the worker node. After the request object is attached, everything continues from where it left off. This implies that responses may be posted by a worker to a disconnected request. The SSI framework simply holds on to the response until a reconnect happens. In each of the calls, an error is indicated if the session no longer exists or if the request no longer exists. This can happen if the hold timeout is exceeded or if some data has already been sent to the czar before the czar failed. The assumption here is that the czar will simply re-issue all non-recoverable requests. Two interesting side-effects appear. First, it provides the opportunity to actually disconnect from running requests and to reconnect later, either using the same czar or some different czar. This opens up new strategies to load balance requests. Secondly, it would be possible to locate all of the workers handling a particular query ID using the standard look-up facilities. That said, this is not what I am proposing unless people feel this is something worth pursuing. I should also note that the XRootD framework is already capable of handling this kind of processing; they just need to be exposed through the SSI interface with some work in the SSI plugin. Please comment on what you think of this scheme.
            Hide
            salnikov Andy Salnikov added a comment -

            Could "holding on" one request on worker side interfere in any way with other requests from other czars? I'm thinking about shared scans for example, it would be bad if killing one czar freezes all shared scans. Or does it needs some special coding in our plugins to avoid this?

            Show
            salnikov Andy Salnikov added a comment - Could "holding on" one request on worker side interfere in any way with other requests from other czars? I'm thinking about shared scans for example, it would be bad if killing one czar freezes all shared scans. Or does it needs some special coding in our plugins to avoid this?
            Hide
            abh Andy Hanushevsky added a comment -

            Thank you for bringing that up. The way it is coded, I believe, is that results of a shared scan are transmitted as the scan proceeds. So, yes, holding the request would stall the shared scan. So, either the results would need to be buffered inside the qserv worker or the framework would need to be able to pause the request so that it stops and resumes or restarts after the reconnect (a rather messy affair). Alternatively, one could indicate at the time the request is submitted whether or not it is recoverable. That would provide a mechanism to cancel shared scans but leave the ad hoc ones alone. That, of course, is not ideal either. In the end, it seems that qserv would need to write out the shared scan results elsewhere and transmit them back after the full scan is done. While you would get better streaming results and likely less memory contention, making it complete faster; it does add disk I/O. Let's just say shared scans are problematic. In fact, the likelihood that a shared scan becomes unrecoverable is rather high since it's a long running query and will have transmitted some data back before it completes; automatically making it unrecoverable. The time window is big enough to raise the probability of a czar failure during the scan.

            Show
            abh Andy Hanushevsky added a comment - Thank you for bringing that up. The way it is coded, I believe, is that results of a shared scan are transmitted as the scan proceeds. So, yes, holding the request would stall the shared scan. So, either the results would need to be buffered inside the qserv worker or the framework would need to be able to pause the request so that it stops and resumes or restarts after the reconnect (a rather messy affair). Alternatively, one could indicate at the time the request is submitted whether or not it is recoverable. That would provide a mechanism to cancel shared scans but leave the ad hoc ones alone. That, of course, is not ideal either. In the end, it seems that qserv would need to write out the shared scan results elsewhere and transmit them back after the full scan is done. While you would get better streaming results and likely less memory contention, making it complete faster; it does add disk I/O. Let's just say shared scans are problematic. In fact, the likelihood that a shared scan becomes unrecoverable is rather high since it's a long running query and will have transmitted some data back before it completes; automatically making it unrecoverable. The time window is big enough to raise the probability of a czar failure during the scan.
            Hide
            jgates John Gates added a comment -

            If the data from a shared scan is being sent back to the czar that started the query, there doesn't seem to be much point in switching the running query to a new czar, as that czar will need to start from scratch. Might as well cancel the original query and start over, which isn't desirable. So, it's probably better to store the data in an additional location. Maybe keep a copy of the results on the worker until the scan is done, shared file system, or predefined fail-over czar that also gets results? idk.

            Show
            jgates John Gates added a comment - If the data from a shared scan is being sent back to the czar that started the query, there doesn't seem to be much point in switching the running query to a new czar, as that czar will need to start from scratch. Might as well cancel the original query and start over, which isn't desirable. So, it's probably better to store the data in an additional location. Maybe keep a copy of the results on the worker until the scan is done, shared file system, or predefined fail-over czar that also gets results? idk.
            Hide
            abh Andy Hanushevsky added a comment -

            Well, I have been thinking some more about this. If we assume that a czar failure is a rare event (which it better be) then spending a huge amount of effort to do fail-over rather than improving reliability seems like misguided activity. So, if the czar fails less than 1% of its running time (which should be the target) I';d say just deep-six the queries and start over again. It's a lot simpler and less error prone.

            Show
            abh Andy Hanushevsky added a comment - Well, I have been thinking some more about this. If we assume that a czar failure is a rare event (which it better be) then spending a huge amount of effort to do fail-over rather than improving reliability seems like misguided activity. So, if the czar fails less than 1% of its running time (which should be the target) I';d say just deep-six the queries and start over again. It's a lot simpler and less error prone.
            Hide
            jbecla Jacek Becla added a comment -

            As long as we can restart the affected async queries automatically rather than throwing the error message back onto user and expecting user to resubmit, that might be ok.

            Show
            jbecla Jacek Becla added a comment - As long as we can restart the affected async queries automatically rather than throwing the error message back onto user and expecting user to resubmit, that might be ok.
            Hide
            abh Andy Hanushevsky added a comment -

            Certainly, that we should be able to do! No reason to tell the user – it will just take a bit longer.

            Show
            abh Andy Hanushevsky added a comment - Certainly, that we should be able to do! No reason to tell the user – it will just take a bit longer.

              People

              Assignee:
              abh Andy Hanushevsky
              Reporter:
              fritzm Fritz Mueller
              Watchers:
              Andy Hanushevsky, Andy Salnikov, Fritz Mueller, Jacek Becla, John Gates
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:

                  Jenkins

                  No builds found.