The following proposal is to allow the czar to fail-over to another czar and recover any pending requests. In general,
When a czar fails (i.e. loses its internet connection) to a worker, the worker can be configured to hold on to requests for a specific amount of time before they are cancelled. Note that specific cancellation requests from the czar are always honored. Cancellation deferral only applies to cancellation due to czar failure.
A new czar must first recover all sessions at each worker (presumably the czar has recorded this information). In order to make this easier, a new field is added to the resource object named rUser. This field should contain the “query ID” when the resource is initially provisioned. This allows the SSI framework to keep track of sessions by usage (i.e. query identifier). It also allows messages to include the query ID so that one can easily back track to the query in question if a problem occurs.
An additional option will be added to Provision() that indicates that the czar wishes to recover a session not create a new one (currently that option is called “Recover”). The Provision() call must be directed to the actual endpoint holding the session to be recovered. The endpoint host name can always be obtained via session object’s Location() method. So, recovering a session simply returns a session object attached to the previously created session with the same query ID. Alternatively, we could simply add a new method to the service object if that seems to be a better fit; instead of overloading an existing method.
The trick now is how to recover a pending request. In the czar’s case, there is a one-to-one correspondence between session object and a request object. So, request recover is pretty simple. I will not go into what needs to happen when multiple requests are in progress within the context of a single session as this is not the case for qServ (though I have a scheme for that).
To recover a request, we introduce a new method in the session object called RecoverRequest(). It looks identical to ProcessRequest(). However, in this case, the passed request object is reattached to the currently existing request object at the worker node. After the request object is attached, everything continues from where it left off. This implies that responses may be posted by a worker to a disconnected request. The SSI framework simply holds on to the response until a reconnect happens.
In each of the calls, an error is indicated if the session no longer exists or if the request no longer exists. This can happen if the hold timeout is exceeded or if some data has already been sent to the czar before the czar failed. The assumption here is that the czar will simply re-issue all non-recoverable requests.
Two interesting side-effects appear. First, it provides the opportunity to actually disconnect from running requests and to reconnect later, either using the same czar or some different czar. This opens up new strategies to load balance requests. Secondly, it would be possible to locate all of the workers handling a particular query ID using the standard look-up facilities. That said, this is not what I am proposing unless people feel this is something worth pursuing.
I should also note that the XRootD framework is already capable of handling this kind of processing; they just need to be exposed through the SSI interface with some work in the SSI plugin.
Please comment on what you think of this scheme.
The following proposal is to allow the czar to fail-over to another czar and recover any pending requests. In general,
When a czar fails (i.e. loses its internet connection) to a worker, the worker can be configured to hold on to requests for a specific amount of time before they are cancelled. Note that specific cancellation requests from the czar are always honored. Cancellation deferral only applies to cancellation due to czar failure.
A new czar must first recover all sessions at each worker (presumably the czar has recorded this information). In order to make this easier, a new field is added to the resource object named rUser. This field should contain the “query ID” when the resource is initially provisioned. This allows the SSI framework to keep track of sessions by usage (i.e. query identifier). It also allows messages to include the query ID so that one can easily back track to the query in question if a problem occurs.
An additional option will be added to Provision() that indicates that the czar wishes to recover a session not create a new one (currently that option is called “Recover”). The Provision() call must be directed to the actual endpoint holding the session to be recovered. The endpoint host name can always be obtained via session object’s Location() method. So, recovering a session simply returns a session object attached to the previously created session with the same query ID. Alternatively, we could simply add a new method to the service object if that seems to be a better fit; instead of overloading an existing method.
The trick now is how to recover a pending request. In the czar’s case, there is a one-to-one correspondence between session object and a request object. So, request recover is pretty simple. I will not go into what needs to happen when multiple requests are in progress within the context of a single session as this is not the case for qServ (though I have a scheme for that).
To recover a request, we introduce a new method in the session object called RecoverRequest(). It looks identical to ProcessRequest(). However, in this case, the passed request object is reattached to the currently existing request object at the worker node. After the request object is attached, everything continues from where it left off. This implies that responses may be posted by a worker to a disconnected request. The SSI framework simply holds on to the response until a reconnect happens.
In each of the calls, an error is indicated if the session no longer exists or if the request no longer exists. This can happen if the hold timeout is exceeded or if some data has already been sent to the czar before the czar failed. The assumption here is that the czar will simply re-issue all non-recoverable requests.
Two interesting side-effects appear. First, it provides the opportunity to actually disconnect from running requests and to reconnect later, either using the same czar or some different czar. This opens up new strategies to load balance requests. Secondly, it would be possible to locate all of the workers handling a particular query ID using the standard look-up facilities. That said, this is not what I am proposing unless people feel this is something worth pursuing.
I should also note that the XRootD framework is already capable of handling this kind of processing; they just need to be exposed through the SSI interface with some work in the SSI plugin.
Please comment on what you think of this scheme.