Andy Hanushevsky sent the following
Here are the directives that you may want to try if overloaded servers are causing a problem:
cms.delay lookup 2 qdl 30
The fxhold simply says do not discard items in the cache until they a 7 days old. I suppose we should put on the docket the option to say "never".
The delay simply says that the cmsd will have a 30 second window (qdl) within which to establish resource location. If the location is not known when the client asks and can't be established within 178 milliseconds then the client waits for 2 seconds and reasks. If after 30 seconds there is no response then the resource does not exist or is not available. We can tune the 178 millisecond window but we need a better feel as to how many requests will hit the cmsd at the same time before we change that value otherwise we are just shooting in the dark. The 30 second window can be increased if need be in terms of qserv as we will never be looking for things that don't exist but let's start off with 30 seconds. An explanation on how all of this works can be found at:
(the graphic seems to be missing though, sigh).
The trace simply says to record what you are looking for and who responds.
All of these direcvives need only be specified for the redirector.
For the server (worker) cmsd's it would be good to nice them up as high as you are comfortable. That would allow the kernel to dispatch the cmsd as soon as it can to answer questions. The cmsd doesn't use much resource at all so there won't be any impact.
Now, that said, the issue here is that if there is a huge amount of I/O going on in a worker, things will not be dispatched all that quickly (at least that's what I've seen). It's also likely that the machine is being trashed and productivity is plumeting. So, it would behoove us to better control the load so we don't get into a "non-responsive" situation as at that point we can't tell if the node is broken or just slow. When you run 25 shared scans what are the load figures from "uptime"? Anything above 1 after dividing by the number of cores is worrisome. If you hit 5 you are likely going to see large delays.
A nice discussion can be found here: