Hi Ian, > > > I think I asked this question once before, but in any case: > Also that I have answered them before (I even think in mails to this mailing list but would have to check) and they are available in various summary talks etc... > 1) Why does the RB die? Until recently there were 3 main causes: 1. The MDS could not cope ... It would die no jobs matched ...whole thing screwed. This will be properly solved with RGMA (we hope), but until then a kludge called the BDII was introduced. This means that the RB always some information on which to match even if it is out of date. This is a kludge but make the whole thing stable. To quote from our Babar friends tests this meant that the efficiency of job completion went from ~70% to 99%. 2. There is a known bug in STL implementation we are using that means that the postgresql database gets screwed up. The RB can limp on for a bit, but will then fall over. It will restart itself but the db will trip it up again. This is load dependent but will go with the change of compiler in the next release. Currently it means that from time to time we have to clear the database loosing all current jobs... which is a bit poor really. 3. There are about 1% of jobs that we still seem to loose even taking the above into account. From the small amount of work that I have done on this, they all appear to have different reasons (e.g. proxy expires, a CE becomes overloaded and the globus installation refuses connections...) This does require more study. > > 2) Is it reproducible (i.e. particular sequence of events), or just due to > load? Load can be high but see above. > > 3) Is it fixable? > > 4) If so, who is working on it and when do they expect a release? See Stephen Burke's comments > > I'm asking because one research area I'm considering is alternative job > matching mechanisms in a grid environment, so the knowing what causes the > current RB problems would be really helpful. What is your alternative? You could always try plugging it in a seeing if it works. You could do that in the current release but it should be even easier in the next release. All the best, david