Print

Print


Hi Ian,

>
>
> I think I asked this question once before, but in any case:
>
Also that I have answered them before (I even think in mails to this
mailing list but would have to check) and they are available in various
summary talks etc...

> 1) Why does the RB die?

Until recently there were 3 main causes:
1. The MDS could not cope ... It would die no jobs matched ...whole thing
screwed. This will be properly solved with RGMA (we hope), but until then
a kludge called the BDII was introduced. This means that the RB always
some information on which to match even if it is out of date. This is a
kludge but make the whole thing stable. To quote from our Babar friends
tests this meant that the efficiency of job completion went from ~70% to
99%.

2. There is a known bug in STL implementation we are using that means that
the postgresql database gets screwed up. The RB can limp on for a bit, but
will then fall over. It will restart itself but the db will trip it up
again. This is load dependent but will go with the change of compiler in
the next release. Currently it means that from time to time we have to
clear the database loosing all current jobs... which is a bit poor really.

3. There are about 1% of jobs that we still seem to loose even taking the
above into account. From the small amount of work that I have done on
this, they all appear to have different reasons (e.g. proxy expires, a CE
becomes overloaded and the globus installation refuses connections...)
This does require more study.


>
> 2) Is it reproducible (i.e. particular sequence of events), or just due to
> load?

Load can be high but see above.

>
> 3) Is it fixable?
>
> 4) If so, who is working on it and when do they expect a release?

See Stephen Burke's comments


>
> I'm asking because one research area I'm considering is alternative job
> matching mechanisms in a grid environment, so the knowing what causes the
> current RB problems would be really helpful.

What is your alternative? You could always try plugging it in a seeing if
it works. You could do that in the current release but it should be even
easier in the next release.


All the best,
david