[This first got rejected, because I sent it using another mail reader...]
On Tue, 23 Dec 2003, Bly, MJ (Martin) wrote:
> Reinstall RB - no.
> Reboot RB - no.
> Various scripts that prod things - no.
> Reboot BDII - no
> Restarting various RB services one-at-a-time - no.
> Neglect - yes!
>
> Indeed, it appears that neglect has its part to play here. Our RB is
> currently working fine having been left to its own devices. It mysteriously
> started working at about 02:18 this morning and has continued to function
> since.
>
> The problem seemed to be an unknown issue between the Workload Manager and
> the Job Controller - jobs would not be handed off to the Job Controller
> which seemed to be spending its time running in circles whinging about
> various things/jobs/events it considered `bad'.
Since LCG-1 came out, various such issues have been fixed by the WP1 folks.
On the Certification Testbed we have been hammering the LCG-2 RB pretty hard
and it did not break, so these nasty problems should soon be of the past.
> Having left it to its own devices, this situation appears to have cleared
> itself as the last of the things/jobs/events it considered `bad' was flushed
> (by some timeout?). Various other of the RB processes are still whining
> about a variety of things including `error recovering event store:
>
> /var/tmp/dg20logd_.NNNNNNN: ... error getting events jobid'
Cleanup recipe for that problem: just remove all the corresponding "*.ctl"
files and the complaints will soon stop. Also fixed in LCG-2.
Cheers,
Maarten
|