Hi Winnie
One of the message brokers is in downtime and there is a known issue with SAM Nagios that failover doesn't work with message broker. Message from SAM Nagios team
"Problem is related to the fact that WN probe picks random MSG from BDII.
If it pick the one in downtime it hungs until the job is aborted. This is a known issue as this component does not implement failover."
There was a temporary workaround but it didn't work for us. I have removed some old jobs manually. The sites are not penalized if there is infrastructure issue across all NGI's. The downtime of message broker will end today at 3PM.
Cheers
Kashif
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Winnie Lacesso
> Sent: 25 January 2016 08:41
> To: [log in to unmask]
> Subject: All weekend :( Re: CREAM-CEs, "No handlers could be found for
> logger "stomp.py""
>
> This is still happening to what look like all UK CREAM-CEs, all weekend =
> FAIL*; opssgm jobs are hanging with error
>
> > --bdii-uri lcgbdii.gridpp.rl.ac.uk:2170 No handlers could be found for
> > logger "stomp.py"
>
> then they hit the jobslot walltime limit, after using zero CPU time
>
> Resource_List.cput=00:20:00 Resource_List.neednodes=1
> Resource_List.nodect=1
> Resource_List.nodes=1 Resource_List.walltime=00:30:00
>
> and fail.
>
> Could someone *please* go stomp on the snoozing handlers, please, &
> wake them up..?
>
>
> * so it better not be a site has to "explain" the 3 days of site FAIL ops tests!
|