An update on this, thanks to the sagely advice of Simon and Daniela I
managed to "resync" the WMSen and our CREAM with a slightly destructive
`UPDATE db_info SET creationTime=NOW();` on our creamdb.
The WMS now considers our database to be scratched, so it and the cream
seem to be talking once more - but currently running directly submitted
jobs don't seem to be affected.
Thanks again to Simon and Daniela.
Cheers!
Matt
Now if I can only figure out why the midmon sha-2 test is failing...
On 07/10/15 17:51, Doidge, Matthew wrote:
> Thanks Daniela.
> I'll take another look at our blah parser just in case - direct queries to the cream get job statuses so I have a feeling it's alright. But I don't trust my instincts.
>
> But from the mail you sent me I think I know what's up. One of the many things to go wrong was our cream CE lost connection to it's back end disks, which completely fubared the cream's database. I naively restored the database from a back up image of our CE - and this has put our cream 3 weeks out of sync with all the WMSen. Maybe. I suspect I need to ask the cream devs how to fix this, preferably without anymore downtime or job losses!
>
> Thanks again,
> Matt
> ________________________________________
> From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Daniela Bauer [[log in to unmask]]
> Sent: 07 October 2015 16:37
> To: [log in to unmask]
> Subject: Re: CREAM and WMSes no longer talking.
>
> Hi Matt,
>
> the database hack you are referring to (last seen in 2012!) showed symptoms for a single user only, if it all users that might be a different issue. Unfortunately the discussion from back then wasn't on any public mailing list so I can't include a link to it.
> Before we go down that rabbit hole, are you sure your blah parser is working as it should ?
>
> Cheers,
> Daniela
>
> On 7 October 2015 at 16:26, Matt Doidge <[log in to unmask]<mailto:[log in to unmask]>> wrote:
> Hello,
> After one helluva week last week we have one bugbear left tormenting us here at Lancaster - our CREAM CE has stopped updating WMSen about job statuses. So we're failing a single ops test, and users like Sno+ are having trouble, despite the job running fine the users never find out about it. The problem seems to be across WMSii (thanks to Matt M for confirming this for Sno+).
>
> I've had this problem before, but for the life of me I can't remember the solution - and I can't figure out the correct incantation to coax the fix from the Oracle Google (or my mail archives) either.
>
> Has anyone in the UK seen something like this before? I'd appreciate any help. I have the horrible feeling the fix is some kind of destructive backend database hack.
>
> Thanks in advance all,
> Matt
>
>
>
> --
> Sent from the pit of despair
>
> -----------------------------------------------------------
> [log in to unmask]<mailto:[log in to unmask]>
> HEP Group/Physics Dep
> Imperial College
> London, SW7 2BW
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/
>
|