Hello,
>
> One of the issue is that it works very well when cluster is small but problem arises when cluster becomes big or there is exceptionally high number of running or queuing jobs. It is difficult to replicate this kind of problem and its intermittent also.
Kashif's onto something here, the sites having the most problems with
cream seem to be the sites with the biggest pool of workers behind their
CEs. I'm trying to figure out what's going on at Lancaster, things don't
seem "quite right", but I haven't had the time to sit down and give it a
good thorough debugging (in order to gain a meaningful amount of
information with which to open a support ticket). My gut feeling is that
the batch/blah/cream communication isn't quite up to scratch.
Finally, I'm pretty sure the only reason we're not restarting cream
regularly is that our cream sits on a fairly new 24GB RAM, 8 core,
mirrored disk box where the cream can hog all the resources it wants
(and even then we've seen load problems in the past).
Cheers,
Matt
>
> Kashif
>
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Jeremy Coles
> Sent: 21 September 2011 11:03
> To: [log in to unmask]
> Subject: Re: CREAM problems (not just SGE)
>
> Hi Ewan
>
> I understand. But if there is no point you filing a ticket then it becomes quite difficult for me (or others) trying to escalate the 'problems' or concerns subsequently expressed about the quality of the product. This latest thread started with me trying to understand if we had a case for arguing to keep the LCG-CEs a while longer (SGE issues aside) while problems with the CREAM CE are being addressed. The problems are not being addressed and there is no clear argument (other than those associated with SGE).
>
> Jeremy
>
>
>
> On 21 Sep 2011, at 10:20, <[log in to unmask]>
> wrote:
>
>>> -----Original Message-----
>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>> [log in to unmask]] On Behalf Of Stuart Purdie
>>>
>>>
>>> I have had a problem with very similar symptoms, but possibly a different
>>> cause - in the other case we noticed the blah npudir was stuffed, and that
>>> was the direct cause, but why that was the case is not clear.
>> I think that part of the problem we've had is that we're not
>> always diagnosing the problems all that much - we're just
>> kicking it until it starts working, and there's not a lot
>> of point in filing tickets for 'it broke, we kicked it, it
>> worked again'.
>>
>> Ewan
>> --
>> Scanned by iCritical.
|