Hi Rod,
> Hi,
> I expect CondorG submitted pilots from ATLAS are part of the perceived
> problem.
I didn't find myself interested in investigating Condor-G submission,
but isn't it possible to put /other.GlueCEStateStatus == "Production"
/(or the equivalent) in the JDL or whatever submit file is in use?
> If you want to drain the site for downtime, then you should stop the
> queues in plenty of time beforehand, i.e. don`t start jobs. It is not
> sufficient to say you are closed, lock the door. Torque qstop(or
> qdisable) will stop new job starts and submissions and this has some
> effect on the info, but it doesn`t rely on the info to block submissions.
If you don't just think about PBS/torque, what about Condor? In my
experience, "condor_off -peaceful" simply doesn't work for grid jobs run
on multi-core node, maybe for obvious reason. And also, why should I
shutdown/disable (if that's what you suggesting) the entire batch system
for doing something which is absolutely nothing to do with the Condor
itself and everything to do with the grid middleware? In my case, I need
to rearrange the pool accounts here (and also rolling out prd pool
account), which specifically to do with the grid-system only and totally
independent of Condor. So one should can carry on the roll over keeping
condor stay alive. Our farm is used by some purpose/jobs as well. I
don't see any point stopping others from running jobs only because of
grid -submissions need some attention.
And also, what's the specification of "plenty of time beforehand"? Isn't
it mean longer "downtime"? If "Closed" CEStateStatus is published by the
info system and the site is not in downtime, SAM test will show errors.
> Info can be wrong - you publish
>
> GlueHostMainMemoryRAMSize: 1024
> GlueHostArchitectureSMPSize: 2
What does this GlueHostArchitectureSMPSize mean?
Cheers,
Santanu
>
>
>
> On Mon, 15 Sep 2008, Santanu Das wrote:
>
>> Hi Steve,
>>> On Mon, Sep 15, 2008 at 3:57 PM, Santanu Das
>>> <[log in to unmask]> wrote:
>>>
>>>> Greetings all!!
>>>>
>>>> What should a site do for the jobs from the VO and/or users, those
>>>> who don't
>>>> pay attention to the GlueCEStateStatus?
>>>>
>>>> Our site is logged "down" and all the queues published as "Closed"
>>>> since
>>>> midnight but jobs from several VOs are keep coming in. So, can I
>>>> just simply
>>>> remove all the queuing/running jobs?
>>>>
>>>
>>> It depends how ruthless you want to be.
>>>
>>> Deleting the jobs is okay and then when the support requests come
>>> let them
>>> know what happened but its best if you contact the users in question
>>> and try and find out
>>> why they are not respecting your GlueCEStateStatus first.
>>>
>> That's what I did but that forces the site to reschedule the work
>> plan, which sometimes brings other inconveniences in. I put out
>> "down" and "Closed" last night for a day, assuming the site will be
>> free in the morning, which did not happen. Now, we are actually
>>
>> 1. down for nothing for the entire day,
>> 2. didn't able to do the thing I supposed to do today
>> 3. plus I need to log another down time
>> 4. and reschedule my to do list.
>>
>>
>> Couple of VOs (and their users) are just pain in the neck and almost
>> certain that they are not gonna get back with anything. So entire
>> thing ends up with a complete mess, which causes huge
>> inconvenience(s) to the site.
>>
>> Cheers,
>> Santanu
>>
>>
>
|