Thanks Jan,
> > Thanks Jan, i am finding another free slot to have new slc4 lcgCE 3.1
> > ready that i can smoohtly migrate the CE box to new r3.1, indeed, we never
> > find the load on new r31 CE but this could be less job load passing to the
> > GK to central batch pool. i am adding action for this anyway.
>
> It looks like the problems at your site are caused by cms001 user.
> There's a possibility that the user has error in his job management
> scripts. Consider to put his DN to ban_users.db.
this is another possibility while yet have time to profile the script.
will check this later. and i thought the problem could be related to lots
of pending jobs submit from cms001 at the same CE to backend batch system
that all the job manager plugins will keep query the job status and result
in severe load of the CE? batch pool have around 1.8k job slots only while
we have more than 4.5k pending jobs in cms queue:
$ qstat cms | grep stdin | awk '{print $3}' | sort | uniq -c
5650 cms001
2 cms011
2 cms024
1 cms027
1 cms033
8 cms034
9 cms038
160 cmsprd
$ qstat cms | grep stdin | awk '{print $3,$5}' | sort | uniq -c
1 cms001 H
4548 cms001 Q
1099 cms001 R
2 cms011 R
2 cms024 Q
1 cms027 Q
1 cms033 Q
8 cms034 Q
9 cms038 R
158 cmsprd R
1 cmsprd W
Br,
J
|