JISCMail - LCG-ROLLOUT Archives




On Jul 21, 2008, at 10:48 AM, Arnau Bria wrote:

> Hi all,
>
> last week we some problems with our CEs. They had  high load
> average (our record: 163).
>
>
> Our new one, with gLite 3.1, was with a value of 50
>
> [root@ce05 ~]# uptime
> 12:17:44 up 2 days,  2:06,  3 users,  load average: 49.77, 22.96,  
> 12.31
>
> As some users have end points hardcoded we see thousands of queries  
> to a
> CE from same user. i.e lhsgm003 ran 4830 jobs in our batch system last
> Friday. And we have 4948 queries to our CE in that day. But we had  
> more
> queries from other users:
>
>  1304 atprd020
>    353 cmprd029
>     54 cms019
>     69 cms057
>     95 cms072
>     52 cms086
>     17 cms098
>     45 cms100
>     48 cms127
>    128 cms163
>     17 cms167
>      7 dteam004
>     24 dteam018
>    479 dteam020
>     41 lhcb016
>     25 lhcb050
>     11 lhcb080
>     19 lhcb089
>      6 lhcb104
>      5 lhprd011
>     13 lhprd025
>   1063 lhprd026
>    135 lhprd027
>   4948 lhsgm003
>     12 lhsgm004
>   1361 lhsgm006
>
> [...]
>
> So in a moment we could have about 150 globus-job-managers running  
> at a
> same time. And in our record (ce07 with a load average of 163, we saw
> 2000 job-managers).
>
> so, my question is, what could we do for preventing this problem? what
> could  we do if we see this problem again? bann some users?
>
>
> Cheers,
> Arnau