For some reasons it looks like async requests are not managed.
Please open a ggus ticket (providing somewhere the glite-ce-cream.log*
files)
Cheers, Massimo
On 04/19/2012 09:53 AM, Mario Kadastik wrote:
> Ok, this has to be something internal to CREAM that creates those commands in the database because I just ran through CREAM logs looking for authorization requests to count how many external commands have been received and over the last MONTH we've only had about 100k job start requests:
>
> [root@europa cream]# grep AUTHORIZED glite-ce-cream.log*|sed 's+.*operation=\(.*\); REMOTE.*+\1+g' |sort|uniq -c
> 57064 getServiceInfo
> 26015 {http://www.gridsite.org/namespaces/delegation-2}getProxyReq
> 39903 {http://www.gridsite.org/namespaces/delegation-2}putProxy
> 48234 {http://www.gridsite.org/namespaces/delegation-2}renewProxyReq
> 30812 JobCancel
> 1129 JobInfo
> 149126 JobPurge
> 97879 JobRegister
> 100828 JobStart
> 337312 JobStatus
> 153393 QueryEvent
> 126239 setLease
>
> Looking at last 24h in the logs I see:
> [root@europa cream]# grep AUTHORIZED glite-ce-cream.log glite-ce-cream.log.[123]|sed 's+.*operation=\(.*\); REMOTE.*+\1+g' |sort|uniq -c
> 1374 getServiceInfo
> 175 {http://www.gridsite.org/namespaces/delegation-2}getProxyReq
> 1772 {http://www.gridsite.org/namespaces/delegation-2}putProxy
> 1996 {http://www.gridsite.org/namespaces/delegation-2}renewProxyReq
> 8586 JobCancel
> 75 JobInfo
> 140 JobPurge
> 2414 JobRegister
> 6185 JobStart
> 28597 JobStatus
> 13676 QueryEvent
> 539 setLease
>
> however since I wrote the e-mail less than 24h ago the pendind command list has increased substantially:
>
> Threshold for Number of pending commands: -1 => Detected value for Number of pending commands: 1008838
>
> so in the last 17h it's increased by almost 100k commands. Any ideas where to look to understand what is causing this?
>
> On 18.04.2012, at 18:18, Mario Kadastik wrote:
>
>> Hi,
>>
>> as far as I know we haven't had a huge amount of jobs come through lately as this node has been broken after the last high load on scheduling system. Everything seems to time out and it seems the source is MySQL that is using 150% cpu time or more (the VM has only 2 cores assigned to the CREAM CE so that would keep the load high). Looking at the CREAM database it looks like there's a huge load of commands waiting:
>>
>> mysql> use creamdb; select count(*), name from command group by name; Database changed
>> +----------+---------------------------+
>> | count(*) | name |
>> +----------+---------------------------+
>> | 60 | COPY_NEW_PROXY_TO_SANDBOX |
>> | 190 | DELETE_PROXY_FROM_SANDBOX |
>> | 3845 | JOB_CANCEL |
>> | 7731 | JOB_PURGE |
>> | 908382 | JOB_START |
>> | 495 | PROXY_RENEW |
>> | 4 | SET_JOB_STATUS |
>> +----------+---------------------------+
>> 7 rows in set (2.31 sec)
>>
>> And interestingly enough the creation time shows that it's mostly the past few days:
>>
>> mysql> select count(id) from command where creationTime<adddate(now(),-3);
>> +-----------+
>> | count(id) |
>> +-----------+
>> | 282660 |
>> +-----------+
>> 1 row in set (2.25 sec)
>>
>> mysql> select count(id) from command where creationTime<adddate(now(),-1);
>> +-----------+
>> | count(id) |
>> +-----------+
>> | 750625 |
>> +-----------+
>> 1 row in set (2.06 sec)
>>
>> Also the load monitor confirms that this is huge:
>> [root@europa ~]# glite_cream_load_monitor --show
>> Threshold for Load Average(1 min): 40 => Detected value for Load Average(1 min): 2.08
>> Threshold for Load Average(5 min): 40 => Detected value for Load Average(5 min): 1.94
>> Threshold for Load Average(15 min): 20 => Detected value for Load Average(15 min): 1.99
>> Threshold for Memory Usage: 95 => Detected value for Memory Usage: 76.97%
>> Threshold for Swap Usage: 95 => Detected value for Swap Usage: 0.00%
>> Threshold for Free FD: 500 => Detected value for Free FD: 341818
>> Threshold for tomcat FD: 800 => Detected value for Tomcat FD: 283
>> Threshold for FTP Connection: 30 => Detected value for FTP Connection: 1
>> Threshold for Number of active jobs: -1 => Detected value for Number of active jobs: 6287
>> Threshold for Number of pending commands: -1 => Detected value for Number of pending commands: 919830
>> Threshold for Disk Usage: 95% => Detected value for Partition / : 82%
>>
>> and active jobs 6287 seems weird considering:
>> Queue Memory CPU Time Walltime Node Run Que Lm State
>> ---------------- ------ -------- -------- ---- --- --- -- -----
>> short -- 01:00:00 02:00:00 -- 0 0 -- E R
>> long -- 48:00:00 72:00:00 -- 554 0 -- E R
>> ----- -----
>> 554 0
>>
>> Thanks for help on how to debug why such huge pending commands keep occurring on our CREAM's and how to remedy this.
>>
>> Mario Kadastik, PhD
>> Researcher
>>
>> ---
>> "Physics is like sex, sure it may have practical reasons, but that's not why we do it"
>> -- Richard P. Feynman
>
> Mario Kadastik, PhD
> Researcher
>
> ---
> "Physics is like sex, sure it may have practical reasons, but that's not why we do it"
> -- Richard P. Feynman
|