Hi Lukasz,
this is a known problem:
http://glite.web.cern.ch/glite/packages/R3.1/deployment/glite-WMS/glite-WMS-known-issues.asp
Best regards,
Andreas
On Wed, 13 May 2009, Lukasz Flis wrote:
> Hello,
>
> We have observed problem on our WMS node rb.cyf-kr.edu.pl
>
> ICE component is crashing every 3-4 days, and service cannot be
> successfully restarted for some reason:
>
>
> Symptoms during job submission:
>
> + glite-wms-job-submit -a --vo voce -e
> https://rb1.cyf-kr.edu.pl:7443/glite_wms_wmproxy_server -o testjob.jid
> testjob.jdl
> Warning - --vo option ignored
> Connecting to the service
> https://rb1.cyf-kr.edu.pl:7443/glite_wms_wmproxy_server
> Warning - Unable to register the job to the service:
> https://rb1.cyf-kr.edu.pl:7443/glite_wms_wmproxy_server
> System load is too high:
> Threshold for ICE Input FileList jobs: 500 => Detected value for ICE Input
> FileList jobs /var/glite/ice/ice_fl : 595
> Method: jobRegister
> Error - Operation failed
> Unable to find any endpoint where to perform service request
> + set +x
>
> Job submission failed! Check the Resource Broker
>
> Restart:
> [root@rb1 glite]# /opt/glite/etc/init.d/glite-wms-ice restart
> stopping ICE... ok (was not running)
> starting ICE... failure
>
> [root@rb1 ice]# tail -n 120 ice_fl.log
> 11 May, 13:00:59 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 13:54:24 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 14:11:51 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 14:12:39 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 14:14:32 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 14:42:44 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 16:10:22 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 17:10:09 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 17:14:31 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 18:02:58 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 18:23:07 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 11 May, 19:34:46 - FileContainer::truncateFile(...): Asked a truncation
> at size: 371
> 12 May, 02:35:54 - FileContainer::checkStream(...): Wrong file status
> found, was: 't'. Going to recover.
> 12 May, 02:35:54 - FileContainer::checkStream(...): Current call stack:
> -> read_end( end ) -> checkStreamAndStamp( recover = 1 ) -> checkStream(
> recover = 1 ) -> readFileStatus( status = 1964200160 )
> 12 May, 02:35:54 - FileContainer::recover_data(...): Container modified,
> going to sync data. Old size = 110
> 12 May, 02:35:54 - FileContainer::recover_data(...): 110
> 12 May, 02:35:54 - FileContainer::checkConsistency(...): Called with
> allowable_size_offset = -1
> 12 May, 02:35:54 - FileContainer::checkConsistency(...): Reached the
> last element
> 12 May, 02:35:54 - FileContainer::checkConsistency(...):
> current_maximum_offset = 51193, max_reached_offset = 51193
> 12 May, 02:35:54 - FileContainer::checkConsistency(...): detected_size =
> 110, this->fc_size = 110
> 12 May, 02:35:54 - FileContainer::recover_data(...): Return status is "OK"
>
>
> Also ice_fl contains a lot of entries.
>
>
> ICE cache reports no jobs at all:
>
> [root@rb1 ice]# dumpICECache glite_wms.conf
>
> Cream Job ID / Grid Job ID / Status
>
>
> Total number of job(s)=0
>
>
>
> Last lines of /var/log/glite/ice.log:
>
> 2009-05-13 10:35:01,760 WARN - creamJob::setJdl() - The user proxy file
> [/var/glite/SandboxDir/v6/https_3a_2f_2flb.grid.cyf-kr.edu.pl_3a9000_2fv6bf-eRGA
> uHQ3RvapE2opA/user.proxy] is not stat-able:No such file or directory. This
> could compromise the
> correct working of proxy renewal thread
> 2009-05-13 10:35:01,761 DEBUG - glite-wms-ice::main() - *** Unparsing
> request <[ Arguments = [ Force = false; ProxyFile =
> "/var/glite/spool/glite-renewd/f6bc5e4ed596367ac7eb13e5b779513e.0";
> SequenceCode =
> "UI=000000:NS=0000000005:WM=000002:BH=0000000000:JSS=000000:LM=000000:LRMS=00000
> 0:APP=000000:LBS=000000"; JobId =
> "https://lb.grid.cyf-kr.edu.pl:9000/v6bf-eRGAuHQ3RvapE2opA" ];
> Command = "Cancel"; Source = 2; Protocol = "1.0.0" ]>
> 2009-05-13 10:35:01,761 DEBUG - iceThreadPoolWorker::body() - Worker
> Thread ICE Requests Pool/0 started processing new request (Currently 1
> threads are running)
> 2009-05-13 10:35:01,761 INFO - iceCommandSubmit::execute() - This
> request is a Submission...
> 2009-05-13 10:35:01,762 DEBUG - glite-wms-ice::main() - *** Unparsing
> request <[ Arguments = [ Force = false; LogFile =
> "/var/glite/logmonitor/CondorG.log/CondorG.1241705792.log"; ProxyFile =
> "/var/glite/spool/glite-renewd/f6bc5e4ed596367ac7eb13e5b779513e.1";
> SequenceCode =
> "UI=000000:NS=0000000005:WM=000002:BH=0000000000:JSS=000000:LM=000000:LRMS=00000
> 0:APP=000000:LBS=000000"; JobId =
> "https://lb.grid.cyf-kr.edu.pl:9000/v89C7AgCWJ2T_CbHrKRuUA" ];
> Command = "Cancel"; Source = 2; Protocol = "1.0.0" ]>
>
>
>
>
>
>
>
> The only solution i know is to cleanup /var/glite/ice dir and restart
> daemon.
> Any suggestions on how to deal with the issue are very welcome since it
> is an important production machine.
>
>
> Best Regards
> --
> Lukasz Flis
>
--
Andreas Unterkircher
IT Department
Grid Deployment Group
CERN
CH-1211 Geneva 23
|