Hi
Did this happen after having updated to glite Update 44 ?
If so, this is a known issue (an incompatibility between patch #2845,
released with patch #2845, and ICE in production).
The details are in the attached mail
As far as I was told, the decision was then to wait for patch #2597 and
officially announce this problem (wasn't this done ?)
Cheers, Massimo
On Thu, 23 Apr 2009, Arnau Bria wrote:
> Hi,
>
> We are faced to a problem in the service glite-wms-ice that does not
> start. Although we stopped all processes gLite in the machine and
> killed the processes related to the user glite we can not start the
> service glite-wms-ice. In the logs we found the following entry:
>
> [...]
> 2009-04-23 18:54:26,576 DEBUG - iceThread::operator() - Thread ICE
> Proxy Renewer starting... 2009-04-23 18:54:26,576 INFO - proxyRenewal::body() - new iteration
> 2009-04-23 18:54:26,577 DEBUG - iceThread::operator() - Thread Job Killer starting...
> 2009-04-23 18:54:26,577 DEBUG - glite-wms-ice::main() - Max ICE memory threshold set to 4000000 kB
> 2009-04-23 18:54:26,583 INFO - glite-wms-ice::main() - *** Found 5 new request(s)
> 2009-04-23 18:54:26,583 DEBUG - glite-wms-ice::main() - *** Unparsing request <[ Arguments = [ JobAd = [ requirements = ( other.GlueCEStateFreeCPUs > 0 ) && !RegExp(".*sdj$",other.GlueCEUniqueID); RetryCount = 2; edg_jobid = "https://rb03.pic.es:9000/RKKS9ffb5S1ofBSfdXToRQ"; lrms_type = "pbs"; CeApplicationDir = "/software/dteam/slc4"; Arguments = "1240221533"; GlobusResourceContactString = "ppsce02.pic.es:8443/cream-pbs"; OutputSandboxPath = "/var/glite/SandboxDir/RK/https_3a_2f_2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ/output"; ce_id = "ppsce02.pic.es:8443/cream-pbs-pps"; MyProxyServer = "lcg2proxy.ific.uv.es"; AllowZippedISB = true; QueueName = "pps"; JobType = "normal"; Executable = "/bin/echo"; OutputSandboxDestURI = { "gsiftp://rb03.pic.es:2811/var/glite/SandboxDir/RK/https_3a_2f_2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ/output/test.out" }; CertificateSubject = "/DC=es/DC=irisgrid/O=pic/CN=fernando.lopez"; X509UserProxy = "/var/glite/SandboxDir/RK/https_3a_2!
f_!
> 2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ/user.proxy"; StdOutput = "test.out"; VOMS_FQAN = "/dteam/Role=NULL/Capability=NULL"; OutputSandbox = { "test.out" }; LB_sequence_code = "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000"; InputSandboxPath = "/var/glite/SandboxDir/RK/https_3a_2f_2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ/input"; VirtualOrganisation = "dteam"; rank = -other.GlueCEStateEstimatedResponseTime; Type = "job"; ShallowRetryCount = 10; WMPInputSandboxBaseURI = "gsiftp://rb03.pic.es:2811/var/glite/SandboxDir/RK/https_3a_2f_2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ"; DefaultRank = -other.GlueCEStateEstimatedResponseTime; ReallyRunningToken = "gsiftp://rb03.pic.es/var/glite/SandboxDir/RK/https_3a_2f_2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ/token.txt"; InputSandbox = { "gsiftp://rb03.pic.es:2811/var/glite/SandboxDir/RK/https_3a_2f_2frb03.pic.es_3a9000_2fRKKS9ffb5S1ofBSfdXToRQ/input/.BrokerIn!
f!
> o" } ] ]; Command = "Submit"; Source = 2; Protocol = "1.0.0" ]>
> 2009-04-23 18:54:26,586 DEBUG - glite-wms-ice::main() - *** Unparsing request <[ Arguments = [ Force = false; ProxyFile = "/var/glite/spool/glite-renewd/fa7f45c964b389e9c98af2779212a72e.0"; SequenceCode = "UI=000000:NS=0000000005:WM=000002:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000"; JobId = "https://rb03.pic.es:9000/RKKS9ffb5S1ofBSfdXToRQ" ]; Command = "Cancel"; Source = 2; Protocol = "1.0.0" ]>
> 2009-04-23 18:54:26,586 DEBUG - iceThreadPoolWorker::body() - Worker Thread ICE Requests Pool/0 started processing new request (Currently 1 threads are running)
> 2009-04-23 18:54:26,586 INFO - iceCommandSubmit::execute() - This request is a Submission...
> 2009-04-23 18:54:26,586 DEBUG - glite-wms-ice::main() - *** Unparsing request <[ Arguments = [ Force = false; LogFile = "/var/glite/logmonitor/CondorG.log/CondorG.1240232577.log"; ProxyFile = "/var/glite/spool/glite-renewd/fa7f45c964b389e9c98af2779212a72e.1"; SequenceCode = "UI=000000:NS=0000000005:WM=000002:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000"; JobId = "https://rb03.pic.es:9000/74f74rxlljQYimQ-20jt6g" ]; Command = "Cancel"; Source = 2; Protocol = "1.0.0" ]>
>
>
> The problem is that it's not really clear to us where the service gets
> the reference to the job that is mentioned in the log
> (https://rb03.pic.es:9000/RKKS9ffb5S1ofBSfdXToRQ) from.
>
> Any help is very appreciated.
>
> Thanks in advance,
> Arnau and Christian.
>
\\\|///
\\ ~ ~ //
(/ @ @ /)
-------oOOo-(_)-oOOo----------------------------------
Massimo Sgaravatto
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy
Tel: ++39 0498277047 Fax: ++39 0498277102
oooO E-mail: massimo.sgaravatto [at] pd.infn.it
( ) Oooo Home page: http://www.pd.infn.it/~sgaravat
--------\ (----( )----------------------------------
\_) ) /
(_/
From [log in to unmask] Tue Apr 21 14:11:46 2009
Date: Tue, 21 Apr 2009 14:11:31 +0200
From: Massimo Sgaravatto - INFN Padova <[log in to unmask]>
To: Ewan Roche <[log in to unmask]>
Cc: [log in to unmask], wms-operations <[log in to unmask]>, Alvise Dorigo <[log in to unmask]>, Moreno Marzolla <[log in to unmask]>, Antonio Retico <[log in to unmask]>, Antonio Retico <[log in to unmask]>, Oliver Keeble <[log in to unmask]>, Andreas Unterkircher <[log in to unmask]>
Subject: Re: ICE crashes on update 44
Resent-Date: Tue, 21 Apr 2009 14:11:43 +0200 (CEST)
Resent-From: <[log in to unmask]>
Hi
With update 44, the versions of CREAM (server and client) and ICE used in
the PPS pilot testbed were supposed to be released.
CREAM and CEmon client (patch #2845) to be installed on the UI and on the
WMS has indeed been updated, while the new ICE (v. 3.1.36-0, patch #2459),
supposed to be released at the same time, was not released with this
update 44.
I have the feeling that patch #2845 was certified considering only the UI
(and not also against the WMS in production, i.e. patch #2562), and and I
am afraid that there is a problem with having
updated on the WMS node just the cream-client-api-c and cemon-client-api-c
without having updated also ICE (this is because some changes in the
interaction with gsoap-plugin, but it should be investigated in more
detail if this is really the cause of the problem).
The dependency of patch #2459 on patch #2845 was indeed specified, but the
contrary (which makes sense only on the WMS node but not on the UI) was
not.
Please also note that I obsoleted patch #2459, just because for it the
certification process didn't start at all after about 1 month, while WMS
patch #2597 (which includes ICE v. 3.1.36-0, that is the one of patch
#2459) released for certification later, is already in certification
Andreas/Oliver/Antonio: what should we do ? Restore patch #2459 ? Waiting
for the certification of patch #2597 ?
Cheers, Massimo
On Tue, 21 Apr 2009, Ewan Roche wrote:
> Hi,
> before I submit a bug could I get some feedback on the following problem.
>
> After upgrading to the latest gLite release (44) it seems that the ICE daemon
> has been repeatedly crashing and unable to submit (any?) jobs. The is pretty
> instantaneous and the log file output from one entire daemon run is included
> below.
>
> The ice version is glite-wms-ice-3.1.31-1.slc4
>
> Any ideas?
>
> Thanks
>
> Ewan
>
> 2009-04-21 13:20:01,615 DEBUG - ICE VersionID is [20080714-17:00]
> ProcessID=[22991]
> 2009-04-21 13:20:01,615 INFO - glite-wms-ice::main() - Host certificate is
> [/home/glite/.certs/hostcert.pem]
> 2009-04-21 13:20:01,615 INFO - glite-wms-ice::main() - Initializing jobCache
> with persistency directory [/var/glite/ice/persist_dir]...
> 2009-04-21 13:20:01,720 DEBUG - jobDbManager::CTOR() - Created database for
> jobs caching in [/var/glite/ice/persist_dir]
> 2009-04-21 13:20:01,720 DEBUG - DNProxyManager::CTOR() - Populating DN ->
> Proxy cache by scannig the jobCache...
> 2009-04-21 13:20:01,721 DEBUG - iceThreadPool::iceThreadPool(ICE Requests
> Pool) - Creating 10 worker threads
> 2009-04-21 13:20:01,722 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=0) starting...
> 2009-04-21 13:20:01,722 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=1) starting...
> 2009-04-21 13:20:01,722 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=2) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=3) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=4) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=5) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=6) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=7) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=8) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThreadPool::iceThreadPool(ICE Internal
> Commands Pool) - Creating 5 worker threads
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Requests Pool, id=9) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Internal Commands Pool, id=0) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Internal Commands Pool, id=1) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Internal Commands Pool, id=2) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Internal Commands Pool, id=3) starting...
> 2009-04-21 13:20:01,723 DEBUG - iceThread::operator() - Thread
> iceThreadPoolWorker(pool=ICE Internal Commands Pool, id=4) starting...
> 2009-04-21 13:20:01,726 DEBUG - Ice::startListener() - Host DN is
> [/DC=ch/DC=cern/OU=computers/CN=wms220.cern.ch]
> 2009-04-21 13:20:01,726 DEBUG - subscriptionProxy::CTOR - Authenticating...
> 2009-04-21 13:20:01,726 DEBUG - Ice::startListener() - Creating a CEMon
> listener object: port=[7010] hostkey=[/home/glite/.certs/hostkey.pem]
> hostcert=[/home/glite/.certs/hostcert.pem]
> 2009-04-21 13:20:01,734 INFO - eventStatusListener::createObject() - Listener
> created!
> 2009-04-21 13:20:01,734 DEBUG - iceThread::operator() - Thread event status
> listener starting...
> 2009-04-21 13:20:01,734 INFO - eventStatusListener::body() - Waiting for job
> status notification
> 2009-04-21 13:20:01,734 DEBUG - iceThread::operator() - Thread subscription
> Updater starting...
> 2009-04-21 13:20:01,734 DEBUG - iceThread::operator() - Thread event status
> poller starting...
> 2009-04-21 13:20:01,734 DEBUG - iceThread::operator() - Thread ICE Lease
> Updater starting...
> 2009-04-21 13:20:01,734 INFO - leaseUpdater::body() - new iteration
> 2009-04-21 13:20:01,734 DEBUG - iceThread::operator() - Thread ICE Proxy
> Renewer starting...
> 2009-04-21 13:20:01,734 INFO - proxyRenewal::body() - new iteration
> 2009-04-21 13:20:01,735 DEBUG - iceThread::operator() - Thread Job Killer
> starting...
> 2009-04-21 13:20:01,735 DEBUG - glite-wms-ice::main() - Max ICE memory
> threshold set to 4000000 kB
> 2009-04-21 13:20:01,738 INFO - glite-wms-ice::main() - *** Found 5 new
> request(s)
> 2009-04-21 13:20:01,739 DEBUG - glite-wms-ice::main() - *** Unparsing request
> <[ Arguments = [ JobAd = [ requirements = ( ( other.GlueCEInfoHostName ==
> "creamce.gina.sara.nl" ) ) && !
> false; RetryCount = 0; edg_jobid =
> "https://wms220.cern.ch:9000/B_qrW9htCJU08uMTHVDPkw"; lrms_type = "torque";
> CeApplicationDir = "/data/software/ops"; Arguments = "-x testjob.sh"; Glo
> busResourceContactString = "creamce.gina.sara.nl:8443/cream-pbs";
> OutputSandboxPath =
> "/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/output";
> ce_id = "creamce.gina.sara.nl:8443/cream-pbs-express"; MyProxyServer =
> "myproxy.cern.ch"; AllowZippedISB = true; QueueName = "express"; JobType =
> "normal"; InputSandboxDestFileName = {
> "testjob.sh","testjob.tgz","same.conf" }; SignificantAttributes = {
> "Requirements","Rank","FuzzyRank" }; Executable = "/bin/sh";
> OutputSandboxDestURI = { "gsiftp://wms220.cern.ch:2811
> /var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/output/testjob.out","gsiftp://wms220.cern.ch:2811/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.c
> ern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/output/testjob-results.tgz" };
> CertificateSubject = "/DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=samoper/CN=582979/CN=Judit Novak"; X509UserPro
> xy =
> "/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/user.proxy";
> StdOutput = "testjob.out"; VOMS_FQAN = "/ops/Role=lcgadmin/Capability=NULL";
> OutputSandbox = { "testjob.out","testjob-results.tgz" }; LB_sequence_code =
> "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000";
> In
> putSandboxPath =
> "/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/input";
> VirtualOrganisation = "ops"; rank = -other.GlueCEStateEstimatedRespon
> seTime; Type = "job"; ShallowRetryCount = 1; StdError = "testjob.out";
> WMPInputSandboxBaseURI =
> "gsiftp://wms220.cern.ch:2811/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a900
> 0_2fB_5fqrW9htCJU08uMTHVDPkw"; DefaultRank =
> -other.GlueCEStateEstimatedResponseTime; ReallyRunningToken =
> "gsiftp://wms220.cern.ch/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.c
> h_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/token.txt"; ZippedISB = {
> "ISBfiles_jmii4McV9qTN5hHMf3wKSA_0.tar.gz" }; InputSandbox = {
> "gsiftp://wms220.cern.ch:2811/var/glite/SandboxDir/B_/https
> _3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/input/testjob.sh","gsiftp://wms220.cern.ch:2811/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08u
> MTHVDPkw/input/testjob.tgz","gsiftp://wms220.cern.ch:2811/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/input/same.conf","gsiftp://wms220.cern.
> ch:2811/var/glite/SandboxDir/B_/https_3a_2f_2fwms220.cern.ch_3a9000_2fB_5fqrW9htCJU08uMTHVDPkw/input/.BrokerInfo"
> } ] ]; Command = "Submit"; Source = 2; Protocol = "1.0.0" ]>
> 2009-04-21 13:20:01,741 DEBUG - iceThreadPoolWorker::body() - Worker Thread
> ICE Requests Pool/1 started processing new request (Currently 1 threads are
> running)
> 2009-04-21 13:20:01,741 INFO - iceCommandSubmit::execute() - This request is
> a Submission...
>
> Nothing more is heard until the daemon is restarted five minutes later.....
>
\\\|///
\\ ~ ~ //
(/ @ @ /)
-------oOOo-(_)-oOOo----------------------------------
Massimo Sgaravatto
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy
Tel: ++39 0498277047 Fax: ++39 0498277102
oooO E-mail: massimo.sgaravatto [at] pd.infn.it
( ) Oooo Home page: http://www.pd.infn.it/~sgaravat
--------\ (----( )----------------------------------
\_) ) /
(_/
|