Hi Maarten, we run into the same problem with upgraded WMS again. Now I believe I understood the problem. In /var/local/condor/log/GridmanagerLog.glite many restarts of the component were reported after such a crash: 07/31 16:40:04 [18921] gahp server not up yet, delaying ping 07/31 16:40:04 [18921] GAHP server pid = 19331 07/31 16:40:04 [18921] gahp->nordugrid_ldap_query returned -101 for resource korundi.grid.helsinki.fi 07/31 16:40:04 [18921] ERROR "nordugrid_ldap_query failed!" at line 211 in file nordugridresource.cpp Searching a bit the internet I found that you got trapped with a similar problem in May: http://lindir.ics.muni.cz/pipermail/egee-jra1/2010-May/012580.html When I remove the old nordugrid_gahp and use the one included in Condor-7.4 things start to work again. Is there a ticket for YAIM people on that? Cheers, Christoph On 07/25/2010 06:52 PM, Maarten Litmaath wrote: > Hallo Christoph, > > >> It seems that the WMS recovered itself (being in >> drain mode) over the weekend. The WMS is full of Conder jobs in state >> "H" (hold). Do they harm? Some are weeks old already. >> > Normally held jobs do not harm, but the latest WMS version has an issue > for which the admin may need to intervene occasionally: > > https://savannah.cern.ch/bugs/?69841 > > A cleanup cron job for held jobs is included in this bug: > > https://savannah.cern.ch/bugs/?70401 > > The grace period of 1 week probably should be lowered to 1 day, > or even just a few hours... > > >> Another question, perhaps someone know the answer. Trying to get some >> understanding of the flow of a job through the WMS, I tried to follow a >> job that goes to a CREAM-CE. Are those jobs supposed to showup in the >> list of jobs listed with conder_q? >> > No. On a WMS the jobs for CREAM are handled by ICE, while jobs sent to > LCG-CE or ARC-CE instances are handled by Condor-G: > > https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEgLiteJobSubmissionSchema > > To see ICE details one can use /opt/glite/bin/queryDb on the WMS. > The "-h" option shows how. >