Print

Print


> As a workaround, can you try adding that host DN as a trusted retriever?
> 
> trusted_retrievers "/C=RS/O=AEGIS/....."


Hi Maarten,

After adding this to the conf of MyProxy server, we don't see any errors in
/var/log/messages on MyProxy, nor on WMS (related to glite-proxy-renewd).

However, jobs are still dying due to aborted proxy. This is one example of
such a job:

[alex@ce sh5-see-ce]$ glite-wms-job-status
https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
Current Status:     Aborted 
Status Reason:      Job proxy is expired.
Destination:        ce-atlas.phy.bg.ac.yu:2119/jobmanager-pbs-see
Submitted:          Sat Jun 21 14:05:16 2008 CEST
*************************************************************


From its status, I see that:

- stateEnterTimes =   
      Submitted        : Sat Jun 21 14:05:16 2008 CEST
      Waiting          : Sat Jun 21 14:05:29 2008 CEST
      Ready            : Sat Jun 21 14:06:18 2008 CEST
      Scheduled        : Sat Jun 21 14:51:08 2008 CEST
      Running          :                ---
      Done             : Sun Jun 22 16:09:06 2008 CEST
      Cleared          :                ---
      Aborted          : Sun Jun 22 18:04:09 2008 CEST
      Cancelled        :                ---
      Unknown          :                ---


Definitely, around Sun Jun 22 16:09:06 2008 CEST and Sun Jun 22 18:04:09 2008
CEST there is nothing in /var/log/messages on MyProxy and on WMS. This is what
I found in /var/log/glite that is related to this job:

[root@wms glite]# grep LPpwobsRBu3EYhzZQ_LKuw *.log
jobcontoller_events.log:21 Jun, 14:50:43 -V- JobControllerReal::submit(...):
Submitting job "https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw"
logmonitor_events.log:21 Jun, 14:50:47 -I- EventSubmit::finalProcess(...): Job
id = https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
logmonitor_events.log:21 Jun, 14:50:47 -I- SubmitReader::internalRead():
Reading condor submit file of job
https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
logmonitor_events.log:21 Jun, 14:51:08 -I- EventGlobusSubmit::process_event():
Job id = https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
logmonitor_events.log:21 Jun, 14:51:08 -I- SubmitReader::internalRead():
Reading condor submit file of job
https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
logmonitor_events.log:22 Jun, 16:09:06 -I- EventJobHeld::process_event(): Job
id = https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
logmonitor_events.log:22 Jun, 18:04:08 -I- EventAborted::process_event(): Job
id = https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
logmonitor_events.log:22 Jun, 18:04:09 -I- JobResubmitter::resubmit(...): Job
id = https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
workload_manager_events.log:21 Jun, 14:05:33 -I: [Info]
operator()(/home/glbuild/GLITE_3_1_0_continous/org.glite.wms.manager/src/server/dispatcher.cpp:470):
new jobsubmit for https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw
workload_manager_events.log:21 Jun, 14:06:16 -I: [Info]
checkRequirement(/home/glbuild/GLITE_3_1_0_continous/org.glite.wms.matchmaking/src/matchmakerISMImpl.cpp:79):
MM for job: https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw (1/6194 [0,
2.48] )
workload_manager_events.log:21 Jun, 14:06:19 -I: [Info]
do_transitions_for_submit(/home/glbuild/GLITE_3_1_0_continous/org.glite.wms.manager/src/server/dispatcher.cpp:283):
https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw delivered


So, for some reason the job was held at 16:09:06, and aborted at 18:04:08.

From jobcontoller_events.log I can find that the condor id of that job is 652005:

21 Jun, 14:50:42 -I- ControllerLoop::run(): Got new submit request...
21 Jun, 14:50:42 -I- SubmitAd::createFromAd(...): Creating job directory path.
21 Jun, 14:50:43 -M- JobControllerReal::submit(...): Classad file created...
21 Jun, 14:50:43 -V- JobControllerReal::submit(...): Submitting job
"https://wms.phy.bg.ac.yu:9000/LPpwobsRBu3EYhzZQ_LKuw"
21 Jun, 14:50:43 -M- JobControllerReal::submit(...): Submit file created...
21 Jun, 14:50:44 -V- JobControllerReal::submit(...): Job submitted to Condor
cluster: 652005

But the job is not in condor queue anymore. Contrary to the logs, the files
associated with the job are not removed:

[root@wms glite-renewd]# ll
/var/glite/SandboxDir/LP/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fLPpwobsRBu3EYhzZQ_5fLKuw/
total 24
drwxrwx---    2 aegis013 glite        4096 Jun 21 14:06 input
-rw-r--r--    1 glite    glite         799 Jun 21 14:05 JDLOriginal
-rw-r--r--    1 glite    glite        2055 Jun 21 14:05 JDLStarted
drwxrwx---    2 aegis013 glite        4096 Jun 21 14:05 output
drwxrwx---    2 aegis013 glite        4096 Jun 21 14:05 peek
-rw-r--r--    1 glite    glite           0 Jun 21 14:06 token.txt
lrwxrwxrwx    1 glite    glite          67 Jun 21 14:05 user.proxy ->
/var/glite/spool/glite-renewd/447177e6e930ecc86b52a6a9100ce494.2519


However, /var/glite/spool/glite-renewd/447177e6e930ecc86b52a6a9100ce494.2519
does not exist anymore.

Any help in finding what went wrong is appreciated. This should be a problem
with glite-proxy-renewd, but I cannot find any traces that suggest what is
wrong...

Thanks, Antun