On Tue, 15 May 2007, Di Qing wrote:
Hi Di,
> If the condor instances for the jobs submitted by SAM portal are running
> on glite CE, when new jobs coming, WMS will bypass gatekeeper and
> directly submit jobs to the condor instance.
Yes indeed the condor is running for this user on the PPS site:
# ps -auxwww | grep ops001
ops001 12313 0.0 0.3 6128 1860 ? S May08 0:00 /opt/condor-c/sbin/condor_master -f -r 29971
ops001 12373 0.0 0.3 7588 1968 ? S May08 0:05 condor_schedd -f -n [log in to unmask]
I'm used to see globus-job-manager & perl monitoring + condor processes running together:
[root@ce01-pps root]# ps -auxwww | grep ops002
ops002 27570 0.0 0.5 6156 3024 ? S 11:21 0:00 /opt/condor-c/sbin/condor_master -f -r 709
ops002 27591 0.0 0.9 7612 4656 ? S 11:21 0:02 condor_schedd -f -n [log in to unmask]
ops002 21389 0.0 0.0 0 0 ? Z 12:22 0:00 [edg-gridftpd <defunct>]
ops002 21842 0.0 0.7 5520 3608 ? S 12:22 0:00 globus-job-manager -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
ops002 21901 0.0 0.5 4296 2724 ? S 12:22 0:00 perl /home/ops002/.globus/.gass_cache/local/md5/73/21f15efbe9035025f30a1f14d7ee2e/md5/ab/3002b5e9e7353a451b30cc7e95116b/data --dest-url=https://lxb2070.cern.ch:20002/tmp/condor_g_scratch.0x88a4fc0.32344/grid-monitor.ce01-pps.bgu.ac.il:2119.3778/grid-monitor-job-status
ops002 21902 0.0 0.9 6388 4780 ? S 12:22 0:00 perl /tmp/grid_manager_monitor_agent.ops002.21901.1000 --delete-self --maxtime=3600s
but then there are no new jobs submited for some time only
the globus-job-manager & perl monitoring processes remain, like (this is ops001 on production site this time):
# ps -auxwww | grep ops001
ops001 11880 0.0 0.7 5520 3608 ? S 12:08 0:00 globus-job-manager -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
ops001 11940 0.0 0.5 4296 2724 ? S 12:08 0:00 perl /home/ops001/.globus/.gass_cache/local/md5/d4/6c264318c3d2440f247595f2f389c0/md5/41/9f9167aaec10ce804dbb17560ea632/data --dest-url=https://rb118.cern.ch:20004/tmp/condor_g_scratch.0xc041d10.26437/grid-monitor.cs-grid1.bgu.ac.il:2119.2244/grid-monitor-job-status
ops001 11941 0.0 0.9 6392 4780 ? S 12:08 0:00 perl /tmp/grid_manager_monitor_agent.ops001.11940.1000 --delete-self a--maxtime=3600s
So this is the first time i see that only condor processes are running.
Also all SAM test submited every hour from the same users (dteam,ops,alice,atlas) allways
authenticated by gatekeeper(accroding to gatekeeper & message logs).
Mening only globus-job-manager & perl monitoring processes are reain running.
So this is confusing, could someone shed some light on the reason
of these two different behaviours?
Thanks
Alex
> For the periodical log
> message in gatekeeper log or /var/log/message, I think it is that WMS
> tried to launch the condor instance, but failed, then it retried again
> and again.
|