Hi Alex,
>
> Yes indeed the condor is running for this user on the PPS site:
> # ps -auxwww | grep ops001
> ops001 12313 0.0 0.3 6128 1860 ? S May08 0:00 /opt/condor-c/sbin/condor_master -f -r 29971
> ops001 12373 0.0 0.3 7588 1968 ? S May08 0:05 condor_schedd -f -n [log in to unmask]
>
> I'm used to see globus-job-manager & perl monitoring + condor processes running together:
> [root@ce01-pps root]# ps -auxwww | grep ops002
> ops002 27570 0.0 0.5 6156 3024 ? S 11:21 0:00 /opt/condor-c/sbin/condor_master -f -r 709
> ops002 27591 0.0 0.9 7612 4656 ? S 11:21 0:02 condor_schedd -f -n [log in to unmask]
> ops002 21389 0.0 0.0 0 0 ? Z 12:22 0:00 [edg-gridftpd <defunct>]
> ops002 21842 0.0 0.7 5520 3608 ? S 12:22 0:00 globus-job-manager -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> ops002 21901 0.0 0.5 4296 2724 ? S 12:22 0:00 perl /home/ops002/.globus/.gass_cache/local/md5/73/21f15efbe9035025f30a1f14d7ee2e/md5/ab/3002b5e9e7353a451b30cc7e95116b/data --dest-url=https://lxb2070.cern.ch:20002/tmp/condor_g_scratch.0x88a4fc0.32344/grid-monitor.ce01-pps.bgu.ac.il:2119.3778/grid-monitor-job-status
> ops002 21902 0.0 0.9 6388 4780 ? S 12:22 0:00 perl /tmp/grid_manager_monitor_agent.ops002.21901.1000 --delete-self --maxtime=3600s
>
> but then there are no new jobs submited for some time only
> the globus-job-manager & perl monitoring processes remain, like (this is ops001 on production site this time):
> # ps -auxwww | grep ops001
> ops001 11880 0.0 0.7 5520 3608 ? S 12:08 0:00 globus-job-manager -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> ops001 11940 0.0 0.5 4296 2724 ? S 12:08 0:00 perl /home/ops001/.globus/.gass_cache/local/md5/d4/6c264318c3d2440f247595f2f389c0/md5/41/9f9167aaec10ce804dbb17560ea632/data --dest-url=https://rb118.cern.ch:20004/tmp/condor_g_scratch.0xc041d10.26437/grid-monitor.cs-grid1.bgu.ac.il:2119.2244/grid-monitor-job-status
> ops001 11941 0.0 0.9 6392 4780 ? S 12:08 0:00 perl /tmp/grid_manager_monitor_agent.ops001.11940.1000 --delete-self a--maxtime=3600s
>
> So this is the first time i see that only condor processes are running.
Usually there should be always globus-job-manager & perl monitoring +
condor processes running together. If only condor processes are running,
there should be something wrong, for example, by chance, monitoring lost
info about these condor processes. We also occasionally saw this happened.
> Also all SAM test submited every hour from the same users (dteam,ops,alice,atlas) allways
> authenticated by gatekeeper(accroding to gatekeeper & message logs).
> Mening only globus-job-manager & perl monitoring processes are reain running.
>
Because the globus-job-manager & perl monitoring for condor processes
you mentioned above don't exist, launcher would think the condor
processes gone although they are not, thus WMS' try to relaunch them
through fork jobamanger of gatekeeper.
Di
> So this is confusing, could someone shed some light on the reason
> of these two different behaviours?
>
> Thanks
> Alex
>
>> For the periodical log
>> message in gatekeeper log or /var/log/message, I think it is that WMS
>> tried to launch the condor instance, but failed, then it retried again
>> and again.
|