JISCMail - LCG-ROLLOUT Archives

Hi Alex,

> 
>  Yes indeed the condor is running for this user on the PPS site:
> # ps -auxwww | grep ops001
> ops001   12313  0.0  0.3  6128 1860 ?        S    May08   0:00 /opt/condor-c/sbin/condor_master -f -r 29971
> ops001   12373  0.0  0.3  7588 1968 ?        S    May08   0:05 condor_schedd -f -n [log in to unmask]
> 
> I'm used to see globus-job-manager & perl monitoring + condor processes running together:
> [root@ce01-pps root]# ps -auxwww | grep ops002
> ops002   27570  0.0  0.5  6156 3024 ?        S    11:21   0:00 /opt/condor-c/sbin/condor_master -f -r 709
> ops002   27591  0.0  0.9  7612 4656 ?        S    11:21   0:02 condor_schedd -f -n [log in to unmask]
> ops002   21389  0.0  0.0     0    0 ?        Z    12:22   0:00 [edg-gridftpd <defunct>]
> ops002   21842  0.0  0.7  5520 3608 ?        S    12:22   0:00 globus-job-manager -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> ops002   21901  0.0  0.5  4296 2724 ?        S    12:22   0:00 perl /home/ops002/.globus/.gass_cache/local/md5/73/21f15efbe9035025f30a1f14d7ee2e/md5/ab/3002b5e9e7353a451b30cc7e95116b/data --dest-url=https://lxb2070.cern.ch:20002/tmp/condor_g_scratch.0x88a4fc0.32344/grid-monitor.ce01-pps.bgu.ac.il:2119.3778/grid-monitor-job-status
> ops002   21902  0.0  0.9  6388 4780 ?        S    12:22   0:00 perl /tmp/grid_manager_monitor_agent.ops002.21901.1000 --delete-self --maxtime=3600s
> 
> but then there are no new jobs submited for some time only
> the globus-job-manager & perl monitoring processes remain, like (this is ops001 on production site this time):
> # ps -auxwww | grep ops001
> ops001   11880  0.0  0.7  5520 3608 ?        S    12:08   0:00 globus-job-manager -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> ops001   11940  0.0  0.5  4296 2724 ?        S    12:08   0:00 perl /home/ops001/.globus/.gass_cache/local/md5/d4/6c264318c3d2440f247595f2f389c0/md5/41/9f9167aaec10ce804dbb17560ea632/data --dest-url=https://rb118.cern.ch:20004/tmp/condor_g_scratch.0xc041d10.26437/grid-monitor.cs-grid1.bgu.ac.il:2119.2244/grid-monitor-job-status
> ops001   11941  0.0  0.9  6392 4780 ?        S    12:08   0:00 perl /tmp/grid_manager_monitor_agent.ops001.11940.1000 --delete-self a--maxtime=3600s
> 
> So this is the first time i see that only condor processes are running.
Usually there should be always globus-job-manager & perl monitoring + 
condor processes running together. If only condor processes are running, 
there should be something wrong, for example, by chance, monitoring lost 
info about these condor processes. We also occasionally saw this happened.

> Also all SAM test submited every hour from the same users (dteam,ops,alice,atlas) allways
> authenticated by gatekeeper(accroding to gatekeeper & message logs).
> Mening only globus-job-manager & perl monitoring processes are reain running.
> 
Because the globus-job-manager & perl monitoring for condor processes 
you mentioned above don't exist, launcher would think the condor 
processes gone although they are not, thus WMS' try to relaunch them 
through fork jobamanger of gatekeeper.

Di

>  So this is confusing, could someone shed some light on the reason
> of these two different behaviours?
> 
>  Thanks
>  Alex
> 
>> For the periodical log
>> message in gatekeeper log or /var/log/message, I think it is that WMS
>> tried to launch the condor instance, but failed, then it retried again
>> and again.