Hi Graeme,
I've attached the output of ps auxwww for node epcf25. The output for
the other nodes is similar. There is a hanging process that has escaped
pbs_mom:
atlasprd 15908 0.0 0.0 29084 728 ? S Nov09 5:01 python
I should have trapped it :(
Thanks for info and help,
Yves
On Thu, 15 Nov 2007, Graeme Stewart wrote:
> Hi Yves
>
> The cronus executor has been shut down. Production jobs you are
> seeing will be coming from the standard EGEE grid LEXOR executor.
>
> Have these jobs consumed CPU yet, or are they trying to get started?
>
> I agree this is a terrible waste of site's resources and that has
> been a big motivating factor in the decision to move ATLAS production
> to PanDA. Because PanDA stages input datasets on the site's SE and
> puts outputs onto the site's SE as well (it does all other data
> movements asynchronously using ATLAS DDM) we will not see these large
> data management timeouts which currently cripple atlas production in
> EGEE.
>
> If you send me the output from ps auxwww I'll try and see what the
> jobs are doing. It's possible you can kill them off - but please
> don't do it yet.
>
> Thanks
>
> Graeme
>
> PS. Yes, we also see inefficient atlasprd jobs at Glasgow.
>
>
> On 15 Nov 2007, at 12:47, Yves Coppens wrote:
>
> > Hello,
> >
> > While investigating while we were failing the Atlas test again, I
> > found
> > (once more) than many prd atlas jobs are sleeping.
> >
> > [root@epcf25 root]# ps -ef | grep sleep
> > atlasprd 27603 23385 0 12:02 ? 00:00:00 sleep 9600
> > atlasprd 27604 23386 0 12:02 ? 00:00:00 sleep 9600
> > root 27712 8088 0 12:17 pts/0 00:00:00 grep sleep
> >
> > [root@epcf28 root]# ps -ef | grep sleep
> > atlasprd 19537 6438 0 10:11 ? 00:00:00 sleep 9600
> > atlasprd 19667 19352 0 11:42 ? 00:00:00 sleep 9600
> > root 19873 19716 0 12:18 pts/0 00:00:00 grep sleep
> > [root@epcf28 root]#
> >
> > and the same on three other worker nodes!
> >
> > I issued ggus ticket (25848) back in August about this. But no one has
> > addressed it yet. Are they using CRONUS and is it really that bad!?
> >
> > I do not think this has got anything to do with my failing Steve's
> > test:
> > the failure is caused by a missing file which is actually available
> > in the
> > Atlas software area on all my workers - I shall take this offline with
> > Frederic.
> >
> > Are VOs really claiming that pilot jobs are necessary because they
> > allow
> > them to make more effective use of resources?
> >
> > We should definitely do wall time accounting rather than CPU time
> > accounting.
> >
> > Have other sites seen this too?
> >
> > Yves
>
> --
> Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
> ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
>
atlasprd 15908 0.0 0.0 29084 728 ? S Nov09 5:01 python /home/atlasprd/globus-tmp.epcf25.11958.0/https_3a_2f_2falbalonga.cnaf.infn.it_3a9000_2fTOF6DlRkgWeZBpiR26jMpA/AtlasProduction/12.0.7.2/InstallArea/share/bin/gbb --mon-interval 900 --time-limit 259200 --cpu-limit 1 ./csc_atlasG4_runathena
atlasprd 22023 0.0 0.1 5328 1232 ? S 09:27 0:00 -sh
atlasprd 22029 0.0 0.1 5328 1228 ? S 09:27 0:00 -sh
atlasprd 22284 0.0 0.1 5500 1280 ? S 09:27 0:00 /bin/sh /var/spool/pbs/mom_priv/jobs/451236.epgc.SC
atlasprd 22283 0.0 0.1 5512 1280 ? S 09:27 0:00 /bin/sh /var/spool/pbs/mom_priv/jobs/451235.epgc.SC
atlasprd 22291 0.0 0.2 7188 2576 ? S 09:27 0:00 /usr/bin/perl -w /tmp/bootstrap.T22286 /home/atlasprd/ epgce1.ph.bham.ac.uk /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/64/c4895cd86fff1d2b637f8fb2d6cc74/data X509GPG:globus-cache-export.S15701.gpg /dev/null /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/6e/450dfccad4a5439e6ef51a6b2b5eac/data stdoutftp /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/03/4288415fcfd2d656d2ad8c67c41e28/data stderrftp /home/atlasprd/.lcgjm/globus-cache-export.S15701 https://epgce1.ph.bham.ac.uk:20105/13463/1194957688/ /home/atlasprd/ NONE https://wms002.cnaf.infn.it:20030/var/glite/jobcontrol/submit/BJ/JobWrapper.https_3a_2f_2fwms003.cnaf.infn.it_3a9000_2fBJ5Ku5OEDaga49t2S-gVJw.sh UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000
atlasprd 22292 0.0 0.2 7196 2572 ? S 09:27 0:00 /usr/bin/perl -w /tmp/bootstrap.J22285 /home/atlasprd/ epgce1.ph.bham.ac.uk /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/b4/7f137b793ccada718dba27838d43e3/data X509GPG:globus-cache-export.j20275.gpg /dev/null /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/95/38c363066e85b5006bff8a6fb632e7/data stdoutftp /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/78/210b46995d252c7351bcecd6f4c501/data stderrftp /home/atlasprd/.lcgjm/globus-cache-export.j20275 https://epgce1.ph.bham.ac.uk:20035/17010/1194957829/ /home/atlasprd/ NONE https://wms002.cnaf.infn.it:20030/var/glite/jobcontrol/submit/Yz/JobWrapper.https_3a_2f_2fwms003.cnaf.infn.it_3a9000_2fYzaUb-HJfuTUYsxbUFnVkQ.sh UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000
atlasprd 22303 0.0 0.2 7452 2868 ? S 09:27 0:00 /usr/bin/perl -w /tmp/bootstrap.T22286 /home/atlasprd/ epgce1.ph.bham.ac.uk /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/64/c4895cd86fff1d2b637f8fb2d6cc74/data X509GPG:globus-cache-export.S15701.gpg /dev/null /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/6e/450dfccad4a5439e6ef51a6b2b5eac/data stdoutftp /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/03/4288415fcfd2d656d2ad8c67c41e28/data stderrftp /home/atlasprd/.lcgjm/globus-cache-export.S15701 https://epgce1.ph.bham.ac.uk:20105/13463/1194957688/ /home/atlasprd/ NONE https://wms002.cnaf.infn.it:20030/var/glite/jobcontrol/submit/BJ/JobWrapper.https_3a_2f_2fwms003.cnaf.infn.it_3a9000_2fBJ5Ku5OEDaga49t2S-gVJw.sh UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000
atlasprd 22304 0.0 0.2 7460 2868 ? S 09:27 0:00 /usr/bin/perl -w /tmp/bootstrap.J22285 /home/atlasprd/ epgce1.ph.bham.ac.uk /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/b4/7f137b793ccada718dba27838d43e3/data X509GPG:globus-cache-export.j20275.gpg /dev/null /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/95/38c363066e85b5006bff8a6fb632e7/data stdoutftp /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/78/210b46995d252c7351bcecd6f4c501/data stderrftp /home/atlasprd/.lcgjm/globus-cache-export.j20275 https://epgce1.ph.bham.ac.uk:20035/17010/1194957829/ /home/atlasprd/ NONE https://wms002.cnaf.infn.it:20030/var/glite/jobcontrol/submit/Yz/JobWrapper.https_3a_2f_2fwms003.cnaf.infn.it_3a9000_2fYzaUb-HJfuTUYsxbUFnVkQ.sh UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000
atlasprd 22805 0.0 0.2 7196 2568 ? S 09:27 0:00 /usr/bin/perl -w /tmp/bootstrap.J22285 /home/atlasprd/ epgce1.ph.bham.ac.uk /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/b4/7f137b793ccada718dba27838d43e3/data X509GPG:globus-cache-export.j20275.gpg /dev/null /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/95/38c363066e85b5006bff8a6fb632e7/data stdoutftp /home/atlasprd/.globus/.gass_cache/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/78/210b46995d252c7351bcecd6f4c501/data stderrftp /home/atlasprd/.lcgjm/globus-cache-export.j20275 https://epgce1.ph.bham.ac.uk:20035/17010/1194957829/ /home/atlasprd/ NONE https://wms002.cnaf.infn.it:20030/var/glite/jobcontrol/submit/Yz/JobWrapper.https_3a_2f_2fwms003.cnaf.infn.it_3a9000_2fYzaUb-HJfuTUYsxbUFnVkQ.sh UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000
atlasprd 22847 0.0 0.2 7188 2572 ? S 09:27 0:00 /usr/bin/perl -w /tmp/bootstrap.T22286 /home/atlasprd/ epgce1.ph.bham.ac.uk /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/64/c4895cd86fff1d2b637f8fb2d6cc74/data X509GPG:globus-cache-export.S15701.gpg /dev/null /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/6e/450dfccad4a5439e6ef51a6b2b5eac/data stdoutftp /home/atlasprd/.globus/.gass_cache/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/03/4288415fcfd2d656d2ad8c67c41e28/data stderrftp /home/atlasprd/.lcgjm/globus-cache-export.S15701 https://epgce1.ph.bham.ac.uk:20105/13463/1194957688/ /home/atlasprd/ NONE https://wms002.cnaf.infn.it:20030/var/glite/jobcontrol/submit/BJ/JobWrapper.https_3a_2f_2fwms003.cnaf.infn.it_3a9000_2fBJ5Ku5OEDaga49t2S-gVJw.sh UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000
atlasprd 23231 0.0 0.1 5272 1056 ? S 09:27 0:00 sh -c if [ -x ${LCG_LOCATION:-/opt/lcg}/libexec/jobwrapper ]; then ${LCG_LOCATION:-/opt/lcg}/libexec/jobwrapper /home/atlasprd/globus-tmp.epcf25.22292.0/globus-tmp.epcf25.22292.0/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/98/be3c2d5b58dd8be81616387f64f14a/data UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000; else /home/atlasprd/globus-tmp.epcf25.22292.0/globus-tmp.epcf25.22292.0/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/98/be3c2d5b58dd8be81616387f64f14a/data UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000; fi
atlasprd 23232 0.0 0.1 5280 1144 ? S 09:27 0:00 /bin/sh /opt/lcg/libexec/jobwrapper /home/atlasprd/globus-tmp.epcf25.22292.0/globus-tmp.epcf25.22292.0/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/98/be3c2d5b58dd8be81616387f64f14a/data UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000
atlasprd 23241 0.0 0.1 5272 1064 ? S 09:27 0:00 sh -c if [ -x ${LCG_LOCATION:-/opt/lcg}/libexec/jobwrapper ]; then ${LCG_LOCATION:-/opt/lcg}/libexec/jobwrapper /home/atlasprd/globus-tmp.epcf25.22291.0/globus-tmp.epcf25.22291.0/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/a6/93df720151804cc6f41c84098a5cf8/data UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000; else /home/atlasprd/globus-tmp.epcf25.22291.0/globus-tmp.epcf25.22291.0/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/a6/93df720151804cc6f41c84098a5cf8/data UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000; fi
atlasprd 23245 0.0 0.1 5276 1144 ? S 09:27 0:00 /bin/sh /opt/lcg/libexec/jobwrapper /home/atlasprd/globus-tmp.epcf25.22291.0/globus-tmp.epcf25.22291.0/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/a6/93df720151804cc6f41c84098a5cf8/data UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000
atlasprd 23385 0.0 0.1 5284 1236 ? S 09:27 0:00 /bin/sh /home/atlasprd/globus-tmp.epcf25.22292.0/globus-tmp.epcf25.22292.0/local/md5/fd/1eddeed7c71681ed7aa9172489f90a/md5/98/be3c2d5b58dd8be81616387f64f14a/data UI=000000:NS=0000000004:WM=000015:BH=0000000000:JSS=000009:LM=000010:LRMS=000000:APP=000000:LBS=000000
atlasprd 23386 0.0 0.1 5292 1192 ? S 09:27 0:00 /bin/sh /home/atlasprd/globus-tmp.epcf25.22291.0/globus-tmp.epcf25.22291.0/local/md5/2b/cdb1fa77cbe87140695685bd950539/md5/a6/93df720151804cc6f41c84098a5cf8/data UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000:LBS=000000
atlasprd 27603 0.0 0.0 4924 564 ? S 12:02 0:00 sleep 9600
atlasprd 27604 0.0 0.0 4932 564 ? S 12:02 0:00 sleep 9600
root 28756 0.0 0.0 4780 672 pts/1 S 13:25 0:00 grep atlas
|