On Mon, 19 May 2003, Frederic Brochu wrote:
> Hello,
>
>
> I have tried last Friday to submit jobs to the new EDG site in Glasgow,
> and although my jobs ran succesfully, they were systematically declared as
> Failed, as you can see in the job logging in attachement.
>
> I happen to follow the behaviour of my jobs in time with globus-job-run,
> and found that the beginning is fine ( Job submitted, transfered to RB,
> matched, transfered to CE, even Scheduled ) until the job starts.
> In the minute the job starts, the dg-job-status goes to Done, in spite of
> the fact that the job is still alive and running.
> I would explain the Fail status by the fact that the job is still running
> and therefore its output is not available.
>
> Anybody (in the CC list) already experienced this ?
>
> Best regards,
> Frederic
>
>
Frederic,
This is most interesting - your comments from probing our frontend
gatekeeper explains some of our pbs_mom logs.
I can see 4 jobs for you on Friday evening. They all ran on node62 of our
cluster and I have attached the editted highlights from the pbs_mom log.
The 2nd job looks quite normal and the 4th looks like it timed out after 1
hour. The 1st and 3rd had a problem that I have seen before but could not
explain.
As I understand the scheme, the RB wraps the user's job in a script that
does
globus-url-copy input sandbox and brokerinfo RB -> WN
run user's job
globus-url-copy output sandbox WN -> RB
and then ~globus-job-submits it to the CE's globus gatekeeper. The
gatekeeper in turn wraps the RB script in a few lines that
setup globus environment on execution host
execute received job from deep in user's ~/.globus/.gass_cache
and qsub's this globus gatekeeper wrapper to the pbs_server.
The attached shows failure to rcp/scp the output of the "globus gatekeeper
wrapper" stdout and stderr back to the gatekeeper. The rcp would fail
due to firewall/libwrap restrictions, but it looks like the scp fails
because the ~/.globus/.gass_cache subdirectory is not there.
(it should really have $usecp directives and do "cp" directly
but that is another story I think)
Your observations suggest it may not be there because someone thinks the
job is finished and has cleaned away the ~/.globus/.gass_cache
sub-directory used by the job.
I have no idea what sort of error I am looking for - has anyone seen
anything similar ?
David Martin
Dept of Physics and Astronomy,
University of Glasgow,
Glasgow, G12 8QQ,
United Kingdom
tel: (0)141 330 4197 fax: (0)141 330 5881
email: [log in to unmask]
05/16/2003 17:12:29;0008; pbs_mom;Job;89572.masternode;Started, pid = 21122
05/16/2003 17:20:59;0080; pbs_mom;Job;89572.masternode;task 1 terminated
05/16/2003 17:20:59;0008; pbs_mom;Job;89572.masternode;Terminated
05/16/2003 17:20:59;0008; pbs_mom;Job;89572.masternode;kill_job
05/16/2003 17:21:04;0080; pbs_mom;Job;89572.masternode;Obit sent
05/16/2003 17:21:04;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89572.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/cd/76/b9/d57a71d25e400d53442b687783/data status=1, try=1
05/16/2003 17:21:35;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89572.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/cd/76/b9/d57a71d25e400d53442b687783/data status=1, try=2
05/16/2003 17:21:47;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89572.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/cd/76/b9/d57a71d25e400d53442b687783/data status=1, try=3
05/16/2003 17:22:18;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89572.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/cd/76/b9/d57a71d25e400d53442b687783/data status=1, try=4
05/16/2003 17:22:39;0004; pbs_mom;Fil;89572.maste.OU;Unable to copy file 89572.maste.OU to ce0-gla.scotgrid.ac.uk:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/cd/76/b9/d57a71d25e400d53442b687783/data
05/16/2003 17:22:39;0004; pbs_mom;Fil;89572.maste.OU;ce0-gla.scotgrid.ac.uk: Connection refused
05/16/2003 17:22:39;0004; pbs_mom;Fil;89572.maste.OU;d.ac.uk' (RSA1) to the list of known hosts.
05/16/2003 17:22:39;0004; pbs_mom;Fil;89572.maste.OU;scp: /home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/cd/76/b9/d57a71d25e400d53442b687783/data: No such file or directory
05/16/2003 17:22:39;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89572.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/8d/2f/85/678bffc54ab96cf9326dd3b786/data status=1, try=1
05/16/2003 17:23:10;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89572.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/8d/2f/85/678bffc54ab96cf9326dd3b786/data status=1, try=2
05/16/2003 17:23:21;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89572.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/8d/2f/85/678bffc54ab96cf9326dd3b786/data status=1, try=3
05/16/2003 17:23:52;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89572.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/8d/2f/85/678bffc54ab96cf9326dd3b786/data status=1, try=4
05/16/2003 17:24:13;0004; pbs_mom;Fil;89572.maste.ER;Unable to copy file 89572.maste.ER to ce0-gla.scotgrid.ac.uk:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/8d/2f/85/678bffc54ab96cf9326dd3b786/data
05/16/2003 17:24:13;0004; pbs_mom;Fil;89572.maste.ER;ce0-gla.scotgrid.ac.uk: Connection refused
05/16/2003 17:24:13;0004; pbs_mom;Fil;89572.maste.ER;d.ac.uk' (RSA1) to the list of known hosts.
05/16/2003 17:24:13;0004; pbs_mom;Fil;89572.maste.ER;scp: /home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/da/cf/da/48bd18a520b7df8626c5937e19/md5/8d/2f/85/678bffc54ab96cf9326dd3b786/data: No such file or directory
05/16/2003 17:26:36;0008; pbs_mom;Job;89573.masternode;Started, pid = 21375
05/16/2003 17:27:55;0080; pbs_mom;Job;89573.masternode;task 1 terminated
05/16/2003 17:27:55;0008; pbs_mom;Job;89573.masternode;Terminated
05/16/2003 17:27:55;0008; pbs_mom;Job;89573.masternode;kill_job
05/16/2003 17:28:01;0080; pbs_mom;Job;89573.masternode;Obit sent
05/16/2003 17:49:39;0008; pbs_mom;Job;89574.masternode;Started, pid = 21614
05/16/2003 17:56:10;0080; pbs_mom;Job;89574.masternode;task 1 terminated
05/16/2003 17:56:10;0008; pbs_mom;Job;89574.masternode;Terminated
05/16/2003 17:56:10;0008; pbs_mom;Job;89574.masternode;kill_job
05/16/2003 17:56:15;0080; pbs_mom;Job;89574.masternode;Obit sent
05/16/2003 17:56:16;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89574.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/ec/0f/5c/829cfe343dc04cc6038fde8f16/data status=1, try=1
05/16/2003 17:56:47;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89574.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/ec/0f/5c/829cfe343dc04cc6038fde8f16/data status=1, try=2
05/16/2003 17:56:58;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89574.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/ec/0f/5c/829cfe343dc04cc6038fde8f16/data status=1, try=3
05/16/2003 17:57:29;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89574.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/ec/0f/5c/829cfe343dc04cc6038fde8f16/data status=1, try=4
05/16/2003 17:57:51;0004; pbs_mom;Fil;89574.maste.OU;Unable to copy file 89574.maste.OU to ce0-gla.scotgrid.ac.uk:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/ec/0f/5c/829cfe343dc04cc6038fde8f16/data
05/16/2003 17:57:51;0004; pbs_mom;Fil;89574.maste.OU;ce0-gla.scotgrid.ac.uk: Connection refused
05/16/2003 17:57:51;0004; pbs_mom;Fil;89574.maste.OU;d.ac.uk' (RSA1) to the list of known hosts.
05/16/2003 17:57:51;0004; pbs_mom;Fil;89574.maste.OU;scp: /home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/ec/0f/5c/829cfe343dc04cc6038fde8f16/data: No such file or directory
05/16/2003 17:57:51;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89574.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/44/1e/1a/e436ddad6f48831bb8f2a64dba/data status=1, try=1
05/16/2003 17:58:22;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89574.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/44/1e/1a/e436ddad6f48831bb8f2a64dba/data status=1, try=2
05/16/2003 17:58:33;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89574.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/44/1e/1a/e436ddad6f48831bb8f2a64dba/data status=1, try=3
05/16/2003 17:59:04;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89574.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/44/1e/1a/e436ddad6f48831bb8f2a64dba/data status=1, try=4
05/16/2003 17:59:25;0004; pbs_mom;Fil;89574.maste.ER;Unable to copy file 89574.maste.ER to ce0-gla.scotgrid.ac.uk:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/44/1e/1a/e436ddad6f48831bb8f2a64dba/data
05/16/2003 17:59:25;0004; pbs_mom;Fil;89574.maste.ER;ce0-gla.scotgrid.ac.uk: Connection refused
05/16/2003 17:59:25;0004; pbs_mom;Fil;89574.maste.ER;d.ac.uk' (RSA1) to the list of known hosts.
05/16/2003 17:59:25;0004; pbs_mom;Fil;89574.maste.ER;scp: /home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/91/d5/13/9ba3012e4e231867b131b48cfe/md5/44/1e/1a/e436ddad6f48831bb8f2a64dba/data: No such file or directory
05/16/2003 18:07:46;0008; pbs_mom;Job;89614.masternode;Started, pid = 21873
05/16/2003 19:07:53;0008; pbs_mom;Job;89614.masternode;kill_job
05/16/2003 19:07:53;0080; pbs_mom;Job;89614.masternode;task 1 terminated
05/16/2003 19:07:53;0008; pbs_mom;Job;89614.masternode;Terminated
05/16/2003 19:07:53;0008; pbs_mom;Job;89614.masternode;kill_job
05/16/2003 19:07:55;0008; pbs_mom;Job;89614.masternode;kill_job
05/16/2003 19:07:58;0080; pbs_mom;Job;89614.masternode;Obit sent
05/16/2003 19:07:59;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89614.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/d2/c3/69/3da50f5f74240a5eef72ae73f8/data status=1, try=1
05/16/2003 19:08:30;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89614.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/d2/c3/69/3da50f5f74240a5eef72ae73f8/data status=1, try=2
05/16/2003 19:08:41;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89614.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/d2/c3/69/3da50f5f74240a5eef72ae73f8/data status=1, try=3
05/16/2003 19:09:12;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89614.maste.OU [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/d2/c3/69/3da50f5f74240a5eef72ae73f8/data status=1, try=4
05/16/2003 19:09:33;0004; pbs_mom;Fil;89614.maste.OU;Unable to copy file 89614.maste.OU to ce0-gla.scotgrid.ac.uk:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/d2/c3/69/3da50f5f74240a5eef72ae73f8/data
05/16/2003 19:09:33;0004; pbs_mom;Fil;89614.maste.OU;ce0-gla.scotgrid.ac.uk: Connection refused
05/16/2003 19:09:33;0004; pbs_mom;Fil;89614.maste.OU;d.ac.uk' (RSA1) to the list of known hosts.
05/16/2003 19:09:33;0004; pbs_mom;Fil;89614.maste.OU;scp: /home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/d2/c3/69/3da50f5f74240a5eef72ae73f8/data: No such file or directory
05/16/2003 19:09:33;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89614.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/f8/93/12/772169f78405cdbba49d2d8009/data status=1, try=1
05/16/2003 19:10:05;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89614.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/f8/93/12/772169f78405cdbba49d2d8009/data status=1, try=2
05/16/2003 19:10:16;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br /var/spool/pbs/spool/89614.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/f8/93/12/772169f78405cdbba49d2d8009/data status=1, try=3
05/16/2003 19:10:47;0080; pbs_mom;Fil;sys_copy;command: /usr/local/pbs/sbin/pbs_rcp -r /var/spool/pbs/spool/89614.maste.ER [log in to unmask]:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/f8/93/12/772169f78405cdbba49d2d8009/data status=1, try=4
05/16/2003 19:11:08;0004; pbs_mom;Fil;89614.maste.ER;Unable to copy file 89614.maste.ER to ce0-gla.scotgrid.ac.uk:/home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/f8/93/12/772169f78405cdbba49d2d8009/data
05/16/2003 19:11:08;0004; pbs_mom;Fil;89614.maste.ER;ce0-gla.scotgrid.ac.uk: Connection refused
05/16/2003 19:11:08;0004; pbs_mom;Fil;89614.maste.ER;d.ac.uk' (RSA1) to the list of known hosts.
05/16/2003 19:11:08;0004; pbs_mom;Fil;89614.maste.ER;scp: /home_scotgrid/a/atlas001/.globus/.gass_cache/local/md5/35/7f/59/706d770214e0e3085d01efcbbf/md5/f8/93/12/772169f78405cdbba49d2d8009/data: No such file or directory
|