On Fri, Aug 19, 2005 at 11:21:48AM +0100 or thereabouts, Steve Traylen wrote:
> On Thu, Aug 18, 2005 at 07:42:31PM +0100 or thereabouts, Dr D J Colling wrote:
> > Hi,
> >
> > A few weeks ago I was trying to do some CMS production the last stage of
> > which was to copy the output to the storage element at RAL....and we had
> > lots of failures in the copy. In the end we cheated and forced all the
> > jobs to go to the RAL CE and copy to the RAL SE.
> >
> > This worked (as you would hope) however it didn't seem to be very
> > Grid-like so last night and today I submitted lots (hundreds) of very
> > short jobs that just tried doing an lcg-cr to the RAL dcache. Most (a far
> > greater fraction a few weeks ago) copied the files successfully. Those few
> > that failed failed for two reasons:
I've noticed the BDII is a little bit stressed out.
http://ganglia.gridpp.rl.ac.uk/?c=LCG%20Others&h=lcgbdii02.gridpp.rl.ac.uk&m=%5Bnone%5D&r=day&s=descending
and there is a spike in the return times.
http://goc.grid.sinica.edu.tw/gstat/RAL-LCG2/BDIINode_Perf_tim_.html
Which corresponds about to when your lump of CMS jobs landed here.
http://ganglia.gridpp.rl.ac.uk/specials/pbs.php?h=OpenPBS%20server%2fcsflnx353.rl.ac.uk&m=%5Bnone%5D&r=day&s=descending
How many replications are we talking about per job here?
We can look into some load balancing or something if it just a matter
of speed but I expect it is blocking or something.... Looking.
Steve
> >
> > 1.
> >
> > SE type not found
> > lcg_cr: Invalid argument
> >
> > This was the one that I saw most of when trying to do the MC production.
> > However, there are far fewer of these now. This seemed to for a whole site
> > rather than individual nodes.
> >
> > 2.
> > SE endpoint not found
> > SE endpoint not found
> > SE endpoint not found
> >
> > Usually repeated three times as shown.
>
> Hi Dave,
>
> I don't know the answers. GIS of course could help but basically they
> are all information failures of one sort or another.
>
> Steve
> >
> > Does anybody know what causes these two errors? How can I protect against
> > them? The first seemed to be for all nodes at a site so retrying would not
> > help whereas the second seemed to be transitory.
> >
> > Sorry if these are "Numpty" questions answered elsewhere ... if they are
> > please could somebody me to this information.
> >
> > All the best and thanks for your help,
> > david
> >
> > PS For Stephen Burke:
> > Numpty Dumpty didn't have a great fall ... he was hit by a car.
>
> --
> Steve Traylen
> [log in to unmask]
> http://www.gridpp.ac.uk/
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|