JISCMail - TB-SUPPORT Archives

On Fri, Aug 19, 2005 at 11:21:48AM +0100 or thereabouts, Steve Traylen wrote:
> On Thu, Aug 18, 2005 at 07:42:31PM +0100 or thereabouts, Dr D J Colling wrote:
> > Hi,
> > 
> > A few weeks ago I was trying to do some CMS production the last stage of 
> > which was to copy the output to the storage element at RAL....and we had 
> > lots of failures in the copy. In the end we cheated and forced all the 
> > jobs to go to the RAL CE and copy to the RAL SE.
> > 
> > This worked (as you would hope) however it didn't seem to be very 
> > Grid-like so last night and today I submitted lots (hundreds) of very 
> > short jobs that just tried doing an lcg-cr to the RAL dcache. Most (a far 
> > greater fraction a few weeks ago) copied the files successfully. Those few 
> > that failed failed for two reasons:

I've noticed the BDII is a little bit stressed out.

http://ganglia.gridpp.rl.ac.uk/?c=LCG%20Others&h=lcgbdii02.gridpp.rl.ac.uk&m=%5Bnone%5D&r=day&s=descending

and there is a spike in the return times.

http://goc.grid.sinica.edu.tw/gstat/RAL-LCG2/BDIINode_Perf_tim_.html

Which corresponds about to when your lump of CMS jobs landed here.

http://ganglia.gridpp.rl.ac.uk/specials/pbs.php?h=OpenPBS%20server%2fcsflnx353.rl.ac.uk&m=%5Bnone%5D&r=day&s=descending

How many replications are we talking about per job here?

We can look into some load balancing or something if it just a matter
of speed but I expect it is blocking or something.... Looking.

 Steve


> > 
> > 1. 
> > 
> > SE type not found
> > lcg_cr: Invalid argument 
> > 
> > This was the one that I saw most of when trying to do the MC production. 
> > However, there are far fewer of these now. This seemed to for a whole site 
> > rather than individual nodes. 
> > 
> > 2.
> > SE endpoint not found
> > SE endpoint not found
> > SE endpoint not found
> > 
> > Usually repeated three times as shown.
> 
> Hi Dave,
> 
>  I don't know the answers. GIS of course could help but basically they
>  are all information failures of one sort or another.
> 
>  Steve
> > 
> > Does anybody know what causes these two errors? How can I protect against 
> > them? The first seemed to be for all nodes at a site so retrying would not 
> > help whereas the second seemed to be transitory.
> > 
> > Sorry if these are "Numpty" questions answered elsewhere ... if they are 
> > please could somebody me to this information.
> > 
> > All the best and thanks for your help,
> > david
> > 
> > PS For Stephen Burke:
> > Numpty Dumpty didn't have a great fall ... he was hit by a car.
> 
> -- 
> Steve Traylen
> [log in to unmask]
> http://www.gridpp.ac.uk/

-- 
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/