On 02/06/11 20:46, Christopher J.Walker wrote:
[ ... ]
> The ATLAS SAM tests, Steve copies the results from the SAM testing
> framework - you need to look at that direct.
> https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE®ions=UKI&vo=atlas&order=SiteName&funct=ShowSensorTests
Ahhh thanks a lot I hadn't thought they were the same, or indeed
to look there. Very useful. I have just noticed some "Job was
aborted" errors at ECDF, which are similar to other errors I see
at other sites and here:
[ ... ]
> - Host = wms201.cern.ch
> - Reason = 7 authentication failed: GSS Major Status: Unexpected Gatekeeper or Service Name GSS Minor Status Error Chain: init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization
[ ... ]
> - Timestamp = Thu Jun 2 21:50:00 2011 CEST
> - User = /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=proxy
That error I have also seen in SteveL's own tests and from
users, and it is intermittent. There is a page about it here:
https://wiki.egi.eu/wiki/Tools/Manuals/TS01
which I had a look at some time ago, and somehow I strongly
suspect proxy or CRL problems, which may be hitting us also with
the 'pheno' LFC. I remember seeing on a subregion blog (probably
the Scotgrid one) advice to refresh the CRLs hourly instead of
every six hours.
> Steve also runs his own tests as an Atlas user - labeled "atlas tests" -
> clicking on the failed job links to a summary page which links to
> detailed job output.
Yes, I have seen those, but the list of jobs with the links is
way down the page and for quite a few months I had not noticed
it was there.
[ ... ]
> Yes, that's a wms problem. You see a vertical stripe of red at
> http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html if that's the case.
I think noticing that WMS issues are more intermittent than
that. Given that some of my 'pheno' users submit batches of
2,000-5,000 jobs I suspect that they generate lots of transient
load spikes on servers like WMSes, proxies, etc.; but sometimes
they have problems with batches of 50 jobs.
|