Hi Trevor.
A few comments for the MapCenter.
- I see that you are monitoring all WNs currently in use. As tehoday we
only have ~60 WNs in total this is ok, but do you plan to keep
monitoring them when they will be in the order of the thousands? Also,
the list of WNs is, at least for CERN, quite dynamic and it would be
probably a huge burden to keep track of all changes (right now you miss
at least 15 CERN WNs which were added last week).
- Most of the nodes show problems which are there on one test and then
disappear on the next. A good example is the UI node in Tokyo,
dgui0.icepp.s.u-tokyo.ac.jp. There you can see that ssh is reported
going up and down once or twice every hour. Reporting all these very
transient problems will end up in cloggin up the system and making it
very hard to spot if a node has real problems. Is there any way to
improve the situation with a retry system? E.g.: if a test shows that a
service switched from "good" to "bad", then wait 5 seconds and retry the
test, confirming the state transition only if the second test also fails.
About CERN:
- the LCG-1 group of nodes is correct (besides the extra WNs I mentioned
above) but the Testbed group is not. Currently we have a single official
UI, adc0014.cern.ch, plus one UI we use for local tests,
testbed011.cern.ch. tbed0116 is OK as it is used for the GridIce
monitoring (is it supposed to run tomcat?) while tbed0117 is currently a
WN, not a CE.
- besides the production system, we have at CERN two Certification&Test
testbeds plus one Installation testbed. These systems are always in a
very dynamic state, so I wonder if it is really useful to monitor them,
but it is up to the C&T group to decide.
Thanks for the good and useful work. Ciao
Emanuele
Daniels, T (Trevor) wrote:
> The GOC report this morning takes a different form. I have recently
> concentrated on setting up the static version of MapCenter developed by
> Franck Bonnassieux at Lyon to correctly reflect the state of services which
> the LCG1 sites are currently providing, and I now want to test, with your
> help, whether this is providing useful information.
>
> In case you are not familiar with MapCenter this is what it does. A host
> may be tested in three ways: the usual ping, a port scan of specified ports,
> and a variety of other checks grouped under URLs (this latter is not used by
> the LCG1 GOC at present). Each check results in a pass or a fail, and
> MapCenter presents the current status of all hosts in a variety of displays.
> The status of a host is represented by the results from all the tests: all
> tests passed is shown by a green 'OK'; all tests failed by a red 'X'; and a
> mixture of results by a brown dot. Individual hosts are grouped in sites,
> and the status of a site is represented by the best and worst state of the
> hosts in that site. Sites similarly are grouped into countries whose state
> is represented by the best and worst state of the sites in that country.
> Have a look at http://mapcentre.rl.ac.uk/fullview.html which shows this
> hierarchy of states. The nature and result of the individual tests is shown
> at the right of each host name. Any test can either pass (shown in green)
> or fail (shown in red), and the text shows either 'icmp' (ping test) or the
> name of the service (port scan test) which was scanned.
>
> Other views present this information in different ways, best explored by
> yourself, but one further view will be of interest to sysadmins. Click on
> the hostname in the view referenced above and you will see details of the
> tests performed on that host and below that a history showing all the recent
> changes of state of that host (called Alarms History). This history will
> show when tests failed and when they started working again (to a resolution
> of 10 minutes).
>
> Comments on the general usefulness and correctness of individual sites to
> the list please.
>
> OK, so what is MapCenter showing this morning:
>
> Prague: all services up
>
> IN2P3: not yet operational
> (when a test has never been passed since the last restart it
> is shown in
> purple, and the state by a blue '?')
>
> FZK: the SE mds (ldap) service is not responding (port 2135)
> the CE logd service is not responding (port 9002)
> (the LCFG and UI states are not of interest and these states
> are not
> propagated higher, shown by the '<-?' symbol at the right)
>
> Budapest: all services up (but see the history of some hosts, Gergo)
>
> INFN: the SE gsiftp and mds services are not responding (ports
> 2811 2135)
>
> ICEPP: all services up
>
> SINP: all services currently down; history shows they come and go
>
> Krakow: all services up
>
> Barcelona: all services up (the LCFG server should not propagate up - a
> bug?)
>
> CERN: I've tried to divide the hosts into production and testbed;
> perhaps
> Emanuele could check this and let me know if it is right
> before I comment.
>
> Taiwan: all service up
>
> RAL: the SE gsiftp and mds services are not responding (ports
> 2811 2135)
> (fluctuating history)
> the PROXy service is not responding (port 7512)
>
> BNL: not yet operational
>
> FNAL: all services up
>
> HTH
>
> Trevor
> .lf n25
>
> Dr Trevor Daniels
> c/o CCLRC eSC Department Phone: (+44)|(0) 1235 778093
> Rutherford Appleton Laboratory Fax: (+44)|(0) 1235 446626
> Chilton, DIDCOT, Oxon, OX11 0QX, UK Email: [log in to unmask]
> The contents of this email are sent in confidence for the use of the
> intended recipient only. If you are not one of the intended recipients do
> not take action on it or show it to anyone else, but return this email to
> the sender and delete your copy of it.
--
/------------------- Emanuele Leonardi -------------------\
| eMail: [log in to unmask] - Tel.: +41-22-7674066 |
| IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
\---------------------------------------------------------/
|