JISCMail - TB-SUPPORT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
TB-SUPPORT Archives

TB-SUPPORT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		TB-SUPPORT Home
		TB-SUPPORT January 2005
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Notes from TB-SUPPORT meeting on Tuesday 25th January.
From:
Alex Martin <[log in to unmask]>
Reply-To:
Testbed Support for GridPP member institutes <[log in to unmask]>
Date:
Mon, 31 Jan 2005 14:02:36 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (333 lines)
Its us who need to do it...really we setting up new machines in a different 
subnet.      



On Monday 31 January 2005 13:55, Simon George wrote:
> Hi,
>
> sorry I missed the meeting, but one thing looks wrong to me in the
> minutes:
>
> Feedback from CIC-on-duty meeting:
> [...]
>     ·       2 sites want to change their host names! RHUL and Edinburgh
> would like to change their host names in an attempt to standardize their
> naming. In terms of the SE this is difficult. Newer VDTs will support C
> names.  RHUL want to change their subnet.
>
> I don't recall wanting to do this at RHUL although I can see how it could
> be useful. Could it have been somewhere else?
>
> Cheers,
> Simon
>
> ---------------------------------------------------------------------------
> Simon George, Dept of Physics, Royal Holloway college, University of London
> Email [log in to unmask]    Tel. +44 1784 41 41 85    Fax. +44 1784 472794
>
> On Sun, 30 Jan 2005, Coles, J (Jeremy) wrote:
> > Dear All
> >
> > Please find below noted from the last Tb-SUPPORT meeting. I would be
> > grateful if you could review them and share any feedback/comments with
> > this list and direct corrections/omissions to me. I will upload these
> > minutes to the GridPP web-site late next week. I also have notes from
> > previous meetings in October and November but they are not formatted; if
> > you have specific questions from those meetings please direct them to
> > me.
> >
> > Many thanks for your support.
> >
> > Kind regards,
> > Jeremy
> >
> >
> >
> >
> > TB-SUPPORT meeting on Tuesday 25th at 11:00am.
> >
> > Attendees
> >
> > Scot: Fraser (Glasgow); Steven (Edinburgh); Mark (Durham)
> > South: Lauri & Eve (Birmingham); Pete (Oxford); Chris (RAL-PPD)
> > London: Owen (Imperial); Ben (UCL)
> > North: No representative
> > Tier-1: Steve T
> > RAL: Jeremy & Stephen
> >
> >
> > Agenda
> > ******
> >
> > 1) Site status reports - information on LCG updates and upgrades to SL3
> >
> >
> > Birmingham - starting install on frontends.
> >
> > Imperial - HEP farm computing section being reorganized into one SGE
> > farm. Will upgrade to 2.3.0 but not SL3 at this time.
> >
> > LESC - trying install on basic 64bit distribution (nb. not supported
> > until LCG 2.5). Trying workaround with SL3 install on frontends and
> > tarball to WNs.  ST commented that there are confusions with java that
> > make it more difficult/confusing in this environment.
> >
> > Brunel - hardware problems. WNs have to be on a private network. Want to
> > pursue 2.2 installation. 30-60 WNs will be available once the problems
> > are resolved.
> >
> > QMUL - installing head nodes with YAIM. Stuck with apt get. Trinity
> > College Dublin can supply a distribution for 2.2 Fedora Core.
> >
> > RHUL - zero manpower.
> >
> > UCL - HEP. Current recorded downtime on GOC database will expire. Expect
> > to be down another week. CCC - Planning to install 2.3 soon with upgrade
> > to SL3.
> >
> > Edinburgh - complete reconfiguration of all hardware taking place. SL3
> > and LCG 2.3 on GridPP frontends. Others to run 2.2. Half way through
> > installation. Also rebuilding backend. dCache headnode - test install of
> > headnode and pull machine. Waiting for backends to be rebuilt. 2.3
> > install by end of next week.
> >
> > Durham - similar situation to Edinburgh. Putting SL3 on GridPP machines.
> > Backends part of a SGE farm. After GridPP12 will test installs. Still
> > suffering from firewall problems.
> >
> > ST: Can you test the UI from outside the network? Firewall open - blocks
> > ports and performs TCP checks - it considers globus to be illegal.
> > Similar stateful firewall problems have been seen at Oxford, Bristol and
> > RAL.
> > PG: I can circulate the filter rules used at Oxford as a similar package
> > is being used.
> > ST: A Wiki entry has recently been added on this topic and it will be
> > updated now.
> >
> > Aside: SGE - LESC have job manager for SGE. Still trying to get compute
> > element up and running. Will work on provider shortly.
> >
> > Oxford - Running 2.2 on 7.3. Installed SL on head nodes. Planning to do
> > a LCFGng upgrade to 2.3 on 7.3 as this is the quickest route to upgrade.
> >
> >
> > Bristol - SL3 on headnodes but not WNs. Firewall not open at present.
> >
> > Cambridge - undergoing upgrades.
> >
> > RAL-PPD - wanted to move subnets at same time so set up almost
> > independent setup. GIIS published through old name. 7.3 services will be
> > down to a single node by the end of the month. Can't just switch SE off
> > though.
> >
> > Tier-1: SNs - BDII upgraded to SL3. Hardware upgraded from P600 to dual
> > 2.8 GHz machines. Tests around UK looking better as a result of this
> > upgrade. A New UI - LCG UI 01 - is installed. The firewall should open
> > up today. New storage available (dCahce on SL3). Main farm will migrate
> > slowly. Aim to have something by the end of the month. dCache up to
> > 25TB. Four VOs supported and there is now a testing link to the tape
> > store (ADS). SB jobs are not working - need to investigate why.
> >
> > Feedback from CIC-on-duty meeting:
> >
> >         *       Birmingham's host certificates are about to expire -
> > replacements in process - will go on today. Need to restart gatekeeper.
> > A warning was received 3 weeks ago.
> >
> >         *       Dublin - only accepting jobs from certain hosts - based
> > on policy decision but not supported by the middleware.
> >
> >         *       Imperial a false positive since IC-LCG2 matched PIC-LCG2
> > and Imperial was gaining PIC's (Barcelona) problems.
> >
> >         *       Concern at the speed of conversion to SL3. ST commented
> > that this was partly due to problems with use of YAIM and PBS server
> > support (separate server and gatekeepers were not previously supported)
> > having only just arrived.
> >
> >         *       2 sites want to change their host names! RHUL and
> > Edinburgh would like to change their host names in an attempt to
> > standardize their naming. In terms of the SE this is difficult. Newer
> > VDTs will support C names.  RHUL want to change their subnet.
> >
> >
> >         *       SB: SEs at Cambridge and Durham not working because the
> > information provider entries are blank. It was noted that Cambridge is
> > offline at the moment so why are they still being picked up?
> >
> > 2) Accounting - reminder of logs that must be kept for 2004 and review
> > of sites currently publishing to the GOC
> >
> >         *       Gatekeeper logs
> >         *       PBS job manager logs
> >         *       System logs
> >         *       APEL configuration file
> >
> >         DPK has asked that process account logs (pacct files) are also
> > kept because they are needed for security purposes.
> >
> > There is now an FAQ dealing with the accounting log files:
> > http://goc.grid-support.ac.uk/gridsite/accounting/faq.html
> >
> >
> > 3) Security contact information - request for sites to provide updated
> > information
> >
> > The original request was for one contact who dealt with Grid machine
> > security for the site and also a contact for all site computer security.
> >
> >
> > Please could all sites confirm the GOCDB entries?
> >
> > SB: There has been a request from the CIC-on-duty (COD) people to
> > include a country code with the telephone numbers.
> >
> > 4) Feedback from the CERN Quattor course
> >
> > RAL - not using Quattor (using YAIM) at the moment. When some spare time
> > is found Quattor will be configured. For other sites it is a question of
> > the effort required to setup the Quattor system vs time required to
> > maintain nodes manually.
> >
> > Will upgrades be a problem for Kickstart installations?
> > ST: Depends on severity of upgrade. Reinstalling the WNs might be
> > required and if there are configuration changes in the new package then
> > the Kickstart configuration will need to be changed but it is thought to
> > be manageable. There is only one node with persistent data at the Tier-1
> > - dCache.
> >
> > CB: I have managed to chain a YAIM setup to the end of Kickstart. Also
> > YAIM allows better error follow through than LCFGng (less layers) and so
> > is much easier to debug.
> >
> > OM: It is looking unlikely that Quattor will be recommended/used for any
> > of the London sites.
> >
> > ST: Quattor needs to be used for everything or nothing!
> >
> > 5) CIC-on-duty - what is it, who is it and why this matters to you
> >
> > ST: What is meant to happen?
> >
> > There are 4 CICs (Russia on line later). The CIC-on-duty (COD) rotates
> > weekly through each member. The people on duty look Gstat test results.
> > Proactive approach to problem resolution.
> >
> > Interface on top of GOC-DB - define problem and sends to site and ROC.
> > Ticket created in Savannah - assigned to ROC. Footprints is the UK ROC
> > helpdesk software. [log in to unmask] (https: cic.in2p3.fr
> > provides more information). Unfortunately at the moment the procedure
> > creates double entries in Footprints (two sources). This will be
> > addressed as Savannah is replaced with a tailoring of the Global Grid
> > User Support service (GGUS) www.ggus.org.
> >
> > The problem is directed to the ROC and site - because the site deals
> > quicker with simple problems like restarting MDS and the ROC has some
> > responsibility to check that problems are followed up and resolved.
> >
> > CIC-on-duty reminded by Savannah after 3 days if an operations problem
> > remains unresolved. Another email is sent out at this time. After
> > another few days the CIC-on-duty will phone the site to raise the
> > problem. After 3 weeks if things are not being resolved the issue is
> > raised to the GDA. At all stages the response (or lack of response) from
> > the site is logged.
> >
> > So, you will get a message from the CIC-on-duty and sometimes also the
> > ROC. You should try to fix the problem as soon as possible. Then, either
> > reply to the mail stating that the problem is fixed or reply with a
> > reason for not being able to resolve. You should mark your site as
> > offline if the problem cannot be resolved and is causing problems
> > elsewhere. Likewise if you are modifying your site and know that errors
> > will result which will be picked up by the CIC, you should mark the site
> > as offline.
> >
> > Open problems are passed to the next CIC-on-duty at an "Operations
> > meeting" which takes place on Monday mornings. You can see minutes and
> > notes from these meetings here:
> > http://agenda.cern.ch/displayLevel.php?fid=258
> >
> >
> > The Tier-2 coordinators are ROC members for UKI and will be involved in
> > the support process - by for instance helping sites resolve problems
> > and/or following up on them. They are able to close open tickets and
> > will have tickets assigned to them for their Tier-2.
> >
> > Comment - there is currently a problem with the speed of response from
> > sites. We need to work at streamlining the process in the UK and this
> > requires us to fully understand and document the whole of the support
> > process (including users!).
> >
> > At the CIC on Monday 24th January:
> >
> >         *       There was a fairly vigorous discussion of release
> > procedures
> >         *       Issue around required bug fixes not being incorporated -
> > replaced in RPM can't tell
> >         *       Some frustration about problems arising at same rate as
> > being fixed - 40 sites out of 107 were green.
> >         *       It was mentioned that sites that don't upgrade to 2.3
> > will be tagged as failing but it is not clear what the timeline is for
> > this move.
> >         ST remarked that LCG 2.3 was not installable at most UK sites
> > until mid-January due to a PBS issue with the middleware.
> >
> >                 *       Comment about sites not understanding the
> > procedures related to the CIC-on-duty - we want to address that for the
> > UK and request questions to be forwarded to the T2 coordinators.
> >
> > For more information on the CIC the following portal is a useful
> > resource: http://cic.in2p3.fr/. You will need a valid certificate to
> > enter the technical areas of the site.
> >
> >
> > 6) Discussion of the SE issues and the operations response
> >
> > Birmingham and other sites have experienced problems with full SEs. The
> > EGEE operations response is to do nothing - the experiments will be
> > encouraged to "tidy up". Under no circumstances should data be deleted
> > unless it is first agreed with the VO/data owners. In moving to a
> > production service it is imperative that decisions are community based.
> > Sites are now providing a service to many remote users and until such
> > time as sites are able to replicate data across the grid to other sites
> > the only recourse if something needs to be done is via operations
> > support (GGUS or ROC).
> >
> > It was discussed that separate areas for each VO would allow better
> > administration and stop all VOs being affected by one filling the SE. It
> > was mentioned that Alessandra Forti has put quotas on VOs via the info
> > provider. This really needs further testing as it is not clear whether
> > gridFTP may write beyond limits imposed (ie. What happens when a limit
> > is reached midway during a transfer? What happens to transfers taking
> > place in parallel?).
> >
> > Finally: the option with volatile storage may reduce problems being
> > experienced with data. It is not known whether SRM's will address this
> > in the short term. Does dCache allow a volatile declaration?
> >
> >
> >
> > 7) Issues to be tackled at GridPP12
> >
> > GridPP12 at Brunel next week has several sections devoted to identifying
> > the top issues in each area. Would sites please mail JC
> > ([log in to unmask]) with issues they would like to be considered for
> > prioritisation. The GridPP12 agenda is here:
> > (http://www.gridpp.ac.uk/gridpp12/programme.html)
> >
> >
> > 8) AOB
> >
> > OM mentioned that he has a IP tables for ports - he will share this with
> > CB for the Bristol work.

-- 
------------------------------------------------------------------------------
|                                                                            |
|  Dr. Alex Martin                                                           |
|  e-Mail:   [log in to unmask]        Queen Mary, University of London,  |
|  Phone :   +44-(0)20-7882-5033          Mile End Road,                     |
|  Fax   :   +44-(0)20-8981-9465          London, UK   E1 4NS                |
|                                                                            |
------------------------------------------------------------------------------
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options