JISCMail - LCG-ROLLOUT Archives

Just to reply on the situation of backups - we do have quite a lot of
redundancy:

 * we have the replay logs for the RLS application server from which we
can recreate the state at any given point 9down to single operation
granularity).  We have these for the entire duration of production RLS
services at CERN (> 1.5 years)

 * We have standard oracle backups of the entire database

 * during the data challenges (and perhaps still) we took oracle dumps
of the database every 30 minutes.

AFAIK, we've never been asked to do a recovery by a user (at least
during the year I was a direct part of the service team).  Also, even
during the data challenges other parts of the experiment software (e.g.
RefDB/PubDB for CMS) also stored the information.  This would enable
them to rerun the production RLS entry insertions from any given point
in time, if necessary, as a final level of redundancy.

Maria Girone, the RLS Service Manager, can give more information I'm
sure.  Rest assured that since we tried to set up a  24x7 service, the
maintainence of the data integrity was something we spent a lot of time
working on.

Cheers,

James.
-----Original Message-----
From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
On Behalf Of Burke, S (Stephen)
Sent: Tuesday, January 18, 2005 12:58 PM
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] [ATLAS-LCG] Disk failure at Prague

LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Jules Wolfrat
said:
> I accept your point, but you can't expect that sysadmins deal with 
> this situation, they never can tell if a validated action is wanted or

> unwanted. And I wonder if you ever can do a restore of the RLS on 
> request of a user because of the above because of the reasons 
> mentioned before, the loss of changes between time of restore and time

> of backup.

I've tried to avoid being too explicit on a semi-public mailing list,
but I guess I have to be (no security through obscurity). LCG is living
on borrowed time when it comes to hackers, we have many security holes
and the main thing protecting us is just that hackers haven't yet got
around to noticing us; sooner or later they will, and we'll be in
trouble! Probably the biggest hole as things stand is the total lack of
security on the catalogues which means that any hacker can do anything
they like with almost no effort. I think the minimum that can be done is
to keep catalogue backups for a reasonable length of time. I agree that
restoring would be quite tricky, but it wouldn't be that hard to take
the union of all the records in the current and backed-up states and
then go through and remove the ones which don't have a physical file at
the endpoint. Certainly it would be a lot better than finding that
everything has been corrupted and there is no way back ...

Stephen