Yes, I think this is your only option now. The developers are being less than helpful. As for tidying things up, I think it's going to be messy. WE should be able to use something like this to find the pnfsids that are not related to any file in the namespace. http://www.sysadmin.hep.ac.uk/svn/fabric-management/dcache/pnfs/remove-orphan-files.sh Greig On 07/12/07 11:21, Matt Doidge wrote: > Hello, > > I've made extensive use of the single-user mode of postgres whilst > attempting to repair our databases, and I couldn't find any glaring > errors or problems looking into these things-but then again I wouldn't > nessicerily have done as the postgres backend to pnfs has always been > something opaque and mysterious to me. > > After much deliberation we've decided to go with the rollback option, > our best "customers" would rather have us back in action and needing > to be reseeded then sitting around being useless. I'm going to keep > the bust database we have and see if something can be salvaged from > it, maybe some entries can be translated over? Maybe I'm just being > overly optimisitic. I'll do the rollback after lunch, to give time for > people to object if they think it's a very very bad idea or present me > with other options, but right now I don't think we have any. > > One thing's for certain, I'll never blindly assume that my backups are > working fine again. > > Thanks for the help Greig, > > Matt > > On 07/12/2007, Greig A Cowan <[log in to unmask]> wrote: >> Matt, >> >> Have you considered the approach of running in single-user mode with just >> one of the broken databases? >> >> http://www.postgresql.org/docs/8.1/static/app-postgres.html >> >> postgres -D /var/lib/pgsql/data/ data1 >> >> This might give you some ability to try and fix things. >> >> As you will know, I've also submitted another ticket to dCache support. >> >> Greig >> >> >> >> On Fri, 7 Dec 2007, Matt Doidge wrote: >> >>> Hello, >>> >>> Just in case the worst happens and we can't salvage pnfs from what's >>> our current postgres and have to use the 3.5 month old backup, would >>> there be guidelines as to how to go about sorting out the horrid mess >>> that would leave. There's guidelines to sort things out from the point >>> of view of a pool snuffing it, but nothing for database failures >>> (largely as such extreme database related errors shouldn't happen due >>> to regular backups being in place). >>> >>> The other piece of advice is, at what point do you should I just give >>> up and start dusting off the old back-up? We've been down since >>> Tuesday. How much downtime is a 3.5 month data rollback worth? Maybe I >>> should put this question to the major VOs we support (aka atlas)? >>> >>> As you can tell I'm a little confused and overwhelmed, and mightily frustrated. >>> >>> cheers, >>> Matt >>> >>> On 06/12/2007, Matt Doidge <[log in to unmask]> wrote: >>>> Hello, >>>> >>>> All my empty files are in place. Postgres is back up and running- I >>>> can connect to it and poke around. However there could have been >>>> dataloss, and so when pnfs looks into postgres for its gubbins all it >>>> is perhaps seeing is gobblygook and thus not be able to initialise >>>> properly. However that's just a theory that I so hope is wrong. >>>> >>>> I'll see if I can dig up a spare node to see if I can get it to work >>>> elsewhere, but I'm not sure we've got any spare machines laying about >>>> the place. It's worth a try, at the moment I'm just banging my head >>>> against a wall, which isn't getting the job done sadly. >>>> >>>> If anyone knows of any postgres queries I could issue that would test >>>> how postgres is looking to pnfs then that would be great. >>>> >>>> cheers, >>>> Matt >>>> On 06/12/2007, Greig Alan Cowan <[log in to unmask]> wrote: >>>>> You've got empty file corresponding to each database in this directory? >>>>> >>>>> /opt/pnfsdb/pnfs/databases >>>>> >>>>> Are you sure that postgres is back up and running? Can you really >>>>> connect to it? >>>>> >>>>> I don't know what causes the enabled (x) output, but it probably implies >>>>> a problem with postgres. I've only seen it once before. >>>>> >>>>> Greig >>>>> >>>>> On 06/12/07 08:35, Matt Doidge wrote: >>>>>> Thanks for the reply Greig, >>>>>> >>>>>> The /opt/pnfsdb/pnfs/info files all seem present and correct. The >>>>>> output of a mdb show gives >>>>>> much the same as usual, except the status column for each reads >>>>>> "enabled (x)". It should be noted that all the bust databases are also >>>>>> all the larger ones, of the databases that work only babar has any >>>>>> significant amount of data. >>>>>> >>>>>> Posting to both users and support was a little cheeky, but I'm trying >>>>>> to maximise coverage in the hopes of maximising my chances of finding >>>>>> a solution that doesn't involve losing 3 months of data, desperate >>>>>> times call for desperate postings! >>>>>> >>>>>> cheers, >>>>>> Matt >>>>>> >>>>>> On 06/12/2007, Greig A Cowan <[log in to unmask]> wrote: >>>>>>> Hi Matt, >>>>>>> >>>>>>> You've probably checked this already, but what are the contents of >>>>>>> >>>>>>> /opt/pnfsdb/pnfs/info >>>>>>> >>>>>>> There should be a file for each database, with contents like: >>>>>>> >>>>>>> $ cat D-0000 >>>>>>> admin:0:r:enabled:/opt/pnfsdb/pnfs/databases/admin >>>>>>> >>>>>>> Also what does this command give you? It should be something like that >>>>>>> below. >>>>>>> >>>>>>> $ /opt/pnfs/tools/mdb show >>>>>>> ID Name Type Status Path >>>>>>> ---------------------------------------------- >>>>>>> 0 admin r enabled (r) /opt/pnfsdb/pnfs/databases/admin >>>>>>> 1 data1 r enabled (r) /opt/pnfsdb/pnfs/databases/data1 >>>>>>> ... >>>>>>> >>>>>>> Also, you should note that you should typically just post to the >>>>>>> user-forum or support@dcache (the developers get a bit narky when people >>>>>>> post to both ;) ) >>>>>>> >>>>>>> Cheers, >>>>>>> Greig >>>>>>> >> -- >> ======================================================================== >> Dr Greig A Cowan http://www.ph.ed.ac.uk/~gcowan1 >> School of Physics, University of Edinburgh, James Clerk Maxwell Building >> >> TIER-2 STORAGE SUPPORT PAGES: http://wiki.gridpp.ac.uk/wiki/Grid_Storage >> ======================================================================== >>