We have just successfully restored the DPM functionality (upgraded to
1.6.3) following Sophie's recommendations:
1) Ditch the database
2) Restore the most recent backup prior to the apt auto-update
3) Run the latest YAIM 3.0.0-38
As a side-note, at least one site I know of carried out successfully a
manual apt update and DPM upgrade by running YAIM 3.0.0-36.
cheers,
Gianfranco
Sophie Lemaitre wrote:
> Hello,
>
> We couldn't reproduce the database corruption:
> - with a 1.5.10 DPM under load
> - with APT auto-update to DPM 1.6.3
> - without restarting the daemons
>
> So, _*could someone provide*_ me ([log in to unmask]) _*a dump of
> his/her "corrupted" database*_ (cns_db + dpm_db), so that we can
> investigate the problem ?
>
> Thanks a lot.
> Sophie
>
>> Could this be clarified? I had the apt auto-update on Friday and
>> dpm-qryconf went nuts as a result. However, copying files to and from
>> the DPM was not affected. Now, running the update script today (after
>> having stopped all the services) fails. This failure was reported at
>> the beginning of thread as a sign of DB corruption.
>>
>> I have regular dumps of the DB, so restoring to working order should
>> be not a big deal (needless to say, however, it could have been
>> avoided, as commented in this very thread). It would be desirable to
>> know how far back I have to go in order to retrieve the latest
>> possible functional snapshot. Is the corruption likely to occur as
>> the new rpm's are installed (perhaps triggered by an attempted
>> transfer after that), or is the action of running the script likely
>> to corrupt the lot?
>>
>> cheers,
>> Gianfranco
>>
>> Michel Jouvin wrote:
>>
>>> Sophie can confirm but I think there is no risk of corruption if
>>> running=20
>>> the new server on the old db : it will just fails. The problem is
>>> running=20
>>> the update script with the service (old or new) running.
>>>
>>> Michel
>>>
>>> --On samedi 10 mars 2007 19:03 +0100 Debreczeni Gergely=20
>>> <[log in to unmask]> wrote:
>>>
>>>
>>>
>>>> Hi !
>>>>
>>>> Just thiking loudly:
>>>>
>>>> The apt-autoupdate updated the rpms, but none of them was restarted.
>>>> (I've checked the .spec files).
>>>> So after the upgrade you had the new DPM libraries and files installed
>>>> but the old servers running. When we tested the upgrade script we
>>>> strictly followed the description and there no hours were passed
>>>> between
>>>> the rpm upgrade and the database schema upgrade, and no meantime data
>>>> transfer were on the server...
>>>>
>>>> So in your case what probably happened, that the old server
>>>> wanted to
>>>> load one of the new shared libraries during the night (because you
>>>> had an
>>>> ongoing transfer), which is obviously a weird situation and that
>>>> caused
>>>> DB corruption.
>>>>
>>>> If as you proposed the rpm postinstall script had stopped the
>>>> service,
>>>> then you would have waken up in the morning with some crashed data
>>>> transfer... (I dunno which one is better :-))
>>>>
>>>> So, none of the solution is perfect, personally
>>>> *I'm very much againts of apt-autoupdate*.
>>>> If I run a production site then it would be me who
>>>> would like to do the upgrade and see,follow the output and
>>>> read the release notes carefully before, not only superficially.....
>>>>
>>>> So, both side needs some improvement ;-)
>>>>
>>>> Best regards and good weekend,
>>>> Gergo
>>>>
>>>> PS: And of course very probably after the database is corrupted the
>>>> update script is not gonna to work...
>>>>
>>>>
>>>>
>>>> Adam Padee a =C3=A9crit :
>>>>
>>>>> Sophie Lemaitre wrote:
>>>>>
>>>>>> Wait, I agree only with the documentation change time and date.
>>>>>>
>>>>>> But, starting and stoping the services is done by YAIM as needed.
>>>>>> This is also explained in the Wiki documentation (since the
>>>>>> beginning)
>>>>>> as well as in the release notes.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Well, you're right. Probably the best way to do it was to use
>>>>> YAIM. But,
>>>>> as I mentioned previously, my SE was upgraded by apt-autoupdate,
>>>>> which
>>>>> unfortunately doesn't run YAIM. When I woke up in the morning, my
>>>>> databases were already corrupt. So I had to deal with the problem
>>>>> manually. I don't mind updating things manually. But gLite adopted
>>>>> continuous update model, which makes sense only with automatic update
>>>>> tools. I agree, that some things cannot be done without manual
>>>>> intervention. But in such a case I would like to have it stated
>>>>> explicitly in the release notes that come to my mailbox. As
>>>>> updates are
>>>>> "continuous", I look at these notes only superficially, and unless I
>>>>> find something really serious, stated in capital letters, I let it go
>>>>> automatically. If I had to update all the nodes manually after every
>>>>> minor update, then the "continuous" update model =3D much more
>>>>> work than
>>>>> in the previous "release" model.
>>>>> In the update 16 release notes I see only "pay close attention to
>>>>> glite-CE and lcg-CE_torque". Nothing at all about reconfiguration of
>>>>> SE_dpm_mysql.
>>>>>
>>>>> I really don't like to repeat the discussion that has already taken
>>>>> place here in Sept'06 along with the openssh update. But I think that
>>>>> putting to the production repository the packages that without
>>>>> special
>>>>> treatment may cause services' malfunction, when lot of people use
>>>>> apt-autoupdate, is not a very good idea. I (partially) understand the
>>>>> openssh case, as it is an external package. But if the same thing
>>>>> happens with EGEE packages, which are not critical security
>>>>> updates, I
>>>>> begin to wonder what PPS is for?
>>>>>
>>>>>
>>>>>> We are always happy to answer all GGUS tickets we get, so please
>>>>>> send a
>>>>>> mail if you are "fighting", or not sure in which order to do what.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> I appreciate that, and I'm really grateful for the help I already
>>>>> received from DPM team (for example with my problem with dpm-drain in
>>>>> ver 1.5.6), but GGUS tickets have to travel very long way before they
>>>>> reach your desk. Usually they are sorted by TPM shift, sent to ROC,
>>>>> analyzed by ROC 1st line support, sent back to GGUS, and then
>>>>> assigned
>>>>> to your group. At least this is what has happened with my previous
>>>>> ticket concerning DPM. When the harm is already done, and my site
>>>>> does
>>>>> not work, I don't think that gong through GGUS is the quickest way to
>>>>> solve the problem.
>>>>>
>>>>> Cheers,
>>>>> Adam
>>>>>
>>>>
>>>
>>>
>>>
>>> *************************************************************
>>> * Michel Jouvin Email : [log in to unmask] *
>>> * LAL / CNRS Tel : +33 1 64468932 *
>>> * B.P. 34 Fax : +33 1 69079404 *
>>> * 91898 Orsay Cedex *
>>> * France *
>>> *************************************************************
>>>
>>
>>
>>
--
Dr. Gianfranco Sciacca Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy Internal: 33044
University College London D15 - Physics Building
London WC1E 6BT
|