Matt Doidge wrote:
> Sorry I didn't make it to the meeting, I was in a LSF training session
> learning dark secrets all day. I have a bit of a headache from it.
>
> There was no particular network latency issues that I noticed over the
> past few days, we were quite busy but there was no huge slowdown from
> it (and we've been equally busy before with no such problems). I
> suspected that this was a top bdii look-up problem, does anyone know
> of anyway one can guard against (like maybe including a failover list
> of BDIIs in their LCG_GFAL_INFOSYS variable)?
>
I have:
gridenv_set "LCG_GFAL_INFOSYS"
"lcg-bdii.gridpp.ac.uk:2170,topbdii.grid.hep.ph.ic.ac.uk:2170"
so it should failover if the RAL bdii is not available. If the RAL bdii
gives the wrong answer on the other hand...
And for the sake of completeness, further down the thread it seems to be
confirmed that "Invalid Argument" is due to the BDII. I was just
reporting what Matt had said, I haven't confirmed this myself.
Can someone file a ggus ticket about this unhelpful error message
please. I wasted a couple of hours of my time trying to debug a user's
problems with it - and I don't know how much of their time was wasted...
Chris
> Thanks,
> Matt
>
> On 12 January 2011 13:19, Sam Skipsey <[log in to unmask]> wrote:
>> Hi Matt,
>>
>> Your issue was brought up in the storage group meeting today.
>>
>> As far as Brian and I can remember, this error has come up
>> sort-of-generically at every grid site every so often. Chris Walker
>> and I associate it with issues with timeouts in talking to the BDII,
>> causing lcg tools to generate wierd unhelpful errors. (It appears that
>> the "Invalid Argument" is a complaint about the BDII hostname
>> apparently not being valid, since it didn't get any response from it.)
>>
>> In any case, it appears that your issue has gone away in the last day,
>> so perhaps this is less timely now. Were you having any particular
>> network latency issues yesterday?
>>
>> Sam
>>
>> On 11 January 2011 14:44, Matt Doidge <[log in to unmask]> wrote:
>>> Heya guys,
>>> The new year is not being kind to us here at Lancaster, we're
>>> intermittently failing both atlas jobs and atlas sam tests when they
>>> try to put data into our SE with the rather useless error message
>>> "Invalid argument lcg_cr: Invalid argument". The fact that this is
>>> hitting atlas sam tests (which IIRC are performed from a central
>>> location) as well as jobs on both of our local clusters makes me
>>> believe that this is a problem with our SE (or possibly or SE's
>>> information publishing, as the troubleshooting guides suggest). It's
>>> also odd that this is only affecting atlas as far as I can see, but
>>> it's an intermittant failure and atlas do poke our SE a lot.
>>>
>>> We also see a less common, but probably related, error in the panda
>>> and hammercloud logs:
>>> /opt/tarball/lcg/bin/lcg-cr lcg_util-1.7.6-1 GFAL-client-1.11.8-2
>>> Using grid catalog type: lfc Using grid catalog :
>>> lfc-atlas.gridpp.rl.ac.uk Checksum type: None SE type: SRMv2
>>> Destination SURL :
>>> srm://fal-pygrid-30.lancs.ac.uk:8446/srm/managerv2?SFN
>>>
>>> Some wierd truncation always seems to happen to the error though so
>>> it's hard to figure out what's really happening, but it appears to be
>>> an error during an lcg-cr. Looking at our workernodes:
>>> LCG_GFAL_INFOSYS=lcg-bdii.gridpp.ac.uk:2170 - not sure if this is
>>> relevent but I thought it might be.
>>>
>>> Has anyone seen this kind of error before for their SEs? I'd
>>> appreciate a hand with this one.
>>>
>>> Thanks in advance,
>>> Matt
>>>
|