It's fixed in the next release! And there's a workaround (which I
think Wahid pointed to in this thread), which just involves
configuring stuff to use the "legacy" database access mechanism, which
still works fine.
Sam
On 5 March 2014 08:57, Alessandra Forti <[log in to unmask]> wrote:
> This seems a bug, shouldn't that be fixed?
>
> I've started to systematically monitor the number of CLOSE_WAIT connections
> and so far they have overall increased.
>
> Wed Mar 5 08:40:01 GMT 2014: 1578
>
> at this pace even 32k limit I set will be covered in a month it. Not sure
> how we have worked so far with a 1024 limit.
>
> We also have a hugh number of TIME_WAIT connections albeit that oscillate
> vastly from few hundreds up to ~35k.
>
> cheers
> alessandra
>
>
> On 04/03/2014 10:05, Sam Skipsey wrote:
>>
>> Hi Alessandra,
>>
>> If it's the Mysql deadlock issue, then it's highly load dependent (the
>> issue only arises because the current version of the dmlite mysql
>> adaptor assumes that it only needs to reserve 1 mysql connection per
>> client process it's talking to. The xrootd adaptor actually needs more
>> than one - so if the mysql connection pool is highly contended, due to
>> high load, the xrootd adaptor can deadlock waiting for its additional
>> connection.)
>>
>>
>> Sam
>>
>> On 3 March 2014 21:37, Alessandra Forti<[log in to unmask]> wrote:
>>>
>>> I applied the changes. xrdcp is now working but I'd wait before claiming
>>> everything is fine because restarting xrootd has forcefully closed all
>>> the
>>> waiting connections. I'm not sure why this problem started now. We have
>>> had
>>> xrootd for a while now.
>>>
>>> cheers
>>> alessandra
>>>
>>>
>>> On 03/03/2014 16:29, Wahid Bhimji wrote:
>>>>
>>>> Hi
>>>>
>>>> Might be worth trying to increase the limit. As Andy says in that thread
>>>> it could be higher.
>>>>
>>>> I'm pretty sure Sam saw something with load of close_wait connections ..
>>>>
>>>> In case it isn't that and is related to the thread dead lock then the
>>>> work
>>>> around for that issue is below.
>>>> But as I say that had a different error message
>>>>
>>>> Wahid
>>>>
>>>> from previous email from David S:
>>>>
>>>> edit /etc/sysconfig/dpm:
>>>>
>>>> #DPM_USE_SYNCGET="yes"
>>>>
>>>> to
>>>>
>>>> export DPM_USE_SYNCGET="yes"
>>>>
>>>> and restart the dpm daemon. Then
>>>>
>>>> mv /etc/dmlite.conf.d/mysql.conf /etc/dmlite.conf.d/mysql.conf-
>>>>
>>>> edit /etc/dmlite.conf.d/adapter.conf change:
>>>>
>>>> LoadPlugin plugin_fs_pooldriver /usr/lib64/dmlite/plugin_adapter.so
>>>>
>>>> to
>>>>
>>>> LoadPlugin plugin_adapter_dpm /usr/lib64/dmlite/plugin_adapter.so
>>>>
>>>> and restart the xrootd service.
>>>>
>>>> This will move a portion of the load from direct database queries to the
>>>> dpm daemon, so it will be more heavily loaded, but will avoid the
>>>> suspected
>>>> problem in (2b). Rerunning yaim would undo this change, but if this
>>>> improves
>>>> your situation we can find an acceptable way to make it a permanent
>>>> change.
>>>> (Until we can fix the bug and make a release).
>>>>
>>>>
>>>> On 3 Mar 2014, at 16:09, Robert Frank<[log in to unmask]>
>>>> wrote:
>>>>
>>>>> Hi Wahid,
>>>>>
>>>>> su - dpmmgr -c "ulimit -n"
>>>>> 1024
>>>>>
>>>>> lsof -u dpmmgr | wc -l
>>>>> 66133
>>>>>
>>>>> Most of those listed "open files" are tcp connections that are in state
>>>>> "CLOSE_WAIT". This would indicate that xrootd doesn't close the
>>>>> connections
>>>>> properly. I wonder if those don't count towards the limits.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Robert
>>>>>
>>>>> On 03/03/14 15:36, Wahid Bhimji wrote:
>>>>>>
>>>>>> Alessandra,
>>>>>>
>>>>>> You could try what Andy H suggests in that link in fact. Ie that you
>>>>>> have run our of file descriptors.
>>>>>>
>>>>>> I am currently using 751 which isn't so far away from 1024 which is
>>>>>> the
>>>>>> default limit I think
>>>>>>
>>>>>> [root@srm ~]# lsof -u dpmmgr | wc -l
>>>>>> 751
>>>>>>
>>>>>> [dpmmgr@srm ~]$ ulimit -n
>>>>>> 1024
>>>>>>
>>>>>> I better increase it myself in fact...
>>>>>>
>>>>>> There was some other similar problem Sam had with threads blocking .
>>>>>> But
>>>>>> I think that had a thread related message.
>>>>>>
>>>>>> Anyway if the above isn't it then I will dig out the workaround for
>>>>>> that
>>>>>> issue....
>>>>>>
>>>>>>
>>>>>> Wahid
>>>>>>
>>>>>> On 3 Mar 2014, at 15:10, Alessandra Forti<[log in to unmask]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> we seem to have a problem with the storage.
>>>>>>>
>>>>>>> Most of the jobs are failing to stagein the input and the problem is
>>>>>>> xrootd AFAICT and it might be a problem with our configuration.
>>>>>>>
>>>>>>> Although a user also wrote me he cannot dq2-get but I still have to
>>>>>>> investigate that because I can use lcg-cp fine. xrootd logs say the
>>>>>>> following
>>>>>>>
>>>>>>> One job example
>>>>>>>
>>>>>>> http://panda.cern.ch/server/pandamon/query?job=2098194781
>>>>>>>
>>>>>>> Last server error 10000 ('') Error accessing path/file for
>>>>>>>
>>>>>>> root://bohr3226.tier2.hep.manchester.ac.uk//dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/mc12_8TeV/5d/bd/EVNT.01001789._000001.pool.root.1
>>>>>>> 03 Mar 14:15:05|xrdcpSiteMov| !!WARNING!!2990!! Command failed:
>>>>>>> source
>>>>>>> /cvmfs/atlas.cern.ch/repo/sw/local/xrootdsetup.sh; xrdcp
>>>>>>>
>>>>>>> root://bohr3226.tier2.hep.manchester.ac.uk//dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/mc12_8TeV/5d/bd/EVNT.01001789._000001.pool.root.1
>>>>>>>
>>>>>>> /scratch/1223444.ce02.tier2.hep.manchester.ac.uk/condorg_vhDcxF8w/pilot3/Panda_Pilot_17101_1393855151/PandaJob_2098194781_1393855213/EVNT.01001789._000001.pool.root.1
>>>>>>> 03 Mar 14:15:05|futil.py | WARNING: Abnormal termination: ecode=256,
>>>>>>> ec=1,
>>>>>>> sig=-, len(etext)=1140 03 Mar 14:15:05|futil.py | WARNING: Error
>>>>>>> message:
>>>>>>> [1;34mCreated /home/prdatl012/home_cream_470305009/.asetup. Please
>>>>>>> look and
>>>>>>> (optional) edit it.
>>>>>>> if I try with lcg-cp I can copy the file
>>>>>>>
>>>>>>> [aforti@bohr2825 ~]$ lcg-cp --verbose
>>>>>>>
>>>>>>> srm://bohr3226.tier2.hep.manchester.ac.uk//dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/mc12_8TeV/5d/bd/EVNT.01001789._000001.pool.root.1
>>>>>>> ./lcgcp-test
>>>>>>> Using grid catalog type: UNKNOWN
>>>>>>> Using grid catalog : prod-lfc-atlas.cern.ch
>>>>>>> VO name: atlas
>>>>>>> Checksum type: None
>>>>>>> Trying SURL
>>>>>>>
>>>>>>> srm://bohr3226.tier2.hep.manchester.ac.uk//dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/mc12_8TeV/5d/bd/EVNT.01001789._000001.pool.root.1
>>>>>>> ...
>>>>>>> Source SE type: SRMv2
>>>>>>> Source SRM Request Token: 53897457-1d97-4f41-babb-3601da1bc326
>>>>>>> Source URL:
>>>>>>>
>>>>>>> srm://bohr3226.tier2.hep.manchester.ac.uk//dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/mc12_8TeV/5d/bd/EVNT.01001789._000001.pool.root.1
>>>>>>> File size: 143179334
>>>>>>> Source URL for copy:
>>>>>>>
>>>>>>> gsiftp://se06.tier2.hep.manchester.ac.uk/se06.tier2.hep.manchester.ac.uk:/raid/atlas/2014-01-06/EVNT.01001789._000001.pool.root.1.123581076.0
>>>>>>> Destination URL:file:/home/aforti/./lcgcp-test
>>>>>>> # streams: 1
>>>>>>> 131072000 bytes 63839.22 KB/sec avg 63839.22 KB/sec inst
>>>>>>> Transfer took 3010 ms
>>>>>>>
>>>>>>> if I try with xrdcp it fails
>>>>>>>
>>>>>>>
>>>>>>> and in the log files it is full of this error
>>>>>>>
>>>>>>> 140303 08:16:33 14919 XrdAccept: Unable to perform accept; too many
>>>>>>> open files
>>>>>>>
>>>>>>> it looks to me I should change something in the configuration. The
>>>>>>> only
>>>>>>> reasonable thread I found is this but it's not DPM-xrootd
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?A2=ind0804&L=XROOTD-L&D=0&P=5756
>>>>>>>
>>>>>>> This problem sums up with the FAX problems we have but are kind of
>>>>>>> more
>>>>>>> urgent since they are blocking production.
>>>>>>>
>>>>>>> Thanks for any help.
>>>>>>>
>>>>>>> cheers
>>>>>>> alessandra
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>
|