On 03/06/11 07:14, Vincenzo Vagnoni wrote:
> Hi Chris,
> you should have the following configuration
>
> STORM_CKSUM_ALGORITHM="Adler32"
> STORM_CKSUM_SUPPORT="false"
> GRIDFTP_WITH_DSI="yes"
>
> Do you?
>
I didn't. I do now, and it seems to have solved the problem.
Thanks very much for your help.
This ought to be added to the release notes. NB We don't see entries in
the storm-checksum.log file with this support enabled (which is exactly
what I'd expect).
Chris
> Cheers,
> Vincenzo
>
> On Jun 3, 2011, at 1:51 AM, Christopher J. Walker wrote:
>
>> On 02/06/11 22:45, Mario David wrote:
>>> hi
>>>
>>> in the log there is always 1 entry for each file even if it's calculated on the fly
>>
>> Unfortunately, this doesn't work for me. The problem I'm trying to fix
>> (that checksums take too long to calculate - so file transfers timeout)
>> is still present.
>>
>> I can't have set it up correctly as I still see significant traffic to
>> the checksum server - whereas if the checksum was being calculated on
>> the fly in gridftp, there should only be a trivial amount of traffic to
>> the checksum server.
>>
>> Ideas?
>>
>>
>> Chris
>>
>>
>>
>>> Mario
>>>
>>> On Jun 2, 2011, at 9:34 PM, Christopher J.Walker wrote:
>>>
>>>> Mario David wrote:
>>>>> hi
>>>>>
>>>>> I have gftp servers in seperate machines from the FE and BE
>>>>>
>>>>> in the gftp server do you have the yaim variable
>>>>> GRIDFTP_WITH_DSI="yes"
>>>>> that should do the trick
>>>>
>>>> I've tried to configure this - but there are still entries in the
>>>> storm-checksum log - so presumably I've failed.
>>>>
>>>> Any suggestions?
>>>>
>>>> Chris
>>>>
>>>>> cheers
>>>>> Mario
>>>>> On May 29, 2011, at 4:28 PM, Christopher J. Walker wrote:
>>>>>
>>>>>> On 29/05/11 13:47, Mario David wrote:
>>>>>>> hi Chris
>>>>>>>
>>>>>>> you should have in the gftpserver/cksum the following
>>>>>>>
>>>>>>> [root@gftp04 ~]# rpm -ql storm-globus-gridftp-gcc64dbg
>>>>>>> /etc/init.d/globus-gridftp
>>>>>>> /etc/logrotate.d/globus-gridftp
>>>>>>> /opt/storm/gridftp/etc/init.d/globus-gridftp
>>>>>>> /opt/storm/gridftp/etc/logrotate.d/globus-gridftp
>>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.a
>>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.so
>>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.so.0
>>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.so.0.0.0
>>>>>>> /opt/storm/gridftp/share/doc/storm-globus-gridftp-1.0.5/LICENSE
>>>>>>>
>>>>>> I do.
>>>>>> [root@se03 log]# rpm -ql storm-globus-gridftp-gcc64dbg
>>>>>> /etc/init.d/globus-gridftp
>>>>>> /etc/logrotate.d/globus-gridftp
>>>>>> /opt/storm/gridftp/etc/init.d/globus-gridftp
>>>>>> /opt/storm/gridftp/etc/logrotate.d/globus-gridftp
>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.a
>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.so
>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.so.0
>>>>>> /opt/storm/gridftp/lib64/libglobus_gridftp_server_StoRM_gcc64dbg.so.0.0.0
>>>>>> /opt/storm/gridftp/share/doc/storm-globus-gridftp-1.1.0/CREDITS
>>>>>> /opt/storm/gridftp/share/doc/storm-globus-gridftp-1.1.0/ChangeLog
>>>>>> /opt/storm/gridftp/share/doc/storm-globus-gridftp-1.1.0/LICENSE
>>>>>> /opt/storm/gridftp/share/doc/storm-globus-gridftp-1.1.0/README
>>>>>>
>>>>>>
>>>>>>> this has the proper DSI to calculate the checksum while the file is being transferred, on the fly.
>>>>>> Oh, that sounds more efficient - and would solve this problem.
>>>>>>
>>>>>>> or do you already have it?
>>>>>> I seem to have it, but it looks like it isn't being used. I currently run the GridFTP server on the same machine as our SE (or on a test SE).
>>>>>>
>>>>>> In the release notes for StoRM 1.6.2, it says:
>>>>>> "Due to a conflicts in some packages, the StoRM-GridFTP service
>>>>>> is not installable in the same node of storm-
>>>>>> backend"
>>>>>>
>>>>>> Would installing a separate gridftp server fix my problem, or do I just need to turn it on somehow?
>>>>>>
>>>>>>> I have in etc/storm-checksum.ini
>>>>>>> # number of threads
>>>>>>> threads=4
>>>>>>>
>>>>>>> and it seems enough in both T2 atlas sites
>>>>>> I have 10 threads and a 10GigE card. However, our new disk servers have 4*1Gbit cards. If a 2 GB file is not cached in RAM, it will take around 20s to checksum - which seems to be too long.
>>>>>>
>>>>>> Chris
>>>>>>> cheers
>>>>>>> Mario
>>>>>>>
>>>>>>> On May 29, 2011, at 12:19 AM, Christopher J. Walker wrote:
>>>>>>>
>>>>>>>> QMUL is seeing FTS failures of files transferred into the site.
>>>>>>>>
>>>>>>>> StoRM calculates checksums synchronously when srnPutDone is called[1]. I suspect that the problem occurs if the checksum takes too long to calculate (probably longer than about 10s).
>>>>>>>>
>>>>>>>> https://ggus.eu/ws/ticket_info.php?ticket=70925 and https://ggus.eu/ws/ticket_info.php?ticket=70672 have been filed against us.
>>>>>>>>
>>>>>>>> At QMUL, our gridftp and checksum server are on the same machine, so the file is usually in the pagecache and the checksum is calculated quickly. When the system is load, the file may not be in the pagecache and the checksum calculation takes longer.
>>>>>>>>
>>>>>>>> Today, there are 3 failures.
>>>>>>>>
>>>>>>>> https://lcgfts01.gridpp.rl.ac.uk:8443/glite-data-transfer-fts/cgi-bin/fts-mon-log.pl?t=GRPATLST1S-UKILT2QMULfailed/GRPATLST1S-UKILT2QMUL__2011-05-28-0924_u6keHj.log
>>>>>>>>
>>>>>>>>
>>>>>>>> The slowest checksum calculations took the following number of milliseconds:
>>>>>>>>
>>>>>>>> [root@se03 ~]# awk '/millis/ {print $9}' /opt/storm/checksum/var/log/storm-checksum.log | sort -n | tail -5
>>>>>>>> 8139
>>>>>>>> 10645
>>>>>>>> 17574
>>>>>>>> 18471
>>>>>>>> 19189
>>>>>>>>
>>>>>>>> There are 4 checksums that took over 10s to calculate. The one that succeeded was from a worker node.
>>>>>>>>
>>>>>>>> Looking at the logs for one of the failures:
>>>>>>>>
>>>>>>>> https://lcgfts01.gridpp.rl.ac.uk:8443/glite-data-transfer-fts/cgi-bin/fts-mon-log.pl?t=GRPATLST1S-UKILT2QMULfailed/GRPATLST1S-UKILT2QMUL__2011-05-28-0924_u6keHj.log
>>>>>>>>
>>>>>>>>
>>>>>>>> I see the following errors:
>>>>>>>>
>>>>>>>> 2011-05-28 10:26:18,848 [INFO ] - STATUS:BEGIN:DESTINATION - Finalization
>>>>>>>> 2011-05-28 10:26:18,848 [INFO ] - completing PrepareToPut [38924519-7abd-44d4-87b8-429f9e67d557] for SURL [srm://se03.esc.qmul.ac.uk/atlas/atlasdatadisk/data10_7TeV/NTUP_SUSY/f275_m548_p305/data10_7TeV.00159224.physics_L1Calo.merge.NTUP_SUSY.f275_m548_p305_tid199739_00/NTUP_SUSY.199739._000106.root.1]
>>>>>>>> 2011-05-28 10:28:48,959 [WARN ] - SRM> method srm2__srmPutDone failed (ip = 0.0.0.0)
>>>>>>>> 2011-05-28 10:28:48,960 [WARN ] - Failed to contact remote SRM [httpg://se03.esc.qmul.ac.uk:8444/srm/managerv2] for completing the request [38924519-7abd-44d4-87b8-429f9e67d557]: service timeout during [srm2__srmPutDone]
>>>>>>>> 2011-05-28 10:28:48,960 [INFO ] - This call will be retried
>>>>>>>> 2011-05-28 10:28:53,180 [WARN ] - Failed to complete PrepareToPut [38924519-7abd-44d4-87b8-429f9e67d557]. Try to abort it
>>>>>>>> 2011-05-28 10:28:53,290 [INFO ] - Abort completed for request [38924519-7abd-44d4-87b8-429f9e67d557]
>>>>>>>> 2011-05-28 10:28:53,290 [ERROR] - Failed to complete PrepareToPut request [38924519-7abd-44d4-87b8-429f9e67d557] on remote SRM [httpg://se03.esc.qmul.ac.uk:8444/srm/managerv2]: [SRM_FAILURE] All file requests are failed.The PrepareToPut request has been successfully aborted
>>>>>>>> 2011-05-28 10:28:53,290 [ERROR] - DESTINATION failed during FINALIZATION phase. Error [GENERAL_FAILURE]:Failed to complete PrepareToPut request [38924519-7abd-44d4-87b8-429f9e67d557] on remote SRM [httpg://se03.esc.qmul.ac.uk:8444/srm/managerv2]: [SRM_FAILURE] All file requests are failed.The PrepareToPut request has been successfully aborted
>>>>>>>> 2011-05-28 10:28:53,569 [INFO ] - File [srm://se03.esc.qmul.ac.uk/atlas/atlasdatadisk/data10_7TeV/NTUP_SUSY/f275_m548_p305/data10_7TeV.00159224.physics_L1Calo.merge.NTUP_SUSY.f275_m548_p305_tid199739_00/NTUP_SUSY.199739._000106.root.1] removed
>>>>>>>> 2011-05-28 10:28:53,569 [INFO ] - STATUS:END fail:DESTINATION - Finalization
>>>>>>>> 2011-05-28 10:28:53,569 [ERROR] - Final error on DESTINATION during FINALIZATION phase: [GENERAL_FAILURE] Failed to complete PrepareToPut request [38924519-7abd-44d4-87b8-429f9e67d557] on remote SRM [httpg://se03.esc.qmul.ac.uk:8444/srm/managerv2]: [SRM_FAILURE] All file requests are failed.The PrepareToPut request has been successfully aborted
>>>>>>>> 2011-05-28 10:28:53,569 [INFO ] - FINAL:DESTINATION: Failed to complete PrepareToPut request [38924519-7abd-44d4-87b8-429f9e67d557] on remote SRM [httpg://se03.esc.qmul.ac.uk:8444/srm/managerv2]: [SRM_FAILURE] All file requests are failed.The PrepareToPut request has been successfully aborted
>>>>>>>> 2011-05-28 10:28:53,569 [INFO ] - FINAL:fail [DESTINATION] - [FINALIZATION] - [GENERAL_FAILURE] : 'Failed to complete PrepareToPut request [38924519-7abd-44d4-87b8-429f9e67d557] on remote SRM [httpg://se03.esc.qmul.ac.uk:8444/srm/managerv2]: [SRM_FAILURE] All file requests are failed.The PrepareToPut request has been successfully aborted'
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> in the storm-backend.log, I see:
>>>>>>>>
>>>>>>>> 0:26:20.053 - INFO [XML-RPC-22] - srmPutDone:<Grid User (VOMS) = DN:'OID.0.9.2342.19200300.100.1.25=ch, OID.0.9.2342.19200300.100.1.25=cern, OU=Organic Units, OU=Users, CN=ddmadmin, CN=531497, CN=Robot: ATLAS Data Management' FQANS:[/atlas/Role=production/Capability=NULL, /atlas/Role=NULL/Capability=NULL, /atlas/lcg1/Role=NULL/Capability=NULL]> Request for [token:38924519-7abd-44d4-87b8-429f9e67d557] for [SURL:srm://se03.esc.qmul.ac.uk/atlas/atlasdatadisk/data10_7TeV/NTUP_SUSY/f275_m548_p305/data10_7TeV.00159224.physics_L1Calo.merge.NTUP_SUSY.f275_m548_p305_tid199739_00/NTUP_SUSY.199739._000106.root.1] successfully done with: [status:SRM_SUCCESS: ]
>>>>>>>> 10:26:20.053 - INFO [XML-RPC-22] - Executing PutDone for SURL: srm://se03.esc.qmul.ac.uk/atlas/atlasdatadisk/data10_7TeV/NTUP_SUSY/f275_m548_p305/data10_7TeV.00159224.physics_L1Calo.merge.NTUP_SUSY.f275_m548_p305_tid199739_00/NTUP_SUSY.199739._000106.root.1
>>>>>>>> 10:26:30.705 - INFO [XML-RPC-22] - srmPutDone:<Grid User (VOMS) = DN:'OID.0.9.2342.19200300.100.1.25=ch, OID.0.9.2342.19200300.100.1.25=cern, OU=Organic Units, OU=Users, CN=ddmadmin, CN=531497, CN=Robot: ATLAS Data Management' FQANS:[/atlas/Role=production/Capability=NULL, /atlas/Role=NULL/Capability=NULL, /atlas/lcg1/Role=NULL/Capability=NULL]> Request for [token:38924519-7abd-44d4-87b8-429f9e67d557] for [SURL:'srm://se03.esc.qmul.ac.uk/atlas/atlasdatadisk/data10_7TeV/NTUP_SUSY/f275_m548_p305/data10_7TeV.00159224.physics_L1Calo.merge.NTUP_SUSY.f275_m548_p305_tid199739_00/NTUP_SUSY.199739._000106.root.1'] successfully done with: [status:SRM_SUCCESS: All file requests are successfully completed]
>>>>>>>> 10:28:53.171 - WARN [XML-RPC-20] - srmPutDone:<Grid User (VOMS) = DN:'OID.0.9.2342.19200300.100.1.25=ch, OID.0.9.2342.19200300.100.1.25=cern, OU=Organic Units, OU=Users, CN=ddmadmin, CN=531497, CN=Robot: ATLAS Data Management' FQANS:[/atlas/Role=production/Capability=NULL, /atlas/Role=NULL/Capability=NULL, /atlas/lcg1/Role=NULL/Capability=NULL]> Request for [token:38924519-7abd-44d4-87b8-429f9e67d557] for [SURL:srm://se03.esc.qmul.ac.uk/atlas/atlasdatadisk/data10_7TeV/NTUP_SUSY/f275_m548_p305/data10_7TeV.00159224.physics_L1Calo.merge.NTUP_SUSY.f275_m548_p305_tid199739_00/NTUP_SUSY.199739._000106.root.1] failed with: [status:SRM_DUPLICATION_ERROR: ]
>>>>>>>>
>>>>>>>>
>>>>>>>> so the srmPutDone did succeed the first time - and fails the second time it is called.
>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at another site, RHUL, I see
>>>>>>>>
>>>>>>>> https://lcgfts01.gridpp.rl.ac.uk:8443/glite-data-transfer-fts/cgi-bin/fts-mon-log.pl?t=RALLCG2-UKILT2RHULcompleted/RALLCG2-UKILT2RHUL__2011-05-28-2242_qViWPW.log
>>>>>>>>
>>>>>>>> I see that the srmPutDone call completes immediately, but the checksum checking gets SRM_REQUEST_QUEUED.
>>>>>>>>
>>>>>>>> So there are a couple of questions.
>>>>>>>>
>>>>>>>> 1) Why is the srmPutDone call timing out? Should it, or should StoRM accept the request immediately and then queue the request for the checksum?
>>>>>>>>
>>>>>>>> 2) Is FTS correct in sending srmPutDone twice, or StoRM correct in rejecting the second call.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> [1]http://storm.forge.cnaf.infn.it/documentation/checksum
>>>>>>>>
>>>>
>>
>> _______________________________________________
>> Storm-users mailing list
>> [log in to unmask]
>> https://iris.cnaf.infn.it/mailman/listinfo/storm-users
>
> ----------------------------------------------------------------------------
> Dr. Vincenzo Maria Vagnoni
> Istituto Nazionale di Fisica Nucleare, Sezione di Bologna
> via Irnerio, 46, I-40126 Bologna - ITALY
> Phone: +39-051-20-91071 +39-051-20-91026
> Mobile: +39-347-6056920
> e-mail: [log in to unmask]
> ----------------------------------------------------------------------------
>
|