On 12 August 2011 11:47, Govind Songara <[log in to unmask]> wrote:
> Hi All,
>
> After implementing the suggestions here
> http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html
>
> Since yesterday, still number of the TIME_WAIT ramp up and had reached
> 5k-6k few times and then SRM start failing.
>
> I am looking around to reduced these number and found suggestions, Any
> comments on these parameters?
> a). Enable tcp_tw_recycle & tcp_tw_reuse
> http://techfreekz.in/2011/06/17/how-to-reduce-very-large-number-of-connections-in-time_wait-state-in-nginx/
> b). tune tcp_max_syn_backlog, tcp_max_tw_buckets & /netdev_max_backlog
> http://bashshell.net/script-of-the-week/script-of-the-week-tuning-tcp-packets/
>
>
> My next plan to drop dpm request tables as suggested here by trying
> http://northgrid-tech.blogspot.com/2011/06/dpm-optimization.html
> 3) set innodb_log_file_size in my.cnf (didn't give much thought to this,
> Maarten value of 50MB seemed good enough. Binary log files need to be
> removed to enable this and the database restarted but check the docs this
> might not be a valid strategy if you make heavier use of the binary logs.)
> 4) set innodb_flush_log_at_trx_commit = 2 in my.cnf (although this parameter
> seems less effective during reload it might be useful in production 2 is
> slightly safer than 0).
> 5) Use the script Jean-Philippe gave me to drop the requests tables before
> an upgrade.
>
>
> I have few doubts
> - How much downtime it needed?
> - In worst case, if need to reolad the database(which take 12 hours), any
> faster approach to reload the data like
> (http://www.maatkit.org/doc/mk-parallel-restore.html)
Does it still take 12 hours to reload your database? It shouldn't, if
you've cleared out the old requests tables...
Sam
>
>
> Thanks
> Govind
>
>
>
>
>
> On Wed, Aug 10, 2011 at 3:58 PM, Sam Skipsey <[log in to unmask]>
> wrote:
>>
>> On 7 August 2011 22:27, Govind Songara <[log in to unmask]> wrote:
>> > Hi Sam,
>> >
>> > I did not track the last failure requests, today we had lot of atlas
>> > failures.
>> >
>> > Here some of the details
>> > Number of connections were too large
>> > [root@se2 ~]# netstat -tap|wc -l;
>> > 6991
>> > [root@se2 ~]# netstat -tap| grep TIME_W|wc -l;
>> > 6336
>> > [root@se2 ~]# netstat -tap| grep TIME_W| grep telelpath |wc -l
>> > 5133
>> >
>>
>> That seems a little high - what are your max_connections and
>> max_threads set to for DPM and MySQL?
>>
>> > Number of mysql slow queries, may be these queries related to slow FTS,
>> > Does
>> > any other site have such high number
>> > mysql> show global status;
>> > Slow_queries = 10544
>> > I think db indexing will help to reduced these numbers.
>> >
>>
>> It probably would, yes. On the other hand, what duration is that
>> statistic collected over?
>> Can you issue a FLUST STATUS; and tell us what the number you get is
>> after a day or two?
>>
>> Sam
>>
>> > Can i directly apply the mysql tunining suggestion of Alessandra without
>> > droping tables & then restoring database.
>> >
>> > http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html
>> > I want to avoid restoring database, it taking more than 12 hours for 4.4
>> > gb
>> > restore sql data file.
>> >
>> > I noticed that ibddata1 file time stamp is not changed since 10th July,
>> > [root@se2 ~]# ls -l /local/mysql/
>> > total 4531644
>> > drwx------ 2 mysql mysql 4096 Apr 27 16:11 cns_db
>> > drwx------ 2 mysql mysql 4096 Feb 27 10:59 dpm_db
>> > -rw-rw---- 1 mysql mysql 10485760 Jul 10 12:11 ibdata1
>> > -rw-rw---- 1 mysql mysql 5242880 Jul 10 12:11 ib_logfile0
>> > -rw-rw---- 1 mysql mysql 5242880 Feb 23 16:09 ib_logfile1
>> > drwx------ 2 mysql mysql 4096 Feb 27 10:59 mysql
>> > -rw-rw---- 1 mysql mysql 1073741947 Jul 31 09:45 mysqld-bin.000102
>> > -rw-rw---- 1 mysql mysql 1073742277 Aug 2 12:12 mysqld-bin.000103
>> > -rw-rw---- 1 mysql mysql 1073741919 Aug 4 13:11 mysqld-bin.000104
>> > -rw-rw---- 1 mysql mysql 1073742007 Aug 7 08:47 mysqld-bin.000105
>> > -rw-rw---- 1 mysql mysql 319191342 Aug 7 22:23 mysqld-bin.000106
>> > -rw-rw---- 1 mysql mysql 100 Aug 7 08:47 mysqld-bin.index
>> >
>> > my.cnf
>> > [mysqld]
>> > myisam-recover=FORCE
>> > datadir=/local/mysql
>> > socket=/local/mysql/mysql.sock
>> > old_passwords=1
>> > set-variable=innodb_buffer_pool_size=2048M
>> > log-bin
>> > expire_logs_days=7
>> >
>> >
>> > Below are details of error messages in Log, some of them may be related
>> > to
>> > failures.
>> > dpm
>> > 08/07 20:48:54.360 18681,5 dpm_serv: [134.219.225.71]
>> > (se2.ppgrid1.rhul.ac.uk): Could not establish an authenticated
>> > connection:
>> > server_establish_context_ext: Could not receive token; _Csec_recv_token:
>> > Connection closed; Csec_server_set_service_name: Could not set service
>> > name;
>> > Csec_get_peer_service_name: Could not Cgetnetaddress: BAD ERROR NUMBER:
>> > 0 !
>> >
>> > 08/07 20:48:54.360 18681,5 sendrep: DP002 - send error : Connection
>> > closed
>> > by remote end
>> >
>> > 08/07 21:44:19.733 18681,26 dpm_srv_proc_put:
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/user.zmarshal/user.zmarshal.mc11_7TeV.130670.simplifiedModel_4_450_300HITS.eXXX_s765.110807221450_sub025683541/user.zmarshal.004642._1288245570.log.tgz:
>> > DPM_FAILED (File exists)
>> > 08/07 21:44:19.734 18681,26 dpm_srv_proc_put: returns 0,
>> > status=DPM_FAILED
>> > (File exists)
>> >
>> > 08/07 21:44:19.779 18681,23 dpm_srv_proc_put:
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/user.zmarshal/user.zmarshal.mc11_7TeV.130670.simplifiedModel_4_450_300HITS.eXXX_s765.110807221450_sub025683541/user.zmarshal.004642._1288245570.log.tgz:
>> > DPM_FAILED (Internal error)
>> >
>> > 08/07 21:45:47.902 18681,26 dpm_srv_proc_put:
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8443/srm/managerv1?SFN=/dpm/ppgrid1.rhul.ac.uk/home/cms/store/unmerged/SAM/StageOutTest-11784-Sun-Aug--7-21_26_50-2011:
>> > DPM_FAILED (Permission denied)
>> >
>> >
>> > dpns
>> > 08/07 18:06:24.106 18077,6 Cns_serv: [134.219.225.71]
>> > (se2.ppgrid1.rhul.ac.uk): Could not establish an authenticated
>> > connection:
>> > server_establish_context_ext: Could not receive token; _Csec_recv_token:
>> > Connection closed; Csec_server_set_service_name: Could not set service
>> > name;
>> > Csec_get_peer_service_name: Could not Cgetnetaddress: BAD ERROR NUMBER:
>> > 0 !
>> > 08/07 18:06:24.107 18077,6 sendrep: NS002 - send error : Connection
>> > closed
>> > by remote end
>> >
>> > 08/07 18:07:13.499 18077,6 Cns_serv: [10.141.0.47] (node047.cm.cluster):
>> > Could not establish an authenticated connection:
>> > Csec_server_negociate_protocol: The client did not send an
>> > authentication
>> > negociation request; _Csec_recv_token: Error reading token header: Timed
>> > out
>> > !
>> >
>> >
>> > srmv2.2
>> > 08/07 21:12:53.171 18928,7 srmv2.2: SRM02 - soap_serve error :
>> > [::ffff:10.141.255.254] (newton.cm.cluster) : CGSI-gSOAP running on
>> > se2.ppgrid1.rhul.ac.uk reports Error reading token data header:
>> > Connection
>> > closed
>> >
>> > 08/07 21:55:37.751 18928,0 Mkdir: request by
>> > /O=GRID-FR/C=FR/O=CNRS/OU=CREATIS/CN=Sorina Camarasu from
>> > grid04.lal.in2p3.fr
>> > 08/07 21:55:37.797 18928,0 Mkdir: SRM98 - Mkdir
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/biomed
>> > 08/07 21:55:37.841 18928,0 Mkdir: returns 0,
>> > statusCode=SRM_AUTHORIZATION_FAILURE
>> >
>> >
>> > 08/07 21:56:17.995 18928,0 Rm: request by
>> > /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=steve lloyd from
>> > atlastest.ph.qmul.ac.uk
>> > 08/07 21:56:18.138 18928,0 Rm: returns 0, statusCode=SRM_FAILURE
>> >
>> > 08/07 04:12:22.784 18928,1 Ls: SRM98 - Ls
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk//dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasproddisk/panda/EVNT/2225f8f0-24c7-43a9-a0d2-d55f603fbfaf
>> > 08/07 04:12:22.784 18928,0
>> > 08/07 04:12:22.819 18928,2 Ls: returns 0, statusCode=SRM_FAILURE
>> >
>> >
>> >
>> > On Thu, Aug 4, 2011 at 4:56 PM, Sam Skipsey <[log in to unmask]>
>> > wrote:
>> >>
>> >> Hi Govind,
>> >>
>> >> So, you might want to have a poke at the /var/log/srmv2.2 logs, and
>> >> the /var/log/dpm , /var/log/dpns logs on the head node for the times
>> >> at which the transfers failed.
>> >> There may be useful error messages therein.
>> >>
>> >> Sam
>> >>
>> >> On 4 August 2011 15:23, Govind Songara <[log in to unmask]>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Since last 2-3 weeks we are having some intermittent failures with
>> >> > nagios
>> >> > tests with errors like
>> >> > CRITICAL: METRIC FAILED [org.sam.SRM-Put]: CRITICAL: File was NOT
>> >> > copied
>> >> > to
>> >> > SRM. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]]
>> >> > CRITICAL: File was NOT copied to SRM. [ErrDB:[('lcg_util', 'server',
>> >> > 'CRITICAL')]]
>> >> >
>> >> > No high load on SE, no segfault in last couple of days but we still
>> >> > failing
>> >> > intermittently ATLAS SAM tests and Nagios tests.
>> >> > Can you please advise how to troubleshoot this problem at SE, any
>> >> > pointer.
>> >> >
>> >> >
>> >> > Thanks
>> >> > Govind
>> >> >
>> >> >
>> >> >
>> >> > Nagios logs
>> >> > With some of timeout errors
>> >> > node001: CRITICAL: File was NOT copied to SE se2.ppgrid1.rhul.ac.uk
>> >> > and
>> >> > registered in LFC lfc-biomed.in2p3.fr. [ErrDB:[('lcg_util_wn',
>> >> > 'server',
>> >> > 'CRITICAL')]] CLI
>> >> > CRITICAL: File was NOT copied to SE se2.ppgrid1.rhul.ac.uk and
>> >> > registered in
>> >> > LFC lfc-biomed.in2p3.fr. [ErrDB:[('lcg_util_wn', 'server',
>> >> > 'CRITICAL')]]
>> >> > CLI
>> >> > Testing from: node001
>> >> > DN: /O=GRID-FR/C=FR/O=CNRS/OU=CREATIS/CN=Sorina
>> >> > Camarasu/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=limited
>> >> > proxy
>> >> > VOMS FQANs: /biomed/Role=NULL/Capability=NULL,
>> >> > /biomed/lcg1/Role=NULL/Capability=NULL,
>> >> > /biomed/team/Role=NULL/Capability=NULL
>> >> > Check if we can write to LFC lfc-biomed.in2p3.fr
>> >> > lcg_util-1.7.6-2
>> >> > GFAL-client-1.11.8-3
>> >> > Copy file to SE and register in LFC.
>> >> > 2011-08-03T10:30:47Z
>> >> > LFC: lfc-biomed.in2p3.fr
>> >> > Using CLI:
>> >> > lcg-cr -v --srm-timeout 180 --connect-timeout 10
>> >> > --sendreceive-timeout
>> >> > 120
>> >> > --bdii-timeout 20 --vo biomed -d se2.ppgrid1.rhul.ac.uk -l
>> >> > lfn:/grid/biomed/SAM/sam-lcg-rm-cr-node001.110803103047.55148546
>> >> >
>> >> >
>> >> > /local/3146161.newton.cm.cluster/https_3a_2f_2fgrid40.lal.in2p3.fr_3a9000_2fZSMqKujKYkdeqUehGv_5fEEg/gridprobes/biomed/org.sam/WN/localhost.localdomain/testFile.txt
>> >> >
>> >> > Using grid catalog type: lfc
>> >> > Using grid catalog : lfc-biomed.in2p3.fr
>> >> > Checksum type: None
>> >> > SE type: SRMv2
>> >> > Destination SURL :
>> >> >
>> >> >
>> >> > srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/biomed/generated/2011-08-03/file86a21576-1b59-4251-b698-68c0eebb427d
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> > [SE][StatusOfPutRequest][ETIMEDOUT]
>> >> > httpg://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2: User timeout over
>> >> > lcg_cr: Connection timed out
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >
>> >
>
>
|