hi Govind, Failures seem to have been all in analysis jobs.
Also it appaeras from:
[X]
is a plot of when the jobs failed.
I can see the FTS activity for RHUL looks like:
[X]
no apparent correlation ( first bout aof failures there was not much activity, but was for the second.) in rate
ifd oyu look at concurrent jobs:
[X]
There is a similar pattern.
Can I get access to the ganglia plots for your storage?
WE might want ot try and reduce the load . Also might want to loook into job recovery as a soultion for recovewring from transient load issues.
Issue iis that it appears that although the storage was under heavy load , it was degraded not broken .)
We can reduce the FTS settings to smooth outh data rates and loads in fnecessary?
Brian
________________________________
From: GRIDPP2: Deployment and support of SRM and local storage management [[log in to unmask]] on behalf of Govind Songara [[log in to unmask]]
Sent: 15 August 2011 21:16
To: [log in to unmask]
Subject: Re: SE SAM/nagios test failures
Hi All,
SE is giving lot of troubles, I could not make conclusion what is going on wrong. Some time high number of TIME_WAIT, same time gridppnagios tests are passing, but atlas tests are failing.
Today we had 1000 Job fails related with SE failures.
It has been almost more than 1.5 month with SE, please advise what I should do.
Thanks
Govind
On Fri, Aug 12, 2011 at 1:13 PM, Govind Songara <[log in to unmask]<mailto:[log in to unmask]>> wrote:
On Fri, Aug 12, 2011 at 1:01 PM, Sam Skipsey <[log in to unmask]<mailto:[log in to unmask]>> wrote:
On 12 August 2011 11:47, Govind Songara <[log in to unmask]<mailto:[log in to unmask]>> wrote:
> Hi All,
>
> After implementing the suggestions here
> http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html
>
> Since yesterday, still number of the TIME_WAIT ramp up and had reached
> 5k-6k few times and then SRM start failing.
>
> I am looking around to reduced these number and found suggestions, Any
> comments on these parameters?
> a). Enable tcp_tw_recycle & tcp_tw_reuse
> http://techfreekz.in/2011/06/17/how-to-reduce-very-large-number-of-connections-in-time_wait-state-in-nginx/
> b). tune tcp_max_syn_backlog, tcp_max_tw_buckets & /netdev_max_backlog
> http://bashshell.net/script-of-the-week/script-of-the-week-tuning-tcp-packets/
>
>
> My next plan to drop dpm request tables as suggested here by trying
> http://northgrid-tech.blogspot.com/2011/06/dpm-optimization.html
> 3) set innodb_log_file_size in my.cnf (didn't give much thought to this,
> Maarten value of 50MB seemed good enough. Binary log files need to be
> removed to enable this and the database restarted but check the docs this
> might not be a valid strategy if you make heavier use of the binary logs.)
> 4) set innodb_flush_log_at_trx_commit = 2 in my.cnf (although this parameter
> seems less effective during reload it might be useful in production 2 is
> slightly safer than 0).
> 5) Use the script Jean-Philippe gave me to drop the requests tables before
> an upgrade.
>
>
> I have few doubts
> - How much downtime it needed?
> - In worst case, if need to reolad the database(which take 12 hours), any
> faster approach to reload the data like
> (http://www.maatkit.org/doc/mk-parallel-restore.html)
Does it still take 12 hours to reload your database? It shouldn't, if
you've cleared out the old requests tables...
I restored data on a test machine by fresh mysql server install, which took around 12 hours.
I have not cleared old request tables, want to give a try by plan downtime.
Sam
>
>
> Thanks
> Govind
>
>
>
>
>
> On Wed, Aug 10, 2011 at 3:58 PM, Sam Skipsey <[log in to unmask]<mailto:[log in to unmask]>>
> wrote:
>>
>> On 7 August 2011 22:27, Govind Songara <[log in to unmask]<mailto:[log in to unmask]>> wrote:
>> > Hi Sam,
>> >
>> > I did not track the last failure requests, today we had lot of atlas
>> > failures.
>> >
>> > Here some of the details
>> > Number of connections were too large
>> > [root@se2 ~]# netstat -tap|wc -l;
>> > 6991
>> > [root@se2 ~]# netstat -tap| grep TIME_W|wc -l;
>> > 6336
>> > [root@se2 ~]# netstat -tap| grep TIME_W| grep telelpath |wc -l
>> > 5133
>> >
>>
>> That seems a little high - what are your max_connections and
>> max_threads set to for DPM and MySQL?
>>
>> > Number of mysql slow queries, may be these queries related to slow FTS,
>> > Does
>> > any other site have such high number
>> > mysql> show global status;
>> > Slow_queries = 10544
>> > I think db indexing will help to reduced these numbers.
>> >
>>
>> It probably would, yes. On the other hand, what duration is that
>> statistic collected over?
>> Can you issue a FLUST STATUS; and tell us what the number you get is
>> after a day or two?
>>
>> Sam
>>
>> > Can i directly apply the mysql tunining suggestion of Alessandra without
>> > droping tables & then restoring database.
>> >
>> > http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html
>> > I want to avoid restoring database, it taking more than 12 hours for 4.4
>> > gb
>> > restore sql data file.
>> >
>> > I noticed that ibddata1 file time stamp is not changed since 10th July,
>> > [root@se2 ~]# ls -l /local/mysql/
>> > total 4531644
>> > drwx------ 2 mysql mysql 4096 Apr 27 16:11 cns_db
>> > drwx------ 2 mysql mysql 4096 Feb 27 10:59 dpm_db
>> > -rw-rw---- 1 mysql mysql 10485760 Jul 10 12:11 ibdata1
>> > -rw-rw---- 1 mysql mysql 5242880 Jul 10 12:11 ib_logfile0
>> > -rw-rw---- 1 mysql mysql 5242880 Feb 23 16:09 ib_logfile1
>> > drwx------ 2 mysql mysql 4096 Feb 27 10:59 mysql
>> > -rw-rw---- 1 mysql mysql 1073741947 Jul 31 09:45 mysqld-bin.000102
>> > -rw-rw---- 1 mysql mysql 1073742277 Aug 2 12:12 mysqld-bin.000103
>> > -rw-rw---- 1 mysql mysql 1073741919 Aug 4 13:11 mysqld-bin.000104
>> > -rw-rw---- 1 mysql mysql 1073742007 Aug 7 08:47 mysqld-bin.000105
>> > -rw-rw---- 1 mysql mysql 319191342 Aug 7 22:23 mysqld-bin.000106
>> > -rw-rw---- 1 mysql mysql 100 Aug 7 08:47 mysqld-bin.index
>> >
>> > my.cnf
>> > [mysqld]
>> > myisam-recover=FORCE
>> > datadir=/local/mysql
>> > socket=/local/mysql/mysql.sock
>> > old_passwords=1
>> > set-variable=innodb_buffer_pool_size=2048M
>> > log-bin
>> > expire_logs_days=7
>> >
>> >
>> > Below are details of error messages in Log, some of them may be related
>> > to
>> > failures.
>> > dpm
>> > 08/07 20:48:54.360 18681,5 dpm_serv: [134.219.225.71]
>> > (se2.ppgrid1.rhul.ac.uk<http://se2.ppgrid1.rhul.ac.uk>): Could not establish an authenticated
>> > connection:
>> > server_establish_context_ext: Could not receive token; _Csec_recv_token:
>> > Connection closed; Csec_server_set_service_name: Could not set service
>> > name;
>> > Csec_get_peer_service_name: Could not Cgetnetaddress: BAD ERROR NUMBER:
>> > 0 !
>> >
>> > 08/07 20:48:54.360 18681,5 sendrep: DP002 - send error : Connection
>> > closed
>> > by remote end
>> >
>> > 08/07 21:44:19.733 18681,26 dpm_srv_proc_put:
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/user.zmarshal/user.zmarshal.mc11_7TeV.130670.simplifiedModel_4_450_300HITS.eXXX_s765.110807221450_sub025683541/user.zmarshal.004642._1288245570.log.tgz<http://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/user.zmarshal/user.zmarshal.mc11_7TeV.130670.simplifiedModel_4_450_300HITS.eXXX_s765.110807221450_sub025683541/user.zmarshal.004642._1288245570.log.tgz>:
>> > DPM_FAILED (File exists)
>> > 08/07 21:44:19.734 18681,26 dpm_srv_proc_put: returns 0,
>> > status=DPM_FAILED
>> > (File exists)
>> >
>> > 08/07 21:44:19.779 18681<tel:19.779%2018681>,23 dpm_srv_proc_put:
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/user.zmarshal/user.zmarshal.mc11_7TeV.130670.simplifiedModel_4_450_300HITS.eXXX_s765.110807221450_sub025683541/user.zmarshal.004642._1288245570.log.tgz<http://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/user.zmarshal/user.zmarshal.mc11_7TeV.130670.simplifiedModel_4_450_300HITS.eXXX_s765.110807221450_sub025683541/user.zmarshal.004642._1288245570.log.tgz>:
>> > DPM_FAILED (Internal error)
>> >
>> > 08/07 21:45:47.902 18681,26 dpm_srv_proc_put:
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8443/srm/managerv1?SFN=/dpm/ppgrid1.rhul.ac.uk/home/cms/store/unmerged/SAM/StageOutTest-11784-Sun-Aug--7-21_26_50-2011<http://se2.ppgrid1.rhul.ac.uk:8443/srm/managerv1?SFN=/dpm/ppgrid1.rhul.ac.uk/home/cms/store/unmerged/SAM/StageOutTest-11784-Sun-Aug--7-21_26_50-2011>:
>> > DPM_FAILED (Permission denied)
>> >
>> >
>> > dpns
>> > 08/07 18:06:24.106 18077<tel:24.106%2018077>,6 Cns_serv: [134.219.225.71]
>> > (se2.ppgrid1.rhul.ac.uk<http://se2.ppgrid1.rhul.ac.uk>): Could not establish an authenticated
>> > connection:
>> > server_establish_context_ext: Could not receive token; _Csec_recv_token:
>> > Connection closed; Csec_server_set_service_name: Could not set service
>> > name;
>> > Csec_get_peer_service_name: Could not Cgetnetaddress: BAD ERROR NUMBER:
>> > 0 !
>> > 08/07 18:06:24.107 18077<tel:24.107%2018077>,6 sendrep: NS002 - send error : Connection
>> > closed
>> > by remote end
>> >
>> > 08/07 18:07:13.499 18077<tel:13.499%2018077>,6 Cns_serv: [10.141.0.47] (node047.cm.cluster):
>> > Could not establish an authenticated connection:
>> > Csec_server_negociate_protocol: The client did not send an
>> > authentication
>> > negociation request; _Csec_recv_token: Error reading token header: Timed
>> > out
>> > !
>> >
>> >
>> > srmv2.2
>> > 08/07 21:12:53.171 18928,7 srmv2.2: SRM02 - soap_serve error :
>> > [::ffff:10.141.255.254] (newton.cm.cluster) : CGSI-gSOAP running on
>> > se2.ppgrid1.rhul.ac.uk<http://se2.ppgrid1.rhul.ac.uk> reports Error reading token data header:
>> > Connection
>> > closed
>> >
>> > 08/07 21:55:37.751 18928<tel:37.751%2018928>,0 Mkdir: request by
>> > /O=GRID-FR/C=FR/O=CNRS/OU=CREATIS/CN=Sorina Camarasu from
>> > grid04.lal.in2p3.fr<http://grid04.lal.in2p3.fr>
>> > 08/07 21:55:37.797 18928<tel:37.797%2018928>,0 Mkdir: SRM98 - Mkdir
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/biomed<http://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/biomed>
>> > 08/07 21:55:37.841 18928<tel:37.841%2018928>,0 Mkdir: returns 0,
>> > statusCode=SRM_AUTHORIZATION_FAILURE
>> >
>> >
>> > 08/07 21:56:17.995 18928<tel:17.995%2018928>,0 Rm: request by
>> > /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=steve lloyd from
>> > atlastest.ph.qmul.ac.uk<http://atlastest.ph.qmul.ac.uk>
>> > 08/07 21:56:18.138 18928,0 Rm: returns 0, statusCode=SRM_FAILURE
>> >
>> > 08/07 04:12:22.784 18928,1 Ls: SRM98 - Ls
>> >
>> > srm://se2.ppgrid1.rhul.ac.uk//dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasproddisk/panda/EVNT/2225f8f0-24c7-43a9-a0d2-d55f603fbfaf<http://se2.ppgrid1.rhul.ac.uk//dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasproddisk/panda/EVNT/2225f8f0-24c7-43a9-a0d2-d55f603fbfaf>
>> > 08/07 04:12:22.784 18928,0
>> > 08/07 04:12:22.819 18928,2 Ls: returns 0, statusCode=SRM_FAILURE
>> >
>> >
>> >
>> > On Thu, Aug 4, 2011 at 4:56 PM, Sam Skipsey <[log in to unmask]<mailto:[log in to unmask]>>
>> > wrote:
>> >>
>> >> Hi Govind,
>> >>
>> >> So, you might want to have a poke at the /var/log/srmv2.2 logs, and
>> >> the /var/log/dpm , /var/log/dpns logs on the head node for the times
>> >> at which the transfers failed.
>> >> There may be useful error messages therein.
>> >>
>> >> Sam
>> >>
>> >> On 4 August 2011 15:23, Govind Songara <[log in to unmask]<mailto:[log in to unmask]>>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Since last 2-3 weeks we are having some intermittent failures with
>> >> > nagios
>> >> > tests with errors like
>> >> > CRITICAL: METRIC FAILED [org.sam.SRM-Put]: CRITICAL: File was NOT
>> >> > copied
>> >> > to
>> >> > SRM. [ErrDB:[('lcg_util', 'server', 'CRITICAL')]]
>> >> > CRITICAL: File was NOT copied to SRM. [ErrDB:[('lcg_util', 'server',
>> >> > 'CRITICAL')]]
>> >> >
>> >> > No high load on SE, no segfault in last couple of days but we still
>> >> > failing
>> >> > intermittently ATLAS SAM tests and Nagios tests.
>> >> > Can you please advise how to troubleshoot this problem at SE, any
>> >> > pointer.
>> >> >
>> >> >
>> >> > Thanks
>> >> > Govind
>> >> >
>> >> >
>> >> >
>> >> > Nagios logs
>> >> > With some of timeout errors
>> >> > node001: CRITICAL: File was NOT copied to SE se2.ppgrid1.rhul.ac.uk<http://se2.ppgrid1.rhul.ac.uk>
>> >> > and
>> >> > registered in LFC lfc-biomed.in2p3.fr<http://lfc-biomed.in2p3.fr>. [ErrDB:[('lcg_util_wn',
>> >> > 'server',
>> >> > 'CRITICAL')]] CLI
>> >> > CRITICAL: File was NOT copied to SE se2.ppgrid1.rhul.ac.uk<http://se2.ppgrid1.rhul.ac.uk> and
>> >> > registered in
>> >> > LFC lfc-biomed.in2p3.fr<http://lfc-biomed.in2p3.fr>. [ErrDB:[('lcg_util_wn', 'server',
>> >> > 'CRITICAL')]]
>> >> > CLI
>> >> > Testing from: node001
>> >> > DN: /O=GRID-FR/C=FR/O=CNRS/OU=CREATIS/CN=Sorina
>> >> > Camarasu/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=limited
>> >> > proxy
>> >> > VOMS FQANs: /biomed/Role=NULL/Capability=NULL,
>> >> > /biomed/lcg1/Role=NULL/Capability=NULL,
>> >> > /biomed/team/Role=NULL/Capability=NULL
>> >> > Check if we can write to LFC lfc-biomed.in2p3.fr<http://lfc-biomed.in2p3.fr>
>> >> > lcg_util-1.7.6-2
>> >> > GFAL-client-1.11.8-3
>> >> > Copy file to SE and register in LFC.
>> >> > 2011-08-03T10:30:47Z
>> >> > LFC: lfc-biomed.in2p3.fr<http://lfc-biomed.in2p3.fr>
>> >> > Using CLI:
>> >> > lcg-cr -v --srm-timeout 180 --connect-timeout 10
>> >> > --sendreceive-timeout
>> >> > 120
>> >> > --bdii-timeout 20 --vo biomed -d se2.ppgrid1.rhul.ac.uk<http://se2.ppgrid1.rhul.ac.uk> -l
>> >> > lfn:/grid/biomed/SAM/sam-lcg-rm-cr-node001.110803103047.55148546
>> >> >
>> >> >
>> >> > /local/3146161.newton.cm.cluster/https_3a_2f_2fgrid40.lal.in2p3.fr_3a9000_2fZSMqKujKYkdeqUehGv_5fEEg/gridprobes/biomed/org.sam/WN/localhost.localdomain/testFile.txt
>> >> >
>> >> > Using grid catalog type: lfc
>> >> > Using grid catalog : lfc-biomed.in2p3.fr<http://lfc-biomed.in2p3.fr>
>> >> > Checksum type: None
>> >> > SE type: SRMv2
>> >> > Destination SURL :
>> >> >
>> >> >
>> >> > srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/biomed/generated/2011-08-03/file86a21576-1b59-4251-b698-68c0eebb427d<http://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/biomed/generated/2011-08-03/file86a21576-1b59-4251-b698-68c0eebb427d>
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> >
>> >> > * Timed out after 30 sec while waiting for output from child.
>> >> > [SE][StatusOfPutRequest][ETIMEDOUT]
>> >> > httpg://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2<http://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2>: User timeout over
>> >> > lcg_cr: Connection timed out
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >
>> >
>
>
|