Hi
I moved the dump to /afs/hephy.at/user/w/walzel/public/
Regards
gerhard
On 10/13/05 1:48 PM, "Piotr Nyczyk" <[log in to unmask]> wrote:
> Hi guys,
>
> To help us investigate the whole thing I would have a small task for
> those who experience SFT job hanging on their WNs.
> In fact this is usually one of the lcg-utils commands that hangs, and
> as we can't reproduce the problem on our machines the only way seems
> to be getting the core dump of your hanging process...
>
> Could you please perform the following steps when you encounter SFT
> job hanging in the batch system:
>
> 1. Locate the WN which is actually running the job
>
> 2. Log in as root on the WN
>
> 3. Use command "ps axf" (the tree view of processes) to find the
> process being executed by sft test job that actually hangs - it
> should be one of the lcg-utils commands, "lcg-cr" most probably.
>
> 4. Make sure "gdb" is installed on your WN (on most machines "apt-get
> install gdb" should be enough) - you can remove gdb after finishing
> with this procedure.
>
> 5. Assuming that the hanging process is "lcg-cr" create a "live" core
> of the process using the following commands:
> ps axf # to locate lcg-cr process that hangs and to find out its pid
> which lcg-cr # to find out where you have "lcg-cr" binary
> installed in your filesystem
> gdb /<absolute_path>/lcg-cr <pid_of_hanging_lcg-cr_process>
> and then under gdb use the following commands:
> gcore
> quit
> (answer "y" for the question)
>
> This should generate a core dump of the running process without
> killing it in fact. You can kill the process anyway if you are sure,
> that the core was saved.
>
> 6. Please send the core file to me (it can be quite big so in that
> case compress it or put it somewhere on the web so my mail account is
> not overloaded)
>
> Thanks in advance for cooperation :-)
>
> Piotr
>
> On Oct 13, 2005, at 11:02 AM, Piotr Nyczyk wrote:
>
>> Yes, the problem seems to similair as reported by Gerhard. "lcg-cr"
>> command hangs in the way that it is not killed by SFT job wrapper.
>> We will investigate it. But it will take some time.
>> Be patient please :)
>>
>> Piotr
>>
>> On Oct 12, 2005, at 3:59 PM, Filippidis christos wrote:
>>
>>
>>> i dont know if this helps:
>>>
>>> [root@arxiloxos6 root]# ps -ax
>>> PID TTY STAT TIME COMMAND
>>> 1 ? S 0:06 init
>>> 2 ? SW 0:00 [migration/0]
>>> 3 ? SW 0:00 [migration/1]
>>> 4 ? SW 0:00 [keventd]
>>> 5 ? SWN 0:00 [ksoftirqd/0]
>>> 6 ? SWN 0:00 [ksoftirqd/1]
>>> 9 ? SW 0:00 [bdflush]
>>> 7 ? SW 0:02 [kswapd]
>>> 8 ? SW 0:09 [kscand]
>>> 10 ? SW 0:03 [kupdated]
>>> 11 ? SW 0:00 [mdrecoveryd]
>>> 15 ? SW 0:14 [kjournald]
>>> 70 ? SW 0:00 [khubd]
>>> 1504 ? SW 0:00 [kjournald]
>>> 1878 ? S 0:05 syslogd -m 0
>>> 1882 ? S 0:00 klogd -x
>>> 1892 ? S 0:24 irqbalance
>>> 1909 ? S 0:00 portmap
>>> 1928 ? S 0:00 rpc.statd
>>> 1939 ? S 0:00 mdadm --monitor --scan -f
>>> 1986 ? SW 0:00 [rpciod]
>>> 1987 ? SW 0:00 [lockd]
>>> 2030 ? S 0:00 /usr/sbin/smartd
>>> 2039 ? S 0:00 /usr/sbin/sshd
>>> 2053 ? S 0:00 xinetd -stayalive -pidfile /var/run/
>>> xinetd.pid
>>> 2076 ? SL 0:09 ntpd -U ntp -p /var/run/ntpd.pid
>>> 2093 ? S 0:22 zhm arxiloxos6.inp.demokritos.gr
>>> 2113 ? S 0:09 sendmail: accepting connections
>>> 2122 ? S 0:00 sendmail: Queue runner@01:00:00 for
>>> /var/spool/clientmqueue
>>> 2132 ? S 0:00 gpm -t imps2 -m /dev/mouse
>>> 2141 ? S 0:00 crond
>>> 2150 ? S 9:18 /usr/sbin/pbs_mom -p
>>> 2173 ? S 0:00 xfs -droppriv -daemon
>>> 2190 ? S 0:00 /usr/sbin/atd
>>> 2209 tty1 S 0:00 /sbin/mingetty tty1
>>> 2210 tty2 S 0:00 /sbin/mingetty tty2
>>> 2211 tty3 S 0:00 /sbin/mingetty tty3
>>> 2212 tty4 S 0:00 /sbin/mingetty tty4
>>> 2213 tty5 S 0:00 /sbin/mingetty tty5
>>> 2214 tty6 S 0:00 /sbin/mingetty tty6
>>> 2215 ? S 0:00 /usr/bin/gdm-binary -nodaemon
>>> 2373 ? S 0:00 /usr/bin/gdm-binary -nodaemon
>>> 2374 ? S 19:11 /usr/X11R6/bin/X :0 -auth /var/gdm/:
>>> 0.Xauth vt7
>>> 2389 ? S 3:54 /usr/bin/gdmgreeter
>>> 6108 ? S 0:00 -sh
>>> 6310 ? S 0:00 /bin/sh
>>> /var/spool/pbs/mom_priv/jobs/3572.xg009..SC
>>> 6314 ? S 0:00 /usr/bin/perl -w /tmp/bootstrap.AC6311
>>> /home/dteam002/ xg009.inp.demokritos.gr
>>> /home/dteam002/.globus/.gass_cache/local/md5/15/de2
>>> 6320 ? S 0:00 /usr/bin/perl -w /tmp/bootstrap.AC6311
>>> /home/dteam002/ xg009.inp.demokritos.gr
>>> /home/dteam002/.globus/.gass_cache/local/md5/15/de2
>>> 6573 ? S 0:00 /usr/bin/perl -w /tmp/bootstrap.AC6311
>>> /home/dteam002/ xg009.inp.demokritos.gr
>>> /home/dteam002/.globus/.gass_cache/local/md5/15/de2
>>> 6784 ? S 0:00 bash
>>> /home/dteam002/globus-tmp.arxiloxos6.6314.0/globus-
>>> tmp.arxiloxos6.6314.0/local/md5/15/de2f91fa165b6e72c05ba3b3fa20c0/
>>> md5/60/9
>>> 6852 ? S 0:00 bash
>>> /home/dteam002/globus-tmp.arxiloxos6.6314.0/globus-
>>> tmp.arxiloxos6.6314.0/local/md5/15/de2f91fa165b6e72c05ba3b3fa20c0/
>>> md5/60/9
>>> 6853 ? S 0:00 /bin/bash ./testJob.sh
>>> 6854 ? SN 0:01 python2 /opt/lcg/bin/lcg-mon-wn -j
>>> https://gdrb02.cern.ch:9000/-2gUNHn_skh6LaVH0T6r4Q -p
>>> /tmp/globus-tmp.arxiloxos6.6314.0 -l 3572
>>> 6855 ? S 0:00 perl -e ? while (1) {?
>>> $time_left =
>>> `grid-proxy-info -timeleft 2> /dev/null` || 0;? last if
>>> ($time_left
>>> <= 0);?
>>> 6873 ? S 0:00 perl ./run-test sft-lcg-rm
>>> 6898 ? S 0:00 /bin/bash tests/sft-lcg-rm
>>> 8269 ? S 0:00 perl ./run-test sft-lcg-rm-cr
>>> 8279 ? S 0:00 /bin/bash tests/sft-lcg-rm-cr
>>> 8290 ? S 0:00 lcg-cr -v --vo dteam -d
>>> xg006.inp.demokritos.gr
>>> -l lfn:sft-lcg-rm-cr-arxiloxos6.inp.demokritos.gr.0510112121
>>> file:///home/dteam002
>>> 22905 ? S 0:00 sshd: root@pts/0
>>> 22907 pts/0 S 0:00 -bash
>>> 23060 pts/0 R 0:00 ps -ax
>>>
>>>
>>>
>>>
>>>> Filippidis christos wrote:
>>>>
>>>>
>>>>
>>>>> hi again,
>>>>>
>>>>> i have an sft job with a "problem" right know (actually its
>>>>> "running
>>>>> from
>>>>> yesterday )
>>>>>
>>>>> the info i can get for this job is:
>>>>> (i dont know how to get more)
>>>>>
>>>>>
>>>>
>>>> Do not trust the output of "qstat" or "pbsnodes": PBS/Torque has
>>>> bugs and
>>>> occasionally it will get into a bad state. Login on the WN and look
>>>> around
>>>> with "ps" etc.
>>>>
>>>>
>>>>
>>>>> arxiloxos6.inp.demokritos.gr
>>>>> state = free
>>>>> np = 2
>>>>> properties = lcgpro
>>>>> ntype = cluster
>>>>> jobs = 0/3572.xg009.inp.demokritos.gr
>>>>> status = arch=linux,uname=Linux arxiloxos6.inp.demokritos.gr
>>>>> 2.4.21-32.0.1.EL.cernsmp #1 SMP Thu May 26 12:29:50 CEST 2005
>>>>> i686,sessions=2389
>>>>> 6108,nsessions=2,nusers=2,idletime=5778321,totmem=1554432kb,availme
>>>>> m=1111516kb,physmem=510216kb,ncpus=2,loadave=0.00,rectime=112912405
>>>>> 6
>>>>>
>>>>> [root@xg009 root]# qstat
>>>>> Job id Name User Time Use S Queue
>>>>> ---------------- ---------------- ---------------- -------- - -----
>>>>> 3572.xg009 STDIN dteam002 00:00:22 R
>>>>> dteam
>>>>>
>>>>> you can see at
>>>>> https://lcg-sft.cern.ch:9443/sft/sitehistory.cgi?
>>>>> site=xg009.inp.demokritos.gr
>>>>> this cause many problems because for today i dont have new sft jobs
>>>>> probably because its seams that there are a dteam job that is
>>>>> running,
>>>>>
>>>>> if i delete this job then i will have new sft jobs util 18:00
>>>>> and then
>>>>> it
>>>>> will happen the same
>>>>>
>>>>>
>>>>> thanks
>>>>> xristos
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> I was trying to figure out why the test job could hang, but I must
>>>>>> admit that I was unable to reproduce the problem. Normally all
>>>>>> tests
>>>>>> are killed automatically after 15 minutes by the SIGALRM signal
>>>>>> handler (the signal handler sends KILL signal to test process),
>>>>>> and
>>>>>> when I try to simulate hanging tests everything works fine for me.
>>>>>>
>>>>>> Could you please check the list of running processes on the WN
>>>>>> when
>>>>>> it happens next time? And if it's possible if you could also note
>>>>>> down the time when the job actually started to execute and when
>>>>>> you
>>>>>> checked the process table...
>>>>>> This is the most obvious way we can investigate what is happening.
>>>>>>
>>>>>> Piotr
>>>>>>
>>>>>> On Oct 12, 2005, at 1:00 PM, Gerhard Walzel wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Judit
>>>>>>> I have exact the same problem on site Hephy-Vienna
>>>>>>> Just starting at 0015 !
>>>>>>> Last days I have simply removed the job to enable
>>>>>>> Sft tests again...
>>>>>>> Gerhard
>>>>>>>
>>>>>>>
>>>>>>> On 10/12/05 11:59 AM, "NOVAK Judit" <[log in to unmask]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi Christos,
>>>>>>>>
>>>>>>>>
>>>>>>>> In the site history I can see two Job Submission failures,
>>>>>>>> both from last week. The last one run to a timeout (while gstat
>>>>>>>> reports many free CPUs -- is it all OK with the batch system?).
>>>>>>>>
>>>>>>>>
>>>>>>>> Judit
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On k, okt 11, Filippidis christos wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> hi to all,
>>>>>>>>>
>>>>>>>>> i have the following problem:
>>>>>>>>>
>>>>>>>>> our site here at demokritos is passing the sft but the last
>>>>>>>>> week
>>>>>>>>> every day
>>>>>>>>> when dteam002 "/c=ch/o=cern/ou=grid/cn=judit novak 0973" send
>>>>>>>>> an sft at
>>>>>>>>> 18:00 the job never ends or it stop the next day and the result
>>>>>>>>> is CT or js
>>>>>>>>>
>>>>>>>>> the same time when i send an sft from this site:
>>>>>>>>> https://monitoring.egee.man.poznan.pl/
>>>>>>>>> everythink is ok,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> it is also strange that when judit novak send an sft at an
>>>>>>>>> other period
>>>>>>>>> of the day ,for example the morning, the sft is succesfull.
>>>>>>>>>
>>>>>>>>> do you have any ideas?
>>>>>>>>>
>>>>>>>>> thanks xristos
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Christos Filippidis
>>>>>>>>> NCSR DEMOKRITOS
>>>>>>>>> Institute of Nuclear Physics
>>>>>>>>> office block 6(ktirion 6)
>>>>>>>>> Gr-15310 Agia Paraskevi
>>>>>>>>> GREECE
>>>>>>>>> Tel:2106503425
>>>>>>>>>
>>>>>>>>> http://consult.cern.ch/xwho/people/117002
>>>>>>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----------------------------------------------
>>>>>>>>>
>>>>>>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>>>>>>> http://www.inp.demokritos.gr/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Christos Filippidis
>>>>>>>>> NCSR DEMOKRITOS
>>>>>>>>> Institute of Nuclear Physics
>>>>>>>>> office block 6(ktirion 6)
>>>>>>>>> Gr-15310 Agia Paraskevi
>>>>>>>>> GREECE
>>>>>>>>> Tel:2106503425
>>>>>>>>>
>>>>>>>>> http://consult.cern.ch/xwho/people/117002
>>>>>>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----------------------------------------------
>>>>>>>>>
>>>>>>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>>>>>>> http://www.inp.demokritos.gr/
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> Christos Filippidis
>>>>> NCSR DEMOKRITOS
>>>>> Institute of Nuclear Physics
>>>>> office block 6(ktirion 6)
>>>>> Gr-15310 Agia Paraskevi
>>>>> GREECE
>>>>> Tel:2106503425
>>>>>
>>>>> http://consult.cern.ch/xwho/people/117002
>>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----------------------------------------------
>>>>>
>>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>>> http://www.inp.demokritos.gr/
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> Christos Filippidis
>>> NCSR DEMOKRITOS
>>> Institute of Nuclear Physics
>>> office block 6(ktirion 6)
>>> Gr-15310 Agia Paraskevi
>>> GREECE
>>> Tel:2106503425
>>>
>>> http://consult.cern.ch/xwho/people/117002
>>> http://www.inp.demokritos.gr/~filippidisx/
>>>
>>>
>>>
>>>
>>>
>>> ----------------------------------------------
>>>
>>> "Institute of Nuclear Physics NCSR Demokritos"
>>> http://www.inp.demokritos.gr/
>>>
>>
|