hi you can find the core file at:
http://grid.inp.demokritos.gr/gdb/
thanks
xristos
> Hi
> I moved the dump to /afs/hephy.at/user/w/walzel/public/
> Regards
> gerhard
>
>
> On 10/13/05 1:48 PM, "Piotr Nyczyk" <[log in to unmask]> wrote:
>
>> Hi guys,
>>
>> To help us investigate the whole thing I would have a small task for
>> those who experience SFT job hanging on their WNs.
>> In fact this is usually one of the lcg-utils commands that hangs, and
>> as we can't reproduce the problem on our machines the only way seems
>> to be getting the core dump of your hanging process...
>>
>> Could you please perform the following steps when you encounter SFT
>> job hanging in the batch system:
>>
>> 1. Locate the WN which is actually running the job
>>
>> 2. Log in as root on the WN
>>
>> 3. Use command "ps axf" (the tree view of processes) to find the
>> process being executed by sft test job that actually hangs - it
>> should be one of the lcg-utils commands, "lcg-cr" most probably.
>>
>> 4. Make sure "gdb" is installed on your WN (on most machines "apt-get
>> install gdb" should be enough) - you can remove gdb after finishing
>> with this procedure.
>>
>> 5. Assuming that the hanging process is "lcg-cr" create a "live" core
>> of the process using the following commands:
>> ps axf # to locate lcg-cr process that hangs and to find out its pid
>> which lcg-cr # to find out where you have "lcg-cr" binary
>> installed in your filesystem
>> gdb /<absolute_path>/lcg-cr <pid_of_hanging_lcg-cr_process>
>> and then under gdb use the following commands:
>> gcore
>> quit
>> (answer "y" for the question)
>>
>> This should generate a core dump of the running process without
>> killing it in fact. You can kill the process anyway if you are sure,
>> that the core was saved.
>>
>> 6. Please send the core file to me (it can be quite big so in that
>> case compress it or put it somewhere on the web so my mail account is
>> not overloaded)
>>
>> Thanks in advance for cooperation :-)
>>
>> Piotr
>>
>> On Oct 13, 2005, at 11:02 AM, Piotr Nyczyk wrote:
>>
>>> Yes, the problem seems to similair as reported by Gerhard. "lcg-cr"
>>> command hangs in the way that it is not killed by SFT job wrapper.
>>> We will investigate it. But it will take some time.
>>> Be patient please :)
>>>
>>> Piotr
>>>
>>> On Oct 12, 2005, at 3:59 PM, Filippidis christos wrote:
>>>
>>>
>>>> i dont know if this helps:
>>>>
>>>> [root@arxiloxos6 root]# ps -ax
>>>> PID TTY STAT TIME COMMAND
>>>> 1 ? S 0:06 init
>>>> 2 ? SW 0:00 [migration/0]
>>>> 3 ? SW 0:00 [migration/1]
>>>> 4 ? SW 0:00 [keventd]
>>>> 5 ? SWN 0:00 [ksoftirqd/0]
>>>> 6 ? SWN 0:00 [ksoftirqd/1]
>>>> 9 ? SW 0:00 [bdflush]
>>>> 7 ? SW 0:02 [kswapd]
>>>> 8 ? SW 0:09 [kscand]
>>>> 10 ? SW 0:03 [kupdated]
>>>> 11 ? SW 0:00 [mdrecoveryd]
>>>> 15 ? SW 0:14 [kjournald]
>>>> 70 ? SW 0:00 [khubd]
>>>> 1504 ? SW 0:00 [kjournald]
>>>> 1878 ? S 0:05 syslogd -m 0
>>>> 1882 ? S 0:00 klogd -x
>>>> 1892 ? S 0:24 irqbalance
>>>> 1909 ? S 0:00 portmap
>>>> 1928 ? S 0:00 rpc.statd
>>>> 1939 ? S 0:00 mdadm --monitor --scan -f
>>>> 1986 ? SW 0:00 [rpciod]
>>>> 1987 ? SW 0:00 [lockd]
>>>> 2030 ? S 0:00 /usr/sbin/smartd
>>>> 2039 ? S 0:00 /usr/sbin/sshd
>>>> 2053 ? S 0:00 xinetd -stayalive -pidfile /var/run/
>>>> xinetd.pid
>>>> 2076 ? SL 0:09 ntpd -U ntp -p /var/run/ntpd.pid
>>>> 2093 ? S 0:22 zhm arxiloxos6.inp.demokritos.gr
>>>> 2113 ? S 0:09 sendmail: accepting connections
>>>> 2122 ? S 0:00 sendmail: Queue runner@01:00:00 for
>>>> /var/spool/clientmqueue
>>>> 2132 ? S 0:00 gpm -t imps2 -m /dev/mouse
>>>> 2141 ? S 0:00 crond
>>>> 2150 ? S 9:18 /usr/sbin/pbs_mom -p
>>>> 2173 ? S 0:00 xfs -droppriv -daemon
>>>> 2190 ? S 0:00 /usr/sbin/atd
>>>> 2209 tty1 S 0:00 /sbin/mingetty tty1
>>>> 2210 tty2 S 0:00 /sbin/mingetty tty2
>>>> 2211 tty3 S 0:00 /sbin/mingetty tty3
>>>> 2212 tty4 S 0:00 /sbin/mingetty tty4
>>>> 2213 tty5 S 0:00 /sbin/mingetty tty5
>>>> 2214 tty6 S 0:00 /sbin/mingetty tty6
>>>> 2215 ? S 0:00 /usr/bin/gdm-binary -nodaemon
>>>> 2373 ? S 0:00 /usr/bin/gdm-binary -nodaemon
>>>> 2374 ? S 19:11 /usr/X11R6/bin/X :0 -auth /var/gdm/:
>>>> 0.Xauth vt7
>>>> 2389 ? S 3:54 /usr/bin/gdmgreeter
>>>> 6108 ? S 0:00 -sh
>>>> 6310 ? S 0:00 /bin/sh
>>>> /var/spool/pbs/mom_priv/jobs/3572.xg009..SC
>>>> 6314 ? S 0:00 /usr/bin/perl -w /tmp/bootstrap.AC6311
>>>> /home/dteam002/ xg009.inp.demokritos.gr
>>>> /home/dteam002/.globus/.gass_cache/local/md5/15/de2
>>>> 6320 ? S 0:00 /usr/bin/perl -w /tmp/bootstrap.AC6311
>>>> /home/dteam002/ xg009.inp.demokritos.gr
>>>> /home/dteam002/.globus/.gass_cache/local/md5/15/de2
>>>> 6573 ? S 0:00 /usr/bin/perl -w /tmp/bootstrap.AC6311
>>>> /home/dteam002/ xg009.inp.demokritos.gr
>>>> /home/dteam002/.globus/.gass_cache/local/md5/15/de2
>>>> 6784 ? S 0:00 bash
>>>> /home/dteam002/globus-tmp.arxiloxos6.6314.0/globus-
>>>> tmp.arxiloxos6.6314.0/local/md5/15/de2f91fa165b6e72c05ba3b3fa20c0/
>>>> md5/60/9
>>>> 6852 ? S 0:00 bash
>>>> /home/dteam002/globus-tmp.arxiloxos6.6314.0/globus-
>>>> tmp.arxiloxos6.6314.0/local/md5/15/de2f91fa165b6e72c05ba3b3fa20c0/
>>>> md5/60/9
>>>> 6853 ? S 0:00 /bin/bash ./testJob.sh
>>>> 6854 ? SN 0:01 python2 /opt/lcg/bin/lcg-mon-wn -j
>>>> https://gdrb02.cern.ch:9000/-2gUNHn_skh6LaVH0T6r4Q -p
>>>> /tmp/globus-tmp.arxiloxos6.6314.0 -l 3572
>>>> 6855 ? S 0:00 perl -e ? while (1) {?
>>>> $time_left =
>>>> `grid-proxy-info -timeleft 2> /dev/null` || 0;? last if
>>>> ($time_left
>>>> <= 0);?
>>>> 6873 ? S 0:00 perl ./run-test sft-lcg-rm
>>>> 6898 ? S 0:00 /bin/bash tests/sft-lcg-rm
>>>> 8269 ? S 0:00 perl ./run-test sft-lcg-rm-cr
>>>> 8279 ? S 0:00 /bin/bash tests/sft-lcg-rm-cr
>>>> 8290 ? S 0:00 lcg-cr -v --vo dteam -d
>>>> xg006.inp.demokritos.gr
>>>> -l lfn:sft-lcg-rm-cr-arxiloxos6.inp.demokritos.gr.0510112121
>>>> file:///home/dteam002
>>>> 22905 ? S 0:00 sshd: root@pts/0
>>>> 22907 pts/0 S 0:00 -bash
>>>> 23060 pts/0 R 0:00 ps -ax
>>>>
>>>>
>>>>
>>>>
>>>>> Filippidis christos wrote:
>>>>>
>>>>>
>>>>>
>>>>>> hi again,
>>>>>>
>>>>>> i have an sft job with a "problem" right know (actually its
>>>>>> "running
>>>>>> from
>>>>>> yesterday )
>>>>>>
>>>>>> the info i can get for this job is:
>>>>>> (i dont know how to get more)
>>>>>>
>>>>>>
>>>>>
>>>>> Do not trust the output of "qstat" or "pbsnodes": PBS/Torque has
>>>>> bugs and
>>>>> occasionally it will get into a bad state. Login on the WN and look
>>>>> around
>>>>> with "ps" etc.
>>>>>
>>>>>
>>>>>
>>>>>> arxiloxos6.inp.demokritos.gr
>>>>>> state = free
>>>>>> np = 2
>>>>>> properties = lcgpro
>>>>>> ntype = cluster
>>>>>> jobs = 0/3572.xg009.inp.demokritos.gr
>>>>>> status = arch=linux,uname=Linux arxiloxos6.inp.demokritos.gr
>>>>>> 2.4.21-32.0.1.EL.cernsmp #1 SMP Thu May 26 12:29:50 CEST 2005
>>>>>> i686,sessions=2389
>>>>>> 6108,nsessions=2,nusers=2,idletime=5778321,totmem=1554432kb,availme
>>>>>> m=1111516kb,physmem=510216kb,ncpus=2,loadave=0.00,rectime=112912405
>>>>>> 6
>>>>>>
>>>>>> [root@xg009 root]# qstat
>>>>>> Job id Name User Time Use S Queue
>>>>>> ---------------- ---------------- ---------------- -------- - -----
>>>>>> 3572.xg009 STDIN dteam002 00:00:22 R
>>>>>> dteam
>>>>>>
>>>>>> you can see at
>>>>>> https://lcg-sft.cern.ch:9443/sft/sitehistory.cgi?
>>>>>> site=xg009.inp.demokritos.gr
>>>>>> this cause many problems because for today i dont have new sft jobs
>>>>>> probably because its seams that there are a dteam job that is
>>>>>> running,
>>>>>>
>>>>>> if i delete this job then i will have new sft jobs util 18:00
>>>>>> and then
>>>>>> it
>>>>>> will happen the same
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>> xristos
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Guys,
>>>>>>>
>>>>>>> I was trying to figure out why the test job could hang, but I must
>>>>>>> admit that I was unable to reproduce the problem. Normally all
>>>>>>> tests
>>>>>>> are killed automatically after 15 minutes by the SIGALRM signal
>>>>>>> handler (the signal handler sends KILL signal to test process),
>>>>>>> and
>>>>>>> when I try to simulate hanging tests everything works fine for me.
>>>>>>>
>>>>>>> Could you please check the list of running processes on the WN
>>>>>>> when
>>>>>>> it happens next time? And if it's possible if you could also note
>>>>>>> down the time when the job actually started to execute and when
>>>>>>> you
>>>>>>> checked the process table...
>>>>>>> This is the most obvious way we can investigate what is happening.
>>>>>>>
>>>>>>> Piotr
>>>>>>>
>>>>>>> On Oct 12, 2005, at 1:00 PM, Gerhard Walzel wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Judit
>>>>>>>> I have exact the same problem on site Hephy-Vienna
>>>>>>>> Just starting at 0015 !
>>>>>>>> Last days I have simply removed the job to enable
>>>>>>>> Sft tests again...
>>>>>>>> Gerhard
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/12/05 11:59 AM, "NOVAK Judit" <[log in to unmask]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Christos,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the site history I can see two Job Submission failures,
>>>>>>>>> both from last week. The last one run to a timeout (while gstat
>>>>>>>>> reports many free CPUs -- is it all OK with the batch system?).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Judit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On k, okt 11, Filippidis christos wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> hi to all,
>>>>>>>>>>
>>>>>>>>>> i have the following problem:
>>>>>>>>>>
>>>>>>>>>> our site here at demokritos is passing the sft but the last
>>>>>>>>>> week
>>>>>>>>>> every day
>>>>>>>>>> when dteam002 "/c=ch/o=cern/ou=grid/cn=judit novak 0973" send
>>>>>>>>>> an sft at
>>>>>>>>>> 18:00 the job never ends or it stop the next day and the result
>>>>>>>>>> is CT or js
>>>>>>>>>>
>>>>>>>>>> the same time when i send an sft from this site:
>>>>>>>>>> https://monitoring.egee.man.poznan.pl/
>>>>>>>>>> everythink is ok,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> it is also strange that when judit novak send an sft at an
>>>>>>>>>> other period
>>>>>>>>>> of the day ,for example the morning, the sft is succesfull.
>>>>>>>>>>
>>>>>>>>>> do you have any ideas?
>>>>>>>>>>
>>>>>>>>>> thanks xristos
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Christos Filippidis
>>>>>>>>>> NCSR DEMOKRITOS
>>>>>>>>>> Institute of Nuclear Physics
>>>>>>>>>> office block 6(ktirion 6)
>>>>>>>>>> Gr-15310 Agia Paraskevi
>>>>>>>>>> GREECE
>>>>>>>>>> Tel:2106503425
>>>>>>>>>>
>>>>>>>>>> http://consult.cern.ch/xwho/people/117002
>>>>>>>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------
>>>>>>>>>>
>>>>>>>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>>>>>>>> http://www.inp.demokritos.gr/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Christos Filippidis
>>>>>>>>>> NCSR DEMOKRITOS
>>>>>>>>>> Institute of Nuclear Physics
>>>>>>>>>> office block 6(ktirion 6)
>>>>>>>>>> Gr-15310 Agia Paraskevi
>>>>>>>>>> GREECE
>>>>>>>>>> Tel:2106503425
>>>>>>>>>>
>>>>>>>>>> http://consult.cern.ch/xwho/people/117002
>>>>>>>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------
>>>>>>>>>>
>>>>>>>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>>>>>>>> http://www.inp.demokritos.gr/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> Christos Filippidis
>>>>>> NCSR DEMOKRITOS
>>>>>> Institute of Nuclear Physics
>>>>>> office block 6(ktirion 6)
>>>>>> Gr-15310 Agia Paraskevi
>>>>>> GREECE
>>>>>> Tel:2106503425
>>>>>>
>>>>>> http://consult.cern.ch/xwho/people/117002
>>>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------
>>>>>>
>>>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>>>> http://www.inp.demokritos.gr/
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> Christos Filippidis
>>>> NCSR DEMOKRITOS
>>>> Institute of Nuclear Physics
>>>> office block 6(ktirion 6)
>>>> Gr-15310 Agia Paraskevi
>>>> GREECE
>>>> Tel:2106503425
>>>>
>>>> http://consult.cern.ch/xwho/people/117002
>>>> http://www.inp.demokritos.gr/~filippidisx/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ----------------------------------------------
>>>>
>>>> "Institute of Nuclear Physics NCSR Demokritos"
>>>> http://www.inp.demokritos.gr/
>>>>
>>>
>
Christos Filippidis
NCSR DEMOKRITOS
Institute of Nuclear Physics
office block 6(ktirion 6)
Gr-15310 Agia Paraskevi
GREECE
Tel:2106503425
http://consult.cern.ch/xwho/people/117002
http://www.inp.demokritos.gr/~filippidisx/
----------------------------------------------
"Institute of Nuclear Physics NCSR Demokritos"
http://www.inp.demokritos.gr/
|