https://gus.fzk.de/ws/ticket_info.php?ticket=68730
On 03/17/2011 04:32 PM, Gonçalo Borges wrote:
[log in to unmask]" type="cite">
Hi Dimitris...
Thanks for the feedback, but at the end, I was able to solve the
problem, which seems to be different from the one you have.
I realized that there was a long queue of requests to be processed
by the workload-management system
(/var/glite/workload_manager/jobdir/old/). The WM was trying to
reprocess some old entries in the queue and failing. Immediately
after the failure, I saw logs like:
17 Mar, 14:53:02 -W: [Warning] get_catalog_url(dli_utils.cpp:89):
No endpoints found
17 Mar, 14:53:02 -W: [Warning]
resolve_filemapping_info(dli_utils.cpp:364): cannot get
DataCatalogType or endpoint
17 Mar, 14:53:02 -I: [Info]
checkRequirement(matchmakerISMImpl.cpp:222): MM for job: https://wms01.ncg.ingrid.pt:9000/Es5PVTeUke9kHx4hG6qvDg
(0/0 [0] )
17 Mar, 14:53:02 -I: [Info] postpone(submit_request.cpp:212):
postponing https://wms01.ncg.ingrid.pt:9000/Es5PVTeUke9kHx4hG6qvDg
(BrokerHelper: no compatible resources)
and running with LogLevel 6
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62):
trying to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62):
trying to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62):
trying to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62):
trying to get data-location-interface information through SD...
I suspected from some jobs with badly defined JDL, and indeed,
looking to the JDL of one of the job ids referred in the logs, I
saw things like:
DataRequirements = {
[
DataCatalogType = "DLI";
InputData = { "guid:3172069e-d20b-483e-afba-f7acc689ac85" }
] };
It seems the user was submitting jobs, requesting a given file,
but not referring the LFC where the file was registered. Because
of that, the WMS dind't know how to process that request, and
failed.
Basically I had to delete the reference to those kind of jobs in
/var/glite/workload_manager/jobdir/old/, and after that, the
daemons are working perfectly.
I have contacted the user, ask him to correct the JDL, but the
service should also be protected against this kind of missusage.
Cheers
Goncalo
On 03/17/2011 02:03 PM, Dimitris Zilaskos wrote:
[log in to unmask]" type="cite">Hi,
Not a solution, but I had similar experience recorded at https://gus.fzk.de/ws/ticket_info.php?ticket=66943.
I have just installed the latest glite update and hammering the
service again to see if I can reproduce the problem...
Cheers,
Στις 17/3/2011 3:50 μμ, ο/η Gonçalo Borges έγραψε:
Hi All...
My WMS is not able to put glite-wms-wm running. I'm running
this service
with loglevel 6, and the final message produced is:
(...)
17 Mar, 13:33:16 -D: [Debug]
populate_ism(ism-ii-purchaser.cpp:129):
w-dpm01.grid.sinica.edu.tw added to ISM
17 Mar, 13:33:16 -D: [Debug]
populate_ism(ism-ii-purchaser.cpp:129):
wipp-se.weizmann.ac.il added to ISM
17 Mar, 13:33:16 -D: [Debug]
populate_ism(ism-ii-purchaser.cpp:129):
wormhole.westgrid.ca added to ISM
17 Mar, 13:33:16 -D: [Debug] switch_active_side(ism.cpp:36):
switched
active side to ISM 0
17 Mar, 13:33:16 -I: [Info] main(main.cpp:421): spawning 5
worker
threads...
17 Mar, 13:33:16 -D: [Debug]
operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/EGWpbJHJy72KIukdmHsyRA
17 Mar, 13:33:16 -D: [Debug]
operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/-S9-S0z62nRHkcfzU4gggw
17 Mar, 13:33:16 -D: [Debug]
operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/i_yTAUVrzoOXm3BBQYSJ1g
17 Mar, 13:33:16 -D: [Debug]
operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/-dXZ4gRb4mVXAbCM4_-GYQ
17 Mar, 13:33:16 -D: [Debug]
operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/-hV1oym9F3g-6ZoZSlU7ig
17 Mar, 13:33:16 -I: [Info] main(main.cpp:429): scheduling
dispatcher...
17 Mar, 13:33:16 -I: [Info] main(main.cpp:438): scheduling ISM
purchaser(s)...
17 Mar, 13:33:16 -I: [Info] main(main.cpp:473): scheduling ISM
updater...
17 Mar, 13:33:16 -D: [Debug] operator()(ism.cpp:142): ISM
updater start
17 Mar, 13:33:16 -D: [Debug] operator()(ism.cpp:145): ISM
updater end
17 Mar, 13:33:16 -I: [Info] main(main.cpp:498): WM startup
completed...
17 Mar, 13:33:17 -D: [Debug]
get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug]
get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug]
get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug]
get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
I've looked to
http://goc.grid.sinica.edu.tw/gocwiki/Jobs_sent_to_my_RB_stay_in_Waiting_state_forever,
restarted the daemon several times, without any success.
Can someone shed some light on the topic?
Cheers
Goncalo