https://gus.fzk.de/ws/ticket_info.php?ticket=68730

On 03/17/2011 04:32 PM, Gonçalo Borges wrote:
[log in to unmask]" type="cite"> Hi Dimitris...

Thanks for the feedback, but at the end, I was able to solve the problem, which seems to be different from the one you have.

I realized that there was a long queue of requests to be processed by the workload-management system (/var/glite/workload_manager/jobdir/old/). The WM was trying to reprocess some old entries in the queue and failing. Immediately after the failure, I saw logs like:

17 Mar, 14:53:02 -W: [Warning] get_catalog_url(dli_utils.cpp:89): No endpoints found
17 Mar, 14:53:02 -W: [Warning] resolve_filemapping_info(dli_utils.cpp:364): cannot get DataCatalogType or endpoint
17 Mar, 14:53:02 -I: [Info] checkRequirement(matchmakerISMImpl.cpp:222): MM for job: https://wms01.ncg.ingrid.pt:9000/Es5PVTeUke9kHx4hG6qvDg (0/0 [0] )
17 Mar, 14:53:02 -I: [Info] postpone(submit_request.cpp:212): postponing https://wms01.ncg.ingrid.pt:9000/Es5PVTeUke9kHx4hG6qvDg (BrokerHelper: no compatible resources)

and running with LogLevel 6

17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying to get data-location-interface information through SD...

I suspected from some jobs with badly defined JDL, and indeed, looking to the JDL of one of the job ids referred in the logs, I saw things like:

  DataRequirements = {
   [
    DataCatalogType = "DLI";
    InputData = { "guid:3172069e-d20b-483e-afba-f7acc689ac85" }
   ] };

It seems the user was submitting jobs, requesting a given file, but not referring the LFC where the file was registered. Because of that, the WMS dind't know how to process that request, and failed.

Basically I had to delete the reference to those kind of jobs in /var/glite/workload_manager/jobdir/old/, and after that, the daemons are working perfectly.

I have contacted the user, ask him to correct the JDL, but the service should also be protected against this kind of missusage.

Cheers
Goncalo

On 03/17/2011 02:03 PM, Dimitris Zilaskos wrote:
[log in to unmask]" type="cite">Hi,

Not a solution, but I had similar experience recorded at https://gus.fzk.de/ws/ticket_info.php?ticket=66943.

I have just installed the latest glite update and hammering the service again to see if I can reproduce the problem...

Cheers,

Στις 17/3/2011 3:50 μμ, ο/η Gonçalo Borges έγραψε:
Hi All...

My WMS is not able to put glite-wms-wm running. I'm running this service
with loglevel 6, and the final message produced is:

(...)
17 Mar, 13:33:16 -D: [Debug] populate_ism(ism-ii-purchaser.cpp:129):
w-dpm01.grid.sinica.edu.tw added to ISM
17 Mar, 13:33:16 -D: [Debug] populate_ism(ism-ii-purchaser.cpp:129):
wipp-se.weizmann.ac.il added to ISM
17 Mar, 13:33:16 -D: [Debug] populate_ism(ism-ii-purchaser.cpp:129):
wormhole.westgrid.ca added to ISM
17 Mar, 13:33:16 -D: [Debug] switch_active_side(ism.cpp:36): switched
active side to ISM 0
17 Mar, 13:33:16 -I: [Info] main(main.cpp:421): spawning 5 worker
threads...
17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/EGWpbJHJy72KIukdmHsyRA
17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/-S9-S0z62nRHkcfzU4gggw
17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/i_yTAUVrzoOXm3BBQYSJ1g
17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/-dXZ4gRb4mVXAbCM4_-GYQ
17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
considering (re)submit of
https://wms01.ncg.ingrid.pt:9000/-hV1oym9F3g-6ZoZSlU7ig
17 Mar, 13:33:16 -I: [Info] main(main.cpp:429): scheduling dispatcher...
17 Mar, 13:33:16 -I: [Info] main(main.cpp:438): scheduling ISM
purchaser(s)...
17 Mar, 13:33:16 -I: [Info] main(main.cpp:473): scheduling ISM updater...
17 Mar, 13:33:16 -D: [Debug] operator()(ism.cpp:142): ISM updater start
17 Mar, 13:33:16 -D: [Debug] operator()(ism.cpp:145): ISM updater end
17 Mar, 13:33:16 -I: [Info] main(main.cpp:498): WM startup completed...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...
17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
to get data-location-interface information through SD...

I've looked to
http://goc.grid.sinica.edu.tw/gocwiki/Jobs_sent_to_my_RB_stay_in_Waiting_state_forever,
restarted the daemon several times, without any success.

Can someone shed some light on the topic?

Cheers
Goncalo