Print

Print


https://gus.fzk.de/ws/ticket_info.php?ticket=68730

On 03/17/2011 04:32 PM, Gonçalo Borges wrote:
> Hi Dimitris...
>
> Thanks for the feedback, but at the end, I was able to solve the 
> problem, which seems to be different from the one you have.
>
> I realized that there was a long queue of requests to be processed by 
> the workload-management system 
> (/var/glite/workload_manager/jobdir/old/). The WM was trying to 
> reprocess some old entries in the queue and failing. Immediately after 
> the failure, I saw logs like:
>
> 17 Mar, 14:53:02 -W: [Warning] get_catalog_url(dli_utils.cpp:89): No 
> endpoints found
> 17 Mar, 14:53:02 -W: [Warning] 
> resolve_filemapping_info(dli_utils.cpp:364): cannot get 
> DataCatalogType or endpoint
> 17 Mar, 14:53:02 -I: [Info] 
> checkRequirement(matchmakerISMImpl.cpp:222): MM for job: 
> https://wms01.ncg.ingrid.pt:9000/Es5PVTeUke9kHx4hG6qvDg (0/0 [0] )
> 17 Mar, 14:53:02 -I: [Info] postpone(submit_request.cpp:212): 
> postponing *https://wms01.ncg.ingrid.pt:9000/Es5PVTeUke9kHx4hG6qvDg 
> *(BrokerHelper: no compatible resources)
>
> and running with LogLevel 6
>
> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying 
> to get data-location-interface information through SD...
> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying 
> to get data-location-interface information through SD...
> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying 
> to get data-location-interface information through SD...
> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying 
> to get data-location-interface information through SD...
>
> I suspected from some jobs with badly defined JDL, and indeed, looking 
> to the JDL of one of the job ids referred in the logs, I saw things like:
>
>   DataRequirements = {
>    [
>     DataCatalogType = "DLI";
>     InputData = { "guid:3172069e-d20b-483e-afba-f7acc689ac85" }
>    ] };
>
> It seems the user was submitting jobs, requesting a given file, but 
> not referring the LFC where the file was registered. Because of that, 
> the WMS dind't know how to process that request, and failed.
>
> Basically I had to delete the reference to those kind of jobs in 
> /var/glite/workload_manager/jobdir/old/, and after that, the daemons 
> are working perfectly.
>
> I have contacted the user, ask him to correct the JDL, but the service 
> should also be protected against this kind of missusage.
>
> Cheers
> Goncalo
>
> On 03/17/2011 02:03 PM, Dimitris Zilaskos wrote:
>> Hi,
>>
>> Not a solution, but I had similar experience recorded at 
>> https://gus.fzk.de/ws/ticket_info.php?ticket=66943.
>>
>> I have just installed the latest glite update and hammering the 
>> service again to see if I can reproduce the problem...
>>
>> Cheers,
>>
>> Στις 17/3/2011 3:50 μμ, ο/η Gonçalo Borges έγραψε:
>>> Hi All...
>>>
>>> My WMS is not able to put glite-wms-wm running. I'm running this 
>>> service
>>> with loglevel 6, and the final message produced is:
>>>
>>> (...)
>>> 17 Mar, 13:33:16 -D: [Debug] populate_ism(ism-ii-purchaser.cpp:129):
>>> w-dpm01.grid.sinica.edu.tw added to ISM
>>> 17 Mar, 13:33:16 -D: [Debug] populate_ism(ism-ii-purchaser.cpp:129):
>>> wipp-se.weizmann.ac.il added to ISM
>>> 17 Mar, 13:33:16 -D: [Debug] populate_ism(ism-ii-purchaser.cpp:129):
>>> wormhole.westgrid.ca added to ISM
>>> 17 Mar, 13:33:16 -D: [Debug] switch_active_side(ism.cpp:36): switched
>>> active side to ISM 0
>>> 17 Mar, 13:33:16 -I: [Info] main(main.cpp:421): spawning 5 worker
>>> threads...
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
>>> considering (re)submit of
>>> https://wms01.ncg.ingrid.pt:9000/EGWpbJHJy72KIukdmHsyRA
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
>>> considering (re)submit of
>>> https://wms01.ncg.ingrid.pt:9000/-S9-S0z62nRHkcfzU4gggw
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
>>> considering (re)submit of
>>> https://wms01.ncg.ingrid.pt:9000/i_yTAUVrzoOXm3BBQYSJ1g
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
>>> considering (re)submit of
>>> https://wms01.ncg.ingrid.pt:9000/-dXZ4gRb4mVXAbCM4_-GYQ
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(submit_request.cpp:224):
>>> considering (re)submit of
>>> https://wms01.ncg.ingrid.pt:9000/-hV1oym9F3g-6ZoZSlU7ig
>>> 17 Mar, 13:33:16 -I: [Info] main(main.cpp:429): scheduling 
>>> dispatcher...
>>> 17 Mar, 13:33:16 -I: [Info] main(main.cpp:438): scheduling ISM
>>> purchaser(s)...
>>> 17 Mar, 13:33:16 -I: [Info] main(main.cpp:473): scheduling ISM 
>>> updater...
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(ism.cpp:142): ISM updater start
>>> 17 Mar, 13:33:16 -D: [Debug] operator()(ism.cpp:145): ISM updater end
>>> 17 Mar, 13:33:16 -I: [Info] main(main.cpp:498): WM startup completed...
>>> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
>>> to get data-location-interface information through SD...
>>> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
>>> to get data-location-interface information through SD...
>>> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
>>> to get data-location-interface information through SD...
>>> 17 Mar, 13:33:17 -D: [Debug] get_catalog_url(dli_utils.cpp:62): trying
>>> to get data-location-interface information through SD...
>>>
>>> I've looked to
>>> http://goc.grid.sinica.edu.tw/gocwiki/Jobs_sent_to_my_RB_stay_in_Waiting_state_forever, 
>>>
>>> restarted the daemon several times, without any success.
>>>
>>> Can someone shed some light on the topic?
>>>
>>> Cheers
>>> Goncalo
>>>
>>>
>>>
>>>
>>>
>>
>>
>