Hi Tiziana...
>
> is this cron script described as workaround or the problem documented
> in any way? if not we should request documentation to be fixed
> accordingly. We can do this with a GGUS ticket (we don't need an
> official EGI requirement for this).
>
Indeed I know that condor_q sometimes has a lot of held entries, and
normally, I remove them by hand. They are normally caused by jobs which
entered in a very strange state. For example, in one of these
situations, these held entries in condor_q were production jobs from an
Auger user which were sent to a site with hardware problems. Jobs failed
but remained in condor_q forever.
I'm not WMS expert but these situations indeed happens from time to
time, and at least a clean guideline on how to deal with them is
appreciated.
> As side issue, I know IGI has a geographically distributed pool of WMS
> instances controlled by WMSmonitor. Why do you think SNMP is not a
> good (interim) solution? probably it's just a matter of fixing the
> respective firewalls at the sites.
AFAIK, WMSMon uses SNMPv2 which means that does not support encryption.
To set it up in a WAN mode, you have to exchange messages (with
community passwords inside) over the internet, and therefore, you do not
want to expose that kind of (unencrypted) traffic to the world which
explicit information about your services. Therefore, it is only suitable
to be implemented over a LAN. If this is being done in IGI (using
SNMPv2) I would have to understand the used network topology.
Cheers
Goncalo
> let me know
>
> Thanks Tiziana
>
> On 14/06/2011 12:55, Gonçalo Borges wrote:
>> After implementing the cron from Arnau and Marteen for the deletion of
>> the CONDOR CRAB, the number of requests registered in the
>>
>> /var/glite/jobcontrol/jobdir/
>>
>> decreased tremendously.
>>
>> Cheers
>> Goncalo
>>
>> On 06/14/2011 11:18 AM, Gonçalo Borges wrote:
>>> Hi Arnau...
>>>
>>>> Option 1: I had similar problems in our WMS. On a high load, it
>>>> stopped
>>>> seeing bdii resources and jobs were not able to start. If it's the
>>>> case, you will find some descriptive message in
>>>> workload_manager_events.log. And, for sovling it, we installed
>>>> google_perf_tools (you will find the receipt at WMS known_issues).
>>>>
>>>
>>> We are using google_perf_tools already.
>>>
>>> grep libtcmalloc.so /opt/glite/etc/glite_wms.conf
>>> RuntimeMalloc = "/usr/lib/libtcmalloc.so";
>>>
>>>
>>>> Options 2: Have you recently upgraded lb? If yes, ensure
>>>> glite-lb-authz.conf has the correct values.
>>>
>>> Nope.
>>>
>>>> *You could also install WMSMonitor. Good tool for quick check.
>>>>
>>>
>>> Yes I know. Unfortunately, since we are operating two sites over a
>>> WAN, we though of using that tool in WAN mode. Talking with Daniele,
>>> the WMSMonitor uses SNMP version 2 which is not the proper framework
>>> to do it. we are expecting the EMI WMSMonitor release which will use
>>> ActiveMQ
>>>
>>>>
>>>>> ---*---
>>>> Maarten senme this script wich must be in cron:
>>>> # cat /usr/local/sbin/clean_condor_jobs.sh
>>>> #!/bin/bash
>>>>
>>>>
>>>> CONDOR_CRAP=`/opt/condor-c/bin/condor_q -hold | grep glite | awk
>>>> '{print $1}'`
>>>>
>>>>
>>>> for JOB_ID in $CONDOR_CRAP
>>>> do
>>>> echo "Removing job: " $JOB_ID
>>>> /opt/condor-c/bin/condor_rm $JOB_ID
>>>> # sleep 2
>>>> /opt/condor-c/bin/condor_rm -forcex $JOB_ID
>>>> done
>>>>
>>>
>>> condor CRAP is a good name :-) Any special frequency to run it?
>>>
>>> Cheers
>>> Goncalo
>>>
>>>
>>
>>
|