Hi Gonçalo,
as it was mentioned in the thread, new WMSMonitor version will work with
ActiveMQ (other than other improvement like usage of LB api + monitor
cream + handle asynchronous data collection).
We're currently testing sensors with preview instances of emi WMS/LB and
EGI testing activemq broker. Then we'll made sensors available to be
installed on new emi-wms/lb for volunteers sites or whoever interested
(a tar + a cron, triggering 15min data production). Data will be
published on our central server. Then wmsmonitor server rpm will also be
distributed.
If you're interested in preview phase I will notify you.
Cheers
Danilo
On 06/14/2011 01:40 PM, Gonçalo Borges wrote:
> Hi Tiziana...
>
>>
>> is this cron script described as workaround or the problem documented
>> in any way? if not we should request documentation to be fixed
>> accordingly. We can do this with a GGUS ticket (we don't need an
>> official EGI requirement for this).
>>
>
> Indeed I know that condor_q sometimes has a lot of held entries, and
> normally, I remove them by hand. They are normally caused by jobs
> which entered in a very strange state. For example, in one of these
> situations, these held entries in condor_q were production jobs from
> an Auger user which were sent to a site with hardware problems. Jobs
> failed but remained in condor_q forever.
>
> I'm not WMS expert but these situations indeed happens from time to
> time, and at least a clean guideline on how to deal with them is
> appreciated.
>
>
>
>> As side issue, I know IGI has a geographically distributed pool of
>> WMS instances controlled by WMSmonitor. Why do you think SNMP is not
>> a good (interim) solution? probably it's just a matter of fixing the
>> respective firewalls at the sites.
>
> AFAIK, WMSMon uses SNMPv2 which means that does not support
> encryption. To set it up in a WAN mode, you have to exchange messages
> (with community passwords inside) over the internet, and therefore,
> you do not want to expose that kind of (unencrypted) traffic to the
> world which explicit information about your services. Therefore, it is
> only suitable to be implemented over a LAN. If this is being done in
> IGI (using SNMPv2) I would have to understand the used network topology.
>
> Cheers
> Goncalo
>
>
>> let me know
>>
>> Thanks Tiziana
>>
>> On 14/06/2011 12:55, Gonçalo Borges wrote:
>>> After implementing the cron from Arnau and Marteen for the deletion of
>>> the CONDOR CRAB, the number of requests registered in the
>>>
>>> /var/glite/jobcontrol/jobdir/
>>>
>>> decreased tremendously.
>>>
>>> Cheers
>>> Goncalo
>>>
>>> On 06/14/2011 11:18 AM, Gonçalo Borges wrote:
>>>> Hi Arnau...
>>>>
>>>>> Option 1: I had similar problems in our WMS. On a high load, it
>>>>> stopped
>>>>> seeing bdii resources and jobs were not able to start. If it's the
>>>>> case, you will find some descriptive message in
>>>>> workload_manager_events.log. And, for sovling it, we installed
>>>>> google_perf_tools (you will find the receipt at WMS known_issues).
>>>>>
>>>>
>>>> We are using google_perf_tools already.
>>>>
>>>> grep libtcmalloc.so /opt/glite/etc/glite_wms.conf
>>>> RuntimeMalloc = "/usr/lib/libtcmalloc.so";
>>>>
>>>>
>>>>> Options 2: Have you recently upgraded lb? If yes, ensure
>>>>> glite-lb-authz.conf has the correct values.
>>>>
>>>> Nope.
>>>>
>>>>> *You could also install WMSMonitor. Good tool for quick check.
>>>>>
>>>>
>>>> Yes I know. Unfortunately, since we are operating two sites over a
>>>> WAN, we though of using that tool in WAN mode. Talking with Daniele,
>>>> the WMSMonitor uses SNMP version 2 which is not the proper framework
>>>> to do it. we are expecting the EMI WMSMonitor release which will use
>>>> ActiveMQ
>>>>
>>>>>
>>>>>> ---*---
>>>>> Maarten senme this script wich must be in cron:
>>>>> # cat /usr/local/sbin/clean_condor_jobs.sh
>>>>> #!/bin/bash
>>>>>
>>>>>
>>>>> CONDOR_CRAP=`/opt/condor-c/bin/condor_q -hold | grep glite | awk
>>>>> '{print $1}'`
>>>>>
>>>>>
>>>>> for JOB_ID in $CONDOR_CRAP
>>>>> do
>>>>> echo "Removing job: " $JOB_ID
>>>>> /opt/condor-c/bin/condor_rm $JOB_ID
>>>>> # sleep 2
>>>>> /opt/condor-c/bin/condor_rm -forcex $JOB_ID
>>>>> done
>>>>>
>>>>
>>>> condor CRAP is a good name :-) Any special frequency to run it?
>>>>
>>>> Cheers
>>>> Goncalo
>>>>
>>>>
>>>
>>>
>
>
|