JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT 2006
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: - gLiteCE - questions
From:
Di Qing <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 18 Aug 2006 16:39:51 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (275 lines)
Emanouil Atanassov wrote:
> Hi,
> 
> I think currently there are 2 condor processes per user per WMS.

You are right.

> 
> The jobmanager processes that monitor the user's jobs are
> different for the different WMS (or RBs) and thus create some sort of
> scallability problem because at some point the
> CE may become overloaded even if the jobs come from the same user.

Yes, for sure there are scalability problems with this. There are plans 
to change from user based to VO based for the set of condor-c daemons.

Di

> Emanouil Atanassov
> [log in to unmask]
> 
> Di Qing wrote:
>> Hi Vega,
>>
>>> ok, it's clear why condor processes start, but :
>>>  >> After the users' jobs finished, these process will not exit.
>>>
>>> ...can you explain better this point please? Why processes should be 
>>> kept running even after job completion? At he moment the only fact I 
>>
>> The CE and the resources behind it are somehow like a kind of condor 
>> resources, your next jobs will be submitted to the same condor 
>> processes. I am not sure if they will keep running for ever if you 
>> don't touch them, need JRA1 to confirm.
>>
>>> notice is that I've log files growing without control in user's 
>>> home...checking those log files I found that is polling an unexisting 
>>> pbs job; well...it's the same entry as for the old LCG CE...in that 
>>> case I'm sure it refers to pbs jobs while I'm not in this: it polls a 
>>> pbs job which has not ever existed and simply continues...
>>
>> This log files should come from the globus-job-manager which launch 
>> these condor processes.
>>
>>> Does it means that X different users will leave 2 jobmanager 
>>> processes (with their condor "children") PER job PER user?
>>
>> Currently there are 2 condor processes left per user.
>>
>> Cheers,
>>
>> Di
>>
>>
>>> Thanks and cheers
>>>
>>> Vega Forneris
>>>
>>> +-----------------------------------------------+
>>> ESA-ESRIN
>>> Unix Systems Administrator
>>> Via Galileo Galilei
>>> 00044 Frascati (Rm) - Italy
>>> Phone +39 06 94180581
>>> Mailto: [log in to unmask]
>>> +-----------------------------------------------+
>>> Vitrociset S.p.A.
>>> Unix System Administrator
>>> Via Tiburtina 1020
>>> 00100 Roma - Italy
>>> Phone +39 06 8820 4297    Mailto: [log in to unmask]
>>> +-----------------------------------------------+
>>>
>>> "I do not feel obliged to believe that the same God who has endowed 
>>> us with sense, reason, and intellect has intended us to forgot their 
>>> use."
>>> (Galileo Galilei)
>>>
>>>
>>>
>>> *Di Qing <[log in to unmask]>*
>>> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>>>
>>> 18/08/2006 14:35
>>> Please respond to
>>> LHC Computer Grid - Rollout <[log in to unmask]>
>>>
>>>
>>>     To
>>>     [log in to unmask]
>>> cc
>>>     Subject
>>>     Re: [LCG-ROLLOUT] - gLiteCE - questions
>>>
>>>
>>>    
>>>
>>>
>>>
>>>
>>> Vega Forneris wrote:
>>>  >
>>>  > Hi *,
>>>  >
>>>  > I've just set up a little GRID for testing purpose (NOTE: this 
>>> implies
>>>  > that all machines involved are note used by anyone else=low cpu 
>>> load and
>>>  > IP traffic), but I've still a couple of questions about gLiteCE + 
>>> gLiteWMS:
>>>  >
>>>  > TEST:
>>>  > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
>>>  > Executable = "/bin/echo";
>>>  > Arguments = "Hello World";
>>>  > StdOutput = "message.txt";
>>>  > StdError = "stderror";
>>>  > OutputSandbox = {"message.txt","stderror"};
>>>  >
>>>  > "glite-job-list-match" and " glite-job-submit" work perfectly and the
>>>  > job is succesfully submitted, but this simple job takes 4-6 
>>> minutes for
>>>  > retrieving the output...is it normal (first question!!!) ? It's a 
>>> very
>>>  > long time considering the job complexity (=NULL!), the distances (the
>>>  > systems are really close each other) and the server's work-load...
>>>  >
>>>  > Here the output of the command " glite-job-status -v 3 "
>>>  > - stateEnterTimes =   >       Submitted        : Fri Aug 18 
>>> 11:24:27 2006 CEST
>>>  >       Waiting          : Fri Aug 18 11:24:15 2006 CEST
>>>  >       Ready            : Fri Aug 18 11:24:16 2006 CEST
>>>  >       Scheduled        :                ---
>>>
>>> There is not a Scheduled event for gLite CE to update since the
>>> submission mechanism changed ax explained below.
>>>
>>>  >       Running          : Fri Aug 18 11:27:08 2006 CEST
>>>  >       Done             : Fri Aug 18 11:30:00 2006 CEST
>>>  >       Cleared          :                ---
>>>  >       Aborted          :                ---
>>>  >       Cancelled        :                ---
>>>  >       Unknown          :                ---
>>>  >
>>>  >
>>>  > while the pbs logs show that the cluster process this job in less 
>>> than 7
>>>  > seconds since it reached the pbs_server and the authenticating 
>>> process
>>>  > takes one second!
>>>  >
>>>  > ...the real problem is that even after retriving the output, there're
>>>  > still jobs running on the CE owned by the local user (in this 
>>> case: eo002)
>>>  >
>>>  > eo002    14086     1  0 11:25 ?        00:00:00 globus-job-manager 
>>> -conf
>>>  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn 
>>> jobmanager-fork
>>>  > -machine-type unknown -publish-jobs
>>>  > eo002    14199     1  0 11:25 ?        00:00:00
>>>  > /opt/condor-c/sbin/condor_master -f -r 680
>>>  > eo002    14231 14199  0 11:25 ?        00:00:00 condor_schedd -f -n
>>>  > [log in to unmask]
>>>  > eo002    14274     1  0 11:25 ?        00:00:00 globus-job-manager 
>>> -conf
>>>  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn 
>>> jobmanager-fork
>>>  > -machine-type unknown -publish-jobs
>>>  > eo002    14306     1  0 11:25 ?        00:00:00 perl
>>>  > 
>>> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data 
>>>
>>>  > 
>>> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119. 
>>>
>>>  >
>>>  > eo002    14308 14306  0 11:25 ?        00:00:00 perl
>>>  > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
>>>  > --maxtime=3600s
>>>
>>> gLite CE is much different from LCG CE. And there are not job managers
>>> for batch system, thus when a new user's job coming through WMS, two
>>> condor processes will be launched through job manager fork as you saw
>>> above. Then users' jobs actually are submitted to condor on CE from the
>>> condor on WMS. After the users' jobs finished, these process will not
>>> exit. JRA1 is planning to move the condor processes from user based to
>>> VO based.
>>>
>>> Di
>>>
>>>  >
>>>  > and 2 gram_job log files grow continuosly in user's home....
>>>  >
>>>  > Crosschecking the "daemon_unique_name"
>>>  > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I  found 
>>> that
>>>  > in condorc-advertiser.813.2385.out log is continuosly appended the
>>>  > following text:
>>>  >
>>>  > Fri Aug 18 11:47:18 2006 Advertising
>>>  > "[log in to unmask]"
>>>  > Fri Aug 18 11:47:48 2006 #################
>>>  > Fri Aug 18 11:47:48 2006 Rewriting
>>>  > "[log in to unmask]"
>>>  > Fri Aug 18 11:47:48 2006     Setting requirements true
>>>  > MyType = "Machine"
>>>  > TargetType = "Job"
>>>  > Activity = "Idle"
>>>  > Arch = "CondorC"
>>>  > CONDORC_WANTJOB = TRUE
>>>  > CondorCAd = 1
>>>  > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
>>>  > CondorVersion = "$CondorVersion: 6.7.10 Aug  3 2005 $"
>>>  > DaemonStartTime = 1155893119
>>>  > Machine = "grid-eo-engine03.esrin.esa.int"
>>>  > MaxJobsRunning = 200
>>>  > MonitorSelfAge = 1200
>>>  > MonitorSelfCPUUsage = 0.012500
>>>  > MonitorSelfImageSize = 7612.000000
>>>  > MonitorSelfResidentSetSize = 4620
>>>  > MonitorSelfTime = 1155894319
>>>  > MyAddress = "<193.204.231.32:22864>"
>>>  > Name = 
>>> "[log in to unmask]"
>>>  > NumUsers = 0
>>>  > OpSys = "CondorC"
>>>  > Requirements = TRUE
>>>  > START = TRUE
>>>  > ServerTime = 1155894460
>>>  > SiteName = "grid-eo-engine03.esrin.esa.int"
>>>  > StartdIpAddr = "<193.204.231.32:22864>"
>>>  > State = "Unclaimed"
>>>  > TotalFlockedJobs = 0
>>>  > TotalHeldJobs = 0
>>>  > TotalIdleJobs = 0
>>>  > TotalJobAds = 0
>>>  > TotalRemovedJobs = 0
>>>  > TotalRunningJobs = 0
>>>  > UpdateSequenceNumber = 1155894468
>>>  > VirtualMemory = 0
>>>  > WantAdRevaluate = True
>>>  > WantResAd = TRUE
>>>  > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
>>>  >
>>>  > I suppose that after the output retrieve, jobs should end, don't 
>>> they?
>>>  >
>>>  > Thanks for the support and cheers
>>>  >
>>>  > Vega Forneris
>>>  >
>>>  > +-----------------------------------------------+
>>>  > ESA-ESRIN
>>>  > Unix Systems Administrator
>>>  > Via Galileo Galilei
>>>  > 00044 Frascati (Rm) - Italy
>>>  > Phone +39 06 94180581
>>>  > Mailto: [log in to unmask]
>>>  > +-----------------------------------------------+
>>>  > Vitrociset S.p.A.
>>>  > Unix System Administrator
>>>  > Via Tiburtina 1020
>>>  > 00100 Roma - Italy
>>>  > Phone +39 06 8820 4297     > Mailto: [log in to unmask]
>>>  > +-----------------------------------------------+
>>>  >
>>>  > "I do not feel obliged to believe that the same God who has 
>>> endowed us
>>>  > with sense, reason, and intellect has intended us to forgot their 
>>> use."
>>>  > (Galileo Galilei)
>>>
>>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options