JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT 2006
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: - gLiteCE - questions
From:
Jeff Templon <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 18 Aug 2006 17:03:11 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (341 lines)
Hi,

I wonder what the rationale is for moving to a VO-based watchdog?  Will 
it be able to handle the various VOMS groups / roles?  What will happen 
to users who switch VOs?

					JT

Vega Forneris wrote:
> 
> Hi Di,
> 
> I can confirm that such logs are still growing since this mornig 
> (timeframe=11:25-16:45; bigger logfile=2,2Mb)...I hope they will stop 
> (under check ;-)  )!
> 
> At the moment the prcesses running as eo002 (the local user who run the 
> job) are:
> 
> (ps -efl --forest)
> 0 S eo002    14086     1  0  76   0    -  1476 schedu 11:25 ?       
>  00:00:04 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown 
> -publish-jobs
> 0 S eo002    14199     1  0  75   0    -  1552 schedu 11:25 ?       
>  00:00:02 /opt/condor-c/sbin/*condor_master* -f -r 680
> 0 S eo002    14231 14199  0  75   0    -  1902 schedu 11:25 ?       
>  00:00:02  \_ *condor_schedd *-f -n 5670f86976d594674aa5ef1c9bc2b3
> [log in to unmask]
> 0 S eo002    13403     1  0  75   0    -  1398 schedu 16:31 ?       
>  00:00:00 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown 
> -publish-jobs
> 0 S eo002    13422     1  0  75   0    -  1073 schedu 16:31 ?       
>  00:00:00 *perl */home/eo002/.globus/.gass_cache/local/md5/00/0bed
> d51244c38d2825c37fec701443/md5/8d/618299719439f70dfc2258222583a4/data 
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor
> _g_scratch.0x8506178.1770/grid-monitor-job-status.grid-e
> 0 S eo002    13424 13422  0  75   0    -  1929 schedu 16:31 ?       
>  00:00:00  \_ *perl */tmp/grid_manager_monitor_agent.eo002.13422.1
> 000 --delete-self --maxtime=3600s
> 
> Summaryzing:         2 condor
>                 2 jobmanager
>                 2 perl
> ==========================
>                 6 processes for doing nothing...I wonder what happened 
> on clusters heavily accessed by different people
> 
> I still don't know if it's the normal procedure, a bug or I simply 
> missed something, but in the first case (normal procedure): why wasting 
> resources when they can be shut down and restarted when/if needed? Which 
> benefits are provided with this approach?
> 
> 
> P.S. about times: just for being sure, a simple Hello_World.jdl takes 
> 4-6 minutes for being submitted to a WMS, running on the CE and 
> providing output in a clear situation (no load on machines and/or 
> network)...is it normal? Is there any way for speed up the job?
> 
> Cheers
> Vega
> 
> 
> 
> *Di Qing <[log in to unmask]>*
> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
> 
> 18/08/2006 16:14
> Please respond to
> LHC Computer Grid - Rollout <[log in to unmask]>
> 
> 
> 	
> To
> 	[log in to unmask]
> cc
> 	
> Subject
> 	Re: [LCG-ROLLOUT] - gLiteCE - questions
> 
> 
> 	
> 
> 
> 
> 
> 
> Hi Vega,
> 
>  > ok, it's clear why condor processes start, but :
>  >  >> After the users' jobs finished, these process will not exit.
>  >
>  > ...can you explain better this point please? Why processes should be
>  > kept running even after job completion? At he moment the only fact I
> 
> The CE and the resources behind it are somehow like a kind of condor
> resources, your next jobs will be submitted to the same condor
> processes. I am not sure if they will keep running for ever if you don't
> touch them, need JRA1 to confirm.
> 
>  > notice is that I've log files growing without control in user's
>  > home...checking those log files I found that is polling an unexisting
>  > pbs job; well...it's the same entry as for the old LCG CE...in that case
>  > I'm sure it refers to pbs jobs while I'm not in this: it polls a pbs job
>  > which has not ever existed and simply continues...
> 
> This log files should come from the globus-job-manager which launch
> these condor processes.
> 
>  > Does it means that X different users will leave 2 jobmanager processes
>  > (with their condor "children") PER job PER user?
> 
> Currently there are 2 condor processes left per user.
> 
> Cheers,
> 
> Di
> 
> 
>  > Thanks and cheers
>  >
>  > Vega Forneris
>  >
>  > +-----------------------------------------------+
>  > ESA-ESRIN
>  > Unix Systems Administrator
>  > Via Galileo Galilei
>  > 00044 Frascati (Rm) - Italy
>  > Phone +39 06 94180581
>  > Mailto: [log in to unmask]
>  > +-----------------------------------------------+
>  > Vitrociset S.p.A.
>  > Unix System Administrator
>  > Via Tiburtina 1020
>  > 00100 Roma - Italy
>  > Phone +39 06 8820 4297    
>  > Mailto: [log in to unmask]
>  > +-----------------------------------------------+
>  >
>  > "I do not feel obliged to believe that the same God who has endowed us
>  > with sense, reason, and intellect has intended us to forgot their use."
>  > (Galileo Galilei)
>  >
>  >
>  >
>  > *Di Qing <[log in to unmask]>*
>  > Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>  >
>  > 18/08/2006 14:35
>  > Please respond to
>  > LHC Computer Grid - Rollout <[log in to unmask]>
>  >
>  >
>  >                  
>  > To
>  >                  [log in to unmask]
>  > cc
>  >                  
>  > Subject
>  >                  Re: [LCG-ROLLOUT] - gLiteCE - questions
>  >
>  >
>  >                  
>  >
>  >
>  >
>  >
>  >
>  > Vega Forneris wrote:
>  >  >
>  >  > Hi *,
>  >  >
>  >  > I've just set up a little GRID for testing purpose (NOTE: this implies
>  >  > that all machines involved are note used by anyone else=low cpu 
> load and
>  >  > IP traffic), but I've still a couple of questions about gLiteCE +
>  > gLiteWMS:
>  >  >
>  >  > TEST:
>  >  > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
>  >  > Executable = "/bin/echo";
>  >  > Arguments = "Hello World";
>  >  > StdOutput = "message.txt";
>  >  > StdError = "stderror";
>  >  > OutputSandbox = {"message.txt","stderror"};
>  >  >
>  >  > "glite-job-list-match" and " glite-job-submit" work perfectly and the
>  >  > job is succesfully submitted, but this simple job takes 4-6 
> minutes for
>  >  > retrieving the output...is it normal (first question!!!) ? It's a very
>  >  > long time considering the job complexity (=NULL!), the distances (the
>  >  > systems are really close each other) and the server's work-load...
>  >  >
>  >  > Here the output of the command " glite-job-status -v 3 "
>  >  > - stateEnterTimes =  
>  >  >       Submitted        : Fri Aug 18 11:24:27 2006 CEST
>  >  >       Waiting          : Fri Aug 18 11:24:15 2006 CEST
>  >  >       Ready            : Fri Aug 18 11:24:16 2006 CEST
>  >  >       Scheduled        :                ---
>  >
>  > There is not a Scheduled event for gLite CE to update since the
>  > submission mechanism changed ax explained below.
>  >
>  >  >       Running          : Fri Aug 18 11:27:08 2006 CEST
>  >  >       Done             : Fri Aug 18 11:30:00 2006 CEST
>  >  >       Cleared          :                ---
>  >  >       Aborted          :                ---
>  >  >       Cancelled        :                ---
>  >  >       Unknown          :                ---
>  >  >
>  >  >
>  >  > while the pbs logs show that the cluster process this job in less 
> than 7
>  >  > seconds since it reached the pbs_server and the authenticating process
>  >  > takes one second!
>  >  >
>  >  > ...the real problem is that even after retriving the output, there're
>  >  > still jobs running on the CE owned by the local user (in this case:
>  > eo002)
>  >  >
>  >  > eo002    14086     1  0 11:25 ?        00:00:00 globus-job-manager 
> -conf
>  >  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
>  >  > -machine-type unknown -publish-jobs
>  >  > eo002    14199     1  0 11:25 ?        00:00:00
>  >  > /opt/condor-c/sbin/condor_master -f -r 680
>  >  > eo002    14231 14199  0 11:25 ?        00:00:00 condor_schedd -f -n
>  >  > [log in to unmask]
>  >  > eo002    14274     1  0 11:25 ?        00:00:00 globus-job-manager 
> -conf
>  >  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
>  >  > -machine-type unknown -publish-jobs
>  >  > eo002    14306     1  0 11:25 ?        00:00:00 perl
>  >  >
>  > 
> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data 
> 
>  >
>  >  >
>  > 
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119. 
> 
>  >
>  >  >
>  >  > eo002    14308 14306  0 11:25 ?        00:00:00 perl
>  >  > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
>  >  > --maxtime=3600s
>  >
>  > gLite CE is much different from LCG CE. And there are not job managers
>  > for batch system, thus when a new user's job coming through WMS, two
>  > condor processes will be launched through job manager fork as you saw
>  > above. Then users' jobs actually are submitted to condor on CE from the
>  > condor on WMS. After the users' jobs finished, these process will not
>  > exit. JRA1 is planning to move the condor processes from user based to
>  > VO based.
>  >
>  > Di
>  >
>  >  >
>  >  > and 2 gram_job log files grow continuosly in user's home....
>  >  >
>  >  > Crosschecking the "daemon_unique_name"
>  >  > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I  found that
>  >  > in condorc-advertiser.813.2385.out log is continuosly appended the
>  >  > following text:
>  >  >
>  >  > Fri Aug 18 11:47:18 2006 Advertising
>  >  > "[log in to unmask]"
>  >  > Fri Aug 18 11:47:48 2006 #################
>  >  > Fri Aug 18 11:47:48 2006 Rewriting
>  >  > "[log in to unmask]"
>  >  > Fri Aug 18 11:47:48 2006     Setting requirements true
>  >  > MyType = "Machine"
>  >  > TargetType = "Job"
>  >  > Activity = "Idle"
>  >  > Arch = "CondorC"
>  >  > CONDORC_WANTJOB = TRUE
>  >  > CondorCAd = 1
>  >  > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
>  >  > CondorVersion = "$CondorVersion: 6.7.10 Aug  3 2005 $"
>  >  > DaemonStartTime = 1155893119
>  >  > Machine = "grid-eo-engine03.esrin.esa.int"
>  >  > MaxJobsRunning = 200
>  >  > MonitorSelfAge = 1200
>  >  > MonitorSelfCPUUsage = 0.012500
>  >  > MonitorSelfImageSize = 7612.000000
>  >  > MonitorSelfResidentSetSize = 4620
>  >  > MonitorSelfTime = 1155894319
>  >  > MyAddress = "<193.204.231.32:22864>"
>  >  > Name = 
> "[log in to unmask]"
>  >  > NumUsers = 0
>  >  > OpSys = "CondorC"
>  >  > Requirements = TRUE
>  >  > START = TRUE
>  >  > ServerTime = 1155894460
>  >  > SiteName = "grid-eo-engine03.esrin.esa.int"
>  >  > StartdIpAddr = "<193.204.231.32:22864>"
>  >  > State = "Unclaimed"
>  >  > TotalFlockedJobs = 0
>  >  > TotalHeldJobs = 0
>  >  > TotalIdleJobs = 0
>  >  > TotalJobAds = 0
>  >  > TotalRemovedJobs = 0
>  >  > TotalRunningJobs = 0
>  >  > UpdateSequenceNumber = 1155894468
>  >  > VirtualMemory = 0
>  >  > WantAdRevaluate = True
>  >  > WantResAd = TRUE
>  >  > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
>  >  >
>  >  > I suppose that after the output retrieve, jobs should end, don't they?
>  >  >
>  >  > Thanks for the support and cheers
>  >  >
>  >  > Vega Forneris
>  >  >
>  >  > +-----------------------------------------------+
>  >  > ESA-ESRIN
>  >  > Unix Systems Administrator
>  >  > Via Galileo Galilei
>  >  > 00044 Frascati (Rm) - Italy
>  >  > Phone +39 06 94180581
>  >  > Mailto: [log in to unmask]
>  >  > +-----------------------------------------------+
>  >  > Vitrociset S.p.A.
>  >  > Unix System Administrator
>  >  > Via Tiburtina 1020
>  >  > 00100 Roma - Italy
>  >  > Phone +39 06 8820 4297    
>  >  > Mailto: [log in to unmask]
>  >  > +-----------------------------------------------+
>  >  >
>  >  > "I do not feel obliged to believe that the same God who has endowed us
>  >  > with sense, reason, and intellect has intended us to forgot their 
> use."
>  >  > (Galileo Galilei)
>  >
>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options