Hi,
I wonder what the rationale is for moving to a VO-based watchdog? Will
it be able to handle the various VOMS groups / roles? What will happen
to users who switch VOs?
JT
Vega Forneris wrote:
>
> Hi Di,
>
> I can confirm that such logs are still growing since this mornig
> (timeframe=11:25-16:45; bigger logfile=2,2Mb)...I hope they will stop
> (under check ;-) )!
>
> At the moment the prcesses running as eo002 (the local user who run the
> job) are:
>
> (ps -efl --forest)
> 0 S eo002 14086 1 0 76 0 - 1476 schedu 11:25 ?
> 00:00:04 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown
> -publish-jobs
> 0 S eo002 14199 1 0 75 0 - 1552 schedu 11:25 ?
> 00:00:02 /opt/condor-c/sbin/*condor_master* -f -r 680
> 0 S eo002 14231 14199 0 75 0 - 1902 schedu 11:25 ?
> 00:00:02 \_ *condor_schedd *-f -n 5670f86976d594674aa5ef1c9bc2b3
> [log in to unmask]
> 0 S eo002 13403 1 0 75 0 - 1398 schedu 16:31 ?
> 00:00:00 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown
> -publish-jobs
> 0 S eo002 13422 1 0 75 0 - 1073 schedu 16:31 ?
> 00:00:00 *perl */home/eo002/.globus/.gass_cache/local/md5/00/0bed
> d51244c38d2825c37fec701443/md5/8d/618299719439f70dfc2258222583a4/data
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor
> _g_scratch.0x8506178.1770/grid-monitor-job-status.grid-e
> 0 S eo002 13424 13422 0 75 0 - 1929 schedu 16:31 ?
> 00:00:00 \_ *perl */tmp/grid_manager_monitor_agent.eo002.13422.1
> 000 --delete-self --maxtime=3600s
>
> Summaryzing: 2 condor
> 2 jobmanager
> 2 perl
> ==========================
> 6 processes for doing nothing...I wonder what happened
> on clusters heavily accessed by different people
>
> I still don't know if it's the normal procedure, a bug or I simply
> missed something, but in the first case (normal procedure): why wasting
> resources when they can be shut down and restarted when/if needed? Which
> benefits are provided with this approach?
>
>
> P.S. about times: just for being sure, a simple Hello_World.jdl takes
> 4-6 minutes for being submitted to a WMS, running on the CE and
> providing output in a clear situation (no load on machines and/or
> network)...is it normal? Is there any way for speed up the job?
>
> Cheers
> Vega
>
>
>
> *Di Qing <[log in to unmask]>*
> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>
> 18/08/2006 16:14
> Please respond to
> LHC Computer Grid - Rollout <[log in to unmask]>
>
>
>
> To
> [log in to unmask]
> cc
>
> Subject
> Re: [LCG-ROLLOUT] - gLiteCE - questions
>
>
>
>
>
>
>
>
> Hi Vega,
>
> > ok, it's clear why condor processes start, but :
> > >> After the users' jobs finished, these process will not exit.
> >
> > ...can you explain better this point please? Why processes should be
> > kept running even after job completion? At he moment the only fact I
>
> The CE and the resources behind it are somehow like a kind of condor
> resources, your next jobs will be submitted to the same condor
> processes. I am not sure if they will keep running for ever if you don't
> touch them, need JRA1 to confirm.
>
> > notice is that I've log files growing without control in user's
> > home...checking those log files I found that is polling an unexisting
> > pbs job; well...it's the same entry as for the old LCG CE...in that case
> > I'm sure it refers to pbs jobs while I'm not in this: it polls a pbs job
> > which has not ever existed and simply continues...
>
> This log files should come from the globus-job-manager which launch
> these condor processes.
>
> > Does it means that X different users will leave 2 jobmanager processes
> > (with their condor "children") PER job PER user?
>
> Currently there are 2 condor processes left per user.
>
> Cheers,
>
> Di
>
>
> > Thanks and cheers
> >
> > Vega Forneris
> >
> > +-----------------------------------------------+
> > ESA-ESRIN
> > Unix Systems Administrator
> > Via Galileo Galilei
> > 00044 Frascati (Rm) - Italy
> > Phone +39 06 94180581
> > Mailto: [log in to unmask]
> > +-----------------------------------------------+
> > Vitrociset S.p.A.
> > Unix System Administrator
> > Via Tiburtina 1020
> > 00100 Roma - Italy
> > Phone +39 06 8820 4297
> > Mailto: [log in to unmask]
> > +-----------------------------------------------+
> >
> > "I do not feel obliged to believe that the same God who has endowed us
> > with sense, reason, and intellect has intended us to forgot their use."
> > (Galileo Galilei)
> >
> >
> >
> > *Di Qing <[log in to unmask]>*
> > Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
> >
> > 18/08/2006 14:35
> > Please respond to
> > LHC Computer Grid - Rollout <[log in to unmask]>
> >
> >
> >
> > To
> > [log in to unmask]
> > cc
> >
> > Subject
> > Re: [LCG-ROLLOUT] - gLiteCE - questions
> >
> >
> >
> >
> >
> >
> >
> >
> > Vega Forneris wrote:
> > >
> > > Hi *,
> > >
> > > I've just set up a little GRID for testing purpose (NOTE: this implies
> > > that all machines involved are note used by anyone else=low cpu
> load and
> > > IP traffic), but I've still a couple of questions about gLiteCE +
> > gLiteWMS:
> > >
> > > TEST:
> > > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
> > > Executable = "/bin/echo";
> > > Arguments = "Hello World";
> > > StdOutput = "message.txt";
> > > StdError = "stderror";
> > > OutputSandbox = {"message.txt","stderror"};
> > >
> > > "glite-job-list-match" and " glite-job-submit" work perfectly and the
> > > job is succesfully submitted, but this simple job takes 4-6
> minutes for
> > > retrieving the output...is it normal (first question!!!) ? It's a very
> > > long time considering the job complexity (=NULL!), the distances (the
> > > systems are really close each other) and the server's work-load...
> > >
> > > Here the output of the command " glite-job-status -v 3 "
> > > - stateEnterTimes =
> > > Submitted : Fri Aug 18 11:24:27 2006 CEST
> > > Waiting : Fri Aug 18 11:24:15 2006 CEST
> > > Ready : Fri Aug 18 11:24:16 2006 CEST
> > > Scheduled : ---
> >
> > There is not a Scheduled event for gLite CE to update since the
> > submission mechanism changed ax explained below.
> >
> > > Running : Fri Aug 18 11:27:08 2006 CEST
> > > Done : Fri Aug 18 11:30:00 2006 CEST
> > > Cleared : ---
> > > Aborted : ---
> > > Cancelled : ---
> > > Unknown : ---
> > >
> > >
> > > while the pbs logs show that the cluster process this job in less
> than 7
> > > seconds since it reached the pbs_server and the authenticating process
> > > takes one second!
> > >
> > > ...the real problem is that even after retriving the output, there're
> > > still jobs running on the CE owned by the local user (in this case:
> > eo002)
> > >
> > > eo002 14086 1 0 11:25 ? 00:00:00 globus-job-manager
> -conf
> > > /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
> > > -machine-type unknown -publish-jobs
> > > eo002 14199 1 0 11:25 ? 00:00:00
> > > /opt/condor-c/sbin/condor_master -f -r 680
> > > eo002 14231 14199 0 11:25 ? 00:00:00 condor_schedd -f -n
> > > [log in to unmask]
> > > eo002 14274 1 0 11:25 ? 00:00:00 globus-job-manager
> -conf
> > > /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
> > > -machine-type unknown -publish-jobs
> > > eo002 14306 1 0 11:25 ? 00:00:00 perl
> > >
> >
> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data
>
> >
> > >
> >
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119.
>
> >
> > >
> > > eo002 14308 14306 0 11:25 ? 00:00:00 perl
> > > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
> > > --maxtime=3600s
> >
> > gLite CE is much different from LCG CE. And there are not job managers
> > for batch system, thus when a new user's job coming through WMS, two
> > condor processes will be launched through job manager fork as you saw
> > above. Then users' jobs actually are submitted to condor on CE from the
> > condor on WMS. After the users' jobs finished, these process will not
> > exit. JRA1 is planning to move the condor processes from user based to
> > VO based.
> >
> > Di
> >
> > >
> > > and 2 gram_job log files grow continuosly in user's home....
> > >
> > > Crosschecking the "daemon_unique_name"
> > > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I found that
> > > in condorc-advertiser.813.2385.out log is continuosly appended the
> > > following text:
> > >
> > > Fri Aug 18 11:47:18 2006 Advertising
> > > "[log in to unmask]"
> > > Fri Aug 18 11:47:48 2006 #################
> > > Fri Aug 18 11:47:48 2006 Rewriting
> > > "[log in to unmask]"
> > > Fri Aug 18 11:47:48 2006 Setting requirements true
> > > MyType = "Machine"
> > > TargetType = "Job"
> > > Activity = "Idle"
> > > Arch = "CondorC"
> > > CONDORC_WANTJOB = TRUE
> > > CondorCAd = 1
> > > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> > > CondorVersion = "$CondorVersion: 6.7.10 Aug 3 2005 $"
> > > DaemonStartTime = 1155893119
> > > Machine = "grid-eo-engine03.esrin.esa.int"
> > > MaxJobsRunning = 200
> > > MonitorSelfAge = 1200
> > > MonitorSelfCPUUsage = 0.012500
> > > MonitorSelfImageSize = 7612.000000
> > > MonitorSelfResidentSetSize = 4620
> > > MonitorSelfTime = 1155894319
> > > MyAddress = "<193.204.231.32:22864>"
> > > Name =
> "[log in to unmask]"
> > > NumUsers = 0
> > > OpSys = "CondorC"
> > > Requirements = TRUE
> > > START = TRUE
> > > ServerTime = 1155894460
> > > SiteName = "grid-eo-engine03.esrin.esa.int"
> > > StartdIpAddr = "<193.204.231.32:22864>"
> > > State = "Unclaimed"
> > > TotalFlockedJobs = 0
> > > TotalHeldJobs = 0
> > > TotalIdleJobs = 0
> > > TotalJobAds = 0
> > > TotalRemovedJobs = 0
> > > TotalRunningJobs = 0
> > > UpdateSequenceNumber = 1155894468
> > > VirtualMemory = 0
> > > WantAdRevaluate = True
> > > WantResAd = TRUE
> > > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
> > >
> > > I suppose that after the output retrieve, jobs should end, don't they?
> > >
> > > Thanks for the support and cheers
> > >
> > > Vega Forneris
> > >
> > > +-----------------------------------------------+
> > > ESA-ESRIN
> > > Unix Systems Administrator
> > > Via Galileo Galilei
> > > 00044 Frascati (Rm) - Italy
> > > Phone +39 06 94180581
> > > Mailto: [log in to unmask]
> > > +-----------------------------------------------+
> > > Vitrociset S.p.A.
> > > Unix System Administrator
> > > Via Tiburtina 1020
> > > 00100 Roma - Italy
> > > Phone +39 06 8820 4297
> > > Mailto: [log in to unmask]
> > > +-----------------------------------------------+
> > >
> > > "I do not feel obliged to believe that the same God who has endowed us
> > > with sense, reason, and intellect has intended us to forgot their
> use."
> > > (Galileo Galilei)
> >
>
|