Vega Forneris wrote:
> P.S. about times: just for being sure, a simple Hello_World.jdl takes
> 4-6 minutes for being submitted to a WMS, running on the CE and
> providing output in a clear situation (no load on machines and/or
> network)...is it normal? Is there any way for speed up the job?
I don't known if it is normal, but I also have to wait about 6 minutes
for my "hello world" job to complete, in a similar context (no load,
local network...). Most of the time is spent between the end of job
execution on CE and the job being seen as done by WMS.
I am also interested in any solution to speed up the job, at least for
functional testing purposes (regardless of scalability concerns). Maybe
we can configure shorter polling periods for some components of CE or
WMS? Does someone know which configuration parameters could be modified
to achieve this?
Thanks,
Sylvain
>
>
> Cheers
> Vega
>
>
>
> *Di Qing <[log in to unmask]>*
> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>
> 18/08/2006 16:14
> Please respond to
> LHC Computer Grid - Rollout <[log in to unmask]>
>
>
>
> To
> [log in to unmask]
> cc
>
> Subject
> Re: [LCG-ROLLOUT] - gLiteCE - questions
>
>
>
>
>
>
>
>
>
> Hi Vega,
>
> > ok, it's clear why condor processes start, but :
> > >> After the users' jobs finished, these process will not exit.
> >
> > ...can you explain better this point please? Why processes should be
> > kept running even after job completion? At he moment the only fact I
>
> The CE and the resources behind it are somehow like a kind of condor
> resources, your next jobs will be submitted to the same condor
> processes. I am not sure if they will keep running for ever if you don't
> touch them, need JRA1 to confirm.
>
> > notice is that I've log files growing without control in user's
> > home...checking those log files I found that is polling an unexisting
> > pbs job; well...it's the same entry as for the old LCG CE...in that
> case
> > I'm sure it refers to pbs jobs while I'm not in this: it polls a pbs
> job
> > which has not ever existed and simply continues...
>
> This log files should come from the globus-job-manager which launch
> these condor processes.
>
> > Does it means that X different users will leave 2 jobmanager processes
> > (with their condor "children") PER job PER user?
>
> Currently there are 2 condor processes left per user.
>
> Cheers,
>
> Di
>
>
> > Thanks and cheers
> >
> > Vega Forneris
> >
> > +-----------------------------------------------+
> > ESA-ESRIN
> > Unix Systems Administrator
> > Via Galileo Galilei
> > 00044 Frascati (Rm) - Italy
> > Phone +39 06 94180581
> > Mailto: [log in to unmask]
> > +-----------------------------------------------+
> > Vitrociset S.p.A.
> > Unix System Administrator
> > Via Tiburtina 1020
> > 00100 Roma - Italy
> > Phone +39 06 8820 4297
> > Mailto: [log in to unmask]
> > +-----------------------------------------------+
> >
> > "I do not feel obliged to believe that the same God who has endowed us
> > with sense, reason, and intellect has intended us to forgot their use."
> > (Galileo Galilei)
> >
> >
> >
> > *Di Qing <[log in to unmask]>*
> > Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
> >
> > 18/08/2006 14:35
> > Please respond to
> > LHC Computer Grid - Rollout <[log in to unmask]>
> >
> >
> >
> > To
> > [log in to unmask]
> > cc
> >
> > Subject
> > Re: [LCG-ROLLOUT] - gLiteCE - questions
> >
> >
> >
> >
> >
> >
> >
> >
> > Vega Forneris wrote:
> > >
> > > Hi *,
> > >
> > > I've just set up a little GRID for testing purpose (NOTE: this
> implies
> > > that all machines involved are note used by anyone else=low cpu
> load and
> > > IP traffic), but I've still a couple of questions about gLiteCE +
> > gLiteWMS:
> > >
> > > TEST:
> > > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
> > > Executable = "/bin/echo";
> > > Arguments = "Hello World";
> > > StdOutput = "message.txt";
> > > StdError = "stderror";
> > > OutputSandbox = {"message.txt","stderror"};
> > >
> > > "glite-job-list-match" and " glite-job-submit" work perfectly and the
> > > job is succesfully submitted, but this simple job takes 4-6
> minutes for
> > > retrieving the output...is it normal (first question!!!) ? It's a
> very
> > > long time considering the job complexity (=NULL!), the distances (the
> > > systems are really close each other) and the server's work-load...
> > >
> > > Here the output of the command " glite-job-status -v 3 "
> > > - stateEnterTimes =
> > > Submitted : Fri Aug 18 11:24:27 2006 CEST
> > > Waiting : Fri Aug 18 11:24:15 2006 CEST
> > > Ready : Fri Aug 18 11:24:16 2006 CEST
> > > Scheduled : ---
> >
> > There is not a Scheduled event for gLite CE to update since the
> > submission mechanism changed ax explained below.
> >
> > > Running : Fri Aug 18 11:27:08 2006 CEST
> > > Done : Fri Aug 18 11:30:00 2006 CEST
> > > Cleared : ---
> > > Aborted : ---
> > > Cancelled : ---
> > > Unknown : ---
> > >
> > >
> > > while the pbs logs show that the cluster process this job in less
> than 7
> > > seconds since it reached the pbs_server and the authenticating
> process
> > > takes one second!
> > >
> > > ...the real problem is that even after retriving the output, there're
> > > still jobs running on the CE owned by the local user (in this case:
> > eo002)
> > >
> > > eo002 14086 1 0 11:25 ? 00:00:00
> globus-job-manager -conf
> > > /opt/glite/etc/globus-job-manager.conf -type fork -rdn
> jobmanager-fork
> > > -machine-type unknown -publish-jobs
> > > eo002 14199 1 0 11:25 ? 00:00:00
> > > /opt/condor-c/sbin/condor_master -f -r 680
> > > eo002 14231 14199 0 11:25 ? 00:00:00 condor_schedd -f -n
> > > [log in to unmask]
> > > eo002 14274 1 0 11:25 ? 00:00:00
> globus-job-manager -conf
> > > /opt/glite/etc/globus-job-manager.conf -type fork -rdn
> jobmanager-fork
> > > -machine-type unknown -publish-jobs
> > > eo002 14306 1 0 11:25 ? 00:00:00 perl
> > >
> >
> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data
>
> >
> > >
> >
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119.
>
> >
> > >
> > > eo002 14308 14306 0 11:25 ? 00:00:00 perl
> > > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
> > > --maxtime=3600s
> >
> > gLite CE is much different from LCG CE. And there are not job managers
> > for batch system, thus when a new user's job coming through WMS, two
> > condor processes will be launched through job manager fork as you saw
> > above. Then users' jobs actually are submitted to condor on CE from the
> > condor on WMS. After the users' jobs finished, these process will not
> > exit. JRA1 is planning to move the condor processes from user based to
> > VO based.
> >
> > Di
> >
> > >
> > > and 2 gram_job log files grow continuosly in user's home....
> > >
> > > Crosschecking the "daemon_unique_name"
> > > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I found
> that
> > > in condorc-advertiser.813.2385.out log is continuosly appended the
> > > following text:
> > >
> > > Fri Aug 18 11:47:18 2006 Advertising
> > > "[log in to unmask]"
> > > Fri Aug 18 11:47:48 2006 #################
> > > Fri Aug 18 11:47:48 2006 Rewriting
> > > "[log in to unmask]"
> > > Fri Aug 18 11:47:48 2006 Setting requirements true
> > > MyType = "Machine"
> > > TargetType = "Job"
> > > Activity = "Idle"
> > > Arch = "CondorC"
> > > CONDORC_WANTJOB = TRUE
> > > CondorCAd = 1
> > > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> > > CondorVersion = "$CondorVersion: 6.7.10 Aug 3 2005 $"
> > > DaemonStartTime = 1155893119
> > > Machine = "grid-eo-engine03.esrin.esa.int"
> > > MaxJobsRunning = 200
> > > MonitorSelfAge = 1200
> > > MonitorSelfCPUUsage = 0.012500
> > > MonitorSelfImageSize = 7612.000000
> > > MonitorSelfResidentSetSize = 4620
> > > MonitorSelfTime = 1155894319
> > > MyAddress = "<193.204.231.32:22864>"
> > > Name =
> "[log in to unmask]"
> > > NumUsers = 0
> > > OpSys = "CondorC"
> > > Requirements = TRUE
> > > START = TRUE
> > > ServerTime = 1155894460
> > > SiteName = "grid-eo-engine03.esrin.esa.int"
> > > StartdIpAddr = "<193.204.231.32:22864>"
> > > State = "Unclaimed"
> > > TotalFlockedJobs = 0
> > > TotalHeldJobs = 0
> > > TotalIdleJobs = 0
> > > TotalJobAds = 0
> > > TotalRemovedJobs = 0
> > > TotalRunningJobs = 0
> > > UpdateSequenceNumber = 1155894468
> > > VirtualMemory = 0
> > > WantAdRevaluate = True
> > > WantResAd = TRUE
> > > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
> > >
> > > I suppose that after the output retrieve, jobs should end, don't
> they?
> > >
> > > Thanks for the support and cheers
> > >
> > > Vega Forneris
> > >
> > > +-----------------------------------------------+
> > > ESA-ESRIN
> > > Unix Systems Administrator
> > > Via Galileo Galilei
> > > 00044 Frascati (Rm) - Italy
> > > Phone +39 06 94180581
> > > Mailto: [log in to unmask]
> > > +-----------------------------------------------+
> > > Vitrociset S.p.A.
> > > Unix System Administrator
> > > Via Tiburtina 1020
> > > 00100 Roma - Italy
> > > Phone +39 06 8820 4297
> > > Mailto: [log in to unmask]
> > > +-----------------------------------------------+
> > >
> > > "I do not feel obliged to believe that the same God who has
> endowed us
> > > with sense, reason, and intellect has intended us to forgot their
> use."
> > > (Galileo Galilei)
> >
>
|