JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT 2006
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: - gLiteCE - questions
From:
Di Qing <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 18 Aug 2006 17:41:44 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (358 lines)
I don't know if there are anything ready for this, and if it is able to 
handle VOMS proxy(but should be, otherwise it's a bug).  When users 
switch VOs, different condor processes should be used.

Di
Jeff Templon wrote:
> Hi,
> 
> I wonder what the rationale is for moving to a VO-based watchdog?  Will 
> it be able to handle the various VOMS groups / roles?  What will happen 
> to users who switch VOs?
> 
>                     JT
> 
> Vega Forneris wrote:
>>
>> Hi Di,
>>
>> I can confirm that such logs are still growing since this mornig 
>> (timeframe=11:25-16:45; bigger logfile=2,2Mb)...I hope they will stop 
>> (under check ;-)  )!
>>
>> At the moment the prcesses running as eo002 (the local user who run 
>> the job) are:
>>
>> (ps -efl --forest)
>> 0 S eo002    14086     1  0  76   0    -  1476 schedu 11:25 ?       
>>  00:00:04 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
>> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown 
>> -publish-jobs
>> 0 S eo002    14199     1  0  75   0    -  1552 schedu 11:25 ?       
>>  00:00:02 /opt/condor-c/sbin/*condor_master* -f -r 680
>> 0 S eo002    14231 14199  0  75   0    -  1902 schedu 11:25 ?       
>>  00:00:02  \_ *condor_schedd *-f -n 5670f86976d594674aa5ef1c9bc2b3
>> [log in to unmask]
>> 0 S eo002    13403     1  0  75   0    -  1398 schedu 16:31 ?       
>>  00:00:00 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
>> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown 
>> -publish-jobs
>> 0 S eo002    13422     1  0  75   0    -  1073 schedu 16:31 ?       
>>  00:00:00 *perl */home/eo002/.globus/.gass_cache/local/md5/00/0bed
>> d51244c38d2825c37fec701443/md5/8d/618299719439f70dfc2258222583a4/data 
>> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor
>> _g_scratch.0x8506178.1770/grid-monitor-job-status.grid-e
>> 0 S eo002    13424 13422  0  75   0    -  1929 schedu 16:31 ?       
>>  00:00:00  \_ *perl */tmp/grid_manager_monitor_agent.eo002.13422.1
>> 000 --delete-self --maxtime=3600s
>>
>> Summaryzing:         2 condor
>>                 2 jobmanager
>>                 2 perl
>> ==========================
>>                 6 processes for doing nothing...I wonder what happened 
>> on clusters heavily accessed by different people
>>
>> I still don't know if it's the normal procedure, a bug or I simply 
>> missed something, but in the first case (normal procedure): why 
>> wasting resources when they can be shut down and restarted when/if 
>> needed? Which benefits are provided with this approach?
>>
>>
>> P.S. about times: just for being sure, a simple Hello_World.jdl takes 
>> 4-6 minutes for being submitted to a WMS, running on the CE and 
>> providing output in a clear situation (no load on machines and/or 
>> network)...is it normal? Is there any way for speed up the job?
>>
>> Cheers
>> Vega
>>
>>
>>
>> *Di Qing <[log in to unmask]>*
>> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>>
>> 18/08/2006 16:14
>> Please respond to
>> LHC Computer Grid - Rollout <[log in to unmask]>
>>
>>
>>     
>> To
>>     [log in to unmask]
>> cc
>>     
>> Subject
>>     Re: [LCG-ROLLOUT] - gLiteCE - questions
>>
>>
>>     
>>
>>
>>
>>
>>
>> Hi Vega,
>>
>>  > ok, it's clear why condor processes start, but :
>>  >  >> After the users' jobs finished, these process will not exit.
>>  >
>>  > ...can you explain better this point please? Why processes should be
>>  > kept running even after job completion? At he moment the only fact I
>>
>> The CE and the resources behind it are somehow like a kind of condor
>> resources, your next jobs will be submitted to the same condor
>> processes. I am not sure if they will keep running for ever if you don't
>> touch them, need JRA1 to confirm.
>>
>>  > notice is that I've log files growing without control in user's
>>  > home...checking those log files I found that is polling an unexisting
>>  > pbs job; well...it's the same entry as for the old LCG CE...in that 
>> case
>>  > I'm sure it refers to pbs jobs while I'm not in this: it polls a 
>> pbs job
>>  > which has not ever existed and simply continues...
>>
>> This log files should come from the globus-job-manager which launch
>> these condor processes.
>>
>>  > Does it means that X different users will leave 2 jobmanager processes
>>  > (with their condor "children") PER job PER user?
>>
>> Currently there are 2 condor processes left per user.
>>
>> Cheers,
>>
>> Di
>>
>>
>>  > Thanks and cheers
>>  >
>>  > Vega Forneris
>>  >
>>  > +-----------------------------------------------+
>>  > ESA-ESRIN
>>  > Unix Systems Administrator
>>  > Via Galileo Galilei
>>  > 00044 Frascati (Rm) - Italy
>>  > Phone +39 06 94180581
>>  > Mailto: [log in to unmask]
>>  > +-----------------------------------------------+
>>  > Vitrociset S.p.A.
>>  > Unix System Administrator
>>  > Via Tiburtina 1020
>>  > 00100 Roma - Italy
>>  > Phone +39 06 8820 4297     > Mailto: [log in to unmask]
>>  > +-----------------------------------------------+
>>  >
>>  > "I do not feel obliged to believe that the same God who has endowed us
>>  > with sense, reason, and intellect has intended us to forgot their 
>> use."
>>  > (Galileo Galilei)
>>  >
>>  >
>>  >
>>  > *Di Qing <[log in to unmask]>*
>>  > Sent by: LHC Computer Grid - Rollout 
>> <[log in to unmask]>
>>  >
>>  > 18/08/2006 14:35
>>  > Please respond to
>>  > LHC Computer Grid - Rollout <[log in to unmask]>
>>  >
>>  >
>>  >                   > To
>>  >                  [log in to unmask]
>>  > cc
>>  >                   > Subject
>>  >                  Re: [LCG-ROLLOUT] - gLiteCE - questions
>>  >
>>  >
>>  >                   >
>>  >
>>  >
>>  >
>>  >
>>  > Vega Forneris wrote:
>>  >  >
>>  >  > Hi *,
>>  >  >
>>  >  > I've just set up a little GRID for testing purpose (NOTE: this 
>> implies
>>  >  > that all machines involved are note used by anyone else=low cpu 
>> load and
>>  >  > IP traffic), but I've still a couple of questions about gLiteCE +
>>  > gLiteWMS:
>>  >  >
>>  >  > TEST:
>>  >  > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
>>  >  > Executable = "/bin/echo";
>>  >  > Arguments = "Hello World";
>>  >  > StdOutput = "message.txt";
>>  >  > StdError = "stderror";
>>  >  > OutputSandbox = {"message.txt","stderror"};
>>  >  >
>>  >  > "glite-job-list-match" and " glite-job-submit" work perfectly 
>> and the
>>  >  > job is succesfully submitted, but this simple job takes 4-6 
>> minutes for
>>  >  > retrieving the output...is it normal (first question!!!) ? It's 
>> a very
>>  >  > long time considering the job complexity (=NULL!), the distances 
>> (the
>>  >  > systems are really close each other) and the server's work-load...
>>  >  >
>>  >  > Here the output of the command " glite-job-status -v 3 "
>>  >  > - stateEnterTimes =   >  >       Submitted        : Fri Aug 18 
>> 11:24:27 2006 CEST
>>  >  >       Waiting          : Fri Aug 18 11:24:15 2006 CEST
>>  >  >       Ready            : Fri Aug 18 11:24:16 2006 CEST
>>  >  >       Scheduled        :                ---
>>  >
>>  > There is not a Scheduled event for gLite CE to update since the
>>  > submission mechanism changed ax explained below.
>>  >
>>  >  >       Running          : Fri Aug 18 11:27:08 2006 CEST
>>  >  >       Done             : Fri Aug 18 11:30:00 2006 CEST
>>  >  >       Cleared          :                ---
>>  >  >       Aborted          :                ---
>>  >  >       Cancelled        :                ---
>>  >  >       Unknown          :                ---
>>  >  >
>>  >  >
>>  >  > while the pbs logs show that the cluster process this job in 
>> less than 7
>>  >  > seconds since it reached the pbs_server and the authenticating 
>> process
>>  >  > takes one second!
>>  >  >
>>  >  > ...the real problem is that even after retriving the output, 
>> there're
>>  >  > still jobs running on the CE owned by the local user (in this case:
>>  > eo002)
>>  >  >
>>  >  > eo002    14086     1  0 11:25 ?        00:00:00 
>> globus-job-manager -conf
>>  >  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn 
>> jobmanager-fork
>>  >  > -machine-type unknown -publish-jobs
>>  >  > eo002    14199     1  0 11:25 ?        00:00:00
>>  >  > /opt/condor-c/sbin/condor_master -f -r 680
>>  >  > eo002    14231 14199  0 11:25 ?        00:00:00 condor_schedd -f -n
>>  >  > [log in to unmask]
>>  >  > eo002    14274     1  0 11:25 ?        00:00:00 
>> globus-job-manager -conf
>>  >  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn 
>> jobmanager-fork
>>  >  > -machine-type unknown -publish-jobs
>>  >  > eo002    14306     1  0 11:25 ?        00:00:00 perl
>>  >  >
>>  > 
>> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data 
>>
>>  >
>>  >  >
>>  > 
>> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119. 
>>
>>  >
>>  >  >
>>  >  > eo002    14308 14306  0 11:25 ?        00:00:00 perl
>>  >  > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
>>  >  > --maxtime=3600s
>>  >
>>  > gLite CE is much different from LCG CE. And there are not job managers
>>  > for batch system, thus when a new user's job coming through WMS, two
>>  > condor processes will be launched through job manager fork as you saw
>>  > above. Then users' jobs actually are submitted to condor on CE from 
>> the
>>  > condor on WMS. After the users' jobs finished, these process will not
>>  > exit. JRA1 is planning to move the condor processes from user based to
>>  > VO based.
>>  >
>>  > Di
>>  >
>>  >  >
>>  >  > and 2 gram_job log files grow continuosly in user's home....
>>  >  >
>>  >  > Crosschecking the "daemon_unique_name"
>>  >  > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I  
>> found that
>>  >  > in condorc-advertiser.813.2385.out log is continuosly appended the
>>  >  > following text:
>>  >  >
>>  >  > Fri Aug 18 11:47:18 2006 Advertising
>>  >  > "[log in to unmask]"
>>  >  > Fri Aug 18 11:47:48 2006 #################
>>  >  > Fri Aug 18 11:47:48 2006 Rewriting
>>  >  > "[log in to unmask]"
>>  >  > Fri Aug 18 11:47:48 2006     Setting requirements true
>>  >  > MyType = "Machine"
>>  >  > TargetType = "Job"
>>  >  > Activity = "Idle"
>>  >  > Arch = "CondorC"
>>  >  > CONDORC_WANTJOB = TRUE
>>  >  > CondorCAd = 1
>>  >  > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
>>  >  > CondorVersion = "$CondorVersion: 6.7.10 Aug  3 2005 $"
>>  >  > DaemonStartTime = 1155893119
>>  >  > Machine = "grid-eo-engine03.esrin.esa.int"
>>  >  > MaxJobsRunning = 200
>>  >  > MonitorSelfAge = 1200
>>  >  > MonitorSelfCPUUsage = 0.012500
>>  >  > MonitorSelfImageSize = 7612.000000
>>  >  > MonitorSelfResidentSetSize = 4620
>>  >  > MonitorSelfTime = 1155894319
>>  >  > MyAddress = "<193.204.231.32:22864>"
>>  >  > Name = 
>> "[log in to unmask]"
>>  >  > NumUsers = 0
>>  >  > OpSys = "CondorC"
>>  >  > Requirements = TRUE
>>  >  > START = TRUE
>>  >  > ServerTime = 1155894460
>>  >  > SiteName = "grid-eo-engine03.esrin.esa.int"
>>  >  > StartdIpAddr = "<193.204.231.32:22864>"
>>  >  > State = "Unclaimed"
>>  >  > TotalFlockedJobs = 0
>>  >  > TotalHeldJobs = 0
>>  >  > TotalIdleJobs = 0
>>  >  > TotalJobAds = 0
>>  >  > TotalRemovedJobs = 0
>>  >  > TotalRunningJobs = 0
>>  >  > UpdateSequenceNumber = 1155894468
>>  >  > VirtualMemory = 0
>>  >  > WantAdRevaluate = True
>>  >  > WantResAd = TRUE
>>  >  > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
>>  >  >
>>  >  > I suppose that after the output retrieve, jobs should end, don't 
>> they?
>>  >  >
>>  >  > Thanks for the support and cheers
>>  >  >
>>  >  > Vega Forneris
>>  >  >
>>  >  > +-----------------------------------------------+
>>  >  > ESA-ESRIN
>>  >  > Unix Systems Administrator
>>  >  > Via Galileo Galilei
>>  >  > 00044 Frascati (Rm) - Italy
>>  >  > Phone +39 06 94180581
>>  >  > Mailto: [log in to unmask]
>>  >  > +-----------------------------------------------+
>>  >  > Vitrociset S.p.A.
>>  >  > Unix System Administrator
>>  >  > Via Tiburtina 1020
>>  >  > 00100 Roma - Italy
>>  >  > Phone +39 06 8820 4297     >  > Mailto: [log in to unmask]
>>  >  > +-----------------------------------------------+
>>  >  >
>>  >  > "I do not feel obliged to believe that the same God who has 
>> endowed us
>>  >  > with sense, reason, and intellect has intended us to forgot 
>> their use."
>>  >  > (Galileo Galilei)
>>  >
>>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options