Hi all,
we are seeing a strange behaviour in our 4 creamCEs (still gLite 3.2).
All CEs use same separate pbs server, all publish same queues and
(should) have same configuration: rpm, scripts and involved config
files are identical. (glite-CREAM-3.2.14-1.sl5.x86_64,
lcg-info-dynamic-pbs-1.0.13-1.noarch lcg-info-dynamic-scheduler-generic-2.3.4-1.noarch,
lcg-info-dynamic-scheduler-pbs-2.0.1-1.noarch, lcg-info-dynamic-software-1.0.5-0.noarch
, lcg-info-provider-software-1.0.6-1.noarch)
but something must be different because they behave different.
When you query CE infosys, 2 of them always publish 0 FreeCPU and 0
FreeJobslot, and two of them publish 'correct' value of FreeCPU and 0
for FreeJobslot
[arnaubria@ui01 ~]$ check_service_bdii.sh ce07|grep FreeCPU
GlueCEStateFreeCPUs: 0
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 0
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 0
[arnaubria@ui01 ~]$ check_service_bdii.sh ce10|grep FreeCPU
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 56
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 56
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 56
output of running pluguin (as edguser) in each CE (I'll paste one queue
info only)
[ce07]
-sh-3.2$ /opt/glite/etc/gip/plugin/glite-info-dynamic-ce
[...]
dn: GlueCEUniqueID=ce07.pic.es:8443/cream-pbs-glong_sl5,mds-vo-name=resource,o=grid
GlueCEInfoLRMSVersion: 2.5.9
GlueCEInfoTotalCPUs: 2649
GlueCEPolicyAssignedJobSlots: 2649
GlueCEStateFreeCPUs: 18
GlueCEPolicyMaxCPUTime: 4800
GlueCEPolicyMaxWallClockTime: 5220
GlueCEStateStatus: Production
$/opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
[...]
dn: GlueCEUniqueID=ce07.pic.es:8443/cream-pbs-glong_sl5,mds-vo-name=resource,o=grid
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 0
GlueCEStateRunningJobs: 1781
GlueCEStateWaitingJobs: 1310
GlueCEStateTotalJobs: 3091
GlueCEStateEstimatedResponseTime: 21003
GlueCEStateWorstResponseTime: 410292000
[ce10]
-sh-3.2$ /opt/glite/etc/gip/plugin/glite-info-dynamic-ce
dn: GlueCEUniqueID=ce10.pic.es:8443/cream-pbs-glong_sl5,mds-vo-name=resource,o=grid
GlueCEInfoLRMSVersion: 2.5.9
GlueCEInfoTotalCPUs: 2649
GlueCEPolicyAssignedJobSlots: 2649
GlueCEStateFreeCPUs: 50
GlueCEPolicyMaxCPUTime: 4800
GlueCEPolicyMaxWallClockTime: 5220
GlueCEStateStatus: Production
$/opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
[...]
dn: GlueCEUniqueID=ce10.pic.es:8443/cream-pbs-glong_sl5,mds-vo-name=resource,o=grid
GlueCEStateFreeJobSlots: 0
GlueCEStateFreeCPUs: 0
GlueCEStateRunningJobs: 1774
GlueCEStateWaitingJobs: 1332
GlueCEStateTotalJobs: 3106
GlueCEStateEstimatedResponseTime: 21109
GlueCEStateWorstResponseTime: 417182400
(FreeCPU values are different due to the time when each command has
been executed).
So, as you can see, FreeJobSlots and FreeCPUs are always 0 when asking
to dynamic-scheduler-wrapper, but FreeCPUs is not 0 when asking
dynamic-ce. ***
I've noticed that when querying infosys, the order in which
FreeCPUs appear in each CE is not the same. In the one that publish 0,
FreeCPU appears just after Status, in the other, at the end:
# ce07.pic.es:8443/cream-pbs-glong_sl5, resource, grid
dn: GlueCEUniqueID=ce07.pic.es:8443/cream-pbs-glong_sl5,Mds-Vo-name=resource,o
=grid
GlueCEStateStatus: Production
GlueCEStateFreeCPUs: 0 <------------------------------------------------
[...]
# ce10.pic.es:8443/cream-pbs-glong_sl5, resource, grid
dn: GlueCEUniqueID=ce10.pic.es:8443/cream-pbs-glong_sl5,Mds-Vo-name=resource,o
=grid
GlueCEStateStatus: Production
GlueCEPolicyPriority: 1
[...]
GlueCEStateFreeJobSlots: 0
GlueCEInfoTotalCPUs: 2649
GlueCEPolicyAssignedJobSlots: 2649
GlueCEStateTotalJobs: 3139
GlueCEStateFreeCPUs: 29 <-------------------------------------------------
GlueCEStateEstimatedResponseTime: 21565
My first question is, which value should be used for publishing
GlueCEStateFreeCPUs? From bddi-update.log I understand that it must be
dynamic-ce, as it is the last command ran:
2012-02-14 11:43:58,811: [INFO] Running Plugins
2012-02-14 11:43:58,811: [DEBUG] Running /opt/glite/etc/gip/plugin/glite-info-dynamic-software-wrapper
2012-02-14 11:43:58,914: [DEBUG] Running /opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
2012-02-14 11:44:03,325: [DEBUG] Running /opt/glite/etc/gip/plugin/glite-info-dynamic-ce
Is my assumption correct? If yes, what could be causing those
differences between both CEs?
*** About what FreeCPU means, when querying dynamic-ce, the total amount
of FreeCpus is calculated looking at nodes which are 'free' and
comparing how many jobs it is running and how many it could run /torque
np parameter). But, in our case, where maui takes between 5 and
10 minutes to schedule, is normal to have some FreeCpus, but just
because maui is scheduling or has not scheduled yet. But, as we have
many jobs in queue, there are no real FreeCpus. So, FreeSlots, that
checks if there are queueud jobs (which really means if free slots
available) isn't what should be used for tools like lcg-info(sites).
Anyone could give a hand?
TIA,
Arnau
|