Hello,
I have upgraded PragueLCG2 site to 2_6_0. We are using pbspro LRMS with
the PBS server located on another host. I have done all the modifications
that were needed until 2_4_0 (pcpus instead of np in lcg-info-dynamic-pbs
script being the most notable change). Things seem to work, up to the
random hiccup of the site BDII.
There are times when the published information is correct:
# ldapsearch -x -H ldap://golias25.farm.particle.cz:2170 -b "mds-vo-name=praguelcg2,o=grid"|grep -i totalcpu
GlueCEInfoTotalCPUs: 198
GlueCEInfoTotalCPUs: 198
GlueCEInfoTotalCPUs: 198
GlueCEInfoTotalCPUs: 198
After some time, the published information becomes wrong, however:
# ldapsearch -x -H ldap://golias25.farm.particle.cz:2170 -b "mds-vo-name=praguelcg2,o=grid"|grep -i totalcpu
GlueCEInfoTotalCPUs: 0
GlueCEInfoTotalCPUs: 0
GlueCEInfoTotalCPUs: 0
GlueCEInfoTotalCPUs: 0
The GlueCEInfoTotalCPUs is not the only value that gets messed up - also
all other information obtained from LRMS about the job and queue states,
LRMS version, etc, are published incorrectly as if they were unitialized.
After some time (I have not yet been able to learn any pattern from the
times when it is OK and when it is not) the published information is
correct again.
When, in the time I see that the published information is incorrect, I run
the /opt/lcg/libexec/lcg-info-dynamic-ce (or lcg-info-dynamic-wrapper)
script manually on the CE, the published values are correct. So it seems
like the BDII/MDS is not able to get the correct values, even though the
scripts are returning the properly.
Does anyone have any idea what could be wrong or in which direction should
I look?
Thanks a lot in advance,
--
Jiri Kosina
Institute of Physics, Academy of sciences of the Czech Republic
|