Gianfranco,
The SFT test - in its current form - is very simple. It looks to see if your site published data recently.
It doesn't check if your site was able to synchronise with the GOC. It doesn't check if the publisher failed.
We are developing a better SFT test.
The new RPMS allow you to publish - in addition to the job records - a high level aggregate of your site
accounting database. If you successfully publish job records to us, this table is updated. If your publisher
fails, this table will not be updated.
Thus, we can determine if the site has published and if the site is synchronised with GOC.
We can also use the determine possible problems at a site e.g. In September 2006 you only have accounting data for 26 out of 30 days.
Was there downtime?
The GOC Aggregator runs 3 times a day. Once the data is in RGMA it will appear on the CESGA pages a few hours later.
The Syncronisation table for your site:-
rgma> select * from LcgRecordsSync_v1 where ExecutingSite like '%UCL-HEP%';
+-----------------+-------+--------------------+--------------------+-------------+------------+
| ExecutingSite | Njobs | ElapsedTimeSeconds | BaseCpuTimeSeconds | RecordStart | RecordEnd |
+-----------------+-------+--------------------+--------------------+-------------+------------+
| UKI-LT2-UCL-HEP | 696 | 1991192 | 1501541 | 2006-02-05 | 2006-02-28 |
| UKI-LT2-UCL-HEP | 1258 | 21813657 | 18676849 | 2006-03-01 | 2006-03-31 |
| UKI-LT2-UCL-HEP | 19763 | 18986344 | 16720062 | 2006-04-01 | 2006-04-30 |
| UKI-LT2-UCL-HEP | 3205 | 31275556 | 27581028 | 2006-05-01 | 2006-05-31 |
| UKI-LT2-UCL-HEP | 5571 | 40021491 | 34155368 | 2006-06-01 | 2006-06-30 |
| UKI-LT2-UCL-HEP | 6950 | 44385756 | 35415227 | 2006-07-01 | 2006-07-31 |
| UKI-LT2-UCL-HEP | 1670 | 32319092 | 21662309 | 2006-08-01 | 2006-08-31 |
| UKI-LT2-UCL-HEP | 3225 | 21172782 | 14315281 | 2006-09-01 | 2006-09-29 |
| UKI-LT2-UCL-HEP | 2315 | 14423968 | 8284929 | 2006-10-02 | 2006-10-20 |
+-----------------+-------+--------------------+--------------------+-------------+------------+
GOC Aggregation over the JobRecords:-
rgma> select ExecutingSite, count(*) as Njobs, Sum(ElapsedTimeSeconds), Sum(BaseCpuTimeSeconds), Min(EventDate), Max(Eventdate) from LcgRecords where ExecutingSite='UKI-LT2-UCL-HEP' group by ExecutingSite, Year(EventDate), Month(EventDate);
+-----------------+-------+-------------------------+-------------------------+----------------+----------------+
| ExecutingSite | Njobs | Sum(ElapsedTimeSeconds) | Sum(BaseCpuTimeSeconds) | Min(EventDate) | Max(Eventdate) |
+-----------------+-------+-------------------------+-------------------------+----------------+----------------+
| UKI-LT2-UCL-HEP | 696 | 1991192 | 1501541 | 2006-02-05 | 2006-02-28 |
| UKI-LT2-UCL-HEP | 1258 | 21813657 | 18676849 | 2006-03-01 | 2006-03-31 |
| UKI-LT2-UCL-HEP | 19763 | 18986344 | 16720062 | 2006-04-01 | 2006-04-30 |
| UKI-LT2-UCL-HEP | 3205 | 31275556 | 27581028 | 2006-05-01 | 2006-05-31 |
| UKI-LT2-UCL-HEP | 5571 | 40021491 | 34155368 | 2006-06-01 | 2006-06-30 |
| UKI-LT2-UCL-HEP | 6950 | 44385756 | 35415227 | 2006-07-01 | 2006-07-31 |
| UKI-LT2-UCL-HEP | 1670 | 32319092 | 21662309 | 2006-08-01 | 2006-08-31 |
| UKI-LT2-UCL-HEP | 3225 | 21172782 | 14315281 | 2006-09-01 | 2006-09-29 |
| UKI-LT2-UCL-HEP | 2315 | 14423968 | 8284929 | 2006-10-02 | 2006-10-20 |
+-----------------+-------+-------------------------+-------------------------+----------------+----------------+
This tells me that your site has synchronised its dataset with the GOC.
Dave
=========================================================
Dr Dave Kant
CCLRC eScience Department Phone: (+44)|(0) 1235 778178
Rutherford Appleton Laboratory Fax: (+44)|(0) 1235 446626
Chilton, Didcot, Oxon, OX11 0QX, UK Email: [log in to unmask]
==========================================================
-----Original Message-----
From: Testbed Support for GridPP member institutes
[mailto:[log in to unmask]]On Behalf Of Gianfranco Sciacca
Sent: 20 November 2006 15:55
To: [log in to unmask]
Subject: Another APEL failure after host cert update
Hi all,
I recall a similar thread about a week ago, but here the problem seems
different. APEL SFT errors started after having updated the MON and CE
host certificates on the 20th of October.
After running the apel-publisher by hand on the MON (trying first <all>
and subsequently <missing> records), the SFT error went away. However,
nothing is yet published to the GOC RGMA since the 20th of October. I've
also updated to all the latest APEL rpm's from
<http://goc.grid-support.ac.uk/gridsite/accounting/rpm.html>
I was going to raise a GGUS ticket, but I thought I'd check here first
for in-house expertise. Below are logs of 1) the publisher cron job and
2) running the publisher by hand.
Thanks,
gianfranco
1)
BC_PROVIDER is not set, using default location
... /opt/glite/share/glite-security-trustmanager/bcprov-jdk14-122.jar
Mon Nov 20 02:45:02 UTC 2006: apel-publisher - Read-in configuration:
[logenabled, j] [DBUsername=accounting,
DBURL=jdbc:mysql://pc91.hep.ucl.ac.uk:3306/accounting, DBPassword=****,
site=UKI-LT2-UCL-HEP, republish=missing]
Mon Nov 20 02:45:02 UTC 2006: apel-publisher - ------ Starting the apel
application ------
Mon Nov 20 02:45:03 UTC 2006: apel-publisher - Optimising table:
EventRecords
Mon Nov 20 02:45:03 UTC 2006: apel-publisher - Optimising table: GkRecords
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table:
MessageRecords
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: SpecRecords
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: LcgRecords
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Checking Blahd table:
BlahdRecords
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - OK
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table:
BlahdRecords
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - **** Combining tables and
republishing in LcgRecords ****
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Checking valid CPU spec
data exists
Mon Nov 20 02:45:04 UTC 2006: apel-publisher - CPU spec values found
Mon Nov 20 02:45:09 UTC 2006: apel-publisher - Finding records from:
2006-11-14 14:28:58
Mon Nov 20 02:45:10 UTC 2006: apel-publisher - Record/s found: 44653
Mon Nov 20 02:45:10 UTC 2006: apel-publisher - Checking Archiver is Online
Mon Nov 20 02:45:33 UTC 2006: apel-publisher - Archiver Alive
Mon Nov 20 02:45:33 UTC 2006: apel-publisher - Archiver Count: Record/s
found: 31594
Mon Nov 20 02:45:33 UTC 2006: apel-publisher - WARNING - Detected
missing records, republishing data starting from: 2006-11-14 14:28:58
Mon Nov 20 02:45:49 UTC 2006: apel-publisher - Publishing data into rgma
Mon Nov 20 02:47:01 UTC 2006: apel-publisher - RGMAException: This
Exception is not handled errorCode = 0
Mon Nov 20 02:47:01 UTC 2006: apel-publisher - program aborted
org.glite.apel.core.ApelException: org.glite.rgma.RGMAException:
java.lang.NumberFormatException:For input string: ""
2)
[root@pc91 root]# /opt/glite/bin/apel-publisher -f
/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml
BC_PROVIDER is not set, using default location
... /opt/glite/share/glite-security-trustmanager/bcprov-jdk14-122.jar
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Read-in configuration:
[logenabled, j] [DBUsername=accounting,
DBURL=jdbc:mysql://pc91.hep.ucl.ac.uk:3306/accounting, DBPassword=****,
site=UKI-LT2-UCL-HEP, republish=missing]
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - ------ Starting the apel
application ------
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table:
EventRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: GkRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table:
MessageRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: SpecRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: LcgRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Checking Blahd table:
BlahdRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - OK
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table:
BlahdRecords
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - **** Combining tables and
republishing in LcgRecords ****
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Checking valid CPU spec
data exists
Mon Nov 20 12:25:11 UTC 2006: apel-publisher - CPU spec values found
Mon Nov 20 12:25:17 UTC 2006: apel-publisher -
====================================
Mon Nov 20 12:25:17 UTC 2006: apel-publisher - Synchronisation data
check
Mon Nov 20 12:25:17 UTC 2006: apel-publisher -
====================================
Mon Nov 20 12:25:17 UTC 2006: apel-publisher - Finding all records in
local database since the last successful publish timestamp : 2006-11-14
14:28:58
Mon Nov 20 12:25:18 UTC 2006: apel-publisher - Record/s found: 44653
Mon Nov 20 12:25:18 UTC 2006: apel-publisher - Checking Archiver is Online
Mon Nov 20 12:25:38 UTC 2006: apel-publisher - Archiver Alive
Mon Nov 20 12:25:38 UTC 2006: apel-publisher - Archiver Record Count:
Record/s found site UKI-LT2-UCL-HEP : 31594
Mon Nov 20 12:25:38 UTC 2006: apel-publisher - WARNING - Detected
missing records, republishing data starting from: 2006-11-14 14:28:58
Mon Nov 20 12:25:55 UTC 2006: apel-publisher - Publishing data into rgma
Mon Nov 20 12:25:55 UTC 2006: apel-publisher - RGMABufferFullException:
out-of-memory error on rgma server while publishing data
Mon Nov 20 12:25:55 UTC 2006: apel-publisher - Please wait, will retry
again in a 600000 milli seconds
Mon Nov 20 12:35:56 UTC 2006: apel-publisher - RGMABufferFullException:
out-of-memory error on rgma server while publishing data
Mon Nov 20 12:35:56 UTC 2006: apel-publisher - Please wait, will retry
again in a 600000 milli seconds
Mon Nov 20 12:46:52 UTC 2006: apel-publisher - RGMABufferFullException:
out-of-memory error on rgma server while publishing data
Mon Nov 20 12:46:52 UTC 2006: apel-publisher - Please wait, will retry
again in a 600000 milli seconds
Mon Nov 20 12:56:52 UTC 2006: apel-publisher - RGMABufferFullException:
out-of-memory error on rgma server while publishing data
Mon Nov 20 12:56:52 UTC 2006: apel-publisher - Please wait, will retry
again in a 600000 milli seconds
Mon Nov 20 13:07:33 UTC 2006: apel-publisher - Checking the record
counts for syncronisation
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Archiver Record Count: 44653
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Local database and GOC
managed to syncronise, updating RepublishInfo
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Rows deleted from
RepublishInfo: 1
Mon Nov 20 13:13:06 UTC 2006: apel-publisher -
====================================
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Completed
Synchronisation data check
Mon Nov 20 13:13:06 UTC 2006: apel-publisher -
====================================
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Publisher Mode = Apel
Publisher (Default)
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Building account records
for LCG CE
Mon Nov 20 13:13:06 UTC 2006: apel-publisher - LCG CE: Stitching
together all accounting records
Mon Nov 20 13:13:24 UTC 2006: apel-publisher - LCG CE: Stitching completed
Mon Nov 20 13:13:24 UTC 2006: apel-publisher - No accounting data to store
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Number of Joined
accounting records: 0
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Build complete
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Building account records
for data through the new Glite CE
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: Stitching
together all accounting records
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: Stitching completed
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: No accounting
data to store
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Number of Joined
accounting records: 0
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Build complete
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing data into rgma
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Records to GOC: 0
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing data into rgma
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Records to GOC: 0
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - **** Join processing
complete ****
Mon Nov 20 13:13:25 UTC 2006: apel-publisher -
====================================
Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Summary Data
Mon Nov 20 13:13:25 UTC 2006: apel-publisher -
====================================
Mon Nov 20 13:13:28 UTC 2006: apel-publisher - Publishing summary data
into rgma
Mon Nov 20 13:13:28 UTC 2006: apel-publisher - ------ Processing
finished ------
--
Dr. Gianfranco Sciacca Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy Internal: 33044
University College London D15 - Physics Building
London WC1E 6BT
|