Dave,
from the table you quote, it appears to me that the last publishing
occurred on the 20th of October. Is this correct? That's the date our
apel-publisher started throwing errors. The current error is:
Mon Nov 20 02:45:33 UTC 2006: apel-publisher - WARNING - Detected
missing records, republishing data starting from: 2006-11-14 14:28:58
Mon Nov 20 02:45:49 UTC 2006: apel-publisher - Publishing data into rgma
Mon Nov 20 02:47:01 UTC 2006: apel-publisher - RGMAException: This
Exception is not handled errorCode = 0
Mon Nov 20 02:47:01 UTC 2006: apel-publisher - program aborted
org.glite.apel.core.ApelException: org.glite.rgma.RGMAException:
java.lang.NumberFormatException:For input string: ""
Looking at
(http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php?ExecutingSite=UKI-LT2-UCL-HEP)
, there are indeed no entries for November. (see GGUS ID: 15792).
Am I looking at the wrong thing?
cheers,
gianfranco
On Mon, 20 Nov 2006, Kant, D (Dave) wrote:
> Gianfranco,
>
> The SFT test - in its current form - is very simple. It looks to see if your site published data recently.
> It doesn't check if your site was able to synchronise with the GOC. It doesn't check if the publisher failed.
>
> We are developing a better SFT test.
>
> The new RPMS allow you to publish - in addition to the job records - a high level aggregate of your site
> accounting database. If you successfully publish job records to us, this table is updated. If your publisher
> fails, this table will not be updated.
>
> Thus, we can determine if the site has published and if the site is synchronised with GOC.
>
> We can also use the determine possible problems at a site e.g. In September 2006 you only have accounting data for 26 out of 30 days.
> Was there downtime?
>
> The GOC Aggregator runs 3 times a day. Once the data is in RGMA it will appear on the CESGA pages a few hours later.
>
> The Syncronisation table for your site:-
>
> rgma> select * from LcgRecordsSync_v1 where ExecutingSite like '%UCL-HEP%';
> +-----------------+-------+--------------------+--------------------+-------------+------------+
> | ExecutingSite | Njobs | ElapsedTimeSeconds | BaseCpuTimeSeconds | RecordStart | RecordEnd |
> +-----------------+-------+--------------------+--------------------+-------------+------------+
> | UKI-LT2-UCL-HEP | 696 | 1991192 | 1501541 | 2006-02-05 | 2006-02-28 |
> | UKI-LT2-UCL-HEP | 1258 | 21813657 | 18676849 | 2006-03-01 | 2006-03-31 |
> | UKI-LT2-UCL-HEP | 19763 | 18986344 | 16720062 | 2006-04-01 | 2006-04-30 |
> | UKI-LT2-UCL-HEP | 3205 | 31275556 | 27581028 | 2006-05-01 | 2006-05-31 |
> | UKI-LT2-UCL-HEP | 5571 | 40021491 | 34155368 | 2006-06-01 | 2006-06-30 |
> | UKI-LT2-UCL-HEP | 6950 | 44385756 | 35415227 | 2006-07-01 | 2006-07-31 |
> | UKI-LT2-UCL-HEP | 1670 | 32319092 | 21662309 | 2006-08-01 | 2006-08-31 |
> | UKI-LT2-UCL-HEP | 3225 | 21172782 | 14315281 | 2006-09-01 | 2006-09-29 |
> | UKI-LT2-UCL-HEP | 2315 | 14423968 | 8284929 | 2006-10-02 | 2006-10-20 |
> +-----------------+-------+--------------------+--------------------+-------------+------------+
>
> GOC Aggregation over the JobRecords:-
>
> rgma> select ExecutingSite, count(*) as Njobs, Sum(ElapsedTimeSeconds), Sum(BaseCpuTimeSeconds), Min(EventDate), Max(Eventdate) from LcgRecords where ExecutingSite='UKI-LT2-UCL-HEP' group by ExecutingSite, Year(EventDate), Month(EventDate);
> +-----------------+-------+-------------------------+-------------------------+----------------+----------------+
> | ExecutingSite | Njobs | Sum(ElapsedTimeSeconds) | Sum(BaseCpuTimeSeconds) | Min(EventDate) | Max(Eventdate) |
> +-----------------+-------+-------------------------+-------------------------+----------------+----------------+
> | UKI-LT2-UCL-HEP | 696 | 1991192 | 1501541 | 2006-02-05 | 2006-02-28 |
> | UKI-LT2-UCL-HEP | 1258 | 21813657 | 18676849 | 2006-03-01 | 2006-03-31 |
> | UKI-LT2-UCL-HEP | 19763 | 18986344 | 16720062 | 2006-04-01 | 2006-04-30 |
> | UKI-LT2-UCL-HEP | 3205 | 31275556 | 27581028 | 2006-05-01 | 2006-05-31 |
> | UKI-LT2-UCL-HEP | 5571 | 40021491 | 34155368 | 2006-06-01 | 2006-06-30 |
> | UKI-LT2-UCL-HEP | 6950 | 44385756 | 35415227 | 2006-07-01 | 2006-07-31 |
> | UKI-LT2-UCL-HEP | 1670 | 32319092 | 21662309 | 2006-08-01 | 2006-08-31 |
> | UKI-LT2-UCL-HEP | 3225 | 21172782 | 14315281 | 2006-09-01 | 2006-09-29 |
> | UKI-LT2-UCL-HEP | 2315 | 14423968 | 8284929 | 2006-10-02 | 2006-10-20 |
> +-----------------+-------+-------------------------+-------------------------+----------------+----------------+
>
> This tells me that your site has synchronised its dataset with the GOC.
>
> Dave
>
> =========================================================
> Dr Dave Kant
> CCLRC eScience Department Phone: (+44)|(0) 1235 778178
> Rutherford Appleton Laboratory Fax: (+44)|(0) 1235 446626
> Chilton, Didcot, Oxon, OX11 0QX, UK Email: [log in to unmask]
> ==========================================================
>
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]]On Behalf Of Gianfranco Sciacca
> Sent: 20 November 2006 15:55
> To: [log in to unmask]
> Subject: Another APEL failure after host cert update
>
>
> Hi all,
>
> I recall a similar thread about a week ago, but here the problem seems
> different. APEL SFT errors started after having updated the MON and CE
> host certificates on the 20th of October.
>
> After running the apel-publisher by hand on the MON (trying first <all>
> and subsequently <missing> records), the SFT error went away. However,
> nothing is yet published to the GOC RGMA since the 20th of October. I've
> also updated to all the latest APEL rpm's from
> <http://goc.grid-support.ac.uk/gridsite/accounting/rpm.html>
>
> I was going to raise a GGUS ticket, but I thought I'd check here first
> for in-house expertise. Below are logs of 1) the publisher cron job and
> 2) running the publisher by hand.
>
> Thanks,
> gianfranco
>
> 1)
> BC_PROVIDER is not set, using default location
> ... /opt/glite/share/glite-security-trustmanager/bcprov-jdk14-122.jar
> Mon Nov 20 02:45:02 UTC 2006: apel-publisher - Read-in configuration:
> [logenabled, j] [DBUsername=accounting,
> DBURL=jdbc:mysql://pc91.hep.ucl.ac.uk:3306/accounting, DBPassword=****,
> site=UKI-LT2-UCL-HEP, republish=missing]
> Mon Nov 20 02:45:02 UTC 2006: apel-publisher - ------ Starting the apel
> application ------
> Mon Nov 20 02:45:03 UTC 2006: apel-publisher - Optimising table:
> EventRecords
> Mon Nov 20 02:45:03 UTC 2006: apel-publisher - Optimising table: GkRecords
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table:
> MessageRecords
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: SpecRecords
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: LcgRecords
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Checking Blahd table:
> BlahdRecords
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - OK
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table:
> BlahdRecords
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - **** Combining tables and
> republishing in LcgRecords ****
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Checking valid CPU spec
> data exists
> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - CPU spec values found
> Mon Nov 20 02:45:09 UTC 2006: apel-publisher - Finding records from:
> 2006-11-14 14:28:58
> Mon Nov 20 02:45:10 UTC 2006: apel-publisher - Record/s found: 44653
> Mon Nov 20 02:45:10 UTC 2006: apel-publisher - Checking Archiver is Online
> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - Archiver Alive
> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - Archiver Count: Record/s
> found: 31594
> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - WARNING - Detected
> missing records, republishing data starting from: 2006-11-14 14:28:58
> Mon Nov 20 02:45:49 UTC 2006: apel-publisher - Publishing data into rgma
> Mon Nov 20 02:47:01 UTC 2006: apel-publisher - RGMAException: This
> Exception is not handled errorCode = 0
> Mon Nov 20 02:47:01 UTC 2006: apel-publisher - program aborted
> org.glite.apel.core.ApelException: org.glite.rgma.RGMAException:
> java.lang.NumberFormatException:For input string: ""
>
> 2)
> [root@pc91 root]# /opt/glite/bin/apel-publisher -f
> /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml
> BC_PROVIDER is not set, using default location
> ... /opt/glite/share/glite-security-trustmanager/bcprov-jdk14-122.jar
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Read-in configuration:
> [logenabled, j] [DBUsername=accounting,
> DBURL=jdbc:mysql://pc91.hep.ucl.ac.uk:3306/accounting, DBPassword=****,
> site=UKI-LT2-UCL-HEP, republish=missing]
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - ------ Starting the apel
> application ------
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table:
> EventRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: GkRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table:
> MessageRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: SpecRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: LcgRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Checking Blahd table:
> BlahdRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - OK
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table:
> BlahdRecords
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - **** Combining tables and
> republishing in LcgRecords ****
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Checking valid CPU spec
> data exists
> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - CPU spec values found
> Mon Nov 20 12:25:17 UTC 2006: apel-publisher -
> ====================================
> Mon Nov 20 12:25:17 UTC 2006: apel-publisher - Synchronisation data
> check
> Mon Nov 20 12:25:17 UTC 2006: apel-publisher -
> ====================================
> Mon Nov 20 12:25:17 UTC 2006: apel-publisher - Finding all records in
> local database since the last successful publish timestamp : 2006-11-14
> 14:28:58
> Mon Nov 20 12:25:18 UTC 2006: apel-publisher - Record/s found: 44653
> Mon Nov 20 12:25:18 UTC 2006: apel-publisher - Checking Archiver is Online
> Mon Nov 20 12:25:38 UTC 2006: apel-publisher - Archiver Alive
> Mon Nov 20 12:25:38 UTC 2006: apel-publisher - Archiver Record Count:
> Record/s found site UKI-LT2-UCL-HEP : 31594
> Mon Nov 20 12:25:38 UTC 2006: apel-publisher - WARNING - Detected
> missing records, republishing data starting from: 2006-11-14 14:28:58
> Mon Nov 20 12:25:55 UTC 2006: apel-publisher - Publishing data into rgma
> Mon Nov 20 12:25:55 UTC 2006: apel-publisher - RGMABufferFullException:
> out-of-memory error on rgma server while publishing data
> Mon Nov 20 12:25:55 UTC 2006: apel-publisher - Please wait, will retry
> again in a 600000 milli seconds
> Mon Nov 20 12:35:56 UTC 2006: apel-publisher - RGMABufferFullException:
> out-of-memory error on rgma server while publishing data
> Mon Nov 20 12:35:56 UTC 2006: apel-publisher - Please wait, will retry
> again in a 600000 milli seconds
> Mon Nov 20 12:46:52 UTC 2006: apel-publisher - RGMABufferFullException:
> out-of-memory error on rgma server while publishing data
> Mon Nov 20 12:46:52 UTC 2006: apel-publisher - Please wait, will retry
> again in a 600000 milli seconds
> Mon Nov 20 12:56:52 UTC 2006: apel-publisher - RGMABufferFullException:
> out-of-memory error on rgma server while publishing data
> Mon Nov 20 12:56:52 UTC 2006: apel-publisher - Please wait, will retry
> again in a 600000 milli seconds
> Mon Nov 20 13:07:33 UTC 2006: apel-publisher - Checking the record
> counts for syncronisation
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Archiver Record Count: 44653
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Local database and GOC
> managed to syncronise, updating RepublishInfo
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Rows deleted from
> RepublishInfo: 1
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher -
> ====================================
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Completed
> Synchronisation data check
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher -
> ====================================
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Publisher Mode = Apel
> Publisher (Default)
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Building account records
> for LCG CE
> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - LCG CE: Stitching
> together all accounting records
> Mon Nov 20 13:13:24 UTC 2006: apel-publisher - LCG CE: Stitching completed
> Mon Nov 20 13:13:24 UTC 2006: apel-publisher - No accounting data to store
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Number of Joined
> accounting records: 0
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Build complete
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Building account records
> for data through the new Glite CE
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: Stitching
> together all accounting records
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: Stitching completed
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: No accounting
> data to store
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Number of Joined
> accounting records: 0
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Build complete
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing data into rgma
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Records to GOC: 0
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing data into rgma
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Records to GOC: 0
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - **** Join processing
> complete ****
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher -
> ====================================
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Summary Data
> Mon Nov 20 13:13:25 UTC 2006: apel-publisher -
> ====================================
> Mon Nov 20 13:13:28 UTC 2006: apel-publisher - Publishing summary data
> into rgma
> Mon Nov 20 13:13:28 UTC 2006: apel-publisher - ------ Processing
> finished ------
>
>
>
>
|