Print

Print


- ok Dave,

thanks for the clarification. I believe the problem is solved now. Apel 
was running fine on the CE, but the PBS accounting directory, despite 
appearing to be NFS mounted on the CE, was in fact not accessible, 
therefore apel on the CE wasn't finding any records to publish.

I have now remounted the directory, re-run  apel-pbs-log-parser on the 
CE and apel-publisher on the MON. The former claims to have inserted 
48438 in the DB and the latter claims to have published 4206 records to 
GOC. These should be appearing in the CESGA pages later on today.

thanks and cheers,
gianfranco


Kant, D (Dave) wrote:
> Gianfranco,
>
> Looking at your data more carefully I now understand why there is no November data reported by CESGA.
>
> Your site last published data into RGMA on the 20th November.
> The dataset which you published was for jobs completed at your site between 5th-Feb-2006 until 20-October-2006.
> Thats why you see nothing in the CESGA pages for November. There is no data in your APEL database for jobs completed 
> in November. This is most likely because you don't have APEL running on the CE!
>
> RecordStart is the date of the first completed job 
> RecordEnd is the date of the last completed job
> MeasurementDate is the date that the tuple was published into RGMA
>
> rgma> select ExecutingSite,Njobs,RecordStart,RecordEnd,MeasurementDate from LcgRecordsSync_v1 where ExecutingSite like '%ucl-hep%';
> +-----------------+-------+-------------+------------+-----------------+
> | ExecutingSite   | Njobs | RecordStart | RecordEnd  | MeasurementDate |
> +-----------------+-------+-------------+------------+-----------------+
> | UKI-LT2-UCL-HEP |   696 | 2006-02-05  | 2006-02-28 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  1258 | 2006-03-01  | 2006-03-31 | 2006-11-20      |
> | UKI-LT2-UCL-HEP | 19763 | 2006-04-01  | 2006-04-30 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  3205 | 2006-05-01  | 2006-05-31 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  5571 | 2006-06-01  | 2006-06-30 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  6950 | 2006-07-01  | 2006-07-31 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  1670 | 2006-08-01  | 2006-08-31 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  3225 | 2006-09-01  | 2006-09-29 | 2006-11-20      |
> | UKI-LT2-UCL-HEP |  2315 | 2006-10-02  | 2006-10-20 | 2006-11-20      |
> +-----------------+-------+-------------+------------+-----------------+
>
> Dave
>
> =========================================================
> Dr Dave Kant
> CCLRC eScience Department                      Phone: (+44)|(0) 1235 778178
> Rutherford Appleton Laboratory                Fax:    (+44)|(0) 1235 446626
> Chilton, Didcot, Oxon, OX11 0QX, UK         Email:  [log in to unmask]
> ==========================================================
>
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]]On Behalf Of gianfranco sciacca
> Sent: 21 November 2006 00:05
> To: [log in to unmask]
> Subject: Re: Another APEL failure after host cert update
>
>
> Dave,
>
> from the table you quote, it appears to me that the last publishing 
> occurred on the 20th of October. Is this correct? That's the date our 
> apel-publisher started throwing errors. The current error is:
>
> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - WARNING - Detected 
> missing records, republishing data starting from: 2006-11-14 14:28:58
> Mon Nov 20 02:45:49 UTC 2006: apel-publisher - Publishing data into rgma
> Mon Nov 20 02:47:01 UTC 2006: apel-publisher - RGMAException: This 
> Exception is not handled errorCode = 0
> Mon Nov 20 02:47:01 UTC 2006: apel-publisher - program aborted
> org.glite.apel.core.ApelException: org.glite.rgma.RGMAException: 
> java.lang.NumberFormatException:For input string: ""
>
>
> Looking at 
> (http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php?ExecutingSite=UKI-LT2-UCL-HEP) 
> , there are indeed no entries for November. (see GGUS ID: 15792).
>
> Am I looking at the wrong thing?
>
> cheers,
> gianfranco
>
>
> On Mon, 20 Nov 2006, Kant, D (Dave) wrote:
>
>   
>> Gianfranco,
>>
>> The SFT test - in its current form - is very simple. It looks to see if your site published data recently.
>> It doesn't check if your site was able to synchronise with the GOC. It doesn't check if the publisher failed.
>>
>> We are developing a better SFT test.
>>
>> The new RPMS allow you to publish - in addition to the job records - a high level aggregate of your site 
>> accounting database. If you successfully publish job records to us, this table is updated. If your publisher
>> fails, this table will not be updated.
>>
>> Thus, we can determine if the site has published and if the site is synchronised with GOC.
>>
>> We can also use the determine possible problems at a site e.g. In September 2006 you only have accounting data for 26 out of 30 days. 
>> Was there downtime?
>>  
>> The GOC Aggregator runs 3 times a day. Once the data is in RGMA it will appear on the CESGA pages a few hours later.
>>
>> The Syncronisation table for your site:-
>>
>> rgma> select * from LcgRecordsSync_v1 where ExecutingSite like '%UCL-HEP%';
>> +-----------------+-------+--------------------+--------------------+-------------+------------+
>> | ExecutingSite   | Njobs | ElapsedTimeSeconds | BaseCpuTimeSeconds | RecordStart | RecordEnd  |
>> +-----------------+-------+--------------------+--------------------+-------------+------------+
>> | UKI-LT2-UCL-HEP |   696 |            1991192 |            1501541 | 2006-02-05  | 2006-02-28 |
>> | UKI-LT2-UCL-HEP |  1258 |           21813657 |           18676849 | 2006-03-01  | 2006-03-31 |
>> | UKI-LT2-UCL-HEP | 19763 |           18986344 |           16720062 | 2006-04-01  | 2006-04-30 |
>> | UKI-LT2-UCL-HEP |  3205 |           31275556 |           27581028 | 2006-05-01  | 2006-05-31 |
>> | UKI-LT2-UCL-HEP |  5571 |           40021491 |           34155368 | 2006-06-01  | 2006-06-30 |
>> | UKI-LT2-UCL-HEP |  6950 |           44385756 |           35415227 | 2006-07-01  | 2006-07-31 |
>> | UKI-LT2-UCL-HEP |  1670 |           32319092 |           21662309 | 2006-08-01  | 2006-08-31 |
>> | UKI-LT2-UCL-HEP |  3225 |           21172782 |           14315281 | 2006-09-01  | 2006-09-29 |
>> | UKI-LT2-UCL-HEP |  2315 |           14423968 |            8284929 | 2006-10-02  | 2006-10-20 |
>> +-----------------+-------+--------------------+--------------------+-------------+------------+
>>
>> GOC Aggregation over the JobRecords:-
>>
>> rgma> select ExecutingSite, count(*) as Njobs, Sum(ElapsedTimeSeconds), Sum(BaseCpuTimeSeconds), Min(EventDate), Max(Eventdate) from LcgRecords where ExecutingSite='UKI-LT2-UCL-HEP' group by ExecutingSite, Year(EventDate), Month(EventDate);
>> +-----------------+-------+-------------------------+-------------------------+----------------+----------------+
>> | ExecutingSite   | Njobs | Sum(ElapsedTimeSeconds) | Sum(BaseCpuTimeSeconds) | Min(EventDate) | Max(Eventdate) |
>> +-----------------+-------+-------------------------+-------------------------+----------------+----------------+
>> | UKI-LT2-UCL-HEP |   696 |                 1991192 |                 1501541 | 2006-02-05     | 2006-02-28     |
>> | UKI-LT2-UCL-HEP |  1258 |                21813657 |                18676849 | 2006-03-01     | 2006-03-31     |
>> | UKI-LT2-UCL-HEP | 19763 |                18986344 |                16720062 | 2006-04-01     | 2006-04-30     |
>> | UKI-LT2-UCL-HEP |  3205 |                31275556 |                27581028 | 2006-05-01     | 2006-05-31     |
>> | UKI-LT2-UCL-HEP |  5571 |                40021491 |                34155368 | 2006-06-01     | 2006-06-30     |
>> | UKI-LT2-UCL-HEP |  6950 |                44385756 |                35415227 | 2006-07-01     | 2006-07-31     |
>> | UKI-LT2-UCL-HEP |  1670 |                32319092 |                21662309 | 2006-08-01     | 2006-08-31     |
>> | UKI-LT2-UCL-HEP |  3225 |                21172782 |                14315281 | 2006-09-01     | 2006-09-29     |
>> | UKI-LT2-UCL-HEP |  2315 |                14423968 |                 8284929 | 2006-10-02     | 2006-10-20     |
>> +-----------------+-------+-------------------------+-------------------------+----------------+----------------+
>>
>>  This tells me that your site has synchronised its dataset with the GOC.
>>
>> Dave
>>
>> =========================================================
>> Dr Dave Kant
>> CCLRC eScience Department                      Phone: (+44)|(0) 1235 778178
>> Rutherford Appleton Laboratory                Fax:    (+44)|(0) 1235 446626
>> Chilton, Didcot, Oxon, OX11 0QX, UK         Email:  [log in to unmask]
>> ==========================================================
>>
>>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]]On Behalf Of Gianfranco Sciacca
>> Sent: 20 November 2006 15:55
>> To: [log in to unmask]
>> Subject: Another APEL failure after host cert update
>>
>>
>> Hi all,
>>
>> I recall a similar thread about a week ago, but here the problem seems 
>> different. APEL SFT errors started after having updated the MON and CE 
>> host certificates on the 20th of October.
>>
>> After running the apel-publisher by hand on the MON (trying first <all> 
>> and subsequently <missing> records), the SFT error went away. However, 
>> nothing is yet published to the GOC RGMA since the 20th of October. I've 
>> also updated to all the latest APEL rpm's from 
>> <http://goc.grid-support.ac.uk/gridsite/accounting/rpm.html>
>>
>> I was going to raise a GGUS ticket, but I thought I'd check here first 
>> for in-house expertise. Below are logs of 1) the publisher cron  job and 
>> 2) running the publisher by hand.
>>
>> Thanks,
>> gianfranco
>>
>> 1)
>> BC_PROVIDER is not set, using default location
>> ... /opt/glite/share/glite-security-trustmanager/bcprov-jdk14-122.jar
>> Mon Nov 20 02:45:02 UTC 2006: apel-publisher - Read-in configuration: 
>> [logenabled, j] [DBUsername=accounting, 
>> DBURL=jdbc:mysql://pc91.hep.ucl.ac.uk:3306/accounting, DBPassword=****, 
>> site=UKI-LT2-UCL-HEP, republish=missing]
>> Mon Nov 20 02:45:02 UTC 2006: apel-publisher - ------ Starting the apel 
>> application ------
>> Mon Nov 20 02:45:03 UTC 2006: apel-publisher - Optimising table: 
>> EventRecords
>> Mon Nov 20 02:45:03 UTC 2006: apel-publisher - Optimising table: GkRecords
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: 
>> MessageRecords
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: SpecRecords
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: LcgRecords
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Checking Blahd table: 
>> BlahdRecords
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher -  OK
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Optimising table: 
>> BlahdRecords
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - **** Combining tables and 
>> republishing in LcgRecords ****
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - Checking valid CPU spec 
>> data exists
>> Mon Nov 20 02:45:04 UTC 2006: apel-publisher - CPU spec values found
>> Mon Nov 20 02:45:09 UTC 2006: apel-publisher - Finding records from: 
>> 2006-11-14 14:28:58
>> Mon Nov 20 02:45:10 UTC 2006: apel-publisher - Record/s found: 44653
>> Mon Nov 20 02:45:10 UTC 2006: apel-publisher - Checking Archiver is Online
>> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - Archiver Alive
>> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - Archiver Count: Record/s 
>> found: 31594
>> Mon Nov 20 02:45:33 UTC 2006: apel-publisher - WARNING - Detected 
>> missing records, republishing data starting from: 2006-11-14 14:28:58
>> Mon Nov 20 02:45:49 UTC 2006: apel-publisher - Publishing data into rgma
>> Mon Nov 20 02:47:01 UTC 2006: apel-publisher - RGMAException: This 
>> Exception is not handled errorCode = 0
>> Mon Nov 20 02:47:01 UTC 2006: apel-publisher - program aborted
>> org.glite.apel.core.ApelException: org.glite.rgma.RGMAException: 
>> java.lang.NumberFormatException:For input string: ""
>>
>> 2)
>> [root@pc91 root]# /opt/glite/bin/apel-publisher -f 
>> /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml
>> BC_PROVIDER is not set, using default location
>> ... /opt/glite/share/glite-security-trustmanager/bcprov-jdk14-122.jar
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Read-in configuration: 
>> [logenabled, j] [DBUsername=accounting, 
>> DBURL=jdbc:mysql://pc91.hep.ucl.ac.uk:3306/accounting, DBPassword=****, 
>> site=UKI-LT2-UCL-HEP, republish=missing]
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - ------ Starting the apel 
>> application ------
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: 
>> EventRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: GkRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: 
>> MessageRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: SpecRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: LcgRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Checking Blahd table: 
>> BlahdRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher -  OK
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Optimising table: 
>> BlahdRecords
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - **** Combining tables and 
>> republishing in LcgRecords ****
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - Checking valid CPU spec 
>> data exists
>> Mon Nov 20 12:25:11 UTC 2006: apel-publisher - CPU spec values found
>> Mon Nov 20 12:25:17 UTC 2006: apel-publisher -  
>> ====================================
>> Mon Nov 20 12:25:17 UTC 2006: apel-publisher -     Synchronisation data 
>> check
>> Mon Nov 20 12:25:17 UTC 2006: apel-publisher -  
>> ====================================
>> Mon Nov 20 12:25:17 UTC 2006: apel-publisher - Finding all records in 
>> local database since the last successful publish timestamp : 2006-11-14 
>> 14:28:58
>> Mon Nov 20 12:25:18 UTC 2006: apel-publisher - Record/s found: 44653
>> Mon Nov 20 12:25:18 UTC 2006: apel-publisher - Checking Archiver is Online
>> Mon Nov 20 12:25:38 UTC 2006: apel-publisher - Archiver Alive
>> Mon Nov 20 12:25:38 UTC 2006: apel-publisher - Archiver Record Count: 
>> Record/s found site UKI-LT2-UCL-HEP : 31594
>> Mon Nov 20 12:25:38 UTC 2006: apel-publisher - WARNING - Detected 
>> missing records, republishing data starting from: 2006-11-14 14:28:58
>> Mon Nov 20 12:25:55 UTC 2006: apel-publisher - Publishing data into rgma
>> Mon Nov 20 12:25:55 UTC 2006: apel-publisher - RGMABufferFullException: 
>> out-of-memory error on rgma server while publishing data
>> Mon Nov 20 12:25:55 UTC 2006: apel-publisher - Please wait, will retry 
>> again in a 600000 milli seconds
>> Mon Nov 20 12:35:56 UTC 2006: apel-publisher - RGMABufferFullException: 
>> out-of-memory error on rgma server while publishing data
>> Mon Nov 20 12:35:56 UTC 2006: apel-publisher - Please wait, will retry 
>> again in a 600000 milli seconds
>> Mon Nov 20 12:46:52 UTC 2006: apel-publisher - RGMABufferFullException: 
>> out-of-memory error on rgma server while publishing data
>> Mon Nov 20 12:46:52 UTC 2006: apel-publisher - Please wait, will retry 
>> again in a 600000 milli seconds
>> Mon Nov 20 12:56:52 UTC 2006: apel-publisher - RGMABufferFullException: 
>> out-of-memory error on rgma server while publishing data
>> Mon Nov 20 12:56:52 UTC 2006: apel-publisher - Please wait, will retry 
>> again in a 600000 milli seconds
>> Mon Nov 20 13:07:33 UTC 2006: apel-publisher - Checking the record 
>> counts for syncronisation
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Archiver Record Count: 44653
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Local database and GOC 
>> managed to syncronise, updating RepublishInfo
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Rows deleted from 
>> RepublishInfo: 1
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher -  
>> ====================================
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher -  Completed 
>> Synchronisation data check
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher -  
>> ====================================
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher -  Publisher Mode = Apel 
>> Publisher (Default)
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - Building account records 
>> for LCG CE
>> Mon Nov 20 13:13:06 UTC 2006: apel-publisher - LCG CE: Stitching 
>> together all accounting records
>> Mon Nov 20 13:13:24 UTC 2006: apel-publisher - LCG CE: Stitching completed
>> Mon Nov 20 13:13:24 UTC 2006: apel-publisher - No accounting data to store
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Number of Joined 
>> accounting records: 0
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Build complete
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Building account records 
>> for data through the new Glite CE
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: Stitching 
>> together all accounting records
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: Stitching completed
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - GliteCE: No accounting 
>> data to store
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Number of Joined 
>> accounting records: 0
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Build complete
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing data into rgma
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Records to GOC:  0
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing data into rgma
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - Publishing Records to GOC: 0
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher - **** Join processing 
>> complete ****
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher -  
>> ====================================
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher -       Publishing Summary Data
>> Mon Nov 20 13:13:25 UTC 2006: apel-publisher -  
>> ====================================
>> Mon Nov 20 13:13:28 UTC 2006: apel-publisher - Publishing summary data 
>> into rgma
>> Mon Nov 20 13:13:28 UTC 2006: apel-publisher - ------ Processing 
>> finished ------
>>
>>
>>
>>
>>     


-- 
Dr. Gianfranco Sciacca			Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy		Internal: 33044
University College London		D15 - Physics Building
London WC1E 6BT