Print

Print


Hi Ilian,

Thanks for the help and information, I'm still having issues though. I'm
doing the testing for our user.

Yes, I did see that truncated octahedron is no longer supported. Are there
any other useful cell shapes or changes that might help?

The code modifications have improved things somewhat. I now find that I see
a number of densvar estimations as one adjusts them as recommended -
previously I would get one estimation and gol no further.

On almost every attempt (with different choices of nodes and cores) I find
once I reach one of the subsequent densvar values, the code will fail the
same way and not produce an output file. This has also occurred when using
a FAT node which has 1Tb memory (I used 27 of the 56 cores).

On two occassions I found I could get the code to run for a reasonable
time, once was 50 min before it failed. This one used our standard nodes

___________________________________________________________________
#PBS -l select=2:ncpus=8:mpiprocs=8
mpirun -np 16 -machinefile $PBS_NODEFILE $exe

It ran 808 MD steps and the standard output had"
*** warning - node 1 mapped on vacuum (no particles) !!! ***


 *** warning - node 2 mapped on vacuum (no particles) !!! ***


 *** warning - node 3 mapped on vacuum (no particles) !!! ***


 *** warning - node 12 mapped on vacuum (no particles) !!! ***


 *** warning - node 13 mapped on vacuum (no particles) !!! ***


 *** warning - node 14 mapped on vacuum (no particles) !!! ***


 *** warning - node 15 mapped on vacuum (no particles) !!! ***


 export_atomic_data allocation failure, node: 12

 export_atomic_data allocation failure, node: 14

 export_atomic_data allocation failure, node: 15

 export_atomic_data allocation failure, node: 13

 export_atomic_data allocation failure, node: 8

 export_atomic_data allocation failure, node: 9

 export_atomic_data allocation failure, node: 10

 export_atomic_data allocation failure, node: 11
_________________________________________________

The other slightly less successful run

#PBS -l select=4:ncpus=4:mpiprocs=4
mpirun -np 16 -machinefile $PBS_NODEFILE $exe

*** warning - node 1 mapped on vacuum (no particles) !!! ***


 *** warning - node 2 mapped on vacuum (no particles) !!! ***


 *** warning - node 3 mapped on vacuum (no particles) !!! ***


 *** warning - node 12 mapped on vacuum (no particles) !!! ***


 *** warning - node 13 mapped on vacuum (no particles) !!! ***


 *** warning - node 14 mapped on vacuum (no particles) !!! ***


 *** warning - node 15 mapped on vacuum (no particles) !!! ***
________________________________________________________


I wasn't fully clear from your response what would be the best choices of
nodes to use. You mentioned cubed values, do you mean for example

nodes 1  cpu 8   (or 1)

nodes 2  cpu 4 per node (8 cores in total)     OR   nodes 2  cpu 8 per node
   ?

nodes 4 cpu 2 per node      OR   nodes 4  cpu 8 per node  ?

Most of these options have given me issues though.


Please let me know your thoughts, I can provide more info if you need. I
can also organise a login on our system for you to test if you wish.


Much appreciated,
Anton




On Mon, 20 May 2019 at 12:12, Ilian Todorov - UKRI STFC <
[log in to unmask]> wrote:

> Hi Anton,
>
>
>
> 1.  The truncated octahedral MD cell is not available in domain
> decomposition.
>
> 2.  Your nodes have sufficient memory for this system despite the
> imbalance.  Bear in mind that the memory requirements will decrease as you
> go up in number of cores, so will the need of densvar usage (in the amended
> for you routines).  I’d say do the test on 5 nodes last as it will be the
> most demanding one and you may need to play a bit with densvar.
>
> 3.  Given the symmetry of the system the best number of cores to run on
> will be cubbed integer numbers (with the total fitting within the allocated
> nodes).  However, it will be difficult to speculate if any speed up is to
> be gained or not when using all available cores on nodes.
>
> Regards,
>
>
>
> Ilian
>
>
>
> *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton
> Lopis
> *Sent:* 18 May 2019 08:19
> *To:* [log in to unmask]
> *Subject:* Re: Empty OUTPUT for higher values of denvar
>
>
>
> Hi Ilian,
>
>
>
> Thank so much for looking into this, appreciated. I didn't expect such a
> complicated issue or required fix. Yes, I'd like a copy of the modified
> routines please so I can test them.
>
>
>
> I have some additional thoughts or options which might help, please let me
> know if any are useful.
>
>
>
> 1. Would it help for the system cell to be a truncated octahedral? I think
> a cube is currently being used, but I'll confirm next week.
>
>
>
> 2. In terms of the node types and memory we have at CHPC as follows. Most
> of our nodes have 128Gb and 24 cores each. A fair number of similar but
> with 65Gb (less helpful ones). We have 5 FAT nodes with 1Tb each and 56
> cores. A little memory is required for the system processes per node though.
>
>
>
> https://chpc.ac.za/index.php/resources/lengau-cluster
>
>
>
> 3. It is possible to run jobs but request fewer cores per node,  therefore
> increasing the memory available per core.
>
>
>
> These are my thoughts so far, please let me know if anything is helpful or
> you have any further ideas.
>
>
>
> Much appreciated,
>
> Anton
>
>
>
>
>
>
>
> On Fri, 17 May 2019, 16:10 Ilian Todorov - UKRI STFC <
> [log in to unmask]> wrote:
>
> Hi Anton,
>
>
>
> The problem your user experience is related to having DL_POLY_4 ask for
> too much memory due to the imbalance of work generated by having a ball of
> matter in a domain decompositioned MD box.  Getting a trace-back to the
> offending line and routine can be badly handled by the OS/MPI/Compiler
> combination.  In my case, this return was in the config_modlule when
> configurational arrays were allocated but I still did not get MPI to return
> allocation failure, which it obviously was, in a controlled manner.
>
>
>
> DL_POLY_3.09 may have worked fine on this model system as it had simpler
> memory estimates.  In DL_POLY_4 I introduced some more complex estimates of
> required/desired memory for configuration array sizes.  These fail for this
> model system and will need amending.
>
>
>
> So, I have spent some time to think about what needs changing, where/why
> /how and did some testing.  Sadly the fastest and simplest way was on my
> laptop.  Going beyond execution on my laptop has been hindered somewhat by
> being in Chile since beginning of last week (today is final day at a CCP5
> summer school at Antofagasta).
>
>
>
> There is no perfect solution for such imbalance but after some amends I
> have been able to run the system without electrostatics on 8 cores (~3.5GB
> RAM available per core).  The system seemed to run with electrostatics on
> but I could not wait for a timestep to get printed.  The run failed on 27
> cores (~1GB RAM available per core).  Obviously, the DL_POLY domain
> decomposition rule of thumb for memory requirements “~200k particles with
> electrostatics per 1GB RAM per core” is not going to hold in this case,
> especially given the densvar push to extend the configurational arrays’
> bounds.  At large core counts this should ease though.
>
>
>
> I’ll send you two routines with amends for you to test.  Let me know how
> they work.  Bear in mind the denvar effect has changed and I’d suggest
> using small densvar values to start with and them push them up if necessary
> when advised so by the output.
>
>
>
> Regards,
>
>
>
> Ilian
>
>
>
> *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton
> Lopis
> *Sent:* 14 May 2019 13:28
> *To:* [log in to unmask]
> *Subject:* Re: Empty OUTPUT for higher values of denvar
>
>
>
> Hi Ilan,
>
>
>
> Just a friendly followup message to make sure you received my recent
> email. I'd forgotten to mention this issue also did happen with 4.07.
> Please update me if you can.
>
>
>
> Many thanks,
>
> Anton
>
>
>
> On Fri, 10 May 2019 at 11:33, Anton Lopis <[log in to unmask]> wrote:
>
> Hi Ilian,
>
>
>
> Yes, I see exactly the same thing with 4.09 as I saw with 4.08. 2.20 and
> 3.09 did not show this issue, as I mentioned in my previous email.
>
>
>
> Best regards,
>
> Anton
>
>
>
> On Thu, 9 May 2019 at 16:17, Ilian Todorov - UKRI STFC <
> [log in to unmask]> wrote:
>
> Hi Anton,
>
>
>
> 4.08 has been superseded by 4.09 in September 2018.  Can you download and
> verify that the problem is still there for your system.
>
>
>
> Thanks,
>
>
>
> Ilian
>
>
>
> *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton
> Lopis
> *Sent:* 09 May 2019 13:21
> *To:* [log in to unmask]
> *Subject:* Re: Empty OUTPUT for higher values of denvar
>
>
>
> Hi Ilian,
>
>
>
> 1. I've confirmed that 2.20 and 3.09 will run with the inputs, they ran a
> few mins without failing (calc too long to get any signif output, though
> 2.20 was writing it).
>
>
>
> 4.07 failed in the same way for 283 and stopped in same way as 4.08.
>
>
>
> I ran each of these on 5 nodes comprising 24 cores each. I'd tested 4.08
> previously with 5, 10, 20, 40 nodes, same results.
>
>
>
> Many thanks,
>
> Anton
>
>
>
>
>
>
>
>
>
> On Thu, 9 May 2019 at 13:05, Anton Lopis <[log in to unmask]> wrote:
>
> Hi Ilian,
>
>
>
> Thanks for the reply, somehow I didn't see it until I searched today.
> Sorry for the delay.
>
>
>
> I am sending you a link to the input files.
>
>
>
> 1. I believe this is true, I will check to confirm.
>
>
>
> 2. version:   4.08    /     march  2016
>
> compiler: gfortran v5.1.0
>  ****      MPI: v3.0
>  **** MPI libs: Open MPI v1.8.8, package: Open MPI root@cnode0 ****
>  **** MPI libs: 006 Distribution, ident: 1.8.8, repo rev: v1.8 ****
>  **** MPI libs: .7-20-g1d53995, Aug 05, 2015
>
>
>
> 3. I'm not sure if there's a specific "trace-back" method? The recomended
> value (283) or above are ones where it fails, 282 didproceed further to
> give output and stop on warnings/error.
>
>
>
> Much appreciated.
>
> Anton
>
>
>
> On Tue, 7 May 2019 at 13:14, Ilian Todorov - UKRI STFC <
> [log in to unmask]> wrote:
>
> Hi Anton,
>
>
>
> 1)  Can you confirm that the jobs submitted to run with versions 4.07 and
> 4.08 are with the same input as (processor count and CONTROL) as these that
> run successfully with versions 2.20 and 3.09?
>
> 2)  Can you confirm which version of 4.08 have you used for this?
>
> 3)  Have you tried to trace-back the problem when going to the
> recommended (still an estimate) densvar?
>
>
>
> Do send me a link with the input files so I could try and investigate.
>
>
>
> Regards,
>
>
>
> Ilian Todorov
>
>
>
> *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton
> Lopis
> *Sent:* 07 May 2019 11:16
> *To:* [log in to unmask]
> *Subject:* Empty OUTPUT for higher values of denvar
>
>
>
> Hi All,
>
>
>
> I'm trying to assist one of our users in terms of scaling calculations and
> moving from version 2 to version 4. Her inputs will work on versions 2.20
> and 3.09, but are failing on 4.07 and 4.08. She has used  denvar=600, but I
> need to drop the value significantly in order to see any output.
>
>
>
> The code (4.08) recommends using 283, however if I use 282 I get the final
> part of output listed below. For 283 and above (I've tried 284 and others)
> the code starts running and dies without writing anything into the OUTPUT
> (the std error and std out seem unhelpful) - in this case, it seems to take
> a few seconds longer before the job stops than when I use 282.
>
>
>
> I can provide more info if needed including the input files (tarred and
> zipped via Google drive perhaps). The system contains 717703 atoms, using
> Buckingham, core and Coulombic.
>
>
>
> Please let me know what you think or need to know from me.
>
> Much appreciated,
>
> Anton
>
>
>
>
> I/O read method: parallel by using MPI-I/O (assumed)
>  I/O readers (assumed)                  15
>  I/O read batch size (assumed)     2000000
>  I/O read buffer size (assumed)      20000
>  I/O parallel read error checking off (assumed)
>
>  I/O write method: parallel by using MPI-I/O (assumed)
>  I/O write type: data sorting on (assumed)
>  I/O writers (assumed)                  60
>  I/O write batch size (assumed)    2000000
>  I/O write buffer size (assumed)     20000
>  I/O parallel write error checking off (assumed)
>
>
>  node/domain decomposition (x,y,z):      4     5     6
>
>  pure cutoff driven limit on largest possible decomposition:117649
> nodes/domains (49,49,49)
>
>  pure cutoff driven limit on largest balanced decomposition: 13824
> nodes/domains (24,24,24)
>
>  cutoffs driven limit on largest possible decomposition:103823
> nodes/domains (47,47,47)
>
>  cutoffs driven limit on largest balanced decomposition: 12167
> nodes/domains (23,23,23)
>
>  link-cell decomposition 1 (x,y,z):     11     9     7
>
>  *** warning - next error due to maximum number of atoms per domain set to
> : 90308
>  ***           but maximum & minumum numbers of atoms per domain asked for
> : 90454 & 0
>  ***           estimated denvar value for passing this stage safely is :
> 283
>
>  DL_POLY_4 terminated due to error    45
>
>  error - too many atoms in CONFIG file or per domain
>
>
>
>
> --
>
> Anton Lopis
> CHPC
> 021 658 2746 (W)
> 072 461 3794 (Cell)
> 021 658 2746 (Fax)
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
>
>
> --
>
> Anton Lopis
> CHPC
> 021 658 2746 (W)
> 072 461 3794 (Cell)
> 021 658 2746 (Fax)
>
>
>
> --
>
> Anton Lopis
> CHPC
> 021 658 2746 (W)
> 072 461 3794 (Cell)
> 021 658 2746 (Fax)
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
>
>
> --
>
> Anton Lopis
> CHPC
> 021 658 2746 (W)
> 072 461 3794 (Cell)
> 021 658 2746 (Fax)
>
>
>
> --
>
> Anton Lopis
> CHPC
> 021 658 2746 (W)
> 072 461 3794 (Cell)
> 021 658 2746 (Fax)
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>
> ------------------------------
>
> To unsubscribe from the DLPOLY list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1
>


-- 
Anton Lopis
CHPC
021 658 2746 (W)
072 461 3794 (Cell)
021 658 2746 (Fax)

########################################################################

To unsubscribe from the DLPOLY list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1