Hi Ilian, Thanks for the help and information, I'm still having issues though. I'm doing the testing for our user. Yes, I did see that truncated octahedron is no longer supported. Are there any other useful cell shapes or changes that might help? The code modifications have improved things somewhat. I now find that I see a number of densvar estimations as one adjusts them as recommended - previously I would get one estimation and gol no further. On almost every attempt (with different choices of nodes and cores) I find once I reach one of the subsequent densvar values, the code will fail the same way and not produce an output file. This has also occurred when using a FAT node which has 1Tb memory (I used 27 of the 56 cores). On two occassions I found I could get the code to run for a reasonable time, once was 50 min before it failed. This one used our standard nodes ___________________________________________________________________ #PBS -l select=2:ncpus=8:mpiprocs=8 mpirun -np 16 -machinefile $PBS_NODEFILE $exe It ran 808 MD steps and the standard output had" *** warning - node 1 mapped on vacuum (no particles) !!! *** *** warning - node 2 mapped on vacuum (no particles) !!! *** *** warning - node 3 mapped on vacuum (no particles) !!! *** *** warning - node 12 mapped on vacuum (no particles) !!! *** *** warning - node 13 mapped on vacuum (no particles) !!! *** *** warning - node 14 mapped on vacuum (no particles) !!! *** *** warning - node 15 mapped on vacuum (no particles) !!! *** export_atomic_data allocation failure, node: 12 export_atomic_data allocation failure, node: 14 export_atomic_data allocation failure, node: 15 export_atomic_data allocation failure, node: 13 export_atomic_data allocation failure, node: 8 export_atomic_data allocation failure, node: 9 export_atomic_data allocation failure, node: 10 export_atomic_data allocation failure, node: 11 _________________________________________________ The other slightly less successful run #PBS -l select=4:ncpus=4:mpiprocs=4 mpirun -np 16 -machinefile $PBS_NODEFILE $exe *** warning - node 1 mapped on vacuum (no particles) !!! *** *** warning - node 2 mapped on vacuum (no particles) !!! *** *** warning - node 3 mapped on vacuum (no particles) !!! *** *** warning - node 12 mapped on vacuum (no particles) !!! *** *** warning - node 13 mapped on vacuum (no particles) !!! *** *** warning - node 14 mapped on vacuum (no particles) !!! *** *** warning - node 15 mapped on vacuum (no particles) !!! *** ________________________________________________________ I wasn't fully clear from your response what would be the best choices of nodes to use. You mentioned cubed values, do you mean for example nodes 1 cpu 8 (or 1) nodes 2 cpu 4 per node (8 cores in total) OR nodes 2 cpu 8 per node ? nodes 4 cpu 2 per node OR nodes 4 cpu 8 per node ? Most of these options have given me issues though. Please let me know your thoughts, I can provide more info if you need. I can also organise a login on our system for you to test if you wish. Much appreciated, Anton On Mon, 20 May 2019 at 12:12, Ilian Todorov - UKRI STFC < [log in to unmask]> wrote: > Hi Anton, > > > > 1. The truncated octahedral MD cell is not available in domain > decomposition. > > 2. Your nodes have sufficient memory for this system despite the > imbalance. Bear in mind that the memory requirements will decrease as you > go up in number of cores, so will the need of densvar usage (in the amended > for you routines). I’d say do the test on 5 nodes last as it will be the > most demanding one and you may need to play a bit with densvar. > > 3. Given the symmetry of the system the best number of cores to run on > will be cubbed integer numbers (with the total fitting within the allocated > nodes). However, it will be difficult to speculate if any speed up is to > be gained or not when using all available cores on nodes. > > Regards, > > > > Ilian > > > > *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton > Lopis > *Sent:* 18 May 2019 08:19 > *To:* [log in to unmask] > *Subject:* Re: Empty OUTPUT for higher values of denvar > > > > Hi Ilian, > > > > Thank so much for looking into this, appreciated. I didn't expect such a > complicated issue or required fix. Yes, I'd like a copy of the modified > routines please so I can test them. > > > > I have some additional thoughts or options which might help, please let me > know if any are useful. > > > > 1. Would it help for the system cell to be a truncated octahedral? I think > a cube is currently being used, but I'll confirm next week. > > > > 2. In terms of the node types and memory we have at CHPC as follows. Most > of our nodes have 128Gb and 24 cores each. A fair number of similar but > with 65Gb (less helpful ones). We have 5 FAT nodes with 1Tb each and 56 > cores. A little memory is required for the system processes per node though. > > > > https://chpc.ac.za/index.php/resources/lengau-cluster > > > > 3. It is possible to run jobs but request fewer cores per node, therefore > increasing the memory available per core. > > > > These are my thoughts so far, please let me know if anything is helpful or > you have any further ideas. > > > > Much appreciated, > > Anton > > > > > > > > On Fri, 17 May 2019, 16:10 Ilian Todorov - UKRI STFC < > [log in to unmask]> wrote: > > Hi Anton, > > > > The problem your user experience is related to having DL_POLY_4 ask for > too much memory due to the imbalance of work generated by having a ball of > matter in a domain decompositioned MD box. Getting a trace-back to the > offending line and routine can be badly handled by the OS/MPI/Compiler > combination. In my case, this return was in the config_modlule when > configurational arrays were allocated but I still did not get MPI to return > allocation failure, which it obviously was, in a controlled manner. > > > > DL_POLY_3.09 may have worked fine on this model system as it had simpler > memory estimates. In DL_POLY_4 I introduced some more complex estimates of > required/desired memory for configuration array sizes. These fail for this > model system and will need amending. > > > > So, I have spent some time to think about what needs changing, where/why > /how and did some testing. Sadly the fastest and simplest way was on my > laptop. Going beyond execution on my laptop has been hindered somewhat by > being in Chile since beginning of last week (today is final day at a CCP5 > summer school at Antofagasta). > > > > There is no perfect solution for such imbalance but after some amends I > have been able to run the system without electrostatics on 8 cores (~3.5GB > RAM available per core). The system seemed to run with electrostatics on > but I could not wait for a timestep to get printed. The run failed on 27 > cores (~1GB RAM available per core). Obviously, the DL_POLY domain > decomposition rule of thumb for memory requirements “~200k particles with > electrostatics per 1GB RAM per core” is not going to hold in this case, > especially given the densvar push to extend the configurational arrays’ > bounds. At large core counts this should ease though. > > > > I’ll send you two routines with amends for you to test. Let me know how > they work. Bear in mind the denvar effect has changed and I’d suggest > using small densvar values to start with and them push them up if necessary > when advised so by the output. > > > > Regards, > > > > Ilian > > > > *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton > Lopis > *Sent:* 14 May 2019 13:28 > *To:* [log in to unmask] > *Subject:* Re: Empty OUTPUT for higher values of denvar > > > > Hi Ilan, > > > > Just a friendly followup message to make sure you received my recent > email. I'd forgotten to mention this issue also did happen with 4.07. > Please update me if you can. > > > > Many thanks, > > Anton > > > > On Fri, 10 May 2019 at 11:33, Anton Lopis <[log in to unmask]> wrote: > > Hi Ilian, > > > > Yes, I see exactly the same thing with 4.09 as I saw with 4.08. 2.20 and > 3.09 did not show this issue, as I mentioned in my previous email. > > > > Best regards, > > Anton > > > > On Thu, 9 May 2019 at 16:17, Ilian Todorov - UKRI STFC < > [log in to unmask]> wrote: > > Hi Anton, > > > > 4.08 has been superseded by 4.09 in September 2018. Can you download and > verify that the problem is still there for your system. > > > > Thanks, > > > > Ilian > > > > *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton > Lopis > *Sent:* 09 May 2019 13:21 > *To:* [log in to unmask] > *Subject:* Re: Empty OUTPUT for higher values of denvar > > > > Hi Ilian, > > > > 1. I've confirmed that 2.20 and 3.09 will run with the inputs, they ran a > few mins without failing (calc too long to get any signif output, though > 2.20 was writing it). > > > > 4.07 failed in the same way for 283 and stopped in same way as 4.08. > > > > I ran each of these on 5 nodes comprising 24 cores each. I'd tested 4.08 > previously with 5, 10, 20, 40 nodes, same results. > > > > Many thanks, > > Anton > > > > > > > > > > On Thu, 9 May 2019 at 13:05, Anton Lopis <[log in to unmask]> wrote: > > Hi Ilian, > > > > Thanks for the reply, somehow I didn't see it until I searched today. > Sorry for the delay. > > > > I am sending you a link to the input files. > > > > 1. I believe this is true, I will check to confirm. > > > > 2. version: 4.08 / march 2016 > > compiler: gfortran v5.1.0 > **** MPI: v3.0 > **** MPI libs: Open MPI v1.8.8, package: Open MPI root@cnode0 **** > **** MPI libs: 006 Distribution, ident: 1.8.8, repo rev: v1.8 **** > **** MPI libs: .7-20-g1d53995, Aug 05, 2015 > > > > 3. I'm not sure if there's a specific "trace-back" method? The recomended > value (283) or above are ones where it fails, 282 didproceed further to > give output and stop on warnings/error. > > > > Much appreciated. > > Anton > > > > On Tue, 7 May 2019 at 13:14, Ilian Todorov - UKRI STFC < > [log in to unmask]> wrote: > > Hi Anton, > > > > 1) Can you confirm that the jobs submitted to run with versions 4.07 and > 4.08 are with the same input as (processor count and CONTROL) as these that > run successfully with versions 2.20 and 3.09? > > 2) Can you confirm which version of 4.08 have you used for this? > > 3) Have you tried to trace-back the problem when going to the > recommended (still an estimate) densvar? > > > > Do send me a link with the input files so I could try and investigate. > > > > Regards, > > > > Ilian Todorov > > > > *From:* DL_POLY Mailing List <[log in to unmask]> *On Behalf Of *Anton > Lopis > *Sent:* 07 May 2019 11:16 > *To:* [log in to unmask] > *Subject:* Empty OUTPUT for higher values of denvar > > > > Hi All, > > > > I'm trying to assist one of our users in terms of scaling calculations and > moving from version 2 to version 4. Her inputs will work on versions 2.20 > and 3.09, but are failing on 4.07 and 4.08. She has used denvar=600, but I > need to drop the value significantly in order to see any output. > > > > The code (4.08) recommends using 283, however if I use 282 I get the final > part of output listed below. For 283 and above (I've tried 284 and others) > the code starts running and dies without writing anything into the OUTPUT > (the std error and std out seem unhelpful) - in this case, it seems to take > a few seconds longer before the job stops than when I use 282. > > > > I can provide more info if needed including the input files (tarred and > zipped via Google drive perhaps). The system contains 717703 atoms, using > Buckingham, core and Coulombic. > > > > Please let me know what you think or need to know from me. > > Much appreciated, > > Anton > > > > > I/O read method: parallel by using MPI-I/O (assumed) > I/O readers (assumed) 15 > I/O read batch size (assumed) 2000000 > I/O read buffer size (assumed) 20000 > I/O parallel read error checking off (assumed) > > I/O write method: parallel by using MPI-I/O (assumed) > I/O write type: data sorting on (assumed) > I/O writers (assumed) 60 > I/O write batch size (assumed) 2000000 > I/O write buffer size (assumed) 20000 > I/O parallel write error checking off (assumed) > > > node/domain decomposition (x,y,z): 4 5 6 > > pure cutoff driven limit on largest possible decomposition:117649 > nodes/domains (49,49,49) > > pure cutoff driven limit on largest balanced decomposition: 13824 > nodes/domains (24,24,24) > > cutoffs driven limit on largest possible decomposition:103823 > nodes/domains (47,47,47) > > cutoffs driven limit on largest balanced decomposition: 12167 > nodes/domains (23,23,23) > > link-cell decomposition 1 (x,y,z): 11 9 7 > > *** warning - next error due to maximum number of atoms per domain set to > : 90308 > *** but maximum & minumum numbers of atoms per domain asked for > : 90454 & 0 > *** estimated denvar value for passing this stage safely is : > 283 > > DL_POLY_4 terminated due to error 45 > > error - too many atoms in CONFIG file or per domain > > > > > -- > > Anton Lopis > CHPC > 021 658 2746 (W) > 072 461 3794 (Cell) > 021 658 2746 (Fax) > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > > > -- > > Anton Lopis > CHPC > 021 658 2746 (W) > 072 461 3794 (Cell) > 021 658 2746 (Fax) > > > > -- > > Anton Lopis > CHPC > 021 658 2746 (W) > 072 461 3794 (Cell) > 021 658 2746 (Fax) > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > > > -- > > Anton Lopis > CHPC > 021 658 2746 (W) > 072 461 3794 (Cell) > 021 658 2746 (Fax) > > > > -- > > Anton Lopis > CHPC > 021 658 2746 (W) > 072 461 3794 (Cell) > 021 658 2746 (Fax) > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > > ------------------------------ > > To unsubscribe from the DLPOLY list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1 > -- Anton Lopis CHPC 021 658 2746 (W) 072 461 3794 (Cell) 021 658 2746 (Fax) ######################################################################## To unsubscribe from the DLPOLY list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=DLPOLY&A=1