Several things are showing up in the (much larger) logs that raise questions for me:
On the WU:Project: 2671 (Run 37, Clone 79, Gen 78):
Code: Select all
[08:46:11] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=8Core.local
NNODES=4, MYRANK=1, HOSTNAME=8Core.local
NNODES=4, MYRANK=2, HOSTNAME=8Core.local
NNODES=4, MYRANK=3, HOSTNAME=8Core.local
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68
NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp
Making 1D domain decomposition 1 x 1 x 4
Warning: application requested read/write mode for file work/wudata_01.trr
Correct checkpointing is not guaranteed
Warning: application requested read/write mode for file work/wudata_01.xtc
Correct checkpointing is not guaranteed
Warning: application requested read/write mode for file work/wudata_01.edr
Correct checkpointing is not guaranteed
starting mdrun '22908 system in water'
19750002 steps, 39500.0 ps (continuing from step 19500002, 39000.0 ps).
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 19500002
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 19500002
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 19500002
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 19500002
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[08:46:42] Completed 0 out of 250000 steps (0%)
[08:46:46] CoreStatus = FF (255)
[08:46:46] Sending work to server
[08:46:46] Project: 2671 (Run 37, Clone 79, Gen 78)
[08:46:46] - Error: Could not get length of results file work/wuresults_01.dat
[08:46:46] - Error: Could not read unit 01 file. Removing from queue.
[08:46:46] Trying to send all finished work units
[08:46:46] + No unsent completed units remaining.
[08:46:46] - Preparing to get new work unit...
It deleted the WU, downloaded it again, same crash out, repeat X2
On WU: Project: 2671 (Run 10, Clone 3, Gen 86)
Code: Select all
[08:49:41] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=8Core.local
NNODES=4, MYRANK=1, HOSTNAME=8Core.local
NNODES=4, MYRANK=2, HOSTNAME=8Core.local
NNODES=4, MYRANK=3, HOSTNAME=8Core.local
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_04.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68
NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp
Making 1D domain decomposition 1 x 1 x 4
Warning: application requested read/write mode for file work/wudata_04.trr
Correct checkpointing is not guaranteed
Warning: application requested read/write mode for file work/wudata_04.xtc
Correct checkpointing is not guaranteed
Warning: application requested read/write mode for file work/wudata_04.edr
Correct checkpointing is not guaranteed
starting mdrun '22884 system in water'
21750000 steps, 43500.0 ps (continuing from step 21500000, 43000.0 ps).
[08:50:12] Completed 0 out of 250000 steps (0%)
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 21500000
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 21500000
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 4
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 21500000
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 2, will try to stop all the nodes
Halting parallel program mdrun on CPU 2 out of 4
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_2]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090605
Source code file: md.c, line: 2169
Fatal error:
NaN detected at step 21500000
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4
gcq#49: Thanx for Using GROMACS - Have a Nice Day
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
[08:50:16] CoreStatus = FF (255)
[08:50:16] Sending work to server
[08:50:16] Project: 2671 (Run 10, Clone 3, Gen 86)
[08:50:16] - Error: Could not get length of results file work/wuresults_04.dat
[08:50:16] - Error: Could not read unit 04 file. Removing from queue.
[08:50:16] Trying to send all finished work units
[08:50:16] + No unsent completed units remaining.
[08:50:16] - Preparing to get new work unit...
Rinse, Repeat 3X as well
Then on WU: Project: 2671 (Run 41, Clone 59, Gen 95)
it gets one that it can run but with very verbose logs:
Code: Select all
[08:53:12] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=8Core.local
NNODES=4, MYRANK=1, HOSTNAME=8Core.local
NNODES=4, MYRANK=2, HOSTNAME=8Core.local
NNODES=4, MYRANK=3, HOSTNAME=8Core.local
NODEID=0 argc=20
NODEID=1 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
Reading file work/wudata_07.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 68
NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp
Making 1D domain decomposition 1 x 1 x 4
Warning: application requested read/write mode for file work/wudata_07.trr
Correct checkpointing is not guaranteed
Warning: application requested read/write mode for file work/wudata_07.xtc
Correct checkpointing is not guaranteed
Warning: application requested read/write mode for file work/wudata_07.edr
Correct checkpointing is not guaranteed
starting mdrun '22866 system in water'
24000000 steps, 48000.0 ps (continuing from step 23750000, 47500.0 ps).
[08:53:21] Completed 0 out of 250000 steps (0%)
[08:59:51] Completed 2500 out of 250000 steps (1%)
[09:06:18] Completed 5000 out of 250000 steps (2%)
[09:12:46] Completed 7500 out of 250000 steps (3%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[09:13:22] fcCheckPointSave: saving tpr and cptfile hash:
[09:13:22] 0 2113020374 2068918989
[09:13:22] 1 2408439846 2318920377
[09:13:22] 2 1537993886 1129988290
[09:13:22] 3 4183718143 1583720376
[09:13:22] 4 4170727096 1353721852
[09:19:14] Completed 10000 out of 250000 steps (4%)
[09:25:42] Completed 12500 out of 250000 steps (5%)
[09:32:10] Completed 15000 out of 250000 steps (6%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[09:33:22] fcCheckPointSave: saving tpr and cptfile hash:
[09:33:22] 0 2113020374 1098334420
[09:33:22] 1 2408439846 937447099
[09:33:22] 2 1537993886 2631693828
[09:33:22] 3 4183718143 3816563843
[09:33:22] 4 4170727096 1859052616
[09:38:38] Completed 17500 out of 250000 steps (7%)
[09:45:07] Completed 20000 out of 250000 steps (8%)
[09:51:35] Completed 22500 out of 250000 steps (9%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[09:53:22] fcCheckPointSave: saving tpr and cptfile hash:
[09:53:22] 0 2113020374 2068982652
[09:53:22] 1 2408439846 534971218
[09:53:22] 2 1537993886 3971506044
[09:53:22] 3 4183718143 3580059372
[09:53:22] 4 4170727096 3432197256
[09:58:04] Completed 25000 out of 250000 steps (10%)
[10:04:32] Completed 27500 out of 250000 steps (11%)
[10:10:59] Completed 30000 out of 250000 steps (12%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[10:13:22] fcCheckPointSave: saving tpr and cptfile hash:
[10:13:22] 0 2113020374 2754091097
[10:13:22] 1 2408439846 2856738919
[10:13:22] 2 1537993886 3537454239
[10:13:22] 3 4183718143 3740550114
[10:13:22] 4 4170727096 278067141
[10:17:28] Completed 32500 out of 250000 steps (13%)
[10:23:57] Completed 35000 out of 250000 steps (14%)
[10:30:26] Completed 37500 out of 250000 steps (15%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[10:33:23] fcCheckPointSave: saving tpr and cptfile hash:
[10:33:23] 0 2113020374 4111435746
[10:33:23] 1 2408439846 307470833
[10:33:23] 2 1537993886 356647119
[10:33:23] 3 4183718143 1949743976
[10:33:23] 4 4170727096 2636963115
[10:36:54] Completed 40000 out of 250000 steps (16%)
[10:43:22] Completed 42500 out of 250000 steps (17%)
[10:49:50] Completed 45000 out of 250000 steps (18%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[10:53:22] fcCheckPointSave: saving tpr and cptfile hash:
[10:53:22] 0 2113020374 3759540007
[10:53:22] 1 2408439846 2769714553
[10:53:22] 2 1537993886 2683653438
[10:53:22] 3 4183718143 3392717342
[10:53:22] 4 4170727096 988234600
[10:56:18] Completed 47500 out of 250000 steps (19%)
[11:02:47] Completed 50000 out of 250000 steps (20%)
[11:09:15] Completed 52500 out of 250000 steps (21%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[11:13:21] fcCheckPointSave: saving tpr and cptfile hash:
[11:13:21] 0 2113020374 4209891572
[11:13:21] 1 2408439846 3618623598
[11:13:21] 2 1537993886 3892274183
[11:13:21] 3 4183718143 598154265
[11:13:21] 4 4170727096 298455184
[11:15:44] Completed 55000 out of 250000 steps (22%)
[11:22:13] Completed 57500 out of 250000 steps (23%)
[11:28:41] Completed 60000 out of 250000 steps (24%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[11:33:22] fcCheckPointSave: saving tpr and cptfile hash:
[11:33:22] 0 2113020374 4006563218
[11:33:22] 1 2408439846 2816648465
[11:33:22] 2 1537993886 2470483010
[11:33:22] 3 4183718143 461796351
[11:33:22] 4 4170727096 3957751879
[11:35:09] Completed 62500 out of 250000 steps (25%)
[11:41:38] Completed 65000 out of 250000 steps (26%)
[11:48:15] Completed 67500 out of 250000 steps (27%)
Warning: application requested read/write mode for file work/wudata_07.cpt
Correct checkpointing is not guaranteed
[11:53:23] fcCheckPointSave: saving tpr and cptfile hash:
[11:53:23] 0 2113020374 601019263
[11:53:23] 1 2408439846 3279998463
[11:53:23] 2 1537993886 4228915165
[11:53:23] 3 4183718143 3092282220
[11:53:23] 4 4170727096 1946760905
[11:54:52] Completed 70000 out of 250000 steps (28%)
[12:01:30] Completed 72500 out of 250000 steps (29%)
[12:08:08] Completed 75000 out of 250000 steps (30%)
These verbose logs are appearing on all the WU's with the new A2 core (v2.10)
Any ideas if these log are something to be concerned about?