[11:46:09] Completed 245009 out of 249999 steps (98%)
[12:00:37] Completed 247509 out of 249999 steps (99%)
[12:15:02] Completed 249999 out of 249999 steps (100%)
[12:16:05]
[12:16:05] Finished Work Unit:
[12:16:05] - Reading up to 17603880 from "work/wudata_03.trr": Read 17603880
[12:16:05] trr file hash check passed.
[12:16:05] - Reading up to 4397324 from "work/wudata_03.xtc": Read 4397324
[12:16:05] xtc file hash check passed.
[12:16:05] edr file hash check passed.
[12:16:05] logfile size: 182428
[12:16:05] Leaving Run
[12:16:05] - Writing 22436816 bytes of core data to disk...
[12:16:06] ... Done.
[12:16:15] - Shutting down core
[17:11:08] - Autosending finished units...
[17:11:08] Trying to send all finished work units
[17:11:08] + No unsent completed units remaining.
[17:11:08] - Autosend completed
After the first time, I stopped the console, and started it again, at which point it started the very same WU from 0%. Then it occurred to me that I ought to upgrade the client to 6.24, so I did that, but when it finished the unit for the second time, it did mostly the same thing as the first time, except that it complained the checksum failed to match:
[00:17:00] Completed 247509 out of 249999 steps (99%)
[00:31:38] Completed 249999 out of 249999 steps (100%)
[00:32:07] - Autosending finished units... [April 10 00:32:07 UTC]
[00:32:07] Trying to send all finished work units
[00:32:07] + No unsent completed units remaining.
[00:32:07] - Autosend completed
[00:32:40]
[00:32:40] Finished Work Unit:
[00:32:40] - Reading up to 17603880 from "work/wudata_03.trr": Read 17603880
[00:32:41] trr file hash check passed.
[00:32:41] - Reading up to 4400308 from "work/wudata_03.xtc": Read 4400308
[00:32:41] xtc file hash check passed.
[00:32:41] - Checksum of file (work/wudata_03.edr) read from disk doesn't match
[06:32:06] - Autosending finished units... [April 10 06:32:06 UTC]
[06:32:06] Trying to send all finished work units
[06:32:06] + No unsent completed units remaining.
[06:32:06] - Autosend completed
Then I waited a day before restarting the console, and threw out the work folder. It picked up a new WU
(Project: 2675 (Run 1, Clone 175, Gen 100), but the end was the same absent-minded wool-gathering:
Writing checkpoint, step 25249850 at Sun Apr 12 06:55:42 2009
[21:56:34] Completed 250000 out of 250000 steps (100%)
Writing checkpoint, step 25250000 at Sun Apr 12 06:56:35 2009
Writing final coordinates.
Average load imbalance: 5.3
Part of the total run time spent waiting due to load imbalance: 3.7
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0
Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 88258.000 88258.000 100.0
1d00h30:58
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 40.480 6.164 0.489 49.032
gcq#59582: Thanx for Using GROMACS - Have a Nice Day
[21:57:37]
[21:57:37] Finished Work Unit:
[21:57:37] - Reading up to 21144528 from "work/wudata_04.trr": Read 21144528
[21:57:37] trr file hash check passed.
[21:57:37] - Reading up to 4531572 from "work/wudata_04.xtc": Read 4531572
[21:57:37] xtc file hash check passed.
[21:57:37] edr file hash check passed.
[21:57:37] logfile size: 181688
[21:57:37] Leaving Run
[21:57:38] - Writing 26109532 bytes of core data to disk...
[21:57:39] ... Done.
[21:57:48] - Shutting down core
[03:24:25] - Autosending finished units... [April 12 03:24:25 UTC]
[03:24:25] Trying to send all finished work units
[03:24:25] + No unsent completed units remaining.
[03:24:25] - Autosend completed
[02:13:52] - Autosending finished units... [April 13 02:13:52 UTC]
[02:13:52] Trying to send all finished work units
[02:13:52] + No unsent completed units remaining.
[02:13:52] - Autosend completed
So replacing the client and discarding the work folder do not fix the problem. What do I need to try next?