smp client or core exhibiting dementia

Locked
Gemini Cricket
Posts: 11
Joined: Sat Dec 08, 2007 5:41 am
Location: Tsuchiura, Japan

smp client or core exhibiting dementia

Post by Gemini Cricket » Mon Apr 13, 2009 9:13 am

Either the client or the core on my machine seems to be forgetting what it's doing. It finishes a unit, states that it is closing down the cores, and then just sits there. Eventually, it tries to transmit all finished work units, but claims to find none. So the first time, processing Project: 2669 (Run 14, Clone 8, Gen 41), it looked like this:

[11:46:09] Completed 245009 out of 249999 steps (98%)
[12:00:37] Completed 247509 out of 249999 steps (99%)
[12:15:02] Completed 249999 out of 249999 steps (100%)
[12:16:05]
[12:16:05] Finished Work Unit:
[12:16:05] - Reading up to 17603880 from "work/wudata_03.trr": Read 17603880
[12:16:05] trr file hash check passed.
[12:16:05] - Reading up to 4397324 from "work/wudata_03.xtc": Read 4397324
[12:16:05] xtc file hash check passed.
[12:16:05] edr file hash check passed.
[12:16:05] logfile size: 182428
[12:16:05] Leaving Run
[12:16:05] - Writing 22436816 bytes of core data to disk...
[12:16:06] ... Done.
[12:16:15] - Shutting down core
[17:11:08] - Autosending finished units...
[17:11:08] Trying to send all finished work units
[17:11:08] + No unsent completed units remaining.
[17:11:08] - Autosend completed


After the first time, I stopped the console, and started it again, at which point it started the very same WU from 0%. Then it occurred to me that I ought to upgrade the client to 6.24, so I did that, but when it finished the unit for the second time, it did mostly the same thing as the first time, except that it complained the checksum failed to match:

[00:17:00] Completed 247509 out of 249999 steps (99%)
[00:31:38] Completed 249999 out of 249999 steps (100%)
[00:32:07] - Autosending finished units... [April 10 00:32:07 UTC]
[00:32:07] Trying to send all finished work units
[00:32:07] + No unsent completed units remaining.
[00:32:07] - Autosend completed
[00:32:40]
[00:32:40] Finished Work Unit:
[00:32:40] - Reading up to 17603880 from "work/wudata_03.trr": Read 17603880
[00:32:41] trr file hash check passed.
[00:32:41] - Reading up to 4400308 from "work/wudata_03.xtc": Read 4400308
[00:32:41] xtc file hash check passed.
[00:32:41] - Checksum of file (work/wudata_03.edr) read from disk doesn't match
[06:32:06] - Autosending finished units... [April 10 06:32:06 UTC]
[06:32:06] Trying to send all finished work units
[06:32:06] + No unsent completed units remaining.
[06:32:06] - Autosend completed


Then I waited a day before restarting the console, and threw out the work folder. It picked up a new WU
(Project: 2675 (Run 1, Clone 175, Gen 100), but the end was the same absent-minded wool-gathering:

Writing checkpoint, step 25249850 at Sun Apr 12 06:55:42 2009
[21:56:34] Completed 250000 out of 250000 steps (100%)

Writing checkpoint, step 25250000 at Sun Apr 12 06:56:35 2009

Writing final coordinates.

Average load imbalance: 5.3
Part of the total run time spent waiting due to load imbalance: 3.7
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: Z 0

Parallel run - timing based on wallclock.
NODE (s) Real (s) (%)
Time: 88258.000 88258.000 100.0
1d00h30:58
(Mnbf/s) (GFlops) (ns/day) (hour/ns)
Performance: 40.480 6.164 0.489 49.032

gcq#59582: Thanx for Using GROMACS - Have a Nice Day

[21:57:37]
[21:57:37] Finished Work Unit:
[21:57:37] - Reading up to 21144528 from "work/wudata_04.trr": Read 21144528
[21:57:37] trr file hash check passed.
[21:57:37] - Reading up to 4531572 from "work/wudata_04.xtc": Read 4531572
[21:57:37] xtc file hash check passed.
[21:57:37] edr file hash check passed.
[21:57:37] logfile size: 181688
[21:57:37] Leaving Run
[21:57:38] - Writing 26109532 bytes of core data to disk...
[21:57:39] ... Done.
[21:57:48] - Shutting down core
[03:24:25] - Autosending finished units... [April 12 03:24:25 UTC]
[03:24:25] Trying to send all finished work units
[03:24:25] + No unsent completed units remaining.
[03:24:25] - Autosend completed
[02:13:52] - Autosending finished units... [April 13 02:13:52 UTC]
[02:13:52] Trying to send all finished work units
[02:13:52] + No unsent completed units remaining.
[02:13:52] - Autosend completed



So replacing the client and discarding the work folder do not fix the problem. What do I need to try next?

toTOW
Site Moderator
Posts: 8717
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: smp client or core exhibiting dementia

Post by toTOW » Mon Apr 13, 2009 10:07 am

What version of the core ? (make sure you have the last one : 2.06)

When this occurs on Windows or Linux, here is what we do (that should work on OSX too) :

- stop the client and make sure all the fahcore processes stopped fine (wait a bit if they didn't, or kill them)
- run qfix (it will re queue the results for upload). You can grab it here : http://linuxminded.nl/?target=software- ... s.plc#qfix
- restart the client. It should upload the re queued results.
Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.

FAH-Addict : latest news, tests and reviews about Folding@Home project.

Image

susato
Site Moderator
Posts: 944
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: smp client or core exhibiting dementia

Post by susato » Mon Apr 13, 2009 12:51 pm

Use these directions for running qfix to repair your queue and re-upload the results. It's not as simple as uncle_fungus's brief description implies, but if you follow the recipe exactly, your client will recover its wits and send the work units home properly. :)

viewtopic.php?f=6&t=8200

Gemini Cricket
Posts: 11
Joined: Sat Dec 08, 2007 5:41 am
Location: Tsuchiura, Japan

Re: smp client or core exhibiting dementia

Post by Gemini Cricket » Tue Apr 14, 2009 5:22 am

Thank-you toTOW and susato. Without the viewtopic you cited, I wouldn't have figured out how to install and run qfix, susato. It ran almost verbatim as shown in the example in the cited thread. The unit had expired, but at least now I am up and running again.

As for what version of the core I use, I've no idea. My FAH folder contains 2 separate cores, FAHCore_a1 and FAHCore_a2. Get info doesn't tell me what version of the cores I have, thought it does state that they were created and modified Oct. 30 2008. I let the client worry about the cores generally, so if there is anything I need to do, it has to be made explicit.

MtM
Posts: 3054
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: smp client or core exhibiting dementia

Post by MtM » Tue Apr 14, 2009 5:33 am

Add -verbosity 9 to your client parameters, it will make the client write more info to the log ( if you already had, the version of the core used should be found in the log, if not in fahlog then open the workfolder and look for a file log_0x.txt where x is the queue entry it uses. One of the first lines in that file will tell you the core version used.

susato
Site Moderator
Posts: 944
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: smp client or core exhibiting dementia

Post by susato » Tue Apr 14, 2009 5:39 am

You can find your client and core versions in the log.

For the client, upon startup you'll see the client version in between the lines of stars in the intro, for instance:
# Mac OS X SMP Console Edition ################################################
###############################################################################

Folding@Home Client Version 6.20

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/susato/Library/InCrease/unit2


For the core, at the very start of each work unit you'll see lines like:
[21:17:40]
[21:17:40] *------------------------------*
[21:17:40] Folding@Home Gromacs SMP Core
[21:17:40] Version 2.01 (Mon Mar 30 18:46:18 PDT 2009)
[21:17:40]


If you trash your copy of FahCore_a2.exe, either during an a1 work unit, or between a2 work units, your client will download the newest version, which has better checkpointing management than earlier cores. Don't update in the middle of a work unit though, or the core-swap will trash the WU.

Gemini Cricket
Posts: 11
Joined: Sat Dec 08, 2007 5:41 am
Location: Tsuchiura, Japan

Re: smp client or core exhibiting dementia

Post by Gemini Cricket » Wed Apr 15, 2009 6:25 am

Thanks again, susato. I let the wu finish before discarding the cores, and ran into the exact same problem as before. But armed with qfix and the know-how to use it, I fixed the problem immediately and got that sucker submitted. Then I threw out the old cores, and am now folding with core version 2.06. Hopefully this will end the problems, but if not, I'll let you know.

Locked

Return to “Mac OS X Beta”