6.24.1 stuck in loop with errors [fixed]

Locked
cameraready
Posts: 1
Joined: Mon Jul 20, 2009 3:56 pm

6.24.1 stuck in loop with errors [fixed]

Post by cameraready » Mon Jul 20, 2009 4:11 pm

I'm running the 6.24.1 beta with the -smp -local flags and it seems to be stuck in a loop with errors starting a new unit. The system had been running for a few weeks without problems when we had a power outage. I didn't notice it for a couple days so by the time I got it restarted the current work unit had expired. Now when I start the process it gets stuck in an infinite loop generating errors when trying to start a new WU. Here's a copy of the log.

Code: Select all

dellp3art:Folding@home joej$ ./fah6 -smp -local
Using local directory for configuration

Note: Please read the license agreement (fah6 -license). Further
use of this software requires that you have read and accepted this agreement.

Using local directory for work files
2 cores detected


--- Opening Log file [July 20 15:18:48 UTC]


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.24R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/joej/Library/Folding@home
Executable: ./fah6
Arguments: -smp -local

[15:18:48] - Ask before connecting: No
[15:18:48] - User name: cameraready (Team 50975)
[15:18:48] - User ID: 6BB006B52AF030AE
[15:18:48] - Machine ID: 1
[15:18:48]
[15:18:48] Could not open work queue, generating new queue...
[15:18:48] - Preparing to get new work unit...
[15:18:48] Cleaning up work directory
[15:18:48] + Attempting to get work packet
[15:18:48] - Connecting to assignment server
[15:18:51] - Successful: assigned to (171.64.65.56).
[15:18:51] + News From Folding@Home: Welcome to Folding@Home
[15:18:51] Loaded queue successfully.
[15:19:53] + Closed connections
[15:19:53]
[15:19:53] + Processing work unit
[15:19:53] At least 4 processors must be requested; read 1.
[15:19:53] Core required: FahCore_a2.exe
[15:19:53] Core found.
[15:19:53] Working on queue slot 01 [July 20 15:19:53 UTC]
[15:19:53] + Working ...
[15:19:53]
[15:19:53] *------------------------------*
[15:19:53] Folding@Home Gromacs SMP Core
[15:19:53] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[15:19:53]
[15:19:53] Preparing to commence simulation
[15:19:53] - Ensuring status. Please wait.
[15:20:02] - Looking at optimizations...
[15:20:02] - Working with standard loops on this execution.
[15:20:02] - Files status OK
[15:20:05] - Expanded 4825524 -> 24050909 (decompressed 498.4 percent)
[15:20:05] Called DecompressByteArray: compressed_data_size=4825524 data_size=24050909, decompressed_data_size=24050909 diff=0
[15:20:05] - Digital signature verified
[15:20:05]
[15:20:05] Project: 2677 (Run 14, Clone 45, Gen 25)
[15:20:05]
[15:20:06] Entering M.D.
NNODES=4, MYRANK=2, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=1, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=3, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=0, HOSTNAME=dellp3art.prentice2.local
[cli_0]: aborting job:
Fatal error in MPI_Bcast: Error message texts are not available
[0]0:Return code = 1
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[15:20:18] CoreStatus = 1 (1)
[15:20:18] Sending work to server
[15:20:18] Project: 2677 (Run 14, Clone 45, Gen 25)
[15:20:18] - Error: Could not get length of results file work/wuresults_01.dat
[15:20:18] - Error: Could not read unit 01 file. Removing from queue.
[15:20:18] - Preparing to get new work unit...
[15:20:18] Cleaning up work directory
[15:20:18] + Attempting to get work packet
[15:20:18] - Connecting to assignment server
[15:20:21] - Successful: assigned to (171.64.65.56).
[15:20:21] + News From Folding@Home: Welcome to Folding@Home
[15:20:21] Loaded queue successfully.
[15:21:48] + Closed connections
[15:21:53]
[15:21:53] + Processing work unit
[15:21:53] At least 4 processors must be requested; read 1.
[15:21:53] Core required: FahCore_a2.exe
[15:21:53] Core found.
[15:21:53] Working on queue slot 02 [July 20 15:21:53 UTC]
[15:21:53] + Working ...
[15:21:53]
[15:21:53] *------------------------------*
[15:21:53] Folding@Home Gromacs SMP Core
[15:21:53] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[15:21:53]
[15:21:53] Preparing to commence simulation
[15:21:53] - Ensuring status. Please wait.
[15:22:03] - Looking at optimizations...
[15:22:03] - Working with standard loops on this execution.
[15:22:03] - Files status OK
[15:22:05] - Expanded 4825524 -> 24050909 (decompressed 498.4 percent)
[15:22:06] Called DecompressByteArray: compressed_data_size=4825524 data_size=24050909, decompressed_data_size=24050909 diff=0
[15:22:06] - Digital signature verified
[15:22:06]
[15:22:06] Project: 2677 (Run 14, Clone 45, Gen 25)
[15:22:06]
[15:22:06] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=1, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=2, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=3, HOSTNAME=dellp3art.prentice2.local
[cli_0]: aborting job:
Fatal error in MPI_Bcast: Error message texts are not available
[0]0:Return code = 1
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[15:22:19] CoreStatus = 1 (1)
[15:22:19] Sending work to server
[15:22:19] Project: 2677 (Run 14, Clone 45, Gen 25)
[15:22:19] - Error: Could not get length of results file work/wuresults_02.dat
[15:22:19] - Error: Could not read unit 02 file. Removing from queue.
[15:22:19] - Preparing to get new work unit...
[15:22:19] Cleaning up work directory
[15:22:19] + Attempting to get work packet
[15:22:19] - Connecting to assignment server
[15:22:21] - Successful: assigned to (171.64.65.56).
[15:22:21] + News From Folding@Home: Welcome to Folding@Home
[15:22:22] Loaded queue successfully.
[15:23:26] + Closed connections
[15:23:31]
[15:23:31] + Processing work unit
[15:23:31] At least 4 processors must be requested; read 1.
[15:23:31] Core required: FahCore_a2.exe
[15:23:31] Core found.
[15:23:31] Working on queue slot 03 [July 20 15:23:31 UTC]
[15:23:31] + Working ...
[15:23:31]
[15:23:31] *------------------------------*
[15:23:31] Folding@Home Gromacs SMP Core
[15:23:31] Version 2.07 (Sun Apr 19 14:29:51 PDT 2009)
[15:23:31]
[15:23:31] Preparing to commence simulation
[15:23:31] - Ensuring status. Please wait.
[15:23:40] - Looking at optimizations...
[15:23:40] - Working with standard loops on this execution.
[15:23:40] - Files status OK
[15:23:43] - Expanded 4825524 -> 24050909 (decompressed 498.4 percent)
[15:23:43] Called DecompressByteArray: compressed_data_size=4825524 data_size=24050909, decompressed_data_size=24050909 diff=0
[15:23:44] - Digital signature verified
[15:23:44]
[15:23:44] Project: 2677 (Run 14, Clone 45, Gen 25)
[15:23:44]
[15:23:44] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=1, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=2, HOSTNAME=dellp3art.prentice2.local
NNODES=4, MYRANK=3, HOSTNAME=dellp3art.prentice2.local
[cli_0]: aborting job:
Fatal error in MPI_Bcast: Error message texts are not available
[0]0:Return code = 1
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Quit
[15:23:56] CoreStatus = 1 (1)
[15:23:56] Sending work to server
[15:23:56] Project: 2677 (Run 14, Clone 45, Gen 25)
[15:23:56] - Error: Could not get length of results file work/wuresults_03.dat
[15:23:56] - Error: Could not read unit 03 file. Removing from queue.
[15:23:56] - Preparing to get new work unit...
[15:23:56] Cleaning up work directory
[15:23:56] + Attempting to get work packet
[15:23:56] - Connecting to assignment server
[15:23:59] - Successful: assigned to (171.64.65.56).
[15:23:59] + News From Folding@Home: Welcome to Folding@Home
[15:23:59] Loaded queue successfully.


I tried deleting everything in the queue to see if that would help but it didn't. It seems to be downloading new work units but generates the "Error: Could not get length of results file work/wuresults_03.dat" and "Error: Could not read unit 03 file. Removing from queue." every time.

fixed it. I did some searching on these forums and realized that the hostname got changed after the system was restarted. That seemed to cause an issue with MPI. Once I corrected the hostname it started working again. That's what I get for using Windows DHCP server with a Mac. :roll:

bruce
Posts: 22470
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 6.24.1 stuck in loop with errors [fixed]

Post by bruce » Mon Jul 20, 2009 9:08 pm

For the MPI clients, I suggest you configure a fixed IP address within your network rather than using DHCP.

Locked

Return to “Mac OS X Beta”