[Wrfems] EMS V3 run errors for historical simulation

Robert Rozumalski rozumal at ucar.edu
Tue Sep 22 08:20:49 MDT 2009



Steve,

V3 uses mpich2, which is very sensitive to the network configuration of 
your cluster.

Can you run net_check as suggested by the error message and send me the 
results.  This is a good
starting point.

There may be other potential issues that we can resolve.  EMS V3.1 will 
have a more robust
mpich2 troubleshooting routine although I suspect there still may be 
issues to be worked out.

There should be a mpich2 installers guide in the wrfems/docs directory. 
This may provide
some guidance.

Bob

Stephen Keighton wrote:
> This is mainly for Bob since he's been helping me with the ems_prep 
> part of this historical run I'm trying (1969, with nnrp initialization 
> data sets), but thought I'd include the whole list in case others have 
> run into this before. 
>
> I've been successful with ems_prep, setting up 3 nests (75km, 15km, 
> and 3km), but now that I'm trying the actual run I'm getting an error 
> that seems to be related to processing on the particular cluster we're 
> running on (4 workstations, each with 8 CPUs, and in run_ncpus.conf 
> we're set up to use 29 of the processors (exactly the same as our 
> operational runs, which ran fine last night).
>
> Anyway, I first tried the full run with all 3 domains (ems_run 
> --domains 2,3) and got the error in the output shown below right off 
> the bat just after it successfully created the initial and boundary 
> conditions (the output below only includes section III).  I have tried 
> the net_check command several times and all output seems fine.  In 
> case I was asking our cluster to bite off more than it could chew 
> right away, I then tried a simple run of just the outer domain for 6 
> hrs (ems_run --length 06h) and got the exact same error as before, so 
> it's obviously not getting very far.
>
> Any ideas for what I should try next would be greatly appreciated!!
>
> Steve @RNK
>
> -----------------------------------
>
> II.  Running ARW WRF while thinking happy thoughts
>
>          *  Starting Message Passing Deamon (MPD) ring for multi-node 
> execution
>
>
>          *  The WRF ARW core will be run on the following systems and 
> processors:
>
>             >  7  processors on porter
>             >  7  processors on bock
>             >  7  processors on pilsner
>             >  8  processors on marzen
>
>          *  Simulation output file frequency (minutes):
>
>               Domain       wrfout       sfcout
>                 01           180          30
>
>
>          *  Starting the Model Simulation with Enthusiasm!
>
>               You can sing along with the progress of the model while 
> watching:
>
>                 %  tail -f /wrf/wrfems/runs/arw_wrf_Camille/rsl.out.0000
>
>               Unless you have something better to do with your time.
>
>
>            EMS ERROR  : Your WRF model run (PID 621) returned a exit 
> status of 35072, which is never good.
>
>
>             >  Any available output files moved to 
> /wrf/wrfems/runs/arw_wrf_Camille/wrfprd
>
>
>            !  SUGGESTION: You might try running:
>
>                   % /wrf/wrfems/strc/net_check bock marzen pilsner porter
>
>               And check for conflicting IP addresses or SSH 
> configuration problems.
>
>
>            *  Bring down MPD ring on porter
>
>
>          Let's try this simulation again sometime soon. Success is 
> only a few key strokes away.
>
>
>   WRF EMS Program ems_run failure (3) at Tue Sep 22 13:54:36 2009 UTC
>
>
>
>
> -- 
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wrfems mailing list
> Wrfems at comet.ucar.edu
>   

-- 
Robert A. Rozumalski, PhD
NWS National SOO Science and Training Resource Coordinator

COMET/UCAR PO Box 3000   Phone:  303.497.8356
Boulder, CO 80307-3000





More information about the Wrfems mailing list