[wrfems] MPICH timeout of metgrid...

Brett McDonald brett.mcdonald at noaa.gov
Tue Apr 10 09:52:37 MDT 2012


Hi All:

Two times a day, I've got a WRF NMM run which should go for 72 hours.  
I've had a problem lately where the model either crashes in the initial 
processing or only goes out XX hours (something less than 72).  This 
happened again today.  For the first attempt, the horizontal 
interpolation of the files to the domain failed even though it appeared 
that all of the NAM files came in just fine.  In section IV of 
ems_autoruns.log, I got a "MPICH Timeout of metgrid after 299 
seconds..." and later on, the creation of the WRF initial and boundary 
conditions failed.

On second attempt (see attached log file), the "MPICH Timeout..." also 
occurred, but the model is continuing, yet states that it is only going 
to go out 27 hours instead of 72.

This seems to occur only on my daytime runs when I'm doing other things 
on the machine.  Is the MPICH Timeout occurring because it is taking 
longer for that process to run (the machine is busier)?  If so, can I 
increase the timeout check for this portion of the processing?

Thanks,

Brett McDonald
RIW WY WFO - SOO

-------------- next part --------------

     AUTORUN: Domain  to be included in the simulation           : 1
     AUTORUN: Domains to be processed concurrently with autopost : 1

     WRF EMS Program ems_prep started on riw-lw-sac at Tue Apr 10 15:10:00 2012 UTC

                The WRF EMS Says: "Who's Awesome? You're Awesome!"

     I.  WRF EMS ems_prep Model Initialization Summary

           Initialization Start Time    : Tue Apr 10 18:00:00 2012 UTC
           Initialization End   Time    : Fri Apr 13 18:00:00 2012 UTC
           Boundary Condition Frequency : 180 Minutes
           Initialization Data Set      : namptile
           Boundary Condition Data Set  : namptile
           Static Surface Data Sets     : None
           Land Surface Data Sets       : None

    II.  Search out requested files for WRF model initialization


         *  Locating namptile files for model initial and boundary conditions


            Areal coverage of the NAM 218 Lambert Conformal personal tiles
              
                Corner Lat-Lon points of the domain:
              
                   46.58, -115.83             47.53, -102.70
                  *                            *
              
                                 * 42.98, -108.68
              
                  *                            *
                   38.12, -114.13             39.04, -102.06




            Initiating HTTP connection to soostrc.comet.ucar.edu

              Making request #1 of 3 for personal tile data

              -> Attempting to acquire 12041012.nam.t12z.awphys06.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys09.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys12.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys15.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys18.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys21.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys24.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys27.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys30.grb2.tm00  - Success (0.04 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys33.grb2.tm00  - Success (0.04 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys36.grb2.tm00  - Success (0.01 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys39.grb2.tm00  - Success (0.01 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys42.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys45.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys48.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys51.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys54.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys57.grb2.tm00  - Success (0.02 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys60.grb2.tm00  - Success (0.04 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys63.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys66.grb2.tm00  - Success (0.03 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys69.grb2.tm00  - Success (0.04 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys72.grb2.tm00  - Success (0.05 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys75.grb2.tm00  - Success (0.05 mb/s)
              -> Attempting to acquire 12041012.nam.t12z.awphys78.grb2.tm00  - Success (0.05 mb/s)

         *  All requested namptile files are available for model initialization


         Excellent! - Your master plan is working!


   III.  Create the WPS NMM intermediate format files

         *  Processing namptile files for use as model initial and boundary conditions - Fantastic!!

         NMM core intermediate file processing completed in 48.53 seconds


    IV.  Horizontal interpolation of the intermediate files to the computational domain

         !  MPICH Timeout of metgrid after 299 seconds - continuing with some trepidation

         *  Metgrid processed files are located in
            
            /data/wrfems/runs/wrfnmmriw04/wpsprd

         Horizontal interpolation to computational domain completed in 5 minutes


         AUTORUN: The ems_prep routine completed successfully - Moving forward


     WRF EMS Program ems_run started on riw-lw-sac at Tue Apr 10 15:31:23 2012 UTC

               The WRF EMS Says: "Who's Awesome? You're Awesome!"


     I.  Preforming configuration in preparation for your EMS experience

         *  You are running the WRF NMM core. Hey Ho! Let's go! - model'n!

         *  Simulation start and end times:

              Domain         Start                   End

                1     2012-04-10_18:00:00     2012-04-11_21:00:00      

         *  Simulation length will be 27 hours

         *  Doing MPI check before running WRF Model

         *  Large timestep to be used for this simulation is 8 seconds


    II.  Creating the initial and boundary condition files for the user domain(s)


         *  The WRF REAL program shall be run on the following systems and processors:

            2  processors on riw-lw-sac     (1 tiles per processor)

         *  Creating the WRF initial and boundary condition files

         *  WRF initial and boundary conditions successfully created in 2 minutes 16 seconds

         Moving on to bigger and better delusions of grandeur


   III.  Running NMM WRF while thinking happy thoughts


         *  The WRF NMM core shall be run on the following systems and processors:

            4  processors on warf1     (192.168.0.10) with 1 tiles per processor
            4  processors on warf2     (192.168.0.11) with 1 tiles per processor

         *  Run Output Frequency   Primary wrfout   Aux File 1
            ---------------------------------------------------
              Domain 01          :   1 hour            Off      

         !  EMS AUTOPOST: It is not recommended that you run the autopost routine on the
            same machine as the model (riw-lw-sac)


         *  Checking connection between riw-lw-sac and riw-lw-sac for the autopost  - Success


         *  WRF EMS Auto Post-Processing Routine Initiated on riw-lw-sac at Tue Apr 10 15:33:53 2012 UTC

              If you don't believe me then you can check it out for yourself:

                %  tail -f /data/wrfems/runs/wrfnmmriw04/log/ems_autopost.log

              Or you can just trust me.


         *  Runnning your simulation with enthusiasm!

              You can sing along to the progress of the simulation while watching:

                %  tail -f /data/wrfems/runs/wrfnmmriw04/rsl.out.0000

              Unless you have something better to do with your time



More information about the wrfems mailing list