[wrfems] Running slower on cluster vs single workstation

Mark Keehn mark.keehn at noaa.gov
Fri Feb 17 07:32:49 MST 2012


That was my experience at HGX, also.  We currently have two quad-core
workstations in our WRF cluster, but I look to replace them with a single
multi-core workstation when budget permits.  [?]

  A couple of years ago, I had up to six nodes in my cluster, but
observed  significant slow-downs after three.  The slight improvement with
the third node was not that great, so I delegated the third node to other
tasks and settled on my two node cluster.

Mark Keehn
ITO at HGX

On Thu, Feb 16, 2012 at 10:17 PM, Kurt Mayer <kurt.mayer at noaa.gov> wrote:

> Thanks for the tip. I made a 4 km 12 hour 300x300 domain. This time 3
> nodes (18 cores) was faster than 2 (12 cores). It took 109 minutes for 18
> cores to do this. I didn't do the full run with just 2 nodes, but I am
> guessing it would of been around 150 minutes, judging by the timesteps. So
> 28% faster with 50% more cores. I guess that is what you would expect in a
> setup like this.
>
> On Wed, Feb 15, 2012 at 10:07 AM, Robert Rozumalski <
> Robert.Rozumalski at noaa.gov> wrote:
>
>>  On 2/14/12 9:37 PM, Kurt Mayer wrote:
>>
>> Some more data to chew on...
>>
>>  I have 2X 6-core computers running in a cluster. They each have a 1055t
>> phenom II cpu. To experiment, I added another node with a 6-core 1090t
>> Phenom II cpu in it. The cluster with the third computer had slower times.
>> I used both DECOMP 0 & 1 ... no difference.
>>
>>  I am still on 3.1 with the old version mpd... so maybe that is the
>> problem. Also the 3rd computer had CentOS 6.2 installed, while the two
>> older nodes had CentOS 5.7.
>>
>>
>> Good Morning Kurt,
>>
>> Thanks for the information. I don't thinks it's the EMS version that you
>> are running or the Linux distributions. I believe part of the problem is
>> that
>> the faster machine(s) must wait for the slower system(s) so the net
>> performance of the cluster is similar to that of a cluster consisting of
>> all slower machines.   Also, with attacking the benchmark case with 36
>> CPUs is probably overkill as the ARW benchmark domain is only 106x106,
>> which results in each CPU handling roughly a 17 x17 grid point patch.
>> There is likely to be a lot of communication overhead.
>>
>> I would try your tests again with a much more sizable domain, such as
>> 300x300 or larger.
>>
>> There is also a possibility that the main boards used with AMD processors
>> are the problem but I'm just speculating.
>>
>> Bob
>>
>>
>>
>>
>> On Tue, Feb 14, 2012 at 3:30 PM, Robert Rozumalski <
>> Robert.Rozumalski at noaa.gov> wrote:
>>
>>>
>>> Hello Brian,
>>>
>>>
>>> I've finally had the opportunity to look at your problem and cluster
>>> configuration and noticed that your benchmark
>>> results and experience bear a striking resemblance  to mine.
>>>
>>> A while back I replaced one  machine on my 3-machine cluster with a
>>> newer 6-core AMD system, just like yours
>>> only slightly faster.
>>>
>>> Here was my configuration:
>>>
>>>   (1)  2 X  Six-Core    AMD Opteron(tm) Processor 2435 @ 2600MHz
>>>   (2)  2 X  Four-Core INTEL Xeon(R) CPU             W5590  @ 3.33GHz
>>> ____________
>>> 28 total Processors  (16 Xeon & 12 Opteron)
>>>
>>>
>>> Here is your configuration:
>>>
>>>   (1)  2 X  Six-Core    AMD Opteron(tm) Processor  2427 @ 2200MHz
>>>   (1)  2 X  Six-Core   INTEL Xeon(R) CPU   X5690 @ 3.47GHz
>>> ________________
>>> 24 Total Cores   (12 Xeon & 12 Opteron)
>>>
>>>
>>> What I found was that when I added the new 2 x six-Core AMD system to
>>> the cluster, my benchmark times were
>>> significantly slower than when I ran on only the 2 INTEL systems and
>>> about the same as when I ran on a single
>>> INTEL machine.
>>>
>>> I could not find an explanation for this result.  I thought it might be
>>> due to over-decomposition of the domain
>>> so I created a much larger domain with similar results.  I ended up
>>> requesting a replacement for the AMD system,
>>> which is now used for testing.  The new all Xeon cluster works well.
>>>
>>> BTW - More CPUs is not always better if your domain gets decomposed to
>>> the point where communication becomes
>>> a large bottleneck.  This is pretty easy to do.
>>>
>>>
>>> Also, your timing on the 2 x Six-Core Xeon is similar to that on my
>>> system with a similar CPU.
>>>
>>> So, I have no solutions or explanations but it may not be anything you
>>> are doing wrong.
>>>
>>> Bob
>>>
>>>
>>>
>>>
>>> On 2/9/12 2:04 PM, bkolts at firstenergycorp.com wrote:
>>>
>>>  Hi All,
>>>
>>> I'm trying to cluster 2 workstations to run WRF.  After setting this up
>>> I've noticed a significant slow down in run time.  When running the
>>> benchmark on the master workstation only (no cluster) the benchmark test
>>> took a little over 4 minutes to complete.  When running with the cluster,
>>> it took over 18 minutes.
>>>
>>> I've attached the results from the two benchmark tests.  The second machine
>>> in the cluster is a slightly slower machine.  Could it be that this is
>>> causing the slow down?  Or have I configured things incorrectly?
>>>
>>> Thanks,
>>> Brian
>>>
>>> (See attached file: singleWorkstation_benchmark.info)(See attached file:cluster_benchmark.info)
>>>
>>> Brian Kolts
>>> Advanced Scientist
>>> Environmental Energy Delivery Services
>>> FirstEnergy Corp.330.384.5474
>>>
>>>
>>> -----------------------------------------
>>> The information contained in this message is intended only for the
>>> personal and confidential use of the recipient(s) named above. If
>>> the reader of this message is not the intended recipient or an
>>> agent responsible for delivering it to the intended recipient, you
>>> are hereby notified that you have received this document in error
>>> and that any review, dissemination, distribution, or copying of
>>> this message is strictly prohibited. If you have received this
>>> communication in error, please notify us immediately, and delete
>>> the original message.
>>>
>>>
>>>
>>>   _______________________________________________
>>> wrfems mailing listwrfems at comet.ucar.edu
>>>
>>>
>>>
>>> --
>>> Robert A. Rozumalski, PhD
>>> NWS National SOO Science and Training Resource Coordinator
>>>
>>> COMET/UCAR PO Box 3000   Phone:  303.497.8356
>>> Boulder, CO 80307-3000
>>>
>>>
>>>
>>> _______________________________________________
>>> wrfems mailing list
>>> wrfems at comet.ucar.edu
>>>
>>>
>>
>>
>> _______________________________________________
>> wrfems mailing listwrfems at comet.ucar.edu
>>
>>
>>
>> --
>> Robert A. Rozumalski, PhD
>> NWS National SOO Science and Training Resource Coordinator
>>
>> COMET/UCAR PO Box 3000   Phone:  303.497.8356
>> Boulder, CO 80307-3000
>>
>>
>>
>> _______________________________________________
>> wrfems mailing list
>> wrfems at comet.ucar.edu
>>
>>
>
> _______________________________________________
> wrfems mailing list
> wrfems at comet.ucar.edu
>
>


-- 
*Mark Keehn*
Information Technology Officer
National Weather Service Houston/Galveston, TX
(281)534-5625 x 244

*This E-mail was composed on the NOAA/Google Mail application!*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.comet.ucar.edu/pipermail/wrfems/attachments/20120217/baad3398/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 360.gif
Type: image/gif
Size: 453 bytes
Desc: not available
URL: <https://mailman.comet.ucar.edu/pipermail/wrfems/attachments/20120217/baad3398/attachment.gif>


More information about the wrfems mailing list