[wrfems] Running slower on cluster vs single workstation

Matthew Foster matthew.foster at noaa.gov
Fri Feb 17 11:19:48 MST 2012


Kurt,

I think ssh is only used to spawn the processes on the compute nodes. The
actual MPI traffic is not encrypted, so the overhead from SSH is minimal, I
believe.

Matt


On Fri, Feb 17, 2012 at 10:10 AM, Kurt Mayer <kurt.mayer at noaa.gov> wrote:

> I always wondered if turning off ssh would get rid of some of the latency.
> You would think the extra step of encrypting would add to the latency.
> Anyone try that?
>
>
> On Fri, Feb 17, 2012 at 7:45 AM, Matthew Foster <matthew.foster at noaa.gov>wrote:
>
>> Kurt,
>>
>> Yes, that is exactly what I would expect.  Ethernet scales very poorly in
>> a WRF cluster environment, namely due to its latency.  The WRF model is not
>> so much bandwidth bound as it is latency bound.  3 nodes is just about the
>> most you can do over Gb Ethernet before you get to zero
>> return-on-investment.  10 GbE won't help much either.  Again, it's all
>> about the latency.
>>
>> I also found that mixing hardware led to more frustration than anything,
>> as you will always slow the entire cluster to the lowest common
>> denominator, as the faster nodes are waiting on results from the slower
>> one(s).
>>
>> Matt
>>
>>
>> On Thu, Feb 16, 2012 at 10:17 PM, Kurt Mayer <kurt.mayer at noaa.gov> wrote:
>>
>>> Thanks for the tip. I made a 4 km 12 hour 300x300 domain. This time 3
>>> nodes (18 cores) was faster than 2 (12 cores). It took 109 minutes for 18
>>> cores to do this. I didn't do the full run with just 2 nodes, but I am
>>> guessing it would of been around 150 minutes, judging by the timesteps. So
>>> 28% faster with 50% more cores. I guess that is what you would expect in a
>>> setup like this.
>>>
>>>
>>> On Wed, Feb 15, 2012 at 10:07 AM, Robert Rozumalski <
>>> Robert.Rozumalski at noaa.gov> wrote:
>>>
>>>>  On 2/14/12 9:37 PM, Kurt Mayer wrote:
>>>>
>>>> Some more data to chew on...
>>>>
>>>>  I have 2X 6-core computers running in a cluster. They each have a
>>>> 1055t phenom II cpu. To experiment, I added another node with a 6-core
>>>> 1090t Phenom II cpu in it. The cluster with the third computer had slower
>>>> times. I used both DECOMP 0 & 1 ... no difference.
>>>>
>>>>  I am still on 3.1 with the old version mpd... so maybe that is the
>>>> problem. Also the 3rd computer had CentOS 6.2 installed, while the two
>>>> older nodes had CentOS 5.7.
>>>>
>>>>
>>>> Good Morning Kurt,
>>>>
>>>> Thanks for the information. I don't thinks it's the EMS version that
>>>> you are running or the Linux distributions. I believe part of the problem
>>>> is that
>>>> the faster machine(s) must wait for the slower system(s) so the net
>>>> performance of the cluster is similar to that of a cluster consisting of
>>>> all slower machines.   Also, with attacking the benchmark case with 36
>>>> CPUs is probably overkill as the ARW benchmark domain is only 106x106,
>>>> which results in each CPU handling roughly a 17 x17 grid point patch.
>>>> There is likely to be a lot of communication overhead.
>>>>
>>>> I would try your tests again with a much more sizable domain, such as
>>>> 300x300 or larger.
>>>>
>>>> There is also a possibility that the main boards used with AMD
>>>> processors are the problem but I'm just speculating.
>>>>
>>>> Bob
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Feb 14, 2012 at 3:30 PM, Robert Rozumalski <
>>>> Robert.Rozumalski at noaa.gov> wrote:
>>>>
>>>>>
>>>>> Hello Brian,
>>>>>
>>>>>
>>>>> I've finally had the opportunity to look at your problem and cluster
>>>>> configuration and noticed that your benchmark
>>>>> results and experience bear a striking resemblance  to mine.
>>>>>
>>>>> A while back I replaced one  machine on my 3-machine cluster with a
>>>>> newer 6-core AMD system, just like yours
>>>>> only slightly faster.
>>>>>
>>>>> Here was my configuration:
>>>>>
>>>>>   (1)  2 X  Six-Core    AMD Opteron(tm) Processor 2435 @ 2600MHz
>>>>>   (2)  2 X  Four-Core INTEL Xeon(R) CPU             W5590  @ 3.33GHz
>>>>> ____________
>>>>> 28 total Processors  (16 Xeon & 12 Opteron)
>>>>>
>>>>>
>>>>> Here is your configuration:
>>>>>
>>>>>   (1)  2 X  Six-Core    AMD Opteron(tm) Processor  2427 @ 2200MHz
>>>>>   (1)  2 X  Six-Core   INTEL Xeon(R) CPU   X5690 @ 3.47GHz
>>>>> ________________
>>>>> 24 Total Cores   (12 Xeon & 12 Opteron)
>>>>>
>>>>>
>>>>> What I found was that when I added the new 2 x six-Core AMD system to
>>>>> the cluster, my benchmark times were
>>>>> significantly slower than when I ran on only the 2 INTEL systems and
>>>>> about the same as when I ran on a single
>>>>> INTEL machine.
>>>>>
>>>>> I could not find an explanation for this result.  I thought it might
>>>>> be due to over-decomposition of the domain
>>>>> so I created a much larger domain with similar results.  I ended up
>>>>> requesting a replacement for the AMD system,
>>>>> which is now used for testing.  The new all Xeon cluster works well.
>>>>>
>>>>> BTW - More CPUs is not always better if your domain gets decomposed to
>>>>> the point where communication becomes
>>>>> a large bottleneck.  This is pretty easy to do.
>>>>>
>>>>>
>>>>> Also, your timing on the 2 x Six-Core Xeon is similar to that on my
>>>>> system with a similar CPU.
>>>>>
>>>>> So, I have no solutions or explanations but it may not be anything you
>>>>> are doing wrong.
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/9/12 2:04 PM, bkolts at firstenergycorp.com wrote:
>>>>>
>>>>>  Hi All,
>>>>>
>>>>> I'm trying to cluster 2 workstations to run WRF.  After setting this up
>>>>> I've noticed a significant slow down in run time.  When running the
>>>>> benchmark on the master workstation only (no cluster) the benchmark test
>>>>> took a little over 4 minutes to complete.  When running with the cluster,
>>>>> it took over 18 minutes.
>>>>>
>>>>> I've attached the results from the two benchmark tests.  The second machine
>>>>> in the cluster is a slightly slower machine.  Could it be that this is
>>>>> causing the slow down?  Or have I configured things incorrectly?
>>>>>
>>>>> Thanks,
>>>>> Brian
>>>>>
>>>>> (See attached file: singleWorkstation_benchmark.info)(See attached file:cluster_benchmark.info)
>>>>>
>>>>> Brian Kolts
>>>>> Advanced Scientist
>>>>> Environmental Energy Delivery Services
>>>>> FirstEnergy Corp.330.384.5474
>>>>>
>>>>>
>>>>> -----------------------------------------
>>>>> The information contained in this message is intended only for the
>>>>> personal and confidential use of the recipient(s) named above. If
>>>>> the reader of this message is not the intended recipient or an
>>>>> agent responsible for delivering it to the intended recipient, you
>>>>> are hereby notified that you have received this document in error
>>>>> and that any review, dissemination, distribution, or copying of
>>>>> this message is strictly prohibited. If you have received this
>>>>> communication in error, please notify us immediately, and delete
>>>>> the original message.
>>>>>
>>>>>
>>>>>
>>>>>   _______________________________________________
>>>>> wrfems mailing listwrfems at comet.ucar.edu
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert A. Rozumalski, PhD
>>>>> NWS National SOO Science and Training Resource Coordinator
>>>>>
>>>>> COMET/UCAR PO Box 3000   Phone:  303.497.8356
>>>>> Boulder, CO 80307-3000
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> wrfems mailing list
>>>>> wrfems at comet.ucar.edu
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> wrfems mailing listwrfems at comet.ucar.edu
>>>>
>>>>
>>>>
>>>> --
>>>> Robert A. Rozumalski, PhD
>>>> NWS National SOO Science and Training Resource Coordinator
>>>>
>>>> COMET/UCAR PO Box 3000   Phone:  303.497.8356
>>>> Boulder, CO 80307-3000
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> wrfems mailing list
>>>> wrfems at comet.ucar.edu
>>>>
>>>>
>>>
>>> _______________________________________________
>>> wrfems mailing list
>>> wrfems at comet.ucar.edu
>>>
>>>
>>
>> _______________________________________________
>> wrfems mailing list
>> wrfems at comet.ucar.edu
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.comet.ucar.edu/pipermail/wrfems/attachments/20120217/90f6050e/attachment.html>


More information about the wrfems mailing list