[wrfems] Running slower on cluster vs single workstation
Kurt Mayer
kurt.mayer at noaa.gov
Fri Feb 17 09:10:19 MST 2012
I always wondered if turning off ssh would get rid of some of the latency.
You would think the extra step of encrypting would add to the latency.
Anyone try that?
On Fri, Feb 17, 2012 at 7:45 AM, Matthew Foster <matthew.foster at noaa.gov>wrote:
> Kurt,
>
> Yes, that is exactly what I would expect. Ethernet scales very poorly in
> a WRF cluster environment, namely due to its latency. The WRF model is not
> so much bandwidth bound as it is latency bound. 3 nodes is just about the
> most you can do over Gb Ethernet before you get to zero
> return-on-investment. 10 GbE won't help much either. Again, it's all
> about the latency.
>
> I also found that mixing hardware led to more frustration than anything,
> as you will always slow the entire cluster to the lowest common
> denominator, as the faster nodes are waiting on results from the slower
> one(s).
>
> Matt
>
>
> On Thu, Feb 16, 2012 at 10:17 PM, Kurt Mayer <kurt.mayer at noaa.gov> wrote:
>
>> Thanks for the tip. I made a 4 km 12 hour 300x300 domain. This time 3
>> nodes (18 cores) was faster than 2 (12 cores). It took 109 minutes for 18
>> cores to do this. I didn't do the full run with just 2 nodes, but I am
>> guessing it would of been around 150 minutes, judging by the timesteps. So
>> 28% faster with 50% more cores. I guess that is what you would expect in a
>> setup like this.
>>
>>
>> On Wed, Feb 15, 2012 at 10:07 AM, Robert Rozumalski <
>> Robert.Rozumalski at noaa.gov> wrote:
>>
>>> On 2/14/12 9:37 PM, Kurt Mayer wrote:
>>>
>>> Some more data to chew on...
>>>
>>> I have 2X 6-core computers running in a cluster. They each have a 1055t
>>> phenom II cpu. To experiment, I added another node with a 6-core 1090t
>>> Phenom II cpu in it. The cluster with the third computer had slower times.
>>> I used both DECOMP 0 & 1 ... no difference.
>>>
>>> I am still on 3.1 with the old version mpd... so maybe that is the
>>> problem. Also the 3rd computer had CentOS 6.2 installed, while the two
>>> older nodes had CentOS 5.7.
>>>
>>>
>>> Good Morning Kurt,
>>>
>>> Thanks for the information. I don't thinks it's the EMS version that you
>>> are running or the Linux distributions. I believe part of the problem is
>>> that
>>> the faster machine(s) must wait for the slower system(s) so the net
>>> performance of the cluster is similar to that of a cluster consisting of
>>> all slower machines. Also, with attacking the benchmark case with 36
>>> CPUs is probably overkill as the ARW benchmark domain is only 106x106,
>>> which results in each CPU handling roughly a 17 x17 grid point patch.
>>> There is likely to be a lot of communication overhead.
>>>
>>> I would try your tests again with a much more sizable domain, such as
>>> 300x300 or larger.
>>>
>>> There is also a possibility that the main boards used with AMD
>>> processors are the problem but I'm just speculating.
>>>
>>> Bob
>>>
>>>
>>>
>>>
>>> On Tue, Feb 14, 2012 at 3:30 PM, Robert Rozumalski <
>>> Robert.Rozumalski at noaa.gov> wrote:
>>>
>>>>
>>>> Hello Brian,
>>>>
>>>>
>>>> I've finally had the opportunity to look at your problem and cluster
>>>> configuration and noticed that your benchmark
>>>> results and experience bear a striking resemblance to mine.
>>>>
>>>> A while back I replaced one machine on my 3-machine cluster with a
>>>> newer 6-core AMD system, just like yours
>>>> only slightly faster.
>>>>
>>>> Here was my configuration:
>>>>
>>>> (1) 2 X Six-Core AMD Opteron(tm) Processor 2435 @ 2600MHz
>>>> (2) 2 X Four-Core INTEL Xeon(R) CPU W5590 @ 3.33GHz
>>>> ____________
>>>> 28 total Processors (16 Xeon & 12 Opteron)
>>>>
>>>>
>>>> Here is your configuration:
>>>>
>>>> (1) 2 X Six-Core AMD Opteron(tm) Processor 2427 @ 2200MHz
>>>> (1) 2 X Six-Core INTEL Xeon(R) CPU X5690 @ 3.47GHz
>>>> ________________
>>>> 24 Total Cores (12 Xeon & 12 Opteron)
>>>>
>>>>
>>>> What I found was that when I added the new 2 x six-Core AMD system to
>>>> the cluster, my benchmark times were
>>>> significantly slower than when I ran on only the 2 INTEL systems and
>>>> about the same as when I ran on a single
>>>> INTEL machine.
>>>>
>>>> I could not find an explanation for this result. I thought it might be
>>>> due to over-decomposition of the domain
>>>> so I created a much larger domain with similar results. I ended up
>>>> requesting a replacement for the AMD system,
>>>> which is now used for testing. The new all Xeon cluster works well.
>>>>
>>>> BTW - More CPUs is not always better if your domain gets decomposed to
>>>> the point where communication becomes
>>>> a large bottleneck. This is pretty easy to do.
>>>>
>>>>
>>>> Also, your timing on the 2 x Six-Core Xeon is similar to that on my
>>>> system with a similar CPU.
>>>>
>>>> So, I have no solutions or explanations but it may not be anything you
>>>> are doing wrong.
>>>>
>>>> Bob
>>>>
>>>>
>>>>
>>>>
>>>> On 2/9/12 2:04 PM, bkolts at firstenergycorp.com wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I'm trying to cluster 2 workstations to run WRF. After setting this up
>>>> I've noticed a significant slow down in run time. When running the
>>>> benchmark on the master workstation only (no cluster) the benchmark test
>>>> took a little over 4 minutes to complete. When running with the cluster,
>>>> it took over 18 minutes.
>>>>
>>>> I've attached the results from the two benchmark tests. The second machine
>>>> in the cluster is a slightly slower machine. Could it be that this is
>>>> causing the slow down? Or have I configured things incorrectly?
>>>>
>>>> Thanks,
>>>> Brian
>>>>
>>>> (See attached file: singleWorkstation_benchmark.info)(See attached file:cluster_benchmark.info)
>>>>
>>>> Brian Kolts
>>>> Advanced Scientist
>>>> Environmental Energy Delivery Services
>>>> FirstEnergy Corp.330.384.5474
>>>>
>>>>
>>>> -----------------------------------------
>>>> The information contained in this message is intended only for the
>>>> personal and confidential use of the recipient(s) named above. If
>>>> the reader of this message is not the intended recipient or an
>>>> agent responsible for delivering it to the intended recipient, you
>>>> are hereby notified that you have received this document in error
>>>> and that any review, dissemination, distribution, or copying of
>>>> this message is strictly prohibited. If you have received this
>>>> communication in error, please notify us immediately, and delete
>>>> the original message.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> wrfems mailing listwrfems at comet.ucar.edu
>>>>
>>>>
>>>>
>>>> --
>>>> Robert A. Rozumalski, PhD
>>>> NWS National SOO Science and Training Resource Coordinator
>>>>
>>>> COMET/UCAR PO Box 3000 Phone: 303.497.8356
>>>> Boulder, CO 80307-3000
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> wrfems mailing list
>>>> wrfems at comet.ucar.edu
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> wrfems mailing listwrfems at comet.ucar.edu
>>>
>>>
>>>
>>> --
>>> Robert A. Rozumalski, PhD
>>> NWS National SOO Science and Training Resource Coordinator
>>>
>>> COMET/UCAR PO Box 3000 Phone: 303.497.8356
>>> Boulder, CO 80307-3000
>>>
>>>
>>>
>>> _______________________________________________
>>> wrfems mailing list
>>> wrfems at comet.ucar.edu
>>>
>>>
>>
>> _______________________________________________
>> wrfems mailing list
>> wrfems at comet.ucar.edu
>>
>>
>
> _______________________________________________
> wrfems mailing list
> wrfems at comet.ucar.edu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.comet.ucar.edu/pipermail/wrfems/attachments/20120217/67250e2b/attachment-0001.html>
More information about the wrfems
mailing list