Intel Skylake-X and Skylake-SP Utilize Mesh Architecture for Intra-Chip Communication

Though we are just days away from the release of Intel’s Core i9 family based on Skylake-X, and a bit further away from the Xeon Scalable Processor launch using the same fundamental architecture, Intel is sharing a bit of information on how the insides of this processor tick. Literally. One of the most significant changes to the new processor design comes in the form of a new mesh interconnect architecture that handles the communications between the on-chip logical areas.

Since the days of Nehalem-EX, Intel has utilized a ring-bus architecture for processor design. The ring bus operated in a bi-directional, sequential method that cycled through various stops. At each stop, the control logic would determine if data was to be the collected to deposited with that module. These ring bus stops are located at memory controllers, CPU cores / caches, the PCI Express interface, memory controllers, LLCs, etc. This ring bus was fairly simple and easily expandable by simply adding more stops on the ring bus itself.

However, over several generations, the ring bus has become quite large and unwieldly. Compare the ring bus from Nehalem above, to the one for last year’s Xeon E5 v5 platform.

The spike in core counts and other modules caused a ballooning of the ring that eventually turned into multiple rings, complicating the design. As you increase the stops on the ring bus you also increase the physical latency of the messaging and data transfer, for which Intel compensated by increasing bandwidth and clock speed of this interface. The expense of that is power and efficiency.

For an on-die interconnect to remain relevant, it needs to be flexible in bandwidth scaling, reduce latency, and remain energy efficient. With 28-core Xeon processors imminent, and new IO capabilities coming along with it, the time for the ring bus in this space is over.

Starting with the HEDT and Xeon products released this year, Intel will be using a new on-chip design called a mesh that Intel promises will offer higher bandwidth, lower latency, and improved power efficiency. As the name implies, the mesh architecture is one in which each node relays messages through the network between source and destination. Though I cannot share many of the details on performance characteristics just yet, Intel did share the following diagram.

As Intel indicates in its blog on the mesh announcements, this generic diagram “shows a representation of the mesh architecture where cores, on-chip cache banks, memory controllers, and I/O controllers are organized in rows and columns, with wires and switches connecting them at each intersection to allow for turns. By providing a more direct path than the prior ring architectures and many more pathways to eliminate bottlenecks, the mesh can operate at a lower frequency and voltage and can still deliver very high bandwidth and low latency. This results in improved performance and greater energy efficiency similar to a well-designed highway system that lets traffic flow at the optimal speed without congestion.”

The bi-directional mesh design allows a many-core design to offer lower node to node latency than the ring architecture could provide, and by adjusting the width of the interface, Intel can control bandwidth (and by relation frequency). Intel tells us that this can offer lower average latency without increasing power. Though it wasn’t specifically mentioned in this blog, the assumption is that because nothing is free, this has a slight die size cost to implement the more granular mesh network.

Using a mesh architecture offers a couple of capabilities and also requires a few changes to the cache design. By dividing up the IO interfaces (think multiple PCI Express banks, or memory channels), Intel can provide better average access times to each core by intelligently spacing the location of those modules. Intel will also be breaking up the LLC into different segments which will share a “stop” on the network with a processor core. Rather than the previous design of the ring bus where the entirety of the LLC was accessed through a single stop, the LLC will perform as a divided system. However, Intel assures us that performance variability is not a concern:

Negligible latency differences in accessing different cache banks allows software to treat the distributed cache banks as one large unified last level cache. As a result, application developers do not have to worry about variable latency in accessing different cache banks, nor do they need to optimize or recompile code to get a significant performance boosts out of their applications.

There is a lot to dissect when it comes to this new mesh architecture for Xeon Scalable and Core i9 processors, including its overall effect on the LLC cache performance and how it might affect system memory or PCI Express performance. In theory, the integration of a mesh network-style interface could drastically improve the average latency in all cases and increase maximum memory bandwidth by giving more cores access to the memory bus sooner. But, it is also possible this increases maximum latency in some fringe cases.

Further testing awaits for us to find out!

17 Comments

Xebec on June 15, 2017 at 8:39 pm

I hate waiting for testing
I hate waiting for testing results! 🙂

Ryan Shrout on June 15, 2017 at 11:59 pm

We will have some stuff for

We will have some stuff for you to peruse on the 19th.
Reply
- Xebec on June 16, 2017 at 1:42 pm
  
  Looking forward to reading
  Looking forward to reading about how the cache structure affects performance. You guys always have great reviews!
  Reply

StephanS on June 15, 2017 at 8:41 pm

“Definition of mesh”
“2 a :
“Definition of mesh”
“2 a : the fabric of a net”

We are one step away from Intel calling this an “EndlessMesh” Tm

Now, I will need to look at how the Q6600 interconnected its two die…

willmore on June 15, 2017 at 9:10 pm

Sounds like they’re trying
Sounds like they’re trying very hard not to use ‘net’ in the name. Which makes sense as the last time they used that things didn’t end well.
Reply

zero_be on June 15, 2017 at 10:47 pm

This sounds to me like Intel
This sounds to me like Intel is doing something very similar to AMDs Infinity Fabric. Can’t wait to see the tests.

Ryan Shrout on June 15, 2017 at 11:59 pm

I wish I knew, actually. AMD

I wish I knew, actually. AMD has been very cagey when it comes to true details on its chip fabric. Hopefully with the coming launch of its enterprise EPYC brand they will be slightly more forthcoming.
Reply
- Anonymouse on June 30, 2017 at 1:58 am
  
  The POWER9 has a Fabric that
  The POWER9 has a Fabric that runs at 7 TB/s or 256 GB/s in SMT8 Mode.
  
  Using CAPI 2.0 and NVLink 2.0 it can connect to external Accelerator Cards (FPGA, IBM’s CCA, etc.) and NVidia’s upcoming Volta GPU Cards at 25 GB/s.
  
  Facebook and Google are abandoning Intel (and it’s pricing) for POWER9 Servers like the Barreleye G2.
  
  Because it’s OpenServer and OpenPOWER anyone (with $) can make the Chip and other components, which will drop the price well below U$7K (which is already cheaper than the top end Epyc 2P Solution).
  
  So it’s ‘better’ than everything in every way (assuming you want the fastest) except for a large Base of Software (usually you will simply compile your own Source Code).
  
  If you’re running a Web Server (anything Linux / nothing Windows) you have no worries. It’s cheaper than Intel’s and AMD’s top end Solutions (cheaper/faster/less W) – otherwise Facebook and Google wouldn’t be buying up all existing Stock and installing new Racks as time permits.
  
  References:
  
  https://en.wikipedia.org/wiki/POWER9
  
  https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf
  
  http://www-355.ibm.com/systems/power/openpower/tgcmDocumentRepository.xhtml?aliasId=POWER9_LaGrange
  
  http://www.pcworld.com/article/3110615/nvidias-nvlink-20-will-first-appear-in-power9-servers-next-year.html
  Reply

wizpig64 on June 16, 2017 at 3:21 am

If PCIe Lanes are dependant
If PCIe Lanes are dependant on columns of cores, this *might* explain Intel’s annoying decision to limit pcie Lanes on lower hedt skus.

psuedonymous on June 16, 2017 at 3:38 am

The revenge of Thinking
The revenge of Thinking Machines ‘Connection Machine’ architecture! We can only hope Skylake-SP comes equipped with such gratuitously glorious Blinkenlichts as the CM series.

Exascale on June 16, 2017 at 4:29 am

Knights Landing, which has up
Knights Landing, which has up to 72 cores, uses a 2D mesh already.

extide on June 16, 2017 at 6:59 am

I think the first time Intel
I think the first time Intel played with a 2D mesh was on the 80-core https://en.wikipedia.org/wiki/Teraflops_Research_Chip (Although these are not x86 cores, but much simpler cores, as the chip was really designed to test the interconnect, not the cores, read Wiki for more info) Although, you could even argue that this chip uses a 3D mesh as it has the cores on top of the memory and each core can route in 5 directiones, N S E W and then down to the SRAM cell below it.
Reply
- Exascale on June 16, 2017 at 7:48 am
  
  Intersting observation about
  Intersting observation about the topology.
  Reply
Ryan Shrout on June 16, 2017 at 1:53 pm

Correct, it’s very similar.

Correct, it's very similar.
Reply
- Exascale on June 17, 2017 at 8:54 pm
  
  Is there any word on a Jordan
  Is there any word on a Jordan Creek style memory buffer to allow 12 channel or 6 channel lockstep modes, like the Ivy-Broadwell E7s have?
  Reply

extide on June 16, 2017 at 6:50 am

As far as I remember
As far as I remember Nehalem/Westmere still used a simple crossbar memory arch as they only scaled to 4 or 6 cores respectfully. I am not 100% sure what Westmere-EX used, but I know SandyBridge brought the first ringbus to most of the chips.

Ryan Shrout on June 16, 2017 at 1:53 pm

Ah, yes, you are correct.

Ah, yes, you are correct. That should be Nehalem-EX.
Reply

Intel Skylake-X and Skylake-SP Utilize Mesh Architecture for Intra-Chip Communication

Video News

About The Author

Ryan Shrout

17 Comments

Leave a reply Cancel reply

Latest Podcasts

Archive & Timeline

Previous 12 months

Explore: All The Years!

Shop new Deals of the Day at GameStop.com!

User login status

Intel Skylake-X and Skylake-SP Utilize Mesh Architecture for Intra-Chip Communication

Video News

About The Author

Ryan Shrout

Related Posts

Meet the i9-9980XE

Skipping out of Computex to hang out with Intel Product Managers

AMD Shows Off Zen 2-Based EPYC “Rome” Server Processor

Report: Intel Core i7-6700K and i5-6600K Retail Box Photos and Pricing Leak

17 Comments

Leave a reply Cancel reply

Latest Podcasts

Archive & Timeline

Previous 12 months

Explore: All The Years!

Shop new Deals of the Day at GameStop.com!

User login status