Intel revealed details about the Lunar Lake processor architecture during the Intel Tech Tour 2024 presentation ahead of the main Computex 2024 company announcement. The new processors will receive significant enhancements in every aspect of design. Lunar Lake processors are primarily being developed for laptops, although many fundamental changes may be transferred to Arrow Lake chips for desktop PCs.
Each component of the Lunar Lake architecture has been optimized for power and performance balance. The most significant improvements have been made to energy-efficient cores (E-cores), with a 38% increase in IPC (instructions per cycle) in the new Skymont cores and a 68% increase in new Skymont cores. Additionally, there was a 14% increase in IPC for the P-cores Lion Cove. Thanks to the new integrated Xe2 graphics, the performance of the integrated video chip will increase by 50%.
Lunar Lake features a new Intel neural processor for artificial intelligence performance of 48 TOPS. In fact, the Lunar Lake platform offers even greater artificial intelligence performance - a total of 120 TOPS, considering computational cores and iGPU.
The mobile processors of Lunar Lake are designed with a focus on energy efficiency as a top priority. This foundational architecture will be used in future Intel products, such as Arrow Lake and Panther Lake.
Intel turned to the competitor TSMC for the advanced 3nm N3B technology process to create computational cores, integrated graphics, and NPU. The TSMC N6 process was used for the controller that contains external I/O interfaces. The only element manufactured by Intel is the passive base plate 22FFL Foveros. Intel claims that it chose TSMC for the best available technology processes. However, the company designed the architecture so that it could be easily transferred to other technology processes.
Lunar Lake SoC Structure
Intel's Lunar Lake processors will feature 4 P-cores and 4 E-cores. The microchip consists of two logical tiles: a computing tile (TSMC N3B) and a platform controller tile (N6), as well as a non-functional rigidity element placed on the Foveros 22FFL base tile. Intel placed two LPDDR5X-8500 memory stacks directly on the chip package in 16GB or 32GB configurations. Memory exchanges data through four 16-bit channels, providing a throughput of up to 8.5 GT/s per chip.
The computing tile contains the main cores, Xe2 chips, and NPU 4.0. It is also equipped with a new "side cache" of 8MB, which is shared among all computation blocks to increase access frequency and reduce data movement. Technically, it does not meet the definition of an L4 cache, as it is common to all elements.
Moving the power subsystem out of the chip also added energy savings. Overall, Intel claims a 40% reduction in power consumption compared to Meteor Lake.
Performance Cores
The P-cores of Lunar Lake provide an average 14% increase in IPC, enhancing performance. However, Intel made an unexpected optimization move for the cores - removed Hyperthreading and all logical blocks that provided this feature. Intel's architects concluded that hyperthreading, which increases IPC by about 30% in multi-threaded workloads, is not as relevant in a hybrid design that uses more energy-efficient E-cores for multi-threaded workloads. Intel talks about an overall performance increase of 10% to 18% compared to Meteor Lake depending on chip power.
Removing Hyperthreading makes the core smaller, providing a 15% efficiency boost, 10% performance improvement per area, and a 30% performance improvement per power per area. This is much more efficient than simply turning off Hyperthreading and leaving the scheme. The new approach also preserves space for other additions - more E-cores and GPU cores can be added.
Intel is not completely abandoning hyperthreading - it still sees its value in P-core-only designs. Thus, Intel has developed two versions of the Lion Cove core, one with hyperthreading and one without it, so that the stream core can be used in other projects, such as future Xeon 6.
Intel previously regulated clock frequencies only in steps of 100 MHz, but now they can be adjusted in ranges of 16.67 MHz to provide more precise control over frequency and power. Intel explains this with a few percentage increases in energy efficiency or performance in some scenarios.
Intel has expanded the prediction block 8 times compared to the previous architecture while maintaining accuracy. The instruction cache request throughput to L2 has also tripled, and the instruction fetch throughput has doubled, from 64 to 128 bytes per cycle. The decoding throughput has been increased from 6 to 8 instructions per cycle, and the micro-op cache has been increased along with the read throughput. The micro-op queue has also been increased from 144 to 192.
The memory subsystem has a new L0 cache level. Architects completely reworked the data cache to add a 192 KB level between existing L1 and L2 caches. This led to renaming L1 to L0. This increases IPC and allows for an increase in L2 cache capacity without increasing latency due to the increased capacity. As a result, the L2 cache increases to 2.5 MB on Lunar Lake and 3 MB on Arrow Lake.
Energy-Efficient Cores
The efficient Lion Cove cores have a large number of improvements, but Skymont promises even greater progress: a 38% increase in IPC in integer workloads and a 68% increase in floating-point workloads. This results in a doubling of single-threaded performance and up to 4 times higher performance in multi-threaded tasks. Intel has also doubled the throughput in the AVX and VNNI vector workloads.
Intel has optimized the branch prediction mechanism by incorporating parallel fetching of 96-byte instructions for feeding the decoding mechanism. Skymont cores can support 9 instruction decodings per cycle. The micro-op capacity has also been increased from 64 to 96 records.
Intel set a goal to double vector performance by transitioning from two 128-bit FP and SIMD vector channels to four with Skymont. Other improvements to the vector system are aimed at reducing latency and adding support for floating-point rounding.
Previous E-core clusters had a shared L2 cache of 2MB, which has now been increased to 4MB with double L2 throughput. The transfer throughput from L1 to L1 has also been improved.
Interestingly, Intel provided a comparison between Skymont and the P-core Raptor Lake, which uses Raptor Cove architecture. The company claims a 2% advantage for Skymont in both integer and floating-point performance.
Intel Xe2 Integrated Graphics
The new Xe2 graphics processor provides up to 1.5 times higher performance than the Arc Graphics of Meteor Lake and an AI performance of up to 67 TOPS. Intel has simplified the GPU naming and will call it simply Xe2 in all configurations, unlike the suffixes Xe-LP, Xe-HP, and Xe-HPG in the previous generation.
The Intel Xe2 architecture will appear not only in Lunar Lake processors but also in future Battlemage gaming video cards. However, Lunar Lake uses lower power transistors, while Battlemage will use faster transistors for maximum performance. This means that the performance of Lunar Lake cannot be directly extrapolated to Battlemage video cards.
The Xe2 architecture includes the second-generation Xe core, support for more data types, improved vector mechanisms, larger ray tracing blocks, and more cache. The graphics processor is divided into second-generation Xe cores and visualization elements, as well as fixed-function blocks for tasks such as geometry processing, texture sampling, and rasterization. These blocks are connected to a large cache memory with an I/O block that varies depending on the implementation. The design is modular, making it easy to scale up or down.
The second-generation Xe core can execute eight 512-bit multiplications per cycle in the XVE vector mechanisms and eight 2048-bit vectors per cycle in the XMX mechanisms. Intel has also increased the SIMD engine width from 8 to 16 lanes, enhancing compatibility. The core has a shared L1 of 192 KB.
The second-generation vector mechanism supports INT2, INT4, INT8, FP16, and BF16 instructions for AI operations. You can also see a table with peak TOPS (Ops/clock) calculations in the above album. The Meteor Lake graphics processor did not have an XMX engine, so laptops with Xe2 will achieve significant gains in AI workloads. The display engine has also received many accelerations and improvements.
The video chip of Lunar Lake is equipped with 8 second-generation Xe cores, 64 vector mechanisms, two geometry pipelines, eight ray tracing blocks, and an 8MB L2 Cache, among other components. Intel says that the iGPU provides 1.5 times greater performance than Meteor Lake-U at the same power. However, the Lunar Lake graphics processor has lower power transistors for better efficiency.
The display engine supports resolutions up to 8K60 HDR, three 4K60 HDR displays, as well as 1080p360 and 1440p360. Outputs include HDMI 2.1, DisplayPort 2.1, and eDP 1.5. The media processor supports decoding and encoding up to 8K60 10-bit HDR, as well as support for all media standards along with the new H.266/VVC codec - but only for decoding.
NPU 4.0 and Controller
The new NPU, with a performance of 48 TOPS, surpasses some recently introduced counterparts from competitors. The standalone chip is primarily intended to offload AI tasks and save battery power. The graphics processor handles more demanding AI workloads with a performance of 67 TOPS, and the central processor provides an additional 5 TOPS. Altogether, Lunar Lake achieves 120 TOPS.
Key architectural components include 12 enhanced DSP SHAVE units, six neural compute engines, as well as a MAC mechanism and DMA. The memory bandwidth is twice that of the previous generation NPUs. It also has access to an 8MB shared side cache on the computing tile. Overall, Intel claims a fourfold increase in maximum performance at the same power compared to the previous generation.
The controller tile contains all external I/O functions for the chip, including Wi-Fi 7 and Bluetooth 5.4, USB 3.0 and 2.0, Thunderbolt 4, and PCIe 4.0 and 5.0 interfaces. It also houses memory controllers.
Intel guarantees that all Lunar Lake laptops will have at least two Thunderbolt 4 connection ports, and some models will offer up to three. The interface also supports the new Thunderbolt Share feature. A CNVi module connected via the CNVi 3.0 interface is still required for Wi-Fi 7 and Bluetooth 5.4 functionality.
Source: Tom`s Hardware
Comments (0)
There are no comments for now