The power consumption problem of chips continues to increase
More data requires faster processing speeds, leading to a series of issues.
In terms of processing and storing data, power consumption is crucial, and many aspects of it are far from ideal. Issues related to power consumption, especially heat, now dominate chip and system design, and these problems are expanding and multiplying.
As transistor density increases, the heat generated by these tiny digital switches cannot be eliminated through traditional means. Although this problem seems controllable, it has created a cascade of new issues that the entire industry must address together, including EDA companies, process equipment manufacturers, wafer fabs, packaging houses, field monitoring and analysis service providers, material suppliers, research teams, and more.
Behind these activities, a continuous focus is on integrating more transistors into a fixed area and the closely related, accelerating battle against power leakage. FinFETs addressed the leakage gate issue in 16/14 nanometer technology, but the problem re-emerged just two nodes later. In the 3-nanometer process, a distinctive all-around gate field-effect transistor (i.e., nanosheet) structure was introduced, making design, metrology, inspection, and testing more challenging and costly. In the 2-nanometer/18-angstrom technology, to ensure sufficient power delivery to the transistors and alleviate wiring issues, power delivery will be flipped from the front to the back of the chip. At higher technology levels, the industry may once again change the transistor structure to adopt a complementary field-effect transistor (CFET). In this short time window, numerous process and structural changes are emerging, with each new node requiring the resolution of more issues.
Advertisement
For example, as high-density chips and packaging technologies develop, transient thermal gradient issues are increasingly gaining attention. These thermal gradients move unpredictably, sometimes quickly, sometimes slowly, and change with variations in workload. In the 40-nanometer process, with thicker dielectrics, substrates, and more relaxed spacing, these issues were only considered minor nuisances. However, in current cutting-edge process technologies, we need to take these issues more seriously.
Cadence Product Management Director Melika Roshandell stated: "Although the basic leakage has decreased compared to previous technologies, the overall power consumption is higher. So, the heat issue will be more severe because you are integrating more transistors into an integrated circuit while continuously improving performance. You want to adopt higher and higher frequencies, for which you need to increase voltage and power consumption. The total power consumption is now higher than the previous generation, so the heat issue will be more severe. Moreover, when using smaller nodes, the chip area is also decreasing. The reduction in area and the increase in total power consumption can sometimes lead to exacerbated thermal issues, making it impossible for the chip to achieve
Heat is becoming a common nightmare for all hardware engineers and is causing some vicious cycles that are difficult to solve and model in advance:Heat accelerates the rupture of dielectric films used to protect signals (time-dependent dielectric breakdown, or TDDB) and increases mechanical stress, leading to warping. Heat causes a series of issues: it speeds up electromigration and other aging effects, potentially narrowing the data pathways. This further increases the heat generated by circuit resistance and the energy required to drive signals, until (if possible) the signals are rerouted.
Heat also affects the operating speed of memory, reducing the overall system performance. Additionally, noise generated by heat impacts signal integrity, and the noise can be transient, making partitioning more difficult. All these factors can shorten the lifespan of a chip, or even affect a part of it. Pradeep Thiagarajan, Chief Product Manager of Analog and Mixed-Signal Verification Solutions at Siemens EDA, said: "Thermal degradation of transistors can easily lead to chip or IP damage. Fortunately, most devices' self-heat analysis can be assessed by measuring the transient heating of each MOS device to evaluate the local heating's impact on the design, then loading temperature difference data and assessing waveform impact. Now, facing the increasing requirements for data transfer rates, innovation is needed in all aspects. Therefore, better modeling of all thermal interface materials increases the likelihood of addressing these impacts and making appropriate design adjustments to avoid short-term or long-term hardware failures. Ultimately, we need innovative thermal solutions, and we must also model correctly."
Power consumption issues abound
Many chip manufacturers are just beginning to address these issues, as most chips are not developed using the most advanced processes. However, as chips increasingly become composed of chiplets, everything needs to be characterized and operated under conditions not developed on 40nm or higher process planar chips.
It is worth noting that increasing transistor density, whether on a single chip or in advanced packaging, is not necessarily the most effective way to improve performance. However, it does increase power density and limits clock frequency. Therefore, many significant advancements are not closely related to the transistors themselves. These advancements include hardware-software co-design, faster physical layers and interconnects, new types of insulating and electronic migration materials, prefetch processing with higher precision and shorter recovery times, sparse algorithms, and new power delivery schemes.
Vincent Risson, Senior Principal CPU Architect at Arm, said: "Understanding the entire system stack is very important. Of course, the computer contributes significantly to power, but other parts of the system are also important. That's why we have different levels of cache, and the sizes of the caches vary. We increased the cache size in the previous generation because having local cache allows downstream power to treat computation as running locally. As we expand into 3D, we can envision using 3D stacked caches, which will help reduce data transfer and improve efficiency."The key is to improve efficiency at every stage of the design cycle, not just in hardware. Although the chip industry has been focusing on hardware for decades—software co-design, system companies have been the first to adopt this approach through customized microarchitectures, and mobile devices also strive to significantly extend battery life for competitive advantage.
Risson said: "We make many adjustments to fully enhance performance, which is a key issue that CPUs are committed to solving. For example, we continuously improve all prefetch engines to increase accuracy and reduce downstream data traffic. As a result, we reduce interconnect traffic while maintaining better coverage."
This is just part of the puzzle; we also need to address more issues. For instance, as time goes by, dielectric films will gradually deteriorate. This situation can be accelerated by different workloads or working conditions, especially inside the packaging filled with chip products. Norman Chang, a researcher and chief technology expert at Ansys' Electronics, Semiconductor, and Optics Division, said: "Due to the need to handle so many signals and operate on a polygon network at different voltages, time-dependent dielectric breakdown (TDDB) becomes an issue. If a network is adjacent to another signal network with a different voltage, the dielectric material will sense different voltage fields. Over time, time-dependent dielectric breakdown will occur. This is a new problem, and we need to find solutions for it."
Inconsistency issues
Thermal gradients are also a challenge, especially when they fluctuate and have significant differences between different workloads. This problem is particularly evident in 2.5D designs, which can lead to deformation. The same issue is expected to exist in 3D-ICs that will be released in the coming years. In both cases, heat may become trapped, leading to a snowball effect.
Zhang said: "In 3D-ICs, power consumption is closely related to temperature. When the temperature rises, the leakage power consumption will increase, and the thermal gradient distribution becomes the core of multi-physical interactions in 3D-ICs. Temperature affects power consumption and also affects resistance. When the temperature rises, resistance will also increase, which will also affect the dielectric constant. This will impact signal integrity and power integrity, and it will also affect stress. In 3D-ICs, when mixing digital and analog, the analog part is more sensitive to stress. You need to know the location of thermal gradients and hot spots to keep analog components away from hot spots. If you see thermal cycles of analog components, the aging speed of the device will accelerate, and you will start to see transistor mismatches, and the efficiency of analog circuits will quickly decline compared to digital logic."
This is just the beginning. Kenneth Larsen, Senior Director of Product Management at Synopsys, pointed out that arranging the positions of various elements in stacked chips incorrectly can lead to some unexpected problems, such as thermal crosstalk, which may also reduce overall performance. "We have moved from monolithic design to a design based on fragments, which has reduced the distance between devices, allowing them to influence each other. When one device is stacked on top of another, how does the heat dissipate? This is a huge challenge. For 3D-ICs, the first question is whether a system with structural integrity can be built. At the same time, you also need to pay attention to other mechanical, thermal, and power consumption issues—there are too many problems to be solved."
In the past, the simplest way to deal with heat was to reduce the voltage. However, this approach has become ineffective because, at very low voltage levels, minor anomalies can lead to problems. Roland Jancke, Head of Design Methods at the Adaptive Systems Engineering Department of Fraunhofer IIS, said: "For low-power technologies (such as critical or sub-critical devices) and high-power devices, noise is a key topic. This is a difficult issue to understand because it usually does not appear during simulation but is exposed in the real world. When noise issues arise in reality, you need to understand and deal with them."Taking crosstalk coupling as an example, during the design phase, the noise it generates in the substrate is not easily noticeable. Jancke said: "We started using substrate simulators to study crosstalk conditions within the substrate a few years ago. At that time, the focus was on individual devices and the devices around them. However, the crosstalk issues of input stages that are further apart through substrate coupling are often overlooked."
Such issues can also lead to problems in DRAMs, especially when the bit cell density increases, making them more susceptible to noise. Onur Mutlu, a professor of computer science at ETH Zurich, said, "There is definitely thermal noise. Additionally, when you access a cell, electrical interference caused by wire switching or accessing the transistor itself generates noise within the structure. This activation behavior creates noise, leading to reliability issues. We call this cell-to-cell interference. The row hammer problem is an example, where activating one row interferes with adjacent rows. RowPress is another example, where keeping a row open for a long time affects the adjacent rows. As we reduce the size of each cell, narrow the cell spacing, and increase density, this cell interference phenomenon becomes more common. This could lead to silent data corruption, which may be exactly what happens in real-world scenarios."
In terms of power consumption, there are always some unexpected issues. Barry Pangrle, power architect at Movellus, said, "Regardless of the clock frequency, we want to operate at the lowest voltage to use the least amount of energy. Although we can build a certain degree of model, unexpected situations will always arise. You can adjust the voltage and frequency of a chip in different environments to test its performance under different loads. You can use this data, and if you want to be more cautious, you can appropriately lower the settings and leave some margin. But people cannot do this for every chip. So, do you want to categorize the chips, such as 'chips of this category will operate at this clock and this voltage.' Additionally, the choice of granularity details will depend on the vendor selling the chip."
Other issues
Power consumption also involves financial aspects, including the resources required to create complex designs, to how much electricity data centers consume. The higher the transistor density, the more energy is needed to start up and cool down the server racks. In various types of AI applications, the goal is to maximize transistor utilization, which in turn consumes more energy, generates more heat, and requires more cooling.
Noam Brousard, Vice President of Engineering Solutions at proteanTecs, said, "These applications require a lot of electricity, and the demand is increasing exponentially. Efficient power consumption will ultimately bring significant savings to data centers. This is the most important. In addition, we also need to pay attention to the environmental impact of the application and hope to extend the lifespan of electronic products."
The impact of power consumption is not limited to the chip itself. Roshandell from Cadence said, "In 2.5D designs, thermal stress can cause warping, which increases the risk of breaking the solder balls between the interposer and the PCB. Once a crack occurs, a short circuit will appear, causing the product to fail to work properly. Therefore, how to solve this problem and how to model it is crucial. It is necessary to consider this in advance at the earliest stage of the design and take corresponding measures."
In 3D-ICs, the problem becomes more complex. The importance of identifying issues early in the design cycle is once again emphasized, but in 3D-ICs, there is a cumulative effect. Zhang from Ansys said, "Compared with SoC, dynamic switching power is really very tricky in 3D-ICs. We must consider the physical architecture as early as possible, because if you have 15 chips on a 3D-IC, how do you distribute power between these 15 chips to accommodate the dynamic workload and the time dimension? At different moments, a chip may have different workloads, which may create hot spots. But if the top chip has a local hot spot and the bottom chip also has a local hot spot, when the two local hot spots align at a certain point in time, this hot spot will become a global hot spot. If other chips are not switching, the global hot spot may be 10 to 15 degrees Celsius higher than the local hot spot. This catches 3D-IC circuit designers completely off guard, because when you simulate a chip on a 3D-IC, you may not be able to simulate the entire 3D-IC with a realistic workflow."The issue lies in the fact that there are many interdependent factors that need to be understood within a certain context. Niels Faché, Vice President and General Manager of the Design and Simulation Product Group at Keysight Technologies, stated: "You cannot optimize these devices in isolation. You might focus on thermal objectives, such as maximum temperature or heat dissipation, but you need to understand these issues in the context of mechanical stress. You must build models of these independent physical effects. If their relationships are very tight, you need to perform them in the form of co-simulation. For example, we use electro-thermal simulation. So, when observing the current flowing through a transistor, it will have an impact on heat. Then, the heat will affect the electrical characteristics, which in turn will change the electrical behavior, and you need to model these interactions."
Solution
There is no single, comprehensive solution to problems related to power consumption, but there are many solutions that can address part of the problem.
One approach to solving the problem, perhaps the simplest, is to limit over-engineering. Steven Woo, a researcher and distinguished inventor at Rambus, said: "Everything starts with focusing on the target application scenario and defining the functions needed to solve these scenarios. It may be tempting to add various functions to meet the needs of other potential markets and use cases, but this often leads to an increase in chip area, power consumption, and complexity, thereby affecting the performance of the chip's main application. We must scrutinize all functions rigorously to make a challenging judgment on whether they truly need to be integrated into the chip. Each new function will affect PPA (power, performance, and area), so always focusing on the target market and use case is the first step."
This will have a significant impact on overall power consumption, especially in the field of AI. Woo said: "There are many factors to consider in AI, especially for edge devices. Some choices include the chip's power supply method, heat dissipation limits, whether it needs to support training and/or inference, accuracy requirements, the environment in which the chip will be deployed, and the digital formats supported. Supporting a large set of functions means a larger area and power consumption, as well as adding complexity to the prohibition when functions are not used. Since data transfer affects performance and consumes a large portion of the energy budget, designers need to fully understand how much data needs to be moved when developing architectures that can minimize edge data transfer."
Another method is to test the design with actual workloads. William Ruby, Senior Director of Product Management for Low Power Solutions at Synopsys, said: "Some customers are trying to have us run representative workloads because we don't know what we don't know." This is like power coverage. "What do we think is the continuous worst case? What do we think is a good idle load?" But what they don't know is how new software updates might change the entire activity profile. The hope is that this change will be gradual, and they have budgeted for it, rather than being pessimistically too conservative. But how to predict what changes firmware updates will bring?
Backside power supply is another option, especially at the most advanced nodes. "To some extent, you will encounter a problem of diminishing returns because you need to deal with the materials from the top layer to the bottom layer, and the top layer is often the power supply and ground wiring," Pangrle from Movellus said, "If you can achieve power supply from the back side without having to go through the 17 metal layers on top, then you don't need to go through many layers. Being able to bypass the entire metal stack and approach the transistor from the back side without worrying about going through all the vias is like the magic of manufacturing."
Using sensors inside the chip and packaging to monitor changes in behavior related to power consumption is another method. Brousard from proteanTecs said: "In real-world applications, there are many factors that can degrade performance, so we must preset voltage protection bands. We know there will be noise, excessive workloads, and chip aging. All these factors force us to apply a voltage greater than VDDmin in the best case."
Additionally, copper wires can be used to conduct heat to places where it can be dissipated. Larsen from Synopsys said: "You can take simple measures, such as optimizing the TSV layout in the stacked chip, or using thermal vias. This is very complex, but the EDA field has been dealing with exponential problems. This is what we need to solve. However, when you want to alleviate certain problems, you need to add something, which may affect some of the value you expect to get, but it is necessary to solve. For reliability, you may add redundancy, which may be TSVs in the stack or hybrid bonding."Conclusion
Over the past few decades, power consumption has been an issue for leading chip manufacturers. Smartphones issue warnings of overheating during operation and shut down until they cool down. For the same reason, a server rack might shift load to another rack. However, as chips are increasingly broken down into various components and packaged together, and as industries such as automotive begin to develop chips at 5 nanometers and below, power consumption issues will emerge in more areas.
Architecture, layout and routing, signal integrity, heat generation, reliability, manufacturability, and aging are all closely related to power consumption. As the chip industry continues to address unique markets with unique ways and different functionalities, the entire industry needs to learn how to handle or resolve the impacts related to power consumption. In the past, only the chip manufacturers with the highest production volumes cared about power consumption, and what has changed now is that there are fewer manufacturers who can afford to ignore power consumption design.
*Disclaimer: This article is the original creation of the author. The content of the article represents his personal views, and our reposting is solely for sharing and discussion, and does not represent our approval or agreement. If you have any objections, please contact the backend.
More data requires faster processing speeds, leading to a series of issues.
In terms of processing and storing data, power consumption is crucial, and many aspects of it are far from ideal. Issues related to power consumption, especially heat, now dominate chip and system design, and these problems are expanding and multiplying.
As transistor density increases, the heat generated by these tiny digital switches cannot be eliminated through traditional means. Although this problem seems controllable, it has created a cascade of new issues that the entire industry must address together, including EDA companies, process equipment manufacturers, wafer fabs, packaging houses, field monitoring and analysis service providers, material suppliers, research teams, and more.
Behind these activities, a continuous focus is on integrating more transistors into a fixed area and the closely related, accelerating battle against power leakage. FinFETs addressed the leakage gate issue in 16/14 nanometer technology, but the problem re-emerged just two nodes later. In the 3-nanometer process, a distinctive all-around gate field-effect transistor (i.e., nanosheet) structure was introduced, making design, metrology, inspection, and testing more challenging and costly. In the 2-nanometer/18-angstrom technology, to ensure sufficient power delivery to the transistors and alleviate wiring issues, power delivery will be flipped from the front to the back of the chip. At higher technology levels, the industry may once again change the transistor structure to adopt a complementary field-effect transistor (CFET). In this short time window, numerous process and structural changes are emerging, with each new node requiring the resolution of more issues.
Advertisement
For example, as high-density chips and packaging technologies develop, transient thermal gradient issues are increasingly gaining attention. These thermal gradients move unpredictably, sometimes quickly, sometimes slowly, and change with variations in workload. In the 40-nanometer process, with thicker dielectrics, substrates, and more relaxed spacing, these issues were only considered minor nuisances. However, in current cutting-edge process technologies, we need to take these issues more seriously.
Cadence Product Management Director Melika Roshandell stated: "Although the basic leakage has decreased compared to previous technologies, the overall power consumption is higher. So, the heat issue will be more severe because you are integrating more transistors into an integrated circuit while continuously improving performance. You want to adopt higher and higher frequencies, for which you need to increase voltage and power consumption. The total power consumption is now higher than the previous generation, so the heat issue will be more severe. Moreover, when using smaller nodes, the chip area is also decreasing. The reduction in area and the increase in total power consumption can sometimes lead to exacerbated thermal issues, making it impossible for the chip to achieve
Heat is becoming a common nightmare for all hardware engineers and is causing some vicious cycles that are difficult to solve and model in advance:Heat accelerates the rupture of dielectric films used to protect signals (time-dependent dielectric breakdown, or TDDB) and increases mechanical stress, leading to warping. Heat causes a series of issues: it speeds up electromigration and other aging effects, potentially narrowing the data pathways. This further increases the heat generated by circuit resistance and the energy required to drive signals, until (if possible) the signals are rerouted.
Heat also affects the operating speed of memory, reducing the overall system performance. Additionally, noise generated by heat impacts signal integrity, and the noise can be transient, making partitioning more difficult. All these factors can shorten the lifespan of a chip, or even affect a part of it. Pradeep Thiagarajan, Chief Product Manager of Analog and Mixed-Signal Verification Solutions at Siemens EDA, said: "Thermal degradation of transistors can easily lead to chip or IP damage. Fortunately, most devices' self-heat analysis can be assessed by measuring the transient heating of each MOS device to evaluate the local heating's impact on the design, then loading temperature difference data and assessing waveform impact. Now, facing the increasing requirements for data transfer rates, innovation is needed in all aspects. Therefore, better modeling of all thermal interface materials increases the likelihood of addressing these impacts and making appropriate design adjustments to avoid short-term or long-term hardware failures. Ultimately, we need innovative thermal solutions, and we must also model correctly."
Power consumption issues abound
Many chip manufacturers are just beginning to address these issues, as most chips are not developed using the most advanced processes. However, as chips increasingly become composed of chiplets, everything needs to be characterized and operated under conditions not developed on 40nm or higher process planar chips.
It is worth noting that increasing transistor density, whether on a single chip or in advanced packaging, is not necessarily the most effective way to improve performance. However, it does increase power density and limits clock frequency. Therefore, many significant advancements are not closely related to the transistors themselves. These advancements include hardware-software co-design, faster physical layers and interconnects, new types of insulating and electronic migration materials, prefetch processing with higher precision and shorter recovery times, sparse algorithms, and new power delivery schemes.
Vincent Risson, Senior Principal CPU Architect at Arm, said: "Understanding the entire system stack is very important. Of course, the computer contributes significantly to power, but other parts of the system are also important. That's why we have different levels of cache, and the sizes of the caches vary. We increased the cache size in the previous generation because having local cache allows downstream power to treat computation as running locally. As we expand into 3D, we can envision using 3D stacked caches, which will help reduce data transfer and improve efficiency."The key is to improve efficiency at every stage of the design cycle, not just in hardware. Although the chip industry has been focusing on hardware for decades—software co-design, system companies have been the first to adopt this approach through customized microarchitectures, and mobile devices also strive to significantly extend battery life for competitive advantage.
Risson said: "We make many adjustments to fully enhance performance, which is a key issue that CPUs are committed to solving. For example, we continuously improve all prefetch engines to increase accuracy and reduce downstream data traffic. As a result, we reduce interconnect traffic while maintaining better coverage."
This is just part of the puzzle; we also need to address more issues. For instance, as time goes by, dielectric films will gradually deteriorate. This situation can be accelerated by different workloads or working conditions, especially inside the packaging filled with chip products. Norman Chang, a researcher and chief technology expert at Ansys' Electronics, Semiconductor, and Optics Division, said: "Due to the need to handle so many signals and operate on a polygon network at different voltages, time-dependent dielectric breakdown (TDDB) becomes an issue. If a network is adjacent to another signal network with a different voltage, the dielectric material will sense different voltage fields. Over time, time-dependent dielectric breakdown will occur. This is a new problem, and we need to find solutions for it."
Inconsistency issues
Thermal gradients are also a challenge, especially when they fluctuate and have significant differences between different workloads. This problem is particularly evident in 2.5D designs, which can lead to deformation. The same issue is expected to exist in 3D-ICs that will be released in the coming years. In both cases, heat may become trapped, leading to a snowball effect.
Zhang said: "In 3D-ICs, power consumption is closely related to temperature. When the temperature rises, the leakage power consumption will increase, and the thermal gradient distribution becomes the core of multi-physical interactions in 3D-ICs. Temperature affects power consumption and also affects resistance. When the temperature rises, resistance will also increase, which will also affect the dielectric constant. This will impact signal integrity and power integrity, and it will also affect stress. In 3D-ICs, when mixing digital and analog, the analog part is more sensitive to stress. You need to know the location of thermal gradients and hot spots to keep analog components away from hot spots. If you see thermal cycles of analog components, the aging speed of the device will accelerate, and you will start to see transistor mismatches, and the efficiency of analog circuits will quickly decline compared to digital logic."
This is just the beginning. Kenneth Larsen, Senior Director of Product Management at Synopsys, pointed out that arranging the positions of various elements in stacked chips incorrectly can lead to some unexpected problems, such as thermal crosstalk, which may also reduce overall performance. "We have moved from monolithic design to a design based on fragments, which has reduced the distance between devices, allowing them to influence each other. When one device is stacked on top of another, how does the heat dissipate? This is a huge challenge. For 3D-ICs, the first question is whether a system with structural integrity can be built. At the same time, you also need to pay attention to other mechanical, thermal, and power consumption issues—there are too many problems to be solved."
In the past, the simplest way to deal with heat was to reduce the voltage. However, this approach has become ineffective because, at very low voltage levels, minor anomalies can lead to problems. Roland Jancke, Head of Design Methods at the Adaptive Systems Engineering Department of Fraunhofer IIS, said: "For low-power technologies (such as critical or sub-critical devices) and high-power devices, noise is a key topic. This is a difficult issue to understand because it usually does not appear during simulation but is exposed in the real world. When noise issues arise in reality, you need to understand and deal with them."Taking crosstalk coupling as an example, during the design phase, the noise it generates in the substrate is not easily noticeable. Jancke said: "We started using substrate simulators to study crosstalk conditions within the substrate a few years ago. At that time, the focus was on individual devices and the devices around them. However, the crosstalk issues of input stages that are further apart through substrate coupling are often overlooked."
Such issues can also lead to problems in DRAMs, especially when the bit cell density increases, making them more susceptible to noise. Onur Mutlu, a professor of computer science at ETH Zurich, said, "There is definitely thermal noise. Additionally, when you access a cell, electrical interference caused by wire switching or accessing the transistor itself generates noise within the structure. This activation behavior creates noise, leading to reliability issues. We call this cell-to-cell interference. The row hammer problem is an example, where activating one row interferes with adjacent rows. RowPress is another example, where keeping a row open for a long time affects the adjacent rows. As we reduce the size of each cell, narrow the cell spacing, and increase density, this cell interference phenomenon becomes more common. This could lead to silent data corruption, which may be exactly what happens in real-world scenarios."
In terms of power consumption, there are always some unexpected issues. Barry Pangrle, power architect at Movellus, said, "Regardless of the clock frequency, we want to operate at the lowest voltage to use the least amount of energy. Although we can build a certain degree of model, unexpected situations will always arise. You can adjust the voltage and frequency of a chip in different environments to test its performance under different loads. You can use this data, and if you want to be more cautious, you can appropriately lower the settings and leave some margin. But people cannot do this for every chip. So, do you want to categorize the chips, such as 'chips of this category will operate at this clock and this voltage.' Additionally, the choice of granularity details will depend on the vendor selling the chip."
Other issues
Power consumption also involves financial aspects, including the resources required to create complex designs, to how much electricity data centers consume. The higher the transistor density, the more energy is needed to start up and cool down the server racks. In various types of AI applications, the goal is to maximize transistor utilization, which in turn consumes more energy, generates more heat, and requires more cooling.
Noam Brousard, Vice President of Engineering Solutions at proteanTecs, said, "These applications require a lot of electricity, and the demand is increasing exponentially. Efficient power consumption will ultimately bring significant savings to data centers. This is the most important. In addition, we also need to pay attention to the environmental impact of the application and hope to extend the lifespan of electronic products."
The impact of power consumption is not limited to the chip itself. Roshandell from Cadence said, "In 2.5D designs, thermal stress can cause warping, which increases the risk of breaking the solder balls between the interposer and the PCB. Once a crack occurs, a short circuit will appear, causing the product to fail to work properly. Therefore, how to solve this problem and how to model it is crucial. It is necessary to consider this in advance at the earliest stage of the design and take corresponding measures."
In 3D-ICs, the problem becomes more complex. The importance of identifying issues early in the design cycle is once again emphasized, but in 3D-ICs, there is a cumulative effect. Zhang from Ansys said, "Compared with SoC, dynamic switching power is really very tricky in 3D-ICs. We must consider the physical architecture as early as possible, because if you have 15 chips on a 3D-IC, how do you distribute power between these 15 chips to accommodate the dynamic workload and the time dimension? At different moments, a chip may have different workloads, which may create hot spots. But if the top chip has a local hot spot and the bottom chip also has a local hot spot, when the two local hot spots align at a certain point in time, this hot spot will become a global hot spot. If other chips are not switching, the global hot spot may be 10 to 15 degrees Celsius higher than the local hot spot. This catches 3D-IC circuit designers completely off guard, because when you simulate a chip on a 3D-IC, you may not be able to simulate the entire 3D-IC with a realistic workflow."The issue lies in the fact that there are many interdependent factors that need to be understood within a certain context. Niels Faché, Vice President and General Manager of the Design and Simulation Product Group at Keysight Technologies, stated: "You cannot optimize these devices in isolation. You might focus on thermal objectives, such as maximum temperature or heat dissipation, but you need to understand these issues in the context of mechanical stress. You must build models of these independent physical effects. If their relationships are very tight, you need to perform them in the form of co-simulation. For example, we use electro-thermal simulation. So, when observing the current flowing through a transistor, it will have an impact on heat. Then, the heat will affect the electrical characteristics, which in turn will change the electrical behavior, and you need to model these interactions."
Solution
There is no single, comprehensive solution to problems related to power consumption, but there are many solutions that can address part of the problem.
One approach to solving the problem, perhaps the simplest, is to limit over-engineering. Steven Woo, a researcher and distinguished inventor at Rambus, said: "Everything starts with focusing on the target application scenario and defining the functions needed to solve these scenarios. It may be tempting to add various functions to meet the needs of other potential markets and use cases, but this often leads to an increase in chip area, power consumption, and complexity, thereby affecting the performance of the chip's main application. We must scrutinize all functions rigorously to make a challenging judgment on whether they truly need to be integrated into the chip. Each new function will affect PPA (power, performance, and area), so always focusing on the target market and use case is the first step."
This will have a significant impact on overall power consumption, especially in the field of AI. Woo said: "There are many factors to consider in AI, especially for edge devices. Some choices include the chip's power supply method, heat dissipation limits, whether it needs to support training and/or inference, accuracy requirements, the environment in which the chip will be deployed, and the digital formats supported. Supporting a large set of functions means a larger area and power consumption, as well as adding complexity to the prohibition when functions are not used. Since data transfer affects performance and consumes a large portion of the energy budget, designers need to fully understand how much data needs to be moved when developing architectures that can minimize edge data transfer."
Another method is to test the design with actual workloads. William Ruby, Senior Director of Product Management for Low Power Solutions at Synopsys, said: "Some customers are trying to have us run representative workloads because we don't know what we don't know." This is like power coverage. "What do we think is the continuous worst case? What do we think is a good idle load?" But what they don't know is how new software updates might change the entire activity profile. The hope is that this change will be gradual, and they have budgeted for it, rather than being pessimistically too conservative. But how to predict what changes firmware updates will bring?
Backside power supply is another option, especially at the most advanced nodes. "To some extent, you will encounter a problem of diminishing returns because you need to deal with the materials from the top layer to the bottom layer, and the top layer is often the power supply and ground wiring," Pangrle from Movellus said, "If you can achieve power supply from the back side without having to go through the 17 metal layers on top, then you don't need to go through many layers. Being able to bypass the entire metal stack and approach the transistor from the back side without worrying about going through all the vias is like the magic of manufacturing."
Using sensors inside the chip and packaging to monitor changes in behavior related to power consumption is another method. Brousard from proteanTecs said: "In real-world applications, there are many factors that can degrade performance, so we must preset voltage protection bands. We know there will be noise, excessive workloads, and chip aging. All these factors force us to apply a voltage greater than VDDmin in the best case."
Additionally, copper wires can be used to conduct heat to places where it can be dissipated. Larsen from Synopsys said: "You can take simple measures, such as optimizing the TSV layout in the stacked chip, or using thermal vias. This is very complex, but the EDA field has been dealing with exponential problems. This is what we need to solve. However, when you want to alleviate certain problems, you need to add something, which may affect some of the value you expect to get, but it is necessary to solve. For reliability, you may add redundancy, which may be TSVs in the stack or hybrid bonding."Conclusion
Over the past few decades, power consumption has been an issue for leading chip manufacturers. Smartphones issue warnings of overheating during operation and shut down until they cool down. For the same reason, a server rack might shift load to another rack. However, as chips are increasingly broken down into various components and packaged together, and as industries such as automotive begin to develop chips at 5 nanometers and below, power consumption issues will emerge in more areas.
Architecture, layout and routing, signal integrity, heat generation, reliability, manufacturability, and aging are all closely related to power consumption. As the chip industry continues to address unique markets with unique ways and different functionalities, the entire industry needs to learn how to handle or resolve the impacts related to power consumption. In the past, only the chip manufacturers with the highest production volumes cared about power consumption, and what has changed now is that there are fewer manufacturers who can afford to ignore power consumption design.
*Disclaimer: This article is the original creation of the author. The content of the article represents his personal views, and our reposting is solely for sharing and discussion, and does not represent our approval or agreement. If you have any objections, please contact the backend.