Design of VLSI (Very Large Scale Integration) Circuits and Systems refers to the process of designing and developing integrated circuits (ICs) that consist of a large number of transistors and other electronic components. VLSI circuits and systems are used in a wide range of electronic devices, from computers and smartphones to medical equipment and automotive systems.
This course delves into the advanced principles of VLSI (Very Large Scale Integration) circuit and system design using the latest CMOS (Complementary Metal-Oxide-Semiconductor) technologies. Students will learn:
- Circuit-level optimization techniques utilizing gate size, supply voltage, and threshold voltage.
- Layout strategies for circuit blocks to optimize for power, speed, or area.
- Advanced concepts of retiming, place and route for more efficient designs.
- Strategies for mitigating power leakage, managing interconnects, and handling clock and power distribution.
- How device variability impacts the design process.
The course will explore how these concepts apply to a range of applications, including microprocessors, signal and multimedia processors, portable devices, memory, and periphery. Throughout the course, there will be a special emphasis on circuit optimization and designing for ultra-low power, which will be integrated into both the lectures and project work.
“Energy-Efficient FPGA Configurable Logic Block (CLB) Design”
The goal of this project is to build a configurable logic block (CLB) for FPGAs with the minimum energy and worst-case delay of 2ns. The architecture of the CLB is shown in Fig. 1.
The CLB can function as one 8-bit adder, two 4-bit adders, or four 4-input look-up tables (LUTs). Each LUT requires 16 configuration bits of storage elements. Assume these bits to be available as inputs, so the LUT functions as a 16:1 multiplexer. You may use gate sizing and supply voltage scaling as variables. No extra pipelining registers are allowed.
- In phase 1, use the design expertise you acquired in class to find the optimum building blocks (adder, multiplexer, and register) that best optimize the speed-energy goal. Do a quick sketch of several feasible options and figure out the best architecture and circuit style. You may mix circuit styles if that helps.
- In phase 2, first implement block-level schematics of the building blocks and verify the functionality in Spectre. Then, identify critical path and optimize sizing for minimum delay. In the critical path evaluation, you need to determine not only the gates along the path, but also the input operands that cause worst-case delay between input and output bits.
- In phase 3, layout and verify your design. Layout area is defined as the smallest bounding box around your design. Layout aspect ratio of the design (long / short side) should be less than 1.5.
- Supply Voltage : Finad optimal VDD that meets the delay and minimizes energy. Multiple voltage design is allowed with proper level-shifter insertion.
- Implementation Choices : Use only static logic (CMOS, pass-transistor logic).
- Input Operands: All operands (AdrA, AdrB, AdrC, AdrD) are 4-bit numbers. There is an incoming carry signal, Cin.
- The input capacitance of all inputs (AdrA[3:0], AdrB[3:0], AdrC[3:0], AdrD[3:0]) is less than equal to 2 unit-sized inverters (see below for the definition of unit-sized inverter). For simulation purposes, the inputs are driven by a unit sized buffer (chain of two unit sized inverters). The delay is measured as the delay after the input driver (2 inverters) to before the load (32× input capacitance of the unit-sized inverter). Test-circuit will be provided.
- All outputs (OutA[1:0], OutB[1:0], OutC[1:0], OutD[1:0]) are loaded with CL = 32 unit-sized inverters. This load will be implemented with inverters.
- Unit sized inverter is Wp = 650nm, Wn = 430nm, Lp = Ln = 100nm.
- The delay is evaluated for the path from Cin to OutD when CinAB=10, CinCD = 10, AddCD = 1, SyncD = 0. Functions of LUTs, adders, and output registers should be verified. Extra inverters/buffers are allowed for delay minimization.
- Minimum width of VDD/Gnd rails is 0.6μm.
- You can use up to 5 metal layers.
Title: Energy-Efficient FPGA Configurable Logic Block (CLB) Design
This work presents an energy-efficient FPGA configurable logic block with energy of 286.1 fJ in VDD of 460mV. The indicate size is 58.2 um by 87.7 um with 1.506 of the aspect ratio. The static CMOS type of lower power (LP) full-adder is implemented in this design, and VDD and size optimization have been performed through the critical path.
Full Adder Summary
- CMOS full Adder.
- CMOS Mirror Adder.
- CPL Full Adder
- LEAP Full Adder
- Low Power (LP) Full Adder
- Transmission Gate (TG) Full Adder
- TGdrivcap Full Adder
- Dual-rail Domino Full Adder
- Adder Topology : Lower Power (LP) Full-Adder
The LP full adder ( A. Shams and M. Bayoumi, “A novel high-performance CMOS 1-bit full-adder cell,” IEEE Trans. Circuit Syst. -Part II, Vol. 47, pp478-481, May. 2000) has a low power consumption because it is based on the low-power XOR and XNOR cells. However, it has a quadratic delay increase with respect to the number of stages. Generally, we distinct two categories of full adder schemes: ones with drive-ability and the others without drive-ability. The drive-ability indicates there are buffers or not at intermediate nodes or internally. LP and TG types of full adders do not have the drive-ability, which means they only construct the path which the signal can go through. Thus, it suffers from Elmore-delay, which could increase quadratically. However, the others have a buffer or drive-ability. so its delay increases linearly.
- Circuit Style : Static CMOS
- Comparison between LP and TGdrivcap Full Adders
Generally, in smaller number of bits applications, LP or TG full adders reveal much less energy with still competitive delay specification. However, TGdrivecap full adder has a relatively lower energy in adders with drive-ability, and it also reveals good feature regarding to energy consumption. In our application, which 8-bit adder, two candidates are evaluated with minimum size. In our initial evaluations, The TGdriveCap gives 5~10 % of additional delay with 30% of additional energy penalties. This result is quite resembled to the one in (1).
The overall summary is shown in below:
As shown in Fig. 1, the critical path is obviously through the 8-bit adders when the all carries are rippled. From Cin to OutD, there is no internal buffer since MUXs are implemented using a transmission gate and the LP adder does not have one inside. Thus, we add up some buffers at each intermediate nodes to mitigate delay penalty from Elmore delay.
This buffer could be replaced with level-conversion block if we need to implement multi-VDD scheme to perform the optimization. Also, each region will have separated VDD-domain in that case. By the way, the critical path delay is the sum of carry-ripple delay of full-adder blocks and MUXs.
The gate-level schematic is shown
Before doing the optimization following the method in reference (2) : D. markovic et al, “Methods for True Energy-Performance Optimization” IEEE JSSC, Vol. 39, No. 8, Aug. 2004, all devices are sized arbitrarily along with the critical path. Including parasitic and branch efforts, the optimization can be done in Matlab.
In this optimization, the minimum delay is around 430 psec with maximum energy. The overall results from the optimization are :
Because the target delay is 2 nsec and the minimum delay from the analysis is around 430 psec, we add up the delay around three times of the minimum delay. The figure shows the energy reduction in percentage when we increase the delay.The X-axis is the additional delay with respect to the minimum delay. The Y-axis is the energy reduction at the critical path, not overall energy reduction. Interestingly, our given target delay is too long to make a good optimization because before touching the target delay, all the physical constraints are ended up. First of all, minimum size constraint limits around 0.9 time of additional delay. Then, VDD constraint limits around 1.7 times of additional delay.
Therefore, except the last buffer for the output load, all sizes are the minimum, and the VDD is around 0.5 V, which barely make the transition for an inverter. Also, using multi-VDD scheme, almost same VDD, which is the minimum, are obtained. Thus, we only exploit a single-VDD with regular buffers.
Note that using the optimization, we can reduce the energy at the critical path almost 90 percent.
Above figures show the delay and energy characteristics at each stages. When we increase the delay and optimize the energy in a same time, the stage number 2 and 4 reveal the maximum energy and delay contributors. The stage number 2 and 4 are the adder blocks, which consist of 4 transmission gates in series without any buffers. Thus, a couple of buffers are added up at the full-adder blocks.
1. Full Adder 1-bit
2. Entire Layout
The overall layout floorplan can be done in center-oriented manner. The inputs, AdrA~AdrD[3:0], and DA~DD[15:0], are fed from left and right sides, then the internal signal transmissions are done through toward the center. The final results are captured at the center, and the last buffers are placed at the bottom. In fact, we can place the last buffer at either the top or bottom. There is no description regarding to floorplan and the input/output ports. The overall layout have done manually, and the total area is 5104 um^2. The critical path is highlighted using blue line.
Overall blocks are very compact, and they shared same VDD and GND lines.
Add a Comment