Edward Chan, Huabo Chen and Chee Yee Chung, NVIDIA Corp.
This is a comparative study of the performance of 4-layer and 6-layer FCBGA packages designed to support a high speed DDR1 interface. Being able to quantify the performance impact in terms of MHz is critical to help decide the cost vs. performance trade-off. A detailed simulation approach, where we were able to translate power distribution network performance into timing impact, will be presented. The good correlation between simulations and measurements demonstrates our ability to predict performance and helps improve future decisions based on tradeoffs between cost and performance.
High performance GPU demand tremendous memory bandwidth to generate complex digital images at high frames rates. This memory I/O interface, known as the Frame Buffer, is typically a sourcesynchronous DDR wide-bus. At data rates now exceeding 1 gigabit per second per signal line, these interfaces require well-designed power distribution networks to supply the necessary current and minimize SSN.
Flip chip ball grid array (FCBGA) packages have replaced wirebond packages as the demands of high speed signalling interfaces have exceeded the capabilities of standard wirebond packages. However, the improved performance comes at a significant monetary cost. Baseline 4-layer FCBGA packages are attractive in the consumer electronics market where cost is paramount. Thicker packages with additional buildup layers offer additional room for betterdesigned power distribution networks, placement of decoupling capacitors, and lower routing density to minimize crosstalk.
Figure 1. 4L and 6L FCBGA packages.
In this paper, we present a comparative study of the performance of 4-layer (4L) and 6-layer (6L) FCBGA packages designed to support a high speed DDR1 interface. Being able to quantify the performance impact in terms of MHz is critical to help decide the cost vs. performance tradeoff. A detailed simulation approach, where we were able to translate power distribution network performance into timing impact, will be presented. The good correlation between simulations and measurements demonstrates our ability to predict performance and helps improve future decisions based on tradeoffs between cost and performance.
4 and 6-layer FCBGA packages
4-layer FCBGA packages have 2 conductor layers above the core and 2 layers below as shown in Table 1. The typical stack-up of a 4-layer package usually has signal routing on the top layer, with the next layer being the reference plane for the routing. This pushes the power planes below the package core. With routing on the top layer, space is limited for package decoupling capacitor placement on top. Placement of capacitors on the BGA side is not cost effective. In this particular design, we were able to place 5 decoupling capacitors for the Frame Buffer VDDQ supply (FBVDDQ) on the 4-layer package connected to the flip chip bumps via a power plane below the package core.
Table 1. Package stackup and usage in 4L and 6L packages.
Table 2. Summary of differences between 4L and 6L packages
In the 6-layer package on the other hand, we were able to connect the decoupling capacitors through wide power planes in the top 3 conductor layers of the package. Since most of the top layer is ground (GND), we could find room to place 13 capacitors for the FBVDDQ supply. The 6-layer design is clearly superior from an electrical perspective allowing higher signalling speeds, but as in any consumer product, this advantage must be weighed against the dollar cost.
The flowchart describing the analysis methodology is shown in Figure 2.We begin by extracting the impedance profile of the power distribution networks (FBVDDQ, GND) of the two package designs, and the single printed circuit board design that the packages are mounted on. We extracted the packages and the board separately to minimize the amount of computation required – discretizing the combined package and board with sufficient accuracy to resolve package-size features would require a large amount of computer resources. These separate board and package models are then combined in HSPICE. Past experience has shown that not much accuracy is lost when constructing composite package-andboard models this way.
Figure 2. Flowchart of the analysis methodology.
The commercial tool PowerSI from Sigrity, Inc. was used to convert the actual package layout files into simulation models from which we generated the frequency-dependent impedance profile. This impedance profile was then fed into Broadband SPICE, also from Sigrity, Inc., to produce a multiport SPICE network which can then be used for transient simulations. In the PowerSI simulation and extraction, all the flip chip bumps of a given net were connected together to create a single node. This was done because the eventual SPICE circuit will have only a single node connecting all the I/O driver subcircuits to the power distribution network. It is impractical to hook up individual SPICE subcircuits of I/O drivers to each individual bump. This approximation assumes that the spatial variation of the power supply voltage around the silicon die is small, which is true for a well-designed package that supplies power uniformly to all regions of the die.
The board where the package is mounted is analyzed in a similar way as the package. On both the package and the board, the BGA nodes of a given net are also grouped together into a single node. This allows a direct node-to-node connection between the package and board models.
Figure 3. Probing the impedance of the power distribution network as seen from the bumps.
Using a network analyzer and a pair of 50ohm microwave probes, we measured the impedance profile of a package alone and of a package mounted on the board. we probed the impedance as seen at the bumps as shown in Figure 3. Figure 4 shows the excellent correlation between the simulations and the measurements. For this correlation study, the simulation model did not have the bumps connected together because such a configuration cannot be measured directly. Instead, we simulated the impedance as seen at various bump pair locations around the package to get representative measurements.
Figure 4. Correlation between simulations (dotted line) and measurements (solid line) of the FBVDDQ power distribution network of 4L and 6L packages (a) package alone, and (b) package mounted on the corresponding board.
Clearly, the impedance of the power distribution network of the 6-layer package is lower, promising superior performance. The impedance profiles show that the decoupling scheme in the 4-layer package is inferior not just because of the fewer number of capacitors, but because the current path from the bumps to the capacitors goes through the package core twice. Core vias, 800μm long, have significant inductance, on the order of 500pH per powerground pair at the minimum design rule spacing. The capacitors on the 6-layer package, on the other hand, are connected by the wide planes on Conductor1 and Conductor3 and hence provide a lower impedance path.
The differences between the packages is mitigated somewhat after mounting on the board since the BGA ballmap and the board are identical for both packages. The board has additional decoupling capacitors for FBVDDQ on the underside directly below the package. Figure 5 shows the total power distribution network for the two packages as seen from unified nodes at the bumps, including on-die capacitance. The impedance is lower than the value seen from a single bump because this incorporates all the parallel paths from the bumps to the BGAs and finally to the voltage regulator and voltage source.
Figure 5. Total power distribution network impedance as seen from a unified node at the bumps, including estimated die capacitance.
With the simulation network ready as shown in Figure 6, we proceed to transient simulations. We chose to run the simulations in HSPICE instead of using IBIS models in order to improve accuracy, especially for simulations of push-out and push-in, and also to incorporate the effects of the predriver circuits. We have found IBIS to be insufficiently accurate for these types of analyses.
Figure 6. HSPICE simulation network incorporating power distribution models, trace models, I/O buffer subcircuits, terminations and loads.
The various signals for a DDR1 memory interface are included in the simulation – data (DQ), strobe (DQS), mask (DQM), clock (CLK), and command and address (CMD). One SPICE subcircuit of a driver with the appropriate multiplier is used for each particular type of signal. As mentioned above, all the I/O driver subcircuits share a common power supply node. The various signals transition with a fixed phase relationship to the clock edges. We took care to make sure that the drive strengths, trace routing and loading in the simulations match the measurements.
The GPU, which has 128 data bitlines, was programmed to drive DDR1 signals from a 2.5V FBVDDQ supply. We varied the pattern of the data lines and the frequency of the memory clock and captured the transient signal waveforms and noise on the power supply. The oscilloscope was triggered by the memory clock to eliminate sources of jitter such as the phase-locked loop (PLL) that would be included if we used the main crystal oscillator reference as the trigger. The memory clock pad on the die is usually surrounded by quieter pads such as command and calibration pads and hence experiences much less SSN compared to the data and strobe lines.
We measured the strobe signal (DQS) while running three different patterns on all 128 bits of the data lines (DQ): (i) 010101 (ii) 101010 (iii) 000000. Patterns (i) and (ii) are distinguishable because of their specific phase relationship to DQS. All the DQ, DQM and CMD lines are unterminated – with neither on-die nor on-board termination – whereas the DQS and CLK lines are terminated with 60ohm resistors to mid-rail located near the DRAMs.
We soldered short wire stubs to the points of interest on the boards to allow reliable and repeatable measurements using high impedance probes (Tektronix P7240) connected to a digital oscilloscope (Tektronix TDS 7254). Since we were interested in the performance of the GPU drivers writing out data to the DRAMs, we probed points as close as possible to the termination resistors on the board to limit capturing midtransmission-line ringing and reflection effects. The termination resistors help reduce Inter-Symbol Interference (ISI) thus allowing a cleaner measurement of SSN which is the main thrust of this paper.
Since we were characterizing the GPU drivers, we were interested only in the writes to the DRAMs, but the inevi able read bursts interrupted the measurements. That is another reason why we chose to use deterministic clock-like patterns for this study rather than pseudo-random patterns which require long, uninterrupted write streams, and are influenced more heavily by the actual transmission line characteristics and terminations.
Figure 7. Effect of simultaneous switching noise on the DQS signal at 100 MHz with three different DQ patterns. (a) Measured 4L (b) Simulated 4L (c) Measured 6L (d) Simulated 6L.
Representative eye-diagrams at 100 MHz (memory clock speed) for 4-layer and 6-layer packages are shown in Figure 7 where we overlaid the three DQS signals corresponding to the three different DQ patterns. Note that DQ is offset in phase from DQS by half a bit period during writes from the GPU to the DRAM. The dotted line (iii) shows DQS when DQ is constantly writing zeros, hence not inducing SSN. The solid line (i) is the case when DQ switches from low to high in the middle of the DQS pulse, drawing significant current from the shared FBVDDQ supply and hence causing the DQS signal to dip slightly as indicated in the figure. The dashed (ii) line shows the DQS signal rising slightly as the data lines are switching from high to low thus pushing up the FBVDDQ voltage. The 4-layer package shows more SSN effects than the 6-layer. Despite our best efforts, the effects of stubs still show up in the rising and falling edges of the DQS signal, but should not affect the measurement in the middle of the DQS pulse.
The simulated curves match the measurements quite well. The same SSN effect in the middle of the pulse is visible, and the SSN effect in the 4-layer package is more pronounced. The simulated magnitudes of the dips and humps are comparable to the measurements too.
Figure 8. Effect of simultaneous switching noise on the DQS signal at 250 MHz. (a) Measured 4L (b) Simulated 4L (c) Measured 6L (d) Simulated 6L.
At higher frequency, the difference between the 4-layer and 6-layer package is more substantial as evident in Figure 8 which is a measurement at 250 MHz. Not only is the variation in amplitude more significant, but the pulse width and timing with respect to the clock has changed as well. The SSN effect is now a larger portion of the bit period. The simulation captures the differences in the performance of the 4-layer and 6-layer packages quite well, with timing push-out and push-in much more evident in the 4-layer package. The differences in signal amplitude are reasonably well modeled too. Evaluating the shapes of the DQS waveforms at various frequencies, we can estimate that the benefit of going from a 4-layer design to a 6-layer design is about 50 MHz.
Measurement and simulation of the power supply noise as a function of frequency is shown in Figure 9. The measurement was made by monitoring the output of a single DQ driver as it was kept driving high (1) while the other data drivers were toggling 010101. We made the measurement directly under the GPU in order to minimize termination effects and crosstalk. The simulation tracks the measurement trend and as expected, the power supply noise rises with frequency and the 6-layer package has lower noise.
Figure 9. Power supply noise as measured at the GPU BGA as a function of frequency for 4L and 6L packges.
Now that we have confidence in the methodology and in the accuracy of our simulation models, we can design for improved performance. In addition, we can also explore the impact of different routing topologies and loading on the signal lines. To examine the sensitivity of the I/O drivers to package power delivery impedance, we created simple equivalent models for the power distribution network – essentially a single inductor connected to an ideal power supply. Sweeping the value of the inductor, we obtained the jitter due to ISI and SSN as a function of power supply inductance. In this case, we ran pseudo-random patterns for the aggressor and victim data lines to generate ISI also. The increase of jitter with power supply inductance increases sharply once the package inductance grows beyond a certain threshold. As shown, the results for the actual packages line up nicely with the sensitivity curve allowing us to design packages with particular target impedances in mind for a given target frequency. The difference in jitter between the 4L and 6L packages corresponds to a difference in maximum operating frequency of about 50MHz, similar to the earlier estimate from observing the form of the DQS signals.
Figure 10. Jitter due to ISI and SSN as a function of power supply inductance. The data points for the actual 4-layer and 6-layer packages are shown.
We described a thorough methodology for simulating and calibrating a high-speed signalling interface. Applying it to the comparison of 4-layer and 6-layer FCBGA packages supporting DDR1 signalling, we showed that we can predict and understand the detailed performance of the interface accurately. This allows us to make well-quantified tradeoffs between cost and performance in future designs.