William H. M. Kamp, Ph.D,
High Performance Computing Research Lab,
Auckland University of Technology, New Zealand.
The SKA Mid-frequency Correlator dumps 20100 visibility products from its Matrix style cross correlator function every 190 microseconds. It does this over 20 buses each approximately 312 bits wide, with a total instantaneous bandwidth of over 500 GB/s. This dump of data must be serialised and processed by the long term accumulator to an external DDR4 interface that limits the bandwidth. This is achieved with a debursting buffer that has 20 independent write-ports and a single independent read port. The design requires that the write ports operate at 500 MHz, and the data is crossed between clock domains through the RAM to the 300 MHz clock domain of the DDR4 interface.
Conceptually the Debursting Buffer is constructed as per Figure 1. There are a number of independent write ports, each directed to their own block RAM. The read ports of all RAMs are driven by the same lower address bits, and the data is selected from the correct RAM using a multiplexer based on the upper address bits.
Multiplexer to OR Gate Initial Optimisation
The Stratix10 M20K blockRAM has a feature, that when enabled, will force the read data output to zero when the read-enable is low. The 20 read enables are derived from the upper address bits and are used to gate the outputs of their respective RAMs. Only one RAM read data output will be non-zero at any time so the read data can be reduced by a bus of multi-input OR gates. Furthermore, this blockRAM feature internalises some logic (the AND gates from the multiplexer) reducing the associated LUT and routing resources.
Hierarchical Wide OR Gate
The correlator design has 20 write interfaces, each 300 bits wide. To reduce these to a single read interface results in 300 OR gates each with 20 inputs. Obviously, these OR gates will require multiple levels of LUTs to implement.
The Stratix10 ALM has a fracturable LUT that can implement either
- one 6-input LUT,
- two 5-input LUTs that share at least one input, or
- two independent 4-input LUTs.
- independent 3-input and 5-input LUTs.
See Section 220.127.116.11 in the Stratix10 LAB user guide: UG-S10LAB.
All inputs to an OR gate are independent, so one obvious solution for a 20-input OR gate would implement something like Figure 3, with five 4-input LUTs and one 5-input LUT.
With hyper-pipelining (or manually) registers can be retimed into this hierarchical structure, between the two layers of lookup tables to increase the operational clock frequency.
This hierarchical structure was synthesised and fitted to a Stratix10 device (along with the rest of the design), allowing Quartus to decide on the number of levels and width of the LUTs. Analysing the implementation in the Chip Planner showed some interesting results, in particular the routing congestion report shown in Figure 4.
Where routing congestion is high, the router cannot take the shortest path between the source and destination, it must direct signals around the congested area on a longer path. This both adds to the problem, by consuming even more routing resources, and causes a potential drop in operating frequency as the valuable setup timing slack is consumed by the additional interconnect routing delay. Hyper-retiming can help solve this symptom by retiming registers into the longer paths, and recover some of the operating frequency. Unfortunately, hyper-retiming occurs after placement and routing, so cannot solve the root cause that is the routing congestion.
The area of high routing congestion in Figure 4. corresponds to the area where the debursting buffer had been placed, as shown in Figure 5. This congestion is not a symptom of a full device, since there are plenty of unused resources. There must be a lot of signals trying to go to the same place within the debursting buffer.
Serialised Wide OR Gate
An alternative structure for building a wide OR gate is to chain smaller ones together as shown in Figure 6.
Similarly, pipeline registers can be retimed into this structure so that there is one at the output of each LUT. Remember that pushing a flop through a gate from output to input will result in a flop on all inputs, resulting in the full pipelining shown in Figure 7. The serialised structure has a greater latency (in clock cycles) than the hierarchical structure. In big-oh notation the serialised OR gate has a latency of O(N), whereas the hierarchical structure has a latency of only O(log(N)).
Notice the large number of flops on the inputs to the OR LUT closest to the output. These can additionally be retimed through the RAM blocks that feed the wide OR gate as shown in Figure 8.
Retiming all the additional flops though the RAM blocks, and merging common flops will result in the read address being pipelined from one RAM to the next. With the release of Quartus 18.0, hyper-retiming has an option to enable retiming registers across RAM blocks like this (off by default). However, the hyper-retimer can only move flops, and cannot change the routing so the merging of common flops is an unavailable optimisation.
If you have been paying attention, the write address and write data ports will now have assumed the towers of flops that belonged to the serialised OR gate in Figure 7. If Quartus does the retiming then to maintain the equivalence between the write and read timing it must implement all these flops. As an engineer, these flops may be removed by changing the access patterns to the RAM to ensure that no address conflicts occur between the read and write ports. Furthermore, Quartus cannot retime RAMs that have independent clock domains for the read and write ports. This is the case for the larger correlator design.
With all pipelining in place, the serialised debursing buffer implementation is synthesised, and analysed again in the Chip Planner. This results in a much improved routing congestion shown in Figure 9. The highly congested area is now absent, with no obvious location of the previously prominent debursting buffer. Its location is shown in Figure 10 by the cells marked in red.
Comparing Figure 10, with the hierarchical implementation in Figure 5, notice that the serialised debursting buffer is less densely packed. This is likely the result of the pipelined structure that allows the logic to cascade through the chip, rather than the logic being pulled to a central location. This results in a more easily routable design that ultimately results in a higher operating clock frequency. The cost is the increase in latency.
The wide and deep multiplexer at the heart of the debursting buffer module was restructured from a typical hierarchical implementation to a serialised implementation. This allowed Quartus the freedom to spread the debursting buffer out through the chip, avoiding the routing congestion experienced by the hierarchical implementation. This in turn reduced the interconnect timing delay, that combined with hyper-retiming, improved timing closure (and compile time). This results in the required throughput being achieved for the SKA Mid.CBF correlator design.