ISPD 2017 Contest : Clock-Aware FPGA Placement

Xilinx UltraScale Architecture introduces a new ASIC-like clocking architecture to the FPGA world. One main feature of this new architecture is the abundance of clocking resources. For example, the biggest proposed Virtex8 device can accommodate more than 600 total clocking buffers. The other big change in the new architecture is the introduction of a mesh-like routing structure for routing clocks from clock sources all the way to all loads. Such routing structure allows the software tools to make smart choices of how the clocks are placed and routed in a way that have not been feasible in any other FPGA architecture.

Introduction to Clock Placement
The clock placement problem can be stated as the problem of assigning clocking components of a design to compatible clocking resources on a device. In the simplest form, clocking components consist of clock sources and clock loads. Clock sources are components that generate clock signals and/or derive dedicated clock nets using dedicated clocking trees. Clock loads are sequential components that capture data with respect to the input clock signal.
Clock source placement is usually done early in the placement flow along with general IO placement, and it heavily depends on architectural rules imposed by the device constraints. Clock load placement, specifically for non-IO clock loads, is taken care by the general placement flow. This usually starts with a global placement of all placeable components, where an approximate location is found for each component. This is followed by a detailed placement, where a legal placement is created and each component is assigned to a physical site on the device.
At early stages in the global placement flow, the clock loads are partitioned based on their placement at the time. The Clock load partitioning is driven by clocking architectural constraints. Without a correct clock load partitioning the final placement solution could be illegal, i.e., no routing solution would be available.
The clock placement and partitioning approach explained above is independent of the how the clocks are routed. But this approach is not enough to create legal clocking solutions. A clock partitioning solution that combines the problems of clock partitioning and clock routing, is needed to produce legal clocking solution and optimize clocking network for better skew, hold requirement, and insertion delay.

Clocking Architecture
This section briefly describes the clocking architecture in the Xilinx UltraScale devices. Each device is divided into clock regions. A clock region includes all synchronous elements--Configurable Logic Block (CLB), I/O, high speed transceivers (GT), DSP, block RAM, and so on-- in an area spanning one I/O bank, with a horizontal clock row (HROW) in its center. The below figure shows clock region divisions for one of the UltraScale devices.


Figure.1: Clock Region Boundaries in an ultrascale device

This particular device is divided into a 4x5 rectangular grid of 20 clock regions. Note that some clock regions may contain an IO bank or a GT quad. Clock source buffers are inside the IO and GT columns. So clocks can only be sourced from such clock regions.
The clock routing structure consists of a two-layer network of routing tracks as detailed below:
  • A routing network consisting of 24 horizontal and 24 vertical tracks
    • There is a one-to-one bidirectional connection between any two horizontal routing and vertical routing tracks in each clock region. For example, for a clock using horizontal routing track 0, it can switch to vertical routing track 0 at their intersection in one clock region and back to horizontal routing track 0 in another clock region.
    • There is no vertical routing track in IO or GT columns.
  • A distribution network, also consisting of 24 horizontal and 24 vertical tracks
    • There is a one-to-one unidirectional connection (from vertical to horizontal) between any two horizontal distribution and vertical distribution tracks in each clock region. For example, for a clock using vertical distribution track 0, it can switch to horizontal distribution track 0 at any possible intersection.
    • There is no vertical distribution track in GT columns.
Notes:
  1. There is no path from distribution back to routing tracks. So once a clock is on the distribution network it can only go to the leaf level nodes.
  2. From routing (horizontal or vertical) to distribution network clocks need to hop onto vertical distribution first. There is a one-to-one connection from every routing (horizontal or vertical) to its corresponding vertical distribution track.
The clock can be distributed from the sources in one of two ways. They can go onto routing tracks which take the clock to a particular sub-region without going to any loads and then go onto the distribution tracks. This is used to move the root for all the loads to be at a location beneficial from a skew perspective. Alternatively, they can go straight onto the distribution tracks. This would be to reduce insertion delay or that point being the root is most beneficial for skew. Once on the distribution tracks, the clock travels vertically and taps off at various horizontal segments. Before driving the horizontal segment it would go through a programmable delay and clock enable circuit. From the horizontal distribution it can feed the leaf clocks.
Each clock segment can be driven at either end or by a driver within the segment. Each of those drivers therefore would be tri-stable. This allows the clock network to be segmented at each fabric sub-region boundary. By having the clock only use segments as needed, it allows the tracks to be reused.

Definition of Clock Placement Problem
Place all clock sources and clock loads and partition the clock loads into partitions containing one or more clock regions, such that
  • Number of global clocks in each clock region is at most 24 clocks.
  • Within each clock region, each half column has at most 12 clocks.
  • Each clock region has enough resources to accommodate all clock loads assigned to that region.
  • If needed, all loads of each clock should be constrained to a continuous rectangular area consisting of one or more clock regions.


Definition of Clock Routing Problem
This is just as FYI, and not adhering to the below set of rules doesn't impact the challenge evaluation. However, for a fully legal solution to the generic clocking problem, the below rules are necessary.

The clock routing structure consists of a two-layer network of routing tracks as detailed below:
  • A routing network consisting of 24 horizontal and 24 vertical tracks
  • A distribution network, also consisting of 24 horizontal and 24 vertical tracks
  • Each clock region has 24 horizontal routing (HR) and 24 vertical routing (VR) tracks
  • Each clock region has 24 horizontal distribution (HD) and 24 vertical distribution (VD)  tracks
  • There is a one-to-one bidirectional connection between any two HR and VR tracks in each clock region. For example, for a clock using HR track 0, it can switch to VR track 0 at their intersection in one clock region and back to HR track 0 in another clock region
  • There is a one-to-one unidirectional connection from VD to HD in each clock region. For example, for a clock using VD track 0, it can switch HD track 0
  • Once on HD, clock only drives HD tracks on neighboring clock regions or clock loads in that region. There is no way back on HR/VR/VD tracks
  • From HR/VR to distribution network clocks need to hop onto VD first. There is a one-to-one connection from every routing (horizontal or vertical) to its corresponding VD track
  • All tracks are segmented at clock region boundaries, therefore two clocks can use the same track provided that their loads are in non-intersecting rectangular clock region areas
  • Each clock net should use a single clock track and a single clock root
  • Each global clock buffer has a dedicated clock track that can only be driven by that clock buffer. The Y coordinate of the site where the clock buffer is placed at can be used to specify the track number for that given site. So for BUFGCE_XmYn the clock track number will be n%24
  • Within a clock region, global clock buffer locations can be changed without affecting design legality


For more information, please visit Ultrascale Architecture Clocking Resources and Leveraging the UltraScale ASIC-like Clocking Architecture