RTL2GDSII: Using CCD Optimisation to Harden Processor Cores to 1GHz

Physical Implementation Aug 16, 2016 12:26:22 PM

Aug 16, 2016 12:26:22 PM


INTRODUCTION

Many challenges arise when implementing high performance ip cores.  One of these is how to effectively skew the clock tree to enable efficient slack borrowing from neighbouring stages in a register-based design. This blog looks at the use of Concurrent Clock and Data (CCD) optimisation technology in hardening a 28nm design project, and how it was used to significantly improve overall timing within this high performance core.


Here's the History and Context...

The CCD flow is an addition to the ICC toolset and has been created to automate the process of using useful skew within your design to achieve improved QoR. Prior to this addition, in order to make use of useful skew within the design, the engineer had two main options.

  • Firstly, manually identify any possible clock tree points that could be skewed, create the necessary clock latency and CTS exception commands, then run multiple iterations where the degree of skewing is increased or decreased until an optimal solution is reached.  This is obviously a time consuming process as identification of all possible skew candidates is difficult and full optimisation loops must be run.
  • Secondly, make use of the skew_opt command.  This has a much higher degree of automation when compared with option 1, when considering how to identify skew candidates.  It does not, however, allow for the identification of possible QoR gains in unoptimized timing paths.

The CCD flow is a massive improvement over both these methods.  Within a single command, ICC analyses the timing graph and, using datapath optimisation and useful skew at the same time, makes adjustments to both for significant QoR gains. 

 

The Problems with Manual Skew Balancing

In register based designs, datapath depths and delays between register stages are not evenly balanced.  This has many possible causes, including RTL constraints, where a particular function must be completed in a single cycle, or by physical limitations, where the elements within the datapath have larger than expected delays due to placement, obstructions or routing issues.

In order to meet design performance requirements, one of the processes available to physical implementation engineers is to make manual adjustments to clock tree skew.  This can be a tedious process as is it an iterative one that relies on multiple passes, often making subtle increases/decreases for single endpoint groups to try and claw back a precious few percentage points of setup slack.

For example, consider the following simple register/slack arrangement presented at the end of an early place_opt stage run;

 

simple_datapath.png

Figure 1 : Simple datapath

 

regA

RAM

regB

Clock latency (ideal)

1.0ns

1.0ns

1.0ns

Slack to input

0ps

+200ps

-150ps

 

It's clear that by shortening the clock latency to the RAM, by at least 150ps and less than 200ps, the violation to registerB can be resolved while ensuring the timing path to RAM0 retains positive slack.

The engineer can at this time trial the adjustment in ICC at the end of place_opt.  Note the following TCL commands assume a default ideal clock latency of 1.0 for all endpoints;

          set_clock_latency 0.85 [get_pins –of RAM –filter “is_clock_pin == true”]

And if successful, as in the timing slack to both RAM0 and registerB are positive, this can be translated into a directive for CTS to achieve a similar skewing.

          set_clock_tree_exceptions \

–float_pin_max_delay_rise 0.15 \

–float_pin_min_delay_rise 0.15 \

-float_pins [get_pins –of RAM0 –filter “is_clock_pin == true”]

The engineer would then repeat this process for any more examples similar to the one above.  Then run the flow to at least the end of clock_opt, analyse the timing again and feedback any necessary changes to place_opt.  Then repeat until there is no more manual skewing available.

The example given above is a very simple one and in real designs a solution as neat as this one rarely presents itself.  Here are some more complex examples;

No visible slack on input to RAM


simple_2_datapath.png

Figure 2 : Simple datapath (2)

 

regA

RAM

regB

Clock latency (ideal)

1.0ns

1.0ns

1.0ns

Slack to input

0ps

+0ps

-150ps

 

In this case it's possible that there is positive slack available on the input path to RAM, but the path has only been optimised as required to 0ps.  The engineer could still have made the latency adjustment, but a trial will be required to see if the overall effect of the adjustment is positive.

Complex fanin or fanout cones, to and from slack target

faninfanout.png

Figure 3 : Multi-fanin datapath

 

 

regA*

RAM

regB

Clock latency (ideal)

1.0ns

1.0ns

1.0ns

Slack to input

?ps

+0ps

-150ps

 

In a ‘normal’ design, the RAM element probably has a fanin cone with multiple startpoints.  Some of these may have positive slack to borrow, others may not.  Analysing which of these is critical or under optimised can be difficult and time consuming.

Slack available, but on a register several stages removed from the violator

 

multi.png

Figure 4 : Multistage datapath

 

 

regZ

regX

regA

RAM

regB

Place latency (ideal)

1.0ns

1.0ns

1.0ns

1.0ns

1.0ns

Slack to input

+500ps

+0ps

+0ps

+0ps

-150ps

 

A latency adjustment can be made here, but it must be made to all the registers from regZ to regA and RAM.  The analysis for this is very complex as the fanin cone grows with each preceding stage that is analysed.

The whole process of latency adjustment is further complicated by what can be accurately achieved at CTS in terms of actual skewing.  The engineer may require a finer level of adjustment than is possible, and this can be caused by a limited CTS library or issues within the clock tree itself.

This would require multiple passes of CTS and CTO to find what adjustment accuracy is achievable with your design.

 

The skew_opt Solution

The skew_opt flow has been available for some years, and is designed in some ways to address the issues outlined above.

Personally, I've always had limited success with implementing a skew_opt solution.  Particular points for concern are the extremely small value for some of the latency adjustments created (Often smaller than the delay of the library’s smallest buffer) and the extremely large size of the skew_opt.tcl file.  In my experience, this increased complexity has caused overall CTS quality to degrade, especially in the critical clock skew category.

The CCD Solution

The CCD flow uses data-path optimisation and useful skew to compute an optimal set of adjustments needed on clock tree for timing improvement; and builds a clock tree based on that information.                                                                                                               

Post-CTS, further timing improvements can be achieved by adjusting the latencies on existing clock tree.  Latency adjustments are performed through buffer addition, removal, re-parenting and sizing.  Multi-stage slack borrowing is also considered to further increase the timing improvement

The CCD flow is enabled as a single option to clock_opt and is fine-tuned with a handful of ccd variables.

Example CCD Flow

The basic CCD flow is outlined below, with examples and notes where relevant.  Please also refer to the Solvnet presentation for more details;

https://solvnet.synopsys.com/retrieve/customer/application_notes/attached_files/040398/Concurrent_Clock_and_Data_Optimization_in_IC_Compiler.pdf

 

Step 1 : Configure standard CTS settings

          set_clock_tree_options …

          set_clock_tree_exceptions

Despite the automatic nature of the clock skewing performed by CCD, if there are known problem timing points that should be skewed, it is still safe to apply these as float exceptions.  With the correct options, CCD will start with your existing floats and make any necessary skew adjustments from that starting point.

Step 2 : Configure CCD settings

          set_concurrent_clock_and_data_strategy….

          -adjust_boundary_registers true/false

When set to false, IO path timing is preserved and CCD will not optimize boundary registers.

          -ignore_boundary_timing true/false

                   When set to true, boundary timing is ignored.

          -add_to_existing_float_pins true/false

When set to true, any prior manual float pin exceptions are considered and CCD will make any further latency adjustments on top of the specified float.

          -ignore_path_groups <path_group_names>

Specifying groups here allows specific registers to be left untouched.  For example, DDR path groups with specific timing arrangements.

Step 3 : Run CCD (cts)

          clock_opt –no_clock_route -only_cts -area_recovery -concurrent_clock_and_data       

Step 4 : Run CCD (Optimization)

          clock_opt -no_clock_route -only_psyn -area_recovery –concurrent_clock_and_data

Step 5 : Route clocks

          route_zrt_group –all_clock_nets –reuse_existing_global true

 

CCD Observations and Tips...

CCD flow does not, by default, make adjustments to boundary flops.  If boundary register skewing is not critical, then consider enabling boundary adjustment.

Critical timing paths to and from hard macros, like RAMs, should still be considered for manual adjustment and the adjustments should be applied before place_opt (Using set_clock_latency).  While CCD will adjust hard macro latency as required, if you have prior knowledge of a beneficial skew adjustment you should apply it.  This will allow the place step to focus on other critical timing paths, instead of others which you already know will be addressed by the CCD flow.  In this case, make sure you enable the option;

          set_concurrent_clock_and_data_strategy -add_to_existing_float_pins true

Expect local skew and global skew reports to look poor, as this is the point of the CCD flow.  Where a register can be skewed to improve timing, it will be, so the skew between these registers will be much larger than in a flow without CCD.

The issue of manually adjusting latency over multiple register stages is addressed directly by the CCD flow as it performs multi-stage slack borrowing.

CCD has multiple methods available to it to make the necessary latency adjustments.  These methods include buffer addition, but also resizing, re-parenting and buffer removal.  Repeated trials have shown a small overall decrease in area of around 1-2% when comparing post place_opt area with post CCD area.

Hold timing violations increased in both size and count (More on this in results section) so  take care in high-density designs to ensure sufficient space is reserved for hold fixing.

 

Results

The CCD flow was used with great success on closing two high performance, 1GHz processor cores.  Each used a different architecture and there is no reason to suggest this method cannot be used to great effect on any register based design.

The results shown below compare the QoR from two runs.  The first, a baseline run, used a simple clock_opt flow;

          clock_opt –only_cts –no_clock_route

          route_zrt_group –all_clock_nets

          clock_opt –no_clock_route –only_psyn –congestion -area

The second, the CCD run, used the CCD run outlined above.

Due to the commercial sensitivity of the results, I have expressed the setup timing violations in terms of the clock period (* -1).

The design in question was a high performance, 28nm core, comprised of a mixture of hard and soft memories alongside standard cell logic.

 

Baseline run

CCD run

Path Group

Setup violation as % of period (max)

Setup violation as % of period (mean)

Number of hold violations

Runtime (hrs)

Setup violation as % of period (max)

Setup violation as % of period (mean)

Number of hold violations

Runtime

REG2REG

36.09%

22.29%

8622

 

7.43%

3.09%

10318

 

RAM2REG

35.03%

21.05%

1136

 

9.55%

4.07%

1282

 

REG2RAM

41.40%

23.60%

9950

 

5.31%

1.06%

10299

 

 

 

 

 

6:00

 

 

 

14:30

 


CONCLUSION

The CCD flow provides a simple, effective method for skew balancing and slack borrowing in register based designs.  It is an invaluable addition to any flow targeting high performance ic designs.

The increase in runtime is significant, but without the CCD flow, timing closure could not have been achieved within the project timescales on this 28nm core.

 

More Information

Sondrel is an ARM Approved Design Partner. We have multiple project experience in RTL2GDSII engineering support and IP Hardening on ARM, ATOM, MIPS and a wide range of other IPs to meet the highest performance standards.  You can download a free datasheet for more details on our core hardening capabilities, or contact us to discuss any engineering support that you may need for your projects.

IP Hardening Datasheet

 



 

Tags:

Physical Implementation ARM ic design

Comments ()

Add comment

Related / News

INTRODUCTION

In my previous blogs, I describe the framework around how to manage a project, the process groups and we started with the first two knowledge areas Scope & Schedule.  I now.

INTRODUCTION

I was watching an episode of ‘The Crown’ last week on Netflix. One of the conspicuous period ‘punctuation marks’ I noticed was the antiquated, small black and white TV they.

INTRODUCTION

This blog looks at a case study of a recent engagement with a leading wireless communications supplier, and involved the design of a machine to machine IoT ASIC which will be.

INTRODUCTION

Amina, a Sondrel Senior Consultant based in the South of France, was one of the organisers who put together a 'Hackathon' event, and wrote up what proved to be an interesting.

‘Work experience abroad can bring benefits to you, your company and your clients’

INTRODUCTION

I love to travel. When I had the opportunity to travel to China on business for the company I.

‘Work experience abroad can bring benefits to you, your company and your clients’

INTRODUCTION

I love to travel. When I had the opportunity to travel to China on business for the company I.

简介

.

简介

.

INTRODUCTION

Almost every aspect of electronic design has at least some focus on low power, particularly with mobile devices dominating the consumer market. Designers are faced with OEMs.

Subscribe to Email Updates