Some PCIe ECNs

SR-IOV defines mechanisms for a system’s endpoints and its CPU to allow sharing of its resources

1, FLR: capability used in IOV
Function Level Reset Support (FLR)
The integrated GbE controller supports FLR capability. FLR capability can be used in
conjunction with Intel® Virtualization Technology. FLR allows an operating system in a
Virtual Machine to have complete control over a device, including its initialization,
without interfering with the rest of the platform. The device provides a software
interface that enables the operating system to reset the entire device as if a PCI reset
was asserted.

2,ARI: Alternatively Routing ID, capability used in IOV
ARI extends the capabilities of a PCIe endpoint by increasing the number of available device functions from eight up to 256 by using the Bus and Device bit fields from the requester ID. A system needing to support ARI requires all devices in the PCIe chain (CPU, PCIe Switch, endpoint) to support ARI.

Consequently, the PCIe switch between the CPU and the endpoint needs to be able to decode and route packets accordingly. Without ARI, a virtual system cannot take advantage of the additional functions enabled in the PCIe endpoint. In a virtualized system, 16 function are typically available with some endpoints implementing as many as 256.

 

For better power management:

3, Latency Tolerance Reporting(LTR)

The first ECNs added to the specification were focused on tackling the overall power management and reducing the active power consumption in a system. When trying to implement an overall power management strategy, designers typically shut down components when they are not in use. For example, a tablet that uses WiFi for connectivity consumes more power when the WiFi radio is on and connected to the network. However, if the tablet doesn’t need to transmit or receive data for a period of time, the WiFi radio can be turn off to save battery life. The key to implementing this power saving strategy is to know how long it takes for the radio to wake back up. Without a mechanism to know how long to wait, the software has to guess how much latency is acceptable for the device–guessing incorrectly can result in performance issues or hardware failures. Consequently, platform power management is often too conservative or not implemented at all, resulting in devices that use more power than necessary.

Obviously, power management has to be done at the system level. This requires a mechanism to tune the power management based on the actual device requirements and adjust the dynamic power usage verses performance. The solution is to have each device in the system report its latency requirements to the host. Devices that utilize PCIe for connectivity, a PCIe endpoint, can utilize the Latency Tolerance Reporting (LTR) mechanism that has been incorporated into the PCIe specification. LTR enables a device to communicate the necessary information back to the host by using a PCIe message to report its required service latency. The LTR values are used by the platform (the tablet in our example) to implement an overall power management strategy that will extend the battery life of the tablet, while giving optimal performance.

4, Optimized Buffer Flush/Fill

Another ECN added to the PCIe specification to improve the overall power management is Optimized Buffer Flush/Fill or OBFF. As a system operates, each of its devices does not know the power state of each of the resources in the system. Without coordination, each of the devices will go in and out of their low-power states as necessary to execute the tasks they are assigned to do.  This “asynchronous” behavior prevents the optimal power management of the CPU, host memory sub-system and other devices, because the intermittent traffic keeps the system permanently awake and unable to optimize power management across the system.

Figure 1: Asynchronous behavior prevents optimal power management

As part of an implementation of a system-level power strategy, the idle time and low-power states of the devices must be optimized to enable them to stay in their low-power states longer. Basically, a host can provide information to all devices by broadcasting a message about the system power state. The devices can utilize this information to group a load of requests, wait until the system wakes up, and burst out all of the requests at the same time. By doing this, the device is a good citizen and does not wake up a sleeping CPU and/or the system memory sub-system. Waiting creates extended periods of system inactivity which saves overall system power (as shown in Figure 2). In other words, the host utilizes the OBFF ECN to give devices a “hint” so they can optimize their behavior, which improves power management at the system level.

Figure 2: Coordinated idle time extends system inactivity, reducing power consumption

5, L1 – substates

PCI-SIG and the member body continue to make changes to improve the ability to implement power management strategies across a system. However, what about the power that is consumed while your tablet or Ultrabook is in the suspend state? Pulling your tablet or laptop out of your bag during a long flight, only to find that it consumed all of the battery power while it was in standby mode, is one of a business traveler’s nightmares. This experience is a lesson in how non-optimized systems consume a surprising amount of power while in the standby state. PCIe’s L1 low-power state is just not enough, as the idle power consumed by PCIe-based devices does not meet the emerging thin and light form factor requirements, which require 8 to 10 hours of use time and a seemingly infinite amount of standby time. Of course, this has to be done with minimal added costs while maintaining backwards compatibility.

As shown in Figure 3, a PCIe link is a serial link that directly connects two components, such as a host and a device. Ignoring the state of the host or the device for this discussion, the PCIe link is defined to save power when the controlling link state machine (LTSSM) is in the L1 state. However, the PCIe interface has both analog and digital circuits and the L1 state doesn’t turn off all the analog circuits in the PHY. The Receiver Electrical Idle detector and the transmit common-mode voltage driver continue drawing power. The result is that each lane of the link can consume 10 to 25mW per lane while in standby…quietly draining the device’s battery.

Figure 3: L1 sub-states ECN reduces the power consumed by the link

Designers using the current low-power states of the PCIe specification can utilize the L1 state to reduce power consumption. The traditional L1 state allows the reference clock to be disabled on entry to L1, which is controlled by a configuration bit written to by software. However, the PCIe link still consumes too much power due to leakage, the transmit common-mode voltage circuit, and the Receiver Electrical Idle detector circuitry. The result for the end product user is drained batteries and unmet governmental regulations. To avoid these issues, the PCIe link must reduce its link idle power to approximately 10% of the active power, or in the range of 10s of microwatts.

The PCI-SIG community has just approved an enhancement to the L1 state called L1 sub-states. The L1 sub-states ECN adds two “pseudo sub-states,” called L1.1 and L1.2 to the LTSSM, which can be used to turn off additional analog circuits in the PHY. L1.1 allows the common-mode voltage to be maintained, while L1.2 allows all high speed circuits to be turned off. To use L1.2, L1 sub-states also require the LTR ECN to be supported by the PCIe interface. The logical view of the LTSSM with the new L1 sub-states is shown in Figure 4.

Figure 4: Relationship of logical L1.1 and L1.2 modes to L1 state specification

Designers need to be aware of a few challenges that implementing the new L1.1 and L1.2 lower power sub-states may present. For example, L1 sub-states may require additional pins if the reference clock generator is off-chip and redefines the CLKREQ# signal to be bidirectional to allow handshaking with the system reference clock controller… Not all form factors support CLKREQ# (which is only defined in the mini-CEM card specification)-form factors that do not have CLKREQ# defined will need to use an in-band mechanism when it becomes available. This L1 sub-state solution is an out-of-band solution since it doesn’t use the differential signals of the PCIe link and there are additional discussions in place to provide an in-band solution utilizing the existing differential signals. The implementation of L1 sub-states also requires some silicon modifications to gate the power of the PCIe analog circuits and logic while retaining the port state. Of course, any modifications to support L1 sub-states must still support the default L1 legacy operation and the new features are enabled via system firmware during the driver’s discovery process of the link capabilities.

Table 1 shows the low-power solutions available with the existing L1 state compared to using L1 sub-states. It is expected that the power savings scale linearly for multi-lane links and implementing the L1 sub-states feature reduces power consumption at the increase of the L1 exit latency. Implementing L1 sub-states is key to reducing power consumption for mobile designs using PCI Express.

Table 1: Comparison of proposed solutions

 

Base and Limit Registers

Once a function’s BARs are programmed, the function knows what address ranges it owns, which means that function will claim any transactions it sees that is targeting an address range it owns, an address range programmed into one of its BARs.

Each bridge(switch ports or root complex ports) needs to know what address range live beneath if so it can determine which requests should be forwarded from its primary interface (upstream side) to its secondary interface(downstream side).

It is the Base and Limit registers in the type 1 headers that are programmed with the range of addresses that live beneath this bridge.  There are 3 sets of Base and Limit registers found in each type 1 header.

  • P-MMIO(2 pairs. base is always aligned on a 1MB boundary)
  • NP-MMIO( can only support 32-bit addressing)
  • IO(2 pairs of 16-bit base/limit registers, base is always alligned on a 4k boundary)

For the unused base and limit registers, just program limit register with higher address than the base to invalidate this pair.  For example if the device does not need IO space, then the bridge immediate upstream of that function would have its IO base register programmed to 00h and the IO limit register programmed to F0h.  Since the limit address is higher than the base address, the bridge understands this is an invalid setting and take it to mean that there are no functions downstream of it that own IO address space.

 

 

BAR Programming

How system software program BARs ?

For type 0 header, there are 6 BARs; For type 1 header, there are 2 BARs.  Not all BAR have to be implemented.  Chip designer knows how many internal registers or ram needed to be accessed by software and how may BARs are implemented.

BAR[0]: 0: memory request; 1: IO request

BAR[2:1]: 00 32-bit address; 10 64-bit address

BAR[3]: 0: non-prefetchable; 1: prefetchable.

Two things chip designer needs to do:

1, Extra BARs are hard-coded with all 0’s notifying software that these BARs are not implemented.

2, Hard-coded the lower bits of the BARs to certain values indicating the type and size of the address space being requested.

System software write all 1’s to BARs and read them back first.  Then software knows the type and size.  Finally software write the requested value(address range) to the upper bits of the BARs.