my tech journals

ARC APEX

As I learned more and more about Synopsys ARC APEX, I decided to write some notes about this.

Here is the ARC HS top level block diagram:

ARC_HS_bus

And this is the 10 stage pipeline.

ARC_HS_pipe

It employs sophisticated branch prediction logic with very high prediction and early branch-resolution points to minimize the average mispredict penalty. The branch-prediction logic speculates the branch target address with a high probability of success, which minimize pipeline stalls.

OpenCL

“Write it once, run on anything“.

Every vendor that provides OpenCL-compliant hardware also provides the tools that compile OpenCL code to tun on the hardware. This means you can write your own OpenCL routines once and compile them for any compliant device, whether it’s a multi-core processor or a graphic card.

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.

Using GPUs to perform non graphical routines is called general-purpose GPU computing, or GPGPU. Before 2010, GPGPU computing was considered a novelty in the world of high-performance computing and not worthy of series attention. But today, engineers and academics are reaching the conclusion that CPU/GPU system represent the future of super computing.

Now an important question arises: how can you program these new hybrid devices ? Traditional C/C++ only target traditional CPUs. Nvidia’s CUDA (Compute Unified Device Architecture) can be used to program Nvidia’s GPUs, but not CPUs.

The answer is OpenCL. OpenCL routines can be executed on GPUs and CPUs from major manufacturers like AMD, Nvidia, and Intel, and will even on Sony’s PS3. OpenCL is non-proprietary — it’s based on a public standard, and you can freely download all the development tools you need. When you code routines in OpenCL, you don’t have to worry about which company designed the processor or how many cores it contains. Your code will compile and execute on AMD’s Fusion processors, Intel’s Core processors, Nivdia’s Fermi/Pascal processors, and IBM’s Cell Broadband Engine.

OpenCL standard defines a set of data types, data structures, and functions that augment C and C++. There main advantages: portability, standardized vector processing, and parallel programming.

Firmware/Device Driver/Software/OS

Firmware is software which is stored in non-volatile (or perhaps even read-only) memory. Because it is stored in such memory, firmware is available when the machine is turned on. The machine may almost immediately begin executing firmware when it is turned on, or some small boot program (itself firmware) can pull the bigger firmware from some electronic storage such as flash and put it into RAM, and then execute it.

Firmware can be a “full blown” operating system. For example, Tomato is a Linux-based firmware for wireless routers:

http://www.polarcloud.com/tomato

We can log into Tomato via ssh, and get a Linux prompt. So it is an advanced operating system, and it is firmware. But if the router had a hard disk in it, and if the same OS had to be loaded from that disk at startup, it could no longer be legitimately called firmware. Firmware has to be in electronic storage that is accessible to the processor immediately on power up, like flash memory or EPROM chips.

Firmware and software are the same thing; the only distinction is in how it’s stored. Software is typically stored on a mass-storage device (e.g., disk drive) and loaded into volatile memory (e.g., DRAM) before being executed. It’s easy to change software, by simply replacing the file containing it with a different one.

Firmware is typically stored in nonvolatile memory (e.g., FLASH) connected more-or-less directly to the CPU. It’s harder to modify (hence the “firm”) and it may or may not be transferred to a different memory for execution.

Firmware is the software that runs on the device. A driver is the software that tells your operating system how to communicate with the device. Not all devices have firmware–only devices with some level of intelligence.

The word “operating system” simply refers to a control program which has a certain degree of sophistication and completeness in managing the resources of the machine and providing reasonably high level services to programs: features like file systems, network protocols, memory and process management, high level access to devices, and perhaps some model of a user as well as security. Not all of these have to be present in an operating system. Usually the memory, process management and I/O are the key. If the control program allows other programs to execute, giving those programs an identity through which they are associated with their own resources, and if it provides services to them for managing the processor and memory, and doing I/O, we may call that control program an operating system.

Some PCIe ECNs

SR-IOV defines mechanisms for a system’s endpoints and its CPU to allow sharing of its resources

1, FLR: capability used in IOV
Function Level Reset Support (FLR)
The integrated GbE controller supports FLR capability. FLR capability can be used in
conjunction with Intel® Virtualization Technology. FLR allows an operating system in a
Virtual Machine to have complete control over a device, including its initialization,
without interfering with the rest of the platform. The device provides a software
interface that enables the operating system to reset the entire device as if a PCI reset
was asserted.

2,ARI: Alternatively Routing ID, capability used in IOV
ARI extends the capabilities of a PCIe endpoint by increasing the number of available device functions from eight up to 256 by using the Bus and Device bit fields from the requester ID. A system needing to support ARI requires all devices in the PCIe chain (CPU, PCIe Switch, endpoint) to support ARI.

Consequently, the PCIe switch between the CPU and the endpoint needs to be able to decode and route packets accordingly. Without ARI, a virtual system cannot take advantage of the additional functions enabled in the PCIe endpoint. In a virtualized system, 16 function are typically available with some endpoints implementing as many as 256.

For better power management:

3, Latency Tolerance Reporting(LTR)

The first ECNs added to the specification were focused on tackling the overall power management and reducing the active power consumption in a system. When trying to implement an overall power management strategy, designers typically shut down components when they are not in use. For example, a tablet that uses WiFi for connectivity consumes more power when the WiFi radio is on and connected to the network. However, if the tablet doesn’t need to transmit or receive data for a period of time, the WiFi radio can be turn off to save battery life. The key to implementing this power saving strategy is to know how long it takes for the radio to wake back up. Without a mechanism to know how long to wait, the software has to guess how much latency is acceptable for the device–guessing incorrectly can result in performance issues or hardware failures. Consequently, platform power management is often too conservative or not implemented at all, resulting in devices that use more power than necessary.

Obviously, power management has to be done at the system level. This requires a mechanism to tune the power management based on the actual device requirements and adjust the dynamic power usage verses performance. The solution is to have each device in the system report its latency requirements to the host. Devices that utilize PCIe for connectivity, a PCIe endpoint, can utilize the Latency Tolerance Reporting (LTR) mechanism that has been incorporated into the PCIe specification. LTR enables a device to communicate the necessary information back to the host by using a PCIe message to report its required service latency. The LTR values are used by the platform (the tablet in our example) to implement an overall power management strategy that will extend the battery life of the tablet, while giving optimal performance.

4, Optimized Buffer Flush/Fill

Another ECN added to the PCIe specification to improve the overall power management is Optimized Buffer Flush/Fill or OBFF. As a system operates, each of its devices does not know the power state of each of the resources in the system. Without coordination, each of the devices will go in and out of their low-power states as necessary to execute the tasks they are assigned to do. This “asynchronous” behavior prevents the optimal power management of the CPU, host memory sub-system and other devices, because the intermittent traffic keeps the system permanently awake and unable to optimize power management across the system.

Figure 1: Asynchronous behavior prevents optimal power management

As part of an implementation of a system-level power strategy, the idle time and low-power states of the devices must be optimized to enable them to stay in their low-power states longer. Basically, a host can provide information to all devices by broadcasting a message about the system power state. The devices can utilize this information to group a load of requests, wait until the system wakes up, and burst out all of the requests at the same time. By doing this, the device is a good citizen and does not wake up a sleeping CPU and/or the system memory sub-system. Waiting creates extended periods of system inactivity which saves overall system power (as shown in Figure 2). In other words, the host utilizes the OBFF ECN to give devices a “hint” so they can optimize their behavior, which improves power management at the system level.

Figure 2: Coordinated idle time extends system inactivity, reducing power consumption

5, L1 – substates

PCI-SIG and the member body continue to make changes to improve the ability to implement power management strategies across a system. However, what about the power that is consumed while your tablet or Ultrabook is in the suspend state? Pulling your tablet or laptop out of your bag during a long flight, only to find that it consumed all of the battery power while it was in standby mode, is one of a business traveler’s nightmares. This experience is a lesson in how non-optimized systems consume a surprising amount of power while in the standby state. PCIe’s L1 low-power state is just not enough, as the idle power consumed by PCIe-based devices does not meet the emerging thin and light form factor requirements, which require 8 to 10 hours of use time and a seemingly infinite amount of standby time. Of course, this has to be done with minimal added costs while maintaining backwards compatibility.

As shown in Figure 3, a PCIe link is a serial link that directly connects two components, such as a host and a device. Ignoring the state of the host or the device for this discussion, the PCIe link is defined to save power when the controlling link state machine (LTSSM) is in the L1 state. However, the PCIe interface has both analog and digital circuits and the L1 state doesn’t turn off all the analog circuits in the PHY. The Receiver Electrical Idle detector and the transmit common-mode voltage driver continue drawing power. The result is that each lane of the link can consume 10 to 25mW per lane while in standby…quietly draining the device’s battery.

Figure 3: L1 sub-states ECN reduces the power consumed by the link

Designers using the current low-power states of the PCIe specification can utilize the L1 state to reduce power consumption. The traditional L1 state allows the reference clock to be disabled on entry to L1, which is controlled by a configuration bit written to by software. However, the PCIe link still consumes too much power due to leakage, the transmit common-mode voltage circuit, and the Receiver Electrical Idle detector circuitry. The result for the end product user is drained batteries and unmet governmental regulations. To avoid these issues, the PCIe link must reduce its link idle power to approximately 10% of the active power, or in the range of 10s of microwatts.

The PCI-SIG community has just approved an enhancement to the L1 state called L1 sub-states. The L1 sub-states ECN adds two “pseudo sub-states,” called L1.1 and L1.2 to the LTSSM, which can be used to turn off additional analog circuits in the PHY. L1.1 allows the common-mode voltage to be maintained, while L1.2 allows all high speed circuits to be turned off. To use L1.2, L1 sub-states also require the LTR ECN to be supported by the PCIe interface. The logical view of the LTSSM with the new L1 sub-states is shown in Figure 4.

Figure 4: Relationship of logical L1.1 and L1.2 modes to L1 state specification

Designers need to be aware of a few challenges that implementing the new L1.1 and L1.2 lower power sub-states may present. For example, L1 sub-states may require additional pins if the reference clock generator is off-chip and redefines the CLKREQ# signal to be bidirectional to allow handshaking with the system reference clock controller… Not all form factors support CLKREQ# (which is only defined in the mini-CEM card specification)-form factors that do not have CLKREQ# defined will need to use an in-band mechanism when it becomes available. This L1 sub-state solution is an out-of-band solution since it doesn’t use the differential signals of the PCIe link and there are additional discussions in place to provide an in-band solution utilizing the existing differential signals. The implementation of L1 sub-states also requires some silicon modifications to gate the power of the PCIe analog circuits and logic while retaining the port state. Of course, any modifications to support L1 sub-states must still support the default L1 legacy operation and the new features are enabled via system firmware during the driver’s discovery process of the link capabilities.

Table 1 shows the low-power solutions available with the existing L1 state compared to using L1 sub-states. It is expected that the power savings scale linearly for multi-lane links and implementing the L1 sub-states feature reduces power consumption at the increase of the L1 exit latency. Implementing L1 sub-states is key to reducing power consumption for mobile designs using PCI Express.

Table 1: Comparison of proposed solutions

Base and Limit Registers

Once a function’s BARs are programmed, the function knows what address ranges it owns, which means that function will claim any transactions it sees that is targeting an address range it owns, an address range programmed into one of its BARs.

Each bridge(switch ports or root complex ports) needs to know what address range live beneath if so it can determine which requests should be forwarded from its primary interface (upstream side) to its secondary interface(downstream side).

It is the Base and Limit registers in the type 1 headers that are programmed with the range of addresses that live beneath this bridge. There are 3 sets of Base and Limit registers found in each type 1 header.

P-MMIO(2 pairs. base is always aligned on a 1MB boundary)
NP-MMIO( can only support 32-bit addressing)
IO(2 pairs of 16-bit base/limit registers, base is always alligned on a 4k boundary)

For the unused base and limit registers, just program limit register with higher address than the base to invalidate this pair. For example if the device does not need IO space, then the bridge immediate upstream of that function would have its IO base register programmed to 00h and the IO limit register programmed to F0h. Since the limit address is higher than the base address, the bridge understands this is an invalid setting and take it to mean that there are no functions downstream of it that own IO address space.

BAR Programming

How system software program BARs ?

For type 0 header, there are 6 BARs; For type 1 header, there are 2 BARs. Not all BAR have to be implemented. Chip designer knows how many internal registers or ram needed to be accessed by software and how may BARs are implemented.

BAR[0]: 0: memory request; 1: IO request

BAR[2:1]: 00 32-bit address; 10 64-bit address

BAR[3]: 0: non-prefetchable; 1: prefetchable.

Two things chip designer needs to do:

1, Extra BARs are hard-coded with all 0’s notifying software that these BARs are not implemented.

2, Hard-coded the lower bits of the BARs to certain values indicating the type and size of the address space being requested.

System software write all 1’s to BARs and read them back first. Then software knows the type and size. Finally software write the requested value(address range) to the upper bits of the BARs.

High Bandwidth Memory (HBM)

Summary:

1, On January 12th, 2016 HBM2 was accepted as JESD235a

2, HBM2 specifies up to 8 dies per staple and doubles throughput to 1 TB/s.

3, On January 19th, Samsung announced early mass production of HBM2.

4, Application: Virtual Reality, Data Center Accelerator.

5, On June 24th, 2015, AMD announced first HBM GPU, AMD Fiji which powers AMD R9 Fury X.

6, On April 5th, 2016, nVidia announced first HBM2 GPU, Tesla P100(16 nm FinFET).

PCIe Throughput

The throughput of a PCIe system depends on

Protocol overhead
Payload size
Completion latency
Flow control update latency
Characteristics of the devices that form the link

Non Transparent Bridge (NTB)

How Transparent Bridge(TB) works?

TB provides electrical isolation between PCI buses. For TB, the Configuration Status Register(CSR) with “type 1” header inform CPU to keep enumerating beyond this bridge since there are additional devices lie downstream. EP devices with “type 0” header in their CSR’s inform the enumerator such as BIOS or CPU that no more devices lie downstream. These CSR’s include BARs used to request memory or IO apertures from the host.

What is NTB?

In addition to electrical isolation, the NTB adds logical isolation by providing processor domain partitioning and address translation between memory-mapped spaces of these domains. With NTB, the devices on either side of the bridge are not visible from the other side, but a path is provided for data transfer and status exchange between the processor domains.

Address translation:

In the NTB environment, PCIe devices need to translate the addresses that cross from one memory space to the other. Each NTB port has two sets of BARs, one for the primary side and other for the secondary side. BARs are used to define address translating windows into the memory space on the other side of the NTB and allow the transactions to be mapped to to local memory or I/Os. Each BAR has a setup register which defines the size and type of the window and an address translation register. While TB’s forward all CSRs based on bus number, NTB’s only accept CSR transactions addressed to the device itself.

There are two translation techniques.

direct-address: add an offset to the BAR in which the transaction terminates.
lookup-table-based: mapping local addresses to host bus addresses as the location of the index field within the address is programmable to adjust window size. The index is used to provide the upper bits for the new memory location.

Inter-processor communication

The NTB also allow hosts on each side of the bridge to exchange indormation about the status through scratchpad registers, doorbell registers, and heartbeat messages.

Scratchpad Registers: These r/w from both sides of the NTB.

Doorbell Registers: used to send interrupt. These are software controlled interrupt request registers with associated masking registers for each interface on the NTB. These registers can be accessed from both sides.

Heartbeat Messages: are sent from primary to the secondary host to indicate it is still alive. The secondary host monitors the state of the primary host and takes appropriate action upon detection of the failure.

In summary, the NTB provides a powerful features for the people wants to implement dual host, dual fabric, fail-over and load sharing capability to their systems. Meaning, high availability systems.

On Communication Between Host And FPGA Through PCIe

From FPGA to host is straightforward that the TLPs (MemWr) are formed in the TL of the PCIe IP, the application side is a DMA master (in the case of Altera, it’s the Avalon-MM master).

On the other direction, host-to-FPGA, the FPGA’s active part is merely to issue MemRd requests. The host provide the requested data in the payload of the CplD TLPs. The “tag” field of these CplD’s matched the one in the original MemRd TLP. There are several rules here for these read requests and its completions. Assuming the “relaxed ordering” bit is cleared for all TLPs involved(You don’t want to find yourself trouble).

The number of the bytes that request TLP can ask is the lower limit declared by the device and the host in their configuration registers. Typically, it ends up to 512 bytes.
The host can divide its response to arbitrary number of completion TLPs as long as it won’t across the RCB(Read Completion Boundary, 64-byte aligned address).
The completions that have the the same “tag” field always arrive min rising address order.

Now, there are two cases here:

Single read request in flight: Since we know the completion TLPs arrive in rising address order. The incoming data can be stored in FIFO or RAM. If stored in RAM, a pointer can be set to the start address, and then incremented when CplDs arrive. The disadvantage of this method is, each read request must be sent only when the previous one’s completions have all arrived. This means no completion data flowing during the time gap until the new request issued. This is will reduce bandwidth a lot.
Multiple read requests in flight: To improve bandwidth utilization, mutliple read requests must be issued. The host must have at least one read request handy when it finishes up completing the others. The FPGA can’t rely on CplD’s arriving in rising address order anymore, and must store the incoming data in the RAM before sending it to the data sinker. Q: “When the completions from multiple read requests arriving, how can you tell when data is ready for submission to the logic that uses it?” It may need to detect some data transmission pattern….

For the Multiple read requests in flight method, the next question will be “When it’s safe to send another read request?” Any EP on the PCIe bus must advertise infinite credits for completions. But how much is “infinite”? Usually, PCIe IP cores have different ways for informing the application logic about the amount of space allocated for incoming completions through the special wires, or some other method. These information need to be hardcoded in the logic. Most of time it’s up to application logic to track how much resources are left, and decide whether it’s safe to send another read request. A typical implementation needs to calculate the maximum number of credits a read request’s response may consume, and verify against a calculated number of credits left. If there are enough credits, the read request is sent, and the responses’s worst-case consumption is deducted from the credits left. These credits are then returned to the “credits left” estimation as the completions arrive.

If some limitations can be made, the above logic can be simplified. For example, assume all read requests are limited to 512 bytes, and always start on 64-byte boundaries, the host will respond with 8 packets with 64 bytes of payload each. Each of these CplDs consumes memory at the receiving side, which is equivalent to 1 header credit and 4 data credits, so the request’s completion may consume memory of up to 8 header credits and 32 data credits’ worth. Suppose that the PCIe IP allocates 28 header credits and 112 data credits for completions. the header credits limit us to 3 requests in flight(3×8 < 28), and so do the data credits(3×32 < 112). This reduces the tracking logic to just knowing how many uncompleted requests are out.