May 2016 – my tech journals

On Communication Between Host And FPGA Through PCIe

From FPGA to host is straightforward that the TLPs (MemWr) are formed in the TL of the PCIe IP, the application side is a DMA master (in the case of Altera, it’s the Avalon-MM master).

On the other direction, host-to-FPGA, the FPGA’s active part is merely to issue MemRd requests. The host provide the requested data in the payload of the CplD TLPs. The “tag” field of these CplD’s matched the one in the original MemRd TLP. There are several rules here for these read requests and its completions. Assuming the “relaxed ordering” bit is cleared for all TLPs involved(You don’t want to find yourself trouble).

The number of the bytes that request TLP can ask is the lower limit declared by the device and the host in their configuration registers. Typically, it ends up to 512 bytes.
The host can divide its response to arbitrary number of completion TLPs as long as it won’t across the RCB(Read Completion Boundary, 64-byte aligned address).
The completions that have the the same “tag” field always arrive min rising address order.

Now, there are two cases here:

Single read request in flight: Since we know the completion TLPs arrive in rising address order. The incoming data can be stored in FIFO or RAM. If stored in RAM, a pointer can be set to the start address, and then incremented when CplDs arrive. The disadvantage of this method is, each read request must be sent only when the previous one’s completions have all arrived. This means no completion data flowing during the time gap until the new request issued. This is will reduce bandwidth a lot.
Multiple read requests in flight: To improve bandwidth utilization, mutliple read requests must be issued. The host must have at least one read request handy when it finishes up completing the others. The FPGA can’t rely on CplD’s arriving in rising address order anymore, and must store the incoming data in the RAM before sending it to the data sinker. Q: “When the completions from multiple read requests arriving, how can you tell when data is ready for submission to the logic that uses it?” It may need to detect some data transmission pattern….

For the Multiple read requests in flight method, the next question will be “When it’s safe to send another read request?” Any EP on the PCIe bus must advertise infinite credits for completions. But how much is “infinite”? Usually, PCIe IP cores have different ways for informing the application logic about the amount of space allocated for incoming completions through the special wires, or some other method. These information need to be hardcoded in the logic. Most of time it’s up to application logic to track how much resources are left, and decide whether it’s safe to send another read request. A typical implementation needs to calculate the maximum number of credits a read request’s response may consume, and verify against a calculated number of credits left. If there are enough credits, the read request is sent, and the responses’s worst-case consumption is deducted from the credits left. These credits are then returned to the “credits left” estimation as the completions arrive.

If some limitations can be made, the above logic can be simplified. For example, assume all read requests are limited to 512 bytes, and always start on 64-byte boundaries, the host will respond with 8 packets with 64 bytes of payload each. Each of these CplDs consumes memory at the receiving side, which is equivalent to 1 header credit and 4 data credits, so the request’s completion may consume memory of up to 8 header credits and 32 data credits’ worth. Suppose that the PCIe IP allocates 28 header credits and 112 data credits for completions. the header credits limit us to 3 requests in flight(3×8 < 28), and so do the data credits(3×32 < 112). This reduces the tracking logic to just knowing how many uncompleted requests are out.

PCIe IP Evaluation

How to evaluate PCIe IP ?

Lots of silicon proven IPs are available. Which one to pick? The system performance target should be kept in mind, starting from the initial architecture stages. Various parameters need to be configured based on system requirements, and then thoroughly simulated to ensure the solution’s correctness.

Factors to consider:

1, Protocol stacks:

Identify the IP integration boundary; usually there are two kinds of boundary. First, interfacing the Physical Layer through the PIPE interface. Meaning you have your own DL and TL logic need to be taken care. Secondly, interfacing the Transaction Layer with other standard such as AXI4, AHMB, or your own interconnects.

2, BW consideration:

Roughly, G1 x1 available BW = 250 MB/sec, G2=500 MB/sec, G3 ~ 1 GB/sec, and G4 ~ 2 GB/sec. You also need to allow for about 20% tolerance for things like link protocol overhead, buffer efficiency, etc. Link width requirements are typically derived from the application’s bandwidth needs. You should keep in mind that the physical layer is sometimes the most area-consuming part of the PCIe stack. Unlike the logical protocol layers that have common functionality for all link widths, physical layer size grows linearly with the number of lanes. Since physical layer cores are independent of one other, some IP vendors provide port bifurcation capabilities, allowing sharing of the transceiver cores between several logical ports. For instance, such implementation may be statically configured in a single X8 port or two X4 ports.

3, Port type:

The SoC needs EP, RC, or both? For instance, non-transparent PCIe bridges acting as an endpoint device on a prime PCIe hierarchy and controlling another PCIe tree of their own. These applications may want to retain flexibility when specifying the operating port type in the device configuration stage. Such applications need special PCIe logic implementation to support both root port and endpoint configuration models and different configuration register sets. This dual mode supports results in increased transactions and configuration register block size; however, the additional cell count is usually outweighed by the achievement in solution flexibility.

4, Virtual Channels:

Virtual Channels (VC) is a mechanism for differential bandwidth allocation. Virtual channels have dedicated physical resources (buffering, flow control management, etc.) across the hierarchy. Transactions are associated with one of the supported VCs according to their Traffic Class (TC) attribute through TC-to- VC mapping specified in the configuration block of the PCIe device. Therefore, transactions with a higher priority may be mapped to a separate virtual channel, eliminating resource conflicts with low priority traffic.

Multi-VC support usually leads to a notable logic area increase due to the additional buffering and separate logic mechanisms required per VC. To support independent queues for different virtual channels, separate logical queues are usually required. One possible multi-VC buffering scheme is implementing separate physical request queues for each VC to allow efficient arbitration between the VCs, while keeping a single physical data buffer with data blocks referenced by the header queues entries.

The vast majority of PCIe applications do not support multiple VCs; however, there is increasing interest in multi-VC configurations.

5, Performance measures:

Link width – Select the desired link width, based on the target bandwidth. While wide links support training on a lower link width while leaving upper lanes idle, interoperability considerations with other devices planned to be attached to the application must also be taken into account to ensure appropriate link width support and optimize lane utilization.
Replay buffer size – Replay buffer physically exists in transmitter side. Three factors need to be considered to determine the replay buffer size: 1, ACK latency from the receiver of the other side. 2, Delay caused by transmitter’s Physical Layer. 3, Receiver L0s exit latency to L0(In other words, replay buffer should be big enough to hold TLPs without stalling when the Link returns from L0s to L0). When the replay buffer is full, the TLP flow is suspended until sufficient replay buffer space becomes available. Replay buffer size largely depends on the TLP acknowledgment round trip time, which is the period of time from the moment of TLP transmission until ACK DLLP arrival and its processing completion by the TLP originator. Some devices implement ACK DLLP coalescing (issuing a single DLLP to acknowledge several TLPs) by specifying an ACK factor parameter greater than one. The following figure illustrates the acknowledgment round trip time for an ACK factor of four. Higher ACK factors result in link utilization improvement, but impact the TLP acknowledgement round trip period, leading to an increase in the Replay Buffer size required for maximal bandwidth.
Request buffer size – PCIe is a flow control- based protocol(based on credits). Receivers advertise the supported number of receive buffers, and transmitters are not allowed to send TLPs without ensuring that sufficient receive buffer space is available. Receivers indicate additional buffer availability through the flow control update mechanism to allow constant data flow. Receive buffers must be large enough to cover for data transmission, processing, and flow control update round trip, and to allow constant data buffer availability from the transmitter’s perspective to support the desired request rate. The following figure illustrates the flow control update round trip period from the remote transmitter’s standpoint. Particular attention should be paid to the read requests’ receive queue depth. For optimal performance, the application must be able to return a read request credit after forwarding the request to the application crossbar, without waiting for the read data to return. Moreover, if the read request queue is not deep enough, non-posted header credit updates may arrive to the remote transmitter at an insufficient rate, limiting the ability to forward read requests to the internal crossbar at a rate that results in optimal utilization of the read data bandwidth on the transmit link.
Read data buffering – PCIe supports multiple, concurrent, outstanding read transactions uniquely identified across the hierarchy by RequestorID and transaction tags. Transaction initiators are required to allocate buffering resources for read data upon making the request and advertise infinite credits for completions. Read requests are withheld until sufficient data buffer resources have been reserved. Therefore, a typical system’s read data return latency must be considered to specify a sufficient number of outstanding reads. The number of reads in conjunction with the supported read request size should be able to compensate for data return latency to allow read data flow at the desired rate. Since PCIe allows a single read request to be completed by multiple completion TLPs, applications are encouraged to utilize large read requests regardless of the configured maximum payload size. Large reads allow achieving the bandwidth target with a smaller number of outstanding transactions, thus simplifying read context management and reducing read request traffic, which imposes an overhead for the data flow on the transmit link.
Maximum Payload Size (MPS) – PCIe supports several configurations for the maximum data payload size allowed to be included in a TLP by a specific device. The default maximum payload size is 128 bytes and can only be modified by PCIeaware configuration software. Additional maximum payload configurations vary from 256 bytes to 4 Kbytes. Smaller payloads require smaller buffers and allow quick flow-control credit turnaround. On the other hand, they result in a higher link overhead. Large payloads require large replay and receive buffers and need to be supported across the entire system for optimal resource utilization. Each application must consider the optimal maximum payload parameter, based on the above criteria. Real life implementations show that a maximum payload size of 512 bytes lies in the sweet spot, allowing high link utilization with reasonably small data buffers. Larger payloads require significantly larger data buffers that do not justify the minor improvement in link utilization.

6, Power Management:

The L0s power state allows automatic transmitter shutdown during idle states and provides short recovery times through a fast training sequence. The L1 link power state is applied when the device is placed in a low device power state by power management software. Only a limited number of transactions that are required for supporting the device’s return to an operational state are allowed in this state. PCI Express also specifies Active State Power Management (ASPM) capabilities that allow L1-level power savings, but are controlled through application-specific mechanisms. Another power management technique is dynamic lane downshifting. Wide links that do not require full bandwidth may retrain with a lower number of active lanes, shutting down unused lanes and resulting in significant power savings. The link protocol also introduces an additional means of power saving by providing a software-controlled capability of retraining the link to a lower speed when full bandwidth is not required.

7, Verification:

PCI Express IP vendors usually provide an IP design that has undergone extensive verification. However, this is not enough to ensure a lack of defects in chip-level functionality of the PCI Express partition. PCIe IP parameterization needs to be verified in a chip-level environment to prove the correctness and consistency of the selected configuration. Data flows and protocols on the IP interfaces to the chip interconnect must also be covered in chip-level verification. In addition, particular attention should be paid to configuration sequences, system boot and address space allocation, possible transactions ordering deadlocks, interrupt handling, error handling and reporting, power management, and other system-level scenarios. This part of verification requires extensive knowledge of the real-life software behavior that needs to be translated into simulation test cases. A chip level verification effort should minimize testing of internal IP features associated with the IP implementation, relying on verification coverage of the IP provider. For instance, the chip integrator may limit testing of the completion timeout mechanism to a single timeout value to validate that the mechanism can be enabled in a chip and that associated error events are handled properly. Testing all the possible timeout values, however, is not necessary, assuming that the testing falls under responsibility of the IP provider.

VIP needed? PCIe verification IP solutions that complement design IP are available to support the PCIe verification effort. In addition to base PCIe device behavior modeling, verification IP implements protocol checkers that provide real-time indications of standard violations, when they occur. Some PCIe verification solutions also provide a test case suite that covers various PCI Express compliance checks and complements chip-level testing.
Configuration sequence: Chip verification should verify the flow that the application performs during boot and initial configuration sequence. These are critical scenarios that may have chip architecture impact; therefore early detection of configuration problems is highly important.
Chip reset sequences – SoC applications usually implement several reset mechanisms, including software controlled mechanisms. These mechanisms are critical to chip functionality and, if defective, may lead to chip malfunction. The reset level of the PCIe partition should be determined for each reset sequence and then simulated to prove the ability to recover from the reset state and return to operational mode. Root port and endpoint applications should take PCIe-specific aspects of the link reset into account. For instance, the downstream PCIe device has to be reconfigured after link reset, which usually requires system software intervention.
PCIe reset sequences – The PCI Express protocol specifies a hot reset mechanism, where downstream components reset through link notification. Root ports should validate that this mechanism can be applied and the hot reset indication is properly detected by a remote device. Endpoint applications should determine the level of chip reset desired in case of hot reset detection on the link to validate proper functionality.
Performance – Performance checking is one of the important chip level tests. This test is supposed to prove initial assumptions taken during architecture phase, such as cores latencies, interconnect bandwidth, etc. Completing performance testing in early project stages allows advance detection of critical chip architecture flaws.
Error events – PCIe specifies various events and error conditions that may occur in the system. These events are registered in the configuration space of the relevant devices and are reported to the root complex through PCIe messages. The root complex collects error reports from the hierarchy and forwards them to the system in an application-specific manner. Error reporting sequences, as well as the ability to resolve the error and clear all the relevant status bits should be addressed by the chip-level testing.
INTx interrupts – The PCIe standard specifies two interrupt modes: legacy INTx level interrupts mode and MSI mode. In the INTx mode, endpoints report an INTx interrupt to the host, based on the interrupt configuration. Root complex merges interrupt reports from all the devices into four interrupt lines and sends an interrupt indication further down the hierarchy. The INTx interrupt resolution includes scanning all the devices for interrupt source determination, clearing the interrupt trigger, and then validating interrupt deassertion. In some cases endpoints cannot assume the MSI interrupts mode is implemented by the host and must provide legacy interrupt support.
MSI interrupts – MSI is the preferred PCIe mechanism for interrupt signaling that uses memory write transactions to communicate all the information on the interrupt source directly to the host. Unlike INTx, the MSI message includes all the relevant interrupt data. In MSI mode, the host is able to access the device causing the interrupt directly to clear the interrupt trigger. Chip-level simulation should validate that the MSI interrupt mode can be properly configured and that MSI messages are properly routed in the system.
Power management – Power management scenarios involve interaction between system software, SoC hardware, and PCI IP. The endpoints should validate the power management activation and deactivation scenarios, paying particular attention to the ability to generate a power management interrupt from a low power state. Root ports should validate that they are able to configure the power management state of downstream devices and process power management events during the low power state. The PCIe link power management is tightly coupled with the above scenarios. The system powerdown sequence, based on the PME_Turn_Off and PME_TO_Ack messages, should be also validated.
Random testing – Chip integrators should consider random testing that includes randomization for the following parameters: different data flows of random rates, address ranges and mapping, response time, and random error injection. Random testing attempts to cover areas that might be missed by the directed testing. Random testing may also discover system deadlocks due to the random pattern of data flows injected during the test.

PCI Express compliance checking is one of the major chip-level verification objectives. Validating PCIe compliancy includes a PCIe compliance checklist review by the chip integrator and use of third-party PCIe models and checkers that provide compliance coverage. The chip integrator may also consider additional directed testing for PCIe compliance coverage to improve confidence in the eventual solution.

Random testing environments may consider implementation of PCIe-specific coverage attributes to improve confidence in the PCIe solution. The major risks from integrator’s standpoint are wrong signal connections. Interface toggle coverage may be useful to validate that all IP interfaces were toggled during testing, assuming that in case of a wrong connection, a functional problem would arise. Static configuration inputs may be covered only for desired values, limiting coverage effort to a specific chip setup.

PCIe IP integrators may also consider reuse of IP verification environment components in chip level verification to improve verification quality and reduce the required effort. This includes reuse of IP assertions, integration of IP-specific offline and online checkers in the chip environment, and reuse of the IP test suite in chip-level verification.

8, IP Deliverables:

What kinds of stuffs you can get? SoftIP/HardIP, test bench, test cases, synthesis scripts, STA scripts, documentations, and reference design, etc.

SoC Design Planning

Should put the main effort in the integration aspects of the system and on the system validation rather than concentrating on developing individual IPs from scratch.

Soft IP : synthesizable RTL, process independent.

Hard IP: synthesized netlist with timing information, easy to integrate and has predictable performance.

IP deliverables include RTL, testbenches, synthesis scripts and documentation.

SoC validation checklist

	B_1	B_2	B_3	B_4	B_5	B_6	B_7	B_8	B_9
Specification	x	x	x	x	x	x	x	x	x
RTL Design	x	x	x	x	x	x	x	x	x
Verification – simulation	x	x	x	x	x	x	x	x	x
Verification – FPGA	x	x	x	x	x	x	x	x	x
Co-Verification – HW/SW	x	x	x	x	x	x	x	x	x
Driver Developement	x	x	x	x	x	x	x	x	x
Application Testing	x	x	x	x	x	x	x	x	x
Configuration Sequence	x	x	x	x	x	x	x	x	x

Vivado notes

Last time the Xilinx FPGA I have used was Virtex4 and ISE around 2008. The new Vivado tool is the counterpart of Altera’s Quartus II. The following notes or tips are recorded when I am learning it.

Vivado notes:

Von Neumann Architecture

Summary:

Reference:

Von Neumann architecture

Harvard Architecture

Summary:

Reference:

Harvard Architecture

PCIe_EP

Summary:

Reference:

PCIe_EP

NVMe

What is NVMe?

“NVMe was designed from the ground up to provide a very high performance, low latency interface for PCIe SSD devices.
“The interface was also designed to be highly parallel and highly scalable. The scalability, parallelism and inherent efficiency of NVMe allow the interface to scale up and down in performance without losing any of the benefits. These features allow the interface to be highly adaptable to a wide variety of system configurations and designs from laptops to very high end, highly parallel servers.”

NVMe Controller registers usage:

Advertise capabilities of the controller
Enable/disable/reset controller
Setup addresses for admin queues
Doorbell registers

NVMe Terminology:

Submission Queue: a circular buffer with fixed slot size that the host use to submit commands for execution by the controller.
Completion Queue: a circular buffer with fixed slot size used to post status of the completed commands.
Door bell register: “Ring the bell” and notifies the SSD that the data is ready and waiting – used instead of polling of a register, better for power management.
Port: is a queue set(a submission queue and completion quest is one queue set).

NVMe Queue operation:

NVMe uses submission queues and completion queues for the management of memory for data transfer.
Admin queues are for Admin commands; IO queues are for data transfer.
Queues are managed in the host and device driver is responsible for talking b/w the host and controller.
Addresses for Admin queues are programmed to controller registers( ie, Admin Submission Queue Base register/Admin Completion Queue Base register). For example, Send MemWr32 TLP with Address field Base+28h to program Admin Submission Queue Addresss.

NVMe Overview:

NVMe_Overview

Queuing Interface:

NVMe_Q_IF

Reference:

1, MVME specification:

2, NVMe Boot Camp Feb 2015

PCIe Bus Enumeration

Summary:

Bus enumeration is done via accessing VendorID and DeviceID registers.
There are two problems that may occur during enumeration: device may not be present, or it may be present but unprepared to respond.
If device is not present, in PCI, the configuration read request would timeout and generate a Master Abort error condition. Since no device was driving the bus and the signals were pulled up, the data would be read as FFFF which is an invalid Vendor ID. Since this is not really an error condition, the Master Abort error condition would not be reported as an error during enumeration process. For PCIe, a configuration read request to a non-existent device will result in the bridge above the targeted device returning a Completion without data that has a status of UR(Unsupported Request). For the backward compatibility with the legacy PCI enumeration model, the Root Complex returns all ones (FFFF) to the CPU for the data when this Completion is seen during enumeration. Anyway, it should avoid to report such an error during enumeration to prevent from hanging the machine.
If device is not ready, for G1/G2, the software should wait 100 ms after reset to initiate a Configuration Request. For G3/G4, software should wait 100 ms after LTSSM completes link training to generate the Configuration Request. The reason for the longer delay for G3/G4 is the the G3 Equalization Process during link training take longer time(~ 50 ms).
Determining if a function is an EP or Bridge: Header Type register[7:0](DW3 byte2), [7]: 1 = multi-function device; 0 = single function device. [6:0]: 0 = EP; 1 = PCI-to-PCI bridge; 2 = CardBus bridge. If a bridge is found then the enumeration software performs a series of configuration writes to set bridge’s Primary Bus Number Register, Secondary Bus Number Register, and Subordinate Bus Number Register. For example, PBN=0, Sec=1, Sub=255. The bridge is now aware that the number of the bus directly attached downstream is 1 and that the largest bus number downstream of it is 255.

Reference:

https://en.wikipedia.org/wiki/PCI_configuration_space#Bus_enumeration

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31