ARC APEX

As I learned more and more about Synopsys ARC APEX, I decided to write some notes about this.

Here is the ARC HS top level block diagram:

ARC_HS_bus

 

And this is the 10 stage pipeline.

ARC_HS_pipe

It employs sophisticated branch prediction logic with very high prediction and early branch-resolution points to minimize the average mispredict  penalty.   The branch-prediction logic speculates the branch target address with a high probability of success, which minimize pipeline stalls.

 

Non Transparent Bridge (NTB)

How Transparent Bridge(TB) works?

TB provides electrical isolation between PCI buses.  For TB, the Configuration Status Register(CSR) with “type 1” header inform CPU to keep enumerating beyond this bridge since there are additional  devices lie downstream.  EP devices with “type 0” header in their CSR’s inform the enumerator such as BIOS or CPU that no more devices lie downstream.  These CSR’s include BARs used to request memory or IO apertures from the host. 

What is NTB?

In addition to electrical isolation, the NTB adds logical isolation by providing processor domain partitioning  and address translation between memory-mapped spaces of these domains.  With NTB, the devices on either side of the bridge are not visible from the other side, but a path is provided for data transfer and status exchange between the processor domains.  

Address translation:

In the NTB environment, PCIe devices need to translate the addresses that cross from one memory space to the other.  Each NTB port has two sets of BARs, one for the primary side and other for the secondary side.  BARs are used to define address translating windows into the memory space on the other side of the NTB and allow the transactions to be mapped to to local memory or I/Os.  Each BAR has a setup register which defines the size and type of the window and an address translation register.  While TB’s forward all CSRs based on bus number, NTB’s only accept CSR transactions addressed to the device itself.

There are two translation techniques.

  1. direct-address: add an offset to the BAR in which the transaction terminates.
  2. lookup-table-based: mapping local addresses to host bus addresses as the location of the index field within the address is programmable to adjust window size.  The index is used to provide the upper bits for the new memory location.

Inter-processor communication

The NTB also allow hosts on each side of the bridge to exchange indormation about the status through scratchpad registers, doorbell registers, and heartbeat messages.

Scratchpad Registers:  These r/w from both sides of the NTB.

Doorbell Registers: used to send interrupt.  These are software controlled interrupt request registers with associated masking registers for each interface on the NTB.  These registers can be accessed from both sides.

Heartbeat Messages: are sent from primary to the secondary host to indicate it is still alive.  The secondary host monitors the state of the primary host and takes appropriate action upon detection of the failure.

In summary, the NTB provides a powerful features for the people wants to implement dual host, dual fabric, fail-over and load sharing capability to their systems.  Meaning, high availability systems.

 

 

On Communication Between Host And FPGA Through PCIe

From FPGA to host is straightforward that the TLPs (MemWr)  are formed in the TL of the PCIe IP, the application side is a DMA master (in the case of Altera, it’s the Avalon-MM master).

On the other direction, host-to-FPGA, the FPGA’s active part is merely to issue MemRd requests.  The host provide the requested data in the payload of the CplD TLPs.  The “tag” field of these CplD’s matched the one in the original MemRd TLP.  There are several rules here for these read requests and its completions.  Assuming the “relaxed ordering” bit is cleared for all TLPs involved(You don’t want to find yourself trouble).

  • The number of the bytes that request TLP can ask is the lower limit declared by the device and the host in their configuration registers.  Typically, it ends up to 512 bytes.
  • The host can divide its response to arbitrary number of completion TLPs as long as it won’t across the RCB(Read Completion Boundary, 64-byte aligned address).
  • The completions that have the the same “tag” field always arrive min rising address order.

Now, there are two cases here:

  1. Single read request in flight: Since we know the completion TLPs arrive in rising address order.  The incoming data can be stored in FIFO or RAM. If stored in RAM, a pointer can be set to the start address, and then incremented when CplDs arrive.  The disadvantage of this method is, each  read request must be sent only when the previous one’s completions have all arrived.  This means no completion data flowing during the time gap until the new request issued.  This is will reduce bandwidth a lot.
  2. Multiple read requests in flight: To improve bandwidth utilization, mutliple read requests must be issued.  The host must have at least one read request handy when it finishes up completing the others.  The FPGA can’t rely on CplD’s arriving in rising address order anymore, and must store the incoming data in the RAM before sending it to the data sinker.  Q: “When the completions from multiple read requests arriving, how can you tell when data is ready for submission to the logic that uses it?”  It may need to detect some data transmission pattern….

For the Multiple read requests in flight method, the next question will be “When it’s safe to send another read request?” Any EP on the PCIe bus must advertise infinite credits for completions. But how much is “infinite”?  Usually, PCIe IP cores have different ways for informing the application logic about the amount of space allocated for incoming completions through the special wires, or some other method.  These information need to be hardcoded in the logic.  Most of time it’s up to application logic to track how much resources are left, and decide whether it’s safe to send another read request.  A typical implementation needs to calculate the maximum number of credits a read request’s response may consume, and verify against a calculated number of credits left.  If there are enough credits, the read request is sent, and the responses’s worst-case consumption is deducted from the credits left.  These credits are then returned to the “credits left” estimation as the completions arrive.

If some limitations can be made, the above logic can be simplified.  For example, assume all read requests are limited to 512 bytes, and always start on 64-byte boundaries, the host will respond with 8 packets with 64 bytes of payload each.  Each of these CplDs consumes memory at the receiving side, which is equivalent to 1 header credit and 4 data credits, so the request’s completion may consume memory of up to 8 header credits and 32 data credits’ worth.  Suppose that the PCIe IP allocates 28 header credits and 112 data credits for completions.  the header credits limit us to 3 requests in flight(3×8 < 28), and so do the data credits(3×32 < 112).  This reduces the tracking logic to just knowing how many uncompleted requests are out.

PCIe IP Evaluation

How to evaluate PCIe IP ?

Lots of silicon proven IPs are available.  Which one to pick?   The system performance target should be kept in mind, starting from the initial architecture stages. Various parameters need to be configured based on system requirements, and then thoroughly simulated to ensure the solution’s correctness.

Factors to consider:

1, Protocol stacks:

Identify the IP integration boundary; usually there are two kinds of boundary.  First, interfacing the Physical Layer through the PIPE interface.  Meaning you have your own DL and TL logic need to be taken care.  Secondly, interfacing the Transaction Layer with other standard such as AXI4, AHMB, or your own interconnects.

2, BW consideration:

Roughly, G1 x1 available BW = 250 MB/sec, G2=500 MB/sec,  G3 ~ 1 GB/sec, and G4 ~ 2 GB/sec.  You also need to allow for about 20% tolerance for things like link protocol overhead, buffer efficiency, etc.  Link width requirements are typically derived from the application’s bandwidth needs. You should keep in mind that the physical layer is sometimes the most area-consuming part of the PCIe stack. Unlike the logical protocol layers that have common functionality for all link widths, physical layer size grows linearly with the number of lanes. Since physical layer cores are independent of one other, some IP vendors provide port bifurcation capabilities, allowing sharing of the transceiver cores between several logical ports. For instance, such implementation may be statically configured in a single X8 port or two X4 ports.

3, Port type:

The SoC needs EP, RC, or both?  For instance, non-transparent PCIe bridges acting as an endpoint device on a prime PCIe hierarchy and controlling another PCIe tree of their own. These applications may want to retain flexibility when specifying the operating port type in the device configuration stage. Such applications need special PCIe logic implementation to support both root port and endpoint configuration models and different configuration register sets. This dual mode supports results in increased transactions and configuration register block size; however, the additional cell count is usually outweighed by the achievement in solution flexibility.

4, Virtual Channels: 

Virtual Channels (VC) is a mechanism for differential bandwidth allocation. Virtual channels have dedicated physical resources (buffering, flow control management, etc.) across the hierarchy. Transactions are associated with one of the supported VCs according to their Traffic Class (TC) attribute through TC-to- VC mapping specified in the configuration block of the PCIe device. Therefore, transactions with a higher priority may be mapped to a separate virtual channel, eliminating resource conflicts with low priority traffic.

Multi-VC support usually leads to a notable logic area increase due to the additional buffering and separate logic mechanisms required per VC. To support independent queues for different virtual channels, separate logical queues are usually required. One possible multi-VC buffering scheme is implementing separate physical request queues for each VC to allow efficient arbitration between the VCs, while keeping a single physical data buffer with data blocks referenced by the header queues entries.

The vast majority of PCIe applications do not support multiple VCs; however, there is increasing interest in multi-VC configurations.

5, Performance measures:

  • Link widthSelect the desired link width, based on the target bandwidth. While wide links support training on a lower link width while leaving upper lanes idle, interoperability considerations with other devices planned to be attached to the application must also be taken into account to ensure appropriate link width support and optimize lane utilization.
  • Replay buffer size –  Replay buffer physically exists in transmitter side.  Three factors need to be considered to determine the replay buffer size: 1, ACK latency from the receiver of the other side. 2, Delay caused by transmitter’s Physical Layer.  3, Receiver L0s exit latency to L0(In other words, replay buffer should be big enough to hold TLPs without stalling when the Link returns from L0s to L0).  When the replay buffer is full, the TLP flow is suspended until sufficient replay buffer space becomes available.  Replay buffer size largely depends on the TLP acknowledgment round trip time, which is the period of time from the moment of TLP transmission until ACK DLLP arrival and its processing completion by the TLP originator. Some devices implement ACK DLLP coalescing (issuing a single DLLP to acknowledge several TLPs) by specifying an ACK factor parameter greater than one. The following figure  illustrates the acknowledgment round trip time for an ACK factor of four. Higher ACK factors result in link utilization improvement, but impact the TLP acknowledgement  round trip period, leading to an increase in the Replay Buffer size required for maximal bandwidth.  ack_latency
  • Request buffer size – PCIe is a flow control- based protocol(based on credits). Receivers advertise the supported number of receive buffers, and transmitters are not allowed to send TLPs without ensuring that sufficient receive buffer space is available. Receivers indicate additional buffer availability through the flow control update mechanism to allow constant data flow. Receive buffers must be large enough to cover for data transmission, processing, and flow control update round trip, and to allow constant data buffer availability from the transmitter’s perspective to support the desired request rate. The following figure illustrates the flow control update round trip period from the remote transmitter’s standpoint. Particular attention should be paid to the read requests’ receive queue depth. For optimal performance, the application must be able to return a read request credit after forwarding the request to the application crossbar, without waiting for the read data to return. Moreover, if the read request queue is not deep enough, non-posted header credit updates may arrive to the remote transmitter at an insufficient rate, limiting the ability to forward read requests to the internal crossbar at a rate that results in optimal utilization of the read data bandwidth on the transmit link. FC update latency
  • Read data buffering – PCIe supports multiple, concurrent, outstanding read transactions uniquely identified across the hierarchy by RequestorID and transaction tags. Transaction initiators are required to allocate buffering resources for read data upon making the request and advertise infinite credits for completions. Read requests are withheld until sufficient data buffer resources have been reserved. Therefore, a typical system’s read data return latency must be considered to specify a sufficient number of outstanding reads. The number of reads in conjunction with the supported read request size should be able to compensate for data return latency to allow read data flow at the desired rate. Since PCIe allows a single read request to be completed by multiple completion TLPs, applications are encouraged to utilize large read requests regardless of the configured maximum payload size. Large reads allow achieving the bandwidth target with a smaller number of outstanding transactions, thus simplifying read context management and reducing read request traffic, which imposes an overhead for the data flow on the transmit link.
  • Maximum Payload Size (MPS) – PCIe supports several configurations for the maximum data payload size allowed to be included in a TLP by a specific device. The default maximum payload size is 128 bytes and can only be modified by PCIeaware configuration software. Additional maximum payload configurations vary from 256 bytes to 4 Kbytes. Smaller payloads require smaller buffers and allow quick flow-control credit turnaround. On the other hand, they result in a higher link overhead. Large payloads require large replay and receive buffers and need to be supported across the entire system for optimal resource utilization. Each application must consider the optimal maximum payload parameter, based on the above criteria. Real life implementations show that a maximum payload size of 512 bytes lies in the sweet spot, allowing high link utilization with reasonably small data buffers. Larger payloads require significantly larger data buffers that do not justify the minor improvement in link utilization.

6, Power Management:

The L0s power state allows automatic transmitter shutdown during idle states and provides short recovery times through a fast training sequence. The L1 link power state is applied when the device is placed in a low device power state by power management software. Only a limited number of transactions that are required for supporting the device’s return to an operational state are allowed in this state. PCI Express also specifies Active State Power Management (ASPM) capabilities that allow L1-level power savings, but are controlled through application-specific mechanisms. Another power management technique is dynamic lane downshifting. Wide links that do not require full bandwidth may retrain with a lower number of active lanes, shutting down unused lanes and resulting in significant power savings. The link protocol also introduces an additional means of power saving by providing a software-controlled capability of retraining the link to a lower speed when full bandwidth is not required.

7, Verification: 

PCI Express IP vendors usually provide an IP design that has undergone extensive verification. However, this is not enough to ensure a lack of defects in chip-level functionality of the PCI Express partition. PCIe IP parameterization needs to be verified in a chip-level environment to prove the correctness and consistency of the selected configuration. Data flows and protocols on the IP interfaces to the chip interconnect must also be covered in chip-level verification. In addition, particular attention should be paid to configuration sequences, system boot and address space allocation, possible transactions ordering deadlocks, interrupt handling, error handling and reporting, power management, and other system-level scenarios. This part of verification requires extensive knowledge of the real-life software behavior that needs to be translated into simulation test cases.  A chip level verification effort should minimize testing of internal IP features associated with the IP implementation, relying on verification coverage of the IP provider. For instance, the chip integrator may limit testing of the completion timeout mechanism to a single timeout value to validate that the mechanism can be enabled in a chip and that associated error events are handled properly. Testing all the possible timeout values, however, is not necessary, assuming that the testing falls under responsibility of the IP provider.

  • VIP needed? PCIe verification IP solutions that complement design IP are available to support the PCIe verification effort. In addition to base PCIe device behavior modeling, verification IP implements protocol checkers that provide real-time indications of standard violations, when they occur. Some PCIe verification solutions also provide a test case suite that covers various PCI Express compliance checks and complements chip-level testing.
  • Configuration sequence:  Chip verification should verify the flow that the application performs during boot and initial configuration sequence. These are critical scenarios that may have chip architecture impact; therefore early  detection of configuration problems is highly important.
  • Chip reset sequences – SoC applications usually implement several reset mechanisms, including software controlled mechanisms. These mechanisms are critical to chip functionality and, if defective, may lead to chip malfunction. The reset level of the PCIe partition should be determined for each reset sequence and then simulated to prove the ability to recover from the reset state and return to operational mode. Root port and endpoint applications should take PCIe-specific aspects of the link reset into account. For instance, the downstream PCIe device has to be reconfigured after link reset, which usually requires system software intervention.
  • PCIe reset sequences – The PCI Express protocol specifies a hot reset mechanism, where downstream components reset through link notification. Root ports should validate that this mechanism can be applied and the hot reset indication is properly detected by a remote device. Endpoint applications should determine the level of chip reset desired in case of hot reset detection on the link to validate proper functionality.
  • Performance – Performance checking is one of the important chip level tests. This test is supposed to prove initial assumptions taken during architecture phase, such as cores latencies, interconnect bandwidth, etc. Completing performance testing in early project stages allows advance detection of critical chip architecture flaws.
  • Error events – PCIe specifies various events and error conditions that may occur in the system. These events are registered in the configuration space of the relevant devices and are reported to the root complex through PCIe messages. The root complex collects error reports from the hierarchy and forwards them to the system in an application-specific manner. Error reporting sequences, as well as the ability to resolve the error and clear all the relevant status bits should be addressed by the chip-level testing.
  • INTx interrupts – The PCIe standard specifies two interrupt modes: legacy INTx level interrupts mode and MSI mode. In the INTx mode, endpoints report an INTx interrupt to the host, based on the interrupt configuration. Root complex merges interrupt reports from all the devices into four interrupt lines and sends an interrupt indication further down the hierarchy. The INTx interrupt resolution includes scanning all the devices for interrupt source determination, clearing the interrupt trigger, and then validating interrupt deassertion. In some cases endpoints cannot assume the MSI interrupts mode is implemented by the host and must provide legacy interrupt support.
  • MSI interrupts – MSI is the preferred PCIe mechanism for interrupt signaling that uses memory write transactions to communicate all the information on the interrupt source directly to the host. Unlike INTx, the MSI message includes all the relevant interrupt data. In MSI mode, the host is able to access the device causing the interrupt directly to clear the interrupt trigger. Chip-level simulation should validate that the MSI interrupt mode can be properly configured and that MSI messages are properly routed in the system.
  • Power management – Power management scenarios involve interaction between system software, SoC hardware, and PCI IP. The endpoints should validate the power management activation and deactivation scenarios, paying particular attention to the ability to generate a power management interrupt from a low power state. Root ports should validate that they are able to configure the power management state of downstream devices and process power management events during the low power state. The PCIe link power management is tightly coupled with the above scenarios. The system powerdown sequence, based on the PME_Turn_Off and PME_TO_Ack messages, should be also validated.
  • Random testing – Chip integrators should consider random testing that includes randomization for the following parameters: different data flows of random rates, address ranges and mapping, response time, and random error injection. Random testing attempts to cover areas that might be missed by the directed testing. Random testing may also discover system deadlocks due to the random pattern of data flows injected during the test.

PCI Express compliance checking is one of the major chip-level verification objectives. Validating PCIe compliancy includes a PCIe compliance checklist review by the chip integrator and use of third-party PCIe models and checkers that provide compliance coverage. The chip integrator may also consider additional directed testing for PCIe compliance coverage to improve confidence in the eventual solution.

Random testing environments may consider implementation of PCIe-specific coverage attributes to improve confidence in the PCIe solution. The major risks from integrator’s standpoint are wrong signal connections. Interface toggle coverage may be useful to validate that all IP interfaces were toggled during testing, assuming that in case of a wrong connection, a functional problem would arise. Static configuration inputs may be covered only for desired values, limiting coverage effort to a specific chip setup.

PCIe IP integrators may also consider reuse of IP verification environment components in chip level verification to improve verification quality and reduce the required effort. This includes reuse of IP assertions, integration of IP-specific offline and online checkers in the chip environment, and reuse of the IP test suite in chip-level verification.

8, IP Deliverables:

What kinds of stuffs you can get?  SoftIP/HardIP, test bench, test cases, synthesis scripts, STA scripts, documentations, and reference design, etc.

 

 

 

SoC Design Planning

Should put the main effort in the integration aspects of the system and on the system validation rather than concentrating on developing individual IPs from scratch.

Soft IP : synthesizable RTL, process independent.

Hard IP: synthesized netlist with timing information, easy to integrate and has predictable performance.

IP deliverables include RTL, testbenches, synthesis scripts and documentation.

SoC validation checklist

B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9
Specification x x x x x x x x x
RTL Design x x x x x x x x x
Verification – simulation x x x x x x x x x
Verification – FPGA x x x x x x x x x
Co-Verification – HW/SW x x x x x x x x x
Driver Developement x x x x x x x x x
Application Testing x x x x x x x x x
Configuration Sequence x x x x x x x x x