Article catalogue
2, Block ram and distributed RAM
3, Explain distributed RAM in detail
Series catalogue and portal
"Learn FPGA from the bottom structure" directory and portal
1, What is RAM? What is ROM?
RAM is the acronym of Random Access Memory. It is a kind of main memory, which is used to store the information currently in use. The information can be data being processed or program code. It is a read-write memory, which means that it can store (write) and access (read) data almost at the same time. However, RAM is volatile or temporary memory, that is, its contents will be erased when the power supply is removed** RAM is a kind of fast access memory, because it can store and access data randomly at any time regardless of its physical location** It stores the necessary instructions required to start the device and the data being used by the processor. It improves the processing speed of the system by quickly transferring data between components.
ROM, Read Only Memory, is also a kind of main memory, but it permanently stores data. It is a non-volatile memory, that is, when the power supply is removed, its data content will not be erased. As the name suggests, it is a read-only memory, which means that data cannot be changed, but can be accessed any number of times. You can only access data and cannot write data.
Ram and ROM can be simply regarded as a table, the content of each grid is the information it stores, and the address line is to find the "identity number" of the corresponding table. RAM can read and write the contents of the table, while ROM can only read the table and cannot write.
2, Block ram and distributed RAM
The concept of distributed RAM (DRAM) in FGPA is relative to Block RAM (BRAM). Physically, BRAM is a fixed hardware resource in fpga, while DRAM is spelled out using logical unit LUT, which is actually an extension of LUT.
2.1,BRAM
BRAM is composed of a certain number of fixed size memory blocks. Using BRAM does not occupy additional logical resources and is fast. However, the BRAM resource consumed when used is an integer multiple of its block size. For example, each BRAM in Xilinx7 series FPGA structure has a capacity of 36Kbit, which can be used as a 36Kbit memory or split into two independent 18Kbit memories. Conversely, two adjacent brams can be combined to realize 72Kbit memory. Each Block RAM has two sets of address bus, data bus, control signal and other signals required to access the memory, so it can be used as either single port memory or dual port memory. It should be noted that when accessing BRAM, it needs to be synchronized with the clock, and asynchronous access is not supported.
2.2,DRAM
Only the look-up table in SLICEM can be used as DRAM. Using the look-up table as the circuit memory can not only realize the internal storage of the chip, but also improve the resource utilization. DRAM is characterized by asynchronous access that cannot be achieved by BRAM. However, using distributed RAM to realize large-scale memory will occupy a lot of LUT, and the lookup tables that can be used to realize logic will be reduced. Therefore, it is recommended to use this distributed RAM only when small-scale memory is required.
2.3 suggestions for use
DRAM uses non integrated LUT units, while BRAM is block ram, whose size and location are fixed. Even if you only use a little BRAM, it will consume a whole block of RAM after synthesis. BRAM is distributed column by column, which may cause a long direct distance and delay between user logic module and BRAM, and eventually lead to performance degradation. If you use more than one BRAM, you'd better plan the layout reasonably.
For larger storage applications, BRAM is recommended; DRAM can be used for sporadic small applications. But this is only a general principle. The specific use depends on the redundancy and performance requirements of resources in the whole design.
3, Explain distributed RAM in detail
CLB (Configurable Logic Block) is the basic logic unit at the bottom of FPGA, which is composed of two slices. There are two types of SLICE: SLICEL and SLICEM:
SLICEL and SLICEM have roughly the same composition, with only LUT6 being different. Let's put two LUT6 together:
Compared with the LUT in SLICEL, the LUT in SLICEM has read bus A1A6, WE (write enable), DI1DI2 (data write port) and WA1~WA8 (data write address port), so SLICEM also has data write function, which makes it can be used as distributed RAM and shift register.
While the LUT in SLICEL only has address line and output, so we can only use it as a ROM to realize the function of lookup table.
Since there are four LUT S in each SLICEM, its resources can realize DRAM in the following forms:
Its configuration is as follows:
- Single port RAM
- dpram
- Simple dual port
- Four ports
(1) Single port RAM: synchronous write and asynchronous read. Read and write operations share a group of address buses
(2) Dual port RAM: one port is used for synchronous write and asynchronous read; One port for asynchronous reading
(3) Simple dual port: a port for synchronous writing (there is no data output / read port from the write port); One port for asynchronous reading
(4) Four ports: a port for synchronous write and asynchronous read; Three ports for asynchronous reading
(5) Deeper implementation
In addition, DRAM with greater depth can be achieved through multiple LUT6+MUX.
128 depth DRAM can be realized by two LUT6 + one MUX2. The two LUT6 store low 64bit and high 64bit data respectively, which are selected by MUX2, so as to realize 128 depth single port DRAM.
Similarly, the same structure can be used to realize 256 depth DRAM.
256 depth single port RAM uses 4 LUT6+2 F7MUX+1 F8MUX, which is exactly the maximum number of resources in a SLICEM, so the maximum depth DRAM that a single SLICEM can achieve is 256 * 1 single port DRAM.
4, Implementation mode
DRAM can be implemented in many ways, and each method has advantages and disadvantages, so it should be used flexibly according to the needs and development environment in the actual use process.
4.1 inference
Inference refers to the way that the DRAM structure is automatically inferred by the comprehensive tool (this article refers to the vivado of xilinx) using the RTL code that conforms to the specification.
Since the inference result is generally ideal, it is recommended to use inference, unless the given use case is not supported, or it is impossible to achieve sufficient results in terms of performance, area or power consumption. In such cases, try other methods. When inferring RAM, Xilinx recommends that you use the HDL template provided in the Vivado tool. As mentioned above, the use of asynchronous reset will adversely affect RAM inference and should be avoided.
(advantages)
- Easy to transplant
- Easy to read and understand
- Self documenting
- Fast simulation
(insufficient)
- Unable to access all available RAM configurations
- The result may not be the best
The following is the RTL code implementation of 64 depth and 6 width single port DRAM:
module RTL_DRAM( input wclk, //input clk input [5:0] addr, //input address input [5:0] d, //input data input we, //input write enable output [5:0] o //output ); reg [5:0] dram64x6 [63:0] ; //64*6 always@(posedge wclk) if(we) dram64x6[addr] <= d; assign o = dram64x6[addr]; endmodule
The write operation is synchronized with the clock, while the read operation is asynchronous.
The synthesis results on FPGA are as follows:
It is composed of 6 1-wide 64 depth RAM cascades, and the resource consumption is 6 LUT6, which is consistent with the theoretical situation.
4.2 primitive
Primitives are the underlying design elements provided by xilinx, similar to the underlying library functions provided in embedded development. For the implementation of DRAM, xilinx also provides several primitives, such as RAM64X1S, RAM16X4S, RAM128X1D, etc. for details, please refer to UG799, Xilinx 7 Series FPGA and Zynq-7000 All Programmable SoC Libraries Guide for Schematic Designs. It should be noted that DRAM primitives usually fix the bit width, depth and implementation method. For some DRAM implementations that do not conform to the depth and bit width, it is necessary to find smaller primitives to implement. For example, the primitive cannot directly implement 64 depth and 6-bit wide single port DRAM, but can only be implemented through six 64 depth and 1-bit wide single port DRAM – RAM64X1S.
(advantages)
- Have the highest control authority over the implementation scheme
- Access to the functions of the block
(insufficient)
- Poor code portability
- Functions and uses are tedious and difficult to understand
The following is the primitive implementation of 64 depth and 6 width single port DRAM:
module PRIMATE_DRAM( input wclk, //input clk input [5:0] addr, //input address input [5:0] d, //input data input we, //input write enable output [5:0] o //output ); RAM64X1S #( .INIT(64'h0000000000000000) // Initial contents of RAM ) RAM64X1S_inst0 ( .A0 (addr[0]), // Address[0] input bit .A1 (addr[1]), // Address[1] input bit .A2 (addr[2]), // Address[2] input bit .A3 (addr[3]), // Address[3] input bit .A4 (addr[4]), // Address[4] input bit .A5 (addr[5]), // Address[5] input bit .D (d[0]), // 1-bit data input .O (o[0]), // 1-bit data output .WCLK (wclk), // Write clock input .WE (we) // Write enable input ); RAM64X1S #( .INIT(64'h0000000000000000) // Initial contents of RAM ) RAM64X1S_inst1 ( .A0 (addr[0]), // Address[0] input bit .A1 (addr[1]), // Address[1] input bit .A2 (addr[2]), // Address[2] input bit .A3 (addr[3]), // Address[3] input bit .A4 (addr[4]), // Address[4] input bit .A5 (addr[5]), // Address[5] input bit .D (d[1]), // 1-bit data input .O (o[1]), // 1-bit data output .WCLK (wclk), // Write clock input .WE (we) // Write enable input ); RAM64X1S #( .INIT(64'h0000000000000000) // Initial contents of RAM ) RAM64X1S_inst2 ( .A0 (addr[0]), // Address[0] input bit .A1 (addr[1]), // Address[1] input bit .A2 (addr[2]), // Address[2] input bit .A3 (addr[3]), // Address[3] input bit .A4 (addr[4]), // Address[4] input bit .A5 (addr[5]), // Address[5] input bit .D (d[2]), // 1-bit data input .O (o[2]), // 1-bit data output .WCLK (wclk), // Write clock input .WE (we) // Write enable input ); RAM64X1S #( .INIT(64'h0000000000000000) // Initial contents of RAM ) RAM64X1S_inst3 ( .A0 (addr[0]), // Address[0] input bit .A1 (addr[1]), // Address[1] input bit .A2 (addr[2]), // Address[2] input bit .A3 (addr[3]), // Address[3] input bit .A4 (addr[4]), // Address[4] input bit .A5 (addr[5]), // Address[5] input bit .D (d[3]), // 1-bit data input .O (o[3]), // 1-bit data output .WCLK (wclk), // Write clock input .WE (we) // Write enable input ); RAM64X1S #( .INIT(64'h0000000000000000) // Initial contents of RAM ) RAM64X1S_inst4 ( .A0 (addr[0]), // Address[0] input bit .A1 (addr[1]), // Address[1] input bit .A2 (addr[2]), // Address[2] input bit .A3 (addr[3]), // Address[3] input bit .A4 (addr[4]), // Address[4] input bit .A5 (addr[5]), // Address[5] input bit .D (d[4]), // 1-bit data input .O (o[4]), // 1-bit data output .WCLK (wclk), // Write clock input .WE (we) // Write enable input ); RAM64X1S #( .INIT(64'h0000000000000000) // Initial contents of RAM ) RAM64X1S_inst5 ( .A0 (addr[0]), // Address[0] input bit .A1 (addr[1]), // Address[1] input bit .A2 (addr[2]), // Address[2] input bit .A3 (addr[3]), // Address[3] input bit .A4 (addr[4]), // Address[4] input bit .A5 (addr[5]), // Address[5] input bit .D (d[5]), // 1-bit data input .O (o[5]), // 1-bit data output .WCLK (wclk), // Write clock input .WE (we) // Write enable input ); endmodule
The comprehensive results are basically consistent with the inferred results.
The resource usage is also consistent with the inferred result - it is composed of 6 1-wide 64 depth RAM cascades, and the resource consumption is 6 LUT6, which is consistent with the theoretical situation.
We can see that the DRAM development method using primitives needs to instantiate the basic primitives for many times. Although the generate syntax can be borrowed, it is still troublesome.
4.3,IP
XILINX also provides the IP of DRAM for developers to use. Using the IP development method, the GUI level is high and the development is simple. However, due to the high degree of customization, the portability is also general.
(advantages)
- It can generally provide more optimized results when using multiple components
- Easy to specify and configure
(insufficient)
- Poor code portability
- Need to manage nuclear
The full name of the IP core of DRAM is Distributed MemoryGenerator, which is relatively simple to use. Next, we use the IP core to configure a 64 depth 6-bit wide single port DRAM. The configuration process is as follows:
(Page 1)
(page 2)
(page 3)
Next, according to the instantiation template provided by the veo file, RTL is written to instantiate the IP Core:
module IP_DRAM( input wclk, //input clk input [5:0] addr, //input address input [5:0] d, //input data input we, //input write enable output [5:0] o //output ); //Instantiated DRAM IP core dram_64x6 dram_64x6_inst ( .a (addr), // input wire [5 : 0] a .d (d), // input wire [5 : 0] d .clk (wclk), // input wire clk .we (we), // input wire we .spo (o) // output wire [5 : 0] spo ); endmodule
The comprehensive results show that the resource consumption is consistent with the above two methods.
4.4 simulation
We instantiate the three modules that implement DRAM into the same top-level file, and then write testbench to simulate it to realize the function: first write data 0-63 to address 0-63, and then read data from address 0-63 to observe whether the write and read data are consistent.
Top level file:
module test( input wclk, //input clk input [5:0] addr, //input address input [5:0] d, //input data input we, //input write enable output [5:0] o_rtl, output [5:0] o_primate, output [5:0] o_ip ); //Instantiated RTL type RTL_DRAM RTL_DRAM_inst( .wclk (wclk ), //input clk .addr (addr ), //input address .d (d ), //input data .we (we ), //input write enable .o (o_rtl ) //output ); //Instantiation primitive PRIMATE_DRAM PRIMATE_DRAM_inst( .wclk (wclk ), //input clk .addr (addr ), //input address .d (d ), //input data .we (we ), //input write enable .o (o_primate) //output ); //Instantiated IP type IP_DRAM IP_DRAM_inst( .wclk (wclk ), //input clk .addr (addr ), //input address .d (d ), //input data .we (we ), //input write enable .o (o_ip ) //output ); endmodule
testbench:
`timescale 1ns / 1ns module tb_test(); reg wclk; //input clk reg [5:0] addr; //input address reg [5:0] d; //input data reg we; //input write enable wire [5:0] o_rtl; wire [5:0] o_primate; wire [5:0] o_ip; //Instantiation test module test test_inst( .wclk (wclk ), //input clk .addr (addr ), //input address .d (d ), //input data .we (we ), //input write enable .o_rtl (o_rtl ), //output .o_primate (o_primate ), //output .o_ip (o_ip ) //output ); initial begin wclk =0; we =1; d = 0; addr = 0; wait(d == 6'd63);#10 we =0; end always #5 wclk = ~wclk; always @(posedge wclk)begin d <= d+1; addr <= addr+1; end endmodule
The simulation results are as follows: consistent with the assumption.
5, Application
Distributed ram provides a trade-off between using storage elements for very small arrays and BRAM for larger arrays. It is recommended to use RTL to infer memory as much as possible to provide maximum flexibility. Distributed RAM can also be generated by primitive instantiation or using IP.
Generally speaking, distributed RAM should be used in all cases with a depth of 64 bits or less, unless the device lacks SLICEM or logical resources. Because distributed RAM is more efficient in terms of resources, performance and functions.
For depths greater than 64 bits but less than or equal to 128 bits, the decision to use the best resources depends on the following factors:
- Availability of additional block ram. If not available, use distributed ram.
- Requirements for delay. If asynchronous read function is required, distributed ram must be used.
- Data width. If the width is greater than 16 bits, block RAM should be used, if possible.
- Necessary performance requirements. Registered distributed ram usually has shorter Tco time and fewer layout constraints than Bram.
First of all, I would like to introduce myself. I graduated from Jiaotong University in 13 years. I once worked in a small company, went to large factories such as Huawei OPPO, and joined Alibaba in 18 years, until now. I know that most junior and intermediate Java engineers who want to improve their skills often need to explore and grow by themselves or sign up for classes, but there is a lot of pressure on training institutions to pay nearly 10000 yuan in tuition fees. The self-study efficiency of their own fragmentation is very low and long, and it is easy to encounter the ceiling technology to stop. Therefore, I collected a "full set of learning materials for java development" and gave it to you. The original intention is also very simple. I hope to help friends who want to learn by themselves and don't know where to start, and reduce everyone's burden at the same time. Add the business card below to get a full set of learning materials