**
VHDL 101****
P16 in VHDL**

On Saturday August 19, 2000 Dr. C. H. Ting gave a presentation to the Silicon Valley Chapter of the Forth Interest Group on VHDL basics. Dr. Ting had on a previous occasion done a presentation on his p8 and p16 chip designs for FPGA. He had previously presented a schematic representation of his p16 and in past years explained the VLSI MuP21 design. The architecture of P16 is similar to MuP21 in that it is a MISC (Minimal Instruction Set Computer) design using the same registers and instruction set as P21 with two instructions from F21. There are four five bit CPU instructions packed in a twenty bit data word on MuP21 and there are three of these instructions packed into a 16 bit word on P16. The P16 also provides addressing from the R registers like F21, and expands the A (addressing) register into an A register stack.

The architecture of p16 is a stack machine. There are many documents about the MuP21 and F21 VLSI stack machines at this web site. These machines have stacks composed of on-chip registers. There is a stack for nesting return addresses from subroutine calls labeled the return stack. There is a second stack for passing parameters called the data or parameter stack. The T register is the top of the data stack and below it is an array of registers called the n_stack in VHDL. The R register is the top of the return stack and below it is an array of registers called the r_stack.

The MuP21 can only address memory from the A register, while F21 and P16 also have two instructions to allow addressing of memory via the R register. On P16 the A register has been expanded to become a register stack.

The p16 as a stack machine has a zero-operand architecture. This means that the T register (Top of the Data Stack) is the implied source and destination of most instructions. There are no register select bits in the opcode so they are very small. Since three opcoded may be stored in a single word of memory when executing sequential stack based instructions the CPU will run three times as fast as the memory clock. Since the stack accesses are in fact register accesses MEMS analysis will show the idea goal of zero MEMS on this type of code. Well written code will keep data on the stack and minimize the number of memory references.

Dr. Ting began by explaining that VHDL was very involved but that he was only interested in the minimal effort needed here to provide one possible high level description of the chip. He distributed a handout with some circuit diagrams and the full VHDL description of the chip. He then presented some introductory stuff about circuit design. He explained the basic logic circuits could be built up of NAND gates, that with feedback a NAND gate becomes an oscilator, and that with a couple of NAND gates if you use feedback you have a flip/flop. The flip/flop is then like a bit of memory.

Dr. Ting explained the workflow of the tools. One first enters the VHDL code then compiles a working model that can be simulated and tested for design errors. The next step is to specify a target such as a specific FPGA device. One can then compile an object for a specific device and run a hardware simulation that will provide timing information for a real device. One can then run test programs in the simulation environment. Finally one can download the object image into hardware and run the new chip.

Now that he has a P16 working in VHDL Dr. Ting is now working on a P24 version for production in Taiwan. The notes reflect that the state of the project is the port. The file is named EM24.VHD but still reads cpu 16 at and that is what is currently working.

He reminded us that the notation in the diagrams that a logical unit could designate a single control bit or a number of bits in parallel on a bus. Dr. Ting said he had borrowed some stuff from WISC CPU 16 at Mountain View Press. In the diagrams that follow there are only MUX and flip/flops that may be one to "Width" bits wide. The data paths associated with each of the machine registers are first diagramed and then specified in VHDL.

Having worked with Offete's P21 and with UltraTechnology's F21 microprocessor, having followed Ting's work on the schematic representation of his P16, and having been in one of John Rible's VLSI design classes I was so familiar with the CPU design that the VHDL code was easy for me to read. Having writing numberous simulators and emulators for the chips I also noted the similarity between a high level description of a simulator and a high level description of the chip itself. I expect that to someone familiar with VHDL the design of the chip, being so simple, would be pretty obvious.

The first part of Dr. Ting's handout has some notes and the diagrams of the data paths on the chip from the schematic layout. These same diagrams are then specified in VHDL code.

The VHDL listing is remarkably simple. It is organized in five sections. The file begins with use statements that include library files that provide things like the ability to overload some operators so that one can increment signal levels etc. Then there is an entity cpu16 declaration. The third section defines the contents of the architecture, archcpu16, for the cpu16 entity. The architecture section begins with signal declarations followed by constants designating each of the CPU instructions. After a BEGIN statement are the definitions of the functional blocks in the diagrams. This section ends with a description of each of the instructions. The final section specifies the synchronous functions that take place inside of every instruction.1. CPU16 Architecture

This is what is working. All registers, stacks and data paths are expanded to 17 bits. As of 6/5/00, this architecture is ok. These data path diagrams should be read with new EM24.VHD file. The name of the implementation is still cpu 16 in the VHDL file. The instruction decoding logic simply apply the proper control signals to the following register clocking and mux select signals: alu_sel, reg_sel, tsel, tright, tleft, npop, npush, a_sel, aload, apop, apush, r_sel, r_load, rpop, rpush, p_sel, pload, m_sel, iload, reset. The synchronous program execution inut clocks the slot signal, which selects the proper 5-bit instructions in the I register to select the above signals. At the rising clock edge, the selected data are latched into the proper register and stacks. All data signals must stablize before the net rising clock edge strickes. The longest delay will be through the adders. The simulator says that the longest delay will be about 24ns, so the architecture should be good to about 40 MIPS. The architecture is very simple and components are very similar to one another. It should be very easy to do a good layout, and the routing should not be difficult. The T and Data Stack Data Path |\ |\ |-------- | |------- | not t-------- | \ | \ | | | | t xor n ----- | |-alu_out---| |-t_in----- | T |------- t-------- |n_stack |---n_out t and n ----- | | | | tright--- | | npop----- | | t+n --------- | / |--| | tleft---- | | npush---- | | |/| | | / clk------ | | clk------ | | | | |/| clr------ | | clr------ | | alu_sel---------- | | |-------- | |------- | |\ | |---tsel n ----------- | \ | a ----------- | |-reg_out- r ----------- | | data -------- | / |/| | reg_sel---------- The A Register and A Stack Data Path |\ |-------- | |------- | a_out ------- | \ | | | | t ----------- | |---a_in----| A |------- a-------- |a_stack |---a_out a+1 --------- | | | | apop----- | | a ----------- | / aload---| | apush---- | | |/| clk-----| | clk------ | | | clr-----| | clr------ | | a_sel------------ |-------- | |------- | The Return Stack Data Path |\ |-------- | |------- | r_out ------- | \ | | | | t ----------- | |---r_in----| R |------- r-------- |r_stack |---r_out r+1 --------- | | | | rpop----- | | p ----------- | / rload---| | rpush---- | | |/| clk-----| | clk------ | | | clr-----| | clr------ | | r_sel------------ |-------- | |------- | The Program Counter Data Path |\ |-------- | |\ p ----------- | \ | | | \ p&i(9,0) ---- | |---p_in----| P |------- p-------- | |---address p+1 --------- | | | | a-------- | | r ----------- | / pload---| | | / |/| clk-----| | |/| | clr-----| | | p_sel------------ |-------- | m_sel-------- The Instruction Latch and Decoder Data Path |-------- | |\ | | fetch-----| \ | I |------- i(19.0)---| |---code(4.0) data----| | | | iload---| | | / clk-----| | |/| clr-----| | | |-------- | |--slot---------- | | | |-------- | | | | | reset---| sync |----| | | | | clk-----| | clr-----| | |-------- |

Note that the T register has tleft and tright control lines. These allow the register to be shifted one bit to the left or one bit to the right and if both control signals are set the register will be loaded from memory or another register.

I have expanded the listing with some comments to capture some of the explanations that Dr. Ting gave while presenting a walk through of the code. VHDL comments are lines that begin with two dashes.

-- CPU24.VHD 8/18/2000 Dr. C. H. Ting -- simple portable definition of P16 Minimal Instruction Set Computer Core -- in VHDL library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; use ieee.std_logic_misc.all; use ieee.std_logic_unsigned.all; entity cpu16 is -- set bus width constant to 16 generic(width: integer := 16); port( -- standard logic signals can be true, false, tri-stated, or unknown. clk, clr: in std_logic; write, read: out std_logic; -- address and data busses addr: out std_logic_vector(width downto 0); data: in std_logic_vector(width downto 0) ); end cpu16; architecture archcpu16 of cpu16 is -- slot counts through the instructions packed in a word signal slot: integer range 0 to 3; -- define stack mechanism type stack is array(7 downto 0) of std_logic_vector(width downto 0); -- control signal definitions signal n_stack, r_stack: stack; -- two stack points are used with each stack to do the pre and post incrementing -- used in stack push and pops, hencr np and np1, np1=np+1 signal np, np1, rp, rp1: integer range 0 to 7; -- select signals signal t, n, r, a, i, p: std_logic_vector(width downto 0); signal t_in, n_in, r_in, a_in, i_in, p_in: std_logic_vector(width downto 0); signal reg_out, alu_out: std_logic_vector(width downto 0); signal code: std_logic_vector(4 downto 0); signal reg_sel, alu_sel: std_logic_vector(1 downto 0); signal npush, npop, rpush, rpop, rsel, tsel, tright, tleft, ainc, aload, pinc, pload, msel, psel, z, iload, reset: std_logic; -- define opcodes constant jmp : std_logic_vector(4 downto 0) :="00000"; constant ret : std_logic_vector(4 downto 0) :="00001"; constant jz : std_logic_vector(4 downto 0) :="00010"; constant jnc : std_logic_vector(4 downto 0) :="00011"; constant call: std_logic_vector(4 downto 0) :="00100"; constant ftch: std_logic_vector(4 downto 0) :="01000"; constant ldp : std_logic_vector(4 downto 0) :="01001"; constant lit : std_logic_vector(4 downto 0) :="01010"; constant lp : std_logic_vector(4 downto 0) :="01011"; constant stp : std_logic_vector(4 downto 0) :="01101"; constant st : std_logic_vector(4 downto 0) :="01111"; constant com : std_logic_vector(4 downto 0) :="10000"; constant shl : std_logic_vector(4 downto 0) :="10001"; constant shr : std_logic_vector(4 downto 0) :="10010"; constant addc: std_logic_vector(4 downto 0) :="10011"; constant xorr: std_logic_vector(4 downto 0) :="10100"; constant andd: std_logic_vector(4 downto 0) :="10101"; constant addd: std_logic_vector(4 downto 0) :="10111"; constant pop : std_logic_vector(4 downto 0) :="11000"; constant lda : std_logic_vector(4 downto 0) :="11001"; constant dup : std_logic_vector(4 downto 0) :="11010"; constant over: std_logic_vector(4 downto 0) :="11011"; constant push: std_logic_vector(4 downto 0) :="11100"; constant sta : std_logic_vector(4 downto 0) :="11101"; constant nop : std_logic_vector(4 downto 0) :="11110"; constant drop: std_logic_vector(4 downto 0) :="11111"; -- instruction opcode meanings Forth equivalent -- jmp unconditional jump with 10 bit on-page argument (else) -- ret subroutine return, pop R to P ; -- jz jump if T=0 (if) -- jnc jump if no carry (-if) -- call subrountine call with 10 bit on-page argument : -- ftch fetch contents of memory using R as pointer R @ R> 1+ >R -- ldp fetch using A and increment A A @ DUP 1+ A ! @ -- lit load immeditate following cell (literal) LIT -- ld fetch using A A @ @ -- stp store using A and increment A A @ DUP 1+ A ! ! -- st store using A A @ ! -- ??? store using R and increment R R ! R> 1+ >R -- com invert T, one's complement, including carry -1 XOR -- shl shift T left 2* -- shr shift T right (carry unchanged) 2/ -- addc conditional non-destructive add of T and N -- if the least sig bit of T is true DUP 1 AND IF OVER + THEN -- xorr exclusive-or T and N XOR -- andd logical AND T and N AND -- addd add T to N + -- pop move, pop from T and push to R >R -- lda move, pop from A and push to T A @ -- dup duplicate T to N DUP -- over duplicate N as new T OVER -- push move, pop from R and push to T R> -- sta move, pop from T and push to A A ! -- nop no operation, delay 1 cycle NOP -- drop discard T DROP -- -- warning there may be errors in this opcode documentation -- I could not find the R!+ instruction (?) begin -- define the first mux in the diagram with alu_sel select alu_out <= (t xor n) when "01", (t and n) when "10", (t + n) when "11", (not t) when others; -- define the second mux with reg_sel select reg_out <= a when "01", r when "10", data when "11", n when others; -- instruction latch mux with slot select code <= i(width-1 downto width-5) when 1, i(width-6 downto width-10) when 2, i(width-11 downto width-15) when 3, ftch when others; n <= n_stack(np); r <= r_stack(rp); r_in <= t when rsel='0' else p; -- combine lower ten bits in argument with upper bits from the program counter p_in <= (p(width downto width-5) & i(width-6 downto 0)) when psel='0' else r; addr <= p when msel='0' else a; t_in <= alu_out when tsel='0' else reg_out; -- zero flag z <= not(t(15) or t(14) or t(13) or t(12) or t(11) or t(10) or t(9) or t(8) or t(7) or t(6) or t(5) or t(4) or t(3) or t(2) or t(1) or t(0)); decode: process(code,z,t) begin alu_sel<="00"; reg_sel<="00"; npush<='0'; npop<='0'; rpush<='0'; rpop<='0'; rsel<='0'; tsel<='0'; tright<='0'; tleft<='0'; ainc<='0'; aload<='0'; pinc<='0'; pload<='0'; msel<='0'; psel<='0'; write<='0'; read<='0'; iload<='0'; reset<='0'; -- specify each opcode case code is when ftch => iload<='1'; pinc<='1'; read<='1'; when jmp => pload<='1'; npush<='1'; rpush<='1'; reset<='1'; when ret => pload<='1'; rpop<='1'; psel<='1'; reset<='1'; when jz => pload<=z; reset<='1'; when jnc => pload<= not t(width); reset<='1'; when call => pload<='1'; rpush<='1'; rsel<='1'; reset<='1'; when ldp => msel<='1'; ainc<='1'; tright<='1'; tleft<='1'; tsel<='1'; npush<='1'; reg_sel<="11"; read<='1'; when lit => pinc<='1'; tright<='1'; tleft<='1'; tsel<='1'; npush<='1'; reg_sel<="11"; read<='1'; when lp => msel<='1'; \ ld is typo? \ when ld => msel<='1'; tright<='1'; tleft<='1'; tsel<='1'; npush<='1'; reg_sel<="11"; read<='1'; when stp => msel<='1'; ainc<='1'; tright<='1'; tleft<='1'; tsel<='1'; npop<='1'; reg_sel<="00"; write<='1'; -- data <= t; when st => msel<='1'; tright<='1'; tleft<='1'; tsel<='1'; npop<='1'; reg_sel<="00"; write<='1'; -- data <= t; when com => tright<='1'; tleft<='1'; alu_sel<="00"; when shr => tright<='1'; tleft<='0'; alu_sel<="00"; when shl => tright<='0'; tleft<='1'; alu_sel<="00"; when addc => if t(0)='1' then tright<='1'; tleft<='1'; alu_sel<="11"; end if; when xorr => tright<='1'; tleft<='1'; alu_sel<="01"; npop<='1'; when andd => tright<='1'; tleft<='1'; alu_sel<="10"; npop<='1'; when addd => tright<='1'; tleft<='1'; alu_sel<="11"; npop<='1'; when pop => tright<='1'; tleft<='1'; tsel<='1'; reg_sel<="10"; rpop<='1'; npush<='1'; when lda => tright<='1'; tleft<='1'; tsel<='1'; reg_sel<="01"; npush<='1'; when dup => npush<='1'; when over => tright<='1'; tleft<='1'; tsel<='1'; reg_sel<="00"; npush<='1'; when push => tright<='1'; tleft<='1'; tsel<='1'; rpush<='1'; npop<='1'; when sta => tright<='1'; tleft<='1'; tsel<='1'; aload<='1'; npop<='1'; when drop => tright<='1'; tleft<='1'; tsel<='1'; npop<='1'; when others => null; end case; end process decode; -- specify synchonous processes sync: process(clk,clr) begin if clr='1' then slot <= 0; i <= (others => '0'); np <= 0; np1 <= 1; rp <= 0; rp1 <= 1; t <= (others => '0'); p <= (others => '0'); a <= (others => '0'); for ii in n_stack'range loop n_stack(ii) <= (others => '1'); r_stack(ii) <= (others => '1'); end loop; -- rising edge of clock elsif (clk'event and clk='1') then if reset='1' then slot <= 0; else slot <= slot+1; end if; if iload='1' then i <= data; end if; if aload='1' then a <= t; elsif ainc='1' then a <= a+1; end if; if pload='1' then p <= p_in; elsif pinc='1' then p <= p+1; end if; if npush='1' then n_stack(np) <= t; np <= np+1; np1 <= np1+1; elsif npop='1' then np <= np-1; np1 <= np1-1; end if; if rpush='1' then r_stack(rp) <= r_in; rp <= rp+1; rp1 <= rp1+1; elsif rpop='1' then rp <= rp-1; rp1 <= rp1-1; end if; if tright='1' then if tleft='1' then t <= t_in; else t <= t(width) & t(width-1) & t(width-1 downto 1); end if; elsif tleft='1' then t <= t(width-1 downto 0) & '0'; end if; end if; end process sync; end archcpu16;

P.S.

If anyone finds any errors in this listing when comparing it to the handout distributed by Dr. Ting please email me so that I can correct the page. This is certainly not the most efficient implementation of P16 possible. Dr. Ting made an effort to get everything to work in one cycle and to make the source as simple as he could.

Posted: 8/20/2000

Updated: 8/28/2000 with typo corrections from John Tasgal

Updated: 9/10/2000

page created by Jeff Fox