MuP21--A High Performance MISC Processor

Chen-hanson Ting and Charles H. Moore

Offete Enterprises, Inc. 3/17/95

1. MISC vs. RISC vs. CISC

The controversy between RISC (Reduced Instruction Set Computer) and CISC (Complicated Instruction Set Computer) has pretty much settled, and RISC has won. Most newer and more powerful processors developed recently are all RISC processors, like SPARC, MIPS, Alpha from DEC, PA from HP, and PowerPC from IBM. However, CISC processors persist due to momentum, like the Intel x86 family, and in the microcontroller area where raw speed is not an important factor.

The basic principles behind the original RISC processors are valid, such as:

a. Simple instruction set is faster
b. Complicated memory accessing instructions are not necessary
c. A large register file facilitates software
d. Complicated functions are best handled in compiler
e. Simpler processor is easier to design and to build

However, RISC is a good idea falling in the wrong hands. The emphasis on simplicity is all but forgotten. The RISC processors we see now are more complicated than many of the CISC processors. The relentless push towards higher speed left behind a bloody trail. Some of the problems in the RISC architecture are quite evident:

a. RISC processors are inherently slow, because each instruction still needs many machine cycles to execute. Instruction pipelines are used to accelerate the execution. However, the pipeline must be flushed and refilled when a branch instruction is encountered.

b. Increasing speed in the RISC processor creates a large disparity between the processor and the slower memory. To increase the memory accessing speed, it is necessary to use cache memory to buffer instruction and data streams. The cache memory brings in a whole set of problems which complicate the system design and render the system more expensive.

c. RISC processors are very inefficient in handling subroutine calls and returns. Efficient subroutine mechanism is critical to the performance of a processor in supporting high level languages. Many RISC processors use a large register file, which is windowed to facilitate subroutine call and return. However, the register window must be big enough to handle a large set of input, output, and local parameters. The large register window wastes the most precious resource in the RISC processor. A large register file also slows down the computer system during a context switch, which must save the register file and later restore it.

Our opinion is that in RISC, the reducing of the size of the instruction set is effective in reducing the complexity of the processor and improving its performance. However, the principle of simplicity was not enforced enough to realize the full benefit from this principle. In the MISC architecture, we like to explore to power of simplicity to its limit, to see how far we can push the CMOS technology in reducing the costs of building computer systems and increasing their performance. We like to have answers to the following questions:

a. What is the minimum set of instructions in a microprocessor to make it useful in solving practical programming problems?

b. What will be the performance of a microprocessor with such a minimum set of instructions?

c. What facilities in a microprocessor are necessary to reduce the complexity and the system costs of a computer?

d. How to best utilize the current CMOS technology to build such MISC processors?

2. The MISC Instruction Set

What is the minimum set of instructions in a practical microprocessor? The CISC processors generally have 100 or more instructions. The RISC processors have about 50 instructions. In our investigations, it was obvious that 16 instructions are not sufficient to support all the necessary functions required in a microprocessor. 50 instructions are too many. The minimum number of instructions is somewhere between 16 and 32. A convenient choice is to limit the number of instructions to 32 and implement a microprocessor with 5 bit instructions.

Here is the instruction set implemented in MuP21:


 Transfer Instructions:  JUMP, CALL, RET, JZ, JCZ
 Memory Instructions:    LOAD, STORE, LOADP, STOREP, LIT
 ALU Instructions:       COM, XOR, AND, ADD, SHL, SHR, ADDNZ
 Register Instructions:  LOADA, STOREA, DUP, DROP, OVER, NOP
So far, we have implemented only 24 instructions, leaving some room for future expansion. This MISC instruction set seems to be adequate in the applications we have coded, including quite elaborate operating systems and demonstration programs.

It is interesting that we have ADD instruction but not subtraction, that we have XOR but not OR instruction, and that we have OVER but not SWAP. Obviously, subtraction can be synthesized by compliment and addition. OR can be synthesized by compliment, AND, and XOR. OVER and SWAP are very similar, in that they allow accessing the top of the data stack. However, it is difficult to determine which is more fundamental in a stack machine.

3. MuP21 Architecture

MuP21 is the first member of a series of MISC microprocessors. The primary constraint on the design of this microprocessor were that it had to be housed in a 40 pin DIP package, and that the silicon die had to be less than 100 mils square. We determined that a 20 bit microprocessor could be implemented within these physical constraints. There would not be enough I/O pins to support a processor with wider data and address buses.

MuP21 must use DRAM as its primary memory, as DRAM offers the best bit density and the lowest cost per bit. However, it has to boot from ROM or other 8 bit memory devices, and it also has to address various I/O devices. Therefore, we need a memory coprocessor to handle the buses and to generate the proper control signals to the memory and I/O devices.

A very unique feature of MuP21 is to generate NTSC signals to drive a color TV monitor, because it will be targeted to many applications which uses the TV monitor as the principal display device. A video coprocessor was designed to run in parallel with the main processor to display video frames stored in the main DRAM memory.

The main CPU in MuP21 thus includes the following components:

a. A Return Stack to nest subroutine return addresses
b. A Data Stack to store parameters passing between subroutines
c. A T (Top) Register as the central holding register for operands
d. An ALU which takes operands from T and the top of Data Stack and returns the results of ALU operation to T Register
e. An A (Address) Register to hold a memory address for fetching or storing data from/to memory
f. A PC (Program Counter) Register to hold the address of the next instruction
g. An Instruction Latch which holds four 5-bit instructions to be executed in sequence

The memory and data buses are 20-bit wide. The instructions are 5-bit wide. Therefore, four instructions can be packed in each 20-bit word fetched from memory. This is a natural instruction pipeline. After 4 instructions are executed, the slower external memory is ready to supply the next set of 4 instructions. The processor can be four times faster than the memory. Fast cache memory and the associated control circuitry are not needed.

The execution speed of MuP21 is very fast because of the simple instruction set and the dual stack architecture. The ALU instructions can be executed very fast because operands are taken from the T register and the top of the data stack, and the results are returned to the T register. There is no need to decode the source and destination registers. Actually, the ALU operates continuously. Once the data in T register an the top of the data stack are stable, ALU results from COM (complement of T), SHL, SHR, XOR, AND, ADD, and conditional ADD are generated spontaneously. The ALU instruction only selects the proper results and gates them back into the T register. The operations of the MuP21 processor can thus be summarized in two steps:

a. Read a 20-bit word from memory and latch it into the instruction latch.
b. Execute the 5-bit instructions by latching proper results into the T register.

MuP21 is thus much faster than RISC machines, because the RISC processor must follow the following sequence to execute one instruction:

a. Read an instruction from memory and latch it.
b. Decode the instruction and select the operand registers.
c. Execute the instruction.
d. Store results back into the selected designation register.

A stack based processor is more advantageous than a register based processor because the source and destination registers are defined in hardware and no register decoding is necessary.

MuP21 executes instructions at a speed of 10 ns per instructions. The peak execution rate is thus 100 MIPS. It achieves this remarkable performance using only the now outdated 1.2 micron CMOS process, because of the simplicity in its architecture and the MISC instruction set. Accessing the slower DRAM memory derates its performance to about 80 MIPS.

4. Video Coprocessor

MuP21 has a video coprocessor which runs in parallel with the main CPU. The video coprocessor read 20-bit words from the DRAM memory and interprets a 20-bit word as four 5-bit instructions, similar to the main CPU. However, the video coprocessor instructions changes the output voltage at the VIDEO output pin to generate NTSC color video signal suitable for displaying on a standard TV monitor.

The video processor is synchronized to a 14.39 MHz external clock to maintain precise timing of the video output. Whenever it is ready to fetch a new word from the DRAM memory, it gets a word via the memory coprocessor without delay, because the video coprocessor has a higher priority over the main CPU, and the memory coprocessor will grant its memory request as soon as possible. After the video coprocessor gets a word from DRAM, it will execute four instructions before fetching the next word. During this interval, the main CPU can request memory access from the memory coprocessor. Hence, when the video coprocessor is turned on, it consumes 25% of the memory bandwidth of MuP21.

The instruction set of the video coprocessor is as follows:

  Opcode  Hex   Name     Slot   Cycles
  B       00    Black    x      1
  S       17    Sync     x      1
  R       1F    Refresh  2      1
  K       13    Skip     0      1
  C       15    Burst    x      1
  P       0x    Pixel    x      1
  J       18    Jump     0      0
When the MSB in a 5-bit video instruction is set, the instruction causes special action in the video signal generator. When the MSB in an instruction is reset, the other four bits specify the color of one pixel to be displayed on the monitor. The assignments of bits are: 0 I G R B
where G, R, B stand for green, red and blue, and I stands for intensity.

A video frame is first constructed in DRAM memory from the video instructions. When the video coprocessor is turned on by setting the LSB in the Configuration Register, the video coprocessor fetches the instructions in sequence and execute them. The results are a continuous stream of analog signals at the VIDEO output pin. When this pin is connected to the input of a video monitor, color pictures will be shown on the monitor. The main processor can change the pixel instructions in the video frame to cause the picture to change dynamically.

Since the video frame is completely constructed in the DRAM memory, it is easy to produce video signals either in NTSC format or in PAL format. This feature makes MuP21 a very powerful and versatile device to produce TV images. It will thus find many applications where video output is needed.

5. Memory Coprocessor

The Memory coprocessor in MuP21 is mostly hidden from the user. It performs the following tasks in the background:

a. It arbitrates DRAM access requests from the video coprocessor and the main CPU. The memory request from the video coprocessor has priority over that from the main CPU.

b. It generates the proper control signals to DRAM and SRAM memories, and also the I/O enable signal to I/O devices. A DRAM RAS cycle is 50 ns. SRAM and I/O have two accessing speed: slow cycle of 250 ns, and fast cycle of 15 ns. The memory coprocessor allows MuP21 to use a variety of memory and I/O devices without additional interface circuitry.

c. It controls the address and data buses to the memory and I/O devices. When accessing DRAM memory, the 20-bit addresses are multiplexed over pins A0-A9, and data bus consists of D0-D9 and AD10-AD19. When accessing SRAM memory during booting, the address bus consists of A0-A9 and AD10-AD19, while the 8-bit data bus is on D0-D7. When accessing I/O devices, the addresses are on A0-A9, and data are on D0-D9 and AD10-AD19.

Memory and I/O accesses are controlled by address lines and two bits in the Configuration Register. The memory map of different memory and I/O devices are:

        Address             Device
        0-FFFFF             20-bit DRAM memory
        12000-1203FF        Slow 20-bit I/O devices
        14000               Configuration Register
        16000-1603FF        Fast 20-bit I/O devices
        18000-1BFFFF        Fast 8-bit SRAM memory
        1C000-1FFFFF        Slow 8-bit SRAM memory
Internally, MuP21 maintains a 21 bit data/address bus. The MSB bit 20 is the carry bit in ALU operations. It also selects DRAM memory when low, and SRAM or I/O when high. According to the memory map, MuP21 addresses directly only 256 KB of SRAM memory. However, Bits 18-19 in the Configuration Register are forced on the address bus when reading or writing SRAM. This paging mechanism allows MuP21 to access 1 MB of external SRAM memory.

6. Applications

MuP21 is a very powerful microprocessor because it is fast, and it has a fairly large addressing space. It also uses very little power. It is therefore suitable for a wide variety of applications in which high speed, low power consumption, and large addressing space are important factors in the design. Here is a list of potential applications for MuP21:

Advanced video games
TV signage
Video test pattern generators
CAD design system
Telephone switching system
Handheld computers
High speed communications systems
Intelligent hard disk controllers
Robotic controllers

7. Conclusion

MuP21 is the first member of a family of microprocessors based on the MISC principles. It proves that there is still room to improve on the RISC architecture. By insisting on the minimum set of instructions, microprocessors can be further simplified and its performance improved. We were amazed that MuP21 can run at a peak speed of 100 MIPS, using the currently outdated 1.2 micron CMOS process,. With the more advanced 0.8 micron process, MuP can be made to run at 200 MIPS rate. Moving on to 0.5 micron, the speed can be increased further to 300 MIPS without much efforts.

MuP21 is a 20-bit microprocessor, constrained by the 40-pin DIP package. Using packages with more pins, the design can be easily expanded to 32-bits and beyond. A wider data/address bus will improved the throughput and also allow greater addressable memory space for applications dealing with massive amount of data. This is another direction to evolve the MISC architecture.

With a simpler and more efficiency architecture, the MISC processors can be built with smaller silicon dies and thus the yield will be much higher than the more complicated RISC and CISC processors. The MISC processors will also consume much less power when running at equivalent speed. MISC processors will be much cheaper than RISC and CISC processors, and can compete effectively against them on the basis of favorable price/performance ratio.