F21 and F*F

**F21 and F*F**

By Jeff Fox

Ultra Technology Inc.
2512 10th St.
Berkeley, CA 94710

Paper submitted for presentation
at the 1993 FORML Conference

Abstract
F21 has been under development by Charles Moore at Computer Cowboys and Jeff Fox at Ultra Technology for more than two years. This paper describes in more detail the ideas behind F21 and the status of the current hardware and software design. Both the hardware and the software have become more lean.

Artificial Intelligence is Embarrassingly Parallel

The term embarrassingly parallel is widely used today by computer scientists to describe their applications that port most naturally to parallel architecture. Almost exclusively these computer scientists are doing floating point number crunching on gigantic computer models of complex problems like fluid turbulence. These operations first require domain decomposition of the data in the model to fit portions of the model into each node of the parallel machine. Since most of the algorithms used are based on operations with neighboring data it is important to let each node contain as much memory as possible. This is because there is an inefficiency introduced by the requirement to duplicate data on the boundaries of the portions of the data that is stored in each node. The boundary of the data will have one less dimension than the data stored in each node. If the data in a node represents a three dimensional model then its two dimensional surface will require duplicates of some of its neighbor nodes data to make the computations on this border data. The ratio of the duplicated data is 2N*X^(N-1)/X^N where N is the number of dimensions and X is the units of computation per dimension. of the data partitioned for a single node. Another factor that has influenced parallel system design is the non-parallelizable portion of application programs. If something is 90% parallelizable then 10% must always be computed sequentially. Adding a higher degree of parallelism does not improve the performance of this non-parallelizable portion of an application. The only way to speed this up is to have a faster node. These two factors have led designers away from massively parallel processors (MPP) with smaller and slower processors towards fewer faster processors with more memory per node.

In recent years workstation technology has seen a faster decrease in the ratio of price to performance than have super computers. In fact, while super computers have become faster in the past few years they have had no decrease in their price to performance ratio. (*1) Researchers now demonstrate that clusters of workstations can provide the computing power of super computers at one tenth the cost and no high maintenance fees. Typically these computer scientists are using workstations with 64 megabytes of memory, 50 MIP, 50 MFLOP, machines costing $25K each and connected on a sub 1 megabyte per second eithernet network.

Artificial Intelligence

Artificial Intelligence software has traditionally be written in LISP and been run on very large sequential computers. Today "C" and "C++" have become more popular for producing work outside of the research lab. The computing requirements and performance for academic AI software has kept much of it out of the marketplace. FORTRAN still dominates the parallel processing field. And although "C" offers some improvement in the efficiency of AI platforms Forth has been used by some researchers who wanted to push the price to performance ratio of their research. Neural network simulations, natural language processing, and expert systems are all well known in the Forth community. It is expert systems that have had the most success in entering the workplace despite the fact that many are very expensive. Expert systems have demonstrated that they can provide Ph.D. specialist level of advice to a human through some form of dialog about some limited domain of expertise. In fact many experts are disturbed to find that most of the expertise that took them years to learn can be encoded in a couple of hundred simple logic statements. Philip J. Koopman Jr.'s Stack Computers, the New Wave lists expert systems are one of the best matches to the stack machine architecture. (*2) In an article in AI Expert Robert Trelease compared the performance of a tower of Hanoi benchmark on a number of expert systems and machines and the results were striking. (*3) The slowest entry was the Digital VAX 11/780 running the industry standard OPS5 expert environment at 60 seconds. The fastest was FORPS on a NOVIX where the expert system converts the English language rules into native forth code for the NOVIX. The NOVIX time of .006 seconds is four orders of magnitude faster than the VAX. If you combine this with the difference in cost you get more than six orders of magnitude difference in the price to performance ratio.
AI research has remained very fragmented after decades of progress. Game playing, natural language, expert systems, neural networks, machine vision, and speech processing have all made significant advances, but things like common sense, consciousness, or self awareness have remained very elusive. Much of the problem has been the attempts to do these things on sequential computers.

Parallel Intelligence

If you study organic architecture you find incredible parallelism! The granule cells are the most numerous cells in the human brain, and it is estimated that there are 3 x 10^10 granule cells in the cerebellum alone. Granule cells posses from one to seven dendrites, the average being four. (*4) Although we understand the organic architecture to some extent, it is very difficult to produce a non-organic device with this many computing elements.
It is well understood that different parts of the human brain are specialized to perform different functions and that processing speeds are very slow, but the degree of parallelism is very very high. Given the high degree of functional behavior possible with very few interconnected neurons in more primitive organisms the degree of parallelism in the human brain gives a new meaning to MPP.

F21 Parallel Multimedia Forth Engine

The latest forth engine microprocessor developed by Charles Moore is the MuP21. The MuP21 is designed to run the OKAD VLSI software that developed the MuP21. (a sort of hardware metacompiling)
The MuP21 is essentially a complete workstation for VLSI CAD on a single chip. Add a keyboard, some memory, a monitor, some mass storage and software and you have a VLSI CAD workstation.
Add a network interface and a few other things and you get F21 the next chip to be produced by Charles Moore. The F21 very much resembles a conventional networked workstation in functionality but with orders of magnitude less complexity and cost.
Parallel computing systems today as noted before are often clusters of workstations. Unused computing cycles can be gathered from idle workstations already in use, or dedicated workstations can be clustered to form less expensive super computers. As noted before today people are using workstations with 64 megabytes of memory, 50 MIP, 50 MFLOP, machines costing $25K each and connected on a sub 1 megabyte per second eithernet network. If we are discussing systems with dedicated workstations then a similar system configured with F21 would contain 2.5 megabytes of memory, 200 MIP, and connected on a 10 megabyte per second network and cost under $200 per node.

Figure 1.
                     Conventional Workstations    F21 System
                
Megabytes per node             64                  2.5
MIPs per node                  50                  200
Cost per node ($)              25,000            under 200
Megabytes per second network   under 1              10

Just as it is possible to use a cluster of conventional workstations as a super computer it is possible to do the same thing with the F21. The F21 was designed for a tradeoff between minimal cost per node and maximum performance per node. The price performance advantage of the F21 over conventional workstations on some applications is five hundred to one. Where required the F21 could be connected to an inexpensive DSP chip for floating point to increase the floating point performance per node.
As shown before, one advantage of doing AI in forth on a forth engine compared to doing AI in LISP on conventional architecture is several orders of magnitude improvement in performance. One advantage of using a networked minimalist forth engine like the F21 rather than conventional workstations on parallel problems is several orders of magnitude reduction of cost. When these advantages are combined the results should be a price performance breakthrough for parallel AI.

Agents and Subsumption

The cofounder of the Artificial Intelligence Laboratory at MIT, Marvin Minsky refers to agents of the mind. (*5) Minsky's The Society of Mind describes how the human mind is a collection of agents that act very much like expert systems with limited domains of knowledge. Agents know nothing about other agents to which they are not directly connected. And all they know about the agents to which they are connected are the messages sent by them. In a way these agents function something like expert systems. These agents function like neural networks, and for the most part most of them have no model of the real world like classical AI programs. Classical AI programs have relied on various schemes for knowledge representation and modeling of the real world, but much attention has been focused in recent years on the work of another MIT scientist Rodney Brooks. (*6) Brooks' subsumption architecture approach demonstrates that robots can deal with real world problems with little more than a few fairly primitive reflexes. Because the subsumption architecture does not require vast intellectual knowledge representation or vast computing resources it is considered by many a divergence from classical AI. I see Minsky's view of mind as the interaction of many disconnected agents and Brook's idea that reflexes are more important than planning and modeling as very similar. It should be noted that the small insect like robots that Brooks created are in fact controlled by parallel processing. Brooks' somewhat famous robot Attila has six legs, 60 sensors, 23 motors, and 11 computers. This robot though only the size of a shoe box and controlled by very primitive micro controllers has force sensing whiskers, a gyroscope, a pitch-and-roll sensor, a near-infrared range finder, and a small camera for machine vision. People who see Attila are often disturbed by what appears to be a large living mechanical insect.
AI has been kept in the lab by the excessive costs of LISP and mainframe computers. And although a market does exist for commercial expert systems built on this type of platform it is limited by expense. The subsumption architecture approach to intelligent machines may make them more prolific.
I have written interactive speech conversant programs on early microcomputers. These programs though simple could carry on a spoken conversation with a human on a very limited subject matter. I have written programs on early microcomputers that combined simple image analysis routines that functioned like neural networks with a formal expert system. This program could recognize humans and pets and trigger various behaviors like dialing a phone and transmitting an interesting image to me in another town. Modest programs like this can be easily connected through expert agents to form a program that exhibits more advanced behavior.
Minsky argues that one agent people don't seem to have is an agent to watch the inner workings of the other agents. The idea of agents evolved only after people understood how neurons and neural networks work, and began to examine how consciousness might reside on this sort of architecture. He likes to say that he thinks humans have evolved about 400 different agents, maybe one more agent every million years. He loves to explain things like why pain hurts, or why mice don't like music in terms of the interaction of these agents.

The Age of Computer Agents

Computer visionary Alan Kay often states that the age of desktop metaphor and graphic user interface is passing and the age of intelligent agent interfaces is dawning. He uses the term agent to refer to an entity with some degree of intelligence that will perform some work for a user. Typically these agents appear as talking heads on a monitor and interact with the user like a human servant. Kay argues that with the development of widespread connectivity and distributed data that the desktop metaphor that is so popular today will become useless. You can make sense of the visual image of the icons that represent the files or folders on one disk drive, but when you are connected to the internet the visual representation of the files available would be of little use to a human. Instead you would like something that could go out and "look" at the millions of files and only show you things of interest to you. To do this these agents need a number of things. The searching of millions of files is the easy part, the hard part is to know what is of interest, and to communicate this to the human. I say communicate because what is needed is more than just a report of information. What is desired is something resembling interaction with a human.

Number Please

For a number of years the production of videos depicting a vision of life with intelligent agent machines has become a billion dollar industry. The first was Apple's presentation of the knowledge navigator. Apple, HP, AT&T and other companies clearly plan to produce these types of machines by the turn of the century. PDAs like Apple's Newton is an example of the first step in that direction. Add color, talking heads, a video camera with image recognition, microphone and speaker and interactive conversant software, wireless communications and networking, video phone and video conferencing, and lots of memory and processing power to a Newton and your there!

The Agent Chip F21

The F21 was designed as an agent chip. The reason F21 can digitize a video image, digitize an audio signal, generate video, generate audio, and overlay computer generated video on top of an external video source is so that F21 can generate conversant talking heads on a video monitor. The reason that F21 has a network interface is so that F21 can provide scalable memory and processing power to handle whatever degree of AI is desired. F21 is an ideal processor for expert systems and parallel subsumption architecture, and F21 will have almost all the hardware you need to create the intelligent agents that people expect to see on the computers forecast now for the year 2000.

Parallel Multimedia

Multimedia is not a precise term. It means different things to different people. Today it is widely used to refer to a computer system that includes CD-ROM based audio and video in addition to the computer generated video. Computer based training on interactive video is commonly offered on this type of media. F21 should probably be called a multimedia chip because it has video in, video out, mixed video, audio in, and audio out capabilities directly from the chip. What do you call a system with hundreds of video in, video out, audio in, and audio out connections? Parallel multimedia is what I call it.

F21 evolves from F20

F21 is leaner than the design of F20 two years ago. Every part of the chip has been streamlined and simplified. As MuP21 evolved many things changed in the CPU core. It gained an extra bit and changed from MuP20 to MuP21. The extra bit gives you that needed carry bit for 20 bit math, and a way to have more addressing space. Conditional branching instructions changed from skips to jumps. The multiply step instruction changed. SWAP went away. NOP went away, then it came back. Instruction bits moved around. Data and address bits changed polarity. Memory maps changed. Memory busses changed. And the number of instructions dropped from 33 to 25. This made for a somewhat difficult to hit moving target environment.
The P20 software simulator became S21 and simulated both MuP21 and F21 moving targets. It was always easier to modify the software simulator than it was for Chuck to make the changes to the hardware simulation in OKAD. The first MuP20 eForth had been created with MASM and loaded into the P20 simulator in 1991. Later that year eForth development shifted to metacompilation. Metacompilation proved to be a much better approach to hit a moving target like MuP21. The MuP21 design eventually stabilized as the chip became functional. Larger stacks were added to the MuP21 design, and the analog processors, network processor and parallel port that distinguish the F21 were added to the MuP21 core. After the instruction set stabilized F21 code development began. The first version of F21 eForth was simple to implement with larger on chip stacks. After the first DTC model was improved and streamlined a STC model was coded. The first STC eForth proved to be slower than the previous DTC version. This was because off page calls required two cells of memory while threaded addresses would always fit into one cell. Since off page memory reference are the slowest operations the replacing of off page references with longer code sequences caused and increase in the size of the program, and therefore an increase in the number of off page references. The next version was STC with native instructions like DUP compiled inline with several NOP instructions to fill a cell rather than a subroutine call to the high level DUP word used by the interpreter. The next version packed these primitives tightly with NOPs only being used where required for address alignment. Tail recursion optimization was added to convert a call to a jump rather than follow it with a subroutine return. Macros were added to implement inline code for @, !, and SWAP and other words, and these macros were modified to insert different code depending on context. F21 eForth source code would compile down to almost 1k word of code, but would not quite fit on one 1k page. More code optimizations were added. Loop unrolling can be used to linearize inlined code, and use of the A register as a fast constant where possible was added. The optimizing cross compiler for F21 is called X21. X21 can produce code for MuP21 as well as F21, but the smaller stacks on MuP21 require the programmer to manage the resources very carefully. The very simple benchmark in figure 2 compiled with the X21 optimizing compiler uses up most of the stack registers available on MuP21.
The Mentink benchmark was posted to comp.lang.forth on the internet and people posted the results from various systems and compilers. I tested the simulated MuP21 and F21 and show the results in figure 4. At the time I posted several notices about how the benchmark could be very deceptive because it was so trivial that many optimizing compilers would actually remove the guts of the program since it did nothing. I stated that anyone posting results should also state if possible any such details. With this in mind notice that some results that were posted indicated that indeed some compilers had removed some of the code.

Bernie Mentink posted the following simple benchmark to the internet:

DECIMAL
: INNER 1000 0 DO 34 DROP LOOP :
: BENCH 1000 0 DO INNER LOOP 7 EMIT ;

Some people removed the 7 EMIT and simply used the loops with some timer function built into the forth they were using and reported the result. This will make a big difference on a fast processor since the BELL may be quite long!

Both the MuP21 and F21 have DRAM and fast SRAM address spaces, so the performance on the benchmark is shown for both types of memory. Also the results for various eForth models and fforth a native code optimizing compiler are given where appropriate.

The code produced by X21 or fforth is show in figure 2.

Figure 2.
 address opcode label comment
 001000  5715C INNER n push n push         1000 0 DO
 001001  003E8       1000
 001002  00000       0
 001003  57D58 IN1   n drop n pop          34 DROP
 001004  00022       34
 001005  00001       1
 001006  BE37B       + pop over over    INC loop counter
 001007  A080A       xor  T0  IN2       COMPARE
 001008  FF39E       drop push push nop
 00100A  FFFFE IN2   drop drop drop ;       FINISH
 00100B  5715C BENCH n push n push           1000 0 DO
 00100C  003E8       1000
 00100D  00000       0
 00100E  20000 BE1   call INNER
 00100F  563DE       n pop nop nop
 001010  00001       1
 001011  BE37B       + pop over  over  INC counter
 001012  A0815       xor T0  BE2 COMPARE
 001013  FF39E       drop push push nop
 001014  00003       jmp  BE1                LOOP
 001015  FFFBA BE2   drop drop drop n      FINISH
 001016  00007       7
 001017  57020       n push  ; long call and return
 001018  03987       lit'  EMIT

Figure 3 shows the code with the inner loop is unrolled and inlined. (**1).

Figure 3.
 address  opcode label  comment
 001000  57BDE INNER  n nop nop nop
 001001  00001        1
 001002  57D5F        n drop n  drop 1st and 2nd
 001003  00022
 001004  00022
 001005  57D5F        n drop n  drop  3rd and 4th
 0010//  ////         unrolled inner loop 5th-48th
 001096  57D5F        n drop n  drop 49th and 50th
 001006  00022
 001007  00022
 001099  88C02        2* C0  001002 loop 20times=1000
 00109A  F8400        drop ;

This INNER executes in 140 microseconds on the MuP21 in SRAM. This gives 140 milliseconds for BENCH. The fastest code is produced by using the A register as a fast constant (**2).

Figure 4. Forth Version Model Memory Options Processor MuP21 F21 MuP21 eForth 2.02 DRAM DTC 3.3 2.6 F21 eForth STC 2.02 DRAM STC calls 1.5 F21 eForth DTC 2.02 DRAM DTC 1.2 F21 eForth STC 2.03 DRAM STC NATIVE PRIMITIVES 0.7 F21 eForth STC 2.04 DRAM STC packed primitives 0.65 fforth DRAM STC optimized 0.45 0.25 fforth SRAM 0.3 0.17 fforth **1 DRAM STC +INLINED 0.19 0.137 fforth **1 SRAM STC +INLINED 0.14 0.077 fforth **2 DRAM STC INLINE W/REG 0.027 0.020 fforth **2 SRAM STC INLINE W/REG 0.020 0.010 **1 inlined inner loop optimization. **2 inlined inner loop optimization with register optimization.

The results of the Mentink benchmark that were posted to the internet and the results from the S21 simulations of MuP21 and F21 in DRAM and SRAM are all combined and sorted by execution speed the result is shown in figure 5.

You will note that the slowest entry for the MuP21 and F21 are running MuP21 eForth in DRAM. The speed is similar to a 386 running FPC and about half as fast as a 386 running TCOM compiled code. The code compiled by fforth or X21 is about as fast as the 50 MHz R4000 in the SGI Crimson when running in DRAM and about as fast as the 100 MHz R4000 in the SGI Crimson when running in SRAM. Inlined, unrolled and register optimized code will run an order of magnitude faster than the SGI Crimson. It is also 300 times faster than the 486 running a forth in "C" or 260 times faster than the F21 running the code under MuP21 eForth.

Figure 5.
           MENTINK BENCHMARK RESULTS FROM INTERNET
                       SORTED BY SPEED

  PROCESSOR  CLOCK  FORTH-COMPILER     OPTION   SECONDS

  VAX 6620           FIG 8080 (EMULATION)         15.32
  H8           10    eForth                       15.0
  68000         7                                  7.6
  80386        33    HS/FORTH                      6.1
  MuP21       100    MuP21 eForth   2.02 DRAM      3.3
  80486        33    F83 in "C"                    3.0
  80386        33    FPC                           2.8
  68040        25    Yerk                          2.7
  F21         200    MuP21 eForth   2.02 DRAM      2.6
  80386        33    TCOM (FPC)                    1.7
  80386        33    HS/FORTH w/optimization       1.6 no 34 DROP
  80386        33    TCOM (FPC)                    0.99 no 7 EMIT
  HP-PA              Forth in "C"                  0.75
  68030        25                                  0.75
  F21         200    F21 eForth STC 2.03 DRAM      0.7
  R3000        33    RISC pFORTH INDIGO            0.66
  F21         200    F21 eForth STC 2.04 DRAM      0.65
  MuP21       100    fforth              DRAM      0.45
  68040        25                                  0.35
  R3000        66    RISC pFORTH INDIGO            0.33
  MuP21       100    fforth              SRAM      0.3
  F21         200    fforth              DRAM      0.25
  R4000        50    RISC pFORTH   CRIMSON         0.24
  80486        33    ForthCMP                      0.21 no 34 DROP
  MuP21       100    fforth **1          DRAM      0.19
  F21         200    fforth              SRAM      0.17
  MuP21       100    fforth **1          SRAM      0.14
  F21         200    fforth **1          DRAM      0.137
  R4000       100    RISC pFORTH   CRIMSON         0.12
  F21         200    fforth **1          SRAM      0.077
  MuP21       100    fforth **2          DRAM      0.027
  MuP21       100    fforth **2          SRAM      0.02
  F21         200    fforth **2          DRAM      0.02
  F21         200    fforth **2          SRAM      0.01

F*F evolves from Forth Linda

Forth Linda was developed to provide an extension to Forth for parallel programming. The advantage of the Linda approach is that the number of processors need not be known at compile time, and the load balancing at runtime is automatic. Linda also provided mechanisms for compiling on a network of heterogeneous machines. Linda implements a tuple space where remote execution is done with active tuples and data is passed in passive tuples. (*7) The F*F parallelizing optimizing compiler for F21 will use the active tuple mechanism of Linda for remote execution, but uses a simpler mechanism for shared data.
The network interface on the F21 has been reduced to have only two functions: broadcast an interrupt or a data transfer to one or more nodes on the network. With the addition of some network support code these two functions map directly to the parallelizing words in F*F.
Since the network interface on the F21 has a memory broadcast capability, a single broadcast on the F21 network interface can write to the memory of a single node, a group of nodes, or all the nodes on the network. So when global variables or data structures are updated it is done in an atomic manner on the network. Fetches from global variables and data structures are just local memory fetches so they are very fast. F*F uses the words G! and G!! to perform a global stores to global data structures. The F*F word RX( queues up the words enclosed in parenthesis for remote execution on a different processor.
Some primitive network boot and control words will be provided in the system, but the three words RX(, G!, and G!! are all that are needed by a F*F forth programmer to convert forth programs to run in parallel. F*F does not make the global data structures or parallelizing transparent. Instead it tries to make the words that specify parallel execution as simple and direct as possible.

REFERENCES

1. Bell, Gordon. "Tracking the Teraflops" Distributed Computing for Aerosciences Applications Workshop at NASA Ames Research Center October 1993
2. Koopman, P. "Stack Computers, the New Wave"
3. Trelease, R. "The Forth Wave in AI." AI Expert (October 1987)
4. Albus, J. "The Cerebellum: A Substrate for List- Processing in the Brain" "Cybernetics, Artificial Intelligence, and Ecology" Spartan Books 1972
5. Minsky, M. "The Society of Mind" Simon & Schuster, Inc. 1985
6. Jones, J. and Flynn, A. "Mobile Robots" A K Peters, Wellesly, Massachusetts 1993
7. Fox, J. "Debugging eForth-Linda on a Simulated Network of Forth Computers" "1991 FORML Conference Proceedings", Forth Interest Group 1992

OTHER SUGGESTED REFERENCES

Minsky, M, and Papert, S. "Perceptrons" MIT Press 1969
Bjornson, R. and Carriero, N. and Gelernter, D, "The Implementation and Performance of Hypercube Linda" 1988
Carriero, N. and Gelernter, D. "How to Write Parallel Programs: A Guide to the Perplexed" Yale University Department of Computer Science RR-628, May 1988
Gelernter, D. "Getting the Job Done" Byte November 1988
Mattson, T. "An Introduction to Linda" Scientific Computing Associates, Inc. 1989
Hendtlass, T. "Embedded Node Collectives" "1991 FORML Conference Proceedings", Forth Interest Group 1992