YOUR FEEDBACK
Gregor Rosenauer wrote: well, not what's your take on this? Did I miss a second page of this article or...

SYS-CON.TV
TOP MICROSOFT .NET LINKS


Design of the Portable.NET Interpreter
Design of the Portable.NET Interpreter

Portable.NET is an implementation of the Common Language Infrastructure (CLI). Its primary design goal is portability to as many platforms as possible, which it achieves through the use of interpretation rather than just-in-time compilation.

The bytecode format of the CLI presents some challenges to efficient interpreter implementation. Rather than directly interpret, we translate the bytecode into a simpler abstract machine, the Converted Virtual Machine (CVM). This machine is then interpreted using a high-performance engine.

Traditionally, abstract machines have used the same bytecode representation "on the wire" as for execution. Our work shows that there are definite performance advantages to using different bytecode representations internally and externally.

Introduction
The Common Language Infrastructure (CLI) is a set of specifications that describe a bytecode-based development and runtime environment. Portable.NET is an implementation of the CLI, whose primary design goal is portability to as many platforms as possible.

Portable.NET consists of three major components to support the CLI: a bytecode-based Common Language Runtime (CLR), a C# compiler, and a C# base class library. This article will concentrate on the runtime engine.

The runtime engine achieves portability primarily through the use of interpretation rather than just-in-time compilation. However, the bytecode format of the CLI presents some challenges to efficient interpreter implementation. This article discusses how we have overcome these challenges to build a high-performance interpreter for Common Intermediate Language (CIL) programs.

We compare the performance of Portable.NET with that of Mono to demonstrate how our approach differs from direct polymorphic interpretation and full just-in-time compilation.

Polymorphic CIL
The CIL instruction set contains instructions to perform arithmetic, logical operations, branching, method calls, object accesses, and pointer manipulation.

A unique feature of CIL, compared to other similar abstract machines, is that its instructions are polymorphic. The add instruction can be used on integers, longs, and floating-point values, for example. Other virtual machines, such as the Java Virtual Machine, use separate instructions for each type.

The polymorphic nature makes direct interpretation very inefficient, as demonstrated by the fragment from Mono's interpreter shown in Listing 1. (All of the code for this article can be downloaded from www.sys-con.com/dotnet/sourcec.cfm.) The types of all stack values must be tracked explicitly, leading to significant runtime overhead.

The challenge for Portable.NET was finding a way to implement a high-performance engine without writing a full just-in-time compiler.

The approach we took was very similar to a JIT: the CIL bytecode is translated into instructions for a simpler abstract machine, dubbed CVM (for "Converted Virtual Machine"). The CVM instructions are then interpreted using a high-performance interpreter, written in C.

As each method is entered, the following process occurs:

  1. Look for a cached CVM version of the method, and use it if found.
  2. Perform bytecode verification and convert the CIL into CVM.
  3. Record the CVM version in the cache.
  4. Jump into the interpreter to execute the CVM code.
Eventually the application's working set of methods ends up in the CVM method cache, and execution proceeds quickly.

Instead of a single add instruction, the CVM instruction set has several: iadd, ladd, fadd, etc. The conversion process chooses the most appropriate variant based on the operand types reported by the bytecode verifier.

Listing 2 shows a simplified form of the CVM interpreter code for the converted instructions. The interpreter executes more efficiently because it can assume that the values on the stack are of the correct type (bytecode verification having already been performed).

Items on the CVM execution stack are a uniform size of one word: 64-bit and larger types straddle multiple words. The CVM conversion process takes care of laying out the stack according to the types of local variables and stack items.

This isn't necessarily a new approach ­ it is normally known as "threaded interpretation" in the Forth community.

The complete list of CVM instructions is given in Listing 3.

Ramping Up
Conventional wisdom says that one should write a handcrafted assembly code loop to get a fast interpreter. However, there are some simple tricks that can be used to speed up an interpreter, even in C code.

  1. Register variables
  2. Computed gotos
  3. Direct threading
  4. CPU-specific unrolling
C compilers aren't terribly good at determining which values are most used in switch-loop interpreter code. The compiler invariably guesses wrong, favoring temporaries over important variables like the program counter and stack pointer. So it is necessary to "help" the compiler a little.

The gcc compiler can bind variables to explicit registers, as follows:

register unsigned char *pc __asm__ ("esi");

We placed the program counter, the top-of-stack pointer, and the frame pointer into x86 registers. This produced a significant improvement in performance compared to straight C code, for such a small change.

The next step was to change from switch statements to using computed gotos. This is normally referred to as token threading. The break at the end of each case is replaced with a goto statement:

goto *main_label_table[*pc];

The main_label_table contains pointers to each of the cases in the switch statement, allowing the interpreter to jump directly to the next case, avoiding the overhead of jumping back to the top of the switch-loop. More information on computed gotos can be found in the gcc documentation.

The result of these two changes (use of explicit registers and token threading) was an interpreter that was so close to a handcrafted assembly loop that there was little point writing one by hand.

The third step involved a change in representation. The switch loop and token-threaded versions select instruction handlers based on CVM bytecode. Instead of storing the single-byte opcodes, we can store the actual addresses of the opcode handlers in the CVM instruction stream. This is known as direct threading.

goto **((void **)pc);

Direct threading increases the size of the CVM code by a factor of 4, because instructions are now pointers to handlers rather than bytecodes. But it avoids the overhead of looking up values in main_label_table. On RISC platforms, this can give a significant increase in engine performance, but on x86 it isn't too impressive. If memory is an issue, token threading gives better results.

Unrolling
Direct threading really shines when combined with some native JIT techniques. We implemented a "mini JIT" that converted simple CVM instruction sequences into x86 machine code on the fly. We call this an unroller because it essentially unrolls the interpreter loop into straight-line machine code.

The unroller uses simple register allocation techniques on the basic blocks of a method. Complex instructions, especially those involving method calls, are not unrolled. It isn't possible for unrolling to achieve the same performance as a JIT, but it can get very close.

The primary advantage of the unroller compared to a JIT is that it is vastly simpler to implement. Portable.NET's x86 unroller took about two weeks to write, and we expect that other CPUs would require a similar amount of effort.

Anything that is too complicated to convert is replaced with a jump back into the interpreter core. This allows unrollers to be developed in stages, replacing one instruction at a time and then retesting. This made development a lot easier than the "all or nothing" approach required for a JIT.

PInvoke
The "platform invoke" (or PInvoke) feature is a very powerful mechanism that CLI programs can use to call legacy native code.

When Portable.NET encounters a PInvoke method reference, it compiles a small CVM stub that performs any necessary parameter marshaling and then calls the underlying native function. Upon return, the CVM stub demarshals the return value.

A similar process is used for "internalcall" methods within the runtime engine that implement built-in features for the C# class library.

Using CVM to perform marshaling operations simplifies native function invocations quite considerably. Only a small amount of platform-specific code is needed to perform the native call, for which we use the standard "libffi" library.

Inlining
Method call overhead is an issue for all interpreter-based abstract machines because method calls are more complicated than the equivalent native code.

CVM addresses this by selectively inlining some of the more commonly used methods in the C# class library. The method call is replaced with a special-purpose opcode during code conversion.

The major groups of inlineable methods within Portable.NET are currently the string, monitor, and 2D array operations.

Inlining common methods can have a dramatic impact on performance. The PNetMark "Float" benchmark improved by a factor of 12 when 2D array operations were inlined. Such operations are normally very expensive in CLRs because a method must be called for every element get or set operation.

Alternate Back Ends
The construction of the interpreter itself was made as modular as possible. The interface between the metadata handling and the execution was kept separate using a standard interface: the coder API.

Coders are an interface between the CIL front end and the execution engine. The design allows for the current CVM back end to be replaced by a fully fledged JIT without any modification to the other components in the system. The same front end could be used with the CVM engine, a native JIT, or even a polymorphic interpreter.

Performance Summary
Table 1 compares the CVM interpreter variants with the Mint polymorphic interpreter and the Mono x86 JIT. All tests were done on an 866 MHz Pentium III, running Red Hat Linux 7.1. Version 0.17 of Mono and version 0.5.0 of Portable.NET were used for these comparisons.

As can be seen, the simple techniques described in this article produce very good results, with the unrolled version achieving 71% overall compared to Mono's JIT.

Future Work
Portable.NET's interpreter remains a work in progress. More optimizations are possible by introducing new CVM instructions for special cases.

We are also investigating selective inlining as an alternative to writing a handcrafted unroller for each CPU. The authors of that paper also reported performance of up to 70% of optimized C code using simple techniques. Their engine, for Objective Caml, has a single line of platform-specific code, to perform a flush of the CPU's instruction cache. Selective inlining doesn't work very well for x86, but it should do well for RISC CPUs like the PowerPC.

In the near future, we will investigate fully fledged JIT coders for Portable.NET, as well as front ends for other instruction sets such as the Java Virtual Machine.

Conclusion
Using a variety of well-known, yet simple, techniques, Portable.NET is able to achieve adequate performance for most application-oriented tasks.

At the same time, the code is highly portable. Ports to new platforms take a matter of days, sometimes hours (e.g., the author ported the code to Mac OSX in a single day).

Acknowledgments
Portable.NET would not have been possible without the generous assistance of volunteers from the DotGNU community.

References

  • Southern Storm Software: www.southern-storm.com.au/portable_net.html
  • Common Language Infrastructure, Partitions I to IV, ECMA 335: ftp://ftp.ecma.ch/ecma-st/Ecma-335-part-i-iv.pdf
  • The Mono Project: www.go-mono.com
  • Lindholm, T. and Yellin. F, (1999). The Java Virtual Machine Specification, Second Edition. Addison-Wesley.
  • A Portable Forth Engine: www.complang.tuwien.ac.at/forth/threaded-code
  • gcc documentation: http://gcc.gnu.org/onlinedocs/gcc-3.2.1/ gcc/Labels-as-Values.html
  • Puimarta, I. and Riccardi, F. "Optimizing direct threaded code by selective inlining," SIGPLAN '98, pages 291-300, ACM Press.
  • DotGNU: www.dotgnu.org

    Copyright ©2003 Rhys Weatherley and Gopal V.  Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

    About Rhys Weatherley
    Rhys Weatherley is a member of the DotGNU Steering Committee and the primary author of Portable.NET. He graduated from the University of Queensland in 1990 and has worked at a number of universities and companies in Australia and the U.S. Gopal V is a final-year computer science student in India and has been involved with DotGNU since August 2001. He's the pnetlib lead developer and has worked on the compiler and runtime as well. Gopal's current pet project is the Java language front-end for the compiler.gopalv82@symonds.net

    About Gopal V
    Gopal V is a final-year computer science student in India and has been involved with DotGNU since August 2001. He's the pnetlib lead developer and has worked on the compiler and runtime as well. Gopal's current pet project is the Java language front-end for the compiler.

  • MICROSOFT .NET LATEST STORIES
    At the end of this month, at its Professional Developers Conference, Microsoft intends to unveil something that CEO Steve Ballmer yesterday referred to as "Windows Cloud." Ballmer explained that "Just like Windows Server looked a lot like Windows but with new properties, new characteri...
    Come see a no-slides, code-only presentation that starts with a blank directory and builds a data-driven, AJAX enabled, ASP.NET web application from scratch that implements common AJAX patterns with the rich set of AJAX Control Toolkit, accesses data with LINQ, and implements standards...
    GigaSpaces Technologies and GoGrid have announced the availability of the GigaSpaces eXtreme Application Platform (XAP) on GoGrid's enterprise-grade cloud computing service for Windows and Linux. The two companies’ joint offering enables enterprises to migrate existing and new Java, ...
    Many of today's (and tomorrow’s) development projects lend themselves nicely to RIA application patterns. Silverlight offers a compelling RIA development experience that works on Linux, the Mac and windows as well as all major browsers. With HD video, vector based graphics and a rich...
    With all of the hype surrounding Cloud computing, Microsoft's upcoming Cloud OS and current efforts around Live Mesh, I thought I would take a trip on the WABAC machine to look at where it all started. Back when I was in junior high school, the best type of connectivity that I could ho...
    SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
    SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


    SYS-CON FEATURED WHITEPAPERS

    ADS BY GOOGLE
    BREAKING NEWS FROM THE WIRES
    Slalom Consulting, ranked one of the best consulting firms to work for in the United States by Consu...