|
YOUR FEEDBACK
|
TOP MICROSOFT .NET LINKS Interpretation Design of the Portable.NET Interpreter
Design of the Portable.NET Interpreter
By: Rhys Weatherley; Gopal V
Feb. 26, 2003 12:00 AM
Portable.NET is an implementation of the Common Language Infrastructure (CLI). Its primary design goal is portability to as many platforms as possible, which it achieves through the use of interpretation rather than just-in-time compilation. The bytecode format of the CLI presents some challenges to efficient interpreter implementation. Rather than directly interpret, we translate the bytecode into a simpler abstract machine, the Converted Virtual Machine (CVM). This machine is then interpreted using a high-performance engine. Traditionally, abstract machines have used the same bytecode representation "on the wire" as for execution. Our work shows that there are definite performance advantages to using different bytecode representations internally and externally.
Introduction Portable.NET consists of three major components to support the CLI: a bytecode-based Common Language Runtime (CLR), a C# compiler, and a C# base class library. This article will concentrate on the runtime engine. The runtime engine achieves portability primarily through the use of interpretation rather than just-in-time compilation. However, the bytecode format of the CLI presents some challenges to efficient interpreter implementation. This article discusses how we have overcome these challenges to build a high-performance interpreter for Common Intermediate Language (CIL) programs. We compare the performance of Portable.NET with that of Mono to demonstrate how our approach differs from direct polymorphic interpretation and full just-in-time compilation.
Polymorphic CIL A unique feature of CIL, compared to other similar abstract machines, is that its instructions are polymorphic. The add instruction can be used on integers, longs, and floating-point values, for example. Other virtual machines, such as the Java Virtual Machine, use separate instructions for each type. The polymorphic nature makes direct interpretation very inefficient, as demonstrated by the fragment from Mono's interpreter shown in Listing 1. (All of the code for this article can be downloaded from www.sys-con.com/dotnet/sourcec.cfm.) The types of all stack values must be tracked explicitly, leading to significant runtime overhead. The challenge for Portable.NET was finding a way to implement a high-performance engine without writing a full just-in-time compiler. The approach we took was very similar to a JIT: the CIL bytecode is translated into instructions for a simpler abstract machine, dubbed CVM (for "Converted Virtual Machine"). The CVM instructions are then interpreted using a high-performance interpreter, written in C. As each method is entered, the following process occurs:
Instead of a single add instruction, the CVM instruction set has several: iadd, ladd, fadd, etc. The conversion process chooses the most appropriate variant based on the operand types reported by the bytecode verifier. Listing 2 shows a simplified form of the CVM interpreter code for the converted instructions. The interpreter executes more efficiently because it can assume that the values on the stack are of the correct type (bytecode verification having already been performed). Items on the CVM execution stack are a uniform size of one word: 64-bit and larger types straddle multiple words. The CVM conversion process takes care of laying out the stack according to the types of local variables and stack items. This isn't necessarily a new approach it is normally known as "threaded interpretation" in the Forth community. The complete list of CVM instructions is given in Listing 3.
Ramping Up
The gcc compiler can bind variables to explicit registers, as follows: register unsigned char *pc __asm__ ("esi"); We placed the program counter, the top-of-stack pointer, and the frame pointer into x86 registers. This produced a significant improvement in performance compared to straight C code, for such a small change. The next step was to change from switch statements to using computed gotos. This is normally referred to as token threading. The break at the end of each case is replaced with a goto statement: goto *main_label_table[*pc]; The main_label_table contains pointers to each of the cases in the switch statement, allowing the interpreter to jump directly to the next case, avoiding the overhead of jumping back to the top of the switch-loop. More information on computed gotos can be found in the gcc documentation. The result of these two changes (use of explicit registers and token threading) was an interpreter that was so close to a handcrafted assembly loop that there was little point writing one by hand. The third step involved a change in representation. The switch loop and token-threaded versions select instruction handlers based on CVM bytecode. Instead of storing the single-byte opcodes, we can store the actual addresses of the opcode handlers in the CVM instruction stream. This is known as direct threading. goto **((void **)pc); Direct threading increases the size of the CVM code by a factor of 4, because instructions are now pointers to handlers rather than bytecodes. But it avoids the overhead of looking up values in main_label_table. On RISC platforms, this can give a significant increase in engine performance, but on x86 it isn't too impressive. If memory is an issue, token threading gives better results.
Unrolling The unroller uses simple register allocation techniques on the basic blocks of a method. Complex instructions, especially those involving method calls, are not unrolled. It isn't possible for unrolling to achieve the same performance as a JIT, but it can get very close. The primary advantage of the unroller compared to a JIT is that it is vastly simpler to implement. Portable.NET's x86 unroller took about two weeks to write, and we expect that other CPUs would require a similar amount of effort. Anything that is too complicated to convert is replaced with a jump back into the interpreter core. This allows unrollers to be developed in stages, replacing one instruction at a time and then retesting. This made development a lot easier than the "all or nothing" approach required for a JIT.
PInvoke When Portable.NET encounters a PInvoke method reference, it compiles a small CVM stub that performs any necessary parameter marshaling and then calls the underlying native function. Upon return, the CVM stub demarshals the return value. A similar process is used for "internalcall" methods within the runtime engine that implement built-in features for the C# class library. Using CVM to perform marshaling operations simplifies native function invocations quite considerably. Only a small amount of platform-specific code is needed to perform the native call, for which we use the standard "libffi" library.
Inlining CVM addresses this by selectively inlining some of the more commonly used methods in the C# class library. The method call is replaced with a special-purpose opcode during code conversion. The major groups of inlineable methods within Portable.NET are currently the string, monitor, and 2D array operations. Inlining common methods can have a dramatic impact on performance. The PNetMark "Float" benchmark improved by a factor of 12 when 2D array operations were inlined. Such operations are normally very expensive in CLRs because a method must be called for every element get or set operation.
Alternate Back Ends Coders are an interface between the CIL front end and the execution engine. The design allows for the current CVM back end to be replaced by a fully fledged JIT without any modification to the other components in the system. The same front end could be used with the CVM engine, a native JIT, or even a polymorphic interpreter.
Performance Summary As can be seen, the simple techniques described in this article produce very good results, with the unrolled version achieving 71% overall compared to Mono's JIT.
Future Work We are also investigating selective inlining as an alternative to writing a handcrafted unroller for each CPU. The authors of that paper also reported performance of up to 70% of optimized C code using simple techniques. Their engine, for Objective Caml, has a single line of platform-specific code, to perform a flush of the CPU's instruction cache. Selective inlining doesn't work very well for x86, but it should do well for RISC CPUs like the PowerPC. In the near future, we will investigate fully fledged JIT coders for Portable.NET, as well as front ends for other instruction sets such as the Java Virtual Machine.
Conclusion At the same time, the code is highly portable. Ports to new platforms take a matter of days, sometimes hours (e.g., the author ported the code to Mac OSX in a single day).
Acknowledgments
References Copyright ©2003 Rhys Weatherley and Gopal V. Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. MICROSOFT .NET LATEST STORIES
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING NEWS FROM THE WIRES
|
|||||||||||||||||||||||||||||||||||