How To Create An Interpreter Of Bytecodes In Fifteen Minutes
Virtual computer programming languages in recent decades have become very widespread. A lot of time has passed since the presentation of the Java Virtual Machine in the second half of the 90s, and it can be said with confidence that bytecode interpreters are not the future, but the present.
But this technique, in my opinion, is almost universal, and an understanding of the basic principles of interpreter development will be useful not only to the creator of the next contender for the title “Language of the Year” according to TIOBE but to any programmer in general.
In a word, if you are interested in knowing how numbers add up to our favorite programming languages, what the developers of virtual machines are still arguing about and how painlessly to match strings and regular expressions.
Prehistory
One of the self-written systems of the Business Intelligence department of our company has an interface in the form of a simple query language. In the first version of the system, this language was interpreted on the fly, without compilation, directly from the input string with the query. The second version of the parser will work with intermediate byte-code, which will allow to separate the language of requests from their execution and greatly simplify the code.
In the process of working on the second version of the system, I had a vacation, during which for an hour or two I was distracted every day from family matters to study materials on the architecture and performance of bytecode interpreters.
The first one presents five small (up to a hundred lines of simple C code) virtual machines, each of which reveals a certain aspect of the development of such interpreters.
Where did the bytecodes come from in programming languages?
Virtual machines, a wide variety of virtual instruction sets over the past several decades have been invented a great many. Wikipedia claims that the first programming languages were compiled into various simplified intermediate representations as early as the 1960s. Some of these first byte codes were converted to machine codes and executed by real processors, others were interpreted by virtual processors on the fly.
The popularity of virtual instruction sets as an intermediate code representation is explained by three reasons:
- Programs in the form of bytecodes are easily transferred to new platforms.
- Bytecode interpreters are faster than syntax tree interpreters.
- You can develop a simple virtual machine in just a couple of hours.
Let’s make some simplest C virtual machines and use these examples to highlight the main technical aspects of the implementation of virtual machines.
Full sample codes are posted on GitHub. Examples can be collected with any relatively fresh GCC:
gcc interpreter-basic-switch.c -o interpreter ./interpreter
All examples have the same structure: first comes the code of the virtual machine itself, after – the main function with assertions that check the code operation. I tried to clearly comment on opcodes and key interpreter locations. I hope the article will be clear even to people who do not write daily in C.
The world’s easiest bytecode interpreter
As I said before, the simplest interpreter is very easy to make. Comments – right after the listing, but let’s start directly with the code:
struct { uint8_t *ip; uint64_t accumulator; } vm; typedef enum { /* increment the register */ OP_INC, /* decrement the register */ OP_DEC, /* stop execution */ OP_DONE } opcode; typedef enum interpret_result { SUCCESS, ERROR_UNKNOWN_OPCODE, } interpret_result; void vm_reset(void) { puts("Reset vm state"); vm = (typeof(vm)) { NULL }; } interpret_result vm_interpret(uint8_t *bytecode) { vm_reset(); puts("Start interpreting"); vm.ip = bytecode; for (;;) { uint8_t instruction = *vm.ip++; switch (instruction) { case OP_INC: { vm.accumulator++; break; } case OP_DEC: { vm.accumulator--; break; } case OP_DONE: { return SUCCESS; } default: return ERROR_UNKNOWN_OPCODE; } } return SUCCESS; }
There are less than one hundred lines, but all the characteristic attributes of a virtual machine are represented. The machine has a single register (vm.accumulator), three operations (register increment, register decrement, and program execution completion) and a pointer to the current instruction (vm.ip).
- Each operation (eng. Operation code, or opcode) is encoded in one byte, and dispatching is performed using a normal switch in the vm_interpret function. The branches in the switch contain the logic of operations, that is, they change the state of the register or complete the execution of the program.
The operations are transferred to the vm_interpret function as an array of bytes – byte-code (bytecode) – and are sequentially executed until the virtual machine shutdown operation (OP_DONE) is encountered in the byte-code.
A key aspect of a virtual machine is semantics, that is, a set of operations that are possible on it. In this case, there are only two operations, and they change the value of a single register.
- Some researchers (Virtual-machine Abstraction and Optimization Techniques, 2009) propose to divide virtual machines into high-level and low-level ones in the proximity of the semantics of a virtual machine to the semantics of the physical machine on which the bytecode will be executed.
In the extreme case, the bytecode of low-level virtual machines can completely repeat the machine code of a physical machine with simulated RAM, a full set of registers, instructions for working with the stack, and so on. The Bochs virtual machine, for example, repeats the set of instructions for the x86 architecture.
And vice versa: the operations of high-level virtual machines closely reflect the semantics of a specialized programming language compiled into byte-code. This is how SQLite, Gawk, and numerous versions of Prolog work, for example.
- An intermediate position is occupied by interpreters of general-purpose programming languages, having elements of both high and low levels. In the most popular Java Virtual Machine there are both low-level instructions for working with the stack, and built-in support for object-oriented programming with automatic memory allocation.
The code above is more likely to be the most primitive of low-level virtual machines: each virtual instruction is a wrapper over one or two physical instructions, while the virtual register fully corresponds to one register of the “iron” processor.
Arguments of instructions in bytecode
We can say that the only case in our example of a virtual machine is both the argument and the return value of all the instructions being executed. However, we may need the ability to pass arguments to instructions. One way is to directly put them into bytecode.
We extend the example by inserting instructions (OP_ADDI, OP_SUBI), which take an argument in the form of a byte, immediately following the opcode:
struct { uint8_t *ip; uint64_t accumulator; } vm; typedef enum { /* increment the register */ OP_INC, /* decrement the register */ OP_DEC, /* add the immediate argument to the register */ OP_ADDI, /* subtract the immediate argument from the register */ OP_SUBI, /* stop execution */ OP_DONE } opcode; typedef enum interpret_result { SUCCESS, ERROR_UNKNOWN_OPCODE, } interpret_result; void vm_reset(void) { puts("Reset vm state"); vm = (typeof(vm)) { NULL }; } interpret_result vm_interpret(uint8_t *bytecode) { vm_reset(); puts("Start interpreting"); vm.ip = bytecode; for (;;) { uint8_t instruction = *vm.ip++; switch (instruction) { case OP_INC: { vm.accumulator++; break; } case OP_DEC: { vm.accumulator--; break; } case OP_ADDI: { /* get the argument */ uint8_t arg = *vm.ip++; vm.accumulator += arg; break; } case OP_SUBI: { /* get the argument */ uint8_t arg = *vm.ip++; vm.accumulator -= arg; break; } case OP_DONE: { return SUCCESS; } default: return ERROR_UNKNOWN_OPCODE; } } return SUCCESS; }
New instructions (see the vm_interpret function) read their argument from bytecode and add it to the register/subtract it from the register.
Such an argument is called an immediate argument (the immediate argument) since it is located directly in the array of opcodes. The main limitation of our implementation is that the argument is a single byte and can only take 256 values.
In our virtual machine, the range of possible values of the arguments of instructions does not play a big role. But if the virtual machine will be used as an interpreter of a real language, it makes sense to complicate the bytecode by adding a table of constants separate from the opcode array and instructions with a direct argument corresponding to the address of the present argument in the table of constants.
Stack machine
Instructions in our simple virtual machine always work with one register and cannot transfer data to each other in any way. In addition, the argument of the instruction can only be immediate, but, say, the operation of addition or multiplication takes two arguments.
Simply speaking, we have no way to evaluate complex expressions. To solve this problem, you need a stack machine, that is, a virtual machine with a built-in stack:
#define STACK_MAX 256 struct { uint8_t *ip; /* Fixed-size stack */ uint64_t stack[STACK_MAX]; uint64_t *stack_top; /* A single register containing the result */ uint64_t result; } vm; typedef enum { /* push the immediate argument onto the stack */ OP_PUSHI, /* pop 2 values from the stack, add and push the result onto the stack */ OP_ADD, /* pop 2 values from the stack, subtract and push the result onto the stack */ OP_SUB, /* pop 2 values from the stack, divide and push the result onto the stack */ OP_DIV, /* pop 2 values from the stack, multiply and push the result onto the stack */ OP_MUL, /* pop the top of the stack and set it as execution result */ OP_POP_RES, /* stop execution */ OP_DONE, } opcode; typedef enum interpret_result { SUCCESS, ERROR_DIVISION_BY_ZERO, ERROR_UNKNOWN_OPCODE, } interpret_result; void vm_reset(void) { puts("Reset vm state"); vm = (typeof(vm)) { NULL }; vm.stack_top = vm.stack; } void vm_stack_push(uint64_t value) { *vm.stack_top = value; vm.stack_top++; } uint64_t vm_stack_pop(void) { vm.stack_top--; return *vm.stack_top; } interpret_result vm_interpret(uint8_t *bytecode) { vm_reset(); puts("Start interpreting"); vm.ip = bytecode; for (;;) { uint8_t instruction = *vm.ip++; switch (instruction) { case OP_PUSHI: { /* get the argument, push it onto stack */ uint8_t arg = *vm.ip++; vm_stack_push(arg); break; } case OP_ADD: { /* Pop 2 values, add 'em, push the result back to the stack */ uint64_t arg_right = vm_stack_pop(); uint64_t arg_left = vm_stack_pop(); uint64_t res = arg_left + arg_right; vm_stack_push(res); break; } case OP_SUB: { /* Pop 2 values, subtract 'em, push the result back to the stack */ uint64_t arg_right = vm_stack_pop(); uint64_t arg_left = vm_stack_pop(); uint64_t res = arg_left - arg_right; vm_stack_push(res); break; } case OP_DIV: { /* Pop 2 values, divide 'em, push the result back to the stack */ uint64_t arg_right = vm_stack_pop(); /* Don't forget to handle the div by zero error */ if (arg_right == 0) return ERROR_DIVISION_BY_ZERO; uint64_t arg_left = vm_stack_pop(); uint64_t res = arg_left / arg_right; vm_stack_push(res); break; } case OP_MUL: { /* Pop 2 values, multiply 'em, push the result back to the stack */ uint64_t arg_right = vm_stack_pop(); uint64_t arg_left = vm_stack_pop(); uint64_t res = arg_left * arg_right; vm_stack_push(res); break; } case OP_POP_RES: { /* Pop the top of the stack, set it as a result value */ uint64_t res = vm_stack_pop(); vm.result = res; break; } case OP_DONE: { return SUCCESS; } default: return ERROR_UNKNOWN_OPCODE; } } return SUCCESS; }
In this example, there are already more operations, and almost all of them work only with the stack. OP_PUSHI pushes its immediate argument onto the stack. The instructions OP_ADD, OP_SUB, OP_DIV, OP_MUL extract a pair of values from the stack, calculate the result and push it back onto the stack. OP_POP_RES removes a value from the stack and places it in the result register, intended for the results of the virtual machine.
For the operation of division (OP_DIV), the error of division by zero is trapped, which stops the operation of the virtual machine.
The capabilities of such a machine are much wider than the previous one with a single register and allow, for example, to calculate complex arithmetic expressions. Another (and important!) Advantage is the simplicity of compiling programming languages into a byte-code stack machine.
Register machine
Thanks to its simplicity, stack virtual machines are the most widely used among developers of programming languages; the same JVM and Python VMs use them.
However, such machines have drawbacks: they have to add special instructions for working with the stack, when calculating expressions all arguments repeatedly pass through a single data structure, a lot of unnecessary instructions appear in the stack code.
Meanwhile, the implementation of each extra instruction entails the cost of scheduling, that is, decoding the opcode and the transition to the body of instructions.
An alternative to stack machines is register-based virtual machines. They have a more complicated byte-code: each instruction clearly encoded the number of registers-arguments and the number of the register result. Accordingly, instead of a stack, an extended set of registers is used as a repository of intermediate values.
#define REGISTER_NUM 16 struct { uint16_t *ip; /* Register array */ uint64_t reg[REGISTER_NUM]; /* A single register containing the result */ uint64_t result; } vm; typedef enum { /* Load an immediate value into r0 */ OP_LOADI, /* Add values in r0,r1 registers and put them into r2 */ OP_ADD, /* Subtract values in r0,r1 registers and put them into r2 */ OP_SUB, /* Divide values in r0,r1 registers and put them into r2 */ OP_DIV, /* Multiply values in r0,r1 registers and put them into r2 */ OP_MUL, /* Move a value from r0 register into the result register */ OP_MOV_RES, /* stop execution */ OP_DONE, } opcode; typedef enum interpret_result { SUCCESS, ERROR_DIVISION_BY_ZERO, ERROR_UNKNOWN_OPCODE, } interpret_result; void vm_reset(void) { puts("Reset vm state"); vm = (typeof(vm)) { NULL }; } void decode(uint16_t instruction, uint8_t *op, uint8_t *reg0, uint8_t *reg1, uint8_t *reg2, uint8_t *imm) { *op = (instruction & 0xF000) >> 12; *reg0 = (instruction & 0x0F00) >> 8; *reg1 = (instruction & 0x00F0) >> 4; *reg2 = (instruction & 0x000F); *imm = (instruction & 0x00FF); } interpret_result vm_interpret(uint16_t *bytecode) { vm_reset(); puts("Start interpreting"); vm.ip = bytecode; uint8_t op, r0, r1, r2, immediate; for (;;) { /* fetch the instruction */ uint16_t instruction = *vm.ip++; /* decode it */ decode(instruction, &op, &r0, &r1, &r2, &immediate); /* dispatch */ switch (op) { case OP_LOADI: { vm.reg[r0] = immediate; break; } case OP_ADD: { vm.reg[r2] = vm.reg[r0] + vm.reg[r1]; break; } case OP_SUB: { vm.reg[r2] = vm.reg[r0] - vm.reg[r1]; break; } case OP_DIV: { /* Don't forget to handle the div by zero error */ if (vm.reg[r1] == 0) return ERROR_DIVISION_BY_ZERO; vm.reg[r2] = vm.reg[r0] / vm.reg[r1]; break; } case OP_MUL: { vm.reg[r2] = vm.reg[r0] * vm.reg[r1]; break; } case OP_MOV_RES: { vm.result = vm.reg[r0]; break; } case OP_DONE: { return SUCCESS; } default: return ERROR_UNKNOWN_OPCODE; } } return SUCCESS; }
The example shows a register machine with 16 registers. The instructions take 16 bits each and are encoded in three ways:
- 4 bits per opcode + 4 bits per register name + 8 bits per argument.
- 4 bits for the operation code + three times 4 bits for the names of the registers.
- 4 bits per opcode + 4 bits for a single register name + 8 unused bits.
Our small virtual machine has very few operations, so four bits (or 16 possible operations) per opcode are enough. The operation determines exactly what the remaining bits of the instruction represent.
The first type of coding (4 + 4 + 8) is needed to load data into the registers with the OP_LOADI operation. The second type (4 + 4 + 4 + 4) is used for arithmetic operations, which should know where to take a couple of arguments and where to add the result of the calculation. And finally, the last view (4 + 4 + 8 unnecessary bits) is used for instructions with a single register as an argument, in our case it is OP_MOV_RES.
For encoding and decoding instructions, special logic is now needed (the decode function). On the other hand, the logic of instructions due to the explicit indication of the location of the arguments becomes easier – operations with the stack disappear.
The key features are: there are fewer instructions in the bytecode of the register machines, separate instructions are wider, compilation into such bytecode is more complicated – the compiler has to decide how to use the available registers.
It should be noted that in practice, register virtual machines usually have a stack where, for example, function arguments are placed; registers are used to evaluate individual expressions. Even if there is no explicit stack, then the array is used to build the stack, which plays the same role as the RAM in physical machines.
The comparison of the stack and register machines
There is an interesting study (Virtual machine showdown: Stack versus registers, 2008) that has had a great influence on all subsequent developments in the field of virtual machines for programming languages. Its authors proposed a method of live translation from the stack code of the standard JVM to the registration code and compared the performance.
The method is non-trivial: the code is first translated and then optimized in a rather complicated way. But a subsequent comparison of the performance of the same program showed that the additional processor cycles spent on decoding instructions were fully compensated by a decrease in the total number of instructions. In general, in short, the register machine turned out to be more efficient than the stack one.
- As mentioned above, this efficiency has a very tangible price: the compiler must allocate the registers itself and additionally it is desirable to have a developed optimizer.
The debate about which architecture is better is still not over. If we talk about Java compilers, the Dalvik VM bytecode, which until recently worked in every Android device, was registered; but the titular JVM kept the stack instruction set. The Lua virtual machine uses the register machine, but the Python VM is still a stack. And so on.
Bytecode in regular expression interpreters
Finally, to distract from low-level virtual machines, let’s look at a specialized interpreter that checks strings for regular expression match:
typedef enum { /* match a single char to an immediate argument from the string and advance ip and cp, or * abort*/ OP_CHAR, /* jump to and match either left expression or the right one, abort if nothing matches*/ OP_OR, /* do an absolute jump to an offset in the immediate argument */ OP_JUMP, /* stop execution and report a successful match */ OP_MATCH, } opcode; typedef enum match_result { MATCH_OK, MATCH_FAIL, MATCH_ERROR, } match_result; match_result vm_match_recur(uint8_t *bytecode, uint8_t *ip, char *sp) { for (;;) { uint8_t instruction = *ip++; switch (instruction) { case OP_CHAR:{ char cur_c = *sp; char arg_c = (char)*ip ; /* no match? FAILed to match */ if (arg_c != cur_c) return MATCH_FAIL; /* advance both current instruction and character pointers */ ip++; sp++; continue; } case OP_JUMP:{ /* read the offset and jump to the instruction */ uint8_t offset = *ip; ip = bytecode + offset; continue; } case OP_OR:{ /* get both branch offsets */ uint8_t left_offset = *ip++; uint8_t right_offset = *ip; /* check if following the first offset get a match */ uint8_t *left_ip = bytecode + left_offset; if (vm_match_recur(bytecode, left_ip, sp) == MATCH_OK) return MATCH_OK; /* no match? Check the second branch */ ip = bytecode + right_offset; continue; } case OP_MATCH:{ /* success */ return MATCH_OK; } } return MATCH_ERROR; } } match_result vm_match(uint8_t *bytecode, char *str) { printf("Start matching a string: %s\n", str); return vm_match_recur(bytecode, bytecode, str); }
The main instruction is OP_CHAR. It takes its immediate argument and compares it with the current character in the string (char * sp). In case of coincidence of the expected and current characters in the line, the transition to the next instruction and the next character takes place.
The machine also understands the transition operation (OP_JUMP), which takes a single immediate argument. Argument means absolute offset in bytecode, from where the calculation should be continued.
The last important operation is OP_OR. It takes two offsets, trying to apply the code first on the first one, then, in case of an error, the second one. It does this with the help of a recursive call, that is, the instruction makes a crawl into the depth of the tree of all possible variants of a regular expression.
Surprisingly, four opcodes and seventy lines of code are enough to express regular expressions of the form “abc”, “a? Bc”, “(ab | bc) d”, “a * bc”. In this virtual machine, there is not even an explicit state, since all that is needed – pointers to the beginning of the instruction stream, the current instruction, and the current symbol – is passed in by the arguments of the recursive function.
If you are interested in the details of the work of regular expression engines, for a start you can familiarize yourself with the series of articles by Russ Cox (Google Co2, author Russ Cox), the author of the regular expression engine from Google RE2.
Conclusion
For general purpose programming languages, as a rule, two architectures are used: stack and register.
In the stack model, the main data structure and the way to transfer arguments between instructions is the stack. In the register model, a set of registers is used to evaluate expressions, but an explicit or implicit stack is still used to store the function arguments.
- The presence of an explicit stack and a set of registers brings such machines to low-level and even physical ones. The abundance of low-level instructions in such a bytecode means that a significant amount of physical processor resources is spent on decoding and dispatching virtual instructions.
On the other hand, high-level instructions play an important role in popular virtual machines. In Java, for example, these are instructions for polymorphic function calls, allocation of objects, and garbage collection.
Purely high-level virtual machines — for example, interpreters of byte-codes of languages with advanced and far from iron semantics — spend most of their time not in the controller or decoder, but in the instructions bodies and, accordingly, are relatively effective.
Practical recommendations:
- If you need to execute any byte-code and do it in a reasonable time, then try to operate with instructions that are closest to your task; the higher the semantic level, the better. This will reduce the cost of scheduling and simplify code generation.
- If greater flexibility and heterogeneous semantics were required, then you should at least try to isolate the common denominator in the byte-code so that the resulting instructions are conditionally average.
- If in the future it may be necessary to calculate any expressions, make a stack machine, this will reduce a headache when compiling bytecode.
- If no expressions are foreseen, then make a trivial register machine, which will avoid the cost of the stack and simplify the instructions themselves.