Today I start designing and writing the expression evaluator for my assembler, Odin. The expression evaluator must handle symbols, 16-bit values, characters, the $ pseudo-symbol, binary operators (such as +, <<, % etc), and unary operators. It must also handle operator precedence rules and parentheses.
The good news for me is that I’ve done this before in C++ for my Spectrum emulator’s built-in assembler. The bad news is that I’ve never done it in Z80 machine code.
The expression evaluator can be split into 3 distinct parts:
- The syntax checker.
- The conversion to reverse polish notation.
- The stack-based evaluator.
I will go over them separately in the following sections.
The Syntax Checker
The job of the syntax checker is to go through the token input stream (i.e. the various parts of the expression) and confirm it’s a valid expression. For example, this is a valid expression:
(4 + 5) * 6
But this one is not:
This is solved using a state machine described by this diagram:
You start at node 0, and for each token you read in from the input, you travel along a line to a new node and push the token on a FIFO queue. If you cannot find a valid line to leave your current node OR after reading the whole token stream you do not end on the END node, it’s a syntax error.
While processing the parentheses, a variable (named x in the above diagram) keeps track of the parentheses depth. Every time an open parentheses is found, it’s incremented. Every time a close parenthese is found, it’s decremented. If the value x falls below 0, a syntax error occurs.
A little explanation is required for the line leaving node 1 and entering the END node. If an expression terminator occurs (like a new line or end-of-file state) then we’ve completed the expression. But in Z80 assembly, a close parentheses is a terminating character too, if and only if x is equal to 0 (i.e. before this parentheses all previous parentheses are balanced). This is to handle expressions like (IX+(4*5)) where the (4*5) is the expression, but the final parentheses is not.
This should not be hard at all to implement in Z80.
Conversion to Reverse Polish Notation
The problem with normal expressions, like 4+5*6 is that there’s ambiguity and complexity in parsing when you want operator precedence. If this expression was read strictly left to right, the answer would be 54. However, we’ve known from school that you handle the multiplies first before the additions. The accepted result should be 34.
There is a notation called Reverse Polish Notation (or RPN) that allows us to describe an expression that has no ambiguity and no need for operator precedence, where you can read it strictly from left to right. If we can get the expression in RPN form, calculation of the final expression is much easier.
If we were to show 4+5*6 without any operator precedence, the RPN form would be:
4 5 + 6 *
With RPN, you read it from left to right and every time you see a value, you push it on to a stack. Everytime you see an operator, you apply it to the values on the stack. If it’s a unary operator, you apply it to the top value on the stack. If it’s a binary operator, you apply it to the top two values on the stack. In the above example, 4 and 5 will be pushed on to the stack, and the + operator will pop those values off it and push the result back on. This means that 9 will be on top the stack. Then, 6 will be pushed, and finally * is applied to 9 and 6. The final value on the stack if 54.
For the expression 4+5*6 with full operator precedence rules, the resulting RPN form is:
4 5 6 * +
Here, all the numbers are push on the stack first. The operator * is then applied to the top two values, i.e. 5 and 6 and pushes 30 back on it. Then the + operator is applied to 4 and 30, and the final answer is 34 as expected.
So how do we convert an operator precedence-riddled expression into a more pure and simple RPN form? There is a really cool algorithm that allows you to do that that is designed by Mr. Goto-Is-Bad himself, Edsger Dijkstra. He designed an algorithm that takes an expression and an operator precedence table and outputs a RPN expression. This algorithm is called the Shunting Yard Algorithm. It’s a wonderfully concise algorithm that not only handles operator precedence but handles whether the operator is left or right-associated. One little detail is that to handle unary operators, the operator precedence is set to the highest (in this algorithm level 0), and given right-associativity.
I was going to describe the algorithm here but honestly, I wouldn’t do a better job than the Wikipedia link above.
The Stack-Based Evaluator
The final part is taking that RPN form and doing the actual calculation. The solution is to run the RPN through a stack-based calculator and the final result will be left on the top of the stack. This is guaranteed due to the syntax checker in the first step. Every number in the RPN form is pushed to a stack, and every operator pops its parameters off this stack and pushes the result. Odin will calculate this using 32-bit maths and then truncate to 16-bit, handling signed values if the expression evaluator is required to return them. For example, an expression following a JR opcode will require a signed value in the range of -128 to 127. An expression in the LD BC,(nnnn) opcode will require an unsigned value in the range of 0 to 65535. By using 32-bits, Odin can detect overflows and out of range values.
I hope this article has given some insight in how an expression is evaluated in assemblers and compilers. Most of the assemblers on the original Spectrum used very simple expression evaluators and some did not handle operator precedence. I wanted something more for Odin.
Wish me luck in writing this in Z80 machine code!