01 - Chunks of Bytecode
In the first chapter, We’ll start by defining the structure of our bytecode, which are the basic instructions we’ll compile source code into and then feed to the virtual machine to be executed. We’ll also write a debugging utility to display bytecode in a human-readable format.
Before continuing, read the start of Chunks of Bytecode, and keep the book handy so you understand what we’re writing.
Getting Started
Now it’s time to write some Zig! Be warned that progress will be slow in this first section as I take my time explaining key Zig features and syntax. Once that’s out of the way, you’ll be comfortable writing Zig and we can speed things up.
Let’s first inspect a “hello, world!” program:
const std = @import("std");
pub fn main() !void {
std.debug.print("Hello, {s}!\n", .{"Interpreters"});
}
First we import the standard library. Zig’s standard library is much fuller than C’s and includes common utilities like a logging interface, dynamic arrays and much more. Zig’s imports are also namespaced, which means we can access the standard library through std.
whereas C only has one global namespace.
The function signature also carries some important information. First, functions must be marked pub
for public if they are used outside the same file. The return type !void
can be read as “any error or void
”. The exclamation mark (!
) is used to denote that an error may be returned. We’ll learn about Zig’s interesting error system soon.
Inside main
, we call std.debug.print
, which prints formatted text a la printf
to the standard error stream. The “{s}” is a placeholder for a string value which is provided in the second argument.
Use zig build run
to make sure everything’s correct:
$ zig build run
Hello, Interpreters!
We won’t be needing an equivalent to common.h
. Zig’s standard library has everything we’ll need.
Chunks of Instructions
The first thing to do is write the structure to store our bytecode, called a “chunk”. We need 3 basic functions:
- Initialize an empty chunkS.
- Add a bytecode instruction.
- Free the chunk’s memory after use.
This is a perfect use case for a dynamic/growable array. I recommend reading through this part of the book to see how a dynamic array is implemented as it’s very much worth understanding how this structure works. Thankfully, we’re spared from writing it ourselves as Zig’s standard library has done the work for us in the form of std.ArrayList
and its relatives.
Let’s wrap an arraylist inside a Zig struct
to store our bytecode.
// Add after main():
const Chunk = struct {
code: std.ArrayList(u8),
pub const OpCode = enum(u8) {
@"return",
};
};
Zig type declarations take the same form as imports with const Name = type;
Our struct has one field: code
of type std.ArrayList(u8).
ArrayList
is a generic type, meaning you can specify the type of the elements it stores. We’re using u8
(an unsigned 8-bit integer, aka a byte) as the element type rather than OpCode
because as we’ll see soon, the code stream can contain bytes other than opcodes.
OpCode
has been declared inside Chunk
, meaning it will be referred to as Chunk.OpCode
. For now we have just one instruction: return
. Because return
is a reserved keyword in Zig, we need to use the @""
syntax so Zig reads it as an identifier.
For initializing a chunk, we’ll use a constant value rather than a function. We’ll call this value empty
for an empty chunk.
// add inside Chunk after OpCode:
pub const empty = Chunk{
.code = .empty,
};
ArrayList
uses the same pattern to initialize an empty arraylist, so we can use the .empty
value (shorthand for ArrayListUmanaged.empty
) as the starting value for an empty chunk.
Adding instructions and freeing the chunk both involve memory management, so it’s time to cover one of Zig’s more prominent features.
The Allocator Interface
In C, managing memory is quite straightforward. You are given malloc()
to obtain a block of memory and free()
to give it back once you’re done with it. How malloc()
and free()
actually obtain and manage these blocks is not our concern. This is very straightforward, but presents 2 problems:
The user has no choice over the memory management strategy which
malloc
andfree
use. Some use cases may benefit from a specialized allocation strategy, but code which usesmalloc
/free
provide no control over this.Memory allocations are hidden, meaning libraries can allocate memory with no indication to the caller and often fail to properly handling allocation failure or allow the caller to handle it themselves.
Zig addresses these problems with the Allocator
interface. This is basically a struct with a set of function pointers for allocating and freeing memory. This object must be passed to any function which performs memory allocation so that the caller can choose which allocation functions the function will use.
Functions which allocate always may fail, and these errors can be propagated to the caller or handled within the function somehow.
You’ll be seeing a lot of Allocator
from here on out, so it’s important to understand why it’s there and what it’s used for.
Let’s add an alias to Allocator
for our convenience.
const std = @import("std");
const Allocator = std.mem.Allocator; // add this line
With that out of the way, let’s create a function to append a byte of code to the chunk.
// add *inside* Chunk after `empty`:
pub fn append(chunk: *Chunk, gpa: Allocator, byte: u8) Allocator.Error!void {
try chunk.code.append(gpa, byte);
}
This is a member function declared inside the struct Chunk
. It takes a pointer to a Chunk
as the first parameter and Zig allows member functions with this signature to be called like a method: e.g chunk.deinit(gpa);
, where chunk
is an existing Chunk
instance which is passed as a pointer in the first argument.
For now it’s just a wrapper around ArrayList
‘s own append
function, but we’ll add a little more functionality nearer the end of the chapter.
It also takes the Allocator
parameter by the name of gpa
which stands for “General Purpose Allocator”. This is shorter than “allocator” and also distinguishes it from an arena allocator, which is used differently.
ArrayList
also has a .deinit()
function for freeing the contents, but since the chunk will be allocating other data apart from the bytecode, it makes sense to write our own member function to free an entire Chunk
.
// Add inside Chunk after `empty`:
// `deinit` is the commonly used name for a freeing function.
pub fn deinit(chunk: *Chunk, gpa: Allocator) void {
chunk.code.deinit(gpa);
}
You may have noticed the Allocator.Error
in the function signature for the append
, along with the try
keyword when calling the append function. This makes now a good time to talk about how error handling is done in Zig.
Errors, Try and Catch
Errors in Zig are defined in types called error sets which are similar to enums. The Allocator.Error
error set contains just one error and looks like this:
// (in std.mem.Allocator)
pub const Error = error {
OutOfMemory,
};
Types which look like Error!SomethingElse
are called error unions, and either contain an error value in the Error
set, or a payload value of type SomethingElse
.
Functions which return an error union can use the try
keyword to unwrap the value of an error union or return it’s error in case of failure. A more flexible way to handle errors is using catch
which allows you to provide an alternative expression which will be evaluated if an error union contains an error.
// Returns in case of an error.
const unwrapped_value = try mayFail();
// Use a default value in case of an error.
const unwrapped_value = mayFail() catch default_value;
// `catch` can capture the value of the error to handle different error cases.
mayFail() catch |err| switch (err) {
error.Foo => return err,
else => @panic("TODO: handle all the errors!"),
}
While the keywords may look familiar, understand that this is not the same as try/catch
in languages with exceptions. Zig does not have exceptions and errors are ordinary values.
There’s a bit more to Zig’s errors than that but this is enough to work with for now. You can always refer to the docs for more detail.
Disassembling Chunks
It’s finally time to put our new type to use in main
!
pub fn main() !void {
var debug_allocator = std.heap.DebugAllocator(.{}).init;
defer _ = debug_allocator.deinit(); // check for memory leaks
const gpa = debug_allocator.allocator();
var chunk = Chunk.empty;
// Try removing this line and running the program. Does Zig catch the leak?
defer chunk.deinit(gpa);
try chunk.append(gpa, @intFromEnum(Chunk.OpCode.@"return"));
}
Since we need an Allocator
to allocate the array, we need to pick a concrete implementation to use. I’ve chosen the standard library’s DebugAllocator
, which can detect and report memory leaks in case we forget to free things. The .allocator();
member function returns the Allocator
interface object which we can then pass to our functions.
This code includes a new keyword: defer
which is used both times we deinit
something. Putting defer
in front of a statement causes the statement to run when the scope exits. This makes freeing resources like memory or files easier to read and write as they only need to be written once just after the resource is created.
Deferred statements are executed in reverse order at the end of the scope, so the chunk will be freed before the allocator is deinitialized, which is what we want. Check out the docs on defer to learn more.
If we try zig build run
now, the code will hopefully run without errors, but nothing much appears to happen. Now we want to be able to display the bytecode in a chunk somehow. This means writing the bytecode disassembler, which will print the code in a human-readable format. Let’s write a dissasembler function to pass our test chunk into.
// add after `Chunk`
pub fn disassembleChunk(
chunk: *const Chunk,
name: []const u8,
writer: *Io.Writer,
) Io.Writer.Error!void {
try writer.print("== {s} ==\n", .{name});
var offset: usize = 0;
while (offset < chunk.code.items.len) {
offset = try disassembleInstruction(chunk, offset, writer);
}
}
This simply loops over the instructions in bytecode, keeping track of the current offset with a usize
- a pointer-sized integer.
This code introduces us to a few more Zig features: strings, the Io.Writer
interface, and while
loops. I’ll assume the while loop is obvious enough and focus on the other two.
Io.Writer
The Io.Writer
interface is similar to the Allocator
one, except this is for outputting bytes instead of allocating them. This means that the caller can choose how and where the output is written to, e.g stdout, stderr, a text file or even a network socket! We’ll make this choice in main
but for now just add an alias for Io
at the top of the file.
const std = @import("std");
const Allocator = std.mem.Allocator;
const Io = std.Io; // new line
Strings and Slices
Zig does not have a dedicated string type. Instead, strings are commonly stored in a type such as []const u8
. The empty square brackets indicate a slice, which is a primitive type for an array with a runtime-known size. The const
means our function can’t modify the contents of the slice and u8
is the element type, which is just a byte here.
The slice is a very useful tool which is included in many other languages. They are used in Zig in favour of C’s null-terminated char *
pointers because of their better ergonomics and for being less error-prone. Read the docs to get a clearer understanding of what slices are or what they’re used for.
Our chunk.code
arraylist exposes it’s elements through a slice called items
. We check against this slice’s length for our while loop.
We’re using a while loop instead of a simple for
because later instructions will have different sizes, so the offset may jump by more than 1 byte each iteration.
Let’s move on to writing the core disassembling logic.
// add after disassembleChunk:
pub fn disassembleInstruction(
chunk: *const Chunk,
offset: usize,
writer: *Io.Writer,
) Io.Writer.Error!usize {
try writer.print("{d:04} ", .{offset});
const instruction: Chunk.OpCode = @enumFromInt(chunk.code.items[offset]);
switch (instruction) {
.@"return" => {
try writer.print("{s}\n", .{@tagName(instruction)});
return offset + 1;
},
}
}
We first print the instruction’s offset, padded to 4 characters, then use a switch
statement to inspect the instruction. We only have one instruction for now so we just print it’s name and return the next offset.
Note that we don’t need a default case for the switch
since Zig can tell that we’ve covered all possibilities. We also won’t be writing a simpleInstruction
function like the book does.
Finally, we’ll apply our disassembler to our little test chunk in main
by first creating a buffered writer to stdout and then calling disassembleChunk
.
try chunk.append(gpa, @intFromEnum(Chunk.OpCode.@"return"));
// add at the end of main:
const stdout_file = std.fs.File.stdout(); // Grab the stdout file handle
var buffer: [1024]u8 = undefined; // Make an intermediate buffer
var stdout_writer: std.fs.File.Writer = stdout_file.writer(&buffer); // Create a writer
const writer: *Io.Writer = &stdout_writer.interface; // Obtain the writer's interface
try disassembleChunk(&chunk, "test chunk", writer);
try writer.flush(); // Don't forget to flush! ;)
The call to disassembleChunk
is straightforward but the meat of this code comes in how we create the Writer
interface to stdout.
File.Writer, Io.Writer and Buffered IO
It’s important to draw the distinction between the File.Writer
and Io.Writer
. The latter is an interface similar to Allocator
while the first is an implementation of that interface. File.Writer
contains a field called interface
which is an Io.Writer
, a pointer to which can be passed to functions which write.
Creating a writer in Zig also means creating a buffer. The purpose of this buffer is to store output temporarily before it get’s passed to the writer’s underlying implementation. This is faster than writing output piecemeal as it reduces the number of calls to the implementation. Buffered IO is a common technique used in almost every language but is usually hidden behind the scenes. I’ve chosen to create a 1KiB buffer and have left it uninitialized with undefined
.
The buffer is stored in the Io.Writer
interface rather than the implementation which is why it is passed as a pointer, since the interface stores state like how full the buffer is.
A call to flush
is required at the end to write out any left over data in the buffer.
Now zig build run
should yield some satisfying output:
$ zig build run
== test chunk ==
0000 return
Constants and Values
Now that we’ve set up the machinery to create and display bytecode chunks, it’s time to add another instruction. The constant
instruction will be used to load a constant value (e.g a literal number in the source code). This also means that constant values will need to be stored in a chunk somehow.
We’ll start for defining a type for values. For now this is just a 64-bit floating-point number, but this will be a more complex type when we add support for different types in the language.
// add after Chunk:
pub const Value = f64;
The Constant Pool
Constant values will be stored in a growable array (called the “constant pool”) just like our bytecode. In the book this involves rewriting a dynamic array, but we already know how to use Zig’s arraylist.
const Chunk = struct {
code: std.ArrayList(OpCode),
constants: std.ArrayList(Value), // new line!
// ...
pub const empty = Chunk{
.code = .empty,
.constants = .empty // new line!
};
We also need to free this arraylist when the chunk is freed:
pub fn deinit(chunk: *Chunk, gpa: Allocator) void {
chunk.code.deinit(gpa);
chunk.constants.deinit(gpa); // new line
}
Constants will be loaded when the instruction to load one is encountered. The next byte after this instruction will be the index of the value in the constant pool. Since we’re only using a single byte to store this index, we’re limited to 2^8 = 256
constants per chunk.
We could append directly to the constants
array to add each constant value but we’ll also have to check if we have too many constants and also save the index of this new constant every time we add one. Let’s write a function inside Chunk
which solves both of those problems.
// add inside Chunk after append:
const AddConstantError = Allocator.Error || error{TooManyConstants};
// Returns error.TooManyConstants if the chunk already has 256 constants.
pub fn addConstant(chunk: *Chunk, gpa: Allocator, value: Value) AddConstantError!u8 {
if (chunk.constants.items.len > std.math.maxInt(u8)) {
return error.TooManyConstants;
}
// We've already made sure the length can fit in a u8.
const new_index: u8 = @intCast(chunk.constants.items.len);
try chunk.constants.append(gpa, value);
return new_index;
}
We first check to make sure the new index will fit in a single byte then we store the new index, add the value to the constant pool and return the index.
This is a good showcase of the aforementioned error system. We’ve created a new error type, which is the union between the Allocator.Error
error set and a new error set containing TooManyConstants
. We combine these sets together with ||
and use it in our function signature.
This way the caller will know why the function failed and can report the reason for the error.
Constant Instructions
We need another variant in our OpCode
enum to represent a “load constant” instruction.
pub const OpCode = enum(u8) {
constant, // new line
@"return",
};
We put this new instruction to use in our test chunk.
// inside main:
defer chunk.deinit(gpa);
// new code start
const constant = try chunk.addConstant(gpa, 1.2);
try chunk.append(gpa, @intFromEnum(Chunk.OpCode.constant));
try chunk.append(gpa, constant);
// new code ends
try chunk.code.append(gpa, @intFromEnum(Chunk.OpCode.@"return"));
All that’s left is to disassemble these constant instructions. Let’s add another case inside disassembleInstruction
.
switch (instruction) {
.@"return" => {
try writer.print("{s}\n", .{@tagName(instruction)});
return offset + 1;
},
// new case:
.constant => {
const constant = chunk.code.items[offset + 1];
const value = chunk.constants.items[constant];
try writer.print(
"{s:<16} {d:04} '{d}'\n",
.{ @tagName(instruction), constant, value },
);
return offset + 2;
},
}
I’ve forgone writing a printValue
function. I’ll do that later when we upgrade our Value
type to support more than 1 type, which will give us a chance to look at Zig’s formatting module! For now, check out the new disassembly, now twice as big!
$ zig build run
== test chunk ==
0000 return
0001 constant 0000 '1.2 '
Line Information
If we want to be able to give helpful error messages when a Lox program produces a runtime error, we need to know the line numbers associated with each instruction. The approach we use or this is very simple, just another arraylist. Each element in the lines
arraylist will give the line number of the corresponding byte in code
. We’ll need to add this field to Chunk
.
const Chunk = struct {
code: std.ArrayList(u8),
lines: std.ArrayList(u32), // new line
constants: std.ArrayList(Value),
// ...
pub const empty = Chunk{
.code = .empty,
.lines = .empty, // new line
.constants = .empty,
};
// ...
pub fn deinit(chunk: *Chunk, gpa: Allocator) void {
chunk.code.deinit(gpa);
chunk.lines.deinit(gpa); // new line
chunk.constants.deinit(gpa);
}
// ...
};
We’ll require a line number to be passed whenever a byte is written to the chunk.
pub fn append(chunk: *Chunk, gpa: Allocator, byte: u8, line: u32) Allocator.Error!void {
try chunk.code.append(gpa, byte);
try chunk.lines.append(gpa, line); // new line (and parameter)
}
Disassembling Line Information
Ok, now we can add a line number to the instructions in our little bytecode program.
// in main, modify 3 lines:
try chunk.append(gpa, @intFromEnum(Chunk.OpCode.constant), 123);
try chunk.append(gpa, constant, 123);
try chunk.append(gpa, @intFromEnum(Chunk.OpCode.@"return"), 123);
Of course, this information isn’t useful apart from displaying it, so let’s do just that in disassembleInstruction
.
// in disassembleInstruction:
try writer.print("{d:04} ", .{offset});
// new code starts
if (offset > 0 and chunk.lines.items[offset] == chunk.lines.items[offset - 1]) {
try writer.print(" | ", .{});
} else {
try writer.print("{d:>4} ", .{chunk.lines.items[offset]});
}
// new code ends
const instruction: Chunk.OpCode = @enumFromInt(chunk.code.items[offset]);
We’re doing a bit of fancy formatting by replacing the line number with a line if it’s the same as the line number of the previous instruction, but hopefully this code will make sense to you by now.
Try zig build run
-ning and, if all is well, you’ll see a beautifully formatted disassembly.
$ zig build run
== test chunk ==
0000 123 constant 0000 '1.2'
0002 | return
That’s all the code we’re writing in this chapter! There is however one last change we should make before moving on. I’ve been adding all the code directly in main.zig
, but we should separate things up before it gets too cluttered.
Organizing Files
Starting with the Value
type which, while pretty unimpressive right now, will grow into a more complicated type with it’s own related functionality. Give it it’s own file at src/values.zig
and cut and paste the single line.
// file: src/values.zig
pub const Value = f64;
Don’t forget to mark the declaration with pub
so that we can use it from other files.
Now the Chunk
struct, which will give me the opportunity to show you one last Zig feature. I promise it’s worth it!
Files are Structs
In Zig, a file is implicitly a struct. What does this mean? Well, you can imagine the top level of a .zig
file as being implicitly wrapped in struct { ... }
. This means that fields can be declared at the top level and the struct defined by the file can be instantiated and used as data.
This is useful for us because we can put our Chunk
type into it’s own file, declaring the fields and declarations at the top level and import it to use within main
. Create a new file in src/Chunk.zig
and copy the contents of the Chunk
struct inside the curly braces into the file.
// file: src/Chunk.zig
const Chunk = @This();
const std = @import("std");
const Allocator = std.mem.Allocator;
const values = @import("values.zig"); // new!
const Value = values.Value; // new!
code: std.ArrayList(u8),
lines: std.ArrayList(u32),
constants: std.ArrayList(Value),
pub const OpCode = enum(u8) {
constant,
@"return",
};
// <other declarations>
}
The necessary imports also have to be copied over, including a new import of Value
. There’s also one additional line: const Chunk = @This()
. Since we’re putting the type in it’s own file, it doesn’t have an identifier to be referred to by. We create this identifier by using the @This()
builtin to capture the surrounding type, which is the file in this case.
The reason we can’t do this for Value
is that it is not and won’t be a struct
.
Finally, the disassembly code. We’ll move this into it’s own file called src/debug.zig
. Create this file and copy the two functions along with their required imports.
// file: src/debug.zig
const std = @import("std");
const Io = std.Io;
const Chunk = @import("Chunk.zig")
// Remember to make these public with `pub`!
pub fn disassembleChunk(...) {
// ...
}
pub fn disassembleInstruction(...) {
// ...
}
Disassembling depends on both Io
and Chunk
which we import. It’s important to mark the functions pub
so that they can be used from other files.
Finally, add imports to these new files in main.zig
, which should now include only the main
function.
const std = @import("std");
const Allocator = std.mem.Allocator;
const Io = std.Io;
const Chunk = @import("Chunk.zig");
const debug = @import("debug.zig");
const disassembleChunk = debug.disassembleChunk;
pub fn main() !void {
// ...
}
Check that everything still works with zig build run
and when you’re ready, move on to the next chapter where we’ll build our own tiny virtual machine to feed some bytecode to.
$ tree src/
src/
├── Chunk.zig
├── debug.zig
├── main.zig
└── values.zig
$ zig build run
== test chunk ==
0000 123 constant 0000 '1.2'
0002 | return