thasso.xyz

Setting up an x86 CPU in 64-bit mode

2024-07-13T00:00:00+00:00

People say there are things that are complex and there are things that are just complicated. Complexity is considered interesting, complicatedness is considered harmful. The process of setting up an x86_64 CPU is mostly complicated.

I’ll describe one way to go from a boot sector loaded by the BIOS with the CPU in 16-bit real mode to the CPU set up in 64-bit long mode. The setup is pretty bare-bones and there’s tons more to do.

To follow along, you need the Intel 64 and IA-32 Architectures Software Developer’s Manual, an assembler (I used nasm), and QEMU. If you don’t have an x86_64 CPU, you should still be able to run everything I describe by emulating an x86 CPU in QEMU. I assume you know x86 assembly and the syntax that nasm uses. I like the nasm tutorial by Ray Toal for getting started.

I was surprised by how readable some of the Intel manual is. The initial chapters in volume 1 do a really good job at providing an overview of the system and explaining the terms used throughout the other volumes. But volume 3: System Programming Guide is most relevant to this discussion. There is an overview of all the operating modes in volume 3, section 2.2 Modes of Operation. The path we’re taking is highlighted in red.

For everything up to 32-bit mode, take a look at “Writing a Simple Operating System – from Scratch”. It’s unfinished but still very good.

Starting Point: BIOS

After a reset, the x86 CPU is in “real mode”. That mode has a default operand size of 16 bits. You get a 20-bit address space and thus the ability to address 1MB of memory by using segmentation. Real mode is pretty much a backward compatibility mode for the Intel 8086 chip from 1978.

After the BIOS the first code that runs is that in the boot sector. The BIOS searches the system for a disk where the first sector ends in the magic number 0xaa55 (i.e., the byte 0x55 followed by the byte 0xaa). It loads that “boot sector” to memory at address 0x7c00.

So the BIOS gives us 512 bytes to work with. We need to use these bytes in order to bootstrap the rest of the bootloader. One can fit a surprising amount of stuff in 512 bytes, but it’s easiest to just load some more data from disk first. Fortunately, routines defined by the BIOS remain available to us as long as we’re in real mode.

Boot Sector Setup

Let’s set up a simple boot sector. It will just print a message to the screen using BIOS routines and then hang. This way, we know that the tooling works.

This is the assembly we need:

;; src/boot_sector.s

    section .boot_sector
    global __start

    [bits 16]

__start:
    mov bx, hello_msg
    call print_string

end:
    hlt
    jmp end

;; Uses the BIOS to print a null-termianted string. The address of the
;; string is found in the bx register.
print_string:
    pusha
    mov ah, 0x0e ; BIOS "display character" function

print_string_loop:
    cmp byte [bx], 0
    je print_string_return

    mov al, [bx]
    int 0x10 ; BIOS video services

    inc bx
    jmp print_string_loop

print_string_return:
    popa
    ret

hello_msg: db "Hello, world!", 0

Plus this Makefile:

# Makefile

.PHONY: all clean boot

NASM := nasm -f elf64

BUILD_DIR := build
SRC_DIR := src

SRC := $(wildcard $(SRC_DIR)/*.s)
OBJS := $(patsubst $(SRC_DIR)/%.s, $(BUILD_DIR)/%.o, $(SRC))
BOOT_IMAGE := $(BUILD_DIR)/boot_image

all: $(BOOT_IMAGE)

boot: $(BOOT_IMAGE)
	qemu-system-x86_64 -no-reboot -drive file=$<,format=raw,index=0,media=disk

$(BOOT_IMAGE): $(BUILD_DIR)/linked.o
	objcopy -O binary $< $@

$(BUILD_DIR)/linked.o: $(OBJS)
	ld -T linker.ld -o $@ $^

$(BUILD_DIR)/%.o: $(SRC_DIR)/%.s
	@mkdir -p $(dir $@)
	$(NASM) $< -o $@

clean:
	$(RM) -r $(BUILD_DIR)

The linker script linker.ld is important because it makes sure that the code in our boot sector is relocated to the right address in the final image. Specifically, the bootloader loads the boot sector to address 0x7c00 in memory. So that’s the base address to relocate the boot sector to. In addition, the linker will add the magic number at the end of the boot sector. Other guides I’ve seen do both the offset and the magic number inside the boot sector assembly source file by using features of the assembler, but that’s somewhat hackish.

# linker.ld

MEMORY
{
    boot_sector (rwx) : ORIGIN = 0x7c00, LENGTH = 512
}

ENTRY(__start)
SECTIONS
{
    .boot_sector : { *(.boot_sector); } > boot_sector
    .bootsign (0x7c00 + 510) :
    {
        BYTE(0x55)
        BYTE(0xaa)
    } > boot_sector
}

Running make boot should result in a QEMU window and the “Hello, World!” message should be displayed.

Stage 1 – Loading Stage 2 From Disk

We can split the bootloader into two stages. Stage 1 is the code in the boot sector. It is everything that the BIOS loads for us. The sole purpose of stage 1 is to load stage 2 into memory. Stage 1 does this by using BIOS-provided routines to load stage 2 into memory.

In stage 2, we’ll switch from 16-bit real mode to 32-bit protected mode. In protected mode, we can’t use BIOS routines anymore. Without BIOS routines, loading sectors from a disk would become much more involved. So we’ll load a number of sectors from disk into memory and hope for the best. Of course, this is an unsafe technique, but it works for now.

This is how one can access the disk using BIOS. There’s an osdev.org page on this.

;; src/boot_sector.s

;; ...

__start:
    ;; ...

    mov si, disk_address_packet
    mov ah, 0x42 ; BIOS "extended read" function
    mov dl, 0x80 ; Drive number
    int 0x13 ; BIOS disk services
    jc error_reading_disk

ignore_disk_read_error:
    SND_STAGE_ADDR equ (BOOT_LOAD_ADDR + SECTOR_SIZE)
    jmp 0:SND_STAGE_ADDR

error_reading_disk:
    ;; We accept reading fewer sectors than requested
    cmp word [dap_sectors_num], READ_SECTORS_NUM
    jle ignore_disk_read_error

    mov bx, error_reading_disk_msg
    call print_string

    end:
    ;; ...

And at the end of boot_sector.s put this data:

;; src/boot_sector.s

;; ...

    align 4
disk_address_packet:
    db 0x10 ; Size of packet
    db 0 ; Reserved, always 0
dap_sectors_num:
    dw READ_SECTORS_NUM ; Number of sectors read
    dd (BOOT_LOAD_ADDR + SECTOR_SIZE) ; Destination address
    dq 1 ; Sector to start at (0 is the boot sector)

READ_SECTORS_NUM equ 64
BOOT_LOAD_ADDR equ 0x7c00
SECTOR_SIZE equ 512

hello_msg: db "Hello, world!", 13, 10, 0
error_reading_disk_msg: db "Error: failed to read disk with 0x13/ah=0x42", 13, 10, 0

Lastly we need a stage 2 to jump to and we need to update the linker script. The Makefile remains unchanged.

;; src/stage2.s

    section .stage2

    [bits 16]

    mov bx, stage2_msg
    call print_string

end:
    hlt
    jmp end

    print_string:
        ;; ...

stage2_msg: db "Hello from stage 2", 13, 10, 0

I just copied the print_string function so we can test if the jump works. Because this specific function only works with BIOS in real mode, it won’t be of any use to stage 2 once we have switched to protected mode.

Finally the linker script:

# linker.ld

MEMORY
{
    boot_sector (rwx) : ORIGIN = 0x7c00, LENGTH = 512
    stage2 (rwx) : ORIGIN = 0x7e00, LENGTH = 32768 # 512 * 64
}

ENTRY(__start)
SECTIONS
{
    .boot_sector : { *(.boot_sector); } > boot_sector
    .bootsign (0x7c00 + 510) :
    {
        BYTE(0x55)
        BYTE(0xaa)
    } > boot_sector
    .stage2 : { *(.stage2); } > stage2
}

32-bit Protected Mode

Next, we’ll switch the CPU from real mode (16-bit) to protected mode (32-bit). In protected mode, segmentation is used by default to implement memory protection. Before switching to protected mode, you need to define a Global Descriptor Table (GDT) that contains segment descriptors for all the segments you want to define. Usually, paging is used in favor of segmentation. In fact, in 64-bit long mode, you need to use paging. But for the initial switch to protected mode, segmentation is required.

The Intel manual describes the “flat model” as a very simple segmentation model that can be implemented in the GDT. The “flat model” comprises a code segment and a data segment. Both of these segments are mapped to the entire linear address space (their base addresses and limits are identical). Using the simplest of all models is fine, since we just want to get to long mode and abandon segmentation in favor of paging.

The GDT is defined as a contiguous structure in memory. You fill a chunk of memory with the right data and give the CPU the address and the length of the memory chunk. The format of the GDT structure is described in the Intel manual.

From section “3.4.5 Segment Descriptors”:

The GDT is just an array of segment descriptors with a “null descriptor” at the start that’s used to catch invalid translations. The fields in the segment descriptor are described in detail in section “3.4.5 Segment Descriptors” of volume 3 of the Intel manual.

We define the GDT like this:

;; include/gdt32.s

    ;; Base address of GDT should be aligned on an eight-byte boundary
    align 8

gdt32_start:
    ;; 8-byte null descriptor (index 0).
    ;; Used to catch translations with a null selector.
    dd 0x0
    dd 0x0

gdt32_code_segment:
    ;; 8-byte code segment descriptor (index 1).
    ;; First 16 bits of segment limit
    dw 0xffff
    ;; First 24 bits of segment base address
    dw 0x0000
    db 0x00
    ;; 0-3: segment type that specifies an execute/read code segment
    ;;   4: descriptor type flag indicating that this is a code/data segment
    ;; 5-6: Descriptor privilege level 0 (most privileged)
    ;;   7: Segment present flag set indicating that the segment is present
    db 10011010b
    ;; 0-3: last 4 bits of segment limit
    ;;   4: unused (available for use by system software)
    ;;   5: 64-bit code segment flag indicates that the segment doesn't contain 64-bit code
    ;;   6: default operation size of 32 bits
    ;;   7: granularity of 4 kilobyte units
    db 11001111b
    ;; Last 8 bits of segment base address
    db 0x00

gdt32_data_segment:
    ;; Only differences are explained ...
    dw 0xffff
    dw 0x0000
    db 0x00
    ;; 0-3: segment type that specifies a read/write data segment
    db 10010010b
    db 11001111b
    db 0x00

gdt32_end:

;; Value for GDTR register that describes the above GDT
gdt32_pseudo_descriptor:
    ;; A limit value of 0 results in one valid byte. So, the limit value of our
    ;; GDT is its length in bytes minus 1.
    dw gdt32_end - gdt32_start - 1
    ;; Start address of the GDT
    dd gdt32_start

CODE_SEG32 equ gdt32_code_segment - gdt32_start
DATA_SEG32 equ gdt32_data_segment - gdt32_start

Switching to protected mode is very easy now. We load the GDT pseudo-descriptor into the GDTR register so that the base address and length of our GDT are known to the system. Lastly, we do a far jump to flush the instruction pipeline.

;; src/stage2.s

    section .stage2

    [bits 16]

;; ...

    ;; Load GDT and switch to protected mode

    cli ; Can't have interrupts during the switch
    lgdt [gdt32_pseudo_descriptor]

    ;; Setting cr0.PE (bit 0) enables protected mode
    mov eax, cr0
    or eax, 1
    mov cr0, eax

    ;; The far jump into the code segment from the new GDT flushes
    ;; the CPU pipeline removing any 16-bit decoded instructions
    ;; and updates the cs register with the new code segment.
    jmp CODE_SEG32:start_prot_mode


    [bits 32]
start_prot_mode:
    ;; Old segments are now meaningless
    mov ax, DATA_SEG32
    mov ds, ax
    mov ss, ax
    mov es, ax
    mov fs, ax
    mov gs, ax

;; ...

%include "include/gdt32.s"

Interrupts are disabled during the switch. After the entire setup is complete, interrupts can be enabled again. This would require extra setup work.

Now that we’re in protected mode, we can’t use the BIOS routines anymore. To print text, we can write straight to the VGA buffer instead.

;; src/stage2.s

;; ...

;; Writes a null-terminated string straight to the VGA buffer.
;; The address of the string is found in the bx register.
print_string32:
    pusha

    VGA_BUF equ 0xb8000
    WB_COLOR equ 0xf

    mov edx, VGA_BUF

print_string32_loop:
    cmp byte [ebx], 0
    je print_string32_return

    mov al, [ebx]
    mov ah, WB_COLOR
    mov [edx], ax

    add ebx, 1              ; Next character
    add edx, 2              ; Next VGA buffer cell
    jmp print_string32_loop

print_string32_return:
    popa
    ret

Best print something so that we know the switch worked. To do that, add a string with the message and a call to print_string32 to the code. The print_string32 function is super basic, so the message always shows up in the top left corner of the display.

64-bit Long Mode

For this part, refer to “10.8.5 Initializing IA-32e Mode”. Note that Intel calls the 64-bit mode “IA-32e” while AMD refers to it as “long mode” in the AMD64 manual.

Before switching to long mode, the CPU must be in protected mode and paging must be enabled. We have protected mode now, but we are missing paging.

I love paging. It’s just very cool. But I’d do a poor job at explaining the concept itself. Philipp Oppermann’s Introduction to Paging from the “Writing an OS in Rust” blog was really useful for me personally. OSTEP also talks about paging starting chapter 18, although it doesn’t go into the specifics of paging on x86 like Philipp Oppermann’s post does.

In long mode with Physical Address Extension enabled (PAE, we’ll do that below ), a four level page table is used. The below code generates such a page table at a given address.

;; src/stage2.s

;; Builds a 4 level page table starting at the address that's passed in ebx.
build_page_table:
    pusha

    PAGE64_PAGE_SIZE equ 0x1000
    PAGE64_TAB_SIZE equ 0x1000
    PAGE64_TAB_ENT_NUM equ 512

    ;; Initialize all four tables to 0. If the present flag is cleared, all other bits in any
    ;; entry are ignored. So by filling all entries with zeros, they are all "not present".
    ;; Each repetition zeros four bytes at once. That's why a number of repetitions equal to
    ;; the size of a single page table is enough to zero all four tables.
    mov ecx, PAGE64_TAB_SIZE ; ecx stores the number of repetitions
    mov edi, ebx             ; edi stores the base address
    xor eax, eax             ; eax stores the value
    rep stosd

    ;; Link first entry in PML4 table to the PDP table
    mov edi, ebx
    lea eax, [edi + (PAGE64_TAB_SIZE | 11b)] ; Set read/write and present flags
    mov dword [edi], eax

    ;; Link first entry in PDP table to the PD table
    add edi, PAGE64_TAB_SIZE
    add eax, PAGE64_TAB_SIZE
    mov dword [edi], eax

    ;; Link the first entry in the PD table to the page table
    add edi, PAGE64_TAB_SIZE
    add eax, PAGE64_TAB_SIZE
    mov dword [edi], eax

    ;; Initialize only a single page on the lowest (page table) layer in
    ;; the four level page table.
    add edi, PAGE64_TAB_SIZE
    mov ebx, 11b
    mov ecx, PAGE64_TAB_ENT_NUM
set_page_table_entry:
    mov dword [edi], ebx
    add ebx, PAGE64_PAGE_SIZE
    add edi, 8
    loop set_page_table_entry

    popa
    ret

Paging supersedes segmentation for managing virtual address spaces, permissions, etc. A Global Descriptor Table with segment descriptors is still needed though, and the segment descriptors must be modified slightly to enable long mode-specific features.

This is another GDT that also implements the flat model. It’s almost identical to the GDT for protected mode. Just two bits were changed.

;; include/gdt64.s

    align 16
gdt64_start:
    ;; 8-byte null descriptor (index 0).
    dd 0x0
    dd 0x0

gdt64_code_segment:
    dw 0xffff
    dw 0x0000
    db 0x00
    db 10011010b
    ;;   5: 64-bit code segment flag indicates that this segment contains 64-bit code
    ;;   6: must be zero if L bit (bit 5) is set
    db 10101111b
    db 0x00

gdt64_data_segment:
    dw 0xffff
    dw 0x0000
    db 0x00
    ;; 0-3: segment type that specifies a read/write data segment
    db 10010010b
    db 10101111b
    db 0x00

gdt64_end:

gdt64_pseudo_descriptor:
    dw gdt64_end - gdt64_start - 1
    dd gdt64_start

CODE_SEG64 equ gdt64_code_segment - gdt64_start
DATA_SEG64 equ gdt64_data_segment - gdt64_start

With the page table and the GDT in place, the switch from protected mode to long mode can be performed.

;; src/stage2.s

;; ...

start_prot_mode:
    ;; ...

    ;; Build 4 level page table and switch to long mode
    mov ebx, 0x1000
    call build_page_table
    mov cr3, ebx            ; MMU finds the PML4 table in cr3

    ;; Enable Physical Address Extension (PAE). This is needed to allow the switch
    mov eax, cr4
    or eax, 1 << 5
    mov cr4, eax

    ;; The EFER (Extended Feature Enable Register) MSR (Model-Specific Register) contains fields
    ;; related to IA-32e mode operation. Bit 8 if this MSR is the LME (long mode enable) flag
    ;; that enables IA-32e operation.
    mov ecx, 0xc0000080
    rdmsr
    or eax, 1 << 8
    wrmsr

    ;; Enable paging (PG flag in cr0, bit 31)
    mov eax, cr0
    or eax, 1 << 31
    mov cr0, eax

    mov ebx, comp_mode_msg
    call print_string32

    ;; New GDT has the 64-bit segment flag set. This makes the CPU switch from
    ;; IA-32e compatibility mode to 64-bit mode.
    lgdt [gdt64_pseudo_descriptor]

    jmp CODE_SEG64:start_long_mode

    ;; ...

    [bits 64]

start_long_mode:
    hlt
    jmp start_long_mode

    ;; ...

%include "include/gdt64.s"

    ;; ...

comp_mode_msg: db "Entered 64-bit compatibility mode", 0

Again, the “success message” should show up in the top left corner. Write a small VGA driver if this annoys you.

Using C

C code can easily be intergrated into this setup. E.g, this might become an OS kernel.

/* src/kernel.c */

#define VGA_COLUMNS_NUM 80
#define VGA_ROWS_NUM 25

#define ARRAY_SIZE(arr) ((int)sizeof(arr) / (int)sizeof((arr)[0]))

void _start_kernel(void) {
	volatile char *vga_buf = (char *)0xb8000;
	const char msg[] = "Hello from C";
	int i;

	for (i = 0; i < VGA_COLUMNS_NUM * VGA_ROWS_NUM * 2; i++)
		vga_buf[i] = '\0';

	for (i = 0; i < ARRAY_SIZE(msg) - 1; i++) {
		vga_buf[i * 2] = msg[i];
		vga_buf[i * 2 + 1] = 0x07; /* White on black */
	}
}

Update src/stage2.s:

;; src/stage2.s

    ;; ...

    [bits 64]

start_long_mode:
    mov ebx, long_mode_msg
    call print_string64

    extern _start_kernel
    call _start_kernel

end64:
    hlt
    jmp end64

    ;; ...

The linker script:

# linker.ld

MEMORY
{
    boot_sector (rwx) : ORIGIN = 0x7c00, LENGTH = 512
    stage2 (rwx) : ORIGIN = 0x7e00, LENGTH = 512
    kernel (rwx) : ORIGIN = 0x8000, LENGTH = 0x10000
}

ENTRY(__start)
SECTIONS
{
    .boot_sector : { *(.boot_sector); } > boot_sector
    .bootsign (0x7c00 + 510) :
    {
        BYTE(0x55)
        BYTE(0xaa)
    } > boot_sector
    .stage2 : { *(.stage2); } > stage2
    .text : { *(.text); } > kernel
    .data : { *(.data); } > kernel
    .rodata : { *(.rodata); } > kernel
    .bss :
    {
        *(.bss)
        *(COMMON)
    } > kernel
}

Lastly, the Makefile needs to change. Here, I only included the lines that have changed.

# Makefile

# ...

CC := gcc
CFLAGS := -std=c99 -ffreestanding -m64 -mno-red-zone -fno-builtin -nostdinc -Wall -Wextra

# ...

SRC := $(wildcard $(SRC_DIR)/*)
OBJS := $(patsubst $(SRC_DIR)/%, $(BUILD_DIR)/%.o, $(SRC))

# ...

$(BUILD_DIR)/%.s.o: $(SRC_DIR)/%.s
	@mkdir -p $(dir $@)
	$(NASM) $< -o $@

$(BUILD_DIR)/%.c.o: $(SRC_DIR)/%.c
	@mkdir -p $(dir $@)
	$(CC) $(CFLAGS) -c $< -o $@

# ...

Cool if you actually came along this far. The code is on GitHub.

Between Curious and Frantic

2024-04-23T00:00:00+00:00

I’ve been upset with my output lately. Just overall. For a few months, I felt I did a good (at least decent) job posting inspired content on this site. I was also learning steadily and making good progress on my programming. But in recent months, less so. I tried to focus on specific projects, but I couldn’t stick to any of them for long. As such, I didn’t produce anything particularly interesting. In short, I went from curious exploring to frantic jumping between projects.

Maybe the reason for this is that school has been coming to an end, which means I have had to spend more time studying for exams recently. It’s reasonable to think that the slow mounting daunt of final exams would increase pressure and dilute focus. But really, I was working diligently for school before, and I managed to stress myself with exams more than enough in the past. The amount of free time I have also hasn’t changed for the worse. So I deem this an internal issue.

I think the problem is very clear: I piled up more and more tasks to work on at the same time. In effect, this hinders my progress since I can’t really focus on any single one of them, and thus I can’t do good work. To progress, I need to decide. I have to do one thing at a time.

Some things I just have to reject for now. That’s OK. I have to learn things step-by-step, one after the other. I have time for that. In fact, I can only maximize my skills if I can focus. But I can’t focus if I hastily attempt to master all skills at once. I want to be good at a broad range of tasks. But to be really good, I have to go deep as well. And from my recent experience, it seems obvious to me that I need a different approach. Now is the time to pick out a small set of challenges and really dive in. I have to fundamentally say no to all other enterprises that might distract me.

It’s also important that I let things pass. Over and over again, I have accumulated endless piles of bookmarks and notes on what to read and what to write. It’s great that I have this much to think about, but it’s also distracting. All of these reminders that I constantly jot down for myself–any systems that I try to develop to organize them–I’ll have to give up on that. So far, I’ve been doing this collecting out of fear of missing out on some valuable information. But building up such a backlog is not productive. Even worse, it simply diverts focus.

My strategy is this: don’t fixate what topics to work on, but fixate everything else. Then, don’t make any changes to the fixed set. I’m not going to rule out whole topics by themselves. Rather, I want to set myself a framework of conditions that make it easy to stay focused on the work that I want to do.

For example, no more text editor switching. Emacs is about good enough for all I need. And there certainly are no alternatives that would boost my productivity enough to make up for the distraction that the switch would incur.

Second, no more Rust (I know, I’m deeply torn myself). I want to improve my foundational knowledge. For 90% of that, I don’t need to sharpen my understanding of any fancy new languages. Instead, I’ll just stick to C and focus on content instead of language.

Third, no more HN (this will probably turn out as less HN in practice). Sure, HN is great for discovering things and also pretty good entertainment, but, really, it’s a distraction most of the time. If I need to find good material on a topic, I can just use HN to search for it directly (I already do that since Google search doesn’t work anymore). If I want to explore, then sure, I can use HN deliberately. But exposing myself to the plethora of fascinating projects that HN has to offer too often isn’t healthy (for months now, breakfast and HN have been coupled tightly).

I’m also deleting some of the stuff on this website that I don’t think is actually very interesting.

That’s all I have for now. I’ll probably think of more rules in the future. In that case, I’ll update this page. I’m putting this out here to hold myself accountable and to get better at getting things out there (something I generally want to improve at). Have fun exploring.

Parsing Expressions by Recursive Descent in Haskell

2023-10-31T00:00:00+00:00

Parsing numerical expressions by recursive descent is a joy in Haskell! It is incredibly concise and elegant, yet very simple.

What we want to parse are binary expressions like 7 + 42 * 9, 2 * 3 / 4 * 5, or 8 * (10 - 6). As always, when parsing such expressions, we have to be aware of the associativity of the operators involved and of their different levels of precedence. In this case it’s simple: +, -, *, and / all associate to the left, and * and / have higher precedence than + and -.

This means that we want to turn the above expressions into the following ASTs.¹

7 + 42 * 9 ⇒ 7 + (42 * 9). * has higher precedence than +, so although they both associate to the left, * binds tighter than +.

2 * 3 / 4 * 5 ⇒ ((2 * 3) / 4) * 5. * and / have the same precedence and associate to the left.

8 * (10 - 6). Parentheses have the highest precedence.

The following grammar encodes the precedence and associativity constraints above. It is also not left-recursive, and can be used in a recursive descent parser.²

Instead of using algorithms like Shunting Yard or precedence climbing, the precedence of the operators is encoded directly in the various production rules. This is the simplest approach to take, but it works well in the implementation. Nora Sandler presents this method, and explains how to get there here on her blog. I recommend reading this article by Theodore Norvell if you want to learn more about paring expressions. It explains both the Shunting Yard algorithms and precedence climbing.

How would this grammar parse an expression like 7 + 42 * 9? It starts at 7, goes down the leftmost derivation of both expr and term, and then chooses the num alternative in factor. Next, + is consumed by the optionally repeated part of expr, and we go down another term, with 42 * 9 as the rest of the input. The recursion mechanism at work here defers the partial tree consisting of (+ 7 ) that we have parsed so far. Starting at 42, term now goes down the leftmost factor again. This factor becomes another num, consuming 42 from the input. Now * is consumed by the optionally repeated part of term, and then factor consumes the last numeric literal 9. In total, the second term in the expr production rule produces the tree (* 42 9). Now that the end of the input has been reached, this tree is used to complete the first partial tree. This way we get (+ 7 (* 42 9)) as the result.

Implementation

We’ll use the Megaparsec library of parser combinators for our implementation. The Megaparsec tutorial is quite thorough, and I recommend you give it a read if you want to use Megaparsec.

First off, let’s define a representation of the ASTs we wish to create:

-- Expr.hs

data Expr
  = Add Expr Expr  -- +
  | Sub Expr Expr  -- -
  | Mul Expr Expr  -- *
  | Div Expr Expr  -- /
  | Num Int
  deriving (Show, Eq)

The first Expr represents the left-hand side of the binary expressions, and the second Expr represents the right-hand side.

Next, we’ll need to define some helpers to start parsing. Here we mostly use the combinators found in Control.Applicative and in Megaparsec’s Lexer module.

-- Expr.hs

import Data.Void
import Control.Applicative hiding (many)
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer as L

data Expr = -- [...]

type Parser = Parsec Void String

spaceConsumer :: Parser ()
spaceConsumer = L.space space1 empty empty

pSymbol :: String -> Parser String
pSymbol = L.symbol spaceConsumer

pLexeme :: Parser a -> Parser a
pLexeme = L.lexeme spaceConsumer

pNum :: Parser Expr
pNum = Num <$> pLexeme L.decimal

pSymbol and pLexeme consume all white space after they are parsing. They don’t consume initial white space, so be careful about that. Now we can already parse numbers.

λ :l Expr
[1 of 2] Compiling Main             ( Expr.hs, interpreted )
Ok, one module loaded.
λ parseTest (pNum <* eof) "7"
Num 7
λ parseTest (pNum <* eof) "43587"
Num 43587
λ parseTest (pNum <* eof) "blah"
1:1:
  |
1 | blah
  | ^
unexpected 'b'
expecting integer
λ parseTest (pNum <* eof) "92 * 4"
1:4:
  |
1 | 92 * 4
  |    ^
unexpected '*'
expecting end of input

As you can see, different numbers are all parsed correctly and invalid inputs are rejected with nice error messages generated by Megaparsec.

Let’s now start implementing the parser. We’ll build it up from the bottom, starting with factor.

-- Expr.hs
-- [...]

inParens :: Parser a -> Parser a
inParens = between (pSymbol "(") (pSymbol ")")

pFactor :: Parser Expr
pFactor = inParens pExpr <|> pNum

pExpr :: Parser Expr
pExpr = undefined

How do we define pExpr? It should parse a single term, and then go on to parse an infinite number of plus or minus characters, each followed by another term. term has the same shape as expr, so once we know how to implement expr, we can also implement term. Parsing the first term is simple:

-- Expr.hs
-- [...]

pTerm :: Parser Expr
pTerm =  -- [...]

pExpr :: Parser Expr
pExpr = do
  lhs <- pTerm
  -- ...

A parser that parses a + or a - and then parses another term might look like this: ((pSymbol "+" $> Add) <|> (pSymbol "-" $> Sub)) <*> pTerm. It discards the symbol it parsed and instead returns the value constructor of the expression that belongs to that symbol. Then it applies the expression parsed by the pTerm on the right to that value constructor. But there is an problem here though! The term that’s applied to the value constructor first is the right-hand side of the binary expression. But the first parameter of the value constructor is defined to be the left-hand side. We need to flip the parameters of the value constructor.

-- Expr.hs

import Data.Functor (($>))

-- [...]

pExpr :: Parser Expr
pExpr = do
  -- lhs :: Expr
  lhs <- pTerm
  -- rhs :: Expr -> Expr
  rhs <- flip <$> pOperator <*> pTerm
  pure $ rhs lhs
  where
	pOperator = (pSymbol "+" $> Add) <|> (pSymbol "-" $> Sub)

Let’s try it out again.

λ :l Expr
[1 of 2] Compiling Main             ( Expr.hs, interpreted )
Ok, one module loaded.
λ parseTest (pExpr <* eof) "92 * 4"
1:7:
  |
1 | 92 * 4
  |       ^
unexpected end of input
expecting '+', '-', or digit

It doesn’t work yet, because we’re missing the zero or more repetitions part. For this, many can be used, which will run the given parser zero or more times and return a list of all results. In our case, it returns a list of Expr -> Expr. A left fold can be used to apply the functions in this list to another, starting with lhs. This will build the desired left-associative tree of expressions.

-- Expr.hs
-- [...]

pTerm :: Parser Expr
pTerm = do
  lhs <- pFactor
  rhs <- many $ flip <$> pOperator <*> pFactor
  pure $ foldl (\expr f -> f expr) lhs rhs
  where
    pOperator = (pSymbol "*" $> Mul) <|> (pSymbol "/" $> Div)

pExpr :: Parser Expr
pExpr = do
  lhs <- pTerm
  rhs <- many $ flip <$> pOperator <*> pTerm
  pure $ foldl (\expr f -> f expr) lhs rhs
  where
    pOperator = (pSymbol "+" $> Add) <|> (pSymbol "-" $> Sub)

Now it works! I formatted the GHCI output a bit so it’s easy to recognize that the trees in the output match those from the beginning of this post.

λ :l Expr
[1 of 2] Compiling Main             ( Expr.hs, interpreted )
Ok, one module loaded.
λ parseTest (pExpr <* eof) "92 * 4"
Mul (Num 92) (Num 4)
λ parseTest (pExpr <* eof) "7 + 42 * 9"
Add
	(Num 7)
	(Mul
		(Num 42)
		(Num 9))
λ parseTest (pExpr <* eof) "2 * 3 / 4 * 5"
Mul
	(Div
		(Mul
			(Num 2)
			(Num 3))
		(Num 4))
	(Num 5)
λ parseTest (pExpr <* eof) "8 * (10 - 6)"
Mul
	(Num 8)
	(Sub
		(Num 10)
		(Num 6))

Conclusion

pTerm and pExpr are very similar and can easily be abstracted into a function that parses any left-associative binary expression. Then, the production rule for any level of precedence can be implemented in a single line. Unary operators can also be added by extending pFactor.

The code for this post can be found here. It includes such a generic function for parsing expressions.

I used Quiver to create the diagrams. It has an option to embed diagrams as Iframes, but I decided not to, because I like how reliable and simple plain images are. ↩
The curly braces denote zero or more repititons of what’s inside them. A character in quotes refers to that literal character. The num production rule/token is not included in the grammar. It refers to a numeric literal. ↩

C VLAs are cool!

2023-10-07T00:00:00+00:00

Can you use a class in C?

2023-08-11T00:00:00+00:00

Recently, I’ve been working on a C debugger. This requires reading and processing the DWARF debugging information that’s part of the binary. Since this is a rather complex task, I figured I might use a library that exports a nice interface to the debugging information.

One such library that I found early on was libelfin. It wasn’t perfect from that start because it is a bit dated now, only supporting DWARF 4 and missing features from the newer DWARF 5 standard, but I thought that I could work around this. The bigger problem was that libelfin is written in C++ while most the debugger is written in C.

It is pretty easy to call code written in C from C++ since a lot of C is still part of the subset of C that C++ supports. The problem with calling C++ code from C is that there are many features in C++ that C is missing. This means that the C++ interface must be simplified for C to be able to understand it.

Handling objects

The most important concept in C++ that C is missing is true object orientation. That is, in C you don’t get a this pointer for free; you need to handle it manually.

Let’s start with a simple example. Say we have a class that represents a rational number $r = p / q$ where $q \neq 0$. The declaration without any of the operations we need might look something like this, which will print 5 / 3 when we run it.

// rational.h

class Rational {
public:
  int _numer;
  int _denom;

  Rational(int numer, int denom)
    : _numer{numer}, _denom{denom} {}
};

This is how we might use it in C++:

// main.cc
#include 
#include "rational.h"

auto main() -> int {
  auto r = Rational(5, 3);
  std::cout << r._numer << " / " << r._denom << std::endl;
  return 0;
}

How do you write this as a C program using the Rational class? After all, there is no such thing as a class in C. To solve this issue we can rely on one of the primitives that most systems languages have in common by virtue of running to the same type of computer: the pointer. We will allocate an instance of our class on the heap and then give the C program a pointer to that instance. This way we can keep track of the object to manipulate it. It’s also possible to use handles for this, but they are just pointers with extra steps and a bit overkill for us at this point.

The following is what we might want.

// main.c
#include 
#include "rational.h"

int main(void) {
  void *r = make_rational(5, 3);
  printf("%d / %d\n", get_numer(r), get_denom(r));
  del_rational(&r);
  return 0;
}

We need to extend our interface with all the new functions to construct, access and manually delete instances of Rational.

// rational.h
class Rational { /* ... */ };

void *make_rational(int numer, int denom);
int get_numer(const void *r);
int get_denom(const void *r);
void del_rational(void **rp);

// rational.cc
#include "rational.h"
#include 

void *make_rational(int numer, int denom) {
  // Allocate an instance on the heap.
  Rational *r = static_cast<Rational*>(malloc(sizeof(Rational)));
  r->_numer = numer;
  r->_denom = denom;
  return r;
}

int get_numer(const void *r) {
  // Cast to access members.
  const Rational *_r = static_cast<const Rational*>(r);
  return _r->_numer;
}

int get_denom(const void *r) {
  const Rational *_r = static_cast<const Rational*>(r);
  return _r->_denom;
}

void del_rational(void **rp) {
  Rational *_r = static_cast<Rational*>(*rp);
  // Delete the instance on the heap.
  free(_r);

  // Delete the dangling pointer too.
  *rp = nullptr;
}

The trick is to allocate instances on heap and then pass them around as void pointers. We use C’s malloc instead of the new operator because the new operator is a C++ only feature which raises a linker error. A good way to improve type safety is to typedef an opaque type to represent the class on the C side, as suggested in this reply. This is the approach that we’ll be using later on, so keep on reading. Alternatively, if you have control over all of the C++ code (i.e. you don’t just wrap a library) you could follow this Stack Overflow answer too.

Now, ignoring how incredibly unsafe all of this is, there is a bigger problem we must face: this is not even close to compiling! The reason for this is that when we #include "rational.h" into main.c, we essentially copy all the contents of rational.h into the C source file. This means that we suddenly present the C compiler with a class declaration and other things that it doesn’t understand because they are part of a totally different language.

We can use the C preprocessor to help us here. Using the __cplusplus macro, we can check whether to include the C++ parts in the interface. This way it’s hidden from the C compiler but available to the C++ compiler.

// rational.h
#ifdef __cplusplus
class Rational {
public:
  int _numer;
  int _denom;

  Rational(int numer, int denom)
    : _numer{numer}, _denom{denom} {}
};
#endif  // __cplusplus

// ...

Using the two different compilers to build, the program could look like this: g++ -c rational.cc && gcc main.c rational.o.

Great it compiles! But uhh … now the linker signals an error. There are two problems left to fix. Firstly C++ uses a different ABI than C which means that the calling convention is different. Additionally, C++ compilers mangle the names of identifiers in the source code differently than C compilers do, so the linker can’t find them. Fortunately, C is the lingua franca of computer programming so C++ compilers can adapt their behavior in both of these aspects to that of C compilers. To do so, we just prefix all C++ declarations that should be used by C code with extern "C".

This is very simple to do in the rational.cc source file, but requires some extra smartness in rational.h. Again, extern "C" is only a C++ feature, so it cannot be part of the header when the C compiler is looking at it. The solution to this is to use the __cplusplus macro once more.

// rational.h
#ifdef __cplusplus
class Rational { /* ... */ };
#endif  // __cplusplus


#ifdef __cplusplus
extern "C" {
#endif  // __cplusplus

void *make_rational(int numer, int denom);
int get_numer(const void *r);
int get_denom(const void *r);
void del_rational(void **rp);

#ifdef __cplusplus
}  // extern "C"
#endif  // __cplusplus

This wraps all of the function definitions in an extern "C" block when the C++ compiler is looking at it. After making those changes to rational.h and rational.cc we get the following output.

g++ -c rational.cc
gcc main.c rational.o
./a.out
5 / 3

We successfully created a class in C++ that we can now use in C!

Now that we have covered how to use the preprocessor to change the content of a file based on the compiler that’s looking at it, we can make the API a bit safer, too. To do that we create an opaque type that acts a proxy for the Rational class on the C side. By only declaring this type, the C compiler will ensure that the pointers passed around in the interface are all of the same type (i.e. Rational). However, it won’t let you dereference the pointers because the type is never really defined.

#ifdef __cplusplus

class Rational {
	// ...
};

#else

// Opaque type as a C proxy for the class.
typedef struct Rational Rational;

#endif // __cplusplus

In addition to that we now replace all void * with Rational *. This will allow you to remote some of the static_casts from the beginning.

Linking the C++ standard library

Above, we used malloc and a cast to allocate the instance of Rational to prevent a linker error later on. If we had used new and delete instead (which is the proper C++ way), we would have gotten linker errors like this one:

rational.cc:(.text+0x15): undefined reference to `operator new(unsigned long)'

Usually in a C++ program, this issue doesn’t arise because new and delete are provided in the C++ standard library. The problem is that we used a C compiler to build the executable, which doesn’t link the C++ standard library by default. The solution is to pass the linker flag -lstdc++ to the compiler explicitly.

With new we can also use normal C++ constructors, making everything more concise and safe:

// rational.cc
#include "rational.h"

extern "C" Rational *make_rational(int numer, int denom) {
  // Now we're using the constructor.
  Rational *r = new Rational(numer, denom);
  return r;
}

// ...

extern "C" void del_rational(Rational **rp) {
  delete *rp;
  *rp = nullptr;
}

Handling exceptions

Exceptions are another feature of C++ that C doesn’t have. If the C++ code we wrapped throws an exception, the whole program will crash without doing any cleanup. This can be addressed in multiple ways, one of which is to pass -fno-exceptions to the C++ compiler to abort if a library throws an exception and to reject code that uses exceptions. The more realistic and safe approach is to carefully catch all exceptions at the language boundary.

If you take another look at the definition of rational numbers above, you’ll notice that we don’t actually ensure that $q \neq 0$. This will become problematic if we try to implement rational number arithmetic for our class. We’ll address this by throwing an exception in the constructor if the denominator is 0.

// rational.h
#ifdef __cplusplus

#include 

class Rational {
public:
  int _numer;
  int _denom;

  Rational(int numer, int denom) {
    this->_numer = numer;
    if (denom == 0) {
      throw std::domain_error("denominator is 0");
    } else {
      this->_denom = denom;
    }
  }
};
#endif  // __cplusplus

// ...

Since we know now that the constructor might throw, we catch all exceptions in the wrapper and return a nullptr in case of an exception. In general, it’s often a good idea to catch anything and return a generic error value such as null. In addition to that, you could add infinitely more complex error-handling schemes at the language boundary.

// rational.cc
#include "rational.h"

extern "C" Rational *make_rational(int numer, int denom) {
  try {
    // Allocate an instance on the heap.
    Rational *r = new Rational(numer, denom);
    return r;
  } catch (...) {
    return nullptr;
  }
}

In such a simple case it’s also feasible to check if the denominator is 0 in make_rational but that doesn’t apply to more realistic examples.

You can find all the code for this post on my GitHub.