Lexing Numbers in Multiple Bases

Table of Contents

Your first language #

So, you’ve written and published your first language, how exciting! You built a lexer, a parser, maybe some intermediate representation, and a backend. But let’s take a quick look at what seems to be a very simple step in the lexer: tokenizing numbers. You probably used a library or, in Rust, a crate like Logos, and tokenized numbers using a regex like \d+. Or, if you’re feeling adventurous, you wrote your own lexer and implemented something like:

fn tokenize_number() -> Result<Token, Error> {
    // Let's ignore floating pointer numbers for now
    let (num, bytes) = take_while(|c| c.is_ascii_digit());
    Ok((Token::Integer(num.parse::<i64>()?), bytes))
}

But then someone files an issue on GitHub asking when your language will support:

var a = 0xFFF
var b = 0b1001
var c = 0o777

And you draw a blank… What are those? They’re numbers in different bases-hexadecimal (0x), binary (0b), and octal (0o).

Why support different bases #

You’re probably thinking: “Humans use decimal numbers, why would I let users write other bases?” Great question! Certain problem domains are much more naturally expressed in hexadecimal or binary. For example:

Bitmasks and flags: It’s way easier to write 0b10101 than to calculate the decimal equivalent, \(21\).
Colors: Hexadecimal literals like 0xFFF500 directly map to RGB values.
Unix file permissions: 0o755 is more intuitive than the decimal \(493\).

Forcing users to wrap everything in a standard-library function or custom macro adds boilerplate. By lexing and parsing non-decimal literals at compile time, you convert them into integer values once, so there’s no runtime overhead to parse 0xFF again.

Moreover, handling these variants in your lexer lets you enforce digit-validity rules early:

Reject 0xG1 as an invalid hexadecimal literal.
Reject 0b102 as an invalid binary literal.
Reject 0o799 as an invalid octal literal.

Since C popularized 0x for hexadecimal (and leading-zero octals), programmers have grown accustomed to multiple bases. If your lexer only recognized decimal, anyone trying to express bit patterns, masks, RGB values, or certain constants in hex, octal, or binary would get frustrated.

Let’s implement it in our lexer #

Suppose our lexer is defined like this:

use std::ops::Range;

pub struct Lexer<'src> {
    src: Cow<'src, str>,
    remaining: Range<usize>,
}

We’ll assume Lexer has methods like advance (to read the next character), match_next (to peek/match the next character), and consume_if (to consume characters while a predicate holds).

A common pattern is that hexadecimal, binary, and octal literals all start with a leading 0. So we can check for that and then inspect the second character:

/// Lexes a single token, returns an error if no token can be built
fn next_token(&mut self) -> Result<Token, Error> {
    Ok(match self.advance()? {
        '0' if self.match_next("x") => { /* hex path */ },
        '0' if self.match_next("o") => { /* octal path */ },
        '0' if self.match_next("b") => { /* binary path */ },
        c @ ('0'..='9') => { /* decimal path */ },
        // …other token rules…
    })
}

We can add two helper methods to handle the heavy lifting:

fn capture_digits(
    lexer: &mut Lexer<'_>, 
    allowed: impl Fn(char) -> bool
) -> &str {
    let start = self.remaining.start;
    while self.consume_if(|c| allowed(c)) {}
    self.src.get(start..self.remaining.end).expect("Lexer: out of bounds")
}

fn try_parse_int(
    lexer: &Lexer<'_>, 
    body: &str, 
    radix: u32
) -> Result<Token, Error> {
    Ok(Token::Integer(i64::from_str_radix(body, radix)?))
}

Here, we let from_str_radix do the work of converting the string into an integer. If it fails (e.g., invalid digit), it returns an error that we can propagate or map to a custom error type. Now our match arms become:

fn next_token(&mut self) -> Result<Token, Error> {
    Ok(match self.advance()? {
        '0' if self.match_next("x") => {
            // Capture characters 0–9, a–f, A–F
            let body = capture_digits(self, |c| matches!(c, '0'..='9' | 'a'..='f' | 'A'..='F'));
            let literal = self.consumed_from(&body);
            try_parse_int(self, literal, 16)?
        },
        '0' if self.match_next("o") => {
            // Capture digits 0–7
            let body = capture_digits(self, |c| matches!(c, '0'..='7'));
            let literal = self.consumed_from(&body);
            try_parse_int(self, literal, 8)?
        },
        '0' if self.match_next("b") => {
            // Capture digits 0–1
            let body = capture_digits(self, |c| matches!(c, '0'..='1'));
            let literal = self.consumed_from(&body);
            try_parse_int(self, literal, 2)?
        },
        c @ ('0'..='9') => {
            // Capture digits 0–9 for decimal
            let body = capture_digits(self, |c| matches!(c, '0'..='9'));
            let literal = self.consumed_from(&body);
            try_parse_int(self, literal, 10)?
            // You could also extend this to handle floating-point numbers here
        },
        // …other token rules…
    })
}

Our lexer now correctly tokenizes hexadecimal, octal, binary, and decimal numbers.

Conclusion #

Lexing numbers may look trivial at first glance, but there’s more to it than meets the eye. By handling multiple bases in your lexer, you give developers a more ergonomic experience, no extra macros or library calls, no runtime parsing overhead, and compile-time validation of digit rules. Plus, it aligns with the conventions programmers have come to expect since the days of C.