Yeehaw! The parser is done for now! That's a pretty big milestone for the early development of this project; it unblocks all of the stuff that needs it, which is like, everything else. The other half of the v0.1 release is the Test Client, which means now I get to try out the real fun stuff - compiling this Rust code to WASM and linking that into a VSCode extension. More on that coming soon!
Let's talk about the parser, though. Parsing is Cool and Fun. It's also an extremely dense topic (take a look at any Wikipedia article on the subject if you don't believe me), so I'd like to provide an entry-level rundown of how it works and how I personally approach it.
Background
Here's a sample of the language - the entire standard library, in fact:
/**
* A `Ref` is a reference to another existing resource.
*
* This is most useful in conjunction with `List`s, as a `List<Ref<T>>`.
* That allows you to remove an item from the list by `DELETE`ing its `Ref`,
* rather than deleting the entire item itself.
*/
resource Ref<T> {
embed T
/// GET is not mandatory for this type
GET -> #405;
/// Removes the referenced item
DELETE;
}
/**
* This is a paginated list.
*/
resource List<T> {
embed T[]
links {
/**
* References the next page of the list, if such a page exists.
*
* If empty, this is the final page of the list.
*/
next? -> @self,
}
/// Renders the current page of the list.
GET;
/// Creates a new item within the list.
POST T -> #201 T;
}
/**
* This is essentially a placeholder for anything that's not data.
*
* For example, this is appropriate to allow users to upload and download images.
*/
resource Media {
GET -> @media;
PUT @media;
}
/// A simple POST-only link with no request body.
resource Action {
POST -> #204;
}
This language describes families of HTTP endpoints, called Resource Classes and denoted here with the resource
keyword. Within the Resource Classes we specify the data that they embed with the embed
keyword, their HTTP methods, and how they link to other resources with the links
keyword. Line comments //
and block comments /* */
work like they do in most other languages.
Hopefully this makes enough sense at a glance - I still need to come up with some better documentation and examples for how this might be used in the real world. For now, don't worry too much about the specifics of the language; just take it at face value that it's a good idea to come up with our own language for this purpose, and let's walk through that process together!
So You Want To Design A Language
Languages are defined by their grammar, so let's start there. Formally, a grammar is a set of rules about replacement. Grammars define syntactic variables (or just "variables" for short) which can be replaced by a string (sequence) consisting of symbols (individual units), which themselves might also be variables.
A grammar looks something like the below, where anything on the left side of the =
sign can be replaced by whatever's on the right. The unquoted words are variables. The quoted bits are literals - a literal is a symbol that represents the quoted contents. The |
operator is called alternation - x | y
can evaluate to x
or y
. In this format, rules always end in a semicolon.
verb = "throw" | "eat" ;
subject = "I" | "You" ;
object = "the apple" | "the football" ;
sentence = subject verb object ;
A language is the set of strings that can be generated by the grammar. Generally, there's one top-level variable - here it's sentence
- that defines the entire language. The language generated by sentence
consists of these strings:
- I eat the apple
- I eat the football
- You eat the apple
- You eat the football
- I throw the apple
- I throw the football
- You throw the apple
- You throw the football
That format above is pretty similar to something called , which is very popular for describing computer languages. The main difference is that in standard EBNF you're supposed to use commas to concatenate symbols together, which I don't like, and also that we're making assumptions about the whitespace between the symbols in the language. For our purposes, variables are a single word like EBNF (Extended Backus-Naur Form)foo
or bar_baz
, and each symbol can be separated by any amount of whitespace - you'll see what I mean later.
It's important to note another assumption we've made, though - that each syntactic variable is defined only once, and that it's the only thing on the left side of the =
in each rule. I talk more about that in the footnotes, but let's carry on for now.
Recursion and Parsing
Righto, so languages can also be recursive - this is when it starts to get interesting. Consider this grammar:
digit = '1' | '2' | '3' ;
group = '(' sentence ')' ;
sentence = digit digit | digit group digit ;
In plain English, this grammar says: a sentence
can be two digits, or: one digit, followed by another sentence
in parentheses, followed by another digit. This grammar is recursive through group
and sentence
- there's an infinite loop back and forth between those rules. The language described by sentence
is now infinitely big! Check out some of the possibilities here:
- 1 2
- 1 ( 1 2 ) 3
- 1 ( 2 ( 3 3 ) 2 ) 1
So how do we interpret this? Our previous grammar was nice and easy, but now there's an infinite set of possibilities. If you're a computer, all you know is a sequence of characters, but you need to understand the structure of the text as a whole. That's what parsing is all about - taking raw text, applying grammar rules to it, and deriving a syntax tree from it.
Take that last example, 1 ( 2 ( 3 3 ) 2 ) 1
. The syntax tree for that sentence looks like this:
sentence
┣ digit: "1"
┃
┣ group
┃ ┣ "("
┃ ┣ sentence
┃ ┃ ┣ digit: "2"
┃ ┃ ┃
┃ ┃ ┣ group
┃ ┃ ┃ ┣ "("
┃ ┃ ┃ ┣ sentence
┃ ┃ ┃ ┃ ┣ digit: "3"
┃ ┃ ┃ ┃ ┗ digit: "3"
┃ ┃ ┃ ┗ ")"
┃ ┃ ┃
┃ ┃ ┗ digit: "2"
┃ ┗ ")"
┃
┗ digit: "1"
We've just done manual parsing! If you've ever had to diagram a sentence in school, you've done it before too.
If we can make a data structure that represents the above in memory, then the computer fully understands the text. Suppose this were a programming language - rather than having to try to figure out what the text means one character at a time, it can step through each node in the syntax tree and say, "ah, this is just a simple sentence, two digits" or "nope, this sentence has a group in the middle of it, I have to do something else first". Of course, real grammars are much bigger and more complicated than this, but that concept you've just grasped is the core idea underlying all computer languages.
One quick note: Above, we have a Concrete Syntax Tree (CST) - it includes things like the parentheses in group
. When writing a computer language interpreter, however, we don't need that information (we already know it's a group, therefore it must have parentheses), and we have a bit of wiggle room as far as what goes where. For that reason, the representations that most parsers derive are called Abstract Syntax Trees or ASTs, because they don't strictly represent the grammar; rather, just the important parts.
Optionals and repetition
The only pieces we're missing are optional symbols (sometimes this might be here, sometimes it might not!) and repetition (sometimes there can be more than one of this symbol!). In EBNF and many pseudo-EBNF formats like the one I created, optionals are denoted with [square-brackets]
and repetition is denoted with {curly-braces}
. They can also be combined [{ like this }]
to indicate that it's both - there can be zero, one, or multiple of that symbol.
Congratulations, you now understand formal language theory!
This concept of using recursive grammars to describe patterns is actually thousands of years old; at least as old as the ancient Indian grammarian . In 1956, Noam Chomsky published Pāṇini defining different classes of grammars and how they can be processed, which is the foundation of computer language theory today. It's an incredibly rich topic that plenty of brilliant people have devoted their entire careers to studying. It even has applications far beyond human and computer languages, including things like some important work and genome sequencing. But now, if I've made enough sense up to this point, you have a strong enough grasp of it to implement these ideas on your own. Let's get to it! studying geological formations
How do I teach the computer to do it?
There are lots of compiler generators out there that will do this work for you if you give them a well-defined grammar, but this is a blog post about writing parsers. I decided to write this one by hand as a learning experience, and also so that it can be easily customized to fit whatever needs we end up having.
Before we dive into our implementation, though, I need to describe the techniques we will use.
Lexing
Again, all the computer knows is a sequence of individual characters, one at a time. For us as language implementers, though, it's more useful to reason about entire words, rather than also having to piece them together from characters at every step of the way. In theory, you could define what words look like in your grammar itself, but in practice most parser implementations like to "chunk" characters together to form tokens.
This process is called lexing, and different languages go about it in different ways. I'll gloss over most of the details here, but in Veneto, there are a few important properties we want our lexer to understand:
- Each token is the longest possible sequence of characters of the same "type" - these types are:
Number
, which is a sequence of digitsPunctuation
, which are non-alphanumeric characters - these must match a predefined list of operators or an error is returnedWord
s, which start with a letter and then consist of alphanumeric characters or an underscore - these can either match a predefined set ofKeyword
s or they can be used as anIdentifier
instead
- At the end of the file, I chose to have a special
EOF
token type. Instead, the entire thing could be wrapped in anOption
, or benull
in languages that have it, but I found it more convenient and readable to have a proper EOF token. - Whitespace can be completely ignored wherever it doesn't separate a token (I don't like whitespace-sensitive languages (looking at you, Python))
- Comments are also (mostly) ignored, and can basically be treated as whitespace in that regard
- String literals are a thing, and in general it's easier to treat them as a single token, especially if you need to do things like escape quotation marks
With that out of the way, our parser also doesn't have to worry about whitespace or comments, which cleans everything else up a lot. If you're curious, the full description of how Veneto's lexer works can be found . here
Token Streams & Recursive Descent
The most popular way to interface with a lexer is as a token stream, an object which emits tokens. Ours exposes a simple interface with these two methods:
peek
, which returns the current token without moving on to the next onenext
, which returns the current token and moves on to the next one
The basic idea of our parser is that we have parsing functions that correspond to rules in the grammar, and these functions accept the token stream as an argument. When invoked, these functions start parsing a clause at the stream's current position, and return an AST node of the appropriate type. Since we're just passing around the same token stream, each part of the parser can just grab tokens however it needs to, without having to care about what's happening anywhere else.
This concept is called recursive descent, and it's my favorite. There's a huge variety of parsing techniques out there, but in my opinion, recursive descent with a token stream is the easiest to read and write. I go into more detail in the footnotes about why this works, but if you're with me so far, let's keep going.
How we use this technique
With these tools in our belt, we have everything we need to parse a clause in our grammar. We start right after the =
sign of our rule, and handle each symbol we run into, from left to right. If the symbol is a literal, that's easy - we just check for a token that matches it. If it's a variable, on the other hand, we can call another function to parse the corresponding clause.
There's a few grammatical circumstances that we have to handle:
- If the symbol is not optional, we can just expect it to be there, and raise an error if we don't find it.
- If it is optional, then we have to use the
peek
operation to see if it's there, and then handle it accordingly - To handle repitition, we start a loop, handling all of the optional and non-optional symbols like we would otherwise, and then peek for the next non-repeated symbol so that we know when to terminate the loop.
- To handle alternation, we can just
peek
for each possibility accordingly. Generally, we want to try the possibilities in order, from left to right - this is called leftmost derivation.
So, with that in mind, there's two sets of tools I'll introduce to help us out.
Peeking for tokens
With this method, a common pattern arises very quickly: sometimes you need to peek a token, and advance the stream only if it matches a certain variant you're expecting. To help with this, I introduced a set of helper operations that I call "peeking for" a token:
peek_for_identifier
peeks at the current token in the stream; if it's anIdentifier
, then it returns the string containing its contents and advances the stream - otherwise, it doesn't advance the stream and returnsNone
.peek_for_punctuation
peeks for a specific kind ofPunctuation
; if the current token matches it, it advances the stream and returnstrue
- otherwise, it doesn't advance the stream and returnsfalse
. We return a boolean here because, unlikeIdentifier
which can have any value the user wants, here we are looking for a specific predefined type of punctuation.
Traits: Expecting and Peeking clauses
Here's where we properly dive into the specifics of our parser. The parser is written in Rust, but hopefully it's understandable even if you're not familiar.
Our parsing functions are defined as associated functions on our AST nodes, which are just data structures that we create. There's two flavors of those functions, each with their own trait:
pub trait Expectable where Self: std::marker::Sized {
fn parse_expect(stream: &mut TokenStream) -> ParseResult<Self>;
}
pub trait Peekable where Self: std::marker::Sized {
fn parse_peek(stream: &mut TokenStream) -> ParseResult<Option<Self>>;
}
Expectable
provides theparse_expect
function, which expects there to be a valid representation of the AST node at the current position in thestream
, and returns an error if it finds anything it doesn't expect.Peekable
provides theparse_peek
function, which peeks the next token to see if this type of AST node could be present. If it finds the token it's looking for, it carries on parsing; if it doesn't, it returnsNone
without advancing the token stream. Not every AST node is peekable, but this is useful in certain situations - we'll get more into that later.
Get on with it, already!
Yes, yes, right! Let's finally write some actual parsing code. We'll focus on one particular aspect of the language as an example.
The language
Let's consider the links
statement. It's in the sample at the very top of this article, but here's a better example: Imagine you're building the next Amazon - each item available for sale has an API resource, and that API output contains links to other resources and endpoints. In Veneto, you might represent that like this:
links {
seller -> Merchant,
addToCart? -> Action,
reviews -> List<Review>,
}
- The word on the left of the
->
is the relation ("rel") of the link - in essence, its name. - The stuff to the right of the
->
is the type of resource that the link "points to". - The
?
means it's optional - in this case, maybeaddToCart
isn't always available because the item could be out of stock. - The Veneto standard library contains definitions for
Action
andList
, whileMerchant
andReview
are things that you would create, but that's not super important right now - we just know that it's a reference to another resource class.
The grammar
The Veneto grammar for this statement looks like this:
rc_links = 'links' links_block ;
links_block = '{' links_list '}' ;
links_list = [ link [',' [links_list]] ] ;
link = Identifier ['?'] '->' rc_reference ;
There's a lot going on there! Let's discuss these rules in plain English:
- A
links_block
is alinks_list
enclosed in braces. - A
links_list
can be:- Nothing, or:
- There might be a first
link
; - if there is a first
link
, it might be followed by a comma; - if there is a comma, it might be followed by another
links_list
!
- A
link
has an Identifier (which is a type of token), then maybe a?
, then a->
, then anrc_reference
which is defined elsewhere.
Let's start with the link
rule. Everything in the links_list
rule is optional, which means that we need to be able to see if the link
inside of it is present or not, so it's a good idea to go ahead and make it Peekable
.
Peeking a link
Here's our AST node:
pub struct Link {
pub rel: String,
pub optional: bool,
pub typ: RCReference,
// 👆 unfortunately, `type` is a reserved keyword in Rust, so we have to call it something else
}
And the grammar for this clause, as we mentioned above:
link = Identifier ['?'] '->' rc_reference ;
We use the grammar to parse the input into a Link
like this:
- The
Identifier
in the grammar becomes theLink
'srel
optional
becomestrue
if the?
is present- We parse the
rc_reference
clause separately, which becomes thetyp
And here's the implementation! We just follow the grammar, from left to right:
impl Peekable for Link {
fn parse_peek(stream: &mut TokenStream) -> ParseResult<Option<Self>> {
let Some(rel) = stream.peek_for_identifier()? else {
return Ok(None)
};
let optional = stream.peek_for_punctuation(Punctuation::Optional)?;
stream.next()?.expect_punctuation(Punctuation::Arrow)?;
Ok(Some(Link {
rel,
optional,
typ: RCReference::parse_expect(stream)?,
}))
}
}
- Have a peek at the current token - if it's an identifier, we capture its contents as
rel
and continue. Otherwise, there's noLink
here - returnNone
. - Check if the
Optional
operator (?
) is present - if it is, we mark it astrue
in the AST node. Sincepeek_for_punctuation
returns a boolean for us, all we need to do is call it. - Regardless, we expect there to be an
Arrow
(->
) next.expect_punctuation
is a helper function that converts the token to the appropriate error if it's not there - this might be rendered to the user like "expected '->' at line 12, column 5". - Finally, we use
parse_expect
to obtain anRCReference
. This happens inside of the implicit return expression that creates theLink
AST node.
Expecting a LinksBlock
Time to move on to the LinksBlock
! First, though, a quick detour:
Comma-separated lists
Let's talk about this comma-separated list idea. This is actually a super common pattern, used all over the place in our grammar - it doesn't make very much sense to implement it every time we need it. Instead, we can make one function that does this for us everywhere. This is a recursive grammar, and we're writing a recursive descent parser for it, so let's whip up a recursive function to do the job!
The function I made here is generic over any Peekable
type, so it will work for our Link
s. It takes a reference to a Vec
(the variable list type in Rust) where we're putting the things we find, and a reference to the stream.
To recap, here's original the grammar for our links_list
:
links_list = [ link [',' [links_list]] ] ;
Our function is generic, though, so here's the same grammar replaced with a generic T
to make it a bit easier to follow:
T_list = [ T [',' [T_list]] ] ;
And here's the function!
fn parse_list_into<T: Peekable>(vec: &mut Vec<T>, stream: &mut TokenStream) -> ParseResult<()> {
if let Some(node) = T::parse_peek(stream)? {
vec.push(node);
if stream.peek_for_punctuation(Punctuation::Comma)? {
parse_list_into(vec, stream)?;
}
}
Ok(())
}
- The whole thing is optional, so first we peek for a
T
first. If we find it, then we add it to thevec
, and we continue. - Next, there might be a comma, so we peek for it.
- If there's a comma, there might be another list, so we recurse!
Actually implementing LinksBlock
Here's the AST node; this one is super simple, it's just a Vec
of Link
s:
pub type LinksBlock = Vec<Link>;
And the grammar again:
links_block = '{' links_list '}' ;
And here's the implementation! Factoring out the list part means this one is a piece of cake.
impl Expectable for LinksBlock {
fn parse_expect(stream: &mut TokenStream) -> ParseResult<Self> {
stream.next()?.expect_punctuation(Punctuation::BraceOpen)?;
let mut links = LinksBlock::new();
parse_list_into(&mut links, stream)?;
stream.next()?.expect_punctuation(Punctuation::BraceClose)?;
Ok(links)
}
}
Tying it all together
Those two parse nodes are the easiest to understand at a granular level, but if you understood that, you can understand the entire parser! Now, let's demonstrate how it fits into the larger picture.
This LinksBlock
node that we've implemented belongs to a Resource Class - a complete example in Veneto could look like this:
resource Item {
data {
name: string,
price: decimal,
}
links {
seller -> Merchant,
addToCart? -> Action,
reviews -> List<Review>,
}
}
The Peekable
implementation for resource classes is a lot to get into, since there are many other components in play, but essentially it peeks for the resource
keyword and then carries on parsing the rest of the clause, all the way until the closing brace.
Here is where we plug in all of the pieces: resource classes are one potential part of a Document
, which is the top-level rule of the Veneto grammar. Its definition is pretty simple - just a repeated alternation of all of its parts:
document = [{
use |
type_alias |
interface_decl | resource_class | rc_entry_directive
}];
So here's the AST node, which is the root of the tree:
pub struct Document {
pub imports : Vec<UseTree>,
pub types: HashMap<String, Type>,
pub interfaces : HashMap<String, InterfaceExpression>,
pub resource_classes : Vec<ResourceClass>, // 👈 the thing we just talked about!
pub entry_points : Vec<EntryPoint>,
}
Finally, to wrap it all up, there's a special parse
method just for the top level. If you speak Rust, notice that the stream
parameter is moved and not borrowed - we consume the whole thing for this operation:
impl Document {
pub fn parse(mut stream: TokenStream) -> ParseResult<Self> {
let mut doc = Self {
imports: Vec::new(),
types: HashMap::new(),
interfaces: HashMap::new(),
resource_classes: Vec::new(),
entry_points: Vec::new(),
};
loop {
if stream.peek_for_keyword(Keyword::Use)? {
// ... (not relevant)
}
else if stream.peek_for_keyword(Keyword::Type)? {
// ... (not relevant)
}
else if stream.peek_for_keyword(Keyword::Interface)? {
// ... (not relevant)
}
else if let Some(rc) = ResourceClass::parse_peek(&mut stream)? {
// 👋 Here!
doc.resource_classes.push(rc);
}
else if stream.peek_for_keyword(Keyword::Entry)? {
// ... (not relevant)
}
else {
let next = stream.next()?;
if next.kind == TokenKind::EOF {
return Ok(doc)
} else {
return Err(next.as_err_unexpected())
}
}
}
}
}
We simply loop, peeking for each possible component of the Document
. If it's a ResourceClass
, we add that to our list! At the end, we haven't matched any of those components, so we have a simple choice: if it's the end of the file, successfully return the parsed Document
, otherwise, return an "unexpected" error. And that's it! That's the whole parser!
To use this, we just create a TokenStream
and pass it along to that parse
method. Below, you can see how the standard library is loaded. Since it's pretty small, we use the handy include_str!
macro to bundle the file into the compiled library as a string, and create a Chars
iterator from the string since that's what we need to create a TokenStream
.
pub fn parse_std() -> ParseResult<Document> {
let stream = TokenStream::new( include_str!("std.veneto").chars() );
Document::parse(stream)
}
For userland files, rather than the standard library, we will have to create a Chars
iterator from an opened file instead. This isn't quite as straightforward as it sounds, but there are some crates that may help us; we'll burn that bridge when we come to it.
Two more fun things
While developing this parser, I made use of two Rust features that made the codebase much nicer, and I couldn't resist sharing them here.
Combining Peekable and Expectable
You may have noticed that Peekable
and Expectable
have a lot in common: in general, if you can peek for something, you should be able to expect it too. That would just mean peeking for it and then throwing the appropriate error if you don't find it.
The most obvious idea is to just throw an error if parse_peek
returns None
, but there's a problem: As we discussed earlier, errors in our parser have to be constructed from the token they're thrown at, so that the error can have useful information like the location (line & column number) describing where it occurs. What we want is a solution that does that for us while keeping good encapsulation and reusing as much code as possible for the sake of easy maintenance.
Not everything is quite as straightforward, but lots of clauses in Veneto follow a particular pattern. They start with a specific kind of punctuation, like an opening brace - like we did with LinksBlock
above. To make it Peekable
, you can peek for that token, return None
if it's not there, and otherwise carry on with the parsing. Expectable
is the same deal, except that you're expecting that token to be there.
The solution I found comes courtesy of one of my favorite features in Rust: blanket implementations. I think this is best explained by example. I've created a new trait called Finishable
, which provides an initial token type (a constant), and a function called parse_finish
which finishes parsing the clause after that initial token is consumed. It looks like this:
pub trait Finishable where Self: std::marker::Sized{
const INITIAL_TOKEN: Terminal;
fn parse_finish(stream: &mut TokenStream) -> ParseResult<Self>;
}
Terminal
basically just means "Punctuation
or Keyword
", and it has its own peek_for_terminal
and expect_terminal
methods.
Here's where the magic happens. In Rust, trait implementations are generic, so we can create a blanket implementation for all types that satisfy criteria we provide. Here, we automatically implement Expectable
for everything that has a Finishable
implementation:
impl<T> Expectable for T where T: Finishable {
fn parse_expect(stream: &mut TokenStream) -> ParseResult<Self> {
stream.next()?.expect_terminal(Self::INITIAL_TOKEN)?;
T::parse_finish(stream)
}
}
It's super simple! We just expect the INITIAL_TOKEN
, then finish the parsing with parse_finish
and return the result. Peekable
follows a similar pattern:
impl<T> Peekable for T where T: Finishable {
fn parse_peek(stream: &mut TokenStream) -> ParseResult<Option<Self>> {
if stream.peek_for_terminal(Self::INITIAL_TOKEN)? {
Ok(Some(T::parse_finish(stream)?))
} else {
Ok(None)
}
}
}
Now we have two trait implementations for the price of one! It's super useful - some clauses in Veneto need to be peeked for or expected in different places depending on context. Now we have two different functions to do that for us, each with a name that clearly conveys the intent behind it - this kinda stuff goes a long way towards a readable and maintainable code base in the long term.
Blanket implementations were a true stroke of brilliance on the part of the language designers who created them, and it's one of the many reasons I prefer the trait system to traditional object-oriented programming.
I was promised macros!
Right! So, earlier we talked about peek_for_punctuation
- the method that peeks a token, checks if it is the type of punctuation you asked for, and advances the token stream only if that is the case. It's a really common pattern in the code. Very often, you need to branch off into one of several different clauses, depending on if there is a certain punctuation token present, but you need to continue as normal if none of them are there. One option is to just use a massive if
/else
block for this - here's a simple example, similar to what we did above in our LinksBlock
loop:
if stream.peek_for_punctuation(Punctuation::BraceClose)? {
// Advance to the next token, and return the thing we've parsed
return Ok(Some(parsedthing))
} else if stream.peek_for_punctuation(Punctuation::Comma)? {
// Advance to the next token, and continue the loop
continue;
} else {
// DO NOT advance to the next token; instead, expect a "Foo" clause to start right here
foos.push(Foo::parse_expect(stream)?)
}
Eughh! That gets messy. It's really not a big deal, but it inspired me to finally get over my fear of Rust macros. I haven't found a super nice beginner-friendly macros tutorial yet, and I plan to contribute one to the official docs, but I was able to fight my way through it and come up with this. It's called peek_match!
, and it emulates the syntax of normal match
expressions, basically expanding them into the if
/else
blocks like you saw above:
macro_rules! peek_match {
($stream:ident.$fn:ident { $first:expr => $first_then:expr, $($other:expr => $other_then:expr),+ , _ => $else_then:expr }) => {
if $stream.$fn($first)? { $first_then }
$(
else if $stream.$fn($other)? { $other_then }
)+
else { $else_then }
}
}
So instead, we can write stuff like this:
peek_match!(stream.peek_for_punctuation {
Punctuation::BraceClose => return Ok(Some(parsedthing)),
Punctuation::Comma => continue,
_ => foos.push( Foo::parse_expect(stream)? )
})
I hope that makes sense! I'm proud of it anyways, I find it to be a lot nicer to look at. I wish I could go into more detail, but this post is way too long already. If you want another post explaining it, give me a shout at jake@jakeconley.com and I'd be happy to write one!
Thanks for reading!
This was a lot to get into, and there's so much detail I wanted to include in this post that I couldn't quite fit. I hope you learned something or at least enjoyed reading this. As always, questions and comments are welcome at jake@jakeconley.com.
For more updates on the Veneto project, keep checking back here or give the a follow, and of course reach out if you're interested in contributing. GitHub organization
I plan to put out another devlog soon talking about the lessons I've learned so far in this project. I also might set up a mailing list or something for this blog if anyone expresses interest. Cheers!
Footnotes
Here's a few miscellaneous technical observations that I felt the article would be incomplete without.
Symbols and Context
You may have noticed that there are two kinds of symbols - those that can be replaced by something else, like variables, or those that cannot, like literals. Symbols that can be replaced are called nonterminal symbols, which is pretty much a synonym for syntactic variables; the other ones are caled terminal symbols, the idea being that you can stop looking for other replacements once you've found them.
In theory, you could define a grammar like this:
subject = "I" | "You" ;
subject verb = "We eat" | "She throws" ;
verb = "eat" | "throw" ;
object = "the apple" | "the football" ;
sentence = subject verb object ;
See how that second line defines two nonterminal symbols, and that subject
is defined in two places? This is called context sensitivity, and it's a pain to deal with - now when you're looking at the rule sentence
you have to look in multiple places and think about what that rule means in the context of the other stuff you're analyzing.
Instead, for computer languages, we prefer to design a grammar where each nonterminal symbol is defined exactly once, and it's the only thing on the left side of that =
sign. These are called context-free grammars, and they're all we want to work with as language designers.
How do we know this will work?
Great question! It actually depends on the grammar you're parsing. There's a name for this: LL(k) grammars can be parsed by looking at only k tokens at a time; we only need one, so our grammar is LL(1). We like that because it's easy.
In theory, you can prove that a grammar is LL(1) by demonstrating certain properties about the grammar's recursion. The way I see it, as far as actually knowing that your grammar is LL(1), you have three options:
- Mathematically prove it yourself
- Rely on something else to mathematically prove it (there are lots of tools out there that will do this for you if you feed them a standard form like EBNF)
- Empirically demonstrate it with rigorous and thorough testing
For now, I'm going with option #3. I've never proven that Veneto's grammar is LL(1) - the draft isn't written in standard EBNF, and it's constantly changing at this time. I've written a solid bunch of tests so far, and I'm always adding more. Furthermore, I've written the entire parser this way and I have yet to run into any dealbreaking issues, so I'm fairly confident that this grammar is LL(1) by definition.
If you're drafting your own grammar but have the luxury of making changes as you go, the general idea is that you need to know where your clauses start and end. There's a bit of wiggle room depending on context, but if you know what tokens start or end a certain clause, then you can always just look for that token and you'll know what to do. I had to go back and revise the grammar a few times for this reason, but it was always easy after that.
If you're drafting your own grammar but you need to get it right the first time, make sure you can prove that it's LL(1) 😉