cephei8's site

Don't be afraid of goyacc: making user-facing DSL for querying data

One of the core features of my Greener project is user-facing DSL for querying database with test results.
For example, status = "fail" AND name = "login_test" fetches results of the “login_test” test that failed.

Surprisingly, Go’s port of classic YACC parser generator was a relatively simple and straightforward way to parse user queries to further convert them in SQL queries.

Note: this post covers a very simplified approach for educational purposes.

Goal

The goal is to turn this:

status = "fail" AND name = "login_test"

Into this:

SELECT * FROM testcases WHERE status = 'fail' AND name = 'login_test'

The workflow consists of:

  1. Tokenizing query string
  2. Parsing tokens into abstract syntax tree (AST)
  3. Generating SQL from AST

However, the the problem will be approached in a slightly different order:

  1. Define AST
  2. Write goyacc grammar
  3. Implement lexer (tokenizer)
  4. Generate parser using goyacc
  5. Generate SQL from AST

Implementation

Define AST

The AST represents the structure of the parsed query.
For out simple DSL we need just several types:

type Query struct {
	Condition Condition
}

type Condition any

type FieldCondition struct {
	Field string
	Value string
}

type AndCondition struct {
	Left  Condition
	Right Condition
}

FieldCondition represents field = "value".
AndCondition combines two conditions with AND.
Condition is either FieldCondition or AndCondition (unfortunately Go doesn’t have sum types; we could make it an interface, but for the sake of this example any would also work fine).

For example, status = "fail" AND name = "login_test" becomes:

           AndCondition
              /     \
  FieldCondition    FieldCondition
  (status="fail")   (name="login_test")

Write goyacc grammar

The grammar defines what valid queries look like:

%{
package main
%}

%union {
    str       string
    condition Condition
}

%token <str> tokIDENT tokSTRING
%token tokAND tokEQ

%type <condition> query condition term

%left tokAND

%%

query:
    condition
    {
        yylex.(*lexer).result = &Query{Condition: $1}
    }

condition:
    condition tokAND term
    {
        $$ = &AndCondition{Left: $1, Right: $3}
    }
    | term
    {
        $$ = $1
    }

term:
    tokIDENT tokEQ tokSTRING
    {
        $$ = &FieldCondition{Field: $1, Value: $3}
    }

%%

To break it down:

Implement lexer (tokenizer)

The lexer breaks input into tokens.
It must implement the following interface:

type yyLexer interface {
	Lex(lval *yySymType) int
	Error(s string)
}

Simple lexer can look like the following:

type lexer struct {
	input  string
	pos    int
	result *Query
	err    error
}

func newLexer(input string) *lexer {
	return &lexer{input: input}
}

func (l *lexer) Lex(lval *yySymType) int {
	for l.pos < len(l.input) && (l.input[l.pos] == ' ' || l.input[l.pos] == '\t') {
		l.pos++
	}

	if l.pos >= len(l.input) {
		return 0 // EOF
	}

	if l.input[l.pos] == '=' {
		l.pos++
		return tokEQ
	}

	if l.input[l.pos] == '"' {
		l.pos++ // skip opening quote
		start := l.pos
		for l.pos < len(l.input) && l.input[l.pos] != '"' {
			l.pos++
		}
		lval.str = l.input[start:l.pos]
		l.pos++ // skip closing quote
		return tokSTRING
	}

	isAlpha := func(c byte) bool {
		return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
	}

	if isAlpha(l.input[l.pos]) {
		start := l.pos
		for l.pos < len(l.input) && (isAlpha(l.input[l.pos]) || l.input[l.pos] == '_') {
			l.pos++
		}
		word := l.input[start:l.pos]

		if strings.ToUpper(word) == "AND" {
			return tokAND
		}
		lval.str = word
		return tokIDENT
	}

	l.pos++
	return 0
}

func (l *lexer) Error(s string) {
	l.err = fmt.Errorf("parse error: %s", s)
}

The lexer recognizes tokens defined in the grammar.
If a token carries data (string), it is stored in lval.str.

Generate parser using goyacc

Run goyacc to generate the parser (the grammar was stored in query.y):

goyacc -o parser.go query.y

For convenience, Parse function can be defined to call yyParse from the generated parser:

func Parse(input string) (*Query, error) {
	l := newLexer(input)
	yyParse(l)
	if l.err != nil {
		return nil, l.err
	}
	return l.result, nil
}

Generate SQL from AST

Finally, traverse the AST to build SQL:

func BuildSQL(q *Query, table string) (string, []any) {
	whereClause, args := toSQL(q.Condition)
	return fmt.Sprintf("SELECT * FROM %s WHERE %s", table, whereClause), args
}

func toSQL(c Condition) (string, []any) {
	switch v := c.(type) {
	case *FieldCondition:
		return fmt.Sprintf("%s = ?", v.Field), []any{v.Value}
	case *AndCondition:
		leftSQL, leftArgs := toSQL(v.Left)
		rightSQL, rightArgs := toSQL(v.Right)
		return fmt.Sprintf("(%s AND %s)", leftSQL, rightSQL), append(leftArgs, rightArgs...)
	default:
		return "", nil
	}
}

Putting it all together:

query, _ := Parse(`status = "fail" AND name = "login_test"`)
sql, args := BuildSQL(query, "testcases")
// sql:  "SELECT * FROM testcases WHERE (status = ? AND name = ?)"
// args: ["fail", "login_test"]

Conclusion

Goyacc is likely sufficient for typical parsing needs. It is relatively easy to use, and (importantly) comes as a “standard” tool in Go ecosystem.

In several hundreds lines of code it is possible to implemeted simple DSL for querying database data.
Although for such a simple task it is likely and overkill, the solution scales well for more complicated queries (see Greener codebase for more complex DSL implementation).

#Greener #Go