For easy access

This post is to mark the completion of my participation in Google Summer of Code 2023 with the PostgreSQL organisation. I worked on a database replication tool called pg_chameleon, to rewrite the SQL parsing library. My mentors were Federico – the original author of this library –, and Andreas – a PostgreSQL contributor.

Previously, this library was built with regex that was hard to understand and consequently to modify. It had to be made simpler so that new contributions could be added without so much difficulty. I did most of my research into this before and during the community bonding period, talked to Federico – my project mentor and the library maintainer – about a solution, and we settled on replacing the regular expressions with a Python library called parsy.

The first half went smoothly, but then life got in the way in the second half, for both the mentors and me and we couldn’t make much progress. We discussed the scope of what was left to be done after the initial goal was reached in the first half at the earliest date we could, but I couldn’t make time for it later. I’m relatively free now and I can make time for the rest. My goal is to add all of that in the next couple of weeks. I made a pull request with those changes in the week after the official completion of GSOC.

The first part was to port all existing parser functionality to parsy. This was the brunt of the project and where we were solving the pressing problem of too much complexity. This got completed on, or in fact slightly before the time we had allocated for these changes. It got easier because of the research and prototyping we did early on – in the commmunity bonding period and the first week. This is the diff for the changes I made during the first half of GSoC: 50ed941…bdc376a.

I started with adding a parser for the CREATE TABLE statement in parsy and replacing the regular expressions with this. This was the first prototype because we wanted some additional validation that this tool is powerful enough to meet our purpose of achieving simplicity while getting things done for us.

I also had to make some changes to how the tool is packaged, because it was written when Python 3.5 was still supported. That is no longer the case and we decided to support 3.7 and up, because that was the oldest Python version compatible with parsy.

Later on, I added more changes to convert the rest of the regular expressions to also use parsy instead. This was a big change and I felt a relief when it was done and working correctly. I was a bit cautious with my changes, so I also added some automated tests with pytest to make sure that there was no regression. The reason regression was possible was because with parsy, the order in which we combine the sub parsers also does matter. If we combine sub-parsers with the OR combinator (| or alt(...) in parsy) for example, it will parse the statement with the earliest sub-parser that matches successfully. This big change converted a lot of Python if-else / find-replace logic to declarative parsers. It also made each parser more complete and accurate, because now we could match with the entire SQL statement rather than checking for patterns here and there, and extracting smaller patterns step by step. For example, the different subcommands (ADD COLUMN, DROP COLUMN, ADD INDEX, …) that are available after the ALTER TABLE. Having separate end-to-end parsers for these subcommands helps because that makes it easier for us to add parsers for other subcommands (e.g. for adding / altering index that I’ll get to later on).

Alter table and create table statements benefitted the most from this rewriting because they are so complex and there be so many different operations that they can entail. Also because SQL is very lenient and there are many different ways to do the same thing. Previously, they were handled with multiple regular expressions and functions with if-else chains. For example, the alter table statement was being handled by over 100 lines of function code, complex regular expressions, and naive find-and-replace for normalization. This could for example break when strings inside the statement contained whitespace. There were also some redundant find-replaces that were hard to spot because of the complexity. All this could now be split into declarative parser chunks that were easy to read on their own, and could be combined in simple albeit powerful ways to give us the complete parser. To complete the example, this is the code of the final parser of the alter table statement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
alter_table_statement = seq(
    command=seq(ci_string("ALTER"), whitespace, ci_string("TABLE")).result("ALTER TABLE"),
    __space=whitespace,
    __schema=seq(identifier, string(".")).optional(),
    name=identifier,
    alter_cmd=optional_space_around(
        alt(
            alter_table_ignored,
            alter_table_drop,
            alter_table_add,
            alter_table_change,
            alter_table_modify,
        )
    ).sep_by(string(",")).map(lambda ls: [x for x in ls if x]),  # remove none values
)

Each of alter_table_ignored, alter_table_drop, and so on, were their own little parsers that were easy to read through. The result of parsing a valid statement is a Python dict with the provided kwargs, with the keys having the __ prefix getting conveniently discarded. Even though there is some convention and prior knowledge required to understand it, the result is something much simpler and readable at a reasonable cost for the abstractions we chose.

The second part, which was to add support for some newer commands, is yet to be done. I’ll be doing that now in the next couple of weeks. This includes adding support for parsing the CREATE INDEX statement and variants of it in the ALTER TABLE form. We will also be adding support for parsing a functional index so that it can be logged and alerted to the user. The same goes for other special case indices that should be manually handled such as fulltext and spatial indices. Another case to be handled is partial indices and the type of the index used.

Adding parsing capabilities for all of this is much easier and simpler because of the changes we made earlier. Regular expressions would need a lot of fore thought, verification, and probably automated tests to make sure that they capture the input correctly. All of this is very tedious and the resulting code is far from obvious to a reader. In comparison, the statements written with parsy can be understood only with familiarity of a few primitives and basic operators.

2 Sep 2023 Update I created a pull request with the changes from part 2 and they are now being reviewed. Pull Request. I liked the idea of composability here, as compared to regular expressions. Here while building parser for ALTER TABLE ... ADD INDEX statements, I could re-use the index definition parser I wrote for CREATE TABLE. Then when I added features for parsing a bigger set of index definitions, those readily started working with CREATE TABLE too. This way we can re-use features but also keep parts of our code consistent. The trade-off is that bugs could pop up at other locations because the behaviour of their utilities is changing. I think a good next step would be to add more tests for the parser so that we can be sure of it for all the cases.

5 Sep 2023 Update I have finally, successfully completed Google Summer of Code! 🎉