For easy access
- GitHub Repo: the4thdoctor/pg_chameleon
- GitHub Diff: 50ed941…bdc376a
- Branch with my changes: gsoc-2023 (This can be installed and run manually)
- Changes after GSOC: Pull Request
This post is to mark the completion of my participation in Google Summer of Code 2023 with the PostgreSQL organisation. I worked on a database replication tool called pg_chameleon, to rewrite the SQL parsing library. My mentors were Federico – the original author of this library –, and Andreas – a PostgreSQL contributor.
Previously, this library was built with regex that was hard to understand and consequently to modify. It had to be made simpler so that new contributions could be added without so much difficulty. I did most of my research into this before and during the community bonding period, talked to Federico – my project mentor and the library maintainer – about a solution, and we settled on replacing the regular expressions with a Python library called parsy.
The first half went smoothly, but then life got in the way in the second half,
for both the mentors and me and we couldn’t make much progress. We discussed the
scope of what was left to be done after the initial goal was reached in the first
half at the earliest date we could, but I couldn’t make time for it later. I’m
relatively free now and I can make time for the rest. My goal is to add all of that
in the next couple of weeks. I made a pull request with those changes in the week
after the official completion of GSOC.
The first part was to port all existing parser functionality to parsy. This was the brunt of the project and where we were solving the pressing problem of too much complexity. This got completed on, or in fact slightly before the time we had allocated for these changes. It got easier because of the research and prototyping we did early on – in the commmunity bonding period and the first week. This is the diff for the changes I made during the first half of GSoC: 50ed941…bdc376a.
I started with adding a parser for the CREATE TABLE
statement in parsy and
replacing the regular expressions with this. This was the first prototype because we
wanted some additional validation that this tool is powerful enough to meet our
purpose of achieving simplicity while getting things done for us.
I also had to make some changes to how the tool is packaged, because it was written when Python 3.5 was still supported. That is no longer the case and we decided to support 3.7 and up, because that was the oldest Python version compatible with parsy.
Later on, I added more changes to convert the rest of the regular expressions to
also use parsy instead. This was a big change and I felt a relief when it was done
and working correctly. I was a bit cautious with my changes, so I also added some
automated tests with pytest
to make sure that there was no regression. The reason
regression was possible was because with parsy, the order in which we combine the
sub parsers also does matter. If we combine sub-parsers with the OR
combinator
(|
or alt(...)
in parsy) for example, it will parse the statement with the
earliest sub-parser that matches successfully. This big change converted a lot of
Python
if-else / find-replace logic to declarative parsers. It also made each parser more
complete and accurate, because now we could match with the entire SQL statement
rather than
checking for patterns here and there, and extracting smaller patterns step by step.
For example, the different subcommands (ADD COLUMN
, DROP COLUMN
, ADD INDEX
, …) that are available after the ALTER TABLE
. Having separate end-to-end parsers
for these subcommands helps because that makes it easier for us to add parsers for
other subcommands (e.g. for adding / altering index that I’ll get to later on).
Alter table and create table statements benefitted the most from this rewriting because they are so complex and there be so many different operations that they can entail. Also because SQL is very lenient and there are many different ways to do the same thing. Previously, they were handled with multiple regular expressions and functions with if-else chains. For example, the alter table statement was being handled by over 100 lines of function code, complex regular expressions, and naive find-and-replace for normalization. This could for example break when strings inside the statement contained whitespace. There were also some redundant find-replaces that were hard to spot because of the complexity. All this could now be split into declarative parser chunks that were easy to read on their own, and could be combined in simple albeit powerful ways to give us the complete parser. To complete the example, this is the code of the final parser of the alter table statement:
|
|
Each of alter_table_ignored
, alter_table_drop
, and so on, were their own little
parsers that were easy to read through. The result of parsing a valid statement is
a Python dict
with the provided kwargs, with the keys having the __
prefix
getting conveniently discarded. Even though there is some convention and prior
knowledge required to understand it, the result is something much simpler and
readable at a reasonable cost for the abstractions we chose.
The second part, which was to add support for some newer commands, is yet to be
done. I’ll be doing that now in the next couple of weeks.
This includes adding support for parsing the CREATE INDEX
statement and variants
of it in the ALTER TABLE
form. We will also be adding support for parsing a
functional index so that it can be logged and alerted to the user. The same goes
for other special case indices that should be manually handled such as fulltext and
spatial indices. Another case to be handled is partial indices and the type of the
index used.
Adding parsing capabilities for all of this is much easier and simpler because of the changes we made earlier. Regular expressions would need a lot of fore thought, verification, and probably automated tests to make sure that they capture the input correctly. All of this is very tedious and the resulting code is far from obvious to a reader. In comparison, the statements written with parsy can be understood only with familiarity of a few primitives and basic operators.
2 Sep 2023 Update I created a pull request with the changes from part 2 and
they are now being reviewed. Pull Request.
I liked the idea of composability here, as compared to regular expressions. Here while
building parser for ALTER TABLE ... ADD INDEX
statements, I could re-use the
index definition parser I wrote for CREATE TABLE
. Then when I added features
for parsing a bigger set of index definitions, those readily started working with
CREATE TABLE
too. This way we can re-use features but also keep parts of our
code consistent. The trade-off is that bugs could pop up at other locations because
the behaviour of their utilities is changing. I think a good next step would be
to add more tests for the parser so that we can be sure of it for all the cases.
5 Sep 2023 Update I have finally, successfully completed Google Summer of Code! 🎉