Comment backend migration

The system I used for my blog comments basically downloaded them from a queue to a directory on my laptop. Each comment submitted to my blog became a file on this directory. Every file was named after the comment ID, which was a random UUID. And each file was formatted as a JSON object.

This setup initially made it really easy to iterate. I quickly created a script to approve and reject comments, and another one use the approval and rejection decisions I made so far to train a spam filter.

Everything related to my blog comments would read from and write to this directory. I could easily use grep to find comments that contained a certain word, or wc -l to count the number of comments I had. jq could be used to perform rudimentary filtering and analysis.

I basically had a database, but it was a directory on my laptop. After a while, this became really inefficient. There were no indexes, and I had to perform a lot of disk operations to do anything. All the tools had to careful not to step on each other’s toes, as they all used keys of the JSON objects to store their data. Nothing could be done in parallel, because the tools would end up overwriting each other’s data. The inefficient storage format also meant that the little number of comments I had took up a lot of space.

The straw that broke the camel’s back was when I wanted to try out some data analysis by producing text embeddings of the comments. Operations like finding a random approved or rejected comment, or finding the most recent comment became really slow because of reading a lot of files and parsing them as JSON.

At this point, I really wanted to move the comments to a proper database, so I wrote a small script to migrate the comments to a table on my ClickHouse database. Updating all the tools to query and update the database instead of the directory was not too difficult, and it came with a nice performance boost.

As of now, the comments table is taking a whopping 4 megabytes of disk space on the database. It is very compact thanks to the built-in compression, and it is much easier to query and update.

For now, the part of the system that fetches the comments from the queue still ends up writing them to a single JSON file. I occasionally run an import script with clickhouse client to import those into the DB. I might get rid of the queue and directly write to the DB in the future, but for now this setup is good enough.