02 August 2023
Comment backend migration
The system I used for my blog comments basically downloaded them from a queue to a directory on my laptop. Each comment submitted to my blog became a file on this directory. Every file was named after the comment ID, which was a random UUID. And each file was formatted as a JSON object.
This setup initially made it really easy to iterate. I quickly created a script to approve and reject comments, and another one use the approval and rejection decisions I made so far to train a spam filter.
Everything related to my blog comments would read from and write to this directory. I could easily use grep
to find comments that contained a certain word, or wc -l
to count the number of comments I had. jq
could be used to perform rudimentary filtering and analysis.
I basically had a database, but it was a directory on my laptop. After a while, this became really inefficient. There were no indexes, and I had to perform a lot of disk operations to do anything. All the tools had to careful not to step on each otherâs toes, as they all used keys of the JSON objects to store their data. Nothing could be done in parallel, because the tools would end up overwriting each otherâs data. The inefficient storage format also meant that the little number of comments I had took up a lot of space.
The straw that broke the camelâs back was when I wanted to try out some data analysis by producing text embeddings of the comments. Operations like finding a random approved or rejected comment, or finding the most recent comment became really slow because of reading a lot of files and parsing them as JSON.
At this point, I really wanted to move the comments to a proper database, so I wrote a small script to migrate the comments to a table on my ClickHouse database. Updating all the tools to query and update the database instead of the directory was not too difficult, and it came with a nice performance boost.
As of now, the comments table is taking a whopping 4 megabytes of disk space on the database. It is very compact thanks to the built-in compression, and it is much easier to query and update.
For now, the part of the system that fetches the comments from the queue still ends up writing them to a single JSON file. I occasionally run an import script with clickhouse client
to import those into the DB. I might get rid of the queue and directly write to the DB in the future, but for now this setup is good enough.
03 August 2023
Misc notes
- I read about the Frank-Wolfe optimization algorithm. I wonder if itâs easy to implement.
Delete old Cloudflare Pages deployments
With every deploy from the Wrangler CLI, I am ending up with an ever-growing pile of CloudFlare deployments. This was getting out of hand quickly, especially with quick write-deploy-test loops that I liked for fast feedback.
I wanted to delete the old deployments that were no longer needed, but the Cloudflare UI is not very good at doing this bulk operation.
To perform this âdelete automationâ, I ended up writing a small script in Python, using the Cloudflare API. The relevant APIs were
- Pages Project > Get projects
- Pages Deployment > Get deployments
- Pages Deployment > Delete deployment
Hereâs the code I ended up writing
import requests
auth_key = "66264d58db29a0f05b9e647bdf36be3c"
auth_email = "leo@example.com"
account_id = "6791375f8ebd84997b966bc0ebc51c10"
headers = {
"X-Auth-Key": auth_key,
"X-Auth-Email": auth_email,
}
url = f"https://api.cloudflare.com/client/v4/accounts/{account_id}/pages/projects"
name_id = {}
r = requests.get(url, headers=headers)
for proj in r.json()["result"]:
name = proj["name"]
_id = proj["canonical_deployment"]["id"]
name_id[name] = _id
for n in name_id:
print("-", n)
while True:
target_project = input("Pick a name: ")
if target_project in name_id:
break
url = f"https://api.cloudflare.com/client/v4/accounts/{account_id}/pages/projects/{target_project}/deployments"
r = requests.get(url, headers=headers)
for depl in r.json()["result"]:
_id = depl["id"]
if _id == name_id[target_project]:
continue
# Remove deployment
print(f"Deleting {_id}...")
url = f"https://api.cloudflare.com/client/v4/accounts/{account_id}/pages/projects/{target_project}/deployments/{_id}"
requests.delete(url, headers=headers)
Handling pagination is left as an exercise to the reader. Me? I just execute the script multiple times.
04 August 2023
Cron expression evaluator
I needed to support âcron expressionsâ for a project. The aim was to allow the user to quickly enter familiar cron expressions to configure scheduling for some tasks.
I followed the POSIX crontab specification in order to maximize familiarity for the user. I will reproduce the relevant section below.
In the POSIX locale, the user or application shall ensure that a crontab entry is a text file consisting of lines of six fields each. The fields shall be separated by <blank> characters. The first five fields shall be integer patterns that specify the following:
- Minute [0,59]
- Hour [0,23]
- Day of the month [1,31]
- Month of the year [1,12]
- Day of the week ([0,6] with 0=Sunday)
Each of these patterns can be either an <asterisk> (meaning all valid values), an element, or a list of elements separated by <comma> characters.
An element shall be either a number or two numbers separated by a <hyphen-minus> (meaning an inclusive range).
The specification of days can be made by two fields (day of the month and day of the week).
If month, day of month, and day of week are all <asterisk> characters, every day shall be matched.
If either the month or day of month is specified as an element or list, but the day of week is an <asterisk>, the month and day of month fields shall specify the days that match.
If both month and day of month are specified as an <asterisk>, but day of week is an element or list, then only the specified days of the week match.
Finally, if either the month or day of month is specified as an element or list, and the day of week is also specified as an element or list, then any day matching either the month and day of month, or the day of week, shall be matched.
You can find the full spec page here.
I tackled this problem by first making a âslowâ evaluation function that follows these rules as written. This ensures that we have a solid base that is likely to be correct.
This slow evaluation function was made to work for a single field (such as minutes, hours, etc.). It just reads each field of the expression, and turns it into a 64-bit bitmap of the numbers that match that expression.
Using 64-bit integers fits really well for this problem, because all the units that we care about (minute, hour, day, month, day of week) are less than 64. Afterwards, the 5 64-bit integers are stored and used for quickly evaluating a given DateTime
for a cron expression. This means the âcompiledâ version of a cron expression only takes 40 bytes of memory, not too bad.
Even in C#, iterating over every minute of a year and matching against the cron expression like this only takes 3 ms. This is more than fast enough for my purposes.
07 August 2023
Rewrite of my HTML minifier
I have a custom HTML minifier that I use with my website. Recently, with the growing number of pages on my website, running the minifier on all the pages became a bit slow.
I didnât want to bother with re-writing the minifier in another language like C++, but I also didnât want to interrupt my edit-build-preview cycle by waiting for a slow minifier to finish. So I profiled the minifier script with Pythonâs built-in cProfile
module, and found some hotspots.
The first thing that caught my attention was the function that minified the CSS using scss
, the SASS compiler. This function was mainly called with two kinds of input.
- The main CSS of the website.
- Individual
style
attributes of HTML tags.
Both of those are repeated a lot within the same page, and across pages. So I cached the results in RAM using the functools.cache
decorator. This was essentially a one-line change, and it ended up giving a decent speedup.
The other optimization was a lot more involved. In order to iterate the HTML elements and output a minified version, I was parsing each HTML file using BeautifulSoup with the html5lib parser. This turned out to be very slow, and accounted for most of the time spent in the minifier.
I tried switching the parser from html5lib to lxml, but that didnât help too much. Even with the faster parser, I was constructing a DOM tree in memory and doing a ton of unnecessary work due to the use of BeautifulSoup.
I ripped all of that out, and re-wrote most of the minifier using the built-in html.parser module. This module allows you to parse HTML documents in a low-level way. Basically, you feed the parser a stream of bytes, and it runs callback functions whenever a tag starts or ends.
Aside from the current tag, nothing else is stored in memory, and no DOM tree is created. The callbacks provide enough information to do the minification correctly.
This change gave a huge speedup. The minifier used to run on multiple processes and take around two minutes on all cores. Now it runs on a single thread and completes in two seconds (for around 450 pages).
I think if I add so many pages to my website that the minifier becomes slow again, Iâll just enable multiprocessing and get a speedup that way. But for now, it doesnât look like I will need to optimize this minifier any further.