What is Selectolax?

Selectolax is a fast HTML parser that uses CSS selectors to locate and extract data from HTML.

It uses the Modest and Lexbor libraries to parse HTML, but this is an implementation detail and you don’t need to know anything about them to use Selectolax.

In this document, I will try to illustrate the main features of Selectolax using examples.

Installing Selectolax

Selectolax is on PyPI, so you can install it with pip:

$ python3 -m pip install selectolax

$ pip3 install selectolax

It can work inside a virtual environment or globally.

Quick start

In [2]:

import selectolax

example_html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

tree = selectolax.parser.HTMLParser(example_html)

tree.css_first('title').text()

Out [2]:

"The Dormouse's story"

Examples

Let’s delve deeper into the features of Selectolax by looking at some examples with real HTML.

In the examples, we will use requests to fetch HTML from the web, and selectolax to parse it.

Leo’s blog title

Let’s get some HTML from my blog as a quick example. To turn the HTML from a string into a tree, you can use the selectolax.parser.HTMLParser class.

You can pass this class a string, or just pass bytes and let it figure out the encoding.

In [3]:

html = requests.get('https://www.gkbrk.com').text
tree = selectolax.parser.HTMLParser(html)

Let’s start with a simple example, and extract the title of the page.

In [4]:

title = tree.css_first('title').text()
title

Out [4]:

'Gokberk Yaltirakli'

Leo’s blog posts

That wasn’t too hard, that is indeed the title of my website. Let’s try something a little more complicated.

My home page has a list of recent posts, and it’s not too far fetched to imagine that we might want to extract the title and URL of each post.

Let’s try that.

In [5]:

articles = tree.css('article > ul > li')

titles = []
dates = []

for article in articles:
    title = article.css_first('b').text()
    url = article.css_first('a').attrs['href']
    date = article.css_first('time').text()

    titles.append(title)
    dates.append(date)

pd.DataFrame({'title': titles, 'date': dates}).head()

Out [5]:




  
    
      
      title
      date
    
  
  
    
      0
      Shell scripts as a poor man's AppImage
      2023-04-28
    
    
      1
      Earthquake data for Turkey
      2022-11-26
    
    
      2
      A Brief Overview of Mastodon
      2022-11-19
    
    
      3
      Memorable Unique Identifiers (MUIDs)
      2022-10-10
    
    
      4
      Status Update, May 2022
      2022-05-15

	title	date
0	Shell scripts as a poor man's AppImage	2023-04-28
1	Earthquake data for Turkey	2022-11-26
2	A Brief Overview of Mastodon	2022-11-19
3	Memorable Unique Identifiers (MUIDs)	2022-10-10
4	Status Update, May 2022	2022-05-15

Pretty easy, right? Let’s step it up a notch and try a website that I don’t control.

Threads from encode.su

In [6]:

html = requests.get("https://encode.su/forums/2-Data-Compression").text
tree = selectolax.parser.HTMLParser(html)

In [7]:

threads = tree.css('ol > li.threadbit')

titles = []
dates = []

for thread in threads:
    title = thread.css_first('a.title').text()
    date = thread.css('dl.threadlastpost > dd')[1].text().strip()

    titles.append(title)
    dates.append(date)

pd.DataFrame({'title': titles, 'date': dates}).head()

Out [7]:




  
    
      
      title
      date
    
  
  
    
      0
      GDC Competition: Discussions
      Today, 00:53
    
    
      1
      List of Asymmetric Numeral Systems implementat...
      25th October 2023, 06:01
    
    
      2
      New saca and bwt library (libsais)
      12th July 2023, 22:21
    
    
      3
      RAZOR - strong LZ-based archiver
      18th April 2023, 18:11
    
    
      4
      Is encode.su community interested in a new fre...
      Yesterday, 22:33

	title	date
0	GDC Competition: Discussions	Today, 00:53
1	List of Asymmetric Numeral Systems implementat...	25th October 2023, 06:01
2	New saca and bwt library (libsais)	12th July 2023, 22:21
3	RAZOR - strong LZ-based archiver	18th April 2023, 18:11
4	Is encode.su community interested in a new fre...	Yesterday, 22:33

Hacker News

How about something more familiar, like Hacker News?

In [8]:

html = requests.get("https://news.ycombinator.com/").text
tree = selectolax.parser.HTMLParser(html)

In [9]:

titles = []

for post in tree.css('td.title > span.titleline'):
    title = post.text()
    titles.append(title)

pd.DataFrame({'title': titles}).head()

Out [9]:




  
    
      
      title
    
  
  
    
      0
      Not a real engineer (2019) (twitchard.github.io)
    
    
      1
      Open-source drawing tool – Excalidraw (github....
    
    
      2
      Tiny volumetric display (mitxela.com)
    
    
      3
      Scientists discover retinal cells that help st...
    
    
      4
      Roar of cicadas was so loud, it was picked up ...

	title
0	Not a real engineer (2019) (twitchard.github.io)
1	Open-source drawing tool – Excalidraw (github....
2	Tiny volumetric display (mitxela.com)
3	Scientists discover retinal cells that help st...
4	Roar of cicadas was so loud, it was picked up ...

Node

The Node class represents a node in the HTML tree. Each element and text node in the tree is represented by a Node object.

In [10]:

tree = selectolax.parser.HTMLParser(example_html)

node = tree.css_first('a.sister')

type(node)

Out [10]:

selectolax.parser.Node

Text

If you have a text node, you can get the text with the .text_content property. If your node was not a text node, this will return None.

You can also call .text() on any node, and it will return the text content of the node and all its children.

In [18]:

node.text()

Out [18]:

'Elsie'

Attributes

In [11]:

node.attributes

Out [11]:

{'href': 'http://example.com/elsie', 'class': 'sister', 'id': 'link1'}

In [12]:

node.attributes['href']
node.attrs.get('href')

Out [12]:

'http://example.com/elsie'

Out [12]:

'http://example.com/elsie'

Useful links

Selectolax API reference
- Everything you need to know about the API.
- // TODO: Everything on the API reference should be in this document.
bs4 documentation
- The bs4 documentation is a great resource for learning about HTML parsing in general, and it’s also a great reference for CSS selectors.
- It’s also a great example of how to write documentation, when I want to expand this document, I might look at it for inspiration.
selectolax repository

Table of contents

Selectolax

What is Selectolax?

Installing Selectolax

Quick start

Examples

Leo’s blog title

Leo’s blog posts

Threads from encode.su

Hacker News

Node

Text

Attributes

Useful links

Citation

Comments

Follow me around

Search

More links

Recent comments