Giving search engines a fair access to data


Reading time: about 4 minutes

Search engines are difficult to create. They are even harder to improve to a point where you get good-enough results to keep regular users. This is why it’s so rare to see decent search engines that aren’t front-ends to the Bing or Google APIs.

This doesn’t mean there are none though. There are a small number of search engines with their own crawlers and search logic. More and more of them appear over time, but most of them cannot improve to the point of catching on. This is because of a common resource they lack: Data.

I am not talking about the slimy, personal kind of data that Google and friends like so much. What are people searching for right now? How many of those do I have good results for? How do people form their queries? Those are all difficult to answer and improve if people aren’t using you search engine. But no one will use your search engine unless you improve those. Great then, we are in a chicken-and-egg situation with no escape in sight.

The data problem

Before tackling the problem, let’s explore what the problem is in the first place. The first problem is the number of humans testing the result quality. In almost all cases, the creator(s) will be testing the results. Friends and family will try it a few times before going back to their default search engine. Social media and Hacker News will provide a swarm of clicks that only last for a few hours. This is not data, at least not enough data.

The second problem is a little trickier. Most people from our already small set of users will not provide data that is too valuable. Let’s break down our users into two segments, the creators and the people testing it out.

The creators are programmers who research very specific pieces of information all day. While this makes them very good at using search engines, it makes them very bad at testing the results. A programmer knows the exact query that will bring them results before typing it. This query is usually so good that even a bad algorithm will find the results they are looking for.

The people testing it out have a different problem. When put on the spot for testing a search engine, it is not easy to come up with queries for difficult questions. But those are the exact situations that need the help of a good search engine. You will only see these queries once people see you as reliable and pick you as their default engine.

The current situation

We can separate the current search ecosystem into three distinct groups.

Google gets almost all the search traffic. They have more than enough data, both personal and aggregated, to serve all their needs. Their monopoly on search, and their hostility for the open web makes this undesirable. A good solution will decrease the amount of data and they get, or give more data to their competitors.

DuckDuckGo and other API front-ends get a small chunk of search traffic. They are a good compromise between keeping the web open, and having a good search experience as a user. Most of these engines stay as API wrappers forever, so the data they get doesn’t improve them much.

Independent search engines have to make do with scraps. This makes it hard for them to become popular or earn money to support themselves.

How to improve the situation

In this post; I will propose different ways to improve this situation. Each have different trade-offs in user convenience and their usefulness to search engines. The best option would be to use and promote independant search engines. But for a lot of people, it is hard to commit to a sub-par experience even if it is the better long-term option. One can look at how people handle environmental issues to see a prime example of this effect.

Feeding data to search engines

With the first option, you keep using your favourite search engine.

An automated process will send a sample of the queries to different search engines. This way, the engines can get their hands on organic usage data before finding full-time users.

This approach makes automated requests without user interaction. Depending on their systems, this might mess up their data collection or make it difficult to serve real users. To be considerate to the service operators, we should make our requests with a User-Agent header that explains what is happening. This header will allow them to log our requests, handle them in a cheaper way, and to filter them out of the data for their real users.

Redirecting to different search engines

Another approach is to have each search go to a random search engine. Compared to the previous approach, this one is more beneficial to search engines and more incovenient for the user. The user won’t be able to reproduce searches as the same query will end up going to different search providers. Similarly, a smaller search engine might give unsatisfactory results to the user, forcing them to perform the same query multiple times.

This approach can be combined with the previous one as well. By putting a few “good” engines on the random redirect list and feeding data automatically to the rest of them, the downsides could be improved.

Embedding the results of multiple engines

There are already meta-search engines, like Searx, that satisfy some of these requirements. The problem with them though is, each data source they add clutters the main results and slows down search. I think if Searx adds the option of sending data to small search engines in the background without slowing down the main UI, it will be a really good solution to this.

One could use iframes to do this as well, but browsers not being “User Agents” any more, they allow the websites to control their embeddability.

Centralized vs. shared

Another trade-off to consider is where the automated query submission should happen. If you choose a centralized approach, you end up trusting a third-party with your search queries. If you instead choose to handle this yourself without a centralized third-party, you are now sending all your queries to all the other engines in an identifiable way.

There are a few ways to work around this. One of them is to have small public instances like the Fediverse. Everyone would pick who to trust with their queries, and even on small instances the queries would be mixed enough to protect identities. Another approach would be to keep the queries saved locally, and submit them using random proxies.

Implementation

If there are solutions satifying this need in the future, I am planning to implement this. I just wanted to write this and put it on the internet in case other people are planning similar things. I already have the random search engine redirect working, but in my opinion the most important piece is the automatic data feeding.

The way I will most likely implement this is either a web endpoint that can be added to browsers as a search engine, which can be hosted locally or on a server, or a browser extension.

The following pages link here

Citation

If you find this work useful, please cite it as:
@article{yaltirakli202102searchenginedata,
  title   = "Giving search engines a fair access to data",
  author  = "Yaltirakli, Gokberk",
  journal = "gkbrk.com",
  year    = "2021",
  url     = "https://www.gkbrk.com/2021/02/search-engine-data/"
}
Not using BibTeX? Click here for more citation styles.
IEEE Citation
Gokberk Yaltirakli, "Giving search engines a fair access to data", February, 2021. [Online]. Available: https://www.gkbrk.com/2021/02/search-engine-data/. [Accessed Nov. 12, 2024].
APA Style
Yaltirakli, G. (2021, February 27). Giving search engines a fair access to data. https://www.gkbrk.com/2021/02/search-engine-data/
Bluebook Style
Gokberk Yaltirakli, Giving search engines a fair access to data, GKBRK.COM (Feb. 27, 2021), https://www.gkbrk.com/2021/02/search-engine-data/

Comments

Comment by Ahmet
2021-03-03 at 11:33
Spam probability: 3.267%

Useful information

Comment by Guest
2021-03-01 at 16:36
Spam probability: 0.105%

The solution would be to generate site indexes on every site, have site directories, and to disallow all search engines. Just to do what gopher did. Search engines from today are single points of failures regarding censorship from within and outside, and the bigger these get, the more so. Search engines are also the reason why shady SEO industries exist, the source of lots of network traffic, and the reason for loads of information rubbish/link SPAM on the web.

Comment by Veraaaaaa
2021-02-28 at 21:48
Spam probability: 1.696%

I found this article very useful as an social media artist.😁

© 2024 Gokberk Yaltirakli