Search systems are struggling for qualitative searches
# December 14, 2022
I'm sure you've seen a thousand of these articles before. Here's how they go:
- Title that includes exactly what you search for (best, favorite, etc).
- Date in parentheses after the title line.
- A long and winding introduction setting up context.
- Whitespaces after every sentence, or two sentence paragraphs.
The funniest part to me is the lead-in. They know exactly what I'm looking for: the title even says it explicitly. I just need the answer. To quote Jerry Maguire, please just show me the money. But the answer is just as meandering as the lead and buried somewhere in a wall of double spaced text.
This field is known broadly as content marketing. And it's easy to feel like it's overcrowding the web. In some ways it is. But I recently noticed that I find meaningful differences in quality with different types of search behavior. My main search queries revolve around programming bugs and architecture, food recipes and reviews, and travel.
Programming results have stayed consistently high. If anything, they've gotten better over time with the introduction of large language models and better semantic modeling of queries. Restaurant reviews are also solid after Google's acquisition of Zagat. I tend to find what I'm looking for quickly.
Travel guides are a whole different story. They always follow the content SEO recipe and drown out any signal in noise. It's not immediately obvious there's a paid company placement other than their own self-referencing somewhere within the content body. I usually give up after scanning for a minute.
SEO optimization is as old as information retrieval
In some ways this day was inevitable. Once a ranking algorithm is widely available, you necessarily open yourself up to adversarial attacks that engineer themselves to rank against the metric. But the fight to the top of search results usually happened in the background.
A decade ago the main techniques were adding header tags to websites to give them more structure. People polluted their
<meta> tags as a way of helping to increasing recall of their pages during different queries. There was also an industry of link farms that operated on the outskirts of what people would actually access. The original PageRank paper intended to boost websites that are referenced by legitimate sources, but people quickly began engineering these connections via link farms.
Content marketing only started getting really popular in 2012. It seemed like a win-win: write great content, get rewarded by revenue. But now coupled with generative language models and over a decade of observations on SEO tactics creates the perfect storm for bad content. 1
Qualitative queries are struggling
How quantifiable is a good search result? Most engines use implicit ways to determine if a user is happy with the page. Signals like clickthrough data: if a user returns to the main search list or if they have stayed on the result page. If they do click back, how long did they stay on the page? Enough to give it a full read or just long enough to realize it wasn't for them? A metric like NDCG aggregates different queries for an overall sense of system performance.
The best search results that I see today are either incredible niche (a blog post on a particular oddity of Spark) or have some amount of crowd-sourced explicit signal (restaurant reviews, hacker news comment threads, reddit posts, stack overflow questions). I suspect the former is simply a search query space that isn't worthwhile for content marketing. The latter is an interesting example of one way around content marketing pollution. Moderated communities with some active upvoting of results seems to be enough to power through most content marketing2. There's some collective wisdom at play in these systems. Group think social media has many bad emergent properties but crowd sourced feedback still seems like the best tool we have at assessing good content.
Next generations of search should be more opinionated
Naively, it seems like this structure of content marketing would be easy enough to capture by a discriminative language model. If an article looks too "spammy" it should be demoted in the results. But this still leads to a cat and mouse game between search engines and content writers. To decrease the amount of SEO noise maybe we need to rethink the approach of a search engine that catalogues the entire web.
This starts with rewarding pithy content. There's must be some information gain signal at how quickly authors get to the point of the article, where the result actually responds to the question framed in the header. The more targeted the question - the more targeted the answer should be.
Websites should specialize in meaningful content. How often can one writer or a group of writers actually develop meaningful content? How broad can their area of expertise be? Should there be an upper limit on how often a website can publish before it's demoted?
Banning corporate domains is another approach. Unless you're explicitly searching for the company in question, ban corporate blogs. Highlight crowd-sourced community posts with their own upvote systems or individual experiences communicated on private blogs.
With renewed interest in approaches that don't boil the ocean of search results, I think there's an opening for more targeted search engines. Or perhaps a platform that allows curating rules to easily build your own. Twitter is the most vocal platform recently about moving content control to the hands of the users. Why should search engines be any different?
It's not solely an ML problem though. This bad content pre-dated generative models. Cheap content farms have been manually writing this content for years. And it seems to be working in projecting these sites to the top of common searches. ↢
There's still the problem of astroturfed campaigns. But at least they don't fall in this SEO content template trap. ↢