Many SEOs have been quick to dismiss the Yandex source code leak. Is there something they’re missing? Or are SEOs underestimating what the leak could help them learn and understand about SEO?
Here’s a little backstory:
Towards the end of January (2023), it was reported that some hacker got their hands on around 45GB of Yandex source code, including their coefficients (weights) and list of ranking factors.
That’s the equivalent of finding out Google’s search algorithms. There was a lot of hype around it, and a big part of the SEO community has been working around the clock trying to decode the material.
However, that was not without a few doubting Thomas, quick to dismiss the leak with arguments such as:
- Yandex isn’t Google
- We can’t ascertain if the leak is real
- What’s this obsession with ranking factors?
- That’s just a copy. Yandex scraped Google.
- The leak is just a tiny fraction of Yandex’s source code. It doesn’t say anything about how Yandex ranks websites.
- There’s nothing new here.
- The code repo is outdated
Does this scream ignorance, or are they right?
The leak may not be comprehensive, but it’s still helpful. Even if the code is dated, it reveals how search engines have evolved.
Most of us have never encountered better insights into how modern search engines work. Much of what we know is pure speculation.
Our take: The reaction we see is mostly based on the fear of the unknown, being wrong, having less room for interpretation, and wasting time and effort.
Being cautious is alright, but dismissing the leak outright screams ignorance.
Don’t get left behind – let’s dive in and explore.
The Most Common Objections to the Yandex Source Code Leak
Some SEOs have been quick to overlook the potential of this leak, with some interesting objections. Let’s examine these arguments and see if they hold up.
Objection 1: Yandex isn’t Google
Yandex and Google are indeed two very different search engines. But you’ll find a few overlaps when you compare their search results.
Let’s run a few search queries and compare the results. For example, search for “the best credit cards” on Yandex and Google.
Here are the top ten results:
|Best Credit Cards|
|Position 1||Best Credit Cards Singapore 2023 | Apply now! – MoneySmart||Best Credit Cards Singapore 2023 | Apply now! – Money Smart|
|Position 2||Best Credit Cards in Singapore 2023 – Value Champion||5 Best Credit Card Plans in Singapore for All Needs (2021) – Bestinsingapore|
|Position 3||Best Credit Cards Promotions in Singapore (March 2023) – Sing Saver||Best credit cards in Singapore for 2023 | Finder Singapore – finder.com|
|Position 4||Best Credit Cards in Singapore 2023 – Seedly||Compare the Best Credit Cards in Singapore  – Finty|
|Position 5||Best credit card sign-up bonuses in Singapore (March 2023) – Suite Smile||5 Best Credit Cards in Singapore for Overall Spending (2023) – Instant loan|
|Position 6||Compare the Best Credit Cards in Singapore  – Finty||Credit Cards in Singapore: February 2023 Deals | SingSaver|
|Position 7||Compare Credit Cards Singapore – DBS Bank||The 5 BEST Credit Cards In Singapore 2021 – YouTube|
|Position 8||Apply for a Credit Card by Trust | Trust Bank Singapore||Best Credit Cards for Online Shopping & Mobile Payments – Value Champion|
As you can see, half of the results are the same.
1/10 of the results hold the same position.
Now let’s do the same with other keywords and see how they stack up:
|Keywords||The number of similar results in the top 10||The number of results with the same position|
|The best credit cards in Singapore||5/10||1/10|
|The best hotels in NYC||6/10||0/10|
|The Best CRM Software||2/10||0/10|
|How to Delete a Branch in Git||3/10||1/10|
|How to Potty-train a Puppy||1/10||1/10|
|3 Bedroom Apartment in Moscow||5/10||0/10|
|Common Cold Symptoms||2/10||0/10|
You could argue which results are better, but the overlaps tell us something interesting. It’s a sign that similar ranking factors exist in both search engines and that they’re not entirely different.
So, the fact that Yandex isn’t Google doesn’t mean the leak is irrelevant.
Objection 2: We Can’t Ascertain if the Leak is Real
Yandex officially confirmed the leak, so there is no doubt it happened (source).
But if you don’t believe that, look at the code’s repo. You can compare it to other projects and see how the structure, formatting, and syntax match what you’d expect from professional code.
Objection 3: What’s This Obsession With Ranking Factors?
The obsession with ranking factors is understandable when you consider how much time and money goes into SEO. It’s only natural to want an edge over your competition.
Knowing the ranking factors can help you optimise your website better. It gives you an understanding of how search engines work and enables you to tailor your content accordingly.
Objection 4: That’s Just a Copy. Yandex Scraped Google
Some of the ranking factors discovered from the Yandex leak match those used by Google. But that doesn’t mean Yandex has copied their algorithm.
The two search engines use different methods to calculate content relevance. Yandex has its own unique approach, which you can see in the code repo.
Yandex may have taken some of the best practices from Google, but there are still plenty of unique selling points in the code.
Objection 5: The Leak is Just a Tiny Fraction of Yandex’s Source Code
That may be true, but it still gives us an insight into how Yandex works. The source code is a big part of the puzzle; understanding it can provide valuable information.
The argument made by most SEOs is that only one repository was leaked and that such a giant search engine can’t be boiled down to a single code repo.
Well, even though most of Google’s search engine is based on a single repository, it’s still the most powerful search engine in the world.
Objection 6: There’s Nothing New Here
Indeed, most of the ranking factors uncovered from the Yandex source code leak aren’t groundbreaking. They’re things we already knew and had been talking about for years.
But that’s not true:
What we claim to already know has been purely conjectures.
We know about SEO from our experiences, experiments, theoretical studies, anecdotes, etc.
We’ve never seen these ranking signals in the source code until now. This is the first time professionals can confirm these theories and have real evidence to back them up.
SEO Highlights of the Yandex Source Code
A few SEOs took it upon themselves to study the source code and break down what they found.
Here are some of the highlights:
#1. List of Yandex Ranking Factors by Martin MacDonald
Martin MacDonald, author and founder of Web Marketing School, compiled a list of Yandex ranking factors from the source code leak.
He discovered that there are far more than 1922 individual ranking factors, starting at Page Rank (PR) and moving on to text/content-based elements, meta tags, link structure, and more.
Ben Wills went through the code and calculated the actual number. It turns out Yandex has 17854 ranking factors.
#2. 19% of the Ranking Factors Focus on User Signals, 6% on Content Relevance, and 6% on Links (By Malte Landwehr)
Malte Landwehr, head of SEO at Idealo, thoroughly analysed the source code and extracted some valuable information.
He found out that 19% of Yandex’s ranking factors focus on user signals (e.g., bounce rate), 6% on content relevance (e.g., keyword density), and 6% on links (e.g., inbound link quality).
Malte’s findings seem to confirm what SEMrush reported when they published their ranking factor study that showed that the traffic to a website had the highest ranking coefficient. The SEO community quickly bashed them, but Malte’s findings agree with their claim.
#3. There Were About 40 Quality-Related Ranking Factors in the Code (Malte Landwehr)
From his analysis, Malte Landwehr also found out that the code had about 40 quality-related ranking factors.
These ranking factors were divided into three:
Yandex pays attention to site details. They look at the average content freshness, the average text quality, and the historical performance of your content (10+ factors). They then proceed to categorise the hosting site as low, acceptable, good, or excellent quality.
Their YMYL rules are host-specific, not document-specific. In other words, Yandex looks at your website’s content holistically rather than on a page-by-page basis.
Yandex also looks at the quality of the page itself.
They’ll look at the 404 status code of the embedded or linked content. They’ll mark your page as low-quality if the content isn’t found.
Broken video files are the worst; Yandex will mark your page as low quality if one is detected.
Yandex also looks at the text on a page.
First, they’ll look at the natural occurrence of verbs, pronouns, adjectives, nouns, adverbs, and other parts of speech.
They also employ various methods to detect automatically generated content and plagiarised content.
Ranking Factors Are Query-specific
It’s been long argued that ranking factors are increasingly category-specific.
This has been true for Google and other search engines, but Yandex takes it further.
Not only do they look at the category or keyword, but they’ll also look at the query itself.
Their source code includes static, binary, and query-specific ranking factors.
Static factors apply to the website, dynamic factors apply to the query, and user factors are connected to the user’s language, search history, location, and other data.
The 17854 Ranking Factors
Martin MacDonald, Ben Wills, and Malte Landwehr all agree that Yandex has impressive ranking factors.
Combined, they calculated that there are 17854 individual ranking factors.
These ranking factors are built around different modalities. However, from this, only 1922 isn’t deprecated.
In the same way humans are bad at understanding the impact of compound interest, it’s incredibly hard to estimate the outcome of these algorithms. Add gradient and binary, query-specific, and user-specific ranking factors to the mix, and you get an algorithmic nightmare.
Reverse engineering becomes next to impossible. The fact that there are so many moving parts, not to forget the web ecosystem, makes Yandex’s algorithm a huge conundrum. It also makes it encouraging because it shows that the search engine giants are considering different aspects of a website to determine its ranking rather than focusing on just one or two facets.
Yandex Seems to Follow a Similar Information Retrieval Best Practices as Google
While their algorithm is incredibly complex and hard to reverse-engineer, there are similarities with Google’s best practices, such as the inverted index or embeddings.
Yandex also uses different models, like the neural network MatrixNet, to determine their rank coefficients. Remember MatrixNet was a thing back then before CatBoost replaced it in 2007.
Knowing how and where MatrixNet is used in their algorithm will give you an idea about how much modern search engines go about adjusting and finetuning their ranking models.
So, Are SEOs Underestimating the Yandex Leak?
To understand the true implications of Yandex’s algorithmic leak, SEOs need to start thinking like researchers.
Imagine if researchers had the complete DNA sequence of cancer in mice. Using the same reasoning SEOs use to dismiss the Yandex leak, would they argue that mice aren’t humans and the DNA sequences are useless?
Of course not.
It’s time for SEOs to step up and realise that the Yandex leak is more than just a set of ranking factors. It’s an opportunity to learn about search engine algorithms from the inside out.
10 Things We Learn From the Yandex Source Leak
In summary, here are ten things to learn from Yandex’s leaked ranking factors:
MatrixNet was first announced in 2009. CatBoost would supersede it in 2007.
Yandex mentions it in its ranking factors.
However, this further validates the claim that this is an outdated repository.
Originally, MatrixNet was introduced as a new core algorithm for Yandex’s SERP. It considered thousands of ranking factors, assigning weights based on the search query, the user’s location, and perceived search intent.
Launched six years before Google’s RankBrain, Yandex’s MatrixNet was considered one of the most advanced search algorithms.
Other algorithms have been built upon MatrixNet. In 2016, Yandex launched the Palekh algorithm that used deep neural networks to generate more accurate results, while the Pinet algorithm focused on cutting back on false-positive results.
The Palekh algorithm could process 150 web pages at a time, making it one of the most powerful versions ever released. In 2017, Yandex released an even more advanced version called Korolyov update, which processed 200,000 pages at once and even went as far as considering the page’s depth.
#2. URL & Page-level Factors
Yandex considers many URL and page-level factors when ranking webpages. These include:
- The presence of numbers in the URL
- The presence and number of trailing slashes (are you using them excessively?)
- The presence and number of capital letters in the URL
Yandex also considers the age of the page and the date of the last update. We all know that Google values fresh content, and Yandex is no different, particularly for news-related search queries.
The leak also shows that Yandex used timestamps, not for ranking but for reordering. They no longer use it, though.
In the deprecated version of the algorithm, keywords were used in the URL. Of course, they no longer use it, but you can still use it to get an idea of how they rank pages.
#3. Crawl Depth
Google is on record saying that crawl depth isn’t explicitly a ranking factor. However, Yandex has an active piece of code in its algorithm that considers a page’s crawl depth.
By crawl depth, we mean the number of clicks it takes a user to get to a specific page from the homepage.
URLs that are easily reachable from the homepage will rank higher than those requiring more clicks. That’s because Yandex believes that pages closer to the homepage are likely to be more important and relevant to the use.
It mirrors John Muller’s statement that Google gives a little more weight to pages closer to the homepage.
The leaked code also has a specific token for weighting orphan pages, i.e., pages not linked to any other page on the website.
#4. Click and CTR
Yandex wrote a blog post in 2011 discussing how they use clicks and click-through rates as ranking factors.
They also talked about how SEO might be tempted to use this ranking factor to manipulate their rankings.
The specific click factors highlighted in the leak give us an insight into the following:
- The ratio of clicks the link receives relative to the ratio of all the clicks in the search result
- The same as the above, but broken down by region
- How often are users clicking through to pages from the search results?
- From the leak, we can see that Yandex considers the click data when ranking pages in its search engine.
The more clicks a page receives, the higher it ranks. It is an indirect ranking factor, but it does have an impact on rankings.
#5. Click Manipulation
Click manipulation has been a topic of interest in SEO circles for years. Otherwise known as “click-jacking,” the practice involves artificially inflating clicks on a link to boost its rankings.
It looks like Yandex is aware of this and is actively trying to prevent it from happening.
They have a filter (the PF Filter) that actively scans and identifies suspicious click patterns.
It appears that if a link has an unnatural pattern of clicks, it will be penalized in the rankings.
#6. User Behaviour
The user behaviour section of the leak is particularly interesting.
Unscrupulous SEOs have been trying to game the system for years, from link buying to keyword stuffing.
But Yandex is cracking down on all these practices and actively trying to reward sites that genuinely provide a great user experience.
Yandex uses the PF Filter, the same filter it uses for click manipulation, to identify sites deliberately trying to manipulate user behaviour.
It looks at the time spent on a page, the number of pages visited, and other metrics to decide whether a page provides real value.
#7. Dwell Time
Dwell time is the amount of time a user spends on a page.
In one of their 102 ranking factors, Yandex has this tag “TG_USERFEAT_SEARCH_DWELL_TIME.”
They also reference the device, user duration, and average dwell time.
They have deprecated about 39 of these factors, but dwell time remains a ranking factor in their algorithm.
The term Dwell Time was first used by Bing (in their 2011 blog post).
However, Google has said they don’t use dwell time or similar interaction signals as a ranking factor.
YMYL, or Your Money, Your Life, is a term used to describe websites containing information related to money, health, and safety transactions.
The leak touches on specific ranking factors for medical, financial, and legal websites.
Nothing new — in 2019, during the Yandex Webmaster conference, they announced the Proxima Search Quality Metric.
So, How Should you Go About Exploring the Yandex Leak?
Thinking about Yandex ranking factors as the basis for SEO test hypotheses is the best way to go about this leak.
While you can’t isolate individual ranking factors, especially those with low coefficients, you can understand the overall trends in their algorithm and try to apply them to your own website.
Sure, it won’t be a perfect science, but at least you’ll have something to work with when testing new SEO strategies and tactics. Test, measure, and adjust until you find a winning formula.
For example, we never look at link age when analysing link profiles, but Yandex does. Therefore, it makes a lot of sense for us to start looking at link age and use it as a factor when making decisions about links.
Just because Yandex has 17854 ranking factors doesn’t mean you must go through them all. Look at the bigger picture and find patterns.
Even if search engines were to change and adopt the Chat GPT-like model, wouldn’t you still have liked to know what the winning formula was all these years?
It is clear that Yandex has gone beyond the basic run-of-the-mill SEO tactics and is leveraging its wealth of data to reward websites that offer a great user experience.
The leak shines some light on the inner workings of Yandex’s algorithm, and it appears that SEOs may have overlooked some important ranking factors.