How To Crawl A Website Without Getting Blocked By Robots.txt And Other Firewalls

How To Crawl A Website Without Getting Blocked By Robots.txt And Other Firewalls

Part of SEO 101 is optimizing your website for a crawl budget. That means doing things to ensure that bots, especially Googlebot, can crawl your website and its pages easily.

It’s a topic many of you are already familiar with, but many of us still get it wrong.

If you’ve been blocked by firewalls, other bots, or even your firewall from accessing your website, this post is for you.

One small change to your robots.txt file can significantly impact your website’s crawl budget and ability to be indexed by search engines.

What’s a Robot.txt File?

Robots.txt is a text file web admins create to control how robots crawl their websites and which pages they can access. 

For instance, if you have areas of your website that are not open to public access or are still under construction, then you can use the robots.txt file to block bots from accessing those pages.

You may also want to block robots from accessing your private photos, expired offers, duplicate content, or other pages that are not relevant to your website’s SEO strategy.

Some SEOs use it to solve issues with duplicate content, but in most cases, using a no-index meta tag is more effective. When a robots.txt file is set wrong, it can also block Googlebot and other bots from accessing your website altogether.

When to Use a Robots.txt File

As we’ve already mentioned, SEOs use robots.txt files to stop search engines from indexing certain pages on their website.

If you have no problem with search engines crawling and indexing your entire site, then there’s no reason to use the robots.txt file.

However, if you have any issues preventing specific pages from being indexed, then a robots.txt file is a quick and easy way to solve the problem. 

Some issues that can be addressed with a robots.txt file include:

  • Duplicate content on your website or blog: If certain URL parameters are causing page duplication, then you can use a robots.txt file to block bots from crawling that part of your site.
  • Keep Parts of a Site Private: Think admin pages, password-protected pages, or expired offers; these pages don’t need to be crawled and indexed. A robot.txt file can block bots from accessing those pages.
  • Prevent Crawling of Specific Files: If you have images, scripts, and other files on your website that Google or any other search engine shouldn’t index, then a robots.txt file will keep them out of the index.
  • Blocking a URL: To block a specific page from being crawled, add the URL to your robots.txt file and save it.
  • Managing Crawl Traffic and Preventing Certain Media Files from Being Crawled: If you want to manage your crawl traffic, there’s no better tool than a robots.txt file. That will allow you to block specific media files from being crawled by search engines and prevent other bots from accessing those files as well. 
  • If You’re Running Paid Ads that Require Specific Instructions, You Can Also Use a Robots.txt File to Stop Bots from Crawling Them: Sometimes, you may have paid ads or other media files that need special instructions. You can use a robots.txt file to block these specific URLs and prevent bots from crawling them altogether.

If you have no area on your website that you wish to control and want to allow access to all areas of your site, then you don’t need a robots.txt file.

Google’s guidelines for robots.txt files are also clear: you shouldn’t use robots.txt to block search engines from indexing web pages. Instead, use a noindex meta tag.

The reason for this is two-fold. First, if other pages link to a page that you’ve blocked with robots.txt, then the page will still be indexed by virtue of showing up on those third-party pages. 

Second, as it blocks pages from being indexed, a robots.txt file can also lead to your entire website getting blocked from search engines if set incorrectly.

How to Get Started with Robots.txt Files

Before you start using a robots.txt file, you must ensure you don’t already have one in place. 

All you have to do is add “robots.txt” to the end of your domain name. For example, if your domain name is abcdef.com/robots.txt.

If you have one, you’ll see a plain text file with a list of rules and instructions for search engines. If you don’t have a robots.txt file, you’ll be directed to a 404 page not found error.

For example, if we add the extension to upwork.com, we’ll see a robots.txt file that looks like this:

Next, Check if Your Robots.txt File is blocking Important Files

To do this, you’ll need to use a tool like Screaming Frog or fetch as Googlebot in Google Search Console.

Head to your Google Search Console to see if robots.txt is blocking any important file or page.

ALSO READ  Tracking Your Influencer Marketing Success: 6 Key Metrics to Monitor and 7 Must-have Tools

After logging into Google Search Console, follow this link https://www.google.com/webmasters/tools/robots-testing-tool, and select the website you want to crawl. 

Google will reveal if there’s any file preventing it from reaching certain parts of your site. 

If you see any important files being blocked by your robots.txt file, then this is a good indication that you need to update your file and allow Google to access those areas.

It’s worth noting that you might not need a robots.txt file at all. 

If you don’t have any areas on your website that you want to control or change, then it’s best to leave your site unblocked and avoid using a robots.txt file. 

However, if you consider some specific pages or files private, sensitive, or irrelevant to your audience, then you can use a robots.txt file to block them from being indexed by search engines.

How to Set Up a Robots.txt File

Setting up a robots.txt file is fairly simple, but you must follow Google’s guidelines to avoid blocking your entire site.

Note that there’s no standard robots.txt file you can download and use as a template. The rules you use on your file will depend on your crawling needs. 

Also, with Robots.txt, you can block Google from accessing certain pages on your website without touching code or using a noindex tag.

There are many benefits to this approach. 

All robots.txt will result in one of the following outcomes:

  • Full Allow: Robots.txt will allow all crawlers (including Google) full access to your website. 
  • Full Disallow: Robots.txt will prevent all crawlers from accessing your website. No content will be crawled and indexed by search engines.
  • Conditional Disallow: Here, your robots.txt file will provide instructions to search engines on which pages and folders you want to be crawled and which ones you don’t. For example, you might want search engines to crawl all your blog posts except for the category page for your “SEO tools” category. Or maybe you want crawlers to index the product pages on your ecommerce site but not their reviews.

The process of setting up a robots.txt file is fairly simple. 

First, you must create a plain text file and add it to your website’s root directory (i.e., the same level as your HTML tags). You can name it whatever you want, but many SEOs recommend adding “robots.txt” to the end of your domain name or URL so crawlers can easily find it.

 The process involves two elements:

“user-agent”: The User-Agent refers to the particular crawler you want to be blocked, such as Googlebot.

“disallow”: The Disallow refers to the specific pages or folders you want to be blocked

For example, if you want Googlebot to avoid indexing your “sales” page, you would add the following line to your robots.txt file:

User-agent: Googlebot

Disallow: /sales/

Note that you can also specify specific file types in your robots.txt file or set up rules for crawling multiple web pages (e.g., /sales/ and /reviews/) by adding a wildcard (*) to the end of each URL:

User-agent: Googlebot

Disallow: /*sales/*

These two lines are seen as a single entry in the final robots.txt file. And yes, there’s no limit to the number of lines you can add.

How to Block URLs in Robots.txt

If you want to block an individual page, then the process is simple: just add the file path in your robots.txt file.

For example, if you wanted to block http://example.com/test-page/, you would simply write this in your text editor:

User-agent: *

Disallow: /test-page/

For the user-agent, you can specify the bot you want to be blocked, such as Googlebot or Bingbot. If you don’t specify a user agent, use an asterisk (*) to block all bots.

The second line, Disallow, tells the crawler which pages and folders to avoid.

To block an entire site, use a forward slash (/) to signify the root directory of your website. For example, if you want to block https://example.com/from being crawled and indexed by search engines, then add this line:

User-agent: *

Disallow: /

adding /bad-directory/ to “disallow” blocks the entire directory and all of its content.

Say you want to block https://example.com/subdirectory/bad-page/, then you will have to add this code:

User-agent: *

Disallow: /subdirectory/

After making these changes, remember to upload your robots.txt file to your website’s root directory.

After deciding on your user agent and “disallow” selection, then one of your entries may look something like this:

User-agent: * Disallow: /secret.html/

User-agent: * Disallow: /bad-directory/

How to Save Your Robots.txt

#1. Save your file as plain text.

To do this, go to your text editor (such as notepad or notepad++). 

ALSO READ  What Type of Data Can Be Scraped? (Web Scraping)

Then copy your robots.txt file into the notepad (or notepad++) text editor.

Next, go to file > save as and change the “file type” dropdown from .txt to all files. Save the file as “robots.txt” onto your website’s root directory.

Congrats! You’ve completed the first step in blocking URLs with robots.txt.

#2: Save the File to the Highest-level Directory

That means you should place the file in your website’s highest-level directory (i.e., not in a subdirectory). The file should be saved in the root directory, the topmost level of your website.

By default, most content management systems (such as WordPress) will save this file to their “root” directory. However, if you’re using a custom platform or web server, you’ll need to double-check that your robots.txt file is saved in the correct location.

Check if the text file is saved as “robots.txt” on your website’s root directory.

#3: Check if Your Code Follows the Right Structure

You also want to go through your file and see if it contains the right code structure.

Each entry in your robots.txt file should appear on a separate line in your text editor (notepad or notepad++).

Each entry must be written using the same format.

For example, this is the proper way to write a line of code in robots.txt:

User-agent: *

Disallow: /secret.html/

#4: Make Sure Each URL You Want to Disallow or Allow is Written in Its Separate Line

Each URL you want to disallow or allow needs to be written on its line.

If your robots.txt file contains multiple URLs not separated by a newline (i.e., “enter” key), it will cause errors.

Search engines crawler will be confused and won’t know which URLs you want to allow or disallow.

For example, if your robots.txt file contains multiple URLs separated by commas (or any other character), then this will create errors:

User-agent: *

Disallow: /secret.html/, /bad-page/

Similarly, if multiple URLs are separated by newlines (i.e., enter key), it will create the same errors as above.

To avoid this problem, ensure each URL is written on its separate line in your text editor.

So, it becomes:

User-agent: *

Disallow: /secret.html/

Disallow: /bad-page/

#5: Use Lowercases to Save Your Files

File names are case-sensitive, so it’s important to save your robots.txt file in lowercase.

For example, if you’re saving a file named “Robots.txt” instead of “robots.txt,” then this will cause errors.

#6: Create Separate Files for Each Subdomain

If your website has multiple subdomains, then you will need to create separate robots.txt files for each subdomain (e.g., “www,” “blog,” “dev”).

Each subdomain should include its list of URLs that are allowed and disallowed.

For example, www.example.com would include a list of allowed and disallowed URLs, while blog.example.com would have its own separate list.

#7: Add a Comment Tag to Your Robots.txt File

Start a new line with the pound sign (#) to leave a comment.

Remember that search engines don’t read comments. However, it’s a good idea to leave a comment at the top of your robots.txt file if you want to make changes in the future.

Here’s an example of a comment code:

# Files

User-Agent: *

Disallow: /

How to Test Whether Your Robots.txt File is Working

Test your results using Google Search Console’s robots.txt tester.

Follow the link and follow the instructions below to ensure that your robots.txt file is working properly:

1.) Type in the URL of the page you want to crawl and click “Test.”

2.) Google also allows you to select the user agent you want to simulate from the dropdown menu. That’s important because the user agent defines how your website appears to search engine crawlers.

3.) Click “Test.”

The test button should read “Accepted” or “Blocked” to indicate whether or not the file is blocked. If it’s “Blocked,” your robots.txt file is working properly.

If you see an error message, your file has a problem. 

4.) Remember, you can always edit and test your robots.txt file again.

The changes you make in the GSC’s tester will not change the file on your website, as it only serves as a simulation. 

If you’d like to save the changes you’ve made, you’ll need to copy the edited code and save it to your website’s robots.txt file.

Common Mistakes People Make with Their Robots.txt Files With Websites

There are many common mistakes people make when they’re creating their robots.txt files. 

Here are some of the most common mistakes:

1.) Not Saving Their Robots.txt File in the Root Folder

Search engine bots will only discover your robots.txt file if it’s saved in the root directory of your website. 

For example, if your website’s address is www.example.com, your robots.txt file should be saved as www.example.com/robots.txt and not www.example.com/Folder/robots.txt.

ALSO READ  Most Important Principles of Good Web UI Design You Need to Know (2022)

2.) Poor Use of Wildcards

Robots.txt supports two wildcards: the asterisk (*) and the dollar sign ($). 

The asterisk wildcard can match any character, while the dollar sign indicates the end of a URL. 

For example, if you want to block all files that end with .jpg, then you would use the following code:

User-Agent: *

Disallow: /*.jpg$

It’s sensible to restrict the use of wildcards to ensure you’re not accidentally blocking important pages on your website.

A poorly placed asterisk could also block your entire site from being crawled, so be careful.

3.) Not Using the Correct Syntax

The robots.txt standard is very strict about the syntax that’s used. 

The file will not work properly if even one character is out of place. 

For example, the following code will not work because there’s a space between the User and the Agent:

User- Agent: *

disallow: /

To fix this, you would need to remove the space so that it looks like this:

User-Agent: *

disallow: /

4.) Noindex in Robots.txt

Adding a noindex directive to your robots.txt file is a common mistake people make. 

This directive tells search engines not to index a page, which means the page won’t appear in search results. 

The noindex directive should only be used in the meta tags of your website’s pages and not in the robots.txt file.

5.) Blocking Important Pages

Another common mistake is accidentally blocking important pages on your website, such as your home or contact page. 

To avoid this, double-check the code in your robots.txt file to ensure you’re not inadvertently blocking any critical pages.

6.) Blocked Scripts and Stylesheets

It might seem logical to block crawlers from accessing your website’s CSS and JavaScript files. But let’s not forget that search engine crawlers also need to access the files to see your HTML and PHP pages correctly.

If your pages don’t load properly because the CSS and JavaScript files are blocked, then your pages will likely not be indexed by search engines.

7.) No Sitemap URL

There’s more to SEO than just creating a robots.txt file. 

If you want your website to be crawled and indexed correctly, you’ll also need to submit a sitemap to search engines. 

A sitemap is a file that contains a list of all the pages on your website. That should also help search engines find and index your website’s pages.

To submit a sitemap, you’ll need to add the following line of code to your robots.txt file:

Sitemap: https://www.example.com/sitemap.xml

Replace “example” with your domain name.

Include the URL of your sitemap on its line.

For example, the following code is also valid:

User-agent: *

Disallow: /cgi-bin/

Sitemap: https://www.example.com/sitemap.xml

8.) Not Updating Your Robots.txt File

Your website will inevitably change over time. You might add new pages, remove old ones, or change the structure of your site. 

Whenever you make changes to your website, you should also update your robots.txt file accordingly. 

If you don’t, search engine crawlers might index the wrong pages or miss important ones altogether.

9.) Access to Development Sites

The last thing you want to do is to block search engine crawlers from accessing your live site. At the same time, you don’t want them to access your pages while they’re still under development.

It’s best practice to disallow access to your development site entirely. You can do this by adding the following line of code to your robots.txt file:

User-agent: *

Disallow: /

That will block all search engine crawlers from accessing any pages on your site. 

Of course, you’ll need to remove this line of code once your site is ready to go live.

How to Recover from a Bad Robots.txt File

If you’ve made one of the above mistakes, don’t worry– it’s not the end of the world. You can quickly fix most of these errors by editing your robots.txt file and resubmitting it to search engines. 

To edit your robots.txt file, open it in a text editor and make the necessary changes. Once you’ve saved your changes, you can upload the file to your server and resubmit it to search engines.

It’s also a good idea to check your website’s log files to see if any search engine crawlers have been blocked. Just edit your robots.txt file to allow access.

Conclusion

A robots.txt file is a critical part of any website’s SEO. 

Of course, there are pages or sections of your site that you might not want search engines to index. 

But if you accidentally block access to important pages or files, it will hurt your website’s SEO.

To avoid this, double-check the code in your robots.txt file before submitting it to search engines

About the Author

Tom Koh

Tom is the CEO and Principal Consultant of MediaOne, a leading digital marketing agency. He has consulted for MNCs like Canon, Maybank, Capitaland, SingTel, ST Engineering, WWF, Cambridge University, as well as Government organisations like Enterprise Singapore, Ministry of Law, National Galleries, NTUC, e2i, SingHealth. His articles are published and referenced in CNA, Straits Times, MoneyFM, Financial Times, Yahoo! Finance, Hubspot, Zendesk, CIO Advisor.

Share:

Search Engine Optimisation (SEO)

Baidu SEO: Optimising Your Website for Chinese Audiences

In today’s interconnected world, expanding your online presence to capture international markets has become essential. When targeting the Chinese market, …

Enterprise SEO: Everything You Need to Know

Are you looking to enhance your online presence and boost organic traffic to your website? If you’re operating on a …

10 Tested SEO-optimised Content Development Techniques

Content development refers to creating or improving material that conveys information to a particular audience. In addition to textual material …

7 Emerging Skills Every SEO Must Master in 2023

7 Emerging Skills Every SEO Must Master in 2023 One thing almost all SEOs can agree on is that SEO …

How to Use Keyword Intent to Maximize Conversion Rate

After keyword research, you’re armed with a list of potential keywords to target.  Let’s say one of the keywords is …

Search Engine Marketing (SEM)

Leveraging Social Media for Search Engine Marketing (SEM)

You’ve probably heard of social media, and how important it is to businesses and marketers. Chances are, you use one …

PSG Grants: The Complete Guide

How do you kickstart your technology journey with limited resources? The Productivity Solution Grant (PSG) is a great place to …

Is SEO Better Or SEM Better?

I think we can all agree that Google SEO is pretty cool! A lot of people get to enjoy high …

How To Remove A Web Page Without Affecting Overall SEO

Before removing an old page from your website, do you ever stop to think about the potential effect it might …

Toxic Links Threats and Disavows: Complete SEO guide

Your website is only as strong as the backlinks you have. We’re not talking numbers here but quality.  If you …

Social Media

25 of the Top Social Media Agencies in Singapore (in 2023)

Singapore is a hub of creativity and ingenuity, so it should come as no surprise that it’s also home to …

How to Find Influencers to Promote Your Small Business in Singapore (Low Cost)

In today’s digital age, social media influencers have become powerful tools for businesses looking to increase their brand awareness and …

Instagramming Your Way to Success: Tips for Effective Social Media Marketing in the Travel Industry

Social Media has revolutionised how businesses connect with their audience. In the travel industry, where experiences and visuals play a …

The Rise of Influencer Marketing: Leveraging Social Media to Promote Your Travel Planning Agency

Social media has become an integral part of our daily lives. People from all walks of life use social media …

App Marketing on a Budget: Cost-Effective Strategies for Maximum Impact

In today’s digital landscape, app marketing plays a crucial role in driving the success of your mobile application. However, many …

Technology

The Rise of Influencer Marketing: Leveraging Social Media to Promote Your Travel Planning Agency

Social media has become an integral part of our daily lives. People from all walks of life use social media …

8 Strategies for Bug Hunting: Debugging, Testing, and Code Review

Bugs are among the most unpleasant aspects of the software development process, whether you’ve worked on a little side project …

How Does A Virtual Private Network Work

If you’re reading this, I assume you’re either a small business owner who’s looking to expand your reach, or an …

Digital Identity Theft: How to Protect Yourself from Scams and Fraud

We are always online in this day and age of technology, which makes personal data more exposed than ever. Digital …

The Legal Consequences of Cybersecurity Breaches in Singapore

Technology has advanced greatly in the digital age. It paves the way for a higher risk of cybersecurity breaches. There …

Branding

What Are Virtual Fitting Rooms and How Do They Work? (2023)

Shopping for clothes online can be tricky. It’s difficult to know how something will look and fit without trying it …

Planograms: What They Are and How They’re Used in Visual Merchandising

As a retailer, you know the importance of creating an appealing and organised display of your merchandise. The way you …

PSG Grants: The Complete Guide

How do you kickstart your technology journey with limited resources? The Productivity Solution Grant (PSG) is a great place to …

The Importance of Authenticity in Your Brand Voice Strategy

Most companies are aware of the value of branding. The reputation of a firm may make or break it, after …

Featured Snippet Optimization: Complete Guide In 2022

You’ve probably seen the boxes that pop up at the top of the SERP featuring a summary of an answer …

Business

15 Ways to Remain Empathic While Still Making Deals

Empathy is an essential quality in any negotiation. It allows you to understand the other party’s perspective, build trust and …

10 Prospect Qualification Mistakes That Are Hurting Your Sales

10 Prospect Qualification Mistakes That Are Hurting Your Sales Prospecting is one of the most important aspects of sales. It’s …

How Pros Write Business Proposals To Win New Clients

As a business owner or entrepreneur, one of the most critical skills you need to have is the ability to …

Baidu SEO: Optimising Your Website for Chinese Audiences

In today’s interconnected world, expanding your online presence to capture international markets has become essential. When targeting the Chinese market, …

Time Management Tips for Busy Entrepreneurs (Free Tools)

Are you one of the entrepreneurs juggling multiple tasks, constantly racing against the clock? Do you often find yourself overwhelmed …

Most viewed Articles

Other Similar Articles