DIGITAL MARKETING DONE
GET YOUR FREE QUOTE +65 6789 9852

How To Crawl A Website Without Getting Blocked By Robots.txt And Other Firewalls

How To Crawl A Website Without Getting Blocked By Robots.txt And Other Firewalls

Part of SEO 101 is optimizing your website for a crawl budget. That means doing things to ensure that bots, especially Googlebot, can crawl your website and its pages easily.

It’s a topic many of you are already familiar with, but many of us still get it wrong.

If you’ve been blocked by firewalls, other bots, or even your firewall from accessing your website, this post is for you.

One small change to your robots.txt file can significantly impact your website’s crawl budget and ability to be indexed by search engines.

What’s a Robot.txt File?

Robots.txt is a text file web admins create to control how robots crawl their websites and which pages they can access. 

For instance, if you have areas of your website that are not open to public access or are still under construction, then you can use the robots.txt file to block bots from accessing those pages.

You may also want to block robots from accessing your private photos, expired offers, duplicate content, or other pages that are not relevant to your website’s SEO strategy.

Some SEOs use it to solve issues with duplicate content, but in most cases, using a no-index meta tag is more effective. When a robots.txt file is set wrong, it can also block Googlebot and other bots from accessing your website altogether.

When to Use a Robots.txt File

As we’ve already mentioned, SEOs use robots.txt files to stop search engines from indexing certain pages on their website.

If you have no problem with search engines crawling and indexing your entire site, then there’s no reason to use the robots.txt file.

However, if you have any issues preventing specific pages from being indexed, then a robots.txt file is a quick and easy way to solve the problem. 

Some issues that can be addressed with a robots.txt file include:

  • Duplicate content on your website or blog: If certain URL parameters are causing page duplication, then you can use a robots.txt file to block bots from crawling that part of your site.
  • Keep Parts of a Site Private: Think admin pages, password-protected pages, or expired offers; these pages don’t need to be crawled and indexed. A robot.txt file can block bots from accessing those pages.
  • Prevent Crawling of Specific Files: If you have images, scripts, and other files on your website that Google or any other search engine shouldn’t index, then a robots.txt file will keep them out of the index.
  • Blocking a URL: To block a specific page from being crawled, add the URL to your robots.txt file and save it.
  • Managing Crawl Traffic and Preventing Certain Media Files from Being Crawled: If you want to manage your crawl traffic, there’s no better tool than a robots.txt file. That will allow you to block specific media files from being crawled by search engines and prevent other bots from accessing those files as well. 
  • If You’re Running Paid Ads that Require Specific Instructions, You Can Also Use a Robots.txt File to Stop Bots from Crawling Them: Sometimes, you may have paid ads or other media files that need special instructions. You can use a robots.txt file to block these specific URLs and prevent bots from crawling them altogether.

If you have no area on your website that you wish to control and want to allow access to all areas of your site, then you don’t need a robots.txt file.

Google’s guidelines for robots.txt files are also clear: you shouldn’t use robots.txt to block search engines from indexing web pages. Instead, use a noindex meta tag.

The reason for this is two-fold. First, if other pages link to a page that you’ve blocked with robots.txt, then the page will still be indexed by virtue of showing up on those third-party pages. 

Second, as it blocks pages from being indexed, a robots.txt file can also lead to your entire website getting blocked from search engines if set incorrectly.

How to Get Started with Robots.txt Files

Before you start using a robots.txt file, you must ensure you don’t already have one in place. 

All you have to do is add “robots.txt” to the end of your domain name. For example, if your domain name is abcdef.com/robots.txt.

If you have one, you’ll see a plain text file with a list of rules and instructions for search engines. If you don’t have a robots.txt file, you’ll be directed to a 404 page not found error.

For example, if we add the extension to upwork.com, we’ll see a robots.txt file that looks like this:

Next, Check if Your Robots.txt File is blocking Important Files

To do this, you’ll need to use a tool like Screaming Frog or fetch as Googlebot in Google Search Console.

Head to your Google Search Console to see if robots.txt is blocking any important file or page.

After logging into Google Search Console, follow this link https://www.google.com/webmasters/tools/robots-testing-tool, and select the website you want to crawl. 

Google will reveal if there’s any file preventing it from reaching certain parts of your site. 

If you see any important files being blocked by your robots.txt file, then this is a good indication that you need to update your file and allow Google to access those areas.

It’s worth noting that you might not need a robots.txt file at all. 

If you don’t have any areas on your website that you want to control or change, then it’s best to leave your site unblocked and avoid using a robots.txt file. 

get google ranking ad

However, if you consider some specific pages or files private, sensitive, or irrelevant to your audience, then you can use a robots.txt file to block them from being indexed by search engines.

How to Set Up a Robots.txt File

Setting up a robots.txt file is fairly simple, but you must follow Google’s guidelines to avoid blocking your entire site.

Note that there’s no standard robots.txt file you can download and use as a template. The rules you use on your file will depend on your crawling needs. 

Also, with Robots.txt, you can block Google from accessing certain pages on your website without touching code or using a noindex tag.

There are many benefits to this approach. 

All robots.txt will result in one of the following outcomes:

  • Full Allow: Robots.txt will allow all crawlers (including Google) full access to your website. 
  • Full Disallow: Robots.txt will prevent all crawlers from accessing your website. No content will be crawled and indexed by search engines.
  • Conditional Disallow: Here, your robots.txt file will provide instructions to search engines on which pages and folders you want to be crawled and which ones you don’t. For example, you might want search engines to crawl all your blog posts except for the category page for your “SEO tools” category. Or maybe you want crawlers to index the product pages on your ecommerce site but not their reviews.

The process of setting up a robots.txt file is fairly simple. 

First, you must create a plain text file and add it to your website’s root directory (i.e., the same level as your HTML tags). You can name it whatever you want, but many SEOs recommend adding “robots.txt” to the end of your domain name or URL so crawlers can easily find it.

 The process involves two elements:

“user-agent”: The User-Agent refers to the particular crawler you want to be blocked, such as Googlebot.

“disallow”: The Disallow refers to the specific pages or folders you want to be blocked

For example, if you want Googlebot to avoid indexing your “sales” page, you would add the following line to your robots.txt file:

User-agent: Googlebot

Disallow: /sales/

Note that you can also specify specific file types in your robots.txt file or set up rules for crawling multiple web pages (e.g., /sales/ and /reviews/) by adding a wildcard (*) to the end of each URL:

User-agent: Googlebot

Disallow: /*sales/*

These two lines are seen as a single entry in the final robots.txt file. And yes, there’s no limit to the number of lines you can add.

How to Block URLs in Robots.txt

If you want to block an individual page, then the process is simple: just add the file path in your robots.txt file.

For example, if you wanted to block http://example.com/test-page/, you would simply write this in your text editor:

User-agent: *

Disallow: /test-page/

For the user-agent, you can specify the bot you want to be blocked, such as Googlebot or Bingbot. If you don’t specify a user agent, use an asterisk (*) to block all bots.

The second line, Disallow, tells the crawler which pages and folders to avoid.

To block an entire site, use a forward slash (/) to signify the root directory of your website. For example, if you want to block https://example.com/from being crawled and indexed by search engines, then add this line:

User-agent: *

Disallow: /

adding /bad-directory/ to “disallow” blocks the entire directory and all of its content.

Say you want to block https://example.com/subdirectory/bad-page/, then you will have to add this code:

User-agent: *

Disallow: /subdirectory/

After making these changes, remember to upload your robots.txt file to your website’s root directory.

After deciding on your user agent and “disallow” selection, then one of your entries may look something like this:

User-agent: * Disallow: /secret.html/

User-agent: * Disallow: /bad-directory/

How to Save Your Robots.txt

#1. Save your file as plain text.

To do this, go to your text editor (such as notepad or notepad++). 

Then copy your robots.txt file into the notepad (or notepad++) text editor.

Next, go to file > save as and change the “file type” dropdown from .txt to all files. Save the file as “robots.txt” onto your website’s root directory.

Congrats! You’ve completed the first step in blocking URLs with robots.txt.

#2: Save the File to the Highest-level Directory

That means you should place the file in your website’s highest-level directory (i.e., not in a subdirectory). The file should be saved in the root directory, the topmost level of your website.

By default, most content management systems (such as WordPress) will save this file to their “root” directory. However, if you’re using a custom platform or web server, you’ll need to double-check that your robots.txt file is saved in the correct location.

Check if the text file is saved as “robots.txt” on your website’s root directory.

#3: Check if Your Code Follows the Right Structure

You also want to go through your file and see if it contains the right code structure.

Each entry in your robots.txt file should appear on a separate line in your text editor (notepad or notepad++).

Each entry must be written using the same format.

For example, this is the proper way to write a line of code in robots.txt:

User-agent: *

Disallow: /secret.html/

#4: Make Sure Each URL You Want to Disallow or Allow is Written in Its Separate Line

Each URL you want to disallow or allow needs to be written on its line.

If your robots.txt file contains multiple URLs not separated by a newline (i.e., “enter” key), it will cause errors.

Search engines crawler will be confused and won’t know which URLs you want to allow or disallow.

For example, if your robots.txt file contains multiple URLs separated by commas (or any other character), then this will create errors:

User-agent: *

Disallow: /secret.html/, /bad-page/

Similarly, if multiple URLs are separated by newlines (i.e., enter key), it will create the same errors as above.

To avoid this problem, ensure each URL is written on its separate line in your text editor.

So, it becomes:

User-agent: *

Disallow: /secret.html/

Disallow: /bad-page/

#5: Use Lowercases to Save Your Files

File names are case-sensitive, so it’s important to save your robots.txt file in lowercase.

For example, if you’re saving a file named “Robots.txt” instead of “robots.txt,” then this will cause errors.

#6: Create Separate Files for Each Subdomain

If your website has multiple subdomains, then you will need to create separate robots.txt files for each subdomain (e.g., “www,” “blog,” “dev”).

Each subdomain should include its list of URLs that are allowed and disallowed.

For example, www.example.com would include a list of allowed and disallowed URLs, while blog.example.com would have its own separate list.

#7: Add a Comment Tag to Your Robots.txt File

Start a new line with the pound sign (#) to leave a comment.

Remember that search engines don’t read comments. However, it’s a good idea to leave a comment at the top of your robots.txt file if you want to make changes in the future.

Here’s an example of a comment code:

# Files

User-Agent: *

Disallow: /

How to Test Whether Your Robots.txt File is Working

Test your results using Google Search Console’s robots.txt tester.

Follow the link and follow the instructions below to ensure that your robots.txt file is working properly:

1.) Type in the URL of the page you want to crawl and click “Test.”

2.) Google also allows you to select the user agent you want to simulate from the dropdown menu. That’s important because the user agent defines how your website appears to search engine crawlers.

3.) Click “Test.”

The test button should read “Accepted” or “Blocked” to indicate whether or not the file is blocked. If it’s “Blocked,” your robots.txt file is working properly.

If you see an error message, your file has a problem. 

4.) Remember, you can always edit and test your robots.txt file again.

The changes you make in the GSC’s tester will not change the file on your website, as it only serves as a simulation. 

If you’d like to save the changes you’ve made, you’ll need to copy the edited code and save it to your website’s robots.txt file.

Common Mistakes People Make with Their Robots.txt Files With Websites

There are many common mistakes people make when they’re creating their robots.txt files. 

Here are some of the most common mistakes:

1.) Not Saving Their Robots.txt File in the Root Folder

Search engine bots will only discover your robots.txt file if it’s saved in the root directory of your website. 

For example, if your website’s address is www.example.com, your robots.txt file should be saved as www.example.com/robots.txt and not www.example.com/Folder/robots.txt.

2.) Poor Use of Wildcards

Robots.txt supports two wildcards: the asterisk (*) and the dollar sign ($). 

The asterisk wildcard can match any character, while the dollar sign indicates the end of a URL. 

For example, if you want to block all files that end with .jpg, then you would use the following code:

User-Agent: *

Disallow: /*.jpg$

It’s sensible to restrict the use of wildcards to ensure you’re not accidentally blocking important pages on your website.

A poorly placed asterisk could also block your entire site from being crawled, so be careful.

3.) Not Using the Correct Syntax

The robots.txt standard is very strict about the syntax that’s used. 

The file will not work properly if even one character is out of place. 

For example, the following code will not work because there’s a space between the User and the Agent:

User- Agent: *

disallow: /

To fix this, you would need to remove the space so that it looks like this:

User-Agent: *

disallow: /

4.) Noindex in Robots.txt

Adding a noindex directive to your robots.txt file is a common mistake people make. 

This directive tells search engines not to index a page, which means the page won’t appear in search results. 

The noindex directive should only be used in the meta tags of your website’s pages and not in the robots.txt file.

5.) Blocking Important Pages

Another common mistake is accidentally blocking important pages on your website, such as your home or contact page. 

To avoid this, double-check the code in your robots.txt file to ensure you’re not inadvertently blocking any critical pages.

6.) Blocked Scripts and Stylesheets

It might seem logical to block crawlers from accessing your website’s CSS and JavaScript files. But let’s not forget that search engine crawlers also need to access the files to see your HTML and PHP pages correctly.

If your pages don’t load properly because the CSS and JavaScript files are blocked, then your pages will likely not be indexed by search engines.

7.) No Sitemap URL

There’s more to SEO than just creating a robots.txt file. 

If you want your website to be crawled and indexed correctly, you’ll also need to submit a sitemap to search engines. 

A sitemap is a file that contains a list of all the pages on your website. That should also help search engines find and index your website’s pages.

To submit a sitemap, you’ll need to add the following line of code to your robots.txt file:

Sitemap: https://www.example.com/sitemap.xml

Replace “example” with your domain name.

Include the URL of your sitemap on its line.

For example, the following code is also valid:

User-agent: *

Disallow: /cgi-bin/

Sitemap: https://www.example.com/sitemap.xml

8.) Not Updating Your Robots.txt File

Your website will inevitably change over time. You might add new pages, remove old ones, or change the structure of your site. 

Whenever you make changes to your website, you should also update your robots.txt file accordingly. 

If you don’t, search engine crawlers might index the wrong pages or miss important ones altogether.

9.) Access to Development Sites

The last thing you want to do is to block search engine crawlers from accessing your live site. At the same time, you don’t want them to access your pages while they’re still under development.

It’s best practice to disallow access to your development site entirely. You can do this by adding the following line of code to your robots.txt file:

User-agent: *

Disallow: /

That will block all search engine crawlers from accessing any pages on your site. 

Of course, you’ll need to remove this line of code once your site is ready to go live.

How to Recover from a Bad Robots.txt File

If you’ve made one of the above mistakes, don’t worry– it’s not the end of the world. You can quickly fix most of these errors by editing your robots.txt file and resubmitting it to search engines. 

To edit your robots.txt file, open it in a text editor and make the necessary changes. Once you’ve saved your changes, you can upload the file to your server and resubmit it to search engines.

It’s also a good idea to check your website’s log files to see if any search engine crawlers have been blocked. Just edit your robots.txt file to allow access.

Conclusion

A robots.txt file is a critical part of any website’s SEO. 

Of course, there are pages or sections of your site that you might not want search engines to index. 

But if you accidentally block access to important pages or files, it will hurt your website’s SEO.

To avoid this, double-check the code in your robots.txt file before submitting it to search engines

Share:

More Posts