Building Resilient Systems on AWS: Learn how to design and implement a resilient, highly available, fault-tolerant infrastructure on AWS.

Disallow Robots Using Robots.txt

By David Walsh on July 3, 2009

I develop customer websites on a publicly accessible web server so that my customers may check the progress of their website at any given time. I could use .htaccess to require username and password for the site but then I'm constantly needing to remind customers what their password is. My big concern is preventing search engines from finding their way to my development server. Luckily I can add a robots.txt file to my development server websites that will prevent search engines from indexing them.

The Robots.txt

User-agent: *
Disallow: /

The above directive prevents the search engines from indexing any pages or files on the website. Say, however, that you simply want to keep search engines out of the folder that contains your administrative control panel. You'd code:

User-agent: *
Disallow: /administration/

Or if you wanted to allow in all spiders except Google's GoogleBot, you'd code:

User-Agent: googlebot
Disallow: /

What would you prevent the search engines from seeing?

Recent Features

By David WalshNovember 7, 2011
Create Spinning Rays with CSS3: Revisited
Last December I wrote a blog post titled Create Spinning Rays with CSS3 Animations & JavaScript where I explained how easy it was to create a spinning rays animation with a bit of CSS and JavaScript. The post became quite popular so I...
By Garris ShiponSeptember 17, 2014
Responsive and Infinitely Scalable JS Animations
Back in late 2012 it was not easy to find open source projects using requestAnimationFrame() - this is the hook that allows Javascript code to synchronize with a web browser's native paint loop. Animations using this method can run at 60 fps and deliver fantastic...

Incredible Demos

By David WalshSeptember 21, 2010
Create a Dojo-Powered WordPress Website View
Yesterday I showed you WordPress' awesome JSON plugin named JSON API. Now that I can get my blog posts in JSON format, it's time to create an awesome AJAX'ed web app with that data. I've chosen to use the power of Dojo and Dijit to...
By David WalshAugust 13, 2009
jQuery Comment Preview
I released a MooTools comment preview script yesterday and got numerous requests for a jQuery version. Ask and you shall receive! I'll use the exact same CSS and HTML as yesterday. The XHTML The CSS The jQuery JavaScript On the keypress and blur events, we validate and...

Discussion

Evan Riley
Sweet! Now I can block all the back-up’d pronz :p
Hirvesh
Nice article David, but we must be extra careful not to block Google from legit content!
Binny V A
There is one small problem with this. robot.txt is often a hotlist for hackers. If someone wants to hack your site, robot.txt shows them all the best spots.
Ryan Rampersad
When I was getting my blog ready, Google somehow found my public testing server (which was hosted in my basement no less) and decided to index a couple of my test posts. I have used Perishable Press’ User Agent Blacklist since then and for the most part bots haven’t broken through.
(Sorry if this posted twice, I submitted the first time but it seems that wp-post timed out.)
David Walsh
Binny V A: Good point. I believe a nice .htaccess hack would take care of that.
Mike
As I understand it, the robots.txt file doesn’t actually prevent search engines from indexing you, it just tells them not to. If they want to index you they can just ignore the file.
http://www.robotstxt.org/faq/blockjustbad.html
Mike
Ahmad Alfy
Binny is correct… but that shouldn’t stop you from using robot.txt!
I block all my cms folders and private stuff using robot.txt

Hey David, How can you protect the file using .htaccess ? I believe this will block the searchbots too right?

digital

Maybe this would help:

  <Files robots.txt>
          order allow,deny
          deny from all
  </Files>

Dave
Aren’t you relying on spider programs actually following the rules? What prevents someone from writing a spider which ignores your robots.txt file and actually indexing your entire site?
Ryan Florence
It appears some are thinking this post is about how to secure your site …
Mike V - RaleighNC
@Binny VA, there’s no reason you can’t sprinkle in a few random directories to mislead your hacker audience. A little light security by obscurity.

User-agent: *
Disallow: /adm/
Disallow: /administration/
Disallow: /admin/
Disallow: /adminportal/
Disallow: /drupal/
Disallow: /joomla/
Web Host Directory
Thank you, Ive been looking for this info for 3 hours now.
Obri
Ryan’s correct, this post isn’t about securing your site.
Omega-3's
I had never considered doing a .htaccess hack to exclude items. What a marvelous idea!

Thanks for that. I am going to work on implementing it ASAP.
Dmitri Ponomarjov
If you block the robots.txt file using .htaccess rules, not only the hackers wouldn’t be able to read it, but also the search bots. That means it’s useless to do it – it would be easier to simply delete robots.txt in that case.
Vipin Sharma
Hi, I need to know how to block ROBOTS.txt using HTACCESS, means I want to block robots.txt from spiders but from htaccess.
Lucas
One should be careful not to rely on robots.txt as a means of security. It offers no security.
The robots.txt file is only a directive to cooperating web crawlers/bots on what to index. The file(s) you’re trying to restrict the access of are still publicly accessible.

If you don’t want a part of your website to be publicly accessible then password protect it.
amit
It was really helpful… thanks
ravinder singh
i want to disallow google from my site . please can you write a code ?

or this one is enough

User-agent: *
Disallow: /
someome
Yes that will be sufficient to block all cooperative crawlers (Google included).
My Complaint
This is really helpful for me. is it any hermful for my alexa rank and PR?
Abid Pasha
This was helpful, I want search engines to not to crawl my wordpress archive pages, admin pages and upload pages… how can I do that?
Thanks.
Rupert
Thanks for this. Been looking for this kind of solution for my directory site. It will help me save bandwith.
Vivek Nath.R
Hello,

Recently I moved to new host and I got a temporary URL for testing purpose. I did a search with my site name, google is showing both the temp url and original url. I want to remove the temp url completely. This is my temporary URL digitaladvices.com.cp-21.webhostbox.net. How can I block this url using robots.txt file? or any other possible ways?
Prachi Srivastva
Hi David, thanks sharing this.. I don’t want disallow in my robot.txt.. any idea?
Karthik
Got quick result. Thanks David!
dave
MJ12Bot ignores the robots.txt and is near impossible to stop, i have spend the last two weeks blocking 100’s of Ips associated with MJ12Bot, even adding MJ12Bot to htaccess yet it still does not block it

they are using some other method of getting around blocks are are raping and pillaging everything they see, they are a bunch of dicks who simply want to trawl what content they can, for what purpose who knows maybe they work for government spies
james
Thank you! Just what I need to block robots from my client demo folder.
Organo gold
Hi. Thanks for he article re “noindex” etc.

We removed a pdf from our server, that had been indexed by Google. The “view as HTML” link still appears in a Google search result, and still shows an HTML rendering of the PDF – how can we get this listing removed from searches?
Michael
Use Google’s Webmaster Tool to remove already indexed websites from Google’s index.
Mukesh Saxena
Hi

Please tell me the meaning of……………….Disallow: /collections/*+* ?
mobil
Googlebot folder block .htaccess ?
Rakesh Aryal
Still I am confused why my blog url is blocked my google.
0sr
Thank Recently I moved to new host and I got a temporary URL for testing purpose. I did a search with my site name, google is showing both the temp url and original url. I want to remove the temp url completely. This is my temporary URL digitaladvices.com.cp-21.webhostbox.net. How can I block this url using robots.txt file? or any other possible ways?
gigapc
Hi David, thanks sharing this.. I don’t want disallow in my robot.txt.. any idea?
oturona
I had never considered doing a .htaccess hack to exclude items. What a marvelous idea!

Thanks for that. I am going to work on implementing it ASAP.
Jonathan Hochman
This is nice idea, but dangerous in practice, because people frequently copy the robots.txt file when deploying a the site, a mistake than can cost thousands of dollars in lost business. I’ve seen this happen several times.

Instead, a good way to protect a development server would be to set a simple password with .htpasswd (if you have Unix/Linux). Set the userid and password both to ‘test’ and ‘test’ so anybody can remember them. The search engines never access password protected pages. If you deploy the .htpasswd file by accident to the liver server, the mistake will be immediately obvious and you’ll fix it, unlike the robot.txt mistake which will silently kill all your search traffic.

I strongly recommend you revise this article to save the readers from the risk of a bad mistake.

Budy K
Well…im confused now…:(

Ankara Escort
Disallow: /*?ref=:memetinaminakoyum.co

Allow: /*?ref=:ankaramasozleri.org
Asif
I want to block the bad bots that are crawling my website. How to identify the bad bots.
Craig
I think that in general putting the admin tool in robots.txt is simply an invitation to people to try to hack into it. If it’s not linked anywhere the SERPs shouldn’t be able to find it. The more paranoid among us run it from a subdomain… Oh, and use STRONG passwords…
Jesse
Doesn’t anyone else think it’s pretty ridiculous that 7 years later we’re still trying to block spiders? How many FREAKING crawlers are there now? A GAGILLION!

Thanks for the article btw. Blocking bots on 60+ sites right MEOW!
Gaurav V
Can anybody tell me how to disallow urls in wordpress which start from ?, for example ?cat=15 etc. Thanks!
Andrew Gloyns
Hi David,

The robots.txt file doesn’t prevent search engines from indexing a URL, it just prevents them from crawling it.

If a URL is blocked for crawling by search engines via robots.txt, but they’ve found the URL (via a rogue internal or external link to the development URL or even the live site XML sitemap – yep, it happens), the URL will be indexed.

The contents of the URL will not however be displayed in the search results as the search engine is unable to crawl the URL to gather this information.

The best way to prevent a page from appearing in the index, besides securing access to the dev area as mentioned, is to add a ‘meta robots’ with the value ‘noindex’ in the head.

It is important however to not have both robots.txt blocking crawl and the noindex tag blocking index, as search engines will be blocked from reading the noindex tag to honour it ;-)

Hope this helps,

Andrew

Disallow Robots Using Robots.txt

The Robots.txt

Recent Features

Create Spinning Rays with CSS3: Revisited

Responsive and Infinitely Scalable JS Animations

Incredible Demos

Create a Dojo-Powered WordPress Website View

jQuery Comment Preview

Discussion