Skip to the content...

Welcome to the David Walsh Blog. I'm a MooTools, Dojo, jQuery, CSS, and PHP Web Developer located in Madison, Wisconsin, United States. Please contact me if I can make your experience on my website better.

Disallow Robots Using Robots.txt

15 Responses »

I develop customer websites on a publicly accessible web server so that my customers may check the progress of their website at any given time. I could use .htaccess to require username and password for the site but then I'm constantly needing to remind customers what their password is. My big concern is preventing search engines from finding their way to my development server. Luckily I can add a robots.txt file to my development server websites that will prevent search engines from indexing them.

The Robots.txt

User-agent: *
Disallow: /

The above directive prevents the search engines from indexing any pages or files on the website. Say, however, that you simply want to keep search engines out of the folder that contains your administrative control panel. You'd code:

User-agent: *
Disallow: /administration/

Or if you wanted to allow in all spiders except Google's GoogleBot, you'd code:

User-Agent: googelbot
Disallow: /

What would you prevent the search engines from seeing?

Discussion

  1. evan riley
    July 3, 2009 @ 4:54 pm

    Sweet! Now I can block all the back-up’d pronz :p

  2. July 3, 2009 @ 9:52 pm

    Nice article David, but we must be extra careful not to block Google from legit content!

  3. July 3, 2009 @ 11:18 pm

    There is one small problem with this. robot.txt is often a hotlist for hackers. If someone wants to hack your site, robot.txt shows them all the best spots.

  4. July 3, 2009 @ 11:37 pm

    When I was getting my blog ready, Google somehow found my public testing server (which was hosted in my basement no less) and decided to index a couple of my test posts. I have used Perishable Press’ User Agent Blacklist since then and for the most part bots haven’t broken through.
    (Sorry if this posted twice, I submitted the first time but it seems that wp-post timed out.)

  5. July 4, 2009 @ 8:12 am

    Binny V A: Good point. I believe a nice .htaccess hack would take care of that.

  6. mike
    July 4, 2009 @ 9:23 am

    As I understand it, the robots.txt file doesn’t actually prevent search engines from indexing you, it just tells them not to. If they want to index you they can just ignore the file.
    http://www.robotstxt.org/faq/blockjustbad.html
    Mike

  7. July 4, 2009 @ 7:00 pm

    Binny is correct… but that shouldn’t stop you from using robot.txt!
    I block all my cms folders and private stuff using robot.txt

    Hey David, How can you protect the file using .htaccess ? I believe this will block the searchbots too right?

  8. digital
    July 5, 2009 @ 6:57 am

    Maybe this would help:

    <Files robots.txt>
    order allow,deny
    deny from all
    </Files>

  9. dave
    July 5, 2009 @ 4:27 pm

    Aren’t you relying on spider programs actually following the rules? What prevents someone from writing a spider which ignores your robots.txt file and actually indexing your entire site?

  10. July 5, 2009 @ 8:56 pm

    It appears some are thinking this post is about how to secure your site …

  11. mike v - raleighnc
    July 8, 2009 @ 7:45 am

    @Binny VA, there’s no reason you can’t sprinkle in a few random directories to mislead your hacker audience. A little light security by obscurity.

    User-agent: *
    Disallow: /adm/
    Disallow: /administration/
    Disallow: /admin/
    Disallow: /adminportal/
    Disallow: /drupal/
    Disallow: /joomla/

  12. cd
    July 24, 2009 @ 12:17 pm

    Robots don’t index filesystems… they index links. There is no point in blocking anything that isn’t in an HREF or SRC attribute on your site.

    Do you people link to your admin/control panel from the front-end of the site? Didn’t think so. Not to mention, you just add another robots file in the admin/control panel directory, you don’t disallow it in the public one.

  13. September 15, 2009 @ 7:56 pm

    Thank you, Ive been looking for this info for 3 hours now.

  14. February 21, 2010 @ 6:00 am

    Ryan’s correct, this post isn’t about securing your site.

  15. June 27, 2010 @ 8:50 am

    I had never considered doing a .htaccess hack to exclude items. What a marvelous idea!

    Thanks for that. I am going to work on implementing it ASAP.

Be Heard!

Share your thoughts with fellow developers of all skill levels! I want to hear from you!

Name*:
Email*:
Website:  
Wrap your code with <code> tags, f00!