Search Engine Bots Crawling Problem, Website Not Accessible

SofieHofmann.com was down for more than six (6) hours today. I was not sure what was happening as the other sites being hosted with the same server were and are fine, online and can be viewed unlike SofieHofmann.com. So I contacted Dreamhost support team. I guess people were still sleeping in the United States when I wrote so I got a reply a little bit late. Nevertheless, thanks to the support team of Dreamhost, I was able to fix the problem.

What happened? Well, it so happen that both Yahoo and Google were crawling SofieHofmann.com at the same time as what Patrick, the support guy from the support team pointed out and I never realized it. It was true because I also checked the raw.log file. He asked me to check “Goodies” and try to block or limit some search engine bots accessing the site.

Well, I already have a robots.txt file but have not updated it for quite some time now. When I checked my robots.txt, the list of folders inside that file were actually outdated. It never occurred to me that my robots.txt file was already outdated. I forgot to update it when I revised SofieHofmann.com.

So I uploaded the newly updated robots.txt file and used the Goodies as well. Just when I finished uploading the robots.txt file and filled out some stuff at the “Goodies” where “Block Spiders” is located, SofieHofmann.com was fine again.

What did I do? I just disallowed search engine bots to access the images and other folders by writing the name of the folders at the robots.txt file. Then, at the Control Panel of Dreamhost, at “Goodies > Block Spiders” section, I did not check what search engines to block but specified which directories to block from every spider. I also specified the file extensions to block from every spider. Yes, from every spider.

I do not really mind if the images at the site will not be crawled or indexed by the search engine bots. It is enough for me for the search engine bots to crawl the pages, the articles, and the blog entries.

How to disallow the search engines bots accessing some of the folders?

I just created, actually updated the text file called robots.txt and wrote the following:

User-agent: *
Disallow: /faq/
Disallow: /cgi-bin/
Disallow: /images/

In your case, if you would like to add more folders, just add them. I specified some more folders too. I just did not write everything at the example above.

Then I uploaded the file at the root folder. The asterisk (*) for the User-agent here means any spider, regardless if it is Yahoo, MSN, Google or whatever. I would like to limit all the search engine bots crawling the images folders and the other folders at the site, no exception.

When two big search engines are crawling a site at the same time, this is using up all the site’s connections and driving up the memory usage. For me, not allowing the search engine bots to crawl the images folders as I really have a lot of pictures at the site is not at all a problem.

The problem was so simple and it was such a discomfort not to be able to access the site. And I did not even realize that the solution was that simple too. If you have the same problem, well, check your robots.txt file.