Site and robots

Your Salesforce Site is for Customers, not for bots!

You have created a pretty Salesforce site to manage your business, and you want to make it cost effective.
There are some limits on the platform in term of CPU, bandwidth and pages viewed per day, if you reach those limits, you probably will have to pay to increase them (for instance changing from an enterprise edition to an unlimited edition).
The resource consumption should be useful, that means targeted to your expected visitors. Did you know that most of the visits are not real human visits ?
The web is not a beautiful place with only friendly people. The underground web is based on machines. These "bots" (robots) are downloading pages for good or bad reasons. And each time they get a page from your site, it's part of the available resources they consume. The issue is the ratio between humans and bots. If a website has not put in place any protection, you will get more trafic from bots than humans.
How to optimize resources ? The first step is to prevent crawling from bad bots. Of course, you will have to make a choice between "good" and "bad". For instance, google, bing, baidu and a few other are crawling the web to make you apear in the search results. Don't block them as they will give your real visitors. On the other side, you have a few bots that are crawling the web to get content information that will be sold: they consume your resources and you don't get money from them - stop them. You can even have bots that will harvest email addresses from your pages, or try to identify security issues (such as a form that is not protected by a captcha). You absolutely need to block them.
The quick win is that Salesforce is providing you a simple standard option to tell the bots they are not wecome: the use of a standard file called "robots.txt" (the file is common for all your salesforce sites). You just have to define a list and associated rights. Pay attention to the fact that very bad bots don't read this file, they won't be blocked.
By default, salesforce will prevent all bots for non production orgs (dev edition etc.). You absolutely need to define a robots.txt for your production org. The syntax is quite simple, but the content is not easy to define: how can you know which robots to put in the file? The following content is a VisualForce page that you will have to add to your org, and then point to this VF page in your Salesforce site configuration, and voila! Taking 5 minutes to do this can spare lots of money.
<apex:page contentType="text/plain" showHeader="false">
User-agent: 008
user-agent: AhrefsBot
User-agent: aipbot
User-agent: Alexibot
User-agent: AlvinetSpider
User-agent: Amfibibot
User-agent: Antenne Hatena
User-agent: antibot
User-agent: ApocalXExplorerBot
User-agent: asterias
User-agent: BackDoorBot/1.0
User-agent: BecomeBot
User-agent: Biglotron
User-agent: BizInformation
User-agent: Black Hole
User-agent: BLEXBot
User-agent: BlowFish/1.0
User-agent: BotALot
User-agent: BruinBot
User-agent: BuiltBotTough
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: CatchBot
User-agent: ccubee
User-agent: ccubee/3.5
User-agent: Cegbfeieh
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: Combine
User-agent: ConveraCrawler
User-agent: ConveraMultiMediaCrawler
User-agent: CoolBot
User-agent: CopyRightCheck
User-agent: cosmos
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DimensioNet
User-agent: discobot
User-agent: DISCo Pump 3.1
User-agent: DittoSpyder
User-agent: dotbot
User-agent: Drecombot
User-agent: DTAAgent
User-agent: e-SocietyRobot
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: envolk
User-agent: EroCrawler
User-agent: EverbeeCrawler
User-agent: ExtractorPro
User-agent: Flamingo_SearchEngine
User-agent: Foobot
User-Agent: FDSE
User-agent: g2Crawler
User-agent: genieBot
User-agent: gsa-crawler
User-agent: Harvest/1.5
User-agent: hloader
User-agent: HooWWWer
User-agent: httplib
User-agent: HTTrack
User-agent: HTTrack 3.0
User-agent: humanlinks
User-agent: Igentia
User-agent: InfoNaviRobot
User-agent: Ipselonbot
User-agent: IRLbot
User-agent: JennyBot
User-agent: JikeSpider
User-agent: Jyxobot
User-agent: KavamRingCrawler
User-agent: Kenjin Spider
User-Agent: larbin
User-agent: LexiBot
User-agent: libWeb/clsHTTP
User-agent: LinkextractorPro
User-agent: LinkScan/8.1a Unix
User-agent: linksmanager
User-agent: LinkWalker
User-Agent: lmspider
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: Mata Hari
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: minibot(NaverRobot)/1.0
User-agent: Mister PiX
User-Agent: MJ12bot
User-agent: MLBot
User-agent: moget
User-agent: moget/2.1
User-agent: MS Search 4.0 Robot
User-agent: MS Search 5.0 Robot
User-Agent: MSIECrawler
User-Agent: MyFamilyBot
User-agent: Naverbot
User-agent: NetAnts
User-agent: NetAttache
User-agent: NetMechanic
User-Agent: NetResearchServer
User-agent: NextGenSearchBot
User-agent: NICErsPRO
User-agent: noxtrumbot
User-agent: NPBot
User-agent: Nutch
User-agent: NutchCVS
User-agent: Offline Explorer
User-Agent: OmniExplorer_Bot
User-agent: Openfind
User-agent: OpenindexSpider
User-Agent: OpenIntelligenceData
User-agent: PhpDig
User-agent: pompos
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: psbot
User-agent: QuepasaCreep
User-agent: QueryN Metasearch
User-agent: Radian6
User-agent: R6_FeedFetcher
User-agent: R6_CommentReader
User-agent: RepoMonkey
User-agent: RMA
User-agent: RufusBot
User-Agent: SBIder
User-Agent: schibstedsokbot
User-Agent: ScSpider
User-agent: SearchmetricsBot
User-Agent: semanticdiscovery
User-agent: SemrushBot
User-agent: Shim-Crawler
User-Agent: ShopWiki
User-agent: SightupBot
User-Agent: silk
user-agent: sistrix
user-agent: sitebot
User-agent: SiteSnagger
User-agent: SiteSucker
User-agent: Slurp
User-agent: Sogou web spider
User-agent: sosospider
User-agent: SpankBot
User-agent: spanner
User-agent: Speedy
User-agent: Sproose
User-agent: Steeler
User-agent: suggybot
User-agent: SuperBot
User-agent: SuperBot/2.6
User-agent: suzuran
User-agent: Szukacz/1.4
User-agent: Tarantula
User-agent: Teleport
User-agent: Telesoft
User-agent: The Intraformant
User-agent: TheNomad
User-agent: Theophrastus
User-agent: TightTwatBot
User-agent: Titan
User-agent: toCrawl/UrlDispatcher
User-agent: TosCrawler
User-agent: TridentSpider
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: turingos
User-agent: turnitinbot
User-agent: twiceler
User-agent: Ultraseek
User-agent: UrlPouls
User-agent: URLy Warning
User-agent: Vagabondo
User-agent: VCI
User-agent: Verticrawlbot
User-agent: voyager
User-agent: voyager/1.0
User-agent: Web Image Collector
User-agent: WebAuto
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCopier
User-agent: webcopy
User-agent: WebEnhancer
User-agent: WebIndexer
User-agent: WebmasterWorldForumBot
User-agent: webmirror
User-agent: WebReaper
User-agent: WebSauger
User-agent: website extractor
User-agent: Website Quester
User-agent: Webster Pro
User-agent: WebStripper
User-agent: WebStripper/2.02
User-agent: WebZip
User-agent: Wget
User-agent: WikioFeedBot
User-agent: WinHTTrack
User-agent: WWW-Collector-E
User-agent: Xenu Link Sleuth/1.3.8
User-agent: xirq
User-agent: yacy
User-agent: YRSPider
User-agent: ZeBot
User-agent: Zeus
User-agent: Zookabot
Disallow: /