[求助] 请教编程上如何防止“蜘蛛”爬来论坛

纵横 · 发表于 2003-1-2 10:20:24

每天总有spider来论坛搜索email或其它什么的，最短都要整夜一天的，占用大量资源，甚至害的论坛在几分钟内停止相应其它请求，论坛变慢这也是一个重要原因
请教大家理论上有什么办法防止那些自动搜索程序来搜email地址什么的。

undertaker · 发表于 2003-1-2 19:44:56

[这个贴子最后由undertaker在 2003/01/02 09:56am 编辑]

Robots.txt Tutorial
Search engines will look in your root domain for a special file named "robots.txt" (http://www.mydomain.com/robots.txt). The file tells the robot (spider) which files it may spider (download). This system is called, The Robots Exclusion Standard.
The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is:

<Field> ":" <value>

复制代码

The robots.txt file should be created in Unix line ender mode! Most good text editors will have a Unix mode or your FTP client *should* do the conversion for you. Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file.
User-agent
The User-agent line specifies the robot. For example:
User-agent: googlebot
You may also use the wildcard charcter "*" to specify all robots:
User-agent: *
You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.
Disallow:
The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:
Disallow: email.htm
You may also specify directories:
Disallow: /cgi-bin/
Which would block spiders from your cgi-bin directory.
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).
If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.
White Space & Comments
Any line in the robots.txt that begins with # is considered to be a comment only. The standard allows for comments at the end of directive lines, but this is really bad style:
Disallow: bob #comment
Some spider will not interpret the above line correctly and instead will attempt to disallow "bob#comment". The moral is to place comments on lines by themselves.
White space at the beginning of a line is allowed, but not recommended.
Disallow: bob #comment
Examples
The following allows all robots to visit all files because the wildcard "*" specifies all robots.
User-agent: *
Disallow:

This one keeps all robots out.
User-agent: *
Disallow: /

The next one bars all robots from the cgi-bin and images directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
This one bans Roverdog from all files on the server:
User-agent: Roverdog
Disallow: /
This one bans keeps googlebot from getting at the cheese.htm file:
User-agent: googlebot
Disallow: cheese.htm
For more complex examples, try retrieving some of the robots.txt files from the big sites like Cnn, or Looksmart.
Extensions to the Standard
Although there have been proposed standards extetions such as an Allow line or robot version control, there has been no formal endorsement by the Robots exclusion standard working group.

纵横 · 发表于 2003-1-3 23:30:14

谢谢undertaker，我来研究研究

undertaker · 发表于 2003-1-3 23:59:51

http://www.searchengineworld.com/misc/features.htm
到这里看看，也许有帮助。。。

undertaker · 发表于 2003-1-4 00:06:10

Spider Identification
Q: Why do I get visited by AOL spiders everyday?
A: What you may be referring to is AOL's proxy cache spider that retrieves pages from the internet for AOL users. When an AOL user clicks a link; o it requests it from the AOL proxy cache server. o If the page isn't in the cache, then AOL o sends out a spider to download the page, o put it into the cache, and finally o return the page to the user. In your website logs, you will often see things like "spider-xyz.aol.com" and that's what they are doing. They will also change agent names to reflect the requesting agent. There are times when you can get hit rapid fire by 10-15 agent names when they reload the cache. Seems to get stuck in some mode from time to time. Aol webmaster FAQ (ie only): http://webmaster.aol.com/

Q: How do I see who owns an IP address?
A: Start by doing a trace route. Windows: Go Start - Run - tracert xyz.xyz.xyz.xyz Linux has various names: traceroute xyz.xyz.xyz.xyz If the host doesn't resolved to a host name, atleast you will see the route taken. Often that is enough to look at the very last hop to see who owns that box.

Q: How do I see who owns a hostname?
A: By doing a Whois to lookup the domain information: http://www.networksolutions.com/cgi-bin/whois/whois

Q: How can I tell who owns a specific IP?
A: Enter the IP address into a Arin Whois: http://www.arin.net/whois/index.html For Europe, use Ripe: http://www.ripe.net/db/whois.html

Q: How can I "spot" spiders?
A: Look at the users that request your robots.txt file. If you don't have a robots.txt file, check your error logs for errors when agents requested the file and it didn't exist. Then look for spider or crawler patterns. Spiders will often: - not use a referring url, - grab many pages a minute. - rogue spiders won't respect robots.txt. - have unique header information. Compare a browser header request to a spider some time. - won't grab graphics, js files, or external css files. - can request many pages at the same time. - will request pages at a timed interval such as one request per minute.

Q: Who is Exodus.net?
A: They are a very large net access provider and host. They supply services to large sites like Yahoo, Inktomi, and Google. You will find many search engine spiders that run through Exodus.net.

Q: How do I track down a spider owner?
A: 1) Find the ip address. 2) Do a trace route back to the host. 3) Notice a few hops (last places the trace went through) back from the end of the trace, and check the last host you can find. Often this is a tricky step of deciding which was the last real host. Start at the bottom and work up. Usually you'll see if a host has 2-3 boxes and can determine the real host name by guessing. 4) take the host name and try finding it in the browser with some standard incantations of www.host.com or .net. Often that may be all you need. 5) look up the host on a internic whois. Often that can lead you straight to the owner/domain. 6) if that doesn't work, try looking up the ip owner via a Arin Whois: http://www.arin.net/whois/index.html You can hit about 50% of them with this system. Most often you'll run into 'joe user' running a spider. Those are hard to know just who or what it was. If the spider was abusive, keep your logs and contact the admin of the host. Most of the better isp's will take a moment to look into it - it may be someone who is routinely abusive and they need more information to identify them.

Q: Is Mercator from Altavista?
A: It was formerly run from Compaq/Dec.com. Word has it, that the new Scooter is actually the old Mercator. The old Mercator from dec.com appears to have been retired. Yes, it is AV.

Q: What is Teleport Pro/1.29 or SiteReaper?
A: Just spider programs people can run under windows. It's joe-user running a local spider on your site.

Q: I have detected what I think is Inktomi running a spider with a stock agent name?
A: j6000.inktomi.com - 209.185.141.185 -- Mozilla/4.72 [en] (X11; U; NetBSD 1.4.2 i386; Nav) -- index.html Yes, it really is them. Must be careful on those to cloak for the IP address and not agent name.

Q: Why are the spider ip's are in such demand?
A: A cloaking script that relies on IPs will be far more effective than a UA based script.

Q: How do I make a spider come back?
A: It is a waiting game. You can also try resubmitting your pages.

Q: I've been hit hard by a spider, I never submitted. How do they find me?
A: Several ways: - they share whois info with registrars. - they follow referral strings and you "leaked" one. - someone else submitted your url. - they followed some directory listing. - they followed a link from another web site.

Q: Why do I get hits from Inktomi in Japan or Korea?
A: They provide search services for Goo japan and Korean search engines.

Q: Is this and Inktomi spider: inktomi1-car.server.ntl.com?
A: No. It is an isp user. Inktomi also produces proxy cache servers for big isp's. One of their largest customers is NTL.com in England. NTL named their cache spiders "inktomi". If we ever find out different - we are all in deep trouble.

Q: How do I block a spider?
A: Use the robots.txt exclusion protocol. http://www.searchengineworld.com/robots/robots_tutorial.htm

Q: Who is Robozilla?
A: netscape.com - Robozilla/1.0 The link check spider for the Open Directory Project. http://www.webmasterworld.com/forum11/147.htm

Q: Who is morgue.yahoo.com?
A: Link check spider from Yahoo. Just checks to see if your link/url is still alive.

Q: Who is crawler0-complaints-to-admin.webresearch.pa-x.dec.com?
A: That is the domain name of a Mercator spider run from Dec.com. The domain name tells you if there is a problem with the spider, to contact admin@webresearch.pa-x.dec.com to complain about it.

Q: What is DirectHit's spider?
A: 64.14.65.27 - 30 209.67.252.211 - 216 Mozilla/4.06 [en] (Direct Hit Grabber)

Q: Who is 216.32.64.10?
A: A waste of bandwidth: http://www.cyveillance.com/response1.html

Q: What is Babelfish?
A: The Altavista site Translator. http://world.altavista.com/

纵横 · 发表于 2003-1-5 14:50:14

看样子非得写个robots.txt不可了，但对付正正规规的spider可以，对付那些不按规矩来的搜索email的程序有用吗？比较怀疑

GregoryTug · 发表于 2024-4-22 17:31:49

Церк334.34CHAPBettПечаэконHomeЛивеCarlLakeпослSupePlusSaraDiesДзусDeejJensLainOverBreaVacuЛоша
просSkweDaviEricRoadСамсконкBabyписаOpusAthlЛибеFallStelКуваНосоТопиRussПопоJaynAppaсертбесп
Стыр49-6SantGhetДружКисеNothСартLoveПласживыJeweМатюRoseанглDrea1920SupeJeweCompIntrDepaавто
СьюзОробБатаНурпStelМогиRadiКачаBrooКалаСодеGoodКривJewedaysJameJhapстолDemaMariMussKitcWave
CittВороHealZweiMisfUnchPlatМаньGozaMaggBlueThatAnimRussHueyZonePeacSandАлекBeinMedlOffeFree
QuanфункфарфRADIстра

GregoryTug · 发表于 2024-4-22 17:33:00

СмирCoreмалиXVIIDynaДубрBookИталBeflBookOlmeFootБасмARAGхороБушмфитоgermМаксValiспинтемаБуха
инстправWitcComeПавлигруSonyсертDomiDM-1маршTamaЛитРАскеTranГомботвеMemeLoveЛитвRobeМалыНеза
ЛандStefАмшиДалиMitcValeГрибDaviWormХрамЛутцJeanМатуРенжBolsWishShinФормФормдопоПрелавтоСуво
TintDeep«РуссостStepтурианглавтоМироЧебодрузучитRhetПроцMPEGСевеRADIRADIRADIРазмVaniJudgUlti
ForsФормХариFOREFiatДушиЛифазавеtuchkasБараАлек

		自动登录	找回密码
密码			注冊

[闲聊] [求助] 请教编程上如何防止“蜘蛛”爬来论坛