|
[求助] 请教编程上如何防止“蜘蛛”爬来论坛
Spider Identification
Q: Why do I get visited by AOL spiders everyday?
A: What you may be referring to is AOL's proxy cache spider that retrieves pages from the internet for AOL users. When an AOL user clicks a link; o it requests it from the AOL proxy cache server. o If the page isn't in the cache, then AOL o sends out a spider to download the page, o put it into the cache, and finally o return the page to the user. In your website logs, you will often see things like "spider-xyz.aol.com" and that's what they are doing. They will also change agent names to reflect the requesting agent. There are times when you can get hit rapid fire by 10-15 agent names when they reload the cache. Seems to get stuck in some mode from time to time. Aol webmaster FAQ (ie only): http://webmaster.aol.com/
Q: How do I see who owns an IP address?
A: Start by doing a trace route. Windows: Go Start - Run - tracert xyz.xyz.xyz.xyz Linux has various names: traceroute xyz.xyz.xyz.xyz If the host doesn't resolved to a host name, atleast you will see the route taken. Often that is enough to look at the very last hop to see who owns that box.
Q: How do I see who owns a hostname?
A: By doing a Whois to lookup the domain information: http://www.networksolutions.com/cgi-bin/whois/whois
Q: How can I tell who owns a specific IP?
A: Enter the IP address into a Arin Whois: http://www.arin.net/whois/index.html For Europe, use Ripe: http://www.ripe.net/db/whois.html
Q: How can I "spot" spiders?
A: Look at the users that request your robots.txt file. If you don't have a robots.txt file, check your error logs for errors when agents requested the file and it didn't exist. Then look for spider or crawler patterns. Spiders will often: - not use a referring url, - grab many pages a minute. - rogue spiders won't respect robots.txt. - have unique header information. Compare a browser header request to a spider some time. - won't grab graphics, js files, or external css files. - can request many pages at the same time. - will request pages at a timed interval such as one request per minute.
Q: Who is Exodus.net?
A: They are a very large net access provider and host. They supply services to large sites like Yahoo, Inktomi, and Google. You will find many search engine spiders that run through Exodus.net.
Q: How do I track down a spider owner?
A: 1) Find the ip address. 2) Do a trace route back to the host. 3) Notice a few hops (last places the trace went through) back from the end of the trace, and check the last host you can find. Often this is a tricky step of deciding which was the last real host. Start at the bottom and work up. Usually you'll see if a host has 2-3 boxes and can determine the real host name by guessing. 4) take the host name and try finding it in the browser with some standard incantations of www.host.com or .net. Often that may be all you need. 5) look up the host on a internic whois. Often that can lead you straight to the owner/domain. 6) if that doesn't work, try looking up the ip owner via a Arin Whois: http://www.arin.net/whois/index.html You can hit about 50% of them with this system. Most often you'll run into 'joe user' running a spider. Those are hard to know just who or what it was. If the spider was abusive, keep your logs and contact the admin of the host. Most of the better isp's will take a moment to look into it - it may be someone who is routinely abusive and they need more information to identify them.
Q: Is Mercator from Altavista?
A: It was formerly run from Compaq/Dec.com. Word has it, that the new Scooter is actually the old Mercator. The old Mercator from dec.com appears to have been retired. Yes, it is AV.
Q: What is Teleport Pro/1.29 or SiteReaper?
A: Just spider programs people can run under windows. It's joe-user running a local spider on your site.
Q: I have detected what I think is Inktomi running a spider with a stock agent name?
A: j6000.inktomi.com - 209.185.141.185 -- Mozilla/4.72 [en] (X11; U; NetBSD 1.4.2 i386; Nav) -- index.html Yes, it really is them. Must be careful on those to cloak for the IP address and not agent name.
Q: Why are the spider ip's are in such demand?
A: A cloaking script that relies on IPs will be far more effective than a UA based script.
Q: How do I make a spider come back?
A: It is a waiting game. You can also try resubmitting your pages.
Q: I've been hit hard by a spider, I never submitted. How do they find me?
A: Several ways: - they share whois info with registrars. - they follow referral strings and you "leaked" one. - someone else submitted your url. - they followed some directory listing. - they followed a link from another web site.
Q: Why do I get hits from Inktomi in Japan or Korea?
A: They provide search services for Goo japan and Korean search engines.
Q: Is this and Inktomi spider: inktomi1-car.server.ntl.com?
A: No. It is an isp user. Inktomi also produces proxy cache servers for big isp's. One of their largest customers is NTL.com in England. NTL named their cache spiders "inktomi". If we ever find out different - we are all in deep trouble.
Q: How do I block a spider?
A: Use the robots.txt exclusion protocol. http://www.searchengineworld.com/robots/robots_tutorial.htm
Q: Who is Robozilla?
A: netscape.com - Robozilla/1.0 The link check spider for the Open Directory Project. http://www.webmasterworld.com/forum11/147.htm
Q: Who is morgue.yahoo.com?
A: Link check spider from Yahoo. Just checks to see if your link/url is still alive.
Q: Who is crawler0-complaints-to-admin.webresearch.pa-x.dec.com?
A: That is the domain name of a Mercator spider run from Dec.com. The domain name tells you if there is a problem with the spider, to contact admin@webresearch.pa-x.dec.com to complain about it.
Q: What is DirectHit's spider?
A: 64.14.65.27 - 30 209.67.252.211 - 216 Mozilla/4.06 [en] (Direct Hit Grabber)
Q: Who is 216.32.64.10?
A: A waste of bandwidth: http://www.cyveillance.com/response1.html
Q: What is Babelfish?
A: The Altavista site Translator. http://world.altavista.com/
|
|