It was thus said that the Great Natalie Pendragon once stated:
> For the GUS crawl at least, the crawler doesn't identify itself _to_
> crawled sites, but it does obey blocks of rules in robots.txt files
> according to user-agent. So it works without needing a user-agent
> header.
>
> It obeys user-agent of `*`, `indexer`, and `gus` in order of
> increasing importance.
>
> There's been some talk of the generic sorts of user-agents in the
> past, which I think is a really nice idea. If `indexer` is a
> user-agent that both sites and crawlers had some sort of informal
> consensus on, then sites wouldn't need to worry about keeping up with
> any new indexers popping up.
>
> Some other generic user-agent ideas, iirc, were `archiver` and
> `proxy`.
That's a decent idea, but that still doesn't help when I want to block a
particular bot for "misbehaving" (in some nebulous way). For example,
there's this one bot, "The Knowledge AI" which sends requests like
/%22http:/wesiseli.com/magician/%22 [1]
(and yes, that's an actual example, pulled straight off the log file). It's
not quite yet annoying enough to block [2] but at least I have some chance
of blocking it via robots.txt (which it does request).
-spc
[1] I can't quite figure out why it includes the quotes as part of the
link. *All* the links on my websites look like:
<a href="
http://example.com/">
and for the most part, it can parse those links correctly. And
that's not limited to just *one* bot, but several of them have that
behavior.
[2] Although it's nearly impossible to find anything out about it as the
user-agent string is literally "The Knowledge AI", so it might be
worth blocking it just out of spite.