Crawler fetching weird urls

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawler fetching weird urls

Jeff Van Boxtel
I am experiencing a problem where my fetcher is trying to grab lots of
URLs that don't exist. For example it will try to get:
 
fetching http://www.ourhost.com/project_files/PROJECTS/000260/WP/
0L19MM14.doc/0k07mm10.doc/%200L19MM14.doc/0i13mm4.doc/0I29MM3.PDF/%200L19MM14.doc/
 
There is no such url that exists and I can't figure out where the
crawler is getting these strange urls from. I don't think any of my
pages link to something like this. I have also seen other (less bizarre)
urls that don't seem to exist and there are no links to them anywhere on
our site. Is it possible that the crawldb is getting corrupt? Is there a
way I can see where the crawldb got these URLs from? And if the urls
result in a 404 page is there a way to have them removed from the
crawldb?
Reply | Threaded
Open this post in threaded view
|

Re: Crawler fetching weird urls

Martin Kuen
hi,

the commands "readdb" and "readlinkdb" could be interesting for you:
http://wiki.apache.org/nutch/08CommandLineOptions

If you want to see the in/outlinks (readlinkdb) of a given page you must
fist invoke the "invertlinks" command.

Unfortunatly, I don't know how to remove an individual url from a crawldb .
. . sorry


Hope it helps,

Martin


On 9/11/07, Jeff Van Boxtel <[hidden email]> wrote:

>
> I am experiencing a problem where my fetcher is trying to grab lots of
> URLs that don't exist. For example it will try to get:
>
> fetching http://www.ourhost.com/project_files/PROJECTS/000260/WP/
> 0L19MM14.doc
> /0k07mm10.doc/%200L19MM14.doc/0i13mm4.doc/0I29MM3.PDF/%200L19MM14.doc/
>
> There is no such url that exists and I can't figure out where the
> crawler is getting these strange urls from. I don't think any of my
> pages link to something like this. I have also seen other (less bizarre)
> urls that don't seem to exist and there are no links to them anywhere on
> our site. Is it possible that the crawldb is getting corrupt? Is there a
> way I can see where the crawldb got these URLs from? And if the urls
> result in a 404 page is there a way to have them removed from the
> crawldb?
>
Reply | Threaded
Open this post in threaded view
|

RE: Crawler fetching weird urls

Howie Wang
I tend to get this problem when parse-js plugin is enabled.

Howie

_________________________________________________________________
Can you find the hidden words?  Take a break and play Seekadoo!
http://club.live.com/seekadoo.aspx?icid=seek_wlmailtextlink
Reply | Threaded
Open this post in threaded view
|

Re: Crawler fetching weird urls

Doğacan Güney-3
On 9/12/07, Howie Wang <[hidden email]> wrote:
> I tend to get this problem when parse-js plugin is enabled.

Try using urlfilter-validator plugin. It should filter such urls (but
it also filters some file:// urls and etc. Please see discussion at
NUTCH-546.)

>
> Howie
>
> _________________________________________________________________
> Can you find the hidden words? Take a break and play Seekadoo!
> http://club.live.com/seekadoo.aspx?icid=seek_wlmailtextlink


--
Doğacan Güney