Re: Why Crawl failed to fetch so many pages?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Why Crawl failed to fetch so many pages?

Nutch开发邮件
please modify below
(# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=])
because the link
http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt
includes the ?=& which will be ignored
it will be
(# skip URLs containing certain characters as probable queries, etc.
# -[@])



2005/4/14, Andy Liu <[hidden email]>:

>
> By default, Nutch only crawls the first 100 outlinks on a page. Maybe
> that's your problem?
>
> On 4/14/05, Matthias Jaekle <[hidden email]> wrote:
> > > try
> > > +^http://news.buaa.edu.cn/*
> > This should not be the reason.
> > Your regex fits on urls starting with:
> > http://news.buaa.edu.cn
> > http://news.buaa.edu.cn/
> > http://news.buaa.edu.cn//
> > http://news.buaa.edu.cn/// ...
> >
> > The only thing I would try is to escape some caracters to make it more
> > correct. A dot means every possible sign. Better:
> > +^http:\/\/news\.buaa\.edu\.cn
> >
> > Did you make enough rounds, to get the wanted depth?
> > With every crawl you only fetch the already known links.
> >
> > Matthias
> >
> > --
> > http://www.eventax.com - eventax GmbH
> > http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events
> >
>



--
TEL 0512-68251233-6966
MSN:[hidden email]
Mail:[hidden email]
QQ:58624951
BenQ.com <http://BenQ.com>
268 Shishan Road, New District,
Suzhou, China