updatedb says URL normalizing and filtering are set to false

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

updatedb says URL normalizing and filtering are set to false

Edward Quick

When I run the updatedb, it states URL normalizing and filtering are set to false. I think they are already active though? If not, could someone tell me how I switch those on please?

Thanks,
Ed.

$ bin/nutch updatedb crawl/crawldb crawl/segments/20080926135817
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080926135817]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done


_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: updatedb says URL normalizing and filtering are set to false

Doğacan Güney-3
On Fri, Sep 26, 2008 at 5:04 PM, Edward Quick <[hidden email]> wrote:
>
> When I run the updatedb, it states URL normalizing and filtering are set to false. I think they are already active though? If not, could someone tell me how I switch those on please?
>

You don't normally need filter/normalize during updatedb, since all
urls should already be filtered and normalized by other jobs at that
point. Still, you can switch them on by passing -normalize -filter to
updatedb.

> Thanks,
> Ed.
>
> $ bin/nutch updatedb crawl/crawldb crawl/segments/20080926135817
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080926135817]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
>
>
> _________________________________________________________________
> Win New York holidays with Kellogg's & Live Search
> http://clk.atdmt.com/UKM/go/111354033/direct/01/



--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

RE: updatedb says URL normalizing and filtering are set to false

Edward Quick



> Date: Sun, 28 Sep 2008 23:06:40 +0300
> From: [hidden email]
> To: [hidden email]
> Subject: Re: updatedb says URL normalizing and filtering are set to false
>
> On Fri, Sep 26, 2008 at 5:04 PM, Edward Quick <[hidden email]> wrote:
> >
> > When I run the updatedb, it states URL normalizing and filtering are set to false. I think they are already active though? If not, could someone tell me how I switch those on please?
> >
>
> You don't normally need filter/normalize during updatedb, since all
> urls should already be filtered and normalized by other jobs at that
> point. Still, you can switch them on by passing -normalize -filter to
> updatedb.

Thanks - that is useful to know though, in case I want to fix the list after the crawl is done.

Ed.

>
> > Thanks,
> > Ed.
> >
> > $ bin/nutch updatedb crawl/crawldb crawl/segments/20080926135817
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20080926135817]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: done
> >
> >
> > _________________________________________________________________
> > Win New York holidays with Kellogg's & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
>
>
>
> --
> Doğacan Güney

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/