when to use cmd "parse" to parse a segment's pages

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

when to use cmd "parse" to parse a segment's pages

Feng Ji
hi,

I follow the nutch08 tutorial. The step to do crawling is "inject,
generator, fetch, update".

But there is a command in nutch/bin, called "parse", which parse a segment's
page. I wonder if I should use it before "update" in the above steps.

Currently, I didn't use "parse" cmd and "update" still see the parsed
outlinks. Seems some kind of parsing is already done in "fetch" step.

thanks,

Michael,
Reply | Threaded
Open this post in threaded view
|

Re: when to use cmd "parse" to parse a segment's pages

Zaheed Haque
Hi

Cos you have parse option true in nutch-site.xml. Try set it to false
if you want to parse it manually. Or overide config with fetch
-noParsing option.

Cheers

On 8/30/06, Feng Ji <[hidden email]> wrote:

> hi,
>
> I follow the nutch08 tutorial. The step to do crawling is "inject,
> generator, fetch, update".
>
> But there is a command in nutch/bin, called "parse", which parse a segment's
> page. I wonder if I should use it before "update" in the above steps.
>
> Currently, I didn't use "parse" cmd and "update" still see the parsed
> outlinks. Seems some kind of parsing is already done in "fetch" step.
>
> thanks,
>
> Michael,
>
>
Reply | Threaded
Open this post in threaded view
|

Re: when to use cmd "parse" to parse a segment's pages

Feng Ji
Exactly!

It solves my puzzle,

thanks,

Michael,


On 8/30/06, Zaheed Haque <[hidden email]> wrote:

>
> Hi
>
> Cos you have parse option true in nutch-site.xml. Try set it to false
> if you want to parse it manually. Or overide config with fetch
> -noParsing option.
>
> Cheers
>
> On 8/30/06, Feng Ji <[hidden email]> wrote:
> > hi,
> >
> > I follow the nutch08 tutorial. The step to do crawling is "inject,
> > generator, fetch, update".
> >
> > But there is a command in nutch/bin, called "parse", which parse a
> segment's
> > page. I wonder if I should use it before "update" in the above steps.
> >
> > Currently, I didn't use "parse" cmd and "update" still see the parsed
> > outlinks. Seems some kind of parsing is already done in "fetch" step.
> >
> > thanks,
> >
> > Michael,
> >
> >
>