Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Tranquil
Hi,

Is there a way to tell nutch not to parse the pages it fetches? meaning just
to extract the links from it?
I know there is a "-no parsing" attribute,but still i need to d/l some
contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
use the option.

Thank you,

--
Eyal Edri
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

joel gump
maybe you can try to use

http://search.capan.org/~podmaster/HTML-LinkExtractor-0.13

eyal edri wrote:

> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
>
> Thank you,
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Dennis Kubes-2
In reply to this post by Tranquil
The noparsing option will still download and store the content.  It
simply will not parse the content.

Dennis Kubes

eyal edri wrote:

> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?
> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.
>
> Thank you,
>
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Andrzej Białecki-2
In reply to this post by Tranquil
eyal edri wrote:
> Hi,
>
> Is there a way to tell nutch not to parse the pages it fetches? meaning just
> to extract the links from it?

Extracting links requires that a page is downloaded first (otherwise
where would you extract the links from?) and parsed (otherwise how would
you extract links from an unintelligible byte[]?).


> I know there is a "-no parsing" attribute,but still i need to d/l some
> contentTypes using the parse-XXX plugins.. so i'm not sure it will work if i
> use the option.

No download -> no parsing -> no outlinks.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to tell nutch fetcher not to parse for text in the page? (i.e. just links)

Tranquil
I understand,
but my intention was on parsing the text and collecting keywords for
indexing/query.
with the overal intention on increasing the speed of the fetcher and
updatedb.

is there a way to do it (maybe removing serveral plugins?)


On 10/26/07, Andrzej Bialecki <[hidden email]> wrote:

>
> eyal edri wrote:
> > Hi,
> >
> > Is there a way to tell nutch not to parse the pages it fetches? meaning
> just
> > to extract the links from it?
>
> Extracting links requires that a page is downloaded first (otherwise
> where would you extract the links from?) and parsed (otherwise how would
> you extract links from an unintelligible byte[]?).
>
>
> > I know there is a "-no parsing" attribute,but still i need to d/l some
> > contentTypes using the parse-XXX plugins.. so i'm not sure it will work
> if i
> > use the option.
>
> No download -> no parsing -> no outlinks.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Eyal Edri