anchor text in content field

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

anchor text in content field

alxsss
Hello,

Is there a way to configure nutch not to put anchors in content field?

Thanks.
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: anchor text in content field

lewis john mcgibbney
Hi Alex,

On Thu, Jun 12, 2014 at 7:34 PM, <[hidden email]> wrote:

>
> Is there a way to configure nutch not to put anchors in content field?
>
> Which version o fNutch are you referring to?
Can you please provide us with an example?
Thank you
Lewis
Reply | Threaded
Open this post in threaded view
|

RE: anchor text in content field

Markus Jelsma-2
In reply to this post by alxsss
Hi,

In case of TikaParser you need a customized ContentHandler that remembers it's state (being inside an anchor) and ignores characters based on that state in characters(char ch[], int start, int length); Easiest would be to modify parse-tika's DOMBuilder for that.

It's use case puzzles me, why would one indiscriminately omit anchors from text? The average Wikipedia page would be much less useful if it is parsed like that.

Cheers

-----Original message-----

> From:Lewis John Mcgibbney <[hidden email]>
> Sent: Friday 13th June 2014 13:59
> To: [hidden email]
> Subject: Re: anchor text in content field
>
> Hi Alex,
>
> On Thu, Jun 12, 2014 at 7:34 PM, <[hidden email]> wrote:
>
> >
> > Is there a way to configure nutch not to put anchors in content field?
> >
> > Which version o fNutch are you referring to?
> Can you please provide us with an example?
> Thank you
> Lewis
>
Reply | Threaded
Open this post in threaded view
|

Re: anchor text in content field

alxsss

Hi,


I went ahead and modified DOMBuilder.


The use case is that some silly newspapers put links to all today's articles at the end of each article. Let say today there were 3 articles. Two of them is about Obama and one is about J.Lopez.
At the end of the article about J.Lopez  there are two links with anchors as title of articles about Obama.


If you index these articles then search for J. Lopez will return articles about Obama.


So, what would be a good solution to this issue?


Thanks.
Alex.



-----Original Message-----
From: Markus Jelsma <[hidden email]>
To: user <[hidden email]>
Sent: Fri, Jun 13, 2014 5:08 am
Subject: RE: anchor text in content field


Hi,

In case of TikaParser you need a customized ContentHandler that remembers it's
state (being inside an anchor) and ignores characters based on that state in
characters(char ch[], int start, int length); Easiest would be to modify
parse-tika's DOMBuilder for that.

It's use case puzzles me, why would one indiscriminately omit anchors from text?
The average Wikipedia page would be much less useful if it is parsed like that.

Cheers

-----Original message-----

> From:Lewis John Mcgibbney <[hidden email]>
> Sent: Friday 13th June 2014 13:59
> To: [hidden email]
> Subject: Re: anchor text in content field
>
> Hi Alex,
>
> On Thu, Jun 12, 2014 at 7:34 PM, <[hidden email]> wrote:
>
> >
> > Is there a way to configure nutch not to put anchors in content field?
> >
> > Which version o fNutch are you referring to?
> Can you please provide us with an example?
> Thank you
> Lewis
>