problem parsing HTML

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

problem parsing HTML

Ian Holsman (Lists)
Hi.

I'm trying to figure out how nutch actually extracts the links out of  
a piece of HTML.

I'm getting confused in what parts TagSoup, NekoHTML, and parse-html  
play in all this.

from what I can see the regular expression it is using to extract the  
link is slightly off, but i'm not sure
where it actually does this bit.

the fragment in question is this:

<a href="#|"  
onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID +  
":NewsMaker: National, Political, World, Breaking News and More :" +  
nm_cur["newsmaker80631"] + " of 8";t=s_account.split(",");s_account2=
(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs
(s_account2);return false;' id="newsmaker80631.pre"><img border="0"  
src="http://cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21"  
alt="Prev"/></a>

and it is attempting to find ;s_account2=(t[0].indexOf(



TIA
Ian

--
Ian Holsman
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: problem parsing HTML

Dennis Kubes
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks()
which is called from org.apache.nutch.parse.html.HtmlParser.  Running
some simple tests on your fragment below I get non outlink for this.
What version of Nutch are you running?

Dennis Kubes

Ian Holsman wrote:

> Hi.
>
> I'm trying to figure out how nutch actually extracts the links out of a
> piece of HTML.
>
> I'm getting confused in what parts TagSoup, NekoHTML, and parse-html  
> play in all this.
>
> from what I can see the regular expression it is using to extract the
> link is slightly off, but i'm not sure
> where it actually does this bit.
>
> the fragment in question is this:
>
> <a href="#|"
> onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID +
> ":NewsMaker: National, Political, World, Breaking News and More :" +
> nm_cur["newsmaker80631"] + " of
> 8";t=s_account.split(",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return
> false;' id="newsmaker80631.pre"><img border="0"
> src="http://cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21"
> alt="Prev"/></a>
>
> and it is attempting to find ;s_account2=(t[0].indexOf(
>
>
>
> TIA
> Ian
>
> --
> Ian Holsman
> [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: problem parsing HTML

Ian Holsman (Lists)
Hi Dennis,
thanks for the fast response.


I'm running the SVN head.
I'll try narrowing it down a bit further.
What led me to believe it was this was looking at what the fetcher  
was fetching. It could have been we had some bad html on our servers,  
but it's a standard header area.

regards
Ian

On 13/04/2007, at 11:17 AM, Dennis Kubes wrote:

> It happens in  
> org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is  
> called from org.apache.nutch.parse.html.HtmlParser.  Running some  
> simple tests on your fragment below I get non outlink for this.  
> What version of Nutch are you running?
>
> Dennis Kubes
>
> Ian Holsman wrote:
>> Hi.
>> I'm trying to figure out how nutch actually extracts the links out  
>> of a piece of HTML.
>> I'm getting confused in what parts TagSoup, NekoHTML, and parse-
>> html  play in all this.
>> from what I can see the regular expression it is using to extract  
>> the link is slightly off, but i'm not sure
>> where it actually does this bit.
>> the fragment in question is this:
>> <a href="#|"  
>> onclick='s_linkTrackVars="None";s_linkType="o";s_linkName=s_pfxID  
>> + ":NewsMaker: National, Political, World, Breaking News and  
>> More :" + nm_cur["newsmaker80631"] + " of 8";t=s_account.split
>> (",");s_account2=(t[0].indexOf("aolsvc")==-1?t[0]:t[1]);s_lnk=s_co
>> (this);s_gs(s_account2);return false;'  
>> id="newsmaker80631.pre"><img border="0" src="http://
>> cdn.XXXX.XXXX.com/ch_news/backbtn" width="25" height="21"  
>> alt="Prev"/></a>
>> and it is attempting to find ;s_account2=(t[0].indexOf(
>> TIA
>> Ian
>> --
>> Ian Holsman
>> [hidden email]