Nutch Improvement - HTML Parser

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Improvement - HTML Parser

Fuad Efendi
I am using  http://htmlparser.sourseforge.net for my Data Mining engine.
It has 'lexer' package, lightweight, and I don't need to perform ANY
html/xml error checking etc., - it's lightweight low-level 'parser', it is
not a parser, it is not DOM, SAX, etc. We do not need to create DOM to
extract Outlink[], and to extract plain text.
What about licensing?

We can develop own low-lewel HTML (InputSource) processing engine from
scratch, we need only Outlink[] and PlainText.

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Improvement - HTML Parser

Lord Elwin
So the tool you use can extract outlinks and plaintext, not accessing DOM?

2006/2/18, Fuad Efendi <[hidden email]>:

>
> I am using  http://htmlparser.sourseforge.net for my Data Mining engine.
> It has 'lexer' package, lightweight, and I don't need to perform ANY
> html/xml error checking etc., - it's lightweight low-level 'parser', it is
> not a parser, it is not DOM, SAX, etc. We do not need to create DOM to
> extract Outlink[], and to extract plain text.
> What about licensing?
>
> We can develop own low-lewel HTML (InputSource) processing engine from
> scratch, we need only Outlink[] and PlainText.
>
>


--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Improvement - HTML Parser

Fuad Efendi
It's not a tool,
IT IS stupidness of Nutch, it uses DOM just to extract plain text and
Outlink[]...
It's very easy to design specific routine to 'parse' byte[], we can improve
everything 100 times... At Least!
So, now I understand exactly what OpenSourceApache is, especially by looking
at comments in the MAIN method of Tomcat ;)))
Regards,
Fuad


-----Original Message-----
From: Elwin [mailto:[hidden email]]
Sent: Saturday, February 18, 2006 12:47 AM
To: [hidden email]
Subject: Re: Nutch Improvement - HTML Parser


So the tool you use can extract outlinks and plaintext, not accessing DOM?

2006/2/18, Fuad Efendi <[hidden email]>:

>
> I am using  http://htmlparser.sourseforge.net for my Data Mining engine.
> It has 'lexer' package, lightweight, and I don't need to perform ANY
> html/xml error checking etc., - it's lightweight low-level 'parser', it is
> not a parser, it is not DOM, SAX, etc. We do not need to create DOM to
> extract Outlink[], and to extract plain text.
> What about licensing?
>
> We can develop own low-lewel HTML (InputSource) processing engine from
> scratch, we need only Outlink[] and PlainText.
>
>


--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Improvement - HTML Parser

Jérôme Charron
> It's not a tool,
> IT IS stupidness of Nutch, it uses DOM just to extract plain text and
> Outlink[]...
> It's very easy to design specific routine to 'parse' byte[], we can
> improve
> everything 100 times... At Least!

Yes sure. I think everybody has already done such things at school...
Building a DOM provide:
1. a better parsing of malformed html documents (and there is a lot of
malformed docs on the web)
2. gives ability to extract meta-information such as creative commons
license
3. gives a high degree of extensibility (HtmlParser extension point) to
extract some specific informations without parsing the document many times
(for instance extracting technorati like tags, ...) and just providing a
simple plugin.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Improvement - HTML Parser

Ragy Eleish-2
first using the word stupid without understand all the pros and cons is not
helpful the least. In addition to that benefits Jerome wrote, using DOM
allows you to use XSLT templates to extract information in a more
declarative way, not to mention standard way.

--Ragy

On 2/25/06, Jérôme Charron <[hidden email]> wrote:

>
> > It's not a tool,
> > IT IS stupidness of Nutch, it uses DOM just to extract plain text and
> > Outlink[]...
> > It's very easy to design specific routine to 'parse' byte[], we can
> > improve
> > everything 100 times... At Least!
>
> Yes sure. I think everybody has already done such things at school...
> Building a DOM provide:
> 1. a better parsing of malformed html documents (and there is a lot of
> malformed docs on the web)
> 2. gives ability to extract meta-information such as creative commons
> license
> 3. gives a high degree of extensibility (HtmlParser extension point) to
> extract some specific informations without parsing the document many times
> (for instance extracting technorati like tags, ...) and just providing a
> simple plugin.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Improvement - HTML Parser

Fuad Efendi
In reply to this post by Jérôme Charron
But we do not need 'better parsing of malformed html', we need only to
extract plain text... Yes, meta-information such as Creative Commons embedde
XML in HTML comments is important too, and plugin technics does the job very
well.

I am only trying to focus on specific task, such as removal of repeated
tokens (menu items, options, ...), automatic web-tree building using anchors
and some statistics, calculating rank for repeated tokens and indexing only
specific sentences with low rank. I simply ignore DOM/SAX, I don't need it.


-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Saturday, February 25, 2006 4:05 AM
To: [hidden email]
Subject: Re: Nutch Improvement - HTML Parser


> It's not a tool,
> IT IS stupidness of Nutch, it uses DOM just to extract plain text and
> Outlink[]...
> It's very easy to design specific routine to 'parse' byte[], we can
> improve
> everything 100 times... At Least!

Yes sure. I think everybody has already done such things at school...
Building a DOM provide:
1. a better parsing of malformed html documents (and there is a lot of
malformed docs on the web)
2. gives ability to extract meta-information such as creative commons
license
3. gives a high degree of extensibility (HtmlParser extension point) to
extract some specific informations without parsing the document many times
(for instance extracting technorati like tags, ...) and just providing a
simple plugin.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Improvement - HTML Parser

luti
Google and other big search engines not extract only plain texts.
e.g.:
When you search in google for 'anything'.
Google will rate up that pages where the 'anything' is in
<h1..6></h1..6> or is in <b></b>.


Fuad Efendi wrote:

> But we do not need 'better parsing of malformed html', we need only to
> extract plain text... Yes, meta-information such as Creative Commons embedde
> XML in HTML comments is important too, and plugin technics does the job very
> well.
>
> I am only trying to focus on specific task, such as removal of repeated
> tokens (menu items, options, ...), automatic web-tree building using anchors
> and some statistics, calculating rank for repeated tokens and indexing only
> specific sentences with low rank. I simply ignore DOM/SAX, I don't need it.
>
>
> -----Original Message-----
> From: Jérôme Charron [mailto:[hidden email]]
> Sent: Saturday, February 25, 2006 4:05 AM
> To: [hidden email]
> Subject: Re: Nutch Improvement - HTML Parser
>
>
>  
>> It's not a tool,
>> IT IS stupidness of Nutch, it uses DOM just to extract plain text and
>> Outlink[]...
>> It's very easy to design specific routine to 'parse' byte[], we can
>> improve
>> everything 100 times... At Least!
>>    
>
> Yes sure. I think everybody has already done such things at school...
> Building a DOM provide:
> 1. a better parsing of malformed html documents (and there is a lot of
> malformed docs on the web)
> 2. gives ability to extract meta-information such as creative commons
> license
> 3. gives a high degree of extensibility (HtmlParser extension point) to
> extract some specific informations without parsing the document many times
> (for instance extracting technorati like tags, ...) and just providing a
> simple plugin.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>
>
>  

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Improvement - HTML Parser

Fuad Efendi
Not sure abour Google and others... Yes, my suggestion was not to extract
only full plain text (including OPTION groups, repeated menus, footers,
headers, ...)

HTML is not an XML, and putting Creative XML inside HTML comments ;-) is
very good sample!

And smbd suggests to constract DOM! Why for? To constract IE's buttons and
dropdowns, and to format a text for presentation?

Another suggest to use XSLT for possible finding Creatively commented DOM
element.



-----Original Message-----
From: yoursoft [mailto:[hidden email]]
Sent: Saturday, February 25, 2006 1:23 PM
To: [hidden email]
Subject: Re: Nutch Improvement - HTML Parser


Google and other big search engines not extract only plain texts.
e.g.:
When you search in google for 'anything'.
Google will rate up that pages where the 'anything' is in
<h1..6></h1..6> or is in <b></b>.


Fuad Efendi wrote:
> But we do not need 'better parsing of malformed html', we need only to
> extract plain text... Yes, meta-information such as Creative Commons
embedde
> XML in HTML comments is important too, and plugin technics does the job
very
> well.
>
> I am only trying to focus on specific task, such as removal of repeated
> tokens (menu items, options, ...), automatic web-tree building using
anchors
> and some statistics, calculating rank for repeated tokens and indexing
only
> specific sentences with low rank. I simply ignore DOM/SAX, I don't need
it.

>
>
> -----Original Message-----
> From: Jérôme Charron [mailto:[hidden email]]
> Sent: Saturday, February 25, 2006 4:05 AM
> To: [hidden email]
> Subject: Re: Nutch Improvement - HTML Parser
>
>
>  
>> It's not a tool,
>> IT IS stupidness of Nutch, it uses DOM just to extract plain text and
>> Outlink[]...
>> It's very easy to design specific routine to 'parse' byte[], we can
>> improve
>> everything 100 times... At Least!
>>    
>
> Yes sure. I think everybody has already done such things at school...
> Building a DOM provide:
> 1. a better parsing of malformed html documents (and there is a lot of
> malformed docs on the web)
> 2. gives ability to extract meta-information such as creative commons
> license
> 3. gives a high degree of extensibility (HtmlParser extension point) to
> extract some specific informations without parsing the document many times
> (for instance extracting technorati like tags, ...) and just providing a
> simple plugin.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>
>
>  



Reply | Threaded
Open this post in threaded view
|

Re: Nutch Improvement - HTML Parser

Howie Wang
In reply to this post by luti
I wouldn't go so far as to call it stupid, but I wouldn't mind
having an html parser not built on DOM. Meta info can still
be gotten without a full DOM parse. Boosting phrases within
certain tags (H1,H2,...) would be nice, but it won't necessarily
be useful for everyone, and we aren't doing it right now anyway.

If you feel strongly about it, why don't you write another
parse filter, something like parse-html-lite? People can then
choose which to use.

By the way, how are you doing stuff like removing repeated
tokens? It's a problem that I'm interested in also.

Howie

>Fuad Efendi wrote:
>>But we do not need 'better parsing of malformed html', we need only to
>>extract plain text... Yes, meta-information such as Creative Commons
>>embedde
>>XML in HTML comments is important too, and plugin technics does the job
>>very
>>well.
>>
>>I am only trying to focus on specific task, such as removal of repeated
>>tokens (menu items, options, ...), automatic web-tree building using
>>anchors
>>and some statistics, calculating rank for repeated tokens and indexing
>>only
>>specific sentences with low rank. I simply ignore DOM/SAX, I don't need
>>it.
>>
>>
>>-----Original Message-----
>>From: Jérôme Charron [mailto:[hidden email]] Sent: Saturday,
>>February 25, 2006 4:05 AM
>>To: [hidden email]
>>Subject: Re: Nutch Improvement - HTML Parser
>>
>>
>>
>>>It's not a tool,
>>>IT IS stupidness of Nutch, it uses DOM just to extract plain text and
>>>Outlink[]...
>>>It's very easy to design specific routine to 'parse' byte[], we can
>>>improve
>>>everything 100 times... At Least!
>>>
>>
>>Yes sure. I think everybody has already done such things at school...
>>Building a DOM provide:
>>1. a better parsing of malformed html documents (and there is a lot of
>>malformed docs on the web)
>>2. gives ability to extract meta-information such as creative commons
>>license
>>3. gives a high degree of extensibility (HtmlParser extension point) to
>>extract some specific informations without parsing the document many times
>>(for instance extracting technorati like tags, ...) and just providing a
>>simple plugin.
>>
>>Regards
>>
>>Jérôme
>>
>>--
>>http://motrech.free.fr/
>>http://www.frutch.org/
>>
>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

RE: Nutch Improvement - HTML Parser

Fuad Efendi
In reply to this post by Jérôme Charron
Let's do this, to create /to use existing/ low-level processing, I mean to
use StartTag and EndTag (which could be different in case of malformed
HTML), and to look at what is inside.

In this case performance wil improve, and functionality, because we are not
building DOM, and we are not trying to find and fix HTML errors. Of course
our Tag class will have Attributes, and we will have StartTag, EndTag, etc.
I call it low-level 'parsing'. Are we using DOM to parse RTF, PDF, XLS, TXT?
Even inside existing parser we are using Perl5 to check some metadata, right
before parsing.


=====
Yes sure. I think everybody has already done such things at school...
Building a DOM provide:
1. a better parsing of malformed html documents (and there is a lot of
malformed docs on the web)
2. gives ability to extract meta-information such as creative commons
license
3. gives a high degree of extensibility (HtmlParser extension point) to
extract some specific informations without parsing the document many times
(for instance extracting technorati like tags, ...) and just providing a
simple plugin.

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Improvement - HTML Parser

Gal Nitzan
You can always implement your own parser.



On Sat, 2006-02-25 at 16:51 -0500, Fuad Efendi wrote:

> Let's do this, to create /to use existing/ low-level processing, I mean to
> use StartTag and EndTag (which could be different in case of malformed
> HTML), and to look at what is inside.
>
> In this case performance wil improve, and functionality, because we are not
> building DOM, and we are not trying to find and fix HTML errors. Of course
> our Tag class will have Attributes, and we will have StartTag, EndTag, etc.
> I call it low-level 'parsing'. Are we using DOM to parse RTF, PDF, XLS, TXT?
> Even inside existing parser we are using Perl5 to check some metadata, right
> before parsing.
>
>
> =====
> Yes sure. I think everybody has already done such things at school...
> Building a DOM provide:
> 1. a better parsing of malformed html documents (and there is a lot of
> malformed docs on the web)
> 2. gives ability to extract meta-information such as creative commons
> license
> 3. gives a high degree of extensibility (HtmlParser extension point) to
> extract some specific informations without parsing the document many times
> (for instance extracting technorati like tags, ...) and just providing a
> simple plugin.
>
>