does anyone know of a 'smart' categorizing text pattern finder?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

does anyone know of a 'smart' categorizing text pattern finder?

Vladimir Olenin

Hi,

I wonder if anyone here knows if there is a 'smart' text pattern finder, ideally written in Java. The library I'm looking for should be able to 'guess' the category of the particular text on the page, most probably by finding similarities between the bulk of the pages and a set of templates.

Eg, many forums are powered by phpbb, which structures 99% of the pages (except for some title pages & user profile pages) in a very similar fashion (page is broken into blocks, each block is broken into further blocks, etc). By comparing many pages with each other (eg, from the same domain root: forum.springframework.org) it should be possible to detect common ('template decorations') and page specific (actual content, like 'user name' and 'posting body') parts. After that it should further be possible, by comparing 'template decorations' parts to a set of templates, to 'guess' the nature of each of the 'page specific' block (eg, 'Vladimir Olenin' in the left side column will be marked as 'name', while whatever is adjucent to this column is the post body).

So, I wonder if anyone knows of a package capable of such things. Primary goal though is simplier: to be able to parse out just posters' names from message boards. Though sometimes the 'block category' can be derived from CSS class name of the tags around the text, it's very often not the case.

Might Nutch have similar functionality built into their crawler?

Thanks.

Vlad
Reply | Threaded
Open this post in threaded view
|

Re: does anyone know of a 'smart' categorizing text pattern finder?

Otis Gospodnetic-2
Look at LingPipe from Alias-i.com.  Look at Named Entity extraction and its classifiers.

Otis


----- Original Message ----
From: Vladimir Olenin <[hidden email]>
To: [hidden email]
Sent: Monday, September 25, 2006 9:49:31 PM
Subject: does anyone know of a 'smart' categorizing text pattern finder?


Hi,

I wonder if anyone here knows if there is a 'smart' text pattern finder, ideally written in Java. The library I'm looking for should be able to 'guess' the category of the particular text on the page, most probably by finding similarities between the bulk of the pages and a set of templates.

Eg, many forums are powered by phpbb, which structures 99% of the pages (except for some title pages & user profile pages) in a very similar fashion (page is broken into blocks, each block is broken into further blocks, etc). By comparing many pages with each other (eg, from the same domain root: forum.springframework.org) it should be possible to detect common ('template decorations') and page specific (actual content, like 'user name' and 'posting body') parts. After that it should further be possible, by comparing 'template decorations' parts to a set of templates, to 'guess' the nature of each of the 'page specific' block (eg, 'Vladimir Olenin' in the left side column will be marked as 'name', while whatever is adjucent to this column is the post body).

So, I wonder if anyone knows of a package capable of such things. Primary goal though is simplier: to be able to parse out just posters' names from message boards. Though sometimes the 'block category' can be derived from CSS class name of the tags around the text, it's very often not the case.

Might Nutch have similar functionality built into their crawler?

Thanks.

Vlad




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: does anyone know of a 'smart' categorizing text pattern finder?

Grant Ingersoll
You might also look at some other NLP tools, such as OpenNLP which  
you can train for your collection, or if you are interested in  
buying, there are many products on the market that do similar things


On Sep 26, 2006, at 9:36 AM, Otis Gospodnetic wrote:

> Look at LingPipe from Alias-i.com.  Look at Named Entity extraction  
> and its classifiers.
>
> Otis
>
>
> ----- Original Message ----
> From: Vladimir Olenin <[hidden email]>
> To: [hidden email]
> Sent: Monday, September 25, 2006 9:49:31 PM
> Subject: does anyone know of a 'smart' categorizing text pattern  
> finder?
>
>
> Hi,
>
> I wonder if anyone here knows if there is a 'smart' text pattern  
> finder, ideally written in Java. The library I'm looking for should  
> be able to 'guess' the category of the particular text on the page,  
> most probably by finding similarities between the bulk of the pages  
> and a set of templates.
>
> Eg, many forums are powered by phpbb, which structures 99% of the  
> pages (except for some title pages & user profile pages) in a very  
> similar fashion (page is broken into blocks, each block is broken  
> into further blocks, etc). By comparing many pages with each other  
> (eg, from the same domain root: forum.springframework.org) it  
> should be possible to detect common ('template decorations') and  
> page specific (actual content, like 'user name' and 'posting body')  
> parts. After that it should further be possible, by comparing  
> 'template decorations' parts to a set of templates, to 'guess' the  
> nature of each of the 'page specific' block (eg, 'Vladimir Olenin'  
> in the left side column will be marked as 'name', while whatever is  
> adjucent to this column is the post body).
>
> So, I wonder if anyone knows of a package capable of such things.  
> Primary goal though is simplier: to be able to parse out just  
> posters' names from message boards. Though sometimes the 'block  
> category' can be derived from CSS class name of the tags around the  
> text, it's very often not the case.
>
> Might Nutch have similar functionality built into their crawler?
>
> Thanks.
>
> Vlad
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: does anyone know of a 'smart' categorizing text pattern finder?

Bob Carpenter
In reply to this post by Vladimir Olenin
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: does anyone know of a 'smart' categorizing text pattern finder?

Erik Hatcher

On Nov 21, 2006, at 5:46 PM, Bob Carpenter wrote:
> LingPipe in Action.

Now that's a book I'd love to own!





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: does anyone know of a 'smart' categorizing text pattern finder?

Jin Yiqing
Does this book really exit? I googled and didn't find any introduction about
it :)

2006/11/22, Erik Hatcher <[hidden email]>:

>
>
> On Nov 21, 2006, at 5:46 PM, Bob Carpenter wrote:
> > LingPipe in Action.
>
> Now that's a book I'd love to own!
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: does anyone know of a 'smart' categorizing text pattern finder?

Erik Hatcher

On Nov 24, 2006, at 3:22 AM, Jin Yiqing wrote:
> Does this book really exit? I googled and didn't find any  
> introduction about
> it :)


No, I'm sure Bob meant to say "Lucene in Action" in which he  
contributed a wonderful case study on bits of LingPipe.

        Erik



>
> 2006/11/22, Erik Hatcher <[hidden email]>:
>>
>>
>> On Nov 21, 2006, at 5:46 PM, Bob Carpenter wrote:
>> > LingPipe in Action.
>>
>> Now that's a book I'd love to own!
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]