Re: Information extraction

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Information extraction

Jack.Tang
Hi Cuong.

I am going to build private book search engine. And I am face the same problem.
Could you describe more about the information you want to extract and
the website?

Regards
/Jack

On 7/26/05, Cuong Hoang <[hidden email]> wrote:

> Hi all,
>
>
>
> Does anyone have experience with designing web information extraction such
> as shopbots/pricebots? I'm currently doing research on this topic and want
> to integrate Nutch. A few guidelines from anyone who has designed this type
> of systems will really be helpful to me.
>
>
>
> Regards,
>
>
>
> Cuong Hoang
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

RE: Information extraction

climbingrose
Hi Jack,

I've been doing research the last few days and I think that once
successfully implemented, an information extraction system should be able to
extract information from various sources. I've started reading
pattern/context free grammar/ontology which I think will be the core of such
a system. I intend to index computer shops.

Regards,

Cuong Hoang

-----Original Message-----
From: Jack Tang [mailto:[hidden email]]
Sent: Tuesday, 26 July 2005 6:16 PM
To: [hidden email]; [hidden email]
Subject: Re: Information extraction

Hi Cuong.

I am going to build private book search engine. And I am face the same
problem.
Could you describe more about the information you want to extract and
the website?

Regards
/Jack

On 7/26/05, Cuong Hoang <[hidden email]> wrote:
> Hi all,
>
>
>
> Does anyone have experience with designing web information extraction such
> as shopbots/pricebots? I'm currently doing research on this topic and want
> to integrate Nutch. A few guidelines from anyone who has designed this
type

> of systems will really be helpful to me.
>
>
>
> Regards,
>
>
>
> Cuong Hoang
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Reply | Threaded
Open this post in threaded view
|

Re: Information extraction

Erik Hatcher
Further on the information extraction idea, consider what the SIMILE  
team at MIT are doing... http://simile.mit.edu

The lower-case semantic web is gaining a lot of momentum these days,  
and I'm a strong proponent and student of it at the moment.  Scraping  
rich information from a site is certainly reasonably pragmatic, but  
it is also highly fragile.  SIMILE's Piggy Bank has a scraper  
facility.  In an more ideal world, computer shops, book stores,  
libraries, and anyone with data to share would publish it in a  
reusable and structured way (RDF seems to me to be the best way to do  
this).  Merging a full-text search engine with structured  
information, though, is yet another tricky thing that I am myself  
working with at the moment.

I'd love to have more discussions along these lines.

     Erik


On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:

> Hi Jack,
>
> I've been doing research the last few days and I think that once
> successfully implemented, an information extraction system should  
> be able to
> extract information from various sources. I've started reading
> pattern/context free grammar/ontology which I think will be the  
> core of such
> a system. I intend to index computer shops.
>
> Regards,
>
> Cuong Hoang
>
> -----Original Message-----
> From: Jack Tang [mailto:[hidden email]]
> Sent: Tuesday, 26 July 2005 6:16 PM
> To: [hidden email]; [hidden email]
> Subject: Re: Information extraction
>
> Hi Cuong.
>
> I am going to build private book search engine. And I am face the same
> problem.
> Could you describe more about the information you want to extract and
> the website?
>
> Regards
> /Jack
>
> On 7/26/05, Cuong Hoang <[hidden email]> wrote:
>
>> Hi all,
>>
>>
>>
>> Does anyone have experience with designing web information  
>> extraction such
>> as shopbots/pricebots? I'm currently doing research on this topic  
>> and want
>> to integrate Nutch. A few guidelines from anyone who has designed  
>> this
>>
> type
>
>> of systems will really be helpful to me.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Cuong Hoang
>>
>>
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Reply | Threaded
Open this post in threaded view
|

Re: Information extraction

chrislusf
My approach to tackle structured information is to use DBSight, which
create Lecene indexes on retrieved data from any database.

As Erik mentioned, scraping is highly fragile. By going directly to
database, we can get more reliable/up-to-date/flexible with the data.
On the other hand, you will need database access, and this approach is
quite different from Nutch.

Or Nutch/Lucene can provide a simple XML analyzer, consuming a
specific format of XML data filtered by any plug-in XSL from any XML
structure.

--
Chris Lu
---------------------
Full-Text Search on Any Database
http://www.dbsight.net


On 7/26/05, Erik Hatcher <[hidden email]> wrote:

> Further on the information extraction idea, consider what the SIMILE
> team at MIT are doing... http://simile.mit.edu
>
> The lower-case semantic web is gaining a lot of momentum these days,
> and I'm a strong proponent and student of it at the moment.  Scraping
> rich information from a site is certainly reasonably pragmatic, but
> it is also highly fragile.  SIMILE's Piggy Bank has a scraper
> facility.  In an more ideal world, computer shops, book stores,
> libraries, and anyone with data to share would publish it in a
> reusable and structured way (RDF seems to me to be the best way to do
> this).  Merging a full-text search engine with structured
> information, though, is yet another tricky thing that I am myself
> working with at the moment.
>
> I'd love to have more discussions along these lines.
>
>      Erik
>
>
> On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:
>
> > Hi Jack,
> >
> > I've been doing research the last few days and I think that once
> > successfully implemented, an information extraction system should
> > be able to
> > extract information from various sources. I've started reading
> > pattern/context free grammar/ontology which I think will be the
> > core of such
> > a system. I intend to index computer shops.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:[hidden email]]
> > Sent: Tuesday, 26 July 2005 6:16 PM
> > To: [hidden email]; [hidden email]
> > Subject: Re: Information extraction
> >
> > Hi Cuong.
> >
> > I am going to build private book search engine. And I am face the same
> > problem.
> > Could you describe more about the information you want to extract and
> > the website?
> >
> > Regards
> > /Jack
> >
> > On 7/26/05, Cuong Hoang <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >>
> >>
> >> Does anyone have experience with designing web information
> >> extraction such
> >> as shopbots/pricebots? I'm currently doing research on this topic
> >> and want
> >> to integrate Nutch. A few guidelines from anyone who has designed
> >> this
> >>
> > type
> >
> >> of systems will really be helpful to me.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> Cuong Hoang
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Information extraction

Nick Lothian
In reply to this post by Jack.Tang
A couple of useful papers:

(Quite well known)
Mining the peanut gallery: Opinion Extraction and Semantic
Classification of Product Reviews
http://www2003.org/cdrom/papers/refereed/p451/package/p451-dave.html 


(I'd never seen this before - about using Hidden Markov Models)
Information extraction from HTML product catalogues: coupling
quantitative and knowledge-based approaches
http://rainbow.vse.cz/dags05.pdf

> -----Original Message-----
> From: Erik Hatcher [mailto:[hidden email]]
> Sent: Tuesday, 26 July 2005 11:42 PM
> To: [hidden email]
> Subject: Re: Information extraction
>
> Further on the information extraction idea, consider what the
> SIMILE team at MIT are doing... http://simile.mit.edu
>
> The lower-case semantic web is gaining a lot of momentum
> these days, and I'm a strong proponent and student of it at
> the moment.  Scraping rich information from a site is
> certainly reasonably pragmatic, but it is also highly
> fragile.  SIMILE's Piggy Bank has a scraper facility.  In an
> more ideal world, computer shops, book stores, libraries, and
> anyone with data to share would publish it in a reusable and
> structured way (RDF seems to me to be the best way to do
> this).  Merging a full-text search engine with structured
> information, though, is yet another tricky thing that I am
> myself working with at the moment.
>
> I'd love to have more discussions along these lines.
>
>      Erik
>
>
> On Jul 26, 2005, at 5:50 AM, Cuong Hoang wrote:
>
> > Hi Jack,
> >
> > I've been doing research the last few days and I think that once
> > successfully implemented, an information extraction system
> should be
> > able to extract information from various sources. I've
> started reading
> > pattern/context free grammar/ontology which I think will be
> the core
> > of such a system. I intend to index computer shops.
> >
> > Regards,
> >
> > Cuong Hoang
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:[hidden email]]
> > Sent: Tuesday, 26 July 2005 6:16 PM
> > To: [hidden email]; [hidden email]
> > Subject: Re: Information extraction
> >
> > Hi Cuong.
> >
> > I am going to build private book search engine. And I am
> face the same
> > problem.
> > Could you describe more about the information you want to
> extract and
> > the website?
> >
> > Regards
> > /Jack
> >
> > On 7/26/05, Cuong Hoang <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >>
> >>
> >> Does anyone have experience with designing web information
> extraction
> >> such as shopbots/pricebots? I'm currently doing research on this
> >> topic and want to integrate Nutch. A few guidelines from
> anyone who
> >> has designed this
> >>
> > type
> >
> >> of systems will really be helpful to me.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> Cuong Hoang
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>


IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party.
This email represents the views of the individual sender, which do not necessarily reflect those of education.au limited except where the sender expressly states otherwise.
It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects.
education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.
Reply | Threaded
Open this post in threaded view
|

Re: Information extraction

Jack.Tang
In reply to this post by Jack.Tang
Hi Cuong

Thanks for the demo.
I agree with that
    Infomation Extraction = Segmenation + Classfication + Clustering +
Association.

I am going to extends HtmlParseFilter and do text mining on
parse.getText(). Is that a good way?
Thoughts?

And I'd like to share some resources I am reading now
http://www.ke.informatik.tu-darmstadt.de/lehre/ss05/web-mining/wm-ie.pdf

Regards
/Jack

On 7/26/05, Cuong Hoang <[hidden email]> wrote:

> Jack,
>
> So far, I found two demos online:
>
> http://eso.vse.cz/~labsky/cgi-bin/client/
> http://iit.demokritos.gr/skel/crossmarc/
>
> On these websites, there are several documents that maybe useful. I don't
> think they will release the source code.
>
>
> Regards,
>
> Cuong Hoang
> -----Original Message-----
> From: Jack Tang [mailto:[hidden email]]
> Sent: Tuesday, 26 July 2005 8:29 PM
> To: [hidden email]
> Subject: Re: Information extraction
>
> Hi Matthias.
>
> The website is interesting but any document about the implementation
> avaiable?
>
> Cuong.
> I notice a lot paper mentioned HMM is great for information
> extraction. But I cannot find one demo in opensource way:(
> What's your thoughts?
>
>
> Regards
> /Jack
>
>
> On 7/26/05, Matthias Jaekle <[hidden email]> wrote:
> > In the list of public nutch servers you find the following, which might
> > be interesting:
> > http://www.betherebesquare.com/
> >
> > Matthias
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars