use Lucene to index sentences

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

use Lucene to index sentences

AJ Chen-2
I'll appreciate any advice on whether Lucene is appropriate for index/search
sentences.  I have millions of documents broken down into millions of
sentences. Each sentence does not exist as a document.  All these sentences
are in a small number of big files. How can I use Lucene to index/search the
sentences? Search will return which sentence matches the query.  If Lucene
does not do it, any better approach besides using mysql database?

Thanks,
AJ
Reply | Threaded
Open this post in threaded view
|

Re: use Lucene to index sentences

Marc Hadfield
Hi AJ -

Depending on your need, you could create a lucene document for each
sentence (in which case searching and returning sentences is trivial),
or create a lucene document for each of your documents, with embedded
sentence start/stop markers (as a special symbol).  or, instead of a
special symbol, you can increase the token count after each
end-of-sentence so that there is a large gap inbetween sentences -- this
will give higher scores to intra-sentence matches.

if you insert special sentence marker symbols, then you could use a span
search to guarantee that a phrase happens inside a sentence.  when a
match occurs, you can use the document's termpositionvector object to
re-create the original sentence, or alternatively, use the embedded
sentence number in lucene (perhaps symbols like "__sentence_start" and
"__sentence_num_20") to grab the original sentence from a file
containing the full text with sentence markers (perhaps xml tags:
"<sentence num=20>").

I use the techniques such as the above for a very large lucene index of
documents with embedded sentence markers.  There are various trade-offs
in terms of index size (how much info to keep in index), expected query
performance, and so on.

---marc hadfield



AJ Chen wrote:

>I'll appreciate any advice on whether Lucene is appropriate for index/search
>sentences.  I have millions of documents broken down into millions of
>sentences. Each sentence does not exist as a document.  All these sentences
>are in a small number of big files. How can I use Lucene to index/search the
>sentences? Search will return which sentence matches the query.  If Lucene
>does not do it, any better approach besides using mysql database?
>
>Thanks,
>AJ
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: use Lucene to index sentences

AJ Chen-2
Hi Marc,
Thanks for your suggestions. Marking sentences in documents and using span
query is a good approach.  How do you compare its performance to a database
approach? For example, sentences can be stored in mysql, one sentence per
row, and they can be searched by mysql's full text search feature. Using
database, it will be also easy to tell which document the matched sentence
belongs to.

AJ

On 2/6/06, Marc Hadfield <[hidden email]> wrote:

>
> Hi AJ -
>
> Depending on your need, you could create a lucene document for each
> sentence (in which case searching and returning sentences is trivial),
> or create a lucene document for each of your documents, with embedded
> sentence start/stop markers (as a special symbol).  or, instead of a
> special symbol, you can increase the token count after each
> end-of-sentence so that there is a large gap inbetween sentences -- this
> will give higher scores to intra-sentence matches.
>
> if you insert special sentence marker symbols, then you could use a span
> search to guarantee that a phrase happens inside a sentence.  when a
> match occurs, you can use the document's termpositionvector object to
> re-create the original sentence, or alternatively, use the embedded
> sentence number in lucene (perhaps symbols like "__sentence_start" and
> "__sentence_num_20") to grab the original sentence from a file
> containing the full text with sentence markers (perhaps xml tags:
> "<sentence num=20>").
>
> I use the techniques such as the above for a very large lucene index of
> documents with embedded sentence markers.  There are various trade-offs
> in terms of index size (how much info to keep in index), expected query
> performance, and so on.
>
> ---marc hadfield
>
>
>
> AJ Chen wrote:
>
> >I'll appreciate any advice on whether Lucene is appropriate for
> index/search
> >sentences.  I have millions of documents broken down into millions of
> >sentences. Each sentence does not exist as a document.  All these
> sentences
> >are in a small number of big files. How can I use Lucene to index/search
> the
> >sentences? Search will return which sentence matches the query.  If
> Lucene
> >does not do it, any better approach besides using mysql database?
> >
> >Thanks,
> >AJ
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: use Lucene to index sentences

Marc Hadfield

Hi AJ -
Performance would depend on the kind of queries you are going to perform
against sentences.  If you are going to be querying for phrases
(multi-token), want to make use of stemming, or any kind of term
expansion (wildcare, synonyms, etc), I imagine lucene would be much
superior, but I don't have experience with mysql full text search.
---marc

AJ Chen wrote:

>Hi Marc,
>Thanks for your suggestions. Marking sentences in documents and using span
>query is a good approach.  How do you compare its performance to a database
>approach? For example, sentences can be stored in mysql, one sentence per
>row, and they can be searched by mysql's full text search feature. Using
>database, it will be also easy to tell which document the matched sentence
>belongs to.
>
>AJ
>
>On 2/6/06, Marc Hadfield <[hidden email]> wrote:
>  
>
>>Hi AJ -
>>
>>Depending on your need, you could create a lucene document for each
>>sentence (in which case searching and returning sentences is trivial),
>>or create a lucene document for each of your documents, with embedded
>>sentence start/stop markers (as a special symbol).  or, instead of a
>>special symbol, you can increase the token count after each
>>end-of-sentence so that there is a large gap inbetween sentences -- this
>>will give higher scores to intra-sentence matches.
>>
>>if you insert special sentence marker symbols, then you could use a span
>>search to guarantee that a phrase happens inside a sentence.  when a
>>match occurs, you can use the document's termpositionvector object to
>>re-create the original sentence, or alternatively, use the embedded
>>sentence number in lucene (perhaps symbols like "__sentence_start" and
>>"__sentence_num_20") to grab the original sentence from a file
>>containing the full text with sentence markers (perhaps xml tags:
>>"<sentence num=20>").
>>
>>I use the techniques such as the above for a very large lucene index of
>>documents with embedded sentence markers.  There are various trade-offs
>>in terms of index size (how much info to keep in index), expected query
>>performance, and so on.
>>
>>---marc hadfield
>>
>>
>>
>>AJ Chen wrote:
>>
>>    
>>
>>>I'll appreciate any advice on whether Lucene is appropriate for
>>>      
>>>
>>index/search
>>    
>>
>>>sentences.  I have millions of documents broken down into millions of
>>>sentences. Each sentence does not exist as a document.  All these
>>>      
>>>
>>sentences
>>    
>>
>>>are in a small number of big files. How can I use Lucene to index/search
>>>      
>>>
>>the
>>    
>>
>>>sentences? Search will return which sentence matches the query.  If
>>>      
>>>
>>Lucene
>>    
>>
>>>does not do it, any better approach besides using mysql database?
>>>
>>>Thanks,
>>>AJ
>>>
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]