Can I use Lucene to solve this problem?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Can I use Lucene to solve this problem?

Josh Rehman
My organization is looking to solve a difficult problem, and I believe that
Lucene is a close fit (although perhaps it is not). However I'm not sure
exactly how to approach this problem.

The problem is this: given a small set of fixed noun phrases and a much
larger set of human generated short sentences, determine whether the
sentences refer to those noun phrases. For example, perhaps I have these
noun phrases:

   1. Bright yellow book
   2. Large bulbous balloon
   3. Green plaid shirt with stripes
   4. Dark yellow book

And these sentences:

   1. Yesterday I put on my green plaid shirt.
   2. Next week I'll sell my balloon.
   3. Just finished my bright book.
   4. Wondering at how lovely my baloon is [Note the misspelling]

Given that list of sentences, I will generate (sentence, noun phrase)
ordered pairs like this:
1,3
2,2
3,1
4,2

Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4] (because
there might be an ambiguous reference to "Book")

The "shape" of this problem looks a lot like what Lucene does, but frankly I
don't have a lot of experience with textual indexing and search. I've
installed Lucene and managed to index and search my data structures, however
with the StandardIndexer I'm getting a lot of false positives.

Here is the code I have so far (I've elided the parsing code which is not
very interesting):
  https://gist.github.com/1150723

Really appreciate any and all guidance. Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Can I use Lucene to solve this problem?

Ian Lea
Certainly sounds doable in lucene.  Is it basically working apart from
false positives?  Can you give some examples of the false positives?

I'd be tempted to look at span queries which will let you say that
"Yesterday I put on my green plaid shirt" is a better match against
"Green plaid shirt with stripes" than "a plaid shirt that is green"
would.  If that is what you want. See
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ for
good info on span queries.

As for misspellings, that is a separate issue.  Google lucene
spellcheck.  Or look at synonyms if you've got a list of alternatives.


--
Ian.


On Wed, Aug 17, 2011 at 4:03 AM, Josh Rehman <[hidden email]> wrote:

> My organization is looking to solve a difficult problem, and I believe that
> Lucene is a close fit (although perhaps it is not). However I'm not sure
> exactly how to approach this problem.
>
> The problem is this: given a small set of fixed noun phrases and a much
> larger set of human generated short sentences, determine whether the
> sentences refer to those noun phrases. For example, perhaps I have these
> noun phrases:
>
>   1. Bright yellow book
>   2. Large bulbous balloon
>   3. Green plaid shirt with stripes
>   4. Dark yellow book
>
> And these sentences:
>
>   1. Yesterday I put on my green plaid shirt.
>   2. Next week I'll sell my balloon.
>   3. Just finished my bright book.
>   4. Wondering at how lovely my baloon is [Note the misspelling]
>
> Given that list of sentences, I will generate (sentence, noun phrase)
> ordered pairs like this:
> 1,3
> 2,2
> 3,1
> 4,2
>
> Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4] (because
> there might be an ambiguous reference to "Book")
>
> The "shape" of this problem looks a lot like what Lucene does, but frankly I
> don't have a lot of experience with textual indexing and search. I've
> installed Lucene and managed to index and search my data structures, however
> with the StandardIndexer I'm getting a lot of false positives.
>
> Here is the code I have so far (I've elided the parsing code which is not
> very interesting):
>  https://gist.github.com/1150723
>
> Really appreciate any and all guidance. Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can I use Lucene to solve this problem?

Federico Fissore
In reply to this post by Josh Rehman
Josh Rehman, il 17/08/2011 05:03, ha scritto:
> My organization is looking to solve a difficult problem, and I believe that
> Lucene is a close fit (although perhaps it is not). However I'm not sure
> exactly how to approach this problem.
>
[...]


maybe using semantic vectors? [0]

we've played around it for a while but never had the time to put it in
production: basically you search the vector index for each of your
sentences and get back a set of vectors (the noun phrases). the hard
part imho is understanding (if exists) a threshold to say, i.e., that
(1,1) are too distant while (1,3) are close enough

fede

[0] https://code.google.com/p/semanticvectors/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can I use Lucene to solve this problem?

Alexander Aristov
In reply to this post by Ian Lea
Hi

Look at the apache mohaut project (based on hadoop ). It seems you need
machine learning algorithms.

Best Regards
Alexander Aristov


On 17 August 2011 20:39, Ian Lea <[hidden email]> wrote:

> Certainly sounds doable in lucene.  Is it basically working apart from
> false positives?  Can you give some examples of the false positives?
>
> I'd be tempted to look at span queries which will let you say that
> "Yesterday I put on my green plaid shirt" is a better match against
> "Green plaid shirt with stripes" than "a plaid shirt that is green"
> would.  If that is what you want. See
> http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ for
> good info on span queries.
>
> As for misspellings, that is a separate issue.  Google lucene
> spellcheck.  Or look at synonyms if you've got a list of alternatives.
>
>
> --
> Ian.
>
>
> On Wed, Aug 17, 2011 at 4:03 AM, Josh Rehman <[hidden email]> wrote:
> > My organization is looking to solve a difficult problem, and I believe
> that
> > Lucene is a close fit (although perhaps it is not). However I'm not sure
> > exactly how to approach this problem.
> >
> > The problem is this: given a small set of fixed noun phrases and a much
> > larger set of human generated short sentences, determine whether the
> > sentences refer to those noun phrases. For example, perhaps I have these
> > noun phrases:
> >
> >   1. Bright yellow book
> >   2. Large bulbous balloon
> >   3. Green plaid shirt with stripes
> >   4. Dark yellow book
> >
> > And these sentences:
> >
> >   1. Yesterday I put on my green plaid shirt.
> >   2. Next week I'll sell my balloon.
> >   3. Just finished my bright book.
> >   4. Wondering at how lovely my baloon is [Note the misspelling]
> >
> > Given that list of sentences, I will generate (sentence, noun phrase)
> > ordered pairs like this:
> > 1,3
> > 2,2
> > 3,1
> > 4,2
> >
> > Or even an ordered pair of (sentence, [noun phrases]). E.g. 3,[1,4]
> (because
> > there might be an ambiguous reference to "Book")
> >
> > The "shape" of this problem looks a lot like what Lucene does, but
> frankly I
> > don't have a lot of experience with textual indexing and search. I've
> > installed Lucene and managed to index and search my data structures,
> however
> > with the StandardIndexer I'm getting a lot of false positives.
> >
> > Here is the code I have so far (I've elided the parsing code which is not
> > very interesting):
> >  https://gist.github.com/1150723
> >
> > Really appreciate any and all guidance. Thanks.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>