Next Word - Any Suggestions?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Next Word - Any Suggestions?

Christopher Ball-3
Am about to implement a custom query that is sort of mash-up of Facets,
Highlighting, and SpanQuery - but thought I'd see if anyone has done
anything similar.

 

In simple words, I need facet on the next word given a target word.

 

For example, if my index only had the following 5 documents (comprised of a
sentence each):

 

Doc 1 - The quick brown fox jumped over the fence.

Doc 2 - The sly fox skipped over the fence.

Doc 3 - The fat fox skipped his afternoon class.

Doc 4 - A brown duck and red fox, crashed the party.

Doc 5 - Charles Brown! Fox! Crashed my damn car.

 

The query should give the frequency of the distinct terms after the word
"fox":

 

skipped - 2

crashed - 2

jumped - 1

 

Long-term, do the opposite - frequency of the distinct terms before the word
"fox":

 

brown - 2

sly - 1

fat - 1

red - 1

 

My guess is that either the FastVectorHighlighter or SpanQuery would be a
reasonable starting point. I was hoping to take advantage of Vectors as I am
storing termVectors, termPositions, and termOffsets for the field in
question.

 

Grateful for any thoughts . . . reference implementations . . . words of
encouragement . . . free beer - whatever you can offer.

 

Gracias,

 

Christopher

 

Reply | Threaded
Open this post in threaded view
|

Re: Next Word - Any Suggestions?

Sean O'Connor
Hi Christopher,
     I am working my way through trying to implement SpanQueries in Solr
(svn trunk). From my lack of progress, I am skeptical that I can help
much, but I would be happy to try.

     I imagine you have already found (either before your message, or
after posting it) Grant's lucene, spanquery, and WindowTermVectorMapper
overview:
  http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/

   I'd be interested in hearing about your progress.
Good luck

Sean



On 10/26/2010 08:26 AM, Christopher Ball wrote:

> Am about to implement a custom query that is sort of mash-up of Facets,
> Highlighting, and SpanQuery - but thought I'd see if anyone has done
> anything similar.
>
>
>
> In simple words, I need facet on the next word given a target word.
>
>
>
> For example, if my index only had the following 5 documents (comprised of a
> sentence each):
>
>
>
> Doc 1 - The quick brown fox jumped over the fence.
>
> Doc 2 - The sly fox skipped over the fence.
>
> Doc 3 - The fat fox skipped his afternoon class.
>
> Doc 4 - A brown duck and red fox, crashed the party.
>
> Doc 5 - Charles Brown! Fox! Crashed my damn car.
>
>
>
> The query should give the frequency of the distinct terms after the word
> "fox":
>
>
>
> skipped - 2
>
> crashed - 2
>
> jumped - 1
>
>
>
> Long-term, do the opposite - frequency of the distinct terms before the word
> "fox":
>
>
>
> brown - 2
>
> sly - 1
>
> fat - 1
>
> red - 1
>
>
>
> My guess is that either the FastVectorHighlighter or SpanQuery would be a
> reasonable starting point. I was hoping to take advantage of Vectors as I am
> storing termVectors, termPositions, and termOffsets for the field in
> question.
>
>
>
> Grateful for any thoughts . . . reference implementations . . . words of
> encouragement . . . free beer - whatever you can offer.
>
>
>
> Gracias,
>
>
>
> Christopher
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Next Word - Any Suggestions?

Sean O'Connor
In reply to this post by Christopher Ball-3
Hi Christopher,
     One option comes to mind: shingles?

     I have not done anything with them yet, but that is on my radar for
sometime about a month out. Speaking unencumbered by experience or
substantial understanding, my guess is that shingles would be great for
you if you can select shingles with something like a terms prefix.

     AFAIU: Shingling[1] basically takes a number of terms/words, and
combines them into a single token. You could set the (max)shingle size
to 2, and then find some way to use the terms component on the shingled
field with a prefix, potentially:
http://wiki.apache.org/solr/TermsComponent

     I'm interested in what you find out, so please post back if you
find something outside the mailing list.
Thanks,

Sean


[1] see something like:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=%28shingle%29,
but the Solr 1.4 Enterprise Search Server book is well worth the money,
and I believe there is an ebook version for $10-20.

On 10/26/2010 08:26 AM, Christopher Ball wrote:

> Am about to implement a custom query that is sort of mash-up of Facets,
> Highlighting, and SpanQuery - but thought I'd see if anyone has done
> anything similar.
>
>
>
> In simple words, I need facet on the next word given a target word.
>
>
>
> For example, if my index only had the following 5 documents (comprised of a
> sentence each):
>
>
>
> Doc 1 - The quick brown fox jumped over the fence.
>
> Doc 2 - The sly fox skipped over the fence.
>
> Doc 3 - The fat fox skipped his afternoon class.
>
> Doc 4 - A brown duck and red fox, crashed the party.
>
> Doc 5 - Charles Brown! Fox! Crashed my damn car.
>
>
>
> The query should give the frequency of the distinct terms after the word
> "fox":
>
>
>
> skipped - 2
>
> crashed - 2
>
> jumped - 1
>
>
>
> Long-term, do the opposite - frequency of the distinct terms before the word
> "fox":
>
>
>
> brown - 2
>
> sly - 1
>
> fat - 1
>
> red - 1
>
>
>
> My guess is that either the FastVectorHighlighter or SpanQuery would be a
> reasonable starting point. I was hoping to take advantage of Vectors as I am
> storing termVectors, termPositions, and termOffsets for the field in
> question.
>
>
>
> Grateful for any thoughts . . . reference implementations . . . words of
> encouragement . . . free beer - whatever you can offer.
>
>
>
> Gracias,
>
>
>
> Christopher
>
>
>
>