get the position of matched word in the response

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

get the position of matched word in the response

eli chen
hi i'm new to solr so please be patient.
how can i get the position of matched word in the results.

and no, im not talking about highlighting the words. i talkng about getting
the postition of the word in the content

i have field content which i do in q=content:"some_word"

the content field is not stored but its
 Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
TermVector +Store Position With TermVector

thx for the help
Reply | Threaded
Open this post in threaded view
|

Re: get the position of matched word in the response

Erick Erickson
Eli:

What problem are you trying to solve? There’s no really convenient way to do this that know of, although it could be done, probably with some lucene-level code.

This may be an XY problem, where you're asking how to do X (find the position of the matched word) because you think it’ll help solve some problem Y. What’s “Y”? Perhaps there’s an easier way to solve that problem if we knew what it was….

Best,
Erick

> On Aug 4, 2019, at 6:55 AM, eli chen <[hidden email]> wrote:
>
> hi i'm new to solr so please be patient.
> how can i get the position of matched word in the results.
>
> and no, im not talking about highlighting the words. i talkng about getting
> the postition of the word in the content
>
> i have field content which i do in q=content:"some_word"
>
> the content field is not stored but its
> Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
> TermVector +Store Position With TermVector
>
> thx for the help

Reply | Threaded
Open this post in threaded view
|

Re: get the position of matched word in the response

eli chen
every content field is actually a book content
so let say someone search for the word "hello" and i found this word in the
book "the story jungle" at position 199 (step by word not char)

now i can look at my database and check the OCR of this word in this book
(and show highlight on the picture and etc)

my db is kinda of (just for simplicity)

book     word     ocr
------     -------     ---------
th....     199        1,1,1,1

that the reason i need the offest of the word.

and btw the content field is just a big text_general field

thx again

‫בתאריך יום א׳, 4 באוג׳ 2019 ב-14:30 מאת ‪Erick Erickson‬‏ <‪
[hidden email]‬‏>:‬

> Eli:
>
> What problem are you trying to solve? There’s no really convenient way to
> do this that know of, although it could be done, probably with some
> lucene-level code.
>
> This may be an XY problem, where you're asking how to do X (find the
> position of the matched word) because you think it’ll help solve some
> problem Y. What’s “Y”? Perhaps there’s an easier way to solve that problem
> if we knew what it was….
>
> Best,
> Erick
>
> > On Aug 4, 2019, at 6:55 AM, eli chen <[hidden email]> wrote:
> >
> > hi i'm new to solr so please be patient.
> > how can i get the position of matched word in the results.
> >
> > and no, im not talking about highlighting the words. i talkng about
> getting
> > the postition of the word in the content
> >
> > i have field content which i do in q=content:"some_word"
> >
> > the content field is not stored but its
> > Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
> > TermVector +Store Position With TermVector
> >
> > thx for the help
>
>
Reply | Threaded
Open this post in threaded view
|

Re: get the position of matched word in the response

Erick Erickson
One approach: Payloads. You can store, with each word, an arbitrary amount data. Of course the index is bigger….

Most of the examples use a single float, which could be all you need. You can store an arbitrary binary blob and encode/decode it however you want. Conceivably you could store the coordinates of the word, along with the position and not need to consult the DB at all.

That said, be prepared to spend some time on this, it’s not necessarily an easy problem to solve. How many positions are you going to return? All of them in the document? How are you going to handle phrase queries? Highlight any individual word matches or only highlight the occurrences of all the words in the phrase together? For that matter, you’ll have to write some code to actually return the payloads with the results...

HTH,
Erick

> On Aug 4, 2019, at 7:45 AM, eli chen <[hidden email]> wrote:
>
> every content field is actually a book content
> so let say someone search for the word "hello" and i found this word in the
> book "the story jungle" at position 199 (step by word not char)
>
> now i can look at my database and check the OCR of this word in this book
> (and show highlight on the picture and etc)
>
> my db is kinda of (just for simplicity)
>
> book     word     ocr
> ------     -------     ---------
> th....     199        1,1,1,1
>
> that the reason i need the offest of the word.
>
> and btw the content field is just a big text_general field
>
> thx again
>
> ‫בתאריך יום א׳, 4 באוג׳ 2019 ב-14:30 מאת ‪Erick Erickson‬‏ <‪
> [hidden email]‬‏>:‬
>
>> Eli:
>>
>> What problem are you trying to solve? There’s no really convenient way to
>> do this that know of, although it could be done, probably with some
>> lucene-level code.
>>
>> This may be an XY problem, where you're asking how to do X (find the
>> position of the matched word) because you think it’ll help solve some
>> problem Y. What’s “Y”? Perhaps there’s an easier way to solve that problem
>> if we knew what it was….
>>
>> Best,
>> Erick
>>
>>> On Aug 4, 2019, at 6:55 AM, eli chen <[hidden email]> wrote:
>>>
>>> hi i'm new to solr so please be patient.
>>> how can i get the position of matched word in the results.
>>>
>>> and no, im not talking about highlighting the words. i talkng about
>> getting
>>> the postition of the word in the content
>>>
>>> i have field content which i do in q=content:"some_word"
>>>
>>> the content field is not stored but its
>>> Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
>>> TermVector +Store Position With TermVector
>>>
>>> thx for the help
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: get the position of matched word in the response

Alexandre Rafalovitch
In reply to this post by eli chen
What happens if they search for "hello monkey" and match against
"hello my monkeys"? What should it return? Why does your database not
contain "hello" instead of 199?

I am saying because if your clients are truly searching for just one
word, then Solr may be an overkill for you. Perhaps you are looking
for just "indexOf" within a string with parallel offset->OCR data
structure. So, there is a hidden question in there of "why do you
choose Solr".

Then, there is a point that Solr searches words/numbers/geo-spacial
but returns documents. So, sometimes, you need to understand what is a
"document" for your business case. And transform your content for
that. E.g. if you are really just searching for one word, then maybe
you index your whole book as a bunch of document each containing a
word, its OCR offset information, its book id. And if it is a couple
of words, maybe you have a secondary field with context of that
sentence (in index-only) form.

Don't be afraid to abandon your first schema. Your business
requirement is different enough.

Regards,
   Alex.


On Sun, 4 Aug 2019 at 07:46, eli chen <[hidden email]> wrote:

>
> every content field is actually a book content
> so let say someone search for the word "hello" and i found this word in the
> book "the story jungle" at position 199 (step by word not char)
>
> now i can look at my database and check the OCR of this word in this book
> (and show highlight on the picture and etc)
>
> my db is kinda of (just for simplicity)
>
> book     word     ocr
> ------     -------     ---------
> th....     199        1,1,1,1
>
> that the reason i need the offest of the word.
>
> and btw the content field is just a big text_general field
>
> thx again
>
> ‫בתאריך יום א׳, 4 באוג׳ 2019 ב-14:30 מאת ‪Erick Erickson‬‏ <‪
> [hidden email]‬‏>:‬
>
> > Eli:
> >
> > What problem are you trying to solve? There’s no really convenient way to
> > do this that know of, although it could be done, probably with some
> > lucene-level code.
> >
> > This may be an XY problem, where you're asking how to do X (find the
> > position of the matched word) because you think it’ll help solve some
> > problem Y. What’s “Y”? Perhaps there’s an easier way to solve that problem
> > if we knew what it was….
> >
> > Best,
> > Erick
> >
> > > On Aug 4, 2019, at 6:55 AM, eli chen <[hidden email]> wrote:
> > >
> > > hi i'm new to solr so please be patient.
> > > how can i get the position of matched word in the results.
> > >
> > > and no, im not talking about highlighting the words. i talkng about
> > getting
> > > the postition of the word in the content
> > >
> > > i have field content which i do in q=content:"some_word"
> > >
> > > the content field is not stored but its
> > > Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
> > > TermVector +Store Position With TermVector
> > >
> > > thx for the help
> >
> >
Reply | Threaded
Open this post in threaded view
|

Re: get the position of matched word in the response

eli chen
thx
of course they search for pharses.
and if they searched "hello monkey" and solr found "hello my monkey".  i
want to get the position of "hello" and "monkey" (they words he actually
typed in the search).

and btw thx you all but i found
https://github.com/dbmdz/solr-ocrhighlighting which i think can help me a
lot. and i'll check the payload thing (im new to solr)



‫בתאריך יום א׳, 4 באוג׳ 2019 ב-15:40 מאת ‪Alexandre Rafalovitch‬‏ <‪
[hidden email]‬‏>:‬

> What happens if they search for "hello monkey" and match against
> "hello my monkeys"? What should it return? Why does your database not
> contain "hello" instead of 199?
>
> I am saying because if your clients are truly searching for just one
> word, then Solr may be an overkill for you. Perhaps you are looking
> for just "indexOf" within a string with parallel offset->OCR data
> structure. So, there is a hidden question in there of "why do you
> choose Solr".
>
> Then, there is a point that Solr searches words/numbers/geo-spacial
> but returns documents. So, sometimes, you need to understand what is a
> "document" for your business case. And transform your content for
> that. E.g. if you are really just searching for one word, then maybe
> you index your whole book as a bunch of document each containing a
> word, its OCR offset information, its book id. And if it is a couple
> of words, maybe you have a secondary field with context of that
> sentence (in index-only) form.
>
> Don't be afraid to abandon your first schema. Your business
> requirement is different enough.
>
> Regards,
>    Alex.
>
>
> On Sun, 4 Aug 2019 at 07:46, eli chen <[hidden email]> wrote:
> >
> > every content field is actually a book content
> > so let say someone search for the word "hello" and i found this word in
> the
> > book "the story jungle" at position 199 (step by word not char)
> >
> > now i can look at my database and check the OCR of this word in this book
> > (and show highlight on the picture and etc)
> >
> > my db is kinda of (just for simplicity)
> >
> > book     word     ocr
> > ------     -------     ---------
> > th....     199        1,1,1,1
> >
> > that the reason i need the offest of the word.
> >
> > and btw the content field is just a big text_general field
> >
> > thx again
> >
> > ‫בתאריך יום א׳, 4 באוג׳ 2019 ב-14:30 מאת ‪Erick Erickson‬‏ <‪
> > [hidden email]‬‏>:‬
> >
> > > Eli:
> > >
> > > What problem are you trying to solve? There’s no really convenient way
> to
> > > do this that know of, although it could be done, probably with some
> > > lucene-level code.
> > >
> > > This may be an XY problem, where you're asking how to do X (find the
> > > position of the matched word) because you think it’ll help solve some
> > > problem Y. What’s “Y”? Perhaps there’s an easier way to solve that
> problem
> > > if we knew what it was….
> > >
> > > Best,
> > > Erick
> > >
> > > > On Aug 4, 2019, at 6:55 AM, eli chen <[hidden email]> wrote:
> > > >
> > > > hi i'm new to solr so please be patient.
> > > > how can i get the position of matched word in the results.
> > > >
> > > > and no, im not talking about highlighting the words. i talkng about
> > > getting
> > > > the postition of the word in the content
> > > >
> > > > i have field content which i do in q=content:"some_word"
> > > >
> > > > the content field is not stored but its
> > > > Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
> > > > TermVector +Store Position With TermVector
> > > >
> > > > thx for the help
> > >
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: get the position of matched word in the response

Erick Erickson
I think you’re missing a nuance.

It’s always a little confusing when people use quotes when talking about
searching because in Solr double quotes are a very specific form of a query, i.e. a
phrase query which means words must appear within some distance of
each other (i.e. the ’slop’)

In Solr, a phrase can allow some intervening words. For instance if you specify
“hello monkey”, it would _not_ match 'hello my monkey' because
the word ‘my' is in the middle. If you specify “hello monkey”~2,
then you’d match text containing 'hello my monkey', 'hello my little monkey',
but not 'hello my little green monkey’ since there are more than 2 words
between ‘hello’ and ‘monkey’ in the example that doesn’t match.

Contrast that with boolean searches. e.g.  (hello AND monkey) would match all
of the examples. As long as both words appeared anywhere in the
field you’d get a match no matter how many intervening words.

And I haven’t even talked about order. “hello monkey” would not match
“monkey hello” unless you specified a slop of 1 (it’s a long story).

So just having the positions doesn’t really solve the problem. If you
searched for the phrase “hello monkey” _as a phrase_, and the text
contained ‘hello monkey, it is a long story why a you would want
to say hello to a  monkey’ how would you want to highlight? The
intent is only for the first two words to be highlighted but just having
the positions of all the ‘hello’ and ‘monkey’ tokens in the text would
lead you to highlight all 4 tokens……

FWIW,
Erick

> On Aug 4, 2019, at 9:52 AM, eli chen <[hidden email]> wrote:
>
> thx
> of course they search for pharses.
> and if they searched "hello monkey" and solr found "hello my monkey".  i
> want to get the position of "hello" and "monkey" (they words he actually
> typed in the search).
>
> and btw thx you all but i found
> https://github.com/dbmdz/solr-ocrhighlighting which i think can help me a
> lot. and i'll check the payload thing (im new to solr)
>
>
>
> ‫בתאריך יום א׳, 4 באוג׳ 2019 ב-15:40 מאת ‪Alexandre Rafalovitch‬‏ <‪
> [hidden email]‬‏>:‬
>
>> What happens if they search for "hello monkey" and match against
>> "hello my monkeys"? What should it return? Why does your database not
>> contain "hello" instead of 199?
>>
>> I am saying because if your clients are truly searching for just one
>> word, then Solr may be an overkill for you. Perhaps you are looking
>> for just "indexOf" within a string with parallel offset->OCR data
>> structure. So, there is a hidden question in there of "why do you
>> choose Solr".
>>
>> Then, there is a point that Solr searches words/numbers/geo-spacial
>> but returns documents. So, sometimes, you need to understand what is a
>> "document" for your business case. And transform your content for
>> that. E.g. if you are really just searching for one word, then maybe
>> you index your whole book as a bunch of document each containing a
>> word, its OCR offset information, its book id. And if it is a couple
>> of words, maybe you have a secondary field with context of that
>> sentence (in index-only) form.
>>
>> Don't be afraid to abandon your first schema. Your business
>> requirement is different enough.
>>
>> Regards,
>>   Alex.
>>
>>
>> On Sun, 4 Aug 2019 at 07:46, eli chen <[hidden email]> wrote:
>>>
>>> every content field is actually a book content
>>> so let say someone search for the word "hello" and i found this word in
>> the
>>> book "the story jungle" at position 199 (step by word not char)
>>>
>>> now i can look at my database and check the OCR of this word in this book
>>> (and show highlight on the picture and etc)
>>>
>>> my db is kinda of (just for simplicity)
>>>
>>> book     word     ocr
>>> ------     -------     ---------
>>> th....     199        1,1,1,1
>>>
>>> that the reason i need the offest of the word.
>>>
>>> and btw the content field is just a big text_general field
>>>
>>> thx again
>>>
>>> ‫בתאריך יום א׳, 4 באוג׳ 2019 ב-14:30 מאת ‪Erick Erickson‬‏ <‪
>>> [hidden email]‬‏>:‬
>>>
>>>> Eli:
>>>>
>>>> What problem are you trying to solve? There’s no really convenient way
>> to
>>>> do this that know of, although it could be done, probably with some
>>>> lucene-level code.
>>>>
>>>> This may be an XY problem, where you're asking how to do X (find the
>>>> position of the matched word) because you think it’ll help solve some
>>>> problem Y. What’s “Y”? Perhaps there’s an easier way to solve that
>> problem
>>>> if we knew what it was….
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>> On Aug 4, 2019, at 6:55 AM, eli chen <[hidden email]> wrote:
>>>>>
>>>>> hi i'm new to solr so please be patient.
>>>>> how can i get the position of matched word in the results.
>>>>>
>>>>> and no, im not talking about highlighting the words. i talkng about
>>>> getting
>>>>> the postition of the word in the content
>>>>>
>>>>> i have field content which i do in q=content:"some_word"
>>>>>
>>>>> the content field is not stored but its
>>>>> Indexed +Tokenized+ Multivalued+ TermVector Stored +Store Offset With
>>>>> TermVector +Store Position With TermVector
>>>>>
>>>>> thx for the help
>>>>
>>>>
>>