Question about BytesRef and BinaryDocValues

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about BytesRef and BinaryDocValues

Kevin Manuel
Hi,

I'm using lucene version 4.3.1 and I've implemented a custom score query.
I'm trying to read the value for a field from the field cache. It's a text
field so I'm using getTerms which returns a binarydocvalues object.

However on trying to get the bytes ref object for a document and converting
it to a string using utf8ToString I think characters after a whitespace and
not being returned in the string. For instance if the field has 'hey tom',
the string only returns 'hey'.

I tried this with version 4.10.0 too and I see the same thing. I was
wondering if there's something wrong with the way I'm accessing it or it
was an issue in these versions.

Thanks,
Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Question about BytesRef and BinaryDocValues

Vadim Gindin
Hi Kevin!

I think that your field is "analyzed" and so your field value is divided to
2 terms "hey" and "tom". So docvalue is written for each of them.

Regards
Vadim Gindin


пт, 24 авг. 2018, 5:19 Kevin Manuel <[hidden email]>:

> Hi,
>
> I'm using lucene version 4.3.1 and I've implemented a custom score query.
> I'm trying to read the value for a field from the field cache. It's a text
> field so I'm using getTerms which returns a binarydocvalues object.
>
> However on trying to get the bytes ref object for a document and converting
> it to a string using utf8ToString I think characters after a whitespace and
> not being returned in the string. For instance if the field has 'hey tom',
> the string only returns 'hey'.
>
> I tried this with version 4.10.0 too and I see the same thing. I was
> wondering if there's something wrong with the way I'm accessing it or it
> was an issue in these versions.
>
> Thanks,
> Kevin
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about BytesRef and BinaryDocValues

Kevin Manuel
Hi Vadim,

Thank you so much for your reply. I think you were right.

So if a field is 'analyzed' how can I get both terms 'hey' and 'tom'?

Thanks,
Kevin

On Thu, Aug 23, 2018, 20:26 Vadim Gindin <[hidden email]> wrote:

> Hi Kevin!
>
> I think that your field is "analyzed" and so your field value is divided to
> 2 terms "hey" and "tom". So docvalue is written for each of them.
>
> Regards
> Vadim Gindin
>
>
> пт, 24 авг. 2018, 5:19 Kevin Manuel <[hidden email]>:
>
> > Hi,
> >
> > I'm using lucene version 4.3.1 and I've implemented a custom score query.
> > I'm trying to read the value for a field from the field cache. It's a
> text
> > field so I'm using getTerms which returns a binarydocvalues object.
> >
> > However on trying to get the bytes ref object for a document and
> converting
> > it to a string using utf8ToString I think characters after a whitespace
> and
> > not being returned in the string. For instance if the field has 'hey
> tom',
> > the string only returns 'hey'.
> >
> > I tried this with version 4.10.0 too and I see the same thing. I was
> > wondering if there's something wrong with the way I'm accessing it or it
> > was an issue in these versions.
> >
> > Thanks,
> > Kevin
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about BytesRef and BinaryDocValues

Vadim Gindin
Kevin, the sequence is the following: get terms for the field, get postings
for a term and further get payload from the postings. Have a read a little
about reverse index structure and it will be more clear to you.

Your Query creates Weight, that must create a scorer in the method
scorer(context). The scheme could be the following:

private Scorer scorer(LeafReaderContext context) {

Terms fieldTerms = reader.terms(field);

TermsEnum te = fieldTerms.iterator();

if (te.seekExact(t.bytes())) {

    PostingsEnum postingsEnum = te.postings(null, PostingsEnum.ALL);

    return CustomFieldScorer(postingsEnum)

}

return  null;

}

After that you're getting a payload in a CustomFieldScorer.score() in
the following way:

postingsEnum.nextPosition();
BytesRef payload = postings.getPayload();


Regards,

Vadim Gindin


On Fri, Aug 24, 2018 at 10:16 AM Kevin Manuel <[hidden email]>
wrote:

> Hi Vadim,
>
> Thank you so much for your reply. I think you were right.
>
> So if a field is 'analyzed' how can I get both terms 'hey' and 'tom'?
>
> Thanks,
> Kevin
>
> On Thu, Aug 23, 2018, 20:26 Vadim Gindin <[hidden email]> wrote:
>
> > Hi Kevin!
> >
> > I think that your field is "analyzed" and so your field value is divided
> to
> > 2 terms "hey" and "tom". So docvalue is written for each of them.
> >
> > Regards
> > Vadim Gindin
> >
> >
> > пт, 24 авг. 2018, 5:19 Kevin Manuel <[hidden email]>:
> >
> > > Hi,
> > >
> > > I'm using lucene version 4.3.1 and I've implemented a custom score
> query.
> > > I'm trying to read the value for a field from the field cache. It's a
> > text
> > > field so I'm using getTerms which returns a binarydocvalues object.
> > >
> > > However on trying to get the bytes ref object for a document and
> > converting
> > > it to a string using utf8ToString I think characters after a whitespace
> > and
> > > not being returned in the string. For instance if the field has 'hey
> > tom',
> > > the string only returns 'hey'.
> > >
> > > I tried this with version 4.10.0 too and I see the same thing. I was
> > > wondering if there's something wrong with the way I'm accessing it or
> it
> > > was an issue in these versions.
> > >
> > > Thanks,
> > > Kevin
> > >
> >
>