Recommendation for doing a search plus collecting extra information?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Recommendation for doing a search plus collecting extra information?

Trejkaz
Hi all.

I have a situation where I want to look up some DocValues for each hit
in the search.

I have a few ways I could go about this:

    1. Use search() as normal and then iterate the hits at the end to
collect the values (easiest?)

    2. Use TopStoreDocsCollector, TopFieldCollector, etc. as-is and
add my own collector to run alongside them. (Only complication seems
to be that these are no longer convenient to use, because it appears
that you now have to use a CollectorManager?)

    3. Try to extend TopStoreDocsCollector, TopFieldCollector, etc. to
return subclasses of TopDocs which already have the information in
them.

    4. Forget about all these pre-existing collectors and write my own
collector that implements search from scratch and just collects only
the information we actually want. (In this particular case, we don't
care about docId, because aside from fetching the stable ID, there is
nothing we use this for up-front when doing a search. Removing it from
the API would be beneficial for us because it would stop people being
tempted to use the doc ID and therefore introduce bugs.)

The value we want to fetch is essentially our stable replacement for
docId, so I figure other people's applications would have gone through
this already. What did everyone else do?

TX

P.S. My original workaround was to delay it until someone asks for the
hit, but if you don't get it from the exact same reader you did the
search with, you will get the wrong value sometimes. And of course, we
can't keep the reader around forever, because we have no idea when the
caller will stop using the search results object.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Erick Erickson
This may be an "XY" problem, you're asking how to do X thinking
it will solve Y without telling us what Y is.

What do you want to _do_ with the DV values you look up for each hit?


Best,
Erick

On Wed, Oct 7, 2015 at 5:22 PM, Trejkaz <[hidden email]> wrote:

> Hi all.
>
> I have a situation where I want to look up some DocValues for each hit
> in the search.
>
> I have a few ways I could go about this:
>
>     1. Use search() as normal and then iterate the hits at the end to
> collect the values (easiest?)
>
>     2. Use TopStoreDocsCollector, TopFieldCollector, etc. as-is and
> add my own collector to run alongside them. (Only complication seems
> to be that these are no longer convenient to use, because it appears
> that you now have to use a CollectorManager?)
>
>     3. Try to extend TopStoreDocsCollector, TopFieldCollector, etc. to
> return subclasses of TopDocs which already have the information in
> them.
>
>     4. Forget about all these pre-existing collectors and write my own
> collector that implements search from scratch and just collects only
> the information we actually want. (In this particular case, we don't
> care about docId, because aside from fetching the stable ID, there is
> nothing we use this for up-front when doing a search. Removing it from
> the API would be beneficial for us because it would stop people being
> tempted to use the doc ID and therefore introduce bugs.)
>
> The value we want to fetch is essentially our stable replacement for
> docId, so I figure other people's applications would have gone through
> this already. What did everyone else do?
>
> TX
>
> P.S. My original workaround was to delay it until someone asks for the
> hit, but if you don't get it from the exact same reader you did the
> search with, you will get the wrong value sometimes. And of course, we
> can't keep the reader around forever, because we have no idea when the
> caller will stop using the search results object.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Trejkaz
On Thu, Oct 8, 2015 at 1:16 PM, Erick Erickson <[hidden email]> wrote:
> This may be an "XY" problem, you're asking how to do X thinking
> it will solve Y without telling us what Y is.
>
> What do you want to _do_ with the DV values you look up for each hit?

Keep them around as the ID to use to look up information later. i.e.,
what we used to do with the doc ID before Lucene decided the doc ID
wouldn't be stable.

e.g., the search happens at some point, and then later you want to
render a row of a table, so you want to fetch the document. But you
can't use the doc ID to do that, so we use another ID which we map
back to the doc ID once we have a reader for that operation.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Erick Erickson
First off, the internal Lucene doc ID has never been stable as long as any
segment merging of whatever style was going on, don't quite know
where you're getting that idea.

It sounds like what you're really looking for is to export complete result
sets to "do something with them later". That's what the export capability
was built for (Solr 4.10 and later). See:
https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
Just make your Solr ID (<uniqueKey> or whatever) a DV field and
export..

Best,
Erick

On Wed, Oct 7, 2015 at 7:32 PM, Trejkaz <[hidden email]> wrote:

> On Thu, Oct 8, 2015 at 1:16 PM, Erick Erickson <[hidden email]> wrote:
>> This may be an "XY" problem, you're asking how to do X thinking
>> it will solve Y without telling us what Y is.
>>
>> What do you want to _do_ with the DV values you look up for each hit?
>
> Keep them around as the ID to use to look up information later. i.e.,
> what we used to do with the doc ID before Lucene decided the doc ID
> wouldn't be stable.
>
> e.g., the search happens at some point, and then later you want to
> render a row of a table, so you want to fetch the document. But you
> can't use the doc ID to do that, so we use another ID which we map
> back to the doc ID once we have a reader for that operation.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Trejkaz
On Thu, Oct 8, 2015 at 1:48 PM, Erick Erickson <[hidden email]> wrote:

> First off, the internal Lucene doc ID has never been stable as long as any
> segment merging of whatever style was going on, don't quite know
> where you're getting that idea.
>
> It sounds like what you're really looking for is to export complete result
> sets to "do something with them later". That's what the export capability
> was built for (Solr 4.10 and later). See:
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> Just make your Solr ID (<uniqueKey> or whatever) a DV field and
> export..

We don't use Solr and aren't particularly planning to start doing so.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Erick Erickson
Oops, wrong list. Then I'm clueless.

On Wed, Oct 7, 2015 at 7:51 PM, Trejkaz <[hidden email]> wrote:

> On Thu, Oct 8, 2015 at 1:48 PM, Erick Erickson <[hidden email]> wrote:
>> First off, the internal Lucene doc ID has never been stable as long as any
>> segment merging of whatever style was going on, don't quite know
>> where you're getting that idea.
>>
>> It sounds like what you're really looking for is to export complete result
>> sets to "do something with them later". That's what the export capability
>> was built for (Solr 4.10 and later). See:
>> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>> Just make your Solr ID (<uniqueKey> or whatever) a DV field and
>> export..
>
> We don't use Solr and aren't particularly planning to start doing so.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Alan Woodward-2
In reply to this post by Trejkaz
Hi Trejkaz,

You can still use a standard collector if you don’t need to worry about multi-threaded search.  It sounds as though what you want to do is implement your own Collector that will read and record docvalues hits, and use MultiCollector to wrap it and a standard TopDocsCollector together.

Alan Woodward
www.flax.co.uk


> On 8 Oct 2015, at 01:22, Trejkaz <[hidden email]> wrote:
>
> Hi all.
>
> I have a situation where I want to look up some DocValues for each hit
> in the search.
>
> I have a few ways I could go about this:
>
>    1. Use search() as normal and then iterate the hits at the end to
> collect the values (easiest?)
>
>    2. Use TopStoreDocsCollector, TopFieldCollector, etc. as-is and
> add my own collector to run alongside them. (Only complication seems
> to be that these are no longer convenient to use, because it appears
> that you now have to use a CollectorManager?)
>
>    3. Try to extend TopStoreDocsCollector, TopFieldCollector, etc. to
> return subclasses of TopDocs which already have the information in
> them.
>
>    4. Forget about all these pre-existing collectors and write my own
> collector that implements search from scratch and just collects only
> the information we actually want. (In this particular case, we don't
> care about docId, because aside from fetching the stable ID, there is
> nothing we use this for up-front when doing a search. Removing it from
> the API would be beneficial for us because it would stop people being
> tempted to use the doc ID and therefore introduce bugs.)
>
> The value we want to fetch is essentially our stable replacement for
> docId, so I figure other people's applications would have gone through
> this already. What did everyone else do?
>
> TX
>
> P.S. My original workaround was to delay it until someone asks for the
> hit, but if you don't get it from the exact same reader you did the
> search with, you will get the wrong value sometimes. And of course, we
> can't keep the reader around forever, because we have no idea when the
> caller will stop using the search results object.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Trejkaz
On Mon, Oct 12, 2015 at 6:32 AM, Alan Woodward <[hidden email]> wrote:
> Hi Trejkaz,
>
> You can still use a standard collector if you don’t need to worry about multi-threaded search.  It sounds
> as though what you want to do is implement your own Collector that will read and record docvalues hits,
> and use MultiCollector to wrap it and a standard TopDocsCollector together.

I guess the benefit of doing it directly at the Collector is that the
results will come in doc ID order, so any I/O I'm doing would be local
to the previous I/O? Which makes sense, and fetching the values seems
easy enough, but then the order I get the results is not the order
they will come back in the search, so I have to find a fairly
efficient way to map int->int so that I can look them up later.

What would seem ideal here is extending ScoreDoc to put my new int in
that, so that it's stored along with the same object that gets sorted
and ultimately ends up in the array (plus the extra storage
requirement would be as low as possible), but there the ScoreDoc is
created by HitQueue#getSentinelObject() and there is no way to get a
different subclass of HitQueue in TopScoreDocCollector. So I think
this route would require reimplementing pretty much all of
TopScoreDocCollector. I guess it isn't very large, but I worry about
future API changes when messing with internal stuff.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Recommendation for doing a search plus collecting extra information?

Uwe Schindler
Hi,

it may sound a bit stupid, but you can do the following:

If you search for a docvalues (previously fieldcache) field in lucene, the returned TopFieldDocs contains also the field values that were sorted against. The ScoreDoc instances in this collection are actually FieldDoc instances (cast them down): https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/FieldDoc.html

So my suggestion would be: sort primarily against score (SortField.SCORE), but add a secondary sort field with the docvalues field you want to be part of your results. The results will be primarily sorted against the score so you should still get the results in right order, but you can have the docvalues field as part of your TopFieldDocs (https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/TopFieldDocs.html) collections after downcasting the ScoreDoc to Fieldoc (the sorted fields are saved as Object[] instances). Choose the second FieldDoc field and cast it to your data type.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Trejkaz [mailto:[hidden email]]
> Sent: Monday, October 12, 2015 2:25 AM
> To: Lucene Users Mailing List
> Subject: Re: Recommendation for doing a search plus collecting extra
> information?
>
> On Mon, Oct 12, 2015 at 6:32 AM, Alan Woodward <[hidden email]> wrote:
> > Hi Trejkaz,
> >
> > You can still use a standard collector if you don’t need to worry
> > about multi-threaded search.  It sounds as though what you want to do
> > is implement your own Collector that will read and record docvalues hits,
> and use MultiCollector to wrap it and a standard TopDocsCollector together.
>
> I guess the benefit of doing it directly at the Collector is that the results will
> come in doc ID order, so any I/O I'm doing would be local to the previous I/O?
> Which makes sense, and fetching the values seems easy enough, but then
> the order I get the results is not the order they will come back in the search,
> so I have to find a fairly efficient way to map int->int so that I can look them
> up later.
>
> What would seem ideal here is extending ScoreDoc to put my new int in that,
> so that it's stored along with the same object that gets sorted and ultimately
> ends up in the array (plus the extra storage requirement would be as low as
> possible), but there the ScoreDoc is created by
> HitQueue#getSentinelObject() and there is no way to get a different subclass
> of HitQueue in TopScoreDocCollector. So I think this route would require
> reimplementing pretty much all of TopScoreDocCollector. I guess it isn't very
> large, but I worry about future API changes when messing with internal stuff.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Recommendation for doing a search plus collecting extra information?

Uwe Schindler
In reply to this post by Trejkaz
> If you search for a docvalues (previously fieldcache) field in lucene, the

I meant "sort" not "search" :-)

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Trejkaz
In reply to this post by Uwe Schindler
On Mon, Oct 12, 2015 at 3:28 PM, Uwe Schindler <[hidden email]> wrote:

> Hi,
>
> it may sound a bit stupid, but you can do the following:
>
> If you search for a docvalues (previously fieldcache) field in lucene, the returned TopFieldDocs contains also the field values
> that were sorted against. The ScoreDoc instances in this collection are actually FieldDoc instances (cast them down):
> https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/FieldDoc.html
>
> So my suggestion would be: sort primarily against score (SortField.SCORE), but add a secondary sort field with the docvalues
> field you want to be part of your results. The results will be primarily sorted against the score so you should still get the results
> in right order, but you can have the docvalues field as part of your TopFieldDocs
> (https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/TopFieldDocs.html) collections after downcasting
> the ScoreDoc to Fieldoc (the sorted fields are saved as Object[] instances). Choose the second FieldDoc field and cast
> it to your data type.

Well, this solution was working fine for a long time, but now we have
some users crying about the additional memory usage.

We're using this sort field:

    private static final SortedNumericSortField ID_SORT_FIELD =
        new SortedNumericSortField(LuceneFields.ID.getName(),
SortField.Type.LONG);

Since the field isn't sorted anything I think I can now change it to just:

    private static final SortField ID_SORT_FIELD =
        new SortField(LuceneFields.ID.getName(), SortField.Type.LONG);

Either way it ends up creating a LongComparator, though, which seems
to be what is being complained about. The memory usage of
LongComparator seems totally fine to me and it's using what seems to
be the minimum storage for what it's doing, so it's not like it can be
improved, but maybe there is a way to make a comparator that doesn't
have to store a copy of the data in memory?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Trejkaz
I did some experiments.

As it turns out, changing SortedNumericSortField to SortField had no
effect on the timings at all.
However, changing the SortField.Type from LONG to INT makes queries
come back 3 times faster.
(20ms vs. 6.5ms comparing the fastest runs for each.)

Why would using int be 3 times faster, and not 2?

(And repeating from the last mail, is there a way to use less memory?)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Recommendation for doing a search plus collecting extra information?

Trejkaz
In reply to this post by Alan Woodward-2
On Mon, Oct 12, 2015 at 4:32 AM, Alan Woodward <[hidden email]> wrote:
> Hi Trejkaz,
>
> You can still use a standard collector if you don’t need to worry about multi-threaded search.
> It sounds as though what you want to do is implement your own Collector that will read and
> record docvalues hits, and use MultiCollector to wrap it and a standard TopDocsCollector together.

This is what I'm currently trying out, but I'm hitting exactly the
problem I predicted. To use the values, I have to put them into some
kind of storage.

I can put them into an int[] but then it's the worst case memory usage
for queries returning a small number of hits.

Or I can put them into something like a fastutil Int2IntOpenHashMap,
which reduces the memory usage for small queries, while also making
large queries much slower.

Neither of these is really appealing right now.

Two ideas but I can't figure out if they'll work:

1. The doc IDs are visited in order, at least within each segment. Is
there a structure in Lucene itself somewhere which can store that off
quickly and efficiently?

2. Am I allowed to just hold onto the NumericDocValues for each leaf
and hold onto them for a long period of time, or is there an
implementation of them which breaks that? I figure it's already
sitting around, so that should be zero additional storage?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]