Storing & using feature vectors

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Storing & using feature vectors

kkrugler
Hi all,

[I posted on the Lucene list two days ago, but didn’t see any response - checking here for completeness]
 
I’ve been looking at directly storing feature vectors and providing scoring/filtering support.

This is for vectors consisting of (typically 300 - 2048) floats or doubles.

It’s following the same pattern as geospatial support - so a new field type and query/parser, plus plumbing to hook it into Solr.

Before I go much further, is there anything like this already done, or in the works?

Thanks,

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Storing & using feature vectors

Doug Turnbull
This is a pretty big hole in Lucene-based search right now that many
practitioners have struggled with

I know a couple of people who have worked on solutions. And I've used a
couple of hacks:

- You can hack together something that does cosine similarity using the
term frequency & query boosts DelimitedTermFreqFilterFactory. Basically the
term frequency becomes a feature weight on the document. Boosts become the
query weight. If you massage things correctly with the similarity, the
resulting boolean similarity is a dot product...

- Erik Hatcher has done some great work with payloads which you might want
to check out. See the delimited payload filter factory, and payload score
function queries

- Simon Hughes Activate Talk (slides/video not yet posted) covers this
topic in some depth

- Rene Kriegler's Haystack Talk discusses encoding Inception model
vectorizations of images:
https://opensourceconnections.com/events/haystack-single/haystack-relevance-scoring/

If this is a huge importance to you, I might also suggest looking at vespa,
which makes tensors a first-class citizen and makes matrix-math pretty
seamless: http://vespa.ai

Hope that helps
-Doug

On Fri, Oct 19, 2018 at 12:50 PM Ken Krugler <[hidden email]>
wrote:

> Hi all,
>
> [I posted on the Lucene list two days ago, but didn’t see any response -
> checking here for completeness]
>
> I’ve been looking at directly storing feature vectors and providing
> scoring/filtering support.
>
> This is for vectors consisting of (typically 300 - 2048) floats or doubles.
>
> It’s following the same pattern as geospatial support - so a new field
> type and query/parser, plus plumbing to hook it into Solr.
>
> Before I go much further, is there anything like this already done, or in
> the works?
>
> Thanks,
>
> — Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378 <(530)%20210-6378>
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug
Reply | Threaded
Open this post in threaded view
|

Re: Storing & using feature vectors

kkrugler
Hi Doug,

Many thanks for the tons of useful information!

Some comments/questions inline below.

— Ken

> On Oct 19, 2018, at 10:46 AM, Doug Turnbull <[hidden email]> wrote:
>
> This is a pretty big hole in Lucene-based search right now that many
> practitioners have struggled with
>
> I know a couple of people who have worked on solutions. And I've used a
> couple of hacks:
>
> - You can hack together something that does cosine similarity using the
> term frequency & query boosts DelimitedTermFreqFilterFactory. Basically the
> term frequency becomes a feature weight on the document. Boosts become the
> query weight. If you massage things correctly with the similarity, the
> resulting boolean similarity is a dot product…

I’ve done a quick test of that approach, though not as elegantly. I just constructed a string of “terms” (feature indices) that generated an approximation to the target vector. DelimitedTermFreqFilterFactory is much better :)

The problem I ran into was that some features have negative weights, and it wasn’t obvious whether it would work to have a second field (with only the negative weights) that I used for (not really supported in Solr?) negative boosting.

Is there some hack to work around that?

> - Erik Hatcher has done some great work with payloads which you might want
> to check out. See the delimited payload filter factory, and payload score
> function queries

Thanks, I’d poked at payloads a bit. From what I could tell, there isn't a way to use payloads with negative feature values, or to filter results, but maybe I didn’t dig deep enough.

> - Simon Hughes Activate Talk (slides/video not yet posted) covers this
> topic in some depth

OK, that looks great - https://activate2018.sched.com/event/FkM3 and https://github.com/DiceTechJobs/VectorsInSearch

Seems like the planets are aligning for this kind of thing.

> - Rene Kriegler's Haystack Talk discusses encoding Inception model
> vectorizations of images:
> https://opensourceconnections.com/events/haystack-single/haystack-relevance-scoring/

Good stuff, thanks!

I’d be curious what his querqy <https://github.com/renekrie/querqy> configuration looked like for the “summing up fieldweights only (ignore df; use cross-field tf)” row in his results table on slide 36.

The use of LSHs (what he describes in this talk as “random projection forest") is something I’d suggested to the client, to mitigate the need for true feature vector support.

Using an initial LSH-based query to get candidates, and then re-ranking based on the actual feature vector, is something I was expecting Rene to discuss but he didn’t seem to mention it in his talk.

> If this is a huge importance to you, I might also suggest looking at vespa,
> which makes tensors a first-class citizen and makes matrix-math pretty
> seamless: http://vespa.ai

Interesting, though my client is pretty much locked into using Solr.



> On Fri, Oct 19, 2018 at 12:50 PM Ken Krugler <[hidden email]>
> wrote:
>
>> Hi all,
>>
>> [I posted on the Lucene list two days ago, but didn’t see any response -
>> checking here for completeness]
>>
>> I’ve been looking at directly storing feature vectors and providing
>> scoring/filtering support.
>>
>> This is for vectors consisting of (typically 300 - 2048) floats or doubles.
>>
>> It’s following the same pattern as geospatial support - so a new field
>> type and query/parser, plus plumbing to hook it into Solr.
>>
>> Before I go much further, is there anything like this already done, or in
>> the works?
>>
>> Thanks,
>>
>> — Ken
>>
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra