Can lucene documents have several thousand attributes each?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Can lucene documents have several thousand attributes each?

Leighton Hargreaves
Hello Lucene project.

I'm in the process of evaluating lucene for a project where we will need to search a large set of 3D objects by various attributes.  In many ways, lucene's functionality seems perfect.

But one thing I'm not sure of: we need to find the set of objects that are within a given distance of any given object.

One solution would to add a numeric field to each 3D object, for each other 3D object, with a name such as 'distance_to_<other_object_id_1>'.  This would allow us to find objects within a given distance of a given object with a query like 'distance_to_<object_id>:[ *to <max_distance> ]'.

But this would mean each 3D object would have several thousand attributes, one for every other 3D object.  Would this be a prohibitively expensive way to do it?

Another solution would be to handle the spatial aspect within my own software ie filter lucene's results according to distance.  But I worry that this would negatively affect performance by causing the set of results returned to my code to be large, prior to filtering by my own software.

I apologise if the question is confusing or badly explained, I'm just asking in case it turns out to be a standard class of problem with good existing solutions.

Regards,

Leighton Hargreaves

Reply | Threaded
Open this post in threaded view
|

Re: Can lucene documents have several thousand attributes each?

Marc Hadfield-2
Hello Leighton --

Use Lucene Spatial, it's built into Lucene for distance/shape functionality
and queries.

Simply google "lucene spatial" for examples, such as:
http://stackoverflow.com/questions/13628602/how-to-use-lucene-4-0-spatial-api
http://lucene.apache.org/core/4_0_0/spatial/index.html
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/spatial/src/test/org/apache/lucene/spatial/SpatialExample.java?view=markup

-- Marc Hadfield

Vital AI


----------------
Marc C. Hadfield
[hidden email]
@MarcHadfield
917-991-9685



On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
[hidden email]> wrote:

> Hello Lucene project.
>
> I'm in the process of evaluating lucene for a project where we will need
> to search a large set of 3D objects by various attributes.  In many ways,
> lucene's functionality seems perfect.
>
> But one thing I'm not sure of: we need to find the set of objects that are
> within a given distance of any given object.
>
> One solution would to add a numeric field to each 3D object, for each
> other 3D object, with a name such as 'distance_to_<other_object_id_1>'.
>  This would allow us to find objects within a given distance of a given
> object with a query like 'distance_to_<object_id>:[ *to <max_distance> ]'.
>
> But this would mean each 3D object would have several thousand attributes,
> one for every other 3D object.  Would this be a prohibitively expensive way
> to do it?
>
> Another solution would be to handle the spatial aspect within my own
> software ie filter lucene's results according to distance.  But I worry
> that this would negatively affect performance by causing the set of results
> returned to my code to be large, prior to filtering by my own software.
>
> I apologise if the question is confusing or badly explained, I'm just
> asking in case it turns out to be a standard class of problem with good
> existing solutions.
>
> Regards,
>
> Leighton Hargreaves
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Can lucene documents have several thousand attributes each?

david.w.smiley@gmail.com
In reply to this post by Leighton Hargreaves
Hi Leighton,

I’m assuming you’re suggesting going about it this way instead of using the
Lucene/Solr spatial feature is because it’s not a 2D distance?  Solr
actually supports n-dimensional Euclidean distance calculation with this
function query (aka Valuesource):

dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z) for
each document


On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
[hidden email]> wrote:

> Hello Lucene project.
>
> I'm in the process of evaluating lucene for a project where we will need
> to search a large set of 3D objects by various attributes.  In many ways,
> lucene's functionality seems perfect.
>
> But one thing I'm not sure of: we need to find the set of objects that are
> within a given distance of any given object.
>
> One solution would to add a numeric field to each 3D object, for each
> other 3D object, with a name such as 'distance_to_<other_object_id_1>'.
>  This would allow us to find objects within a given distance of a given
> object with a query like 'distance_to_<object_id>:[ *to <max_distance> ]'.
>
> But this would mean each 3D object would have several thousand attributes,
> one for every other 3D object.  Would this be a prohibitively expensive way
> to do it?
>
> Another solution would be to handle the spatial aspect within my own
> software ie filter lucene's results according to distance.  But I worry
> that this would negatively affect performance by causing the set of results
> returned to my code to be large, prior to filtering by my own software.
>
> I apologise if the question is confusing or badly explained, I'm just
> asking in case it turns out to be a standard class of problem with good
> existing solutions.
>
> Regards,
>
> Leighton Hargreaves
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Can lucene documents have several thousand attributes each?

Ted Dunning
Also, you can use 2D projections with AND to limit the number of documents
you need to compute distances on.


On Wed, May 21, 2014 at 10:29 AM, [hidden email] <
[hidden email]> wrote:

> Hi Leighton,
>
> I’m assuming you’re suggesting going about it this way instead of using the
> Lucene/Solr spatial feature is because it’s not a 2D distance?  Solr
> actually supports n-dimensional Euclidean distance calculation with this
> function query (aka Valuesource):
>
> dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z) for
> each document
>
>
> On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
> [hidden email]> wrote:
>
> > Hello Lucene project.
> >
> > I'm in the process of evaluating lucene for a project where we will need
> > to search a large set of 3D objects by various attributes.  In many ways,
> > lucene's functionality seems perfect.
> >
> > But one thing I'm not sure of: we need to find the set of objects that
> are
> > within a given distance of any given object.
> >
> > One solution would to add a numeric field to each 3D object, for each
> > other 3D object, with a name such as 'distance_to_<other_object_id_1>'.
> >  This would allow us to find objects within a given distance of a given
> > object with a query like 'distance_to_<object_id>:[ *to <max_distance>
> ]'.
> >
> > But this would mean each 3D object would have several thousand
> attributes,
> > one for every other 3D object.  Would this be a prohibitively expensive
> way
> > to do it?
> >
> > Another solution would be to handle the spatial aspect within my own
> > software ie filter lucene's results according to distance.  But I worry
> > that this would negatively affect performance by causing the set of
> results
> > returned to my code to be large, prior to filtering by my own software.
> >
> > I apologise if the question is confusing or badly explained, I'm just
> > asking in case it turns out to be a standard class of problem with good
> > existing solutions.
> >
> > Regards,
> >
> > Leighton Hargreaves
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Can lucene documents have several thousand attributes each?

Leighton Hargreaves
Thanks for the responses, I didn't even realise there was a spatial feature.  The distances I need to search for, though, are the minimum distances between arbitrarily complex 3D geometry (the geometry itself wouldn't be represented in lucene, only metadata about it).  So I want to calculate these minimum distances within my own geometry engine, and then pass the calculated distances into lucene/solr.  

So really my question is, what is the best way to represent values which relate to 2 documents, so they I can search for documents 'in relation to' another document?  (in this case the relation is an externally-calculated distance).


 
-----Original Message-----
From: Ted Dunning [mailto:[hidden email]]
Sent: 21 May 2014 22:19
To: [hidden email]
Subject: Re: Can lucene documents have several thousand attributes each?

Also, you can use 2D projections with AND to limit the number of documents you need to compute distances on.


On Wed, May 21, 2014 at 10:29 AM, [hidden email] < [hidden email]> wrote:

> Hi Leighton,
>
> I’m assuming you’re suggesting going about it this way instead of
> using the Lucene/Solr spatial feature is because it’s not a 2D
> distance?  Solr actually supports n-dimensional Euclidean distance
> calculation with this function query (aka Valuesource):
>
> dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z)
> for each document
>
>
> On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
> [hidden email]> wrote:
>
> > Hello Lucene project.
> >
> > I'm in the process of evaluating lucene for a project where we will
> > need to search a large set of 3D objects by various attributes.  In
> > many ways, lucene's functionality seems perfect.
> >
> > But one thing I'm not sure of: we need to find the set of objects
> > that
> are
> > within a given distance of any given object.
> >
> > One solution would to add a numeric field to each 3D object, for
> > each other 3D object, with a name such as 'distance_to_<other_object_id_1>'.
> >  This would allow us to find objects within a given distance of a
> > given object with a query like 'distance_to_<object_id>:[ *to
> > <max_distance>
> ]'.
> >
> > But this would mean each 3D object would have several thousand
> attributes,
> > one for every other 3D object.  Would this be a prohibitively
> > expensive
> way
> > to do it?
> >
> > Another solution would be to handle the spatial aspect within my own
> > software ie filter lucene's results according to distance.  But I
> > worry that this would negatively affect performance by causing the
> > set of
> results
> > returned to my code to be large, prior to filtering by my own software.
> >
> > I apologise if the question is confusing or badly explained, I'm
> > just asking in case it turns out to be a standard class of problem
> > with good existing solutions.
> >
> > Regards,
> >
> > Leighton Hargreaves
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Can lucene documents have several thousand attributes each?

Marc C Hadfield
You may be able to leverage Faceting for more complex cases (
http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html),
however it sounds like you could just create a set of Lucene documents
with 3 main fields:
object-id-1, distance, object-id-2
and then query this as needed with constraints on the distance.  you would
be "joining" this index to another index (your object index) by object-id.



On Fri, May 23, 2014 at 4:29 AM, Leighton Hargreaves <
[hidden email]> wrote:

> Thanks for the responses, I didn't even realise there was a spatial
> feature.  The distances I need to search for, though, are the minimum
> distances between arbitrarily complex 3D geometry (the geometry itself
> wouldn't be represented in lucene, only metadata about it).  So I want to
> calculate these minimum distances within my own geometry engine, and then
> pass the calculated distances into lucene/solr.
>
> So really my question is, what is the best way to represent values which
> relate to 2 documents, so they I can search for documents 'in relation to'
> another document?  (in this case the relation is an externally-calculated
> distance).
>
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:[hidden email]]
> Sent: 21 May 2014 22:19
> To: [hidden email]
> Subject: Re: Can lucene documents have several thousand attributes each?
>
> Also, you can use 2D projections with AND to limit the number of documents
> you need to compute distances on.
>
>
> On Wed, May 21, 2014 at 10:29 AM, [hidden email] <
> [hidden email]> wrote:
>
> > Hi Leighton,
> >
> > I’m assuming you’re suggesting going about it this way instead of
> > using the Lucene/Solr spatial feature is because it’s not a 2D
> > distance?  Solr actually supports n-dimensional Euclidean distance
> > calculation with this function query (aka Valuesource):
> >
> > dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z)
> > for each document
> >
> >
> > On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
> > [hidden email]> wrote:
> >
> > > Hello Lucene project.
> > >
> > > I'm in the process of evaluating lucene for a project where we will
> > > need to search a large set of 3D objects by various attributes.  In
> > > many ways, lucene's functionality seems perfect.
> > >
> > > But one thing I'm not sure of: we need to find the set of objects
> > > that
> > are
> > > within a given distance of any given object.
> > >
> > > One solution would to add a numeric field to each 3D object, for
> > > each other 3D object, with a name such as
> 'distance_to_<other_object_id_1>'.
> > >  This would allow us to find objects within a given distance of a
> > > given object with a query like 'distance_to_<object_id>:[ *to
> > > <max_distance>
> > ]'.
> > >
> > > But this would mean each 3D object would have several thousand
> > attributes,
> > > one for every other 3D object.  Would this be a prohibitively
> > > expensive
> > way
> > > to do it?
> > >
> > > Another solution would be to handle the spatial aspect within my own
> > > software ie filter lucene's results according to distance.  But I
> > > worry that this would negatively affect performance by causing the
> > > set of
> > results
> > > returned to my code to be large, prior to filtering by my own software.
> > >
> > > I apologise if the question is confusing or badly explained, I'm
> > > just asking in case it turns out to be a standard class of problem
> > > with good existing solutions.
> > >
> > > Regards,
> > >
> > > Leighton Hargreaves
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Can lucene documents have several thousand attributes each?

Mark Bennett
In reply to this post by Leighton Hargreaves
Another feature that might be useful, and that might not be obvious at first, is that document tokens can have custom payloads, so you could encode arbitrary binary metadata in them.

Then at search time, maybe override the Similarity class to leverage those payloads.

Non-trivial, but likely do-able.

--
Mark Bennett / LucidWorks: Search & Big Data / [hidden email]
Office: 408-898-4201 / Telecommute: 408-733-0387 / Cell: 408-829-6513

On May 23, 2014, at 1:29 AM, Leighton Hargreaves <[hidden email]> wrote:

> Thanks for the responses, I didn't even realise there was a spatial feature.  The distances I need to search for, though, are the minimum distances between arbitrarily complex 3D geometry (the geometry itself wouldn't be represented in lucene, only metadata about it).  So I want to calculate these minimum distances within my own geometry engine, and then pass the calculated distances into lucene/solr.  
>
> So really my question is, what is the best way to represent values which relate to 2 documents, so they I can search for documents 'in relation to' another document?  (in this case the relation is an externally-calculated distance).
>
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:[hidden email]]
> Sent: 21 May 2014 22:19
> To: [hidden email]
> Subject: Re: Can lucene documents have several thousand attributes each?
>
> Also, you can use 2D projections with AND to limit the number of documents you need to compute distances on.
>
>
> On Wed, May 21, 2014 at 10:29 AM, [hidden email] < [hidden email]> wrote:
>
>> Hi Leighton,
>>
>> I’m assuming you’re suggesting going about it this way instead of
>> using the Lucene/Solr spatial feature is because it’s not a 2D
>> distance?  Solr actually supports n-dimensional Euclidean distance
>> calculation with this function query (aka Valuesource):
>>
>> dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z)
>> for each document
>>
>>
>> On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
>> [hidden email]> wrote:
>>
>>> Hello Lucene project.
>>>
>>> I'm in the process of evaluating lucene for a project where we will
>>> need to search a large set of 3D objects by various attributes.  In
>>> many ways, lucene's functionality seems perfect.
>>>
>>> But one thing I'm not sure of: we need to find the set of objects
>>> that
>> are
>>> within a given distance of any given object.
>>>
>>> One solution would to add a numeric field to each 3D object, for
>>> each other 3D object, with a name such as 'distance_to_<other_object_id_1>'.
>>> This would allow us to find objects within a given distance of a
>>> given object with a query like 'distance_to_<object_id>:[ *to
>>> <max_distance>
>> ]'.
>>>
>>> But this would mean each 3D object would have several thousand
>> attributes,
>>> one for every other 3D object.  Would this be a prohibitively
>>> expensive
>> way
>>> to do it?
>>>
>>> Another solution would be to handle the spatial aspect within my own
>>> software ie filter lucene's results according to distance.  But I
>>> worry that this would negatively affect performance by causing the
>>> set of
>> results
>>> returned to my code to be large, prior to filtering by my own software.
>>>
>>> I apologise if the question is confusing or badly explained, I'm
>>> just asking in case it turns out to be a standard class of problem
>>> with good existing solutions.
>>>
>>> Regards,
>>>
>>> Leighton Hargreaves
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: Can lucene documents have several thousand attributes each?

Leighton Hargreaves
In reply to this post by Marc C Hadfield
Aha!  Yes, I think a separate index, with a 'join' is probably the best solution.  It makes a lot of sense.
One more question:

Is it possible to create a single lucene query which would refer to this separate 'join' index, and to my main index?  I don't want to have to execute multiple queries and merge the results, as this would be inefficient for pagination etc.

Thanks for all the insights...


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Marc Hadfield
Sent: 23 May 2014 13:42
To: [hidden email]
Subject: Re: Can lucene documents have several thousand attributes each?

You may be able to leverage Faceting for more complex cases ( http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html),
however it sounds like you could just create a set of Lucene documents with 3 main fields:
object-id-1, distance, object-id-2
and then query this as needed with constraints on the distance.  you would be "joining" this index to another index (your object index) by object-id.



On Fri, May 23, 2014 at 4:29 AM, Leighton Hargreaves < [hidden email]> wrote:

> Thanks for the responses, I didn't even realise there was a spatial
> feature.  The distances I need to search for, though, are the minimum
> distances between arbitrarily complex 3D geometry (the geometry itself
> wouldn't be represented in lucene, only metadata about it).  So I want
> to calculate these minimum distances within my own geometry engine,
> and then pass the calculated distances into lucene/solr.
>
> So really my question is, what is the best way to represent values
> which relate to 2 documents, so they I can search for documents 'in relation to'
> another document?  (in this case the relation is an
> externally-calculated distance).
>
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:[hidden email]]
> Sent: 21 May 2014 22:19
> To: [hidden email]
> Subject: Re: Can lucene documents have several thousand attributes each?
>
> Also, you can use 2D projections with AND to limit the number of
> documents you need to compute distances on.
>
>
> On Wed, May 21, 2014 at 10:29 AM, [hidden email] <
> [hidden email]> wrote:
>
> > Hi Leighton,
> >
> > I’m assuming you’re suggesting going about it this way instead of
> > using the Lucene/Solr spatial feature is because it’s not a 2D
> > distance?  Solr actually supports n-dimensional Euclidean distance
> > calculation with this function query (aka Valuesource):
> >
> > dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z)
> > for each document
> >
> >
> > On Wed, May 21, 2014 at 12:30 PM, Leighton Hargreaves <
> > [hidden email]> wrote:
> >
> > > Hello Lucene project.
> > >
> > > I'm in the process of evaluating lucene for a project where we
> > > will need to search a large set of 3D objects by various
> > > attributes.  In many ways, lucene's functionality seems perfect.
> > >
> > > But one thing I'm not sure of: we need to find the set of objects
> > > that
> > are
> > > within a given distance of any given object.
> > >
> > > One solution would to add a numeric field to each 3D object, for
> > > each other 3D object, with a name such as
> 'distance_to_<other_object_id_1>'.
> > >  This would allow us to find objects within a given distance of a
> > > given object with a query like 'distance_to_<object_id>:[ *to
> > > <max_distance>
> > ]'.
> > >
> > > But this would mean each 3D object would have several thousand
> > attributes,
> > > one for every other 3D object.  Would this be a prohibitively
> > > expensive
> > way
> > > to do it?
> > >
> > > Another solution would be to handle the spatial aspect within my
> > > own software ie filter lucene's results according to distance.  
> > > But I worry that this would negatively affect performance by
> > > causing the set of
> > results
> > > returned to my code to be large, prior to filtering by my own software.
> > >
> > > I apologise if the question is confusing or badly explained, I'm
> > > just asking in case it turns out to be a standard class of problem
> > > with good existing solutions.
> > >
> > > Regards,
> > >
> > > Leighton Hargreaves
> > >
> > >
> >
>