Solr faceting vs. Lucene faceting

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Solr faceting vs. Lucene faceting

Otis Gospodnetić
Hi,


Are there plans to switch Solr to Lucene's faceting?

At this point, does Solr's faceting have some advantages over Lucene's? Or vice versa?

Thanks,
Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Yonik Seeley-4
On Sun, Dec 9, 2012 at 5:55 PM, Otis Gospodnetic
<[hidden email]> wrote:
> Are there plans to switch Solr to Lucene's faceting?

Nope.  There is no one best algorithm - different approaches work best
in different circumstances.
We've added faceting implementations to Solr over time, and we'll
undoubtedly add more in the future.

Would it make sense to add Lucene's faceting as an *additional* Solr
faceting method?  Maybe?
I don't really know though - I haven't done the work to evaluate how
well it would fit in with Solr's architecture or not.
Patches welcome and all that... ;-)

-Yonik
http://lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Otis Gospodnetić
Thanks Yonik.

Would it also make sense to add Solr's faceting method to Lucene's faceting module?

Thanks,
Otis




On Sun, Dec 9, 2012 at 6:38 PM, Yonik Seeley <[hidden email]> wrote:
On Sun, Dec 9, 2012 at 5:55 PM, Otis Gospodnetic
<[hidden email]> wrote:
> Are there plans to switch Solr to Lucene's faceting?

Nope.  There is no one best algorithm - different approaches work best
in different circumstances.
We've added faceting implementations to Solr over time, and we'll
undoubtedly add more in the future.

Would it make sense to add Lucene's faceting as an *additional* Solr
faceting method?  Maybe?
I don't really know though - I haven't done the work to evaluate how
well it would fit in with Solr's architecture or not.
Patches welcome and all that... ;-)

-Yonik
http://lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Shai Erera
Hi

The faceting module in Lucene is very generic and extendable in many ways. From the little I read about Solr facets, I think that all of its features can be implemented on top of Lucene facets. Some directly with the code that exists today, some with writing few extensions points. I don't remember a feature that requires adding low-level changes to the code, but if there is one, I promise to do the work ! :)

I wrote few blogs on Lucene facets recently (http://shaierera.blogspot.com) and intend to write a few more, delving in depth on some features such as complements, sampling, partitions, associations as well as some other advanced/expert stuff. Mike and I also opened few issues to get the code more "modern" w/ Lucene 4.x capabilities, e.g. explore DocValues, PackedInts and more.

Yonik, unlike Solr facets (which manage everything in the search index), the Lucene module comes with a sidecar taxonomy index, so e.g. when Solr replicates shards, it will need to replicate one other index files. That's the big difference, the rest are miniscule I think. And of course, Solr has a much higher level API than Lucene, so we'll need to translate those APIs to the facets module.

If indeed you'll want to at least explore this direction, I'll be willing to help. But my understanding and knowledge of Solr is not rich enough to try that on my own !

Shai


On Mon, Dec 10, 2012 at 11:53 PM, Otis Gospodnetic <[hidden email]> wrote:
Thanks Yonik.

Would it also make sense to add Solr's faceting method to Lucene's faceting module?

Thanks,
Otis




On Sun, Dec 9, <a href="tel:2012" value="+9722012" target="_blank">2012 at 6:38 PM, Yonik Seeley <[hidden email]> wrote:
On Sun, Dec 9, <a href="tel:2012" value="+9722012" target="_blank">2012 at 5:55 PM, Otis Gospodnetic
<[hidden email]> wrote:
> Are there plans to switch Solr to Lucene's faceting?

Nope.  There is no one best algorithm - different approaches work best
in different circumstances.
We've added faceting implementations to Solr over time, and we'll
undoubtedly add more in the future.

Would it make sense to add Lucene's faceting as an *additional* Solr
faceting method?  Maybe?
I don't really know though - I haven't done the work to evaluate how
well it would fit in with Solr's architecture or not.
Patches welcome and all that... ;-)

-Yonik
http://lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

David Smiley
Shai Erera wrote
Yonik, unlike Solr facets (which manage everything in the search index),
the Lucene module comes with a sidecar taxonomy index, so e.g. when Solr
replicates shards, it will need to replicate one other index files. That's
the big difference, the rest are miniscule I think. And of course, Solr has
a much higher level API than Lucene, so we'll need to translate those APIs
to the facets module.
Shai,
RE: Sidecar index --  That's a huge difference and a shortcoming; no?  Do you somehow take care to avoid a stale view on the sidecar index during a commit?

On the upside; if this does proper hierarchical faceting then a Solr adapter for it would be awesome.

~ David
Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Shai Erera

You're right, the sidecar index does bring some challenges into the picture, but we're using it like that for many years, in distributed mode too, and so far it wasn't an issue. I opened LUCENE-3786 to create SearcherTaxoManager which lets you manage an IndexSearcher and TaxonomyReader pairs, like SearcherManager does. I am thinking maybe this object will also manage the commits to both indexes.

Keeping them in sync is a delicate matter, but certainly doable, even more so, now that IndexWriter lets you commit just commitData.

The taxonomy manages the global ordinals for categories. The first version of it used some files (maybe a B-Tree, I don't recall), but moving to a Lucene index was a huge gain. The code became very simple, and we could enjoy Lucene's robustness and commit semantics.

The global ordinals are a huge benefit IMO, as they let you do all the work on integers, rather than strings, and allow you to do faceting both off-disk an in-memory. They are also NRT friendly (and DirectoryTaxonomyReader is now NRT too!).

I'm not too familiar with Solr adapters .. will Solr NRT, SolrCloud etc. work with any adapter, even one that carries along a sidecar index/data structure? I'm mostly worried about replication, because distributed indexing should not be affected by the existence of the taxonomy index.

Shai

On Tue, Dec 11, 2012 at 7:40 AM, David Smiley (@MITRE.org) <[hidden email]> wrote:
Shai Erera wrote
> Yonik, unlike Solr facets (which manage everything in the search index),
> the Lucene module comes with a sidecar taxonomy index, so e.g. when Solr
> replicates shards, it will need to replicate one other index files. That's
> the big difference, the rest are miniscule I think. And of course, Solr
> has
> a much higher level API than Lucene, so we'll need to translate those APIs
> to the facets module.

Shai,
RE: Sidecar index --  That's a huge difference and a shortcoming; no?  Do
you somehow take care to avoid a stale view on the sidecar index during a
commit?

On the upside; if this does proper hierarchical faceting then a Solr adapter
for it would be awesome.

~ David



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-faceting-vs-Lucene-faceting-tp4025577p4025928.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Yonik Seeley-4
On Tue, Dec 11, 2012 at 2:06 AM, Shai Erera <[hidden email]> wrote:
> The taxonomy manages the global ordinals for categories.

I wonder if there's a way to do global ordinals w/ a codec instead of
a sidecar index?

 -Yonik
http://lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Robert Muir
On Tue, Dec 11, 2012 at 11:02 AM, Yonik Seeley <[hidden email]> wrote:
> On Tue, Dec 11, 2012 at 2:06 AM, Shai Erera <[hidden email]> wrote:
>> The taxonomy manages the global ordinals for categories.
>
> I wonder if there's a way to do global ordinals w/ a codec instead of
> a sidecar index?
>

I'm not sure how this would work, since codec is a per-segment thing.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Tommaso Teofili



2012/12/11 Robert Muir <[hidden email]>
On Tue, Dec 11, 2012 at 11:02 AM, Yonik Seeley <[hidden email]> wrote:
> On Tue, Dec 11, 2012 at 2:06 AM, Shai Erera <[hidden email]> wrote:
>> The taxonomy manages the global ordinals for categories.
>
> I wonder if there's a way to do global ordinals w/ a codec instead of
> a sidecar index?
>

I'm not sure how this would work, since codec is a per-segment thing.

Even if I generally like the idea (avoiding the sync of a pair of indexes, rather than just one, would be nice), Solr may be locked to a certain Codec implementation if done like that, or am I wrong?

Regards,
Tommaso
 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Lukáš Vlček
In reply to this post by Shai Erera
Hi Shai,

thanks for your blog, I am looking forward to your future posts!

Just two questions: you mentioned that you have been running this in production in distributed mode. If I understand it correctly the idea is there is only a single taxonomy index even if the distributed mode means that the data indices were partitioned/sharded. (Thus the ordinals are global). The taxonomy index is not partitioned/sharded itself. Am I correct?

Also what seems to be an interesting implication of this implementation is the fact that taxonomy index never cares about deleted documents (categories that are obsolete). In practices this is probably not a bit deal because the taxonomy index is small but I can imagine this might be problematic in some situations (for example imagine that the categories would be based on highly granular timestamp, that could create a lot of categories over short period of time and those would be kept "forever" and still growing...).
(^^ I am just trying to understand how it works.)

Regards,
Lukas
Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Shai Erera
There are two ways you can work with the taxonomy index in a distributed environment (at least, these are the things that we've tried):
(1) replicate the taxonomy to all shards, so that each shard sees the entire global taxonomy
(2) each shard maintains its own taxonomy.

(1) only makes sense when the shards are built by a side process,e.g. MapReduce, and then copied to their respective nodes.
If you index like that, then your distributed faceted search (correcting the counts of categories) is done on ordinals rather than
strings.

(2) is the one that makes sense to most people, and is also NRT (where #1 isn't!). Each shard maintains its local search +
taxonomy indexes. In that mode, the counts correction cannot be done on ordinals, and has to be done on strings.

When you're doing distributed faceted search, you cannot just ask each shard to return the top-10 categories for the "Author" dimension/facet,
because then you might (1) miss a category that should have been in the total top-10 and (2) return a category in the top-10 with
incorrect counts. What you can do is ask for C*10, where C is an over-counting factor. You'd still hope for the best though, b/c
theoretically, you could still miss a category / have incorrect count for one.

The difference between the two approaches is how big C can be. In the first approach, since all you transmit on the wire are
integers, and the merge is done on integers, you can set C much higher than in the second approach. In practice though, since
more and more applications are interested in real-time search, we keep a local taxonomy index per each shard and do the merge
on the strings.

Also, when you're doing really large-scale, exact counts for categories may not be so usable. How is Science Fiction (123,367,129)
different than Drama (145,465,987) !? To the user these are just categories that are associated with too many documents than I
can digest anyway :).

For that, we do sampling and display %, which is more consumable by users, and then you don't need to worry about exact counts.

I think that I wrote a bit too long an answer to your question :).

Regarding not deleting categories, we've thought about it in the past and I'm not sure it's a problem. I mean, in theory, yes, you could
end up w/ a taxonomy index that has many unused categories. But:

* Whenever we were dealing with timestamp-based applications, at large scale, they always created shards per time (e.g. per day / hour)
  and when the taxonomy index is local to the shard, then it's gone completely when the shard is gone.

* You can always re-map the ordinals to new ones by running a side process which checks which of the categories are unused, adds
  those that are in use to a new taxonomy index and rewrites the payload posing of the search index. It sounds expensive, but we've
  never had to do it yet, so I don't know how much expensive.

At the end of the day, the facets module lets you build the faceted search that best suits your needs. It can work entirely off-disk,
it can be loaded in-memory (similar to FieldCache, Mike and I are working on some improvements there - you're welcome to join!), it
can support exact counts or sampling, other aggregation methods than just counting and many more.

The sidecar taxonomy index is not as bad as it sounds. As I've told you, many IBM products are working with it for many years, at small
and large scale.

I think that Solr could benefit from this module too, and I hope that I don't sound too biased :).
Having Solr reusing Lucene modules is important IMO.

Shai


On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <[hidden email]> wrote:
Hi Shai,

thanks for your blog, I am looking forward to your future posts!

Just two questions: you mentioned that you have been running this in production in distributed mode. If I understand it correctly the idea is there is only a single taxonomy index even if the distributed mode means that the data indices were partitioned/sharded. (Thus the ordinals are global). The taxonomy index is not partitioned/sharded itself. Am I correct?

Also what seems to be an interesting implication of this implementation is the fact that taxonomy index never cares about deleted documents (categories that are obsolete). In practices this is probably not a bit deal because the taxonomy index is small but I can imagine this might be problematic in some situations (for example imagine that the categories would be based on highly granular timestamp, that could create a lot of categories over short period of time and those would be kept "forever" and still growing...).
(^^ I am just trying to understand how it works.)

Regards,
Lukas

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Robert Muir
even as a step it would be nice to have lucene's faceting exposed to
solr in a way that only works with a single node.

because it supports NRT, doesnt need to build up massive top-level
datastructures and so on, many people that currently need multiple
nodes might be able to work just fine with a single node.

On Wed, Dec 12, 2012 at 2:28 AM, Shai Erera <[hidden email]> wrote:

> There are two ways you can work with the taxonomy index in a distributed
> environment (at least, these are the things that we've tried):
> (1) replicate the taxonomy to all shards, so that each shard sees the entire
> global taxonomy
> (2) each shard maintains its own taxonomy.
>
> (1) only makes sense when the shards are built by a side process,e.g.
> MapReduce, and then copied to their respective nodes.
> If you index like that, then your distributed faceted search (correcting the
> counts of categories) is done on ordinals rather than
> strings.
>
> (2) is the one that makes sense to most people, and is also NRT (where #1
> isn't!). Each shard maintains its local search +
> taxonomy indexes. In that mode, the counts correction cannot be done on
> ordinals, and has to be done on strings.
>
> When you're doing distributed faceted search, you cannot just ask each shard
> to return the top-10 categories for the "Author" dimension/facet,
> because then you might (1) miss a category that should have been in the
> total top-10 and (2) return a category in the top-10 with
> incorrect counts. What you can do is ask for C*10, where C is an
> over-counting factor. You'd still hope for the best though, b/c
> theoretically, you could still miss a category / have incorrect count for
> one.
>
> The difference between the two approaches is how big C can be. In the first
> approach, since all you transmit on the wire are
> integers, and the merge is done on integers, you can set C much higher than
> in the second approach. In practice though, since
> more and more applications are interested in real-time search, we keep a
> local taxonomy index per each shard and do the merge
> on the strings.
>
> Also, when you're doing really large-scale, exact counts for categories may
> not be so usable. How is Science Fiction (123,367,129)
> different than Drama (145,465,987) !? To the user these are just categories
> that are associated with too many documents than I
> can digest anyway :).
>
> For that, we do sampling and display %, which is more consumable by users,
> and then you don't need to worry about exact counts.
>
> I think that I wrote a bit too long an answer to your question :).
>
> Regarding not deleting categories, we've thought about it in the past and
> I'm not sure it's a problem. I mean, in theory, yes, you could
> end up w/ a taxonomy index that has many unused categories. But:
>
> * Whenever we were dealing with timestamp-based applications, at large
> scale, they always created shards per time (e.g. per day / hour)
>   and when the taxonomy index is local to the shard, then it's gone
> completely when the shard is gone.
>
> * You can always re-map the ordinals to new ones by running a side process
> which checks which of the categories are unused, adds
>   those that are in use to a new taxonomy index and rewrites the payload
> posing of the search index. It sounds expensive, but we've
>   never had to do it yet, so I don't know how much expensive.
>
> At the end of the day, the facets module lets you build the faceted search
> that best suits your needs. It can work entirely off-disk,
> it can be loaded in-memory (similar to FieldCache, Mike and I are working on
> some improvements there - you're welcome to join!), it
> can support exact counts or sampling, other aggregation methods than just
> counting and many more.
>
> The sidecar taxonomy index is not as bad as it sounds. As I've told you,
> many IBM products are working with it for many years, at small
> and large scale.
>
> I think that Solr could benefit from this module too, and I hope that I
> don't sound too biased :).
> Having Solr reusing Lucene modules is important IMO.
>
> Shai
>
>
> On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <[hidden email]> wrote:
>>
>> Hi Shai,
>>
>> thanks for your blog, I am looking forward to your future posts!
>>
>> Just two questions: you mentioned that you have been running this in
>> production in distributed mode. If I understand it correctly the idea is
>> there is only a single taxonomy index even if the distributed mode means
>> that the data indices were partitioned/sharded. (Thus the ordinals are
>> global). The taxonomy index is not partitioned/sharded itself. Am I correct?
>>
>> Also what seems to be an interesting implication of this implementation is
>> the fact that taxonomy index never cares about deleted documents (categories
>> that are obsolete). In practices this is probably not a bit deal because the
>> taxonomy index is small but I can imagine this might be problematic in some
>> situations (for example imagine that the categories would be based on highly
>> granular timestamp, that could create a lot of categories over short period
>> of time and those would be kept "forever" and still growing...).
>> (^^ I am just trying to understand how it works.)
>>
>> Regards,
>> Lukas
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Shai Erera
As I said, if someone volunteers to do some work on the Solr side, I will gladly participate in that effort.
I just don't even know where to start w/ Solr :).

One thing that would be really great is if we can build an adapter (I think someone mentioned that word here)
which supports basic facets capabilities, so that we can at least benchmark Solr's current
implementation vs the implementation w/ the module. I'm talking something very basic, a'la the test Mike and I
run on the module (counting 1-2 facets, simple hierarchy, simple queries).

Then we can at least tell if moving Solr to the module makes sense, before we continue to develop all of current
Solr's functionality on top of the module.

Shai


On Thu, Dec 13, 2012 at 12:59 PM, Robert Muir <[hidden email]> wrote:
even as a step it would be nice to have lucene's faceting exposed to
solr in a way that only works with a single node.

because it supports NRT, doesnt need to build up massive top-level
datastructures and so on, many people that currently need multiple
nodes might be able to work just fine with a single node.

On Wed, Dec 12, <a href="tel:2012" value="+9722012">2012 at 2:28 AM, Shai Erera <[hidden email]> wrote:
> There are two ways you can work with the taxonomy index in a distributed
> environment (at least, these are the things that we've tried):
> (1) replicate the taxonomy to all shards, so that each shard sees the entire
> global taxonomy
> (2) each shard maintains its own taxonomy.
>
> (1) only makes sense when the shards are built by a side process,e.g.
> MapReduce, and then copied to their respective nodes.
> If you index like that, then your distributed faceted search (correcting the
> counts of categories) is done on ordinals rather than
> strings.
>
> (2) is the one that makes sense to most people, and is also NRT (where #1
> isn't!). Each shard maintains its local search +
> taxonomy indexes. In that mode, the counts correction cannot be done on
> ordinals, and has to be done on strings.
>
> When you're doing distributed faceted search, you cannot just ask each shard
> to return the top-10 categories for the "Author" dimension/facet,
> because then you might (1) miss a category that should have been in the
> total top-10 and (2) return a category in the top-10 with
> incorrect counts. What you can do is ask for C*10, where C is an
> over-counting factor. You'd still hope for the best though, b/c
> theoretically, you could still miss a category / have incorrect count for
> one.
>
> The difference between the two approaches is how big C can be. In the first
> approach, since all you transmit on the wire are
> integers, and the merge is done on integers, you can set C much higher than
> in the second approach. In practice though, since
> more and more applications are interested in real-time search, we keep a
> local taxonomy index per each shard and do the merge
> on the strings.
>
> Also, when you're doing really large-scale, exact counts for categories may
> not be so usable. How is Science Fiction (123,367,129)
> different than Drama (145,465,987) !? To the user these are just categories
> that are associated with too many documents than I
> can digest anyway :).
>
> For that, we do sampling and display %, which is more consumable by users,
> and then you don't need to worry about exact counts.
>
> I think that I wrote a bit too long an answer to your question :).
>
> Regarding not deleting categories, we've thought about it in the past and
> I'm not sure it's a problem. I mean, in theory, yes, you could
> end up w/ a taxonomy index that has many unused categories. But:
>
> * Whenever we were dealing with timestamp-based applications, at large
> scale, they always created shards per time (e.g. per day / hour)
>   and when the taxonomy index is local to the shard, then it's gone
> completely when the shard is gone.
>
> * You can always re-map the ordinals to new ones by running a side process
> which checks which of the categories are unused, adds
>   those that are in use to a new taxonomy index and rewrites the payload
> posing of the search index. It sounds expensive, but we've
>   never had to do it yet, so I don't know how much expensive.
>
> At the end of the day, the facets module lets you build the faceted search
> that best suits your needs. It can work entirely off-disk,
> it can be loaded in-memory (similar to FieldCache, Mike and I are working on
> some improvements there - you're welcome to join!), it
> can support exact counts or sampling, other aggregation methods than just
> counting and many more.
>
> The sidecar taxonomy index is not as bad as it sounds. As I've told you,
> many IBM products are working with it for many years, at small
> and large scale.
>
> I think that Solr could benefit from this module too, and I hope that I
> don't sound too biased :).
> Having Solr reusing Lucene modules is important IMO.
>
> Shai
>
>
> On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <[hidden email]> wrote:
>>
>> Hi Shai,
>>
>> thanks for your blog, I am looking forward to your future posts!
>>
>> Just two questions: you mentioned that you have been running this in
>> production in distributed mode. If I understand it correctly the idea is
>> there is only a single taxonomy index even if the distributed mode means
>> that the data indices were partitioned/sharded. (Thus the ordinals are
>> global). The taxonomy index is not partitioned/sharded itself. Am I correct?
>>
>> Also what seems to be an interesting implication of this implementation is
>> the fact that taxonomy index never cares about deleted documents (categories
>> that are obsolete). In practices this is probably not a bit deal because the
>> taxonomy index is small but I can imagine this might be problematic in some
>> situations (for example imagine that the categories would be based on highly
>> granular timestamp, that could create a lot of categories over short period
>> of time and those would be kept "forever" and still growing...).
>> (^^ I am just trying to understand how it works.)
>>
>> Regards,
>> Lukas
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Adrien Grand
Hi Shai,

On Thu, Dec 13, 2012 at 12:21 PM, Shai Erera <[hidden email]> wrote:
> As I said, if someone volunteers to do some work on the Solr side, I will
> gladly participate in that effort.
> I just don't even know where to start w/ Solr :).

The entry point for Solr facets is
org.apache.solr.request.SimpleFacets.getFacetCounts (called from
FacetComponent).

> One thing that would be really great is if we can build an adapter (I think
> someone mentioned that word here)
> which supports basic facets capabilities, so that we can at least benchmark
> Solr's current
> implementation vs the implementation w/ the module.

Comparing both impls would be great but an adapter might be hard to
write given how Lucene faceting differs from Solr faceting: the lucene
module requires users to decide at indexing time what and how to facet
whereas Solr does everything at searching time (there is even an issue
open in order to be able to compute facet counts based on arbitray
functions [1]) using FieldCache and UninvertedField (meaning that you
can compute facets on any field that is indexed). So Lucene faceting
would probably require an additional field property in the schema to
let Solr know that it should add category paths to documents? (Please
correct me if anything I wrote here is wrong).

I have a few questions regarding the faceting module:
 - do you have any rough idea of how speed and memory usage vary
depending on the number of docs to collect, distinct field values,
etc. ?
 - TaxonomyReader seems to use ints as ordinals for category paths,
does it mean that the faceting module can't handle paths that have
more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
sense to handle such large numbers of distinct values?)

 [1] https://issues.apache.org/jira/browse/SOLR-1581

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Jack Krupansky-2
"the lucene module requires users to decide at indexing time what and how to
facet whereas Solr does everything at searching time"

It would be nice to have some confirmation/clarification of that - Are
Lucene facets "static" in some/any sense? What decisions does an app
developer need to make upfront and can only be changed with a full reindex
of the data?

I'm trying to get a handle on whether Lucene Facets is a guru-level feature
or something that an average Lucene user can trivially master with say 5
minutes of reading. Or is it the kind of feature that is mainly of interest
to the developers of higher-level search platforms such as Solr and
ElasticSearch as opposed to the users of those platforms?

-- Jack Krupansky

-----Original Message-----
From: Adrien Grand
Sent: Thursday, December 13, 2012 7:03 AM
To: [hidden email]
Subject: Re: Solr faceting vs. Lucene faceting

Hi Shai,

On Thu, Dec 13, 2012 at 12:21 PM, Shai Erera <[hidden email]> wrote:
> As I said, if someone volunteers to do some work on the Solr side, I will
> gladly participate in that effort.
> I just don't even know where to start w/ Solr :).

The entry point for Solr facets is
org.apache.solr.request.SimpleFacets.getFacetCounts (called from
FacetComponent).

> One thing that would be really great is if we can build an adapter (I
> think
> someone mentioned that word here)
> which supports basic facets capabilities, so that we can at least
> benchmark
> Solr's current
> implementation vs the implementation w/ the module.

Comparing both impls would be great but an adapter might be hard to
write given how Lucene faceting differs from Solr faceting: the lucene
module requires users to decide at indexing time what and how to facet
whereas Solr does everything at searching time (there is even an issue
open in order to be able to compute facet counts based on arbitray
functions [1]) using FieldCache and UninvertedField (meaning that you
can compute facets on any field that is indexed). So Lucene faceting
would probably require an additional field property in the schema to
let Solr know that it should add category paths to documents? (Please
correct me if anything I wrote here is wrong).

I have a few questions regarding the faceting module:
- do you have any rough idea of how speed and memory usage vary
depending on the number of docs to collect, distinct field values,
etc. ?
- TaxonomyReader seems to use ints as ordinals for category paths,
does it mean that the faceting module can't handle paths that have
more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
sense to handle such large numbers of distinct values?)

[1] https://issues.apache.org/jira/browse/SOLR-1581

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Shai Erera
Hi Jack,

> Are Lucene facets "static" in some/any sense?

Lucene facets are not static in any way. The taxonomy is built on-the-fly, as documents are added to the index. You could say that it's 'discovered' as you add documents.
The facets come with a rich userguide: http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html
I also wrote a few posts on it: http://shaierera.blogspot.com

> What decisions does an app developer need to make upfront

Well, as an app developer, you currently need to decide up front what facets your documents will have. A document may not contain all the facets, but you cannot say "hey, I added an Author field, now I want to facet on it". The reason is that in order to facet on it, the values that you put under Author need to be added to the taxonomy and resolved to an ordinal. Then those ords are written in the search index, in a way that enables very fast and efficient aggregations.

Also, if you're going to do more than just counting (see my first post - intro to facets), you're going to need to index the facets in a special way (I intend to write a blog about that too, w/ example code).
But I guess that's expected right? Like, you cannot add a 'price' field to the index as String values, and suddenly expect to be able to do efficient range queries on it.
As an app developer you'll recognize that when writing your app and add the field as a numeric field.

> and can only be changed with a full reindex of the data?

As with regular Lucene fields, if you suddenly decide to make a change to your taxonomy, e.g. that category A/C now needs to be under A/B/C, then yes, you will need to re-index the documents that were previously associated w/ A/C. But now that we're making progress w/ field level updated (see LUCENE-4258), perhaps in the future you won't need to do so.

> I'm trying to get a handle on whether Lucene Facets is a guru-level feature...

Absolutely not ! Lucene facets allow you to do very complicated things, but also start up w/ a faceted index in I'd say even less than 5 minutes.
Look at this post (http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html). You can copy paste the code (over current trunk) and get an impression of what it's like to index facets w/ Lucene.
Also, Mike McCandless and I are working on lots of simplifications now, including some specialized code paths for common use cases. You can follow LUCENE-4619.

> is it the kind of feature that is mainly of interest to the developers of higher-level search platforms such as Solr and ElasticSearch as opposed to the users of those platforms

Again, absolutely not! Well, it's true that in order to get the real value out of faceted search you need to at least have a User Interface that shows you the returned facets, weights etc.
But there's nothing in the module that restricts you from working with it as-is.

Hope I answered all your questions.

Shai


On Thu, Dec 13, 2012 at 4:28 PM, Jack Krupansky <[hidden email]> wrote:
"the lucene module requires users to decide at indexing time what and how to facet whereas Solr does everything at searching time"

It would be nice to have some confirmation/clarification of that - Are Lucene facets "static" in some/any sense? What decisions does an app developer need to make upfront and can only be changed with a full reindex of the data?

I'm trying to get a handle on whether Lucene Facets is a guru-level feature or something that an average Lucene user can trivially master with say 5 minutes of reading. Or is it the kind of feature that is mainly of interest to the developers of higher-level search platforms such as Solr and ElasticSearch as opposed to the users of those platforms?

-- Jack Krupansky

-----Original Message----- From: Adrien Grand
Sent: Thursday, December 13, <a href="tel:2012" value="+9722012" target="_blank">2012 7:03 AM
To: [hidden email]
Subject: Re: Solr faceting vs. Lucene faceting


Hi Shai,

On Thu, Dec 13, <a href="tel:2012" value="+9722012" target="_blank">2012 at 12:21 PM, Shai Erera <[hidden email]> wrote:
As I said, if someone volunteers to do some work on the Solr side, I will
gladly participate in that effort.
I just don't even know where to start w/ Solr :).

The entry point for Solr facets is
org.apache.solr.request.SimpleFacets.getFacetCounts (called from
FacetComponent).

One thing that would be really great is if we can build an adapter (I think
someone mentioned that word here)
which supports basic facets capabilities, so that we can at least benchmark
Solr's current
implementation vs the implementation w/ the module.

Comparing both impls would be great but an adapter might be hard to
write given how Lucene faceting differs from Solr faceting: the lucene
module requires users to decide at indexing time what and how to facet
whereas Solr does everything at searching time (there is even an issue
open in order to be able to compute facet counts based on arbitray
functions [1]) using FieldCache and UninvertedField (meaning that you
can compute facets on any field that is indexed). So Lucene faceting
would probably require an additional field property in the schema to
let Solr know that it should add category paths to documents? (Please
correct me if anything I wrote here is wrong).

I have a few questions regarding the faceting module:
- do you have any rough idea of how speed and memory usage vary
depending on the number of docs to collect, distinct field values,
etc. ?
- TaxonomyReader seems to use ints as ordinals for category paths,
does it mean that the faceting module can't handle paths that have
more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
sense to handle such large numbers of distinct values?)

[1] https://issues.apache.org/jira/browse/SOLR-1581

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Shai Erera
In reply to this post by Adrien Grand
Hi Adrien,

> the lucene module requires users to decide at indexing time what and how to facet
> whereas Solr does everything at searching time

True, that's one difference between the two implementations today, even though I think that we can create a specialized path (under LUCENE-4619) for really simple, non-hierarchical cases.
I don't know if and how Solr can handle a field value Sport/Basketball/NBA/... -- i.e., how is the hierarchy broken?

I imagine that there's no magic done here. Assuming that Solr can handle it (and I think I read somewhere that it does handle hierarchical facets?), you've got to specify somewhere that this field's values should be broken on '/' and that you'd like to facet on it? Or at least you need to say "create me a hierarchy from it"?

But I think that in Lucene we can add a FlatFacetsField, so that you initialize it like new FlatFacetsField("Author", "Shai") and it will create the implicit hierarchy Author/Shai.
Or, we can add a FieldType.facet(), and if the field is a StringField (i.e. indexed, not tokenized), then we create the implicit hierarchy fName/fValue?
Just throwing an idea.. that's basically the purpose of LUCENE-4619. Come up w/ even simpler starter-level API for really simple cases.

Making a decision at search time that you'd like to facet on a field ... well I think that not doing that is what allows us to do efficient faceted search, off-disk or in-memory, support really large indexes and taxonomies and be NRT.

From the little I know and read, this is one drawback of Solr facets? But if not, don't be too harsh in your reply, I'm not trying to pass any judgement here :).

> - do you have any rough idea of how speed and memory usage vary
> depending on the number of docs to collect, distinct field values,
> etc. ?

As per tests show (I think on LUCENE-4602, but I'm starting to lose track of all the new issues :)), when you load the facets info into memory, performance improves. Still, I think that if you're going to count facets on millions of documents, it's not going to be efficient, no matter where they are. Loading them into memory will speed things of course, but also consume more RAM.
That's why we can sample facets, to get the approximate top-K very fast and then per your decision, you can do a 2nd pass to correct the approximate weights, or return them as is, e.g. in the form %tg.

> TaxonomyReader seems to use ints as ordinals for category paths,
> does it mean that the faceting module can't handle paths that have
> more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
> sense to handle such large numbers of distinct values?)

That's right, it's a limitation, but I haven't a taxonomy that is that big. I've worked w/ several teams which had really huge taxonomies, I'm talking in the order of 10M nodes, but that doesn't even scratch the MAX_INT limit right?

I guess that we can change the taxonomy to support long ordinals, but I think that managing a taxonomy that size is going to pose plenty of other limitations first. Probably much sooner than you'd hit the MAX_INT limit :).

I.e., today we count the facets in memory, which is one contiguous array of integers. If it's too large, you can choose to partition the ordinal space into smaller sets.
But even if a partition is of size 1M, or 10M, I don't think that counting 200+ partitions makes sense (compares to e.g. read 200 posting lists).
So I think that if anyone would want to really manage taxonomies of that size, we'd need to discuss and maybe get back to the drawing board :).

Shai



On Thu, Dec 13, 2012 at 2:03 PM, Adrien Grand <[hidden email]> wrote:
Hi Shai,

On Thu, Dec 13, <a href="tel:2012" value="+9722012">2012 at 12:21 PM, Shai Erera <[hidden email]> wrote:
> As I said, if someone volunteers to do some work on the Solr side, I will
> gladly participate in that effort.
> I just don't even know where to start w/ Solr :).

The entry point for Solr facets is
org.apache.solr.request.SimpleFacets.getFacetCounts (called from
FacetComponent).

> One thing that would be really great is if we can build an adapter (I think
> someone mentioned that word here)
> which supports basic facets capabilities, so that we can at least benchmark
> Solr's current
> implementation vs the implementation w/ the module.

Comparing both impls would be great but an adapter might be hard to
write given how Lucene faceting differs from Solr faceting: the lucene
module requires users to decide at indexing time what and how to facet
whereas Solr does everything at searching time (there is even an issue
open in order to be able to compute facet counts based on arbitray
functions [1]) using FieldCache and UninvertedField (meaning that you
can compute facets on any field that is indexed). So Lucene faceting
would probably require an additional field property in the schema to
let Solr know that it should add category paths to documents? (Please
correct me if anything I wrote here is wrong).

I have a few questions regarding the faceting module:
 - do you have any rough idea of how speed and memory usage vary
depending on the number of docs to collect, distinct field values,
etc. ?
 - TaxonomyReader seems to use ints as ordinals for category paths,
does it mean that the faceting module can't handle paths that have
more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
sense to handle such large numbers of distinct values?)

 [1] https://issues.apache.org/jira/browse/SOLR-1581

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Adrien Grand
Hi Shai,

Thanks for your answers!

On Thu, Dec 13, 2012 at 5:05 PM, Shai Erera <[hidden email]> wrote:
>> the lucene module requires users to decide at indexing time what and how
>> to facet
>> whereas Solr does everything at searching time
>
> True, that's one difference between the two implementations today, even
> though I think that we can create a specialized path (under LUCENE-4619) for
> really simple, non-hierarchical cases.
> I don't know if and how Solr can handle a field value
> Sport/Basketball/NBA/... -- i.e., how is the hierarchy broken?

Solr doesn't break hierarchies. Its closest concept is pivot faceting
(https://issues.apache.org/jira/browse/SOLR-2894) available since 4.0
which allows you to compute hierarchical facets on the fly. For
example you can count brand counts per category (if both brand and
category are indexed).

> Making a decision at search time that you'd like to facet on a field ...
> well I think that not doing that is what allows us to do efficient faceted
> search, off-disk or in-memory, support really large indexes and taxonomies
> and be NRT.

Maybe it would be less efficient (or not?) butI think this kind of
flexibility can be great for some applications (I'm thinking to
analytics right now but there are probably many other use-cases). To
me the main issues with Solr faceting right now are that it consumes a
lot of memory and is not NRT-friendly because on uninversion time. But
I think this can be fixed by using doc values (because they can be
stored on dist and don't need to be uninverted) instead of the field
cache. I would really love that the faceting module became flexible
enough to be able to handle both index-time and search-time facets so
that Solr could become a consumer of this API instead of implementing
its own faceting logic.

> So I think that if anyone would want to really manage taxonomies of that
> size, we'd need to discuss and maybe get back to the drawing board :).

One use-case I'm thinking of is finding the top terms of documents
that match an arbitrary query. This can be very useful to help you
better understand your data, but in this case the number of distinct
values is the size of your term dictionary.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

Jack Krupansky-2
In reply to this post by Shai Erera
Thanks. Now back to thinking about Lucene vs. Solr facets in Solr.

-- Jack Krupansky
 
Sent: Thursday, December 13, 2012 10:45 AM
Subject: Re: Solr faceting vs. Lucene faceting
 
Hi Jack,

> Are Lucene facets "static" in some/any sense?

Lucene facets are not static in any way. The taxonomy is built on-the-fly, as documents are added to the index. You could say that it's 'discovered' as you add documents.
The facets come with a rich userguide: http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html
I also wrote a few posts on it: http://shaierera.blogspot.com

> What decisions does an app developer need to make upfront

Well, as an app developer, you currently need to decide up front what facets your documents will have. A document may not contain all the facets, but you cannot say "hey, I added an Author field, now I want to facet on it". The reason is that in order to facet on it, the values that you put under Author need to be added to the taxonomy and resolved to an ordinal. Then those ords are written in the search index, in a way that enables very fast and efficient aggregations.

Also, if you're going to do more than just counting (see my first post - intro to facets), you're going to need to index the facets in a special way (I intend to write a blog about that too, w/ example code).
But I guess that's expected right? Like, you cannot add a 'price' field to the index as String values, and suddenly expect to be able to do efficient range queries on it.
As an app developer you'll recognize that when writing your app and add the field as a numeric field.

> and can only be changed with a full reindex of the data?

As with regular Lucene fields, if you suddenly decide to make a change to your taxonomy, e.g. that category A/C now needs to be under A/B/C, then yes, you will need to re-index the documents that were previously associated w/ A/C. But now that we're making progress w/ field level updated (see LUCENE-4258), perhaps in the future you won't need to do so.

> I'm trying to get a handle on whether Lucene Facets is a guru-level feature...

Absolutely not ! Lucene facets allow you to do very complicated things, but also start up w/ a faceted index in I'd say even less than 5 minutes.
Look at this post (http://shaierera.blogspot.com/2012/11/lucene-facets-part-2.html). You can copy paste the code (over current trunk) and get an impression of what it's like to index facets w/ Lucene.
Also, Mike McCandless and I are working on lots of simplifications now, including some specialized code paths for common use cases. You can follow LUCENE-4619.

> is it the kind of feature that is mainly of interest to the developers of higher-level search platforms such as Solr and ElasticSearch as opposed to the users of those platforms

Again, absolutely not! Well, it's true that in order to get the real value out of faceted search you need to at least have a User Interface that shows you the returned facets, weights etc.
But there's nothing in the module that restricts you from working with it as-is.

Hope I answered all your questions.

Shai


On Thu, Dec 13, 2012 at 4:28 PM, Jack Krupansky <[hidden email]> wrote:
"the lucene module requires users to decide at indexing time what and how to facet whereas Solr does everything at searching time"

It would be nice to have some confirmation/clarification of that - Are Lucene facets "static" in some/any sense? What decisions does an app developer need to make upfront and can only be changed with a full reindex of the data?

I'm trying to get a handle on whether Lucene Facets is a guru-level feature or something that an average Lucene user can trivially master with say 5 minutes of reading. Or is it the kind of feature that is mainly of interest to the developers of higher-level search platforms such as Solr and ElasticSearch as opposed to the users of those platforms?

-- Jack Krupansky

-----Original Message----- From: Adrien Grand
Sent: Thursday, December 13, <A href="tel:2012" target=_blank value="+9722012">2012 7:03 AM
To: [hidden email]
Subject: Re: Solr faceting vs. Lucene faceting


Hi Shai,

On Thu, Dec 13, <A href="tel:2012" target=_blank value="+9722012">2012 at 12:21 PM, Shai Erera <[hidden email]> wrote:
As I said, if someone volunteers to do some work on the Solr side, I will
gladly participate in that effort.
I just don't even know where to start w/ Solr :).

The entry point for Solr facets is
org.apache.solr.request.SimpleFacets.getFacetCounts (called from
FacetComponent).

One thing that would be really great is if we can build an adapter (I think
someone mentioned that word here)
which supports basic facets capabilities, so that we can at least benchmark
Solr's current
implementation vs the implementation w/ the module.

Comparing both impls would be great but an adapter might be hard to
write given how Lucene faceting differs from Solr faceting: the lucene
module requires users to decide at indexing time what and how to facet
whereas Solr does everything at searching time (there is even an issue
open in order to be able to compute facet counts based on arbitray
functions [1]) using FieldCache and UninvertedField (meaning that you
can compute facets on any field that is indexed). So Lucene faceting
would probably require an additional field property in the schema to
let Solr know that it should add category paths to documents? (Please
correct me if anything I wrote here is wrong).

I have a few questions regarding the faceting module:
- do you have any rough idea of how speed and memory usage vary
depending on the number of docs to collect, distinct field values,
etc. ?
- TaxonomyReader seems to use ints as ordinals for category paths,
does it mean that the faceting module can't handle paths that have
more than 2B distinct values? Is it fixable? (Or maybe it doesn't make
sense to handle such large numbers of distinct values?)

[1] https://issues.apache.org/jira/browse/SOLR-1581

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

 
Reply | Threaded
Open this post in threaded view
|

Re: Solr faceting vs. Lucene faceting

David Smiley
In reply to this post by Adrien Grand
I second this use-case.  This is my only concern with Solr faceting — Solr's UnInvertedField on the search index to discover frequently used words.  It doesn't scale well.  Shai; do you think  this would scale?  FWIW one of my indexes with only 300k docs has ~3.1M terms — not a lot but it's a number to consider.

~ David

From: "Adrien Grand [via Lucene]" <[hidden email]>
Hi Shai,

Thanks for your answers!

> So I think that if anyone would want to really manage taxonomies of that
> size, we'd need to discuss and maybe get back to the drawing board :).

One use-case I'm thinking of is finding the top terms of documents
that match an arbitrary query. This can be very useful to help you
better understand your data, but in this case the number of distinct
values is the size of your term dictionary.

--
Adrien

12