Can Solr solve this simple problem?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Can Solr solve this simple problem?

Alexandr Bocharov
Hi everyone :)
Our company is very interesting in Solr engine for searching people.
I have 3 questions below about extended capabilities of Solr, but first I'd
like to present you the problem

Let's say we have ~100 mln users with many characteristics - some of them
described below.
We want to search users by any set of these characteristics (of course we
should use index clustering, replication, query distribution)

   - country - text (alpha-iso-3 country code)
   - language - text (alpha-iso-3 country code)
   - has_photo - boolean
   - has_video - boolean
   - lastvisit - date
   - gender - int
   - age - int
   - latitude - float
   - longitude - float
   - height - int
   - updated - date
   - 100+ other boolean fields to store and search by it - profile has some
   property or don't

*
Prototype SQL query looks like this:*

SELECT user_id
FROM users
WHERE
    AND country = 'USA'
    AND language = 'SPA'
    AND gender = 1
    AND age BETWEEN 30 AND 40
    AND latitude BETWEEN 39.0 AND 41.0
    AND longitude BETWEEN 73.0 AND 75.0
    AND height BETWEEN 170 AND 180
    AND has_photo = 1
    AND has_video = 0
    AND (bool_field1 = 1 OR bool_field2 = 1)
    AND (bool_fieldN = 0 OR bool_fieldM = 1 OR bool_fieldK = 0)
    ...
ORDER BY
    IF(has_photo = 1, 100, 0) +
    IF(language = 'FRA', 50, 0) +
    IF(has_description = 1, 150, 0) +
    IF(has_video = 1, 50, 0) +
    IF(lastvisit > NOW() - interval 1 month, 200, 0) DESC,
    IF(age > 35, 20, 0) +
    IF(gender = 2, 30, 0) +
    IF(bool_field1 = 1, 50, 0) +
    IF(bool_fieldN = 0, 100, 0) ASC
LIMIT 200;

So, these are my 3 questions:
1. Does Solr provide searching among different count fields with different
types like in WHERE condition?
2. Does Solr provide such sorting, that depends on other fields (like sums
in ORDER BY), other words - does it provide any kind of function, which is
used to sort results from q1?
3. Does Solr provide realtime index updating or updating every N minutes?

What advices can you give to provide this scheme searching with Solr?

Best regards.
Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Jan Høydahl / Cominvent
> Hi everyone :)

Hi :)

> So, these are my 3 questions:
> 1. Does Solr provide searching among different count fields with different
> types like in WHERE condition?

Yes. As long as these are not full-text you should use filter queries for these, e.g.
&q=*:*
&fq=country:USA
&fq=language:SPA
&fq=age:[30 TO 40]
&fq=(bool_field1:1 OR bool_field2:1)

The reason why I put multiple "fq" instead of one long is to optimize for caching of filters

> 2. Does Solr provide such sorting, that depends on other fields (like sums
> in ORDER BY), other words - does it provide any kind of function, which is
> used to sort results from q1?

Yes. In trunk version you can sort by function which can do sums and all crezy things
&sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0)) asc&agequery=age:[53 TO *]
See http://wiki.apache.org/solr/FunctionQuery for more functions

But you could also to much of this through boost queries
&sort=score desc
&bq=language:FRA^50
%bq=age:[53 TO *]^20

> 3. Does Solr provide realtime index updating or updating every N minutes?

Sure, there is Near Real-time indexing in TRUNK (coming 4.0)

Jan
Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Tomás Fernández Löbbe
I'm wondering if Solr is the best tool for this kind of usage. Solr is a
"text search engine", so even if it supports all those features, it is
design for text search, which doesn't seem to be what you need. Which are
the reasons for moving from a DB implementation to Solr?

Don't misunderstand me, I really like Solr and it is a really cool hammer,
but not everything is a nail. I would really like to hear more opinions on
this.


Tomás


On Mon, Apr 16, 2012 at 7:12 PM, Jan Høydahl <[hidden email]> wrote:

> > Hi everyone :)
>
> Hi :)
>
> > So, these are my 3 questions:
> > 1. Does Solr provide searching among different count fields with
> different
> > types like in WHERE condition?
>
> Yes. As long as these are not full-text you should use filter queries for
> these, e.g.
> &q=*:*
> &fq=country:USA
> &fq=language:SPA
> &fq=age:[30 TO 40]
> &fq=(bool_field1:1 OR bool_field2:1)
>
> The reason why I put multiple "fq" instead of one long is to optimize for
> caching of filters
>
> > 2. Does Solr provide such sorting, that depends on other fields (like
> sums
> > in ORDER BY), other words - does it provide any kind of function, which
> is
> > used to sort results from q1?
>
> Yes. In trunk version you can sort by function which can do sums and all
> crezy things
> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
> asc&agequery=age:[53 TO *]
> See http://wiki.apache.org/solr/FunctionQuery for more functions
>
> But you could also to much of this through boost queries
> &sort=score desc
> &bq=language:FRA^50
> %bq=age:[53 TO *]^20
>
> > 3. Does Solr provide realtime index updating or updating every N minutes?
>
> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
>
> Jan
Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Yonik Seeley-2-2
2012/4/16 Tomás Fernández Löbbe <[hidden email]>:
> I'm wondering if Solr is the best tool for this kind of usage. Solr is a
> "text search engine"

Well, Lucene is a "full-text search library", but Solr has always been far more.
Dating back to it's first use in CNET, it was used as a browse engine
(faceted search), sometimes without much of a full-text aspect at all.
And we're moving more and more into the NoSQL realm (durability,
realtime-get, and coming real soon - optimistic locking).

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10
Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Alexandr Bocharov
In reply to this post by Jan Høydahl / Cominvent
Thanks for your reply :)
I have some new questions now:
1. How stable is trunk version? Has anyone used it on any kind of highload
project in production?
2. Does version 3.6 support near real time index update?
3. What is scheme of Solr index storing? Is it all in memory for each shard
or in disk with caching for frequently asked queries in memory?
4. The best practice for index updating is - to do delta imports each 5
minutes for example, and once a day - full rebuild index, does it take long
time for ~100 mln users? Am I right?
5. Does sharding and replications have native support in Solr, so everyting
I need to care about is config file for nodes? Are there any limitations of
usage such sorting if we use sharding?

The reason why we want to move from our DB search scheme (data is sharded
into small tables at several servers and managed in code) is that:
1. response time of our search isn't what we need (3-5 s now in production,
we want <1 s)
2. growing amount of data
3. we want automatically clustering any amount of data and search by it,
without need to care about how data stores and does it has durability or not

That's why we also looking other solutions with autosharding of huge amount
of data with ability to make such types of query and sorting (thinking
about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If anyone
can give advice for such technology, I'll be glad to hear it.

2012/4/17 Jan Høydahl <[hidden email]>

> > Hi everyone :)
>
> Hi :)
>
> > So, these are my 3 questions:
> > 1. Does Solr provide searching among different count fields with
> different
> > types like in WHERE condition?
>
> Yes. As long as these are not full-text you should use filter queries for
> these, e.g.
> &q=*:*
> &fq=country:USA
> &fq=language:SPA
> &fq=age:[30 TO 40]
> &fq=(bool_field1:1 OR bool_field2:1)
>
> The reason why I put multiple "fq" instead of one long is to optimize for
> caching of filters
>
> > 2. Does Solr provide such sorting, that depends on other fields (like
> sums
> > in ORDER BY), other words - does it provide any kind of function, which
> is
> > used to sort results from q1?
>
> Yes. In trunk version you can sort by function which can do sums and all
> crezy things
> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
> asc&agequery=age:[53 TO *]
> See http://wiki.apache.org/solr/FunctionQuery for more functions
>
> But you could also to much of this through boost queries
> &sort=score desc
> &bq=language:FRA^50
> %bq=age:[53 TO *]^20
>
> > 3. Does Solr provide realtime index updating or updating every N minutes?
>
> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
>
> Jan
Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Jan Høydahl / Cominvent
Hi,

You have many basic questions about search. Can I recommend one of the books? http://lucene.apache.org/solr/books.html
Also, you'll find a lot of answers on the Solr WIKI: http://wiki.apache.org/solr/ if you're not aware of it.

I think Solr may solve your performance problems well.
Whether it's the right tool for the job depends on several factors.
Also, sometimes it is useful to step back and think fresh. Perhaps the reason why you implemented things like you did was technical reasons driven by your DB capabilities.
When re-implementing on top of Solr, perhaps there are better ways to do what you REALLY wanted instead of limiting yourself to the ORDER BY syntax etc.
One of Solr's strengths is relevancy and FunctionQueries and it can do amazing things :)

Further answers below..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote:

> Thanks for your reply :)
> I have some new questions now:
> 1. How stable is trunk version? Has anyone used it on any kind of highload
> project in production?
It's stable. Used in production many places. Soon expected in alpha or beta release
> 2. Does version 3.6 support near real time index update?
No
> 3. What is scheme of Solr index storing? Is it all in memory for each shard
> or in disk with caching for frequently asked queries in memory?
On disk but with many caching optimizations
> 4. The best practice for index updating is - to do delta imports each 5
> minutes for example, and once a day - full rebuild index, does it take long
> time for ~100 mln users? Am I right?
You can do deltas only, as often as you choose. Solr will handle the backend details
> 5. Does sharding and replications have native support in Solr, so everyting
> I need to care about is config file for nodes? Are there any limitations of
> usage such sorting if we use sharding?
Yes, sharding and replication is natively supported. See the Wiki

> The reason why we want to move from our DB search scheme (data is sharded
> into small tables at several servers and managed in code) is that:
> 1. response time of our search isn't what we need (3-5 s now in production,
> we want <1 s)
> 2. growing amount of data
> 3. we want automatically clustering any amount of data and search by it,
> without need to care about how data stores and does it has durability or not
>
> That's why we also looking other solutions with autosharding of huge amount
> of data with ability to make such types of query and sorting (thinking
> about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If anyone
> can give advice for such technology, I'll be glad to hear it.
What do you expect from "Autosharding"?

>
> 2012/4/17 Jan Høydahl <[hidden email]>
>
>>> Hi everyone :)
>>
>> Hi :)
>>
>>> So, these are my 3 questions:
>>> 1. Does Solr provide searching among different count fields with
>> different
>>> types like in WHERE condition?
>>
>> Yes. As long as these are not full-text you should use filter queries for
>> these, e.g.
>> &q=*:*
>> &fq=country:USA
>> &fq=language:SPA
>> &fq=age:[30 TO 40]
>> &fq=(bool_field1:1 OR bool_field2:1)
>>
>> The reason why I put multiple "fq" instead of one long is to optimize for
>> caching of filters
>>
>>> 2. Does Solr provide such sorting, that depends on other fields (like
>> sums
>>> in ORDER BY), other words - does it provide any kind of function, which
>> is
>>> used to sort results from q1?
>>
>> Yes. In trunk version you can sort by function which can do sums and all
>> crezy things
>> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
>> asc&agequery=age:[53 TO *]
>> See http://wiki.apache.org/solr/FunctionQuery for more functions
>>
>> But you could also to much of this through boost queries
>> &sort=score desc
>> &bq=language:FRA^50
>> %bq=age:[53 TO *]^20
>>
>>> 3. Does Solr provide realtime index updating or updating every N minutes?
>>
>> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
>>
>> Jan

Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Alexandr Bocharov
Thanks for your replies, you're good expert :)
I've read documentation on Solr basicaly, I'm familiar with it around 2
days.
The documentation is very huge at first sight :). Me and my company is
being deciding to use Solr or other solution.
Maybe you're right about re-implementing our sorting functions to something
new.

1. If index is stored at disk, what way good performance is achieved (if
index changes frequently, ~50,000 - 100,000 records are updating each 10
minutes, so maybe caching won't be effective)?
2. What can you say about semantic search Solr capabilities? Are there any
examples of it in production?
3. Can you please give some examples projects/sites with Solr 4.0 usage in
production?


2012/4/17 Jan Høydahl <[hidden email]>

> Hi,
>
> You have many basic questions about search. Can I recommend one of the
> books? http://lucene.apache.org/solr/books.html
> Also, you'll find a lot of answers on the Solr WIKI:
> http://wiki.apache.org/solr/ if you're not aware of it.
>
> I think Solr may solve your performance problems well.
> Whether it's the right tool for the job depends on several factors.
> Also, sometimes it is useful to step back and think fresh. Perhaps the
> reason why you implemented things like you did was technical reasons driven
> by your DB capabilities.
> When re-implementing on top of Solr, perhaps there are better ways to do
> what you REALLY wanted instead of limiting yourself to the ORDER BY syntax
> etc.
> One of Solr's strengths is relevancy and FunctionQueries and it can do
> amazing things :)
>
> Further answers below..
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote:
>
> > Thanks for your reply :)
> > I have some new questions now:
> > 1. How stable is trunk version? Has anyone used it on any kind of
> highload
> > project in production?
> It's stable. Used in production many places. Soon expected in alpha or
> beta release
> > 2. Does version 3.6 support near real time index update?
> No
> > 3. What is scheme of Solr index storing? Is it all in memory for each
> shard
> > or in disk with caching for frequently asked queries in memory?
> On disk but with many caching optimizations
> > 4. The best practice for index updating is - to do delta imports each 5
> > minutes for example, and once a day - full rebuild index, does it take
> long
> > time for ~100 mln users? Am I right?
> You can do deltas only, as often as you choose. Solr will handle the
> backend details
> > 5. Does sharding and replications have native support in Solr, so
> everyting
> > I need to care about is config file for nodes? Are there any limitations
> of
> > usage such sorting if we use sharding?
> Yes, sharding and replication is natively supported. See the Wiki
> > The reason why we want to move from our DB search scheme (data is sharded
> > into small tables at several servers and managed in code) is that:
> > 1. response time of our search isn't what we need (3-5 s now in
> production,
> > we want <1 s)
> > 2. growing amount of data
> > 3. we want automatically clustering any amount of data and search by it,
> > without need to care about how data stores and does it has durability or
> not
> >
> > That's why we also looking other solutions with autosharding of huge
> amount
> > of data with ability to make such types of query and sorting (thinking
> > about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If
> anyone
> > can give advice for such technology, I'll be glad to hear it.
> What do you expect from "Autosharding"?
> >
> > 2012/4/17 Jan Høydahl <[hidden email]>
> >
> >>> Hi everyone :)
> >>
> >> Hi :)
> >>
> >>> So, these are my 3 questions:
> >>> 1. Does Solr provide searching among different count fields with
> >> different
> >>> types like in WHERE condition?
> >>
> >> Yes. As long as these are not full-text you should use filter queries
> for
> >> these, e.g.
> >> &q=*:*
> >> &fq=country:USA
> >> &fq=language:SPA
> >> &fq=age:[30 TO 40]
> >> &fq=(bool_field1:1 OR bool_field2:1)
> >>
> >> The reason why I put multiple "fq" instead of one long is to optimize
> for
> >> caching of filters
> >>
> >>> 2. Does Solr provide such sorting, that depends on other fields (like
> >> sums
> >>> in ORDER BY), other words - does it provide any kind of function, which
> >> is
> >>> used to sort results from q1?
> >>
> >> Yes. In trunk version you can sort by function which can do sums and all
> >> crezy things
> >> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
> >> asc&agequery=age:[53 TO *]
> >> See http://wiki.apache.org/solr/FunctionQuery for more functions
> >>
> >> But you could also to much of this through boost queries
> >> &sort=score desc
> >> &bq=language:FRA^50
> >> %bq=age:[53 TO *]^20
> >>
> >>> 3. Does Solr provide realtime index updating or updating every N
> minutes?
> >>
> >> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
> >>
> >> Jan
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Can Solr solve this simple problem?

Jan Høydahl / Cominvent
1. Just trust that Lucene will perform :)
   Incremental updates are actually stored in separate new index segments with own caches, so all the old existing data is left un-touched with caches in place.

2. Please explain what you expect from "semantic search" which is an overloaded word.

3. On http://wiki.apache.org/solr/PublicServers the only one saying so explicitly is Jeeran - I'm sure others can fill in with more examples

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 12:10, Alexandr Bocharov wrote:

> Thanks for your replies, you're good expert :)
> I've read documentation on Solr basicaly, I'm familiar with it around 2
> days.
> The documentation is very huge at first sight :). Me and my company is
> being deciding to use Solr or other solution.
> Maybe you're right about re-implementing our sorting functions to something
> new.
>
> 1. If index is stored at disk, what way good performance is achieved (if
> index changes frequently, ~50,000 - 100,000 records are updating each 10
> minutes, so maybe caching won't be effective)?
> 2. What can you say about semantic search Solr capabilities? Are there any
> examples of it in production?
> 3. Can you please give some examples projects/sites with Solr 4.0 usage in
> production?
>
>
> 2012/4/17 Jan Høydahl <[hidden email]>
>
>> Hi,
>>
>> You have many basic questions about search. Can I recommend one of the
>> books? http://lucene.apache.org/solr/books.html
>> Also, you'll find a lot of answers on the Solr WIKI:
>> http://wiki.apache.org/solr/ if you're not aware of it.
>>
>> I think Solr may solve your performance problems well.
>> Whether it's the right tool for the job depends on several factors.
>> Also, sometimes it is useful to step back and think fresh. Perhaps the
>> reason why you implemented things like you did was technical reasons driven
>> by your DB capabilities.
>> When re-implementing on top of Solr, perhaps there are better ways to do
>> what you REALLY wanted instead of limiting yourself to the ORDER BY syntax
>> etc.
>> One of Solr's strengths is relevancy and FunctionQueries and it can do
>> amazing things :)
>>
>> Further answers below..
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote:
>>
>>> Thanks for your reply :)
>>> I have some new questions now:
>>> 1. How stable is trunk version? Has anyone used it on any kind of
>> highload
>>> project in production?
>> It's stable. Used in production many places. Soon expected in alpha or
>> beta release
>>> 2. Does version 3.6 support near real time index update?
>> No
>>> 3. What is scheme of Solr index storing? Is it all in memory for each
>> shard
>>> or in disk with caching for frequently asked queries in memory?
>> On disk but with many caching optimizations
>>> 4. The best practice for index updating is - to do delta imports each 5
>>> minutes for example, and once a day - full rebuild index, does it take
>> long
>>> time for ~100 mln users? Am I right?
>> You can do deltas only, as often as you choose. Solr will handle the
>> backend details
>>> 5. Does sharding and replications have native support in Solr, so
>> everyting
>>> I need to care about is config file for nodes? Are there any limitations
>> of
>>> usage such sorting if we use sharding?
>> Yes, sharding and replication is natively supported. See the Wiki
>>> The reason why we want to move from our DB search scheme (data is sharded
>>> into small tables at several servers and managed in code) is that:
>>> 1. response time of our search isn't what we need (3-5 s now in
>> production,
>>> we want <1 s)
>>> 2. growing amount of data
>>> 3. we want automatically clustering any amount of data and search by it,
>>> without need to care about how data stores and does it has durability or
>> not
>>>
>>> That's why we also looking other solutions with autosharding of huge
>> amount
>>> of data with ability to make such types of query and sorting (thinking
>>> about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If
>> anyone
>>> can give advice for such technology, I'll be glad to hear it.
>> What do you expect from "Autosharding"?
>>>
>>> 2012/4/17 Jan Høydahl <[hidden email]>
>>>
>>>>> Hi everyone :)
>>>>
>>>> Hi :)
>>>>
>>>>> So, these are my 3 questions:
>>>>> 1. Does Solr provide searching among different count fields with
>>>> different
>>>>> types like in WHERE condition?
>>>>
>>>> Yes. As long as these are not full-text you should use filter queries
>> for
>>>> these, e.g.
>>>> &q=*:*
>>>> &fq=country:USA
>>>> &fq=language:SPA
>>>> &fq=age:[30 TO 40]
>>>> &fq=(bool_field1:1 OR bool_field2:1)
>>>>
>>>> The reason why I put multiple "fq" instead of one long is to optimize
>> for
>>>> caching of filters
>>>>
>>>>> 2. Does Solr provide such sorting, that depends on other fields (like
>>>> sums
>>>>> in ORDER BY), other words - does it provide any kind of function, which
>>>> is
>>>>> used to sort results from q1?
>>>>
>>>> Yes. In trunk version you can sort by function which can do sums and all
>>>> crezy things
>>>> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0))
>>>> asc&agequery=age:[53 TO *]
>>>> See http://wiki.apache.org/solr/FunctionQuery for more functions
>>>>
>>>> But you could also to much of this through boost queries
>>>> &sort=score desc
>>>> &bq=language:FRA^50
>>>> %bq=age:[53 TO *]^20
>>>>
>>>>> 3. Does Solr provide realtime index updating or updating every N
>> minutes?
>>>>
>>>> Sure, there is Near Real-time indexing in TRUNK (coming 4.0)
>>>>
>>>> Jan
>>
>>