How to index multiple sites with option of combining results in search

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

How to index multiple sites with option of combining results in search

Dietrich-5
I am planning to index 275+ different sites with Solr, each of which
might have anywhere up to 200 000 documents. When performing searches,
I need to be able to search against any combination of sites.
Does anybody have suggestions what the best practice for a scenario
like that would be, considering  both indexing and querying
performance? Put everything into one index and filter when performing
the queries, or creating a separate index for each one and combining
results when performing the query?
Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Otis Gospodnetic-2
Sounds like SOLR-303 is a must for you.  Have you looked at Nutch?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dietrich <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 25, 2008 4:15:23 PM
Subject: How to index multiple sites with option of combining results in search

I am planning to index 275+ different sites with Solr, each of which
might have anywhere up to 200 000 documents. When performing searches,
I need to be able to search against any combination of sites.
Does anybody have suggestions what the best practice for a scenario
like that would be, considering  both indexing and querying
performance? Put everything into one index and filter when performing
the queries, or creating a separate index for each one and combining
results when performing the query?



Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Dietrich-5
On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
<[hidden email]> wrote:
> Sounds like SOLR-303 is a must for you.
Why? I see the benefits of using a distributed architecture in
general, but why do you recommend it specifically for this scenario.
> Have you looked at Nutch?
I don't want to (or need to) use a crawler. I am using a crawler-base
system now, and it does not offer the flexibility I need when it comes
to custom schemes and faceting.

>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  Subject: How to index multiple sites with option of combining results in search
>
>  I am planning to index 275+ different sites with Solr, each of which
>  might have anywhere up to 200 000 documents. When performing searches,
>  I need to be able to search against any combination of sites.
>  Does anybody have suggestions what the best practice for a scenario
>  like that would be, considering  both indexing and querying
>  performance? Put everything into one index and filter when performing
>  the queries, or creating a separate index for each one and combining
>  results when performing the query?
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Otis Gospodnetic-2
In reply to this post by Dietrich-5
Dietrich,

I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number for a single machine to handle.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dietrich <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 25, 2008 7:00:17 PM
Subject: Re: How to index multiple sites with option of combining results in search

On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
<[hidden email]> wrote:
> Sounds like SOLR-303 is a must for you.
Why? I see the benefits of using a distributed architecture in
general, but why do you recommend it specifically for this scenario.
> Have you looked at Nutch?
I don't want to (or need to) use a crawler. I am using a crawler-base
system now, and it does not offer the flexibility I need when it comes
to custom schemes and faceting.

>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  Subject: How to index multiple sites with option of combining results in search
>
>  I am planning to index 275+ different sites with Solr, each of which
>  might have anywhere up to 200 000 documents. When performing searches,
>  I need to be able to search against any combination of sites.
>  Does anybody have suggestions what the best practice for a scenario
>  like that would be, considering  both indexing and querying
>  performance? Put everything into one index and filter when performing
>  the queries, or creating a separate index for each one and combining
>  results when performing the query?
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Dietrich-5
I understand that, and that makes sense. But, coming back to the
orginal question:
>  >  When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?

Are there any established best practices for that?

-ds

On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Dietrich,
>
>  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number for a single machine to handle.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Tuesday, March 25, 2008 7:00:17 PM
>  Subject: Re: How to index multiple sites with option of combining results in search
>
>  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  <[hidden email]> wrote:
>  > Sounds like SOLR-303 is a must for you.
>  Why? I see the benefits of using a distributed architecture in
>  general, but why do you recommend it specifically for this scenario.
>  > Have you looked at Nutch?
>  I don't want to (or need to) use a crawler. I am using a crawler-base
>  system now, and it does not offer the flexibility I need when it comes
>  to custom schemes and faceting.
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >
>  >
>  >  ----- Original Message ----
>  >  From: Dietrich <[hidden email]>
>  >  To: [hidden email]
>  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  Subject: How to index multiple sites with option of combining results in search
>  >
>  >  I am planning to index 275+ different sites with Solr, each of which
>  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?
>  >
>  >
>  >
>  >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Otis Gospodnetic-2
In reply to this post by Dietrich-5
Dietrich,

I don't think there are established practices in the open (yet).  You could design your application with a site(s)->shard mapping and then, knowing which sites are involved in the query, search only the relevant shards.  This will be efficient, but it would require careful management on your part.

Putting everything in a single index would just not work with "normal" machines, I think.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dietrich <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 26, 2008 10:47:55 AM
Subject: Re: How to index multiple sites with option of combining results in search

I understand that, and that makes sense. But, coming back to the
orginal question:
>  >  When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?

Are there any established best practices for that?

-ds

On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Dietrich,
>
>  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number for a single machine to handle.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Tuesday, March 25, 2008 7:00:17 PM
>  Subject: Re: How to index multiple sites with option of combining results in search
>
>  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  <[hidden email]> wrote:
>  > Sounds like SOLR-303 is a must for you.
>  Why? I see the benefits of using a distributed architecture in
>  general, but why do you recommend it specifically for this scenario.
>  > Have you looked at Nutch?
>  I don't want to (or need to) use a crawler. I am using a crawler-base
>  system now, and it does not offer the flexibility I need when it comes
>  to custom schemes and faceting.
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >
>  >
>  >  ----- Original Message ----
>  >  From: Dietrich <[hidden email]>
>  >  To: [hidden email]
>  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  Subject: How to index multiple sites with option of combining results in search
>  >
>  >  I am planning to index 275+ different sites with Solr, each of which
>  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?
>  >
>  >
>  >
>  >
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Dietrich-5
Makes sense, nut probably overkill for my requirements. I wasn't
really talking 275*200000, more likely the total would be something
like four million documents. I was under the assumption that a single
machine, or a simple distributed index, should be able to handle that,
is that wrong?

-ds

On Wed, Mar 26, 2008 at 2:05 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Dietrich,
>
>  I don't think there are established practices in the open (yet).  You could design your application with a site(s)->shard mapping and then, knowing which sites are involved in the query, search only the relevant shards.  This will be efficient, but it would require careful management on your part.
>
>  Putting everything in a single index would just not work with "normal" machines, I think.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Wednesday, March 26, 2008 10:47:55 AM
>  Subject: Re: How to index multiple sites with option of combining results in search
>
>  I understand that, and that makes sense. But, coming back to the
>  orginal question:
>  >  >  When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>
>  Are there any established best practices for that?
>
>  -ds
>
>  On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
>  <[hidden email]> wrote:
>  > Dietrich,
>  >
>  >  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number for a single machine to handle.
>  >
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >  ----- Original Message ----
>  >  From: Dietrich <[hidden email]>
>  >  To: [hidden email]
>  >
>  >
>  > Sent: Tuesday, March 25, 2008 7:00:17 PM
>  >  Subject: Re: How to index multiple sites with option of combining results in search
>  >
>  >  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  >  <[hidden email]> wrote:
>  >  > Sounds like SOLR-303 is a must for you.
>  >  Why? I see the benefits of using a distributed architecture in
>  >  general, but why do you recommend it specifically for this scenario.
>  >  > Have you looked at Nutch?
>  >  I don't want to (or need to) use a crawler. I am using a crawler-base
>  >  system now, and it does not offer the flexibility I need when it comes
>  >  to custom schemes and faceting.
>  >  >
>  >  >  Otis
>  >  >  --
>  >  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >  >
>  >  >
>  >  >
>  >  >  ----- Original Message ----
>  >  >  From: Dietrich <[hidden email]>
>  >  >  To: [hidden email]
>  >  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  >  Subject: How to index multiple sites with option of combining results in search
>  >  >
>  >  >  I am planning to index 275+ different sites with Solr, each of which
>  >  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>  >  >
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to index multiple sites with option of combining results in search

Otis Gospodnetic-2
In reply to this post by Dietrich-5
Ah, that's a very different number.  Yes, assuming your docs are web pages, a single reasonably equipped machine should be able to handle that and a few dozen QPS.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dietrich <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 26, 2008 2:18:53 PM
Subject: Re: How to index multiple sites with option of combining results in search

Makes sense, nut probably overkill for my requirements. I wasn't
really talking 275*200000, more likely the total would be something
like four million documents. I was under the assumption that a single
machine, or a simple distributed index, should be able to handle that,
is that wrong?

-ds

On Wed, Mar 26, 2008 at 2:05 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Dietrich,
>
>  I don't think there are established practices in the open (yet).  You could design your application with a site(s)->shard mapping and then, knowing which sites are involved in the query, search only the relevant shards.  This will be efficient, but it would require careful management on your part.
>
>  Putting everything in a single index would just not work with "normal" machines, I think.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Wednesday, March 26, 2008 10:47:55 AM
>  Subject: Re: How to index multiple sites with option of combining results in search
>
>  I understand that, and that makes sense. But, coming back to the
>  orginal question:
>  >  >  When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>
>  Are there any established best practices for that?
>
>  -ds
>
>  On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
>  <[hidden email]> wrote:
>  > Dietrich,
>  >
>  >  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number for a single machine to handle.
>  >
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >  ----- Original Message ----
>  >  From: Dietrich <[hidden email]>
>  >  To: [hidden email]
>  >
>  >
>  > Sent: Tuesday, March 25, 2008 7:00:17 PM
>  >  Subject: Re: How to index multiple sites with option of combining results in search
>  >
>  >  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  >  <[hidden email]> wrote:
>  >  > Sounds like SOLR-303 is a must for you.
>  >  Why? I see the benefits of using a distributed architecture in
>  >  general, but why do you recommend it specifically for this scenario.
>  >  > Have you looked at Nutch?
>  >  I don't want to (or need to) use a crawler. I am using a crawler-base
>  >  system now, and it does not offer the flexibility I need when it comes
>  >  to custom schemes and faceting.
>  >  >
>  >  >  Otis
>  >  >  --
>  >  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >  >
>  >  >
>  >  >
>  >  >  ----- Original Message ----
>  >  >  From: Dietrich <[hidden email]>
>  >  >  To: [hidden email]
>  >  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  >  Subject: How to index multiple sites with option of combining results in search
>  >  >
>  >  >  I am planning to index 275+ different sites with Solr, each of which
>  >  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>  >  >
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

RE: How to index multiple sites with option of combining results in search

Lance Norskog-2
In fact, 55m records works fine in Solr; assuming they are small records.
The problem is that the index files wind up in the tens of gigabytes. The
logistics of doing backups, snapping to query servers, etc. is what makes
this index unwieldy, and why multiple shards are useful.

Lance

-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: Wednesday, March 26, 2008 11:22 AM
To: [hidden email]
Subject: Re: How to index multiple sites with option of combining results in
search

Ah, that's a very different number.  Yes, assuming your docs are web pages,
a single reasonably equipped machine should be able to handle that and a few
dozen QPS.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dietrich <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 26, 2008 2:18:53 PM
Subject: Re: How to index multiple sites with option of combining results in
search

Makes sense, nut probably overkill for my requirements. I wasn't really
talking 275*200000, more likely the total would be something like four
million documents. I was under the assumption that a single machine, or a
simple distributed index, should be able to handle that, is that wrong?

-ds

On Wed, Mar 26, 2008 at 2:05 PM, Otis Gospodnetic
<[hidden email]> wrote:
> Dietrich,
>
>  I don't think there are established practices in the open (yet).  You
could design your application with a site(s)->shard mapping and then,
knowing which sites are involved in the query, search only the relevant
shards.  This will be efficient, but it would require careful management on
your part.
>
>  Putting everything in a single index would just not work with "normal"
machines, I think.

>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>  From: Dietrich <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Wednesday, March 26, 2008 10:47:55 AM
>  Subject: Re: How to index multiple sites with option of combining
> results in search
>
>  I understand that, and that makes sense. But, coming back to the  
> orginal question:
>  >  >  When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a
> scenario  >  >  like that would be, considering  both indexing and
> querying  >  >  performance? Put everything into one index and filter
> when performing  >  >  the queries, or creating a separate index for
> each one and combining  >  >  results when performing the query?
>
>  Are there any established best practices for that?
>
>  -ds
>
>  On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic  
> <[hidden email]> wrote:
>  > Dietrich,
>  >
>  >  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a
number for a single machine to handle.

>  >
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch  >  >  
> ----- Original Message ----  >  From: Dietrich
> <[hidden email]>  >  To: [hidden email]  >  
> >  > Sent: Tuesday, March 25, 2008 7:00:17 PM  >  Subject: Re: How to
> index multiple sites with option of combining results in search  >  >  
> On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic  >  
> <[hidden email]> wrote:
>  >  > Sounds like SOLR-303 is a must for you.
>  >  Why? I see the benefits of using a distributed architecture in  >  
> general, but why do you recommend it specifically for this scenario.
>  >  > Have you looked at Nutch?
>  >  I don't want to (or need to) use a crawler. I am using a
> crawler-base  >  system now, and it does not offer the flexibility I
> need when it comes  >  to custom schemes and faceting.
>  >  >
>  >  >  Otis
>  >  >  --
>  >  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch  >  >  
> >  >  >  >  >  >  ----- Original Message ----  >  >  From: Dietrich
> <[hidden email]>  >  >  To: [hidden email]  
> >  >  Sent: Tuesday, March 25, 2008 4:15:23 PM  >  >  Subject: How to
> index multiple sites with option of combining results in search  >  >  
> >  >  I am planning to index 275+ different sites with Solr, each of
> which  >  >  might have anywhere up to 200 000 documents. When
> performing searches,  >  >  I need to be able to search against any
combination of sites.

>  >  >  Does anybody have suggestions what the best practice for a
> scenario  >  >  like that would be, considering  both indexing and
> querying  >  >  performance? Put everything into one index and filter
> when performing  >  >  the queries, or creating a separate index for
> each one and combining  >  >  results when performing the query?
>  >  >
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>