Bulk Indexing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Bulk Indexing

Sohail Aboobaker
Hi,

We have created a search service which is responsible for providing
interface between Solr and rest of our application. It basically takes one
document at a time and updates or adds it to appropriate index.

Now, in application, we have processes, that add products (our document are
based on products) in bulk using a data bulk load process. At this point,
we use the same search service to add the documents in a loop. These can be
up to 20,000 documents in one load.

In a recent solr user discussion, it seems like this is a no-no strategy
with red flags all around it.

What are other alternatives?

Thanks,

Regards,
Sohail Aboobaker.
Reply | Threaded
Open this post in threaded view
|

RE: Bulk Indexing

Zhang, Lisheng
Hi,

Previously I asked a similar question and I have not fully implemented yet.

My plan is:
1) use Solr only for search, not for indexing
2) have a separate java process to index (calling lucene API directly, maybe
   can call Solr API, I need to check more details).

As other people pointed earlier, the problem with above plan is that Solr
does not know when to reload IndexSearcher (namely underlying IndexReader)
after indexing is done, since indexer and Solr are two separate processes?

My plan is to let Solr not to cache any IndexReader (each time when performing
search, just create a new IndexSearcher), because:

1) our app is made of many lucene indexed data folders (in Solr language, many
   cores), caching IndexSearcher would be too expensive.
2) in my experience, without caching search is still quite fast (this is
   maybe partially due to the fact our indexed data is not large, per folder).

This is just my plan (not fully implemented yet).

Best regards, Lisheng

-----Original Message-----
From: Sohail Aboobaker [mailto:[hidden email]]
Sent: Friday, July 27, 2012 6:56 AM
To: [hidden email]
Subject: Bulk Indexing


Hi,

We have created a search service which is responsible for providing
interface between Solr and rest of our application. It basically takes one
document at a time and updates or adds it to appropriate index.

Now, in application, we have processes, that add products (our document are
based on products) in bulk using a data bulk load process. At this point,
we use the same search service to add the documents in a loop. These can be
up to 20,000 documents in one load.

In a recent solr user discussion, it seems like this is a no-no strategy
with red flags all around it.

What are other alternatives?

Thanks,

Regards,
Sohail Aboobaker.
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Indexing

Alexandre Rafalovitch
In reply to this post by Sohail Aboobaker
Haven't tried this but:
1) I think SOLR 4 supports on-the-fly core attach/detach/select. Can
somebody confirm this?
2) If 1) is true, run everything as two cores.
3) One core is live in production
4) Second core is detached from SOLR and attached to something like
SolrJ, which I believe can index without going over network
5) Once SolrJ finished bulk import indexing, switch the cores around

Or if you are not live, just use SolrJ to run the index and then
attached finished core to SOLR.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Jul 27, 2012 at 9:55 AM, Sohail Aboobaker <[hidden email]> wrote:

> Hi,
>
> We have created a search service which is responsible for providing
> interface between Solr and rest of our application. It basically takes one
> document at a time and updates or adds it to appropriate index.
>
> Now, in application, we have processes, that add products (our document are
> based on products) in bulk using a data bulk load process. At this point,
> we use the same search service to add the documents in a loop. These can be
> up to 20,000 documents in one load.
>
> In a recent solr user discussion, it seems like this is a no-no strategy
> with red flags all around it.
>
> What are other alternatives?
>
> Thanks,
>
> Regards,
> Sohail Aboobaker.
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Indexing

Sohail Aboobaker
We will be using Solr 3.x version. I was wondering if we do need to worry
about this as we have only 10k index entries at a time. It sounds like a
very low number and we have only document type at this point.

Should we worry about directly using SolrJ for indexing and searching for
this low volume simple schema?
Lan
Reply | Threaded
Open this post in threaded view
|

RE: Bulk Indexing

Lan
In reply to this post by Zhang, Lisheng
I assume your're indexing on the same server that is used to execute search queries. Adding 20K documents in bulk could cause the Solr Server to 'stop the world' where the server would stop responding to queries.

My suggestion is
- Setup master/slave to insulate your clients from 'stop the world' events during indexing.
- Update in batches with a commit at the end of the batch.
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Indexing

Mikhail Khludnev
Lan,

I assume that some particular server can freeze on such bulk. But overall
message seems not absolutely correct to me. Solr has a lot of mechanisms to
survive in such cases.
Bulk indexing is absolutely right (if you submit single request with long
iterator of SolrInputDocs). This indexing thread can occupy single cpu
core, keeping others ready for searches. Such indexing occupies
ramBufferSizeMB of heap. After limit is exceeded new segment is flushed to
disk, which require some IO and can impact searchers. (misconfigured merge
can ruin everything, of course)
Commit should been executed from business consideration not performance
ones. Commit leads to creating new searcher and warming it, these actions
can be memory and cpu expensive (almost single thread activity).
I did some experiments on 40 M index at desktop box. Constantly adding 1K
docs/sec with autocommit more than once per minute, doesn't have
significant impact on search latency.
Generally, yes. Master-Slave scheme has more performance, for sure.

On Sat, Jul 28, 2012 at 4:01 AM, Lan <[hidden email]> wrote:

> I assume your're indexing on the same server that is used to execute search
> queries. Adding 20K documents in bulk could cause the Solr Server to 'stop
> the world' where the server would stop responding to queries.
>
> My suggestion is
> - Setup master/slave to insulate your clients from 'stop the world' events
> during indexing.
> - Update in batches with a commit at the end of the batch.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Bulk-Indexing-tp3997745p3997815.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



--
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Indexing

Sohail Aboobaker
We have auto commit on and will basically send it in a loop after
validating each record, we send it to search service. And keep doing it in
a loop. Mikhail / Lan, are you suggesting that instead of sending it in a
loop, we should collect them in an array and do a commit at the end? Is
this better than doing it in a loop with auto commit.

Also, where can I find some reference on Master / Slave configuration.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Bulk Indexing

Mikhail Khludnev
Usually collecting whole array hurts client's jvm JVM, sending doc-by-doc
bloats sever by huge number of small requests. You need just rewrite your
code from the eager loop to pulling iterator to be able to submit all docs
via single http request
http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
Then if you wouldn't be happy with low utilization due to using single
thread, post your problem and numbers here again.

http://wiki.apache.org/solr/SolrReplication
http://lucidworks.lucidimagination.com/display/solr/Index+Replication

On Sat, Jul 28, 2012 at 11:21 PM, Sohail Aboobaker <[hidden email]>wrote:

> We have auto commit on and will basically send it in a loop after
> validating each record, we send it to search service. And keep doing it in
> a loop. Mikhail / Lan, are you suggesting that instead of sending it in a
> loop, we should collect them in an array and do a commit at the end? Is
> this better than doing it in a loop with auto commit.
>
> Also, where can I find some reference on Master / Slave configuration.
>
> Thanks.
>



--
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <[hidden email]>