Posting Concurrently to Solr

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Posting Concurrently to Solr

Abhishek Srivastava
Hello Everyone,

If I have a large data set which needs to be indexed, what strategy I can take to build the index fast?

1. split the input into multiple xml files and then open different shells and post each of the split xml file? will this work and help me build index faster than 1 large xml file?

2. What if I don't want to build the XML files at all. I want to write the extraction logic in an ETL tool and then let the ETL tool send the command to SOLR. then I run my ETL tool in a multi-threaded manner where each thread is extracting the data from the backed and send it to Solr for indexing.

3. Use the Core Feature and then populate each core separately, then merge the cores.

Any other approach?


Reply | Threaded
Open this post in threaded view
|

Re: Posting Concurrently to Solr

Vijayant Kumar
Why don't you approach for DIH

http://wiki.apache.org/solr/DataImportHandler


Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211

>
> Hello Everyone,
>
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
>
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build
> index
> faster than 1 large xml file?
>
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each
> thread
> is extracting the data from the backed and send it to Solr for indexing.
>
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
>
> Any other approach?
>
>
>
> --
> View this message in context:
> http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


--



Reply | Threaded
Open this post in threaded view
|

Re: Posting Concurrently to Solr

Jan Høydahl / Cominvent
In reply to this post by Abhishek Srivastava
You did not say how frequent you need to update the index, if this is batch type of operation or if you also have some real-time requirements after the initial load.

Your ETL could use SolrJ and the StreamingUpdateSolrServer for high throughput.
You could try multiple threads pushing in parallell if your bottleneck is on the client side.
If that's not enough you can split your index into multiple cores/shards to get more parallell indexing power.
You don't need to merge them at the end, you can query using the shards parameter.

For extreme power for batch indexing, you can look at a map-reduce strategy: http://wiki.apache.org/solr/HadoopIndexing

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 11.33, abhishes wrote:

>
> Hello Everyone,
>
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
>
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build index
> faster than 1 large xml file?
>
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
> is extracting the data from the backed and send it to Solr for indexing.
>
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
>
> Any other approach?
>
>
>
> --
> View this message in context: http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply | Threaded
Open this post in threaded view
|

Re: Posting Concurrently to Solr

Abhishek Srivastava
In reply to this post by Abhishek Srivastava
I will run update index once a day.

Regards,
Abhishek

------Original Message------
From: Jan Høydahl / Cominvent
To: [hidden email]
ReplyTo: [hidden email]
Subject: Re: Posting Concurrently to Solr
Sent: Feb 11, 2010 22:17

You did not say how frequent you need to update the index, if this is batch type of operation or if you also have some real-time requirements after the initial load.

Your ETL could use SolrJ and the StreamingUpdateSolrServer for high throughput.
You could try multiple threads pushing in parallell if your bottleneck is on the client side.
If that's not enough you can split your index into multiple cores/shards to get more parallell indexing power.
You don't need to merge them at the end, you can query using the shards parameter.

For extreme power for batch indexing, you can look at a map-reduce strategy: http://wiki.apache.org/solr/HadoopIndexing

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 11.33, abhishes wrote:

>
> Hello Everyone,
>
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
>
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build index
> faster than 1 large xml file?
>
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
> is extracting the data from the backed and send it to Solr for indexing.
>
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
>
> Any other approach?
>
>
>
> --
> View this message in context: http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



Sent from BlackBerry® on Airtel