Quantcast

The index speed in the solr

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

The index speed in the solr

neosky
It takes me 50 hours to index a total 9 G file(about 2,000,000 documents) with n-gram filter from min=6,max=10, my token before ngram filter is long(not a word, at most 300,000 bytes with white space). I split into 4 files and use the post.sh to update at the same time. I also tried to write a lucene to do the index myself(single thread). The time is almost the same. I would like to know what's the general bottleneck for the index in solr? Doesn't the solr handle the index update request concurrently?

1.
Posting file /ngram_678910/file1.xml to http://localhost:8988/solr/update
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 51 3005M    0     0   51 1557M      0  18902 46:19:14 23:59:46 22:19:28     0
2.
Posting file /ngram_678910/file2.xml to http://localhost:8988/solr/update
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 62 2623M    0     0   62 1632M      0  19839 38:31:16 23:58:01 14:33:15 76629
3.
Posting file /ngram_678910/file3.xml to http://localhost:8988/solr/update
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 65 2667M    0     0   65 1737M      0  21113 36:48:23 23:58:06 12:50:17 25537
4.
Posting file /ngram_678910/file4.xml to http://localhost:8988/solr/update
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 58 2766M    0     0   58 1625M      0  19752 40:47:34 23:58:28 16:49:06 81435
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: The index speed in the solr

Erick Erickson
Hard to say. Here's the basic approach I'd use to try to narrow it down:
1> take out ngrams. What does that do to your speed?
2> are you committing very often? Lengthen the time here if so.
3> Posting is probably not the more performant thing in world.
     Consider using SolrJ.
4> What does a document look like? Are they structured docs
     (Word, PDF, etc). If so, try offloading that to client machines.

Basically, you haven't given enough information to make much
of a guess here...

50 hours is a really long time for 2M docs though, so something
doesn't seem right unless the docs are really unusual.

If you need to offload the structured docs, here's a way to
get started:

http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

Best
Erick

On Sun, Apr 22, 2012 at 9:58 PM, neosky <[hidden email]> wrote:

> It takes me 50 hours to index a total 9 G file(about 2,000,000 documents)
> with n-gram filter from min=6,max=10, my token before ngram filter is
> long(not a word, at most 300,000 bytes with white space). I split into 4
> files and use the post.sh to update at the same time. I also tried to write
> a lucene to do the index myself(single thread). The time is almost the same.
> I would like to know what's the general bottleneck for the index in solr?
> Doesn't the solr handle the index update request concurrently?
>
> 1.
> Posting file /ngram_678910/file1.xml to http://localhost:8988/solr/update
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
>  51 3005M    0     0   51 1557M      0  18902 46:19:14 23:59:46 22:19:28
> 0
> 2.
> Posting file /ngram_678910/file2.xml to http://localhost:8988/solr/update
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
>  62 2623M    0     0   62 1632M      0  19839 38:31:16 23:58:01 14:33:15
> 76629
> 3.
> Posting file /ngram_678910/file3.xml to http://localhost:8988/solr/update
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
>  65 2667M    0     0   65 1737M      0  21113 36:48:23 23:58:06 12:50:17
> 25537
> 4.
> Posting file /ngram_678910/file4.xml to http://localhost:8988/solr/update
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
>  58 2766M    0     0   58 1625M      0  19752 40:47:34 23:58:28 16:49:06
> 81435
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/The-index-speed-in-the-solr-tp3931338p3931338.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: The index speed in the solr

David Smiley

On Apr 23, 2012, at 9:27 AM, Erick Erickson wrote:

> 50 hours is a really long time for 2M docs though, so something
> doesn't seem right unless the docs are really unusual.

Don't forget he's n-gramming ;-)  There's not much more demanding you could ask of text analysis except for throwing shingling in there too for good measure[*].

Neosky, you should consider using Solr trunk which has dramatic multithreaded indexing performance improvements if your hardware is capable.  If you try trunk, use a large ramBufferSizeMB (say 2GB worth), but if you stick with Solr 3.x, use 1GB.  And finally, increasing your mergeFactor will increase indexing performance at the expense of search speed.  You could throw in an optimize at the very end with a maxSegments=10 or something to compensate.

~ David Smiley
[*] that was a joke
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: The index speed in the solr

neosky
Thanks for your suggestion. I will try later and give you a feedback if possible
Now the way I use is to remove some ngram.
Thanks again!
Loading...