How many <doc></doc> in the XML source file before indexing?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How many <doc></doc> in the XML source file before indexing?

Bruno Mannina
Hi All,

Just a little question concerning the max number of

<add>
<doc></doc>
</add>

that I can write in the xml source file before indexing? only one, 10,
100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno




Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Paul Libbrecht-4
Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :

> Hi All,
>
> Just a little question concerning the max number of
>
> <add>
> <doc></doc>
> </add>
>
> that I can write in the xml source file before indexing? only one, 10, 100, 1000, unlimited...?
>
> I must indexed 80M docs so I can't create one xml file by doc.
>
> thanks,
> Bruno
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Bruno Mannina
Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of <doc></doc>.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

> Bruno,
> see the solrconfig.xml, you have all sorts of tweaks for this kind of things.
>
> paul
>
>
> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>
>> Hi All,
>>
>> Just a little question concerning the max number of
>>
>> <add>
>> <doc></doc>
>> </add>
>>
>> that I can write in the xml source file before indexing? only one, 10, 100, 1000, unlimited...?
>>
>> I must indexed 80M docs so I can't create one xml file by doc.
>>
>> thanks,
>> Bruno
>>
>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Bruno Mannina
I can't find my answer concerning the max number of <doc></doc> ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

> Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages
>
> I will take also a look to find the max number of <doc></doc>.
>
> Le 24/05/2012 09:51, Paul Libbrecht a écrit :
>> Bruno,
>> see the solrconfig.xml, you have all sorts of tweaks for this kind of
>> things.
>>
>> paul
>>
>>
>> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>>
>>> Hi All,
>>>
>>> Just a little question concerning the max number of
>>>
>>> <add>
>>> <doc></doc>
>>> </add>
>>>
>>> that I can write in the xml source file before indexing? only one,
>>> 10, 100, 1000, unlimited...?
>>>
>>> I must indexed 80M docs so I can't create one xml file by doc.
>>>
>>> thanks,
>>> Bruno
>>>
>>>
>>>
>>>
>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Michael Kuhlmann-4
There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200 documents
instead, or even 1000. The number of requests don't count that much.

And, if the update fails for some reason, then the whole request will be
ignored. If you had sent 1000 documents in an update, and one of them
had a field missing, for example, then it's hard to find out which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

> I can't find my answer concerning the max number of <doc></doc> ?
>
> Can someone can tell me if there is no limit?
>
> Le 24/05/2012 09:55, Bruno Mannina a écrit :
>> Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages
>>
>> I will take also a look to find the max number of <doc></doc>.
>>
>> Le 24/05/2012 09:51, Paul Libbrecht a écrit :
>>> Bruno,
>>> see the solrconfig.xml, you have all sorts of tweaks for this kind of
>>> things.
>>>
>>> paul
>>>
>>>
>>> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>>>
>>>> Hi All,
>>>>
>>>> Just a little question concerning the max number of
>>>>
>>>> <add>
>>>> <doc></doc>
>>>> </add>
>>>>
>>>> that I can write in the xml source file before indexing? only one,
>>>> 10, 100, 1000, unlimited...?
>>>>
>>>> I must indexed 80M docs so I can't create one xml file by doc.
>>>>
>>>> thanks,
>>>> Bruno
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Michael Kuhlmann-4
"pish it too jard" - sounds funny. :)

I meant "push it too hard".

Am 24.05.2012 11:46, schrieb Michael Kuhlmann:

> There is no hard limit for the maximum nunmber of documents per update.
>
> It's only memory dependent. The smaller each document, and the more
> memory Solr can acquire, the more documents can you send in one update.
>
> However, I wouldn't pish it too jard anyway. If you can send, say, 100
> documents per update, the you won't gain much if you send 200 documents
> instead, or even 1000. The number of requests don't count that much.
>
> And, if the update fails for some reason, then the whole request will be
> ignored. If you had sent 1000 documents in an update, and one of them
> had a field missing, for example, then it's hard to find out which one.
>
> Greetings,
> Michael
>
> Am 24.05.2012 10:58, schrieb Bruno Mannina:
>> I can't find my answer concerning the max number of <doc></doc> ?
>>
>> Can someone can tell me if there is no limit?
>>
>> Le 24/05/2012 09:55, Bruno Mannina a écrit :
>>> Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages
>>>
>>> I will take also a look to find the max number of <doc></doc>.
>>>
>>> Le 24/05/2012 09:51, Paul Libbrecht a écrit :
>>>> Bruno,
>>>> see the solrconfig.xml, you have all sorts of tweaks for this kind of
>>>> things.
>>>>
>>>> paul
>>>>
>>>>
>>>> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>>>>
>>>>> Hi All,
>>>>>
>>>>> Just a little question concerning the max number of
>>>>>
>>>>> <add>
>>>>> <doc></doc>
>>>>> </add>
>>>>>
>>>>> that I can write in the xml source file before indexing? only one,
>>>>> 10, 100, 1000, unlimited...?
>>>>>
>>>>> I must indexed 80M docs so I can't create one xml file by doc.
>>>>>
>>>>> thanks,
>>>>> Bruno
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Bruno Mannina
In reply to this post by Michael Kuhlmann-4
In fact it's not for an update but only for the first indexation.

I mean, I will receive the full database with around 80M docs in some
XML files (one per country in the world).
 From these 80M docs I will generate right XML format for each doc. (I
don't need all fields from the source)

And as actually for my test (12 000 docs), I generate one file per doc,
there is no problem.
But with 80M docs I can't generate one file per doc.

It's for this reason I asked the max number of <doc> in a file <add>.

For the first time, if a country file fails, no problem, I will check it
and re-generate it.

Is it bad to create a file with 5M <doc> ?


Le 24/05/2012 11:46, Michael Kuhlmann a écrit :

> There is no hard limit for the maximum nunmber of documents per update.
>
> It's only memory dependent. The smaller each document, and the more
> memory Solr can acquire, the more documents can you send in one update.
>
> However, I wouldn't pish it too jard anyway. If you can send, say, 100
> documents per update, the you won't gain much if you send 200
> documents instead, or even 1000. The number of requests don't count
> that much.
>
> And, if the update fails for some reason, then the whole request will
> be ignored. If you had sent 1000 documents in an update, and one of
> them had a field missing, for example, then it's hard to find out
> which one.
>
> Greetings,
> Michael
>
> Am 24.05.2012 10:58, schrieb Bruno Mannina:
>> I can't find my answer concerning the max number of <doc></doc> ?
>>
>> Can someone can tell me if there is no limit?
>>
>> Le 24/05/2012 09:55, Bruno Mannina a écrit :
>>> Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages
>>>
>>> I will take also a look to find the max number of <doc></doc>.
>>>
>>> Le 24/05/2012 09:51, Paul Libbrecht a écrit :
>>>> Bruno,
>>>> see the solrconfig.xml, you have all sorts of tweaks for this kind of
>>>> things.
>>>>
>>>> paul
>>>>
>>>>
>>>> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>>>>
>>>>> Hi All,
>>>>>
>>>>> Just a little question concerning the max number of
>>>>>
>>>>> <add>
>>>>> <doc></doc>
>>>>> </add>
>>>>>
>>>>> that I can write in the xml source file before indexing? only one,
>>>>> 10, 100, 1000, unlimited...?
>>>>>
>>>>> I must indexed 80M docs so I can't create one xml file by doc.
>>>>>
>>>>> thanks,
>>>>> Bruno
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Michael Kuhlmann-4
Just try it!

Maybe you're lucky, and it works with 80M docs. If each document takes
100 k, then it only needs 8 GB memory for indexing.

However, I doubt it. I've not been too deeply into the UpdateHandler
yet, but I think it first needs to parse the complete XML file before it
starts to index.

But that worst thing that can happen is an OOM exception. And when you
need to split the xml files, then you can split into smaller chunks as well.

Just a note: In Solr, you're always updating, even in the first
indexation. There's no difference between updates and inserts.

Greetings,
Michael

Am 24.05.2012 12:37, schrieb Bruno Mannina:

> In fact it's not for an update but only for the first indexation.
>
> I mean, I will receive the full database with around 80M docs in some
> XML files (one per country in the world).
>  From these 80M docs I will generate right XML format for each doc. (I
> don't need all fields from the source)
>
> And as actually for my test (12 000 docs), I generate one file per doc,
> there is no problem.
> But with 80M docs I can't generate one file per doc.
>
> It's for this reason I asked the max number of <doc> in a file <add>.
>
> For the first time, if a country file fails, no problem, I will check it
> and re-generate it.
>
> Is it bad to create a file with 5M <doc> ?
>
>
> Le 24/05/2012 11:46, Michael Kuhlmann a écrit :
>> There is no hard limit for the maximum nunmber of documents per update.
>>
>> It's only memory dependent. The smaller each document, and the more
>> memory Solr can acquire, the more documents can you send in one update.
>>
>> However, I wouldn't pish it too jard anyway. If you can send, say, 100
>> documents per update, the you won't gain much if you send 200
>> documents instead, or even 1000. The number of requests don't count
>> that much.
>>
>> And, if the update fails for some reason, then the whole request will
>> be ignored. If you had sent 1000 documents in an update, and one of
>> them had a field missing, for example, then it's hard to find out
>> which one.
>>
>> Greetings,
>> Michael
>>
>> Am 24.05.2012 10:58, schrieb Bruno Mannina:
>>> I can't find my answer concerning the max number of <doc></doc> ?
>>>
>>> Can someone can tell me if there is no limit?
>>>
>>> Le 24/05/2012 09:55, Bruno Mannina a écrit :
>>>> Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages
>>>>
>>>> I will take also a look to find the max number of <doc></doc>.
>>>>
>>>> Le 24/05/2012 09:51, Paul Libbrecht a écrit :
>>>>> Bruno,
>>>>> see the solrconfig.xml, you have all sorts of tweaks for this kind of
>>>>> things.
>>>>>
>>>>> paul
>>>>>
>>>>>
>>>>> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Just a little question concerning the max number of
>>>>>>
>>>>>> <add>
>>>>>> <doc></doc>
>>>>>> </add>
>>>>>>
>>>>>> that I can write in the xml source file before indexing? only one,
>>>>>> 10, 100, 1000, unlimited...?
>>>>>>
>>>>>> I must indexed 80M docs so I can't create one xml file by doc.
>>>>>>
>>>>>> thanks,
>>>>>> Bruno
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Bruno Mannina
humm... ok I will do the test as soon as receive the database.

Thx a lot !

Le 24/05/2012 13:29, Michael Kuhlmann a écrit :

> Just try it!
>
> Maybe you're lucky, and it works with 80M docs. If each document takes
> 100 k, then it only needs 8 GB memory for indexing.
>
> However, I doubt it. I've not been too deeply into the UpdateHandler
> yet, but I think it first needs to parse the complete XML file before
> it starts to index.
>
> But that worst thing that can happen is an OOM exception. And when you
> need to split the xml files, then you can split into smaller chunks as
> well.
>
> Just a note: In Solr, you're always updating, even in the first
> indexation. There's no difference between updates and inserts.
>
> Greetings,
> Michael
>
> Am 24.05.2012 12:37, schrieb Bruno Mannina:
>> In fact it's not for an update but only for the first indexation.
>>
>> I mean, I will receive the full database with around 80M docs in some
>> XML files (one per country in the world).
>>  From these 80M docs I will generate right XML format for each doc. (I
>> don't need all fields from the source)
>>
>> And as actually for my test (12 000 docs), I generate one file per doc,
>> there is no problem.
>> But with 80M docs I can't generate one file per doc.
>>
>> It's for this reason I asked the max number of <doc> in a file <add>.
>>
>> For the first time, if a country file fails, no problem, I will check it
>> and re-generate it.
>>
>> Is it bad to create a file with 5M <doc> ?
>>
>>
>> Le 24/05/2012 11:46, Michael Kuhlmann a écrit :
>>> There is no hard limit for the maximum nunmber of documents per update.
>>>
>>> It's only memory dependent. The smaller each document, and the more
>>> memory Solr can acquire, the more documents can you send in one update.
>>>
>>> However, I wouldn't pish it too jard anyway. If you can send, say, 100
>>> documents per update, the you won't gain much if you send 200
>>> documents instead, or even 1000. The number of requests don't count
>>> that much.
>>>
>>> And, if the update fails for some reason, then the whole request will
>>> be ignored. If you had sent 1000 documents in an update, and one of
>>> them had a field missing, for example, then it's hard to find out
>>> which one.
>>>
>>> Greetings,
>>> Michael
>>>
>>> Am 24.05.2012 10:58, schrieb Bruno Mannina:
>>>> I can't find my answer concerning the max number of <doc></doc> ?
>>>>
>>>> Can someone can tell me if there is no limit?
>>>>
>>>> Le 24/05/2012 09:55, Bruno Mannina a écrit :
>>>>> Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages
>>>>>
>>>>> I will take also a look to find the max number of <doc></doc>.
>>>>>
>>>>> Le 24/05/2012 09:51, Paul Libbrecht a écrit :
>>>>>> Bruno,
>>>>>> see the solrconfig.xml, you have all sorts of tweaks for this
>>>>>> kind of
>>>>>> things.
>>>>>>
>>>>>> paul
>>>>>>
>>>>>>
>>>>>> Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Just a little question concerning the max number of
>>>>>>>
>>>>>>> <add>
>>>>>>> <doc></doc>
>>>>>>> </add>
>>>>>>>
>>>>>>> that I can write in the xml source file before indexing? only one,
>>>>>>> 10, 100, 1000, unlimited...?
>>>>>>>
>>>>>>> I must indexed 80M docs so I can't create one xml file by doc.
>>>>>>>
>>>>>>> thanks,
>>>>>>> Bruno
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How many <doc></doc> in the XML source file before indexing?

Yonik Seeley-2-2
In reply to this post by Michael Kuhlmann-4
On Thu, May 24, 2012 at 7:29 AM, Michael Kuhlmann <[hidden email]> wrote:
> However, I doubt it. I've not been too deeply into the UpdateHandler yet,
> but I think it first needs to parse the complete XML file before it starts
> to index.

Solr's update handlers all stream (XML, JSON, CSV), reading and
indexing a document at a time from the input.

-Yonik
http://lucidimagination.com