DocSet: BitDocSet or HashDocSet ?

classic Classic list List threaded Threaded
4 messages Options
ttj
Reply | Threaded
Open this post in threaded view
|

DocSet: BitDocSet or HashDocSet ?

ttj
Hi all,

  In my code, I'd like to keep a subset of my 14M docs which is around
100k large.

 What is according to you the best option in terms of speed and memory usage ?

 Some basic thoughts tells me the BitDocSet should be the fastest for
lookup, but takes ~ 14M * sizeof(int) in memory, whereas
 the HashDocSet takes just ~ 100k * sizeof(int)  , but is a bit slower lookup.

 The doc of HashDocSet says "t can be a better choice if there are few
docs in the set" . What does 'few' means in this context ?

 Cheers !

 Jerome.


--
Jerome Eteve.

Chat with me live at http://www.eteve.net

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: DocSet: BitDocSet or HashDocSet ?

Noble Paul നോബിള്‍  नोब्ळ्
bitdocset does not take  ~ 14M * sizeof(int) in memory

 it may take a maximum of
14M/8 bytes in memory ~= 1.75MB



On Tue, Oct 28, 2008 at 6:06 PM, Jérôme Etévé <[hidden email]> wrote:

> Hi all,
>
>  In my code, I'd like to keep a subset of my 14M docs which is around
> 100k large.
>
>  What is according to you the best option in terms of speed and memory usage ?
>
>  Some basic thoughts tells me the BitDocSet should be the fastest for
> lookup, but takes ~ 14M * sizeof(int) in memory, whereas
>  the HashDocSet takes just ~ 100k * sizeof(int)  , but is a bit slower lookup.
>
>  The doc of HashDocSet says "t can be a better choice if there are few
> docs in the set" . What does 'few' means in this context ?
>
>  Cheers !
>
>  Jerome.
>
>
> --
> Jerome Eteve.
>
> Chat with me live at http://www.eteve.net
>
> [hidden email]
>



--
--Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: DocSet: BitDocSet or HashDocSet ?

hossman
In reply to this post by ttj

:  The doc of HashDocSet says "t can be a better choice if there are few
: docs in the set" . What does 'few' means in this context ?

it's relative the total size of your index.  if you have a million docs,
but you are dealing with DocSets that are only going to contain 10 docs,
then both the memory requirements and the lookup speed on a HashDocSet is
probably going to be faster.

exactly where the sweetspot is as far as size and speed is somewhat hard
to pin down.

if i recall correctly from the way yonik implmented OpenBitSet, the size
isn't purely a factor of set size either ... a BitDocSet containing a
thousand docs that are very "near" each other in the id space (ie: from a
uniqueKey:[x TO y] type filter, or even a date based filter where docs
are generally indexed cronologically) might be more compact and faster
then a HashDocSet of the same thousand docs -- but a thousand docs
scattared arround the id space with lots of big gaps in the middle might
be much bigger then an equivilent HashDocSet.

it's one of those things you have to experiment with.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: DocSet: BitDocSet or HashDocSet ?

Mike Klaas
In reply to this post by ttj

On 28-Oct-08, at 5:36 AM, Jérôme Etévé wrote:

> Hi all,
>
>  In my code, I'd like to keep a subset of my 14M docs which is around
> 100k large.
>
> What is according to you the best option in terms of speed and  
> memory usage ?
>
> Some basic thoughts tells me the BitDocSet should be the fastest for
> lookup, but takes ~ 14M * sizeof(int) in memory, whereas
> the HashDocSet takes just ~ 100k * sizeof(int)  , but is a bit  
> slower lookup.
>
> The doc of HashDocSet says "t can be a better choice if there are few
> docs in the set" . What does 'few' means in this context ?

Solr, by default, ships in a configuration that creates filters with  
HashDocSet if the size of the set is < 3000, and BitDocSet otherwise.  
This parameter is tunable in solrconfig.xml.  You might find it helps  
to increase this slightly with 14m docs--say to 5000-6000.  In my  
testing, any higher than this is a net loss.

-Mike