index size

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

index size

Kevin Lewandowski
Are there any tips on reducing the index size or what factors most
impact index size?

My index has 2.7 million documents and is 200 gigabytes and growing.
Most documents are around 2-3kb and there are about 30 indexed fields.

thanks,
Kevin
Reply | Threaded
Open this post in threaded view
|

Re: index size

Yonik Seeley-2
On 8/17/07, Kevin Lewandowski <[hidden email]> wrote:
> Are there any tips on reducing the index size or what factors most
> impact index size?
>
> My index has 2.7 million documents and is 200 gigabytes and growing.
> Most documents are around 2-3kb and there are about 30 indexed fields.

Wow, that's pretty big for the document count!
- make sure that you only store fields you need to retrieve... if you
only need to search on the fields, make them indexed-only.
- unique terms take up more space... if you have date or time fields,
try reducing the time resolution
- if any stored fields are very large, perhaps try compression
- application specific compression...  for example, if you have a lot
of URL values starting with the same thing change "http://" to a
single unique character.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: index size

hossman

: - make sure that you only store fields you need to retrieve... if you
: only need to search on the fields, make them indexed-only.

and omitNorms on any fields were you don't need lengthNormilization or
field boosts (ie: date fields, numeric fields, boolean flag fields,
etc...)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: index size

Mike Klaas
In reply to this post by Kevin Lewandowski

On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:

> Are there any tips on reducing the index size or what factors most
> impact index size?
>
> My index has 2.7 million documents and is 200 gigabytes and growing.
> Most documents are around 2-3kb and there are about 30 indexed fields.

An "ls -sh" will tell you roughly where the the space is being  
occupied.  There is something strange going on: 2.5kB * 2.7m is only  
6GB, and I have trouble imagining where the 30-fold index size  
expansion is coming from.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: index size

Kevin Lewandowski
Late reply on this but I just wanted to say thanks for the
suggestions. I went through my whole schema and was storing things
that didn't need to be stored and indexing a lot of things that didn't
need to be indexed. Just completed a full reindex and it's a much more
reasonable size now.

Kevin

On 8/20/07, Mike Klaas <[hidden email]> wrote:

>
> On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:
>
> > Are there any tips on reducing the index size or what factors most
> > impact index size?
> >
> > My index has 2.7 million documents and is 200 gigabytes and growing.
> > Most documents are around 2-3kb and there are about 30 indexed fields.
>
> An "ls -sh" will tell you roughly where the the space is being
> occupied.  There is something strange going on: 2.5kB * 2.7m is only
> 6GB, and I have trouble imagining where the 30-fold index size
> expansion is coming from.
>
> -Mike
>
Reply | Threaded
Open this post in threaded view
|

Re: index size

Ravish Bhagdev
Hi All,

I'm facing similar problem.  I want to index entire document as a
field.  But I also want to be able to retrieve snippets (like
Google/Nutch return in results page below the links).

To achieve this I have to keep the document field to "stored" right?
When I do this my index becomes huge 10 GB index, cause I have 10K
docs but each is very lengthy HTML.  Is there any better solution?
Why is index created by nutch so small in comparison (about 27 mb
approx) but it still returns snippets!

Ravish

On 10/9/07, Kevin Lewandowski <[hidden email]> wrote:

> Late reply on this but I just wanted to say thanks for the
> suggestions. I went through my whole schema and was storing things
> that didn't need to be stored and indexing a lot of things that didn't
> need to be indexed. Just completed a full reindex and it's a much more
> reasonable size now.
>
> Kevin
>
> On 8/20/07, Mike Klaas <[hidden email]> wrote:
> >
> > On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:
> >
> > > Are there any tips on reducing the index size or what factors most
> > > impact index size?
> > >
> > > My index has 2.7 million documents and is 200 gigabytes and growing.
> > > Most documents are around 2-3kb and there are about 30 indexed fields.
> >
> > An "ls -sh" will tell you roughly where the the space is being
> > occupied.  There is something strange going on: 2.5kB * 2.7m is only
> > 6GB, and I have trouble imagining where the 30-fold index size
> > expansion is coming from.
> >
> > -Mike
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: index size

Kevin Lewandowski
> To achieve this I have to keep the document field to "stored" right?

Yes, the field needs to be stored to return snippets.


> When I do this my index becomes huge 10 GB index, cause I have 10K
> docs but each is very lengthy HTML.  Is there any better solution?
> Why is index created by nutch so small in comparison (about 27 mb
> approx) but it still returns snippets!

Are you storing the complete html? If so I think you should strip out
the html then index the document.




>
> On 10/9/07, Kevin Lewandowski <[hidden email]> wrote:
> > Late reply on this but I just wanted to say thanks for the
> > suggestions. I went through my whole schema and was storing things
> > that didn't need to be stored and indexing a lot of things that didn't
> > need to be indexed. Just completed a full reindex and it's a much more
> > reasonable size now.
> >
> > Kevin
> >
> > On 8/20/07, Mike Klaas <[hidden email]> wrote:
> > >
> > > On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:
> > >
> > > > Are there any tips on reducing the index size or what factors most
> > > > impact index size?
> > > >
> > > > My index has 2.7 million documents and is 200 gigabytes and growing.
> > > > Most documents are around 2-3kb and there are about 30 indexed fields.
> > >
> > > An "ls -sh" will tell you roughly where the the space is being
> > > occupied.  There is something strange going on: 2.5kB * 2.7m is only
> > > 6GB, and I have trouble imagining where the 30-fold index size
> > > expansion is coming from.
> > >
> > > -Mike
> > >
> >
>