which one will save hard disk space?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

which one will save hard disk space?

nick19701
 <field name="signature" type="string" indexed="false" stored="true" compressed="true"/>
 <field name="signature" type="string" indexed="true" stored="true" compressed="true"/>

I don't need to search the "signature" field. But my intuition tells me that
if I index this field, I will use less hard disk space since a lot of docs may have the same signature.

Am I right?
Reply | Threaded
Open this post in threaded view
|

Re: which one will save hard disk space?

Mike Klaas
On 3/26/07, nick19701 <[hidden email]> wrote:

>
>  <field name="signature" type="string" indexed="false" stored="true"
> compressed="true"/>
>  <field name="signature" type="string" indexed="true" stored="true"
> compressed="true"/>
>
> I don't need to search the "signature" field. But my intuition tells me that
> if I index this field, I will use less hard disk space since a lot of docs
> may have the same signature.
>
> Am I right?

Storing and indexing are completely disjoint: indexing is a lossy
operation, so if you want to be able retrieve the original contents,
they must be stored separately (ie., the first option uses the least
space).

-MIke
Reply | Threaded
Open this post in threaded view
|

Re: which one will save hard disk space?

nick19701
Mike Klaas wrote
Storing and indexing are completely disjoint: indexing is a lossy
operation, so if you want to be able retrieve the original contents,
they must be stored separately (ie., the first option uses the least
space).

-MIke
But here the "signature" field has field type "string". when you index it,
you put the whole string somewhere and give it an id, for example, 323454.

In a doc, you only need to reference this id 323454 if the doc happens to contain
the same signature value.

Now suppose I have a lot of docs with same signature and signature
is a very long string. It seems to me indexing the signature will save me
hard disk space.

In short, what I mean is that if you index a "string" field, you can retrieve it
without loss. So you don't need to store it separately. what do you think?
Reply | Threaded
Open this post in threaded view
|

Re: which one will save hard disk space?

Mike Klaas
On 3/26/07, nick19701 <[hidden email]> wrote:

> But here the "signature" field has field type "string". when you index it,
> you put the whole string somewhere and give it an id, for example, 323454.
>
> In a doc, you only need to reference this id 323454 if the doc happens to
> contain
> the same signature value.
>
> Now suppose I have a lot of docs with same signature and signature
> is a very long string. It seems to me indexing the signature will save me
> hard disk space.
>
> In short, what I mean is that if you index a "string" field, you can
> retrieve it
> without loss. So you don't need to store it separately. what do you think?

In theory that might be true, but lucene is not implemented that way,
I'm afraid.  If this is the a priori situation, it is probably easier
to implement this outside of lucene and "store" the id in your
external index.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: which one will save hard disk space?

Chris Hostetter-3
In reply to this post by nick19701

: Now suppose I have a lot of docs with same signature and signature
: is a very long string. It seems to me indexing the signature will save me
: hard disk space.

that's true, and if you were using Lucene directly you could do this and
then use the StringIndex FieldCache to lookup the value for each doc, but
Solr doesn't have any special optimization like that at the moment.

If you don't store it, none of hte standard request handlers will retrieve
it when generating results, but you could write a custom request handler
to do that if you wished (it could even be done fairly programaticly: look
for any fields with type "string" which are indexed but not stored and
return them)



-Hoss