[jira] [Resolved] (LUCENE-830) norms file can become unexpectedly enormous

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Resolved] (LUCENE-830) norms file can become unexpectedly enormous

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-830.

       Resolution: Fixed
    Fix Version/s: 4.0

As of 4.0, when norms are missing we drop norms for the entire field, unlike before when we invent a fake norm for documents missing that field or omitting norm for it.

Also, as of 4.0, you can now make a custom norm provider and custom similarity so if you really want to it's possible (in theory!) to have a sparse norms data structure...

> norms file can become unexpectedly enormous
> -------------------------------------------
>                 Key: LUCENE-830
>                 URL: https://issues.apache.org/jira/browse/LUCENE-830
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
> Spinoff from this user thread:
>    http://www.gossamer-threads.com/lists/lucene/java-user/46754
> Norms are not stored sparsely, so even if a doc doesn't have field X
> we still use up 1 byte in the norms file (and in memory when that
> field is searched) for that segment.  I think this is done for
> performance at search time?
> For indexes that have a large # documents where each document can have
> wildly varying fields, each segment will use # documents times # fields
> seen in that segment.  When optimize merges all segments, that product
> grows multiplicatively so the norms file for the single segment will
> require far more storage than the sum of all previous segments' norm
> files.
> I think it's uncommon to have a huge number of distinct fields (?) so
> we would need a solution that doesn't hurt the more common case where
> most documents have the same fields.  Maybe something analogous to how
> bitvectors are now optionally stored sparsely?
> One simple workaround is to disable norms.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]