[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510092#comment-16510092 ]

David Smiley commented on LUCENE-8041:

bq. I do wonder if an intermediate step would be to have a map + a Iterable<String> so we don't need to do this sorting over and over again?

Absolutely; there's no need to re-sort.  I'm working with [~[hidden email]] who came up with this tidy solution.  In BlockTreeTermsReader, the "fields" field becomes a HashMap.  Then, in the constructor after it's populated, there's a few lines to build up the list:
ArrayList<String> fieldsNames = new ArrayList<String>(fields.keySet());
fieldsNamesIterable = Collections.unmodifiableCollection(fieldsNames);
{{fieldsNamesIterable}} is a new declared field.
Very similar logic goes in PerFieldsPostingsFormat.

I think it's possible to avoid the sort() in BlockTreeTermsReader if you know you're reading data written pre-sorted.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> --------------------------------------------------------
>                 Key: LUCENE-8041
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8041
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8041.patch
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) of TreeMap definitely shows up in a profiler; sometimes 20% of search time, if I recall.  There are many Field implementations that are impacted... in part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be TermsProducer and not have an iterator of fields. If DocValuesProducer doesn't have this then why should the terms index part of our API have it?  If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap in many cases if we can assume when we initialize these internal maps that we consume them in sorted order to begin with.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]