How does wildcard search works under the hood?

Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How does wildcard search works under the hood?

alexpusch
I understand the concept of an inverted index. Terms are mapped the the docs
containing them. But when the query contains a wildcard (*, ?) we cannot
simply use an index entry.

How does wildcard searches work under the hood? Is there some other index
helping Lucene to find all the relevant terms, and then the inverted index
is used with all of them?

I'm hoping to get a better understanding of this issue to be able to reason
poor performance of the application I work on

Thanks, Alex.



--
Sent from: http://lucene.472066.n3.nabble.com/Lucene-General-f642108.html
Reply | Threaded
Open this post in threaded view
|

Re: How does wildcard search works under the hood?

Paul Masurel
As you said the terms are mapped to a list of documents containing them.
To do so, they are stored in a finite state transducer. For simplification,
you can imagine it as a trie.

When you search for a term, Lucene does a lookup into the dictionary, gets
a position
at which it can read documents in another file.

When you search for a document containing a ?, Lucene can use the fst to
identify all of the terms
matching the pattern. Like for a trie, the fst is especially potent at a
pattern like someprefix* and not as great for patterns like *somesuffix as
it will have to go through the full trie.

Implementation-wise this is done by transforming the pattern into a
deterministic finite state automaton,
and finding the terms that match the pattern in the dictionary.

https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/core/src/java/org/apache/lucene/search/WildcardQuery.java

https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java


On Sat, Sep 9, 2017 at 8:21 AM, alexpusch <[hidden email]> wrote:

> I understand the concept of an inverted index. Terms are mapped the the
> docs
> containing them. But when the query contains a wildcard (*, ?) we cannot
> simply use an index entry.
>
> How does wildcard searches work under the hood? Is there some other index
> helping Lucene to find all the relevant terms, and then the inverted index
> is used with all of them?
>
> I'm hoping to get a better understanding of this issue to be able to reason
> poor performance of the application I work on
>
> Thanks, Alex.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-General-f642108.html
>
Reply | Threaded
Open this post in threaded view
|

Re: How does wildcard search works under the hood?

alexpusch
I see, seems that in my case the fst have to work particularly hard since the
number of unique terms is huge, and I use * in both parts of the query

Thanks for the detailed answer!



--
Sent from: http://lucene.472066.n3.nabble.com/Lucene-General-f642108.html
Reply | Threaded
Open this post in threaded view
|

Re: How does wildcard search works under the hood?

Khurram Shehzad
Hi,

Please someone tell me how to unsubscribe from


[hidden email]


I tried to email on [hidden email] several time but no use.


Regards,