[jira] Created: (SOLR-2150) Anti-phrasing feature

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-2150) Anti-phrasing feature

JIRA jira@apache.org
Anti-phrasing feature
---------------------

                 Key: SOLR-2150
                 URL: https://issues.apache.org/jira/browse/SOLR-2150
             Project: Solr
          Issue Type: New Feature
          Components: SearchComponents - other
            Reporter: Jan Høydahl


Add an anti-phrasing feature to Solr.

Definition: Identifying word sequences in queries that do not contribute essentially to the query's meaning, such as "Where can I find" or "Where is."
(Source: http://www.google.com/search?q=define%3Aanti+phrasing)

For general purpose search services, such as web, intranet, shopping search, some users will try to write a question to the search engine, such as "how much is an ipod nano". One straight-forward way of limiting the number of 0-hits in such environments is to apply anti-phrasing, which uses a dictionary of common sentence prefixes which should be stripped from the incoming query before it is sent further to search.

This can be implemented as a Search Component in Solr. The dictionary can be language independent. We can encourage users to submit their tested anti-phrasing dictionaries for various languages, and include those. The dictionary can be a set of simple .txt files, loaded in memory at startup in an efficient data structure such as b-tree or finite state automaton to avoid redundancy and ensure quick matching. The procedure for detecting an anti-phrase from the incoming query is to first lookup the full query phrase, if no match, remove a word from the end, and do another lookup until either a match or end of string. Example for query: "Who is Einstein?", where "Who is" is defined as an anti phrase.
1. Lookup "Who is Einstein"
2. Lookup "Who is" (match), remove this prefix
3. Issue the query "Einstein" to search

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-2150) Anti-phrasing feature

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919934#action_12919934 ]

Hoss Man commented on SOLR-2150:
--------------------------------

One approach that might be worth considering is to generalize the problem beyond just anti-phrasing, and allow the dictionaries to contain optional mappings between the patterns and filter queries that should be applied to the query in place of the full phrase.

In the product space this could let people setup mappings like...

{noformat}
* Printers => {!field f=category}printers
{noformat}
so a request like...
{noformat}
q=HP+Printers
{noformat}
 would become equivalent to
{noformat}
q=HP&fq={!field f=category}printers
{noformat}

...following in the "Who is" example, if the data set is a collection of people, then "Who is" could be mapped to nothing (so it's just striped away, w/o a filter query being added) but if the data set is a general collection of information (ie: wikipedia) then "Who Is" could be mapped to something like "doc_type:person"

> Anti-phrasing feature
> ---------------------
>
>                 Key: SOLR-2150
>                 URL: https://issues.apache.org/jira/browse/SOLR-2150
>             Project: Solr
>          Issue Type: New Feature
>          Components: SearchComponents - other
>            Reporter: Jan Høydahl
>
> Add an anti-phrasing feature to Solr.
> Definition: Identifying word sequences in queries that do not contribute essentially to the query's meaning, such as "Where can I find" or "Where is."
> (Source: http://www.google.com/search?q=define%3Aanti+phrasing)
> For general purpose search services, such as web, intranet, shopping search, some users will try to write a question to the search engine, such as "how much is an ipod nano". One straight-forward way of limiting the number of 0-hits in such environments is to apply anti-phrasing, which uses a dictionary of common sentence prefixes which should be stripped from the incoming query before it is sent further to search.
> This can be implemented as a Search Component in Solr. The dictionary can be language independent. We can encourage users to submit their tested anti-phrasing dictionaries for various languages, and include those. The dictionary can be a set of simple .txt files, loaded in memory at startup in an efficient data structure such as b-tree or finite state automaton to avoid redundancy and ensure quick matching. The procedure for detecting an anti-phrase from the incoming query is to first lookup the full query phrase, if no match, remove a word from the end, and do another lookup until either a match or end of string. Example for query: "Who is Einstein?", where "Who is" is defined as an anti phrase.
> 1. Lookup "Who is Einstein"
> 2. Lookup "Who is" (match), remove this prefix
> 3. Issue the query "Einstein" to search

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-2150) Anti-phrasing feature

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919942#action_12919942 ]

Jan Høydahl commented on SOLR-2150:
-----------------------------------

What you describe is also a useful feature. I think of it even more generic, as a place to configure detection of various patterns, and apply some action on the query based on he match, whether that is fetching a weather forecast from an API, performing a calculation or rewriting the query to apply a filter. I think it deserves its own feature request, and then one could decide whether the same code base could power parts of both later in the design phase.

> Anti-phrasing feature
> ---------------------
>
>                 Key: SOLR-2150
>                 URL: https://issues.apache.org/jira/browse/SOLR-2150
>             Project: Solr
>          Issue Type: New Feature
>          Components: SearchComponents - other
>            Reporter: Jan Høydahl
>
> Add an anti-phrasing feature to Solr.
> Definition: Identifying word sequences in queries that do not contribute essentially to the query's meaning, such as "Where can I find" or "Where is."
> (Source: http://www.google.com/search?q=define%3Aanti+phrasing)
> For general purpose search services, such as web, intranet, shopping search, some users will try to write a question to the search engine, such as "how much is an ipod nano". One straight-forward way of limiting the number of 0-hits in such environments is to apply anti-phrasing, which uses a dictionary of common sentence prefixes which should be stripped from the incoming query before it is sent further to search.
> This can be implemented as a Search Component in Solr. The dictionary can be language independent. We can encourage users to submit their tested anti-phrasing dictionaries for various languages, and include those. The dictionary can be a set of simple .txt files, loaded in memory at startup in an efficient data structure such as b-tree or finite state automaton to avoid redundancy and ensure quick matching. The procedure for detecting an anti-phrase from the incoming query is to first lookup the full query phrase, if no match, remove a word from the end, and do another lookup until either a match or end of string. Example for query: "Who is Einstein?", where "Who is" is defined as an anti phrase.
> 1. Lookup "Who is Einstein"
> 2. Lookup "Who is" (match), remove this prefix
> 3. Issue the query "Einstein" to search

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]