When search term has two stopwords ('and' and 'a') together, it doesn't work

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri
OR

 <!--this is the default search handler with highlighting and grouped results -->
    <requestHandler name="/search" class="solr.SearchHandler">

        <lst name="defaults">
            <str name="q.op">OR</str>
            <str name="echoParams">explicit</str>
            <str name="defType">edismax</str>
            <str name="q.alt">*:*</str>
            <str name="df">name</str>
            <str name="qf">
                     .......
           <str>


> On 8 Nov 2019, at 16:43, David Hastings <[hidden email]> wrote:
>
> is your default operator OR?
> change it to AND
>
>
> On Fri, Nov 8, 2019 at 11:30 AM Guilherme Viteri <[hidden email]> wrote:
>
>> HI Walter and Paras
>>
>> I indexed it removing all the references to StopWordFilter and I went from
>> 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid
>> cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think
>> removing it completely is the way to go from the scenario we have, but I
>> appreciate the suggestion...
>>
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>>
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful results.
>>
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>>
>> Thanks everyone with their inputs
>>
>>
>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>> wrote:
>>>
>>> Hi Guilherme
>>>
>>> By accident, I ended up querying the using the default handler (/select)
>> and it worked.
>>>
>>> You've just found the culprit. Thanks for giving the material I
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of 50
>> when boosting contextual suggestions (boosting "gold iphone" on a page of
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by us as
>> well (we never faced performance or relevance issues).
>>>
>>> See the major difference in both the handlers - edismax. I'm pretty sure
>> that your problem lies in the parsing of queries (you can confirm that from
>> parsedquery key in debug of both JSON responses). I hope you have provided
>> the response with fl=*. Replace q with q.alt in your /search handler query
>> and I think you should start getting responses. That's because q.alt uses
>> standard parser. If you want to keep using edisMax, I suggest you to test
>> the responses removing some combination of lst (qf, bf) and find what's
>> restricting the documents to come up. I'm out of office today - would have
>> certainly tried analyzing the field values of the document in /select
>> request and compare it with qf/bq in solrconfig.xml /search. Do this for me
>> and you'd certainly find something.
>>>
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>>
>>> I add a “pf” field with the weights doubled, so that phrase matches have
>> a higher weight.
>>>
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two early
>> web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email] <mailto:[hidden email]>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>> blog)
>>>
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>
>>>> Hi Wunder,
>>>>
>>>> My indexer takes quite a few hours to be executed I am shortening it to
>> run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>>
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>> I've inherited that implementation and I am really keen to adequate it,
>> what would you recommend ?
>>>>
>>>> Cheers
>>>> Guilherme
>>>>
>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>
>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>>>>
>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>
>>>>> You will continue to have problems matching stopwords until you do
>> that.
>>>>>
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email] <mailto:[hidden email]>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>>
>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>> Hi Paras, everyone
>>>>>>
>>>>>> Thank you again for your inputs and suggestions. I sorry to hear you
>> had trouble with the attachments I will host it somewhere and share the
>> links.
>>>>>> I don't tweak my index, I get the data from the graph database,
>> create a document as they are and save to solr.
>>>>>>
>>>>>> So, I am sending the new analysis screen querying the way you
>> suggested. Also the results with params and solr query url.
>>>>>>
>>>>>> During the process of querying what you asked I found something
>> really weird (at least for me). By accident, I ended up querying the using
>> the default handler (/select) and it worked. Then If I use the one I must
>> use, then sadly doesn't work. I am posting both results and I will also
>> post the handlers as well.
>>>>>>
>>>>>> Here is the link with all the files mentioned before
>>>>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>
>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>> Hi Guilherme.
>>>>>>>
>>>>>>> I am sending they analysis result and the json result as requested.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>> quality
>>>>>>> though).
>>>>>>>
>>>>>>> From the analysis screen, the analysis is working as expected. One
>> of the
>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>> initially
>>>>>>> think of is: the stopword "a" is probably present in post-analysis
>> either
>>>>>>> of query or index. Did you tweak your index time analysis after
>> indexing?
>>>>>>>
>>>>>>> Do two things:
>>>>>>>
>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>> "query=*"lymphoid
>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the
>> link
>>>>>>> here.
>>>>>>> 2. Give the same JSON output as you have sent but this time with
>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
>> such. The
>>>>>>>> Apache server is fairly aggressive about stripping attachments
>> though, so
>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>
>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks Erick.
>>>>>>>>>
>>>>>>>>>> First, your index and analysis chains are considerably different,
>> this
>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this
>> unless
>>>>>>>> you’re totally sure you understand the consequences. Additionally,
>> your use
>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>> interactions.
>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>
>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>> Used by custom code?
>>>>>>>>> This the url in my application, not solr params. That's the query
>> string.
>>>>>>>>>
>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>> typo.
>>>>>>>>> This is part of the application. Species will be used later on in
>> solr
>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>
>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore
>> all the
>>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to
>> skip
>>>>>>>> that part.
>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>> explain tag
>>>>>>>> is present.
>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>
>>>>>>>>> Thank for your inputs
>>>>>>>>>
>>>>>>>>> Guilherme
>>>>>>>>>
>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <[hidden email]
>> <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Fwd to another server
>>>>>>>>>>
>>>>>>>>>> First, your index and analysis chains are considerably different,
>> this
>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this
>> unless
>>>>>>>> you’re totally sure you understand the consequences. Additionally,
>> your use
>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>> interactions.
>>>>>>>>>>
>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>> Used by custom code?
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>
>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>> typo.
>>>>>>>>>>
>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore
>> all the
>>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to
>> skip
>>>>>>>> that part.
>>>>>>>>>>
>>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>>>>> expect” is answered by looking at the “&debug=query” output and the
>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be sure
>> to look
>>>>>>>> at _both_ the query and index output. Also, and very important
>> about the
>>>>>>>> analysis page (and this is confusing) is that this _assumes_ that
>> what you
>>>>>>>> put in the text boxes have made it through the query parser intact
>> and is
>>>>>>>> analyzed by the field selected. Consider the search "q=field:word1
>> word2".
>>>>>>>> Now you type “word1 word2” into the analysis text box and it looks
>> like
>>>>>>>> what you expect. That’s misleading because the query is _parsed_ as
>>>>>>>> "field:word1 default_search_field:word2”. This is where
>> “&debug=query”
>>>>>>>> helps.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Erick
>>>>>>>>>>
>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>
>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those
>> words
>>>>>>>> will
>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think the OP's concern is different results when adding a
>> stopword. I
>>>>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>>>>> includes
>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>
>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>> document in
>>>>>>>>>>> result you are concerned about and post full result of analysis
>> screen
>>>>>>>> (for
>>>>>>>>>>> both query and index).
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> No.
>>>>>>>>>>>>
>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>> Those words
>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
>> chain in
>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the
>> new
>>>>>>>> config.
>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>
>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
>> not be
>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>
>>>>>>>>>>>> wunder
>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> http://observer.wunderwood.org/ <
>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's the
>> final
>>>>>>>>>>>> result.
>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>> stopwords.txt"," ")
>>>>>>>> then
>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>> [hidden email]>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is still using stopwords and should be removed, in my opinion
>> of
>>>>>>>> course,
>>>>>>>>>>>>>> based on your use case may be different, but i generally axe
>> any
>>>>>>>>>>>> reference
>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>   <analyzer type="index">
>>>>>>>>>>>>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>       <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>> [hidden email]>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
>> stop
>>>>>>>> words
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
>> however
>>>>>>>> this
>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records.
>> If i
>>>>>>>> remove
>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>> <field name="name"
>> type="text_field"
>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>> required="true"
>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   <analyzer type="query">
>>>>>>>>>>>>>>>>>       <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>   <analyzer type="index">
>>>>>>>>>>>>>>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>>   <analyzer type="query">
>>>>>>>>>>>>>>>>>       <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>> StopAnalyzer
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>
>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>
>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Paras Lehana* [65871]
>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>
>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>> Noida, UP, IN - 201303
>>>>>>>
>>>>>>> Mob.: +91-9560911996
>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>
>>>>>>> --
>>>>>>> IMPORTANT:
>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> Paras Lehana [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>> Work: 01203916600 | Extn:  8173
>>>
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Erick Erickson
In reply to this post by Guilherme Viteri
Look at the “mm” parameter, try setting it to 100%. Although that’t not entirely likely to do what you want either since virtually every doc will have “a” in it. But at least you’d get docs that have both terms.

you may also be able to search for things like “Lamin A” _only as a phrase_ and have some luck. But this is a gnarly problem in general. Some people have been able to substitute synonyms and/or shingles to make this work at the expense of a larger index.

This is a generic problem with context. “Lamin A” is really a “concept”, not just two words that happen to be near each other. Searching as a phrase is an OOB-but-naive way to try to make it more likely that the ranked results refer to the _concept_ of “Lamin A”. The assumption here is “if these two words appear next to each other, they’re more likely to be what I want”. I say “naive” because “Lamins: A new approach to...” would _also_ be found for a naive phrase search. (I have no idea whether such a title makes sense or not, but you figured that out already)...

To do this well you’d have to dive in to NLP/Machine learning.

I truly wish we could have the DWIM search algorithm (Do What I Mean)….

> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]> wrote:
>
> HI Walter and Paras
>
> I indexed it removing all the references to StopWordFilter and I went from 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think removing it completely is the way to go from the scenario we have, but I appreciate the suggestion…
>
> Yes the response is using fl=*
> I am trying some combinations at the moment, but yet no success.
>
> defType=edismax
> q.alt=Lymphoid and a non-Lymphoid cell
> Number of results=1599
> Quite a considerable increase, even though reasonable meaningful results.
>
> I am sorry but I didn't understand what do you want me to do exactly with the lst (??) and qf and bf.
>
> Thanks everyone with their inputs
>
>
>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]> wrote:
>>
>> Hi Guilherme
>>
>> By accident, I ended up querying the using the default handler (/select) and it worked.
>>
>> You've just found the culprit. Thanks for giving the material I requested. Your analysis chain is working as expected. I don't see any issue in either StopWordFilter or your boosts. I also use a boost of 50 when boosting contextual suggestions (boosting "gold iphone" on a page of iphone) but I take Walter's suggestion and would try to optimize my weights. I agree that this 50 thing was not researched much about by us as well (we never faced performance or relevance issues).  
>>
>> See the major difference in both the handlers - edismax. I'm pretty sure that your problem lies in the parsing of queries (you can confirm that from parsedquery key in debug of both JSON responses). I hope you have provided the response with fl=*. Replace q with q.alt in your /search handler query and I think you should start getting responses. That's because q.alt uses standard parser. If you want to keep using edisMax, I suggest you to test the responses removing some combination of lst (qf, bf) and find what's restricting the documents to come up. I'm out of office today - would have certainly tried analyzing the field values of the document in /select request and compare it with qf/bq in solrconfig.xml /search. Do this for me and you'd certainly find something.  
>>
>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email] <mailto:[hidden email]>> wrote:
>> I normally use a weight of 8 for the most important field, like title. Other fields might get a 4 or 2.
>>
>> I add a “pf” field with the weights doubled, so that phrase matches have a higher weight.
>>
>> The weight of 8 comes from experience at Infoseek and Inktomi, two early web search engines. With different relevance algorithms and totally different evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML titles. With the the two radically different system getting the same number, I decided that was a property of the documents, not of the search engines.
>>
>> wunder
>> Walter Underwood
>> [hidden email] <mailto:[hidden email]>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>
>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> Hi Wunder,
>>>
>>> My indexer takes quite a few hours to be executed I am shortening it to run faster, but I also need to make sure it gives what we are expecting. This implementation's been there for >4y, and massively used.
>>>
>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I don’t think I’ve ever used a weight higher than 16 in a dozen years of configuring Solr.
>>> I've inherited that implementation and I am really keen to adequate it, what would you recommend ?
>>>
>>> Cheers
>>> Guilherme
>>>
>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email] <mailto:[hidden email]>> wrote:
>>>>
>>>> Thanks for posting the files. Looking at schema.xml, I see that you still are using StopFilterFactory. The first advice we gave you was to remove that.
>>>>
>>>> Remove StopFilterFactory everywhere and reindex.
>>>>
>>>> You will continue to have problems matching stopwords until you do that.
>>>>
>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I don’t think I’ve ever used a weight higher than 16 in a dozen years of configuring Solr.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> [hidden email] <mailto:[hidden email]>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>>>
>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>
>>>>> Hi Paras, everyone
>>>>>
>>>>> Thank you again for your inputs and suggestions. I sorry to hear you had trouble with the attachments I will host it somewhere and share the links.
>>>>> I don't tweak my index, I get the data from the graph database, create a document as they are and save to solr.
>>>>>
>>>>> So, I am sending the new analysis screen querying the way you suggested. Also the results with params and solr query url.
>>>>>
>>>>> During the process of querying what you asked I found something really weird (at least for me). By accident, I ended up querying the using the default handler (/select) and it worked. Then If I use the one I must use, then sadly doesn't work. I am posting both results and I will also post the handlers as well.
>>>>>
>>>>> Here is the link with all the files mentioned before
>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>>
>>>>> If the link doesn't work www dot dropbox dot com slash sh slash fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>
>>>>> Thanks
>>>>>
>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>> Hi Guilherme.
>>>>>>
>>>>>> I am sending they analysis result and the json result as requested.
>>>>>>
>>>>>>
>>>>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>>>>> though).
>>>>>>
>>>>>> From the analysis screen, the analysis is working as expected. One of the
>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>>>>> think of is: the stopword "a" is probably present in post-analysis either
>>>>>> of query or index. Did you tweak your index time analysis after indexing?
>>>>>>
>>>>>> Do two things:
>>>>>>
>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>> "query=*"lymphoid
>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>>>>> here.
>>>>>> 2. Give the same JSON output as you have sent but this time with
>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>>>>>> Apache server is fairly aggressive about stripping attachments though, so
>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>
>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>>
>>>>>>>> Thanks Erick.
>>>>>>>>
>>>>>>>>> First, your index and analysis chains are considerably different, this
>>>>>>> can easily be a source of problems. In particular, using two different
>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>>>>> of the length filter is suspicious, especially since your problem statement
>>>>>>> is about the addition of a single letter term and the min length allowed on
>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>>>>> interactions.
>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>
>>>>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>>>>> Used by custom code?
>>>>>>>> This the url in my application, not solr params. That's the query string.
>>>>>>>>
>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>>>>> typo.
>>>>>>>> This is part of the application. Species will be used later on in solr
>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>
>>>>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>>>>> that part.
>>>>>>>> The two json files i've sent, they are debugQuery=on and the explain tag
>>>>>>> is present.
>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>
>>>>>>>> Thank for your inputs
>>>>>>>>
>>>>>>>> Guilherme
>>>>>>>>
>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <[hidden email] <mailto:[hidden email]>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Fwd to another server
>>>>>>>>>
>>>>>>>>> First, your index and analysis chains are considerably different, this
>>>>>>> can easily be a source of problems. In particular, using two different
>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>>>>> of the length filter is suspicious, especially since your problem statement
>>>>>>> is about the addition of a single letter term and the min length allowed on
>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>>>>> interactions.
>>>>>>>>>
>>>>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>>>>> Used by custom code?
>>>>>>>>>
>>>>>>>>>>>
>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>
>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>>>>> typo.
>>>>>>>>>
>>>>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>>>>> that part.
>>>>>>>>>
>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>>>> expect” is answered by looking at the “&debug=query” output and the
>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>>>>>>> at _both_ the query and index output. Also, and very important about the
>>>>>>> analysis page (and this is confusing) is that this _assumes_ that what you
>>>>>>> put in the text boxes have made it through the query parser intact and is
>>>>>>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>>>>>>> Now you type “word1 word2” into the analysis text box and it looks like
>>>>>>> what you expect. That’s misleading because the query is _parsed_ as
>>>>>>> "field:word1 default_search_field:word2”. This is where “&debug=query”
>>>>>>> helps.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Erick
>>>>>>>>>
>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <[hidden email] <mailto:[hidden email]>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Walter,
>>>>>>>>>>
>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>> will
>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think the OP's concern is different results when adding a stopword. I
>>>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>>>> includes
>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>
>>>>>>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>>>>>>> result you are concerned about and post full result of analysis screen
>>>>>>> (for
>>>>>>>>>> both query and index).
>>>>>>>>>>
>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <[hidden email] <mailto:[hidden email]>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> No.
>>>>>>>>>>>
>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>
>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>>>>>>>>> schema.xml.
>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>>>>>>> config.
>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>
>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will not be
>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>
>>>>>>>>>>> wunder
>>>>>>>>>>> Walter Underwood
>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>
>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>>
>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>> If I open up the console > analysis and perform it, that's the final
>>>>>>>>>>> result.
>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>
>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>>>>>>> then
>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>
>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>
>>>>>>>>>>>>> no,
>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>
>>>>>>>>>>>>> is still using stopwords and should be removed, in my opinion of
>>>>>>> course,
>>>>>>>>>>>>> based on your use case may be different, but i generally axe any
>>>>>>>>>>> reference
>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <[hidden email] <mailto:[hidden email]>
>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>   <analyzer type="index">
>>>>>>>>>>>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>       <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The first thing you should do is remove any reference to stop
>>>>>>> words
>>>>>>>>>>> and
>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field), however
>>>>>>> this
>>>>>>>>>>> term
>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
>>>>>>> remove
>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>> <
>>>>>>>>>>>
>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true <https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true <https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>> required="true"
>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   <analyzer type="query">
>>>>>>>>>>>>>>>>       <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>   <analyzer type="index">
>>>>>>>>>>>>>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>       <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>   <analyzer type="query">
>>>>>>>>>>>>>>>>       <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>
>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>
>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> IMPORTANT:
>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Regards,
>>>>>>
>>>>>> *Paras Lehana* [65871]
>>>>>> Development Engineer, Auto-Suggest,
>>>>>> IndiaMART Intermesh Ltd.
>>>>>>
>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>> Noida, UP, IN - 201303
>>>>>>
>>>>>> Mob.: +91-9560911996
>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>
>>>>>> --
>>>>>> IMPORTANT:
>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> Paras Lehana [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996 <tel:+91-9560911996>
>> Work: 01203916600 | Extn:  8173
>>
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.

Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood
In reply to this post by David Hastings
But when you change it to AND, a single misspelling means zero results. That is usually not helpful.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Nov 8, 2019, at 10:43 AM, David Hastings <[hidden email]> wrote:
>
> is your default operator OR?
> change it to AND
>
>
> On Fri, Nov 8, 2019 at 11:30 AM Guilherme Viteri <[hidden email]> wrote:
>
>> HI Walter and Paras
>>
>> I indexed it removing all the references to StopWordFilter and I went from
>> 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid
>> cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think
>> removing it completely is the way to go from the scenario we have, but I
>> appreciate the suggestion...
>>
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>>
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful results.
>>
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>>
>> Thanks everyone with their inputs
>>
>>
>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>> wrote:
>>>
>>> Hi Guilherme
>>>
>>> By accident, I ended up querying the using the default handler (/select)
>> and it worked.
>>>
>>> You've just found the culprit. Thanks for giving the material I
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of 50
>> when boosting contextual suggestions (boosting "gold iphone" on a page of
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by us as
>> well (we never faced performance or relevance issues).
>>>
>>> See the major difference in both the handlers - edismax. I'm pretty sure
>> that your problem lies in the parsing of queries (you can confirm that from
>> parsedquery key in debug of both JSON responses). I hope you have provided
>> the response with fl=*. Replace q with q.alt in your /search handler query
>> and I think you should start getting responses. That's because q.alt uses
>> standard parser. If you want to keep using edisMax, I suggest you to test
>> the responses removing some combination of lst (qf, bf) and find what's
>> restricting the documents to come up. I'm out of office today - would have
>> certainly tried analyzing the field values of the document in /select
>> request and compare it with qf/bq in solrconfig.xml /search. Do this for me
>> and you'd certainly find something.
>>>
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>>
>>> I add a “pf” field with the weights doubled, so that phrase matches have
>> a higher weight.
>>>
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two early
>> web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email] <mailto:[hidden email]>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>> blog)
>>>
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>
>>>> Hi Wunder,
>>>>
>>>> My indexer takes quite a few hours to be executed I am shortening it to
>> run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>>
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>> I've inherited that implementation and I am really keen to adequate it,
>> what would you recommend ?
>>>>
>>>> Cheers
>>>> Guilherme
>>>>
>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>
>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>>>>
>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>
>>>>> You will continue to have problems matching stopwords until you do
>> that.
>>>>>
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email] <mailto:[hidden email]>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>>
>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>> Hi Paras, everyone
>>>>>>
>>>>>> Thank you again for your inputs and suggestions. I sorry to hear you
>> had trouble with the attachments I will host it somewhere and share the
>> links.
>>>>>> I don't tweak my index, I get the data from the graph database,
>> create a document as they are and save to solr.
>>>>>>
>>>>>> So, I am sending the new analysis screen querying the way you
>> suggested. Also the results with params and solr query url.
>>>>>>
>>>>>> During the process of querying what you asked I found something
>> really weird (at least for me). By accident, I ended up querying the using
>> the default handler (/select) and it worked. Then If I use the one I must
>> use, then sadly doesn't work. I am posting both results and I will also
>> post the handlers as well.
>>>>>>
>>>>>> Here is the link with all the files mentioned before
>>>>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>
>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>> Hi Guilherme.
>>>>>>>
>>>>>>> I am sending they analysis result and the json result as requested.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>> quality
>>>>>>> though).
>>>>>>>
>>>>>>> From the analysis screen, the analysis is working as expected. One
>> of the
>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>> initially
>>>>>>> think of is: the stopword "a" is probably present in post-analysis
>> either
>>>>>>> of query or index. Did you tweak your index time analysis after
>> indexing?
>>>>>>>
>>>>>>> Do two things:
>>>>>>>
>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>> "query=*"lymphoid
>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the
>> link
>>>>>>> here.
>>>>>>> 2. Give the same JSON output as you have sent but this time with
>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
>> such. The
>>>>>>>> Apache server is fairly aggressive about stripping attachments
>> though, so
>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>
>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks Erick.
>>>>>>>>>
>>>>>>>>>> First, your index and analysis chains are considerably different,
>> this
>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this
>> unless
>>>>>>>> you’re totally sure you understand the consequences. Additionally,
>> your use
>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>> interactions.
>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>
>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>> Used by custom code?
>>>>>>>>> This the url in my application, not solr params. That's the query
>> string.
>>>>>>>>>
>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>> typo.
>>>>>>>>> This is part of the application. Species will be used later on in
>> solr
>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>
>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore
>> all the
>>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to
>> skip
>>>>>>>> that part.
>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>> explain tag
>>>>>>>> is present.
>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>
>>>>>>>>> Thank for your inputs
>>>>>>>>>
>>>>>>>>> Guilherme
>>>>>>>>>
>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <[hidden email]
>> <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Fwd to another server
>>>>>>>>>>
>>>>>>>>>> First, your index and analysis chains are considerably different,
>> this
>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this
>> unless
>>>>>>>> you’re totally sure you understand the consequences. Additionally,
>> your use
>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>> interactions.
>>>>>>>>>>
>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>> Used by custom code?
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>
>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>> typo.
>>>>>>>>>>
>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore
>> all the
>>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to
>> skip
>>>>>>>> that part.
>>>>>>>>>>
>>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>>>>> expect” is answered by looking at the “&debug=query” output and the
>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be sure
>> to look
>>>>>>>> at _both_ the query and index output. Also, and very important
>> about the
>>>>>>>> analysis page (and this is confusing) is that this _assumes_ that
>> what you
>>>>>>>> put in the text boxes have made it through the query parser intact
>> and is
>>>>>>>> analyzed by the field selected. Consider the search "q=field:word1
>> word2".
>>>>>>>> Now you type “word1 word2” into the analysis text box and it looks
>> like
>>>>>>>> what you expect. That’s misleading because the query is _parsed_ as
>>>>>>>> "field:word1 default_search_field:word2”. This is where
>> “&debug=query”
>>>>>>>> helps.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Erick
>>>>>>>>>>
>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>
>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those
>> words
>>>>>>>> will
>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think the OP's concern is different results when adding a
>> stopword. I
>>>>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>>>>> includes
>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>
>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>> document in
>>>>>>>>>>> result you are concerned about and post full result of analysis
>> screen
>>>>>>>> (for
>>>>>>>>>>> both query and index).
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> No.
>>>>>>>>>>>>
>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>> Those words
>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
>> chain in
>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the
>> new
>>>>>>>> config.
>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>
>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
>> not be
>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>
>>>>>>>>>>>> wunder
>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> http://observer.wunderwood.org/ <
>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's the
>> final
>>>>>>>>>>>> result.
>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>> stopwords.txt"," ")
>>>>>>>> then
>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>> [hidden email]>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is still using stopwords and should be removed, in my opinion
>> of
>>>>>>>> course,
>>>>>>>>>>>>>> based on your use case may be different, but i generally axe
>> any
>>>>>>>>>>>> reference
>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>   <analyzer type="index">
>>>>>>>>>>>>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>       <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>> [hidden email]>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
>> stop
>>>>>>>> words
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
>> however
>>>>>>>> this
>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records.
>> If i
>>>>>>>> remove
>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>> <field name="name"
>> type="text_field"
>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>> required="true"
>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   <analyzer type="query">
>>>>>>>>>>>>>>>>>       <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>   <analyzer type="index">
>>>>>>>>>>>>>>>>>       <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>>   <analyzer type="query">
>>>>>>>>>>>>>>>>>       <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>       <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>   </analyzer>
>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>> StopAnalyzer
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>
>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>
>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Paras Lehana* [65871]
>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>
>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>> Noida, UP, IN - 201303
>>>>>>>
>>>>>>> Mob.: +91-9560911996
>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>
>>>>>>> --
>>>>>>> IMPORTANT:
>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> Paras Lehana [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>> Work: 01203916600 | Extn:  8173
>>>
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood
In reply to this post by Erick Erickson
I always enable phrase searching in edismax for exactly this reason.

Something like:

       <str name="qf”>title^8 keywords^4 text</str>
       <str name="pf”>title^16 keywords^8 text^2</str>

To deal with concepts in queries, a classifier and/or named entity extractor can be helpful. If you have a list of concepts (“controlled vocabulary”) that includes “Lamin A”, and that shows up in a query, that term can be queried against the field matching that vocabulary.

This is how LinkedIn separates people, companies, and places, for example.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]> wrote:
>
> Look at the “mm” parameter, try setting it to 100%. Although that’t not entirely likely to do what you want either since virtually every doc will have “a” in it. But at least you’d get docs that have both terms.
>
> you may also be able to search for things like “Lamin A” _only as a phrase_ and have some luck. But this is a gnarly problem in general. Some people have been able to substitute synonyms and/or shingles to make this work at the expense of a larger index.
>
> This is a generic problem with context. “Lamin A” is really a “concept”, not just two words that happen to be near each other. Searching as a phrase is an OOB-but-naive way to try to make it more likely that the ranked results refer to the _concept_ of “Lamin A”. The assumption here is “if these two words appear next to each other, they’re more likely to be what I want”. I say “naive” because “Lamins: A new approach to...” would _also_ be found for a naive phrase search. (I have no idea whether such a title makes sense or not, but you figured that out already)...
>
> To do this well you’d have to dive in to NLP/Machine learning.
>
> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
>
>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]> wrote:
>>
>> HI Walter and Paras
>>
>> I indexed it removing all the references to StopWordFilter and I went from 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think removing it completely is the way to go from the scenario we have, but I appreciate the suggestion…
>>
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>>
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful results.
>>
>> I am sorry but I didn't understand what do you want me to do exactly with the lst (??) and qf and bf.
>>
>> Thanks everyone with their inputs
>>
>>
>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]> wrote:
>>>
>>> Hi Guilherme
>>>
>>> By accident, I ended up querying the using the default handler (/select) and it worked.
>>>
>>> You've just found the culprit. Thanks for giving the material I requested. Your analysis chain is working as expected. I don't see any issue in either StopWordFilter or your boosts. I also use a boost of 50 when boosting contextual suggestions (boosting "gold iphone" on a page of iphone) but I take Walter's suggestion and would try to optimize my weights. I agree that this 50 thing was not researched much about by us as well (we never faced performance or relevance issues).  
>>>
>>> See the major difference in both the handlers - edismax. I'm pretty sure that your problem lies in the parsing of queries (you can confirm that from parsedquery key in debug of both JSON responses). I hope you have provided the response with fl=*. Replace q with q.alt in your /search handler query and I think you should start getting responses. That's because q.alt uses standard parser. If you want to keep using edisMax, I suggest you to test the responses removing some combination of lst (qf, bf) and find what's restricting the documents to come up. I'm out of office today - would have certainly tried analyzing the field values of the document in /select request and compare it with qf/bq in solrconfig.xml /search. Do this for me and you'd certainly find something.  
>>>
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email] <mailto:[hidden email]>> wrote:
>>> I normally use a weight of 8 for the most important field, like title. Other fields might get a 4 or 2.
>>>
>>> I add a “pf” field with the weights doubled, so that phrase matches have a higher weight.
>>>
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two early web search engines. With different relevance algorithms and totally different evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML titles. With the the two radically different system getting the same number, I decided that was a property of the documents, not of the search engines.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email] <mailto:[hidden email]>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>>
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>> wrote:
>>>>
>>>> Hi Wunder,
>>>>
>>>> My indexer takes quite a few hours to be executed I am shortening it to run faster, but I also need to make sure it gives what we are expecting. This implementation's been there for >4y, and massively used.
>>>>
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I don’t think I’ve ever used a weight higher than 16 in a dozen years of configuring Solr.
>>>> I've inherited that implementation and I am really keen to adequate it, what would you recommend ?
>>>>
>>>> Cheers
>>>> Guilherme
>>>>
>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>
>>>>> Thanks for posting the files. Looking at schema.xml, I see that you still are using StopFilterFactory. The first advice we gave you was to remove that.
>>>>>
>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>
>>>>> You will continue to have problems matching stopwords until you do that.
>>>>>
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I don’t think I’ve ever used a weight higher than 16 in a dozen years of configuring Solr.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email] <mailto:[hidden email]>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>>>>
>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>> Hi Paras, everyone
>>>>>>
>>>>>> Thank you again for your inputs and suggestions. I sorry to hear you had trouble with the attachments I will host it somewhere and share the links.
>>>>>> I don't tweak my index, I get the data from the graph database, create a document as they are and save to solr.
>>>>>>
>>>>>> So, I am sending the new analysis screen querying the way you suggested. Also the results with params and solr query url.
>>>>>>
>>>>>> During the process of querying what you asked I found something really weird (at least for me). By accident, I ended up querying the using the default handler (/select) and it worked. Then If I use the one I must use, then sadly doesn't work. I am posting both results and I will also post the handlers as well.
>>>>>>
>>>>>> Here is the link with all the files mentioned before
>>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>>
>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>> Hi Guilherme.
>>>>>>>
>>>>>>> I am sending they analysis result and the json result as requested.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>>>>>> though).
>>>>>>>
>>>>>>> From the analysis screen, the analysis is working as expected. One of the
>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>>>>>> think of is: the stopword "a" is probably present in post-analysis either
>>>>>>> of query or index. Did you tweak your index time analysis after indexing?
>>>>>>>
>>>>>>> Do two things:
>>>>>>>
>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>> "query=*"lymphoid
>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>>>>>> here.
>>>>>>> 2. Give the same JSON output as you have sent but this time with
>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>>>>>>> Apache server is fairly aggressive about stripping attachments though, so
>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>
>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks Erick.
>>>>>>>>>
>>>>>>>>>> First, your index and analysis chains are considerably different, this
>>>>>>>> can easily be a source of problems. In particular, using two different
>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>>>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>>>>>> of the length filter is suspicious, especially since your problem statement
>>>>>>>> is about the addition of a single letter term and the min length allowed on
>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>>>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>>>>>> interactions.
>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>
>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>>>>>> Used by custom code?
>>>>>>>>> This the url in my application, not solr params. That's the query string.
>>>>>>>>>
>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>>>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>>>>>> typo.
>>>>>>>>> This is part of the application. Species will be used later on in solr
>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>
>>>>>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>>>>>> that part.
>>>>>>>>> The two json files i've sent, they are debugQuery=on and the explain tag
>>>>>>>> is present.
>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>
>>>>>>>>> Thank for your inputs
>>>>>>>>>
>>>>>>>>> Guilherme
>>>>>>>>>
>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <[hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Fwd to another server
>>>>>>>>>>
>>>>>>>>>> First, your index and analysis chains are considerably different, this
>>>>>>>> can easily be a source of problems. In particular, using two different
>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>>>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>>>>>> of the length filter is suspicious, especially since your problem statement
>>>>>>>> is about the addition of a single letter term and the min length allowed on
>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>>>>>>> filtered out in both cases, but maybe you’ve found something odd about the
>>>>>>>> interactions.
>>>>>>>>>>
>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>>>>>>> Used by custom code?
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>
>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>>>>>>> all the params with an equal-sign are totally ignored unless it’s just a
>>>>>>>> typo.
>>>>>>>>>>
>>>>>>>>>> Third, the easiest way to see what’s happening under the covers is to
>>>>>>>> add “&debug=true” to the query and look at the parsed query. Ignore all the
>>>>>>>> relevance calculations for the nonce, or specify “&debug=query” to skip
>>>>>>>> that part.
>>>>>>>>>>
>>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>>>>> expect” is answered by looking at the “&debug=query” output and the
>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>>>>>>>> at _both_ the query and index output. Also, and very important about the
>>>>>>>> analysis page (and this is confusing) is that this _assumes_ that what you
>>>>>>>> put in the text boxes have made it through the query parser intact and is
>>>>>>>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>>>>>>>> Now you type “word1 word2” into the analysis text box and it looks like
>>>>>>>> what you expect. That’s misleading because the query is _parsed_ as
>>>>>>>> "field:word1 default_search_field:word2”. This is where “&debug=query”
>>>>>>>> helps.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Erick
>>>>>>>>>>
>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <[hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>
>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>>> will
>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think the OP's concern is different results when adding a stopword. I
>>>>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>>>>> includes
>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>
>>>>>>>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>>>>>>>> result you are concerned about and post full result of analysis screen
>>>>>>>> (for
>>>>>>>>>>> both query and index).
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <[hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> No.
>>>>>>>>>>>>
>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>>>>>>>> config.
>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>
>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will not be
>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>
>>>>>>>>>>>> wunder
>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <[hidden email] <mailto:[hidden email]>>
>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's the final
>>>>>>>>>>>> result.
>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>>>>>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>>>>>>>> then
>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is still using stopwords and should be removed, in my opinion of
>>>>>>>> course,
>>>>>>>>>>>>>> based on your use case may be different, but i generally axe any
>>>>>>>>>>>> reference
>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <[hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>  <analyzer type="index">
>>>>>>>>>>>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>      <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>  </analyzer>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to stop
>>>>>>>> words
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field), however
>>>>>>>> this
>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
>>>>>>>> remove
>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true <https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true<https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true <https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>> required="true"
>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  <analyzer type="query">
>>>>>>>>>>>>>>>>>      <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>  </analyzer>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>  <analyzer type="index">
>>>>>>>>>>>>>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>  </analyzer>
>>>>>>>>>>>>>>>>>  <analyzer type="query">
>>>>>>>>>>>>>>>>>      <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>  </analyzer>
>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>
>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>
>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Paras Lehana* [65871]
>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>
>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>> Noida, UP, IN - 201303
>>>>>>>
>>>>>>> Mob.: +91-9560911996
>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>
>>>>>>> --
>>>>>>> IMPORTANT:
>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> Paras Lehana [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>> Work: 01203916600 | Extn:  8173
>>>
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>

Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

David Hastings
the pf and qf fields are REALLY nice for this

On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <[hidden email]>
wrote:

> I always enable phrase searching in edismax for exactly this reason.
>
> Something like:
>
>        <str name="qf”>title^8 keywords^4 text</str>
>        <str name="pf”>title^16 keywords^8 text^2</str>
>
> To deal with concepts in queries, a classifier and/or named entity
> extractor can be helpful. If you have a list of concepts (“controlled
> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
> term can be queried against the field matching that vocabulary.
>
> This is how LinkedIn separates people, companies, and places, for example.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]>
> wrote:
> >
> > Look at the “mm” parameter, try setting it to 100%. Although that’t not
> entirely likely to do what you want either since virtually every doc will
> have “a” in it. But at least you’d get docs that have both terms.
> >
> > you may also be able to search for things like “Lamin A” _only as a
> phrase_ and have some luck. But this is a gnarly problem in general. Some
> people have been able to substitute synonyms and/or shingles to make this
> work at the expense of a larger index.
> >
> > This is a generic problem with context. “Lamin A” is really a “concept”,
> not just two words that happen to be near each other. Searching as a phrase
> is an OOB-but-naive way to try to make it more likely that the ranked
> results refer to the _concept_ of “Lamin A”. The assumption here is “if
> these two words appear next to each other, they’re more likely to be what I
> want”. I say “naive” because “Lamins: A new approach to...” would _also_ be
> found for a naive phrase search. (I have no idea whether such a title makes
> sense or not, but you figured that out already)...
> >
> > To do this well you’d have to dive in to NLP/Machine learning.
> >
> > I truly wish we could have the DWIM search algorithm (Do What I Mean)….
> >
> >> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
> wrote:
> >>
> >> HI Walter and Paras
> >>
> >> I indexed it removing all the references to StopWordFilter and I went
> from 121 results to near 20K as the search term q="Lymphoid and a
> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A". So I
> don't think removing it completely is the way to go from the scenario we
> have, but I appreciate the suggestion…
> >>
> >> Yes the response is using fl=*
> >> I am trying some combinations at the moment, but yet no success.
> >>
> >> defType=edismax
> >> q.alt=Lymphoid and a non-Lymphoid cell
> >> Number of results=1599
> >> Quite a considerable increase, even though reasonable meaningful
> results.
> >>
> >> I am sorry but I didn't understand what do you want me to do exactly
> with the lst (??) and qf and bf.
> >>
> >> Thanks everyone with their inputs
> >>
> >>
> >>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
> wrote:
> >>>
> >>> Hi Guilherme
> >>>
> >>> By accident, I ended up querying the using the default handler
> (/select) and it worked.
> >>>
> >>> You've just found the culprit. Thanks for giving the material I
> requested. Your analysis chain is working as expected. I don't see any
> issue in either StopWordFilter or your boosts. I also use a boost of 50
> when boosting contextual suggestions (boosting "gold iphone" on a page of
> iphone) but I take Walter's suggestion and would try to optimize my
> weights. I agree that this 50 thing was not researched much about by us as
> well (we never faced performance or relevance issues).
> >>>
> >>> See the major difference in both the handlers - edismax. I'm pretty
> sure that your problem lies in the parsing of queries (you can confirm that
> from parsedquery key in debug of both JSON responses). I hope you have
> provided the response with fl=*. Replace q with q.alt in your /search
> handler query and I think you should start getting responses. That's
> because q.alt uses standard parser. If you want to keep using edisMax, I
> suggest you to test the responses removing some combination of lst (qf, bf)
> and find what's restricting the documents to come up. I'm out of office
> today - would have certainly tried analyzing the field values of the
> document in /select request and compare it with qf/bq in solrconfig.xml
> /search. Do this for me and you'd certainly find something.
> >>>
> >>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>> I normally use a weight of 8 for the most important field, like title.
> Other fields might get a 4 or 2.
> >>>
> >>> I add a “pf” field with the weights doubled, so that phrase matches
> have a higher weight.
> >>>
> >>> The weight of 8 comes from experience at Infoseek and Inktomi, two
> early web search engines. With different relevance algorithms and totally
> different evaluation and tuning systems, they settled on weights of 8 and
> 7.5 for HTML titles. With the the two radically different system getting
> the same number, I decided that was a property of the documents, not of the
> search engines.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> [hidden email] <mailto:[hidden email]>
> >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my blog)
> >>>
> >>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>>>
> >>>> Hi Wunder,
> >>>>
> >>>> My indexer takes quite a few hours to be executed I am shortening it
> to run faster, but I also need to make sure it gives what we are expecting.
> This implementation's been there for >4y, and massively used.
> >>>>
> >>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
> of configuring Solr.
> >>>> I've inherited that implementation and I am really keen to adequate
> it, what would you recommend ?
> >>>>
> >>>> Cheers
> >>>> Guilherme
> >>>>
> >>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>>>>
> >>>>> Thanks for posting the files. Looking at schema.xml, I see that you
> still are using StopFilterFactory. The first advice we gave you was to
> remove that.
> >>>>>
> >>>>> Remove StopFilterFactory everywhere and reindex.
> >>>>>
> >>>>> You will continue to have problems matching stopwords until you do
> that.
> >>>>>
> >>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
> of configuring Solr.
> >>>>>
> >>>>> wunder
> >>>>> Walter Underwood
> >>>>> [hidden email] <mailto:[hidden email]>
> >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my blog)
> >>>>>
> >>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>>>>>
> >>>>>> Hi Paras, everyone
> >>>>>>
> >>>>>> Thank you again for your inputs and suggestions. I sorry to hear
> you had trouble with the attachments I will host it somewhere and share the
> links.
> >>>>>> I don't tweak my index, I get the data from the graph database,
> create a document as they are and save to solr.
> >>>>>>
> >>>>>> So, I am sending the new analysis screen querying the way you
> suggested. Also the results with params and solr query url.
> >>>>>>
> >>>>>> During the process of querying what you asked I found something
> really weird (at least for me). By accident, I ended up querying the using
> the default handler (/select) and it worked. Then If I use the one I must
> use, then sadly doesn't work. I am posting both results and I will also
> post the handlers as well.
> >>>>>>
> >>>>>> Here is the link with all the files mentioned before
> >>>>>>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >>
> >>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>>>>>>
> >>>>>>> Hi Guilherme.
> >>>>>>>
> >>>>>>> I am sending they analysis result and the json result as requested.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
> quality
> >>>>>>> though).
> >>>>>>>
> >>>>>>> From the analysis screen, the analysis is working as expected. One
> of the
> >>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> >>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
> initially
> >>>>>>> think of is: the stopword "a" is probably present in post-analysis
> either
> >>>>>>> of query or index. Did you tweak your index time analysis after
> indexing?
> >>>>>>>
> >>>>>>> Do two things:
> >>>>>>>
> >>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
> >>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
> >>>>>>> "query=*"lymphoid
> >>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the
> link
> >>>>>>> here.
> >>>>>>> 2. Give the same JSON output as you have sent but this time with
> >>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
> [hidden email] <mailto:[hidden email]>> wrote:
> >>>>>>>
> >>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
> such. The
> >>>>>>>> Apache server is fairly aggressive about stripping attachments
> though, so
> >>>>>>>> it’s also possible they didn’t make it through.
> >>>>>>>>
> >>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[hidden email]
> <mailto:[hidden email]>> wrote:
> >>>>>>>>>
> >>>>>>>>> Thanks Erick.
> >>>>>>>>>
> >>>>>>>>>> First, your index and analysis chains are considerably
> different, this
> >>>>>>>> can easily be a source of problems. In particular, using two
> different
> >>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> this unless
> >>>>>>>> you’re totally sure you understand the consequences.
> Additionally, your use
> >>>>>>>> of the length filter is suspicious, especially since your problem
> statement
> >>>>>>>> is about the addition of a single letter term and the min length
> allowed on
> >>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> ’a’ is
> >>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> about the
> >>>>>>>> interactions.
> >>>>>>>>> I will investigate the min length and post the results later.
> >>>>>>>>>
> >>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> typos?
> >>>>>>>> Used by custom code?
> >>>>>>>>> This the url in my application, not solr params. That's the
> query string.
> >>>>>>>>>
> >>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
> that
> >>>>>>>> all the params with an equal-sign are totally ignored unless it’s
> just a
> >>>>>>>> typo.
> >>>>>>>>> This is part of the application. Species will be used later on
> in solr
> >>>>>>>> to filter out the result. That's not solr. That my app params.
> >>>>>>>>>
> >>>>>>>>>> Third, the easiest way to see what’s happening under the covers
> is to
> >>>>>>>> add “&debug=true” to the query and look at the parsed query.
> Ignore all the
> >>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
> to skip
> >>>>>>>> that part.
> >>>>>>>>> The two json files i've sent, they are debugQuery=on and the
> explain tag
> >>>>>>>> is present.
> >>>>>>>>> I will try the searching the way you mentioned.
> >>>>>>>>>
> >>>>>>>>> Thank for your inputs
> >>>>>>>>>
> >>>>>>>>> Guilherme
> >>>>>>>>>
> >>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
> [hidden email] <mailto:[hidden email]>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Fwd to another server
> >>>>>>>>>>
> >>>>>>>>>> First, your index and analysis chains are considerably
> different, this
> >>>>>>>> can easily be a source of problems. In particular, using two
> different
> >>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> this unless
> >>>>>>>> you’re totally sure you understand the consequences.
> Additionally, your use
> >>>>>>>> of the length filter is suspicious, especially since your problem
> statement
> >>>>>>>> is about the addition of a single letter term and the min length
> allowed on
> >>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> ’a’ is
> >>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> about the
> >>>>>>>> interactions.
> >>>>>>>>>>
> >>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> typos?
> >>>>>>>> Used by custom code?
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>>>>>>>
> >>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
> that
> >>>>>>>> all the params with an equal-sign are totally ignored unless it’s
> just a
> >>>>>>>> typo.
> >>>>>>>>>>
> >>>>>>>>>> Third, the easiest way to see what’s happening under the covers
> is to
> >>>>>>>> add “&debug=true” to the query and look at the parsed query.
> Ignore all the
> >>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
> to skip
> >>>>>>>> that part.
> >>>>>>>>>>
> >>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
> >>>>>>>> expect” is answered by looking at the “&debug=query” output and
> the
> >>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
> sure to look
> >>>>>>>> at _both_ the query and index output. Also, and very important
> about the
> >>>>>>>> analysis page (and this is confusing) is that this _assumes_ that
> what you
> >>>>>>>> put in the text boxes have made it through the query parser
> intact and is
> >>>>>>>> analyzed by the field selected. Consider the search
> "q=field:word1 word2".
> >>>>>>>> Now you type “word1 word2” into the analysis text box and it
> looks like
> >>>>>>>> what you expect. That’s misleading because the query is _parsed_
> as
> >>>>>>>> "field:word1 default_search_field:word2”. This is where
> “&debug=query”
> >>>>>>>> helps.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Erick
> >>>>>>>>>>
> >>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> [hidden email] <mailto:[hidden email]>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Walter,
> >>>>>>>>>>>
> >>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> Those words
> >>>>>>>> will
> >>>>>>>>>>>> not be in the index, so they can never match a query.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I think the OP's concern is different results when adding a
> stopword. I
> >>>>>>>>>>> think he's using the filter factory correctly - the query chain
> >>>>>>>> includes
> >>>>>>>>>>> the filter as well so it should remove "a" while querying.
> >>>>>>>>>>>
> >>>>>>>>>>> *@Guilherme*, please post results for both the query, the
> document in
> >>>>>>>>>>> result you are concerned about and post full result of
> analysis screen
> >>>>>>>> (for
> >>>>>>>>>>> both query and index).
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
> [hidden email] <mailto:[hidden email]>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> No.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> Those words
> >>>>>>>>>>>> will not be in the index, so they can never match a query.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
> chain in
> >>>>>>>>>>>> schema.xml.
> >>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read
> the new
> >>>>>>>> config.
> >>>>>>>>>>>> 3. Reindex all of the documents.
> >>>>>>>>>>>>
> >>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
> not be
> >>>>>>>>>>>> removed and they will be searchable.
> >>>>>>>>>>>>
> >>>>>>>>>>>> wunder
> >>>>>>>>>>>> Walter Underwood
> >>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
> >>>>>>>>>>>> http://observer.wunderwood.org/ <
> http://observer.wunderwood.org/>  (my blog)
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
> [hidden email] <mailto:[hidden email]>>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Ok. I am kind a lost now.
> >>>>>>>>>>>>> If I open up the console > analysis and perform it, that's
> the final
> >>>>>>>>>>>> result.
> >>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in
> the
> >>>>>>>>>>>> schema.xml and during index phase replaceAll("in
> stopwords.txt"," ")
> >>>>>>>> then
> >>>>>>>>>>>> add to solr. Is that correct ?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks David
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> >>>>>>>> [hidden email] <mailto:[hidden email]
> >
> >>>>>>>>>>>> <mailto:[hidden email] <mailto:
> [hidden email]>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> no,
> >>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> >>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> is still using stopwords and should be removed, in my
> opinion of
> >>>>>>>> course,
> >>>>>>>>>>>>>> based on your use case may be different, but i generally
> axe any
> >>>>>>>>>>>> reference
> >>>>>>>>>>>>>> to them at all
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
> [hidden email] <mailto:[hidden email]>
> >>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>> Haven't I done this here ?
> >>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>>>>  <analyzer type="index">
> >>>>>>>>>>>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>>>>      <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> >>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>  </analyzer>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> >>>>>>>> [hidden email] <mailto:[hidden email]
> >
> >>>>>>>>>>>> <mailto:[hidden email] <mailto:
> [hidden email]>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
> stop
> >>>>>>>> words
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> >>>>>>>> [hidden email] <mailto:[hidden email]>
> >>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
> however
> >>>>>>>> this
> >>>>>>>>>>>> term
> >>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
> records. If i
> >>>>>>>> remove
> >>>>>>>>>>>>>>> 'a'
> >>>>>>>>>>>>>>>>> then it works.
> >>>>>>>>>>>>>>>>> e.g
> >>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>>>>>>>>>>>>>> doesn't work:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
> >>>>>>>>>>>>>>>>> works:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> interested in the first result
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> schema.xml
> >>>>>>>>>>>>>>>>> <field name="name"
> type="text_field"
> >>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
> >>>>>>>> required="true"
> >>>>>>>>>>>>>>>>> multiValued="false"/>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>  <analyzer type="query">
> >>>>>>>>>>>>>>>>>      <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>  </analyzer>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>>>>>>  <analyzer type="index">
> >>>>>>>>>>>>>>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>  </analyzer>
> >>>>>>>>>>>>>>>>>  <analyzer type="query">
> >>>>>>>>>>>>>>>>>      <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>      <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>  </analyzer>
> >>>>>>>>>>>>>>>>> </fieldType>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> stopwords.txt
> >>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
> StopAnalyzer
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>> c
> >>>>>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Running SolR 6.6.2.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>> Guilherme
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> --
> >>>>>>>>>>> Regards,
> >>>>>>>>>>>
> >>>>>>>>>>> *Paras Lehana* [65871]
> >>>>>>>>>>> Development Engineer, Auto-Suggest,
> >>>>>>>>>>> IndiaMART Intermesh Ltd.
> >>>>>>>>>>>
> >>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>>>>>>> Noida, UP, IN - 201303
> >>>>>>>>>>>
> >>>>>>>>>>> Mob.: +91-9560911996
> >>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> IMPORTANT:
> >>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> --
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> *Paras Lehana* [65871]
> >>>>>>> Development Engineer, Auto-Suggest,
> >>>>>>> IndiaMART Intermesh Ltd.
> >>>>>>>
> >>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>>> Noida, UP, IN - 201303
> >>>>>>>
> >>>>>>> Mob.: +91-9560911996
> >>>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>>
> >>>>>>> --
> >>>>>>> IMPORTANT:
> >>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --
> >>> Regards,
> >>>
> >>> Paras Lehana [65871]
> >>> Development Engineer, Auto-Suggest,
> >>> IndiaMART Intermesh Ltd.
> >>>
> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>> Noida, UP, IN - 201303
> >>>
> >>> Mob.: +91-9560911996 <tel:+91-9560911996>
> >>> Work: 01203916600 | Extn:  8173
> >>>
> >>> IMPORTANT:
> >>> NEVER share your IndiaMART OTP/ Password with anyone.
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood
If we had IDF for phrases, they would be super effective. The 2X weight is a hack that mostly works.

Infoseek had phrase IDF and it was a killer algorithm for relevance.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Nov 8, 2019, at 11:08 AM, David Hastings <[hidden email]> wrote:
>
> the pf and qf fields are REALLY nice for this
>
> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <[hidden email]>
> wrote:
>
>> I always enable phrase searching in edismax for exactly this reason.
>>
>> Something like:
>>
>>       <str name="qf”>title^8 keywords^4 text</str>
>>       <str name="pf”>title^16 keywords^8 text^2</str>
>>
>> To deal with concepts in queries, a classifier and/or named entity
>> extractor can be helpful. If you have a list of concepts (“controlled
>> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
>> term can be queried against the field matching that vocabulary.
>>
>> This is how LinkedIn separates people, companies, and places, for example.
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]>
>> wrote:
>>>
>>> Look at the “mm” parameter, try setting it to 100%. Although that’t not
>> entirely likely to do what you want either since virtually every doc will
>> have “a” in it. But at least you’d get docs that have both terms.
>>>
>>> you may also be able to search for things like “Lamin A” _only as a
>> phrase_ and have some luck. But this is a gnarly problem in general. Some
>> people have been able to substitute synonyms and/or shingles to make this
>> work at the expense of a larger index.
>>>
>>> This is a generic problem with context. “Lamin A” is really a “concept”,
>> not just two words that happen to be near each other. Searching as a phrase
>> is an OOB-but-naive way to try to make it more likely that the ranked
>> results refer to the _concept_ of “Lamin A”. The assumption here is “if
>> these two words appear next to each other, they’re more likely to be what I
>> want”. I say “naive” because “Lamins: A new approach to...” would _also_ be
>> found for a naive phrase search. (I have no idea whether such a title makes
>> sense or not, but you figured that out already)...
>>>
>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>
>>> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
>>>
>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>> wrote:
>>>>
>>>> HI Walter and Paras
>>>>
>>>> I indexed it removing all the references to StopWordFilter and I went
>> from 121 results to near 20K as the search term q="Lymphoid and a
>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A". So I
>> don't think removing it completely is the way to go from the scenario we
>> have, but I appreciate the suggestion…
>>>>
>>>> Yes the response is using fl=*
>>>> I am trying some combinations at the moment, but yet no success.
>>>>
>>>> defType=edismax
>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>> Number of results=1599
>>>> Quite a considerable increase, even though reasonable meaningful
>> results.
>>>>
>>>> I am sorry but I didn't understand what do you want me to do exactly
>> with the lst (??) and qf and bf.
>>>>
>>>> Thanks everyone with their inputs
>>>>
>>>>
>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>> wrote:
>>>>>
>>>>> Hi Guilherme
>>>>>
>>>>> By accident, I ended up querying the using the default handler
>> (/select) and it worked.
>>>>>
>>>>> You've just found the culprit. Thanks for giving the material I
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of 50
>> when boosting contextual suggestions (boosting "gold iphone" on a page of
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by us as
>> well (we never faced performance or relevance issues).
>>>>>
>>>>> See the major difference in both the handlers - edismax. I'm pretty
>> sure that your problem lies in the parsing of queries (you can confirm that
>> from parsedquery key in debug of both JSON responses). I hope you have
>> provided the response with fl=*. Replace q with q.alt in your /search
>> handler query and I think you should start getting responses. That's
>> because q.alt uses standard parser. If you want to keep using edisMax, I
>> suggest you to test the responses removing some combination of lst (qf, bf)
>> and find what's restricting the documents to come up. I'm out of office
>> today - would have certainly tried analyzing the field values of the
>> document in /select request and compare it with qf/bq in solrconfig.xml
>> /search. Do this for me and you'd certainly find something.
>>>>>
>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>>>>
>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>> have a higher weight.
>>>>>
>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>> early web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email] <mailto:[hidden email]>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>>
>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>> Hi Wunder,
>>>>>>
>>>>>> My indexer takes quite a few hours to be executed I am shortening it
>> to run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>>>>
>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>> I've inherited that implementation and I am really keen to adequate
>> it, what would you recommend ?
>>>>>>
>>>>>> Cheers
>>>>>> Guilherme
>>>>>>
>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>>>>>>
>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>>
>>>>>>> You will continue to have problems matching stopwords until you do
>> that.
>>>>>>>
>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>>>
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>>>>
>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>>
>>>>>>>> Hi Paras, everyone
>>>>>>>>
>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>> you had trouble with the attachments I will host it somewhere and share the
>> links.
>>>>>>>> I don't tweak my index, I get the data from the graph database,
>> create a document as they are and save to solr.
>>>>>>>>
>>>>>>>> So, I am sending the new analysis screen querying the way you
>> suggested. Also the results with params and solr query url.
>>>>>>>>
>>>>>>>> During the process of querying what you asked I found something
>> really weird (at least for me). By accident, I ended up querying the using
>> the default handler (/select) and it worked. Then If I use the one I must
>> use, then sadly doesn't work. I am posting both results and I will also
>> post the handlers as well.
>>>>>>>>
>>>>>>>> Here is the link with all the files mentioned before
>>>>>>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>
>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Guilherme.
>>>>>>>>>
>>>>>>>>> I am sending they analysis result and the json result as requested.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>> quality
>>>>>>>>> though).
>>>>>>>>>
>>>>>>>>> From the analysis screen, the analysis is working as expected. One
>> of the
>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>> initially
>>>>>>>>> think of is: the stopword "a" is probably present in post-analysis
>> either
>>>>>>>>> of query or index. Did you tweak your index time analysis after
>> indexing?
>>>>>>>>>
>>>>>>>>> Do two things:
>>>>>>>>>
>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>>>> "query=*"lymphoid
>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing the
>> link
>>>>>>>>> here.
>>>>>>>>> 2. Give the same JSON output as you have sent but this time with
>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>> [hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>>>
>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
>> such. The
>>>>>>>>>> Apache server is fairly aggressive about stripping attachments
>> though, so
>>>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>>>
>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks Erick.
>>>>>>>>>>>
>>>>>>>>>>>> First, your index and analysis chains are considerably
>> different, this
>>>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>> this unless
>>>>>>>>>> you’re totally sure you understand the consequences.
>> Additionally, your use
>>>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>>>> interactions.
>>>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>>>
>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>>>> Used by custom code?
>>>>>>>>>>> This the url in my application, not solr params. That's the
>> query string.
>>>>>>>>>>>
>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>>>> typo.
>>>>>>>>>>> This is part of the application. Species will be used later on
>> in solr
>>>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>>>
>>>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>> Ignore all the
>>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
>> to skip
>>>>>>>>>> that part.
>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>> explain tag
>>>>>>>>>> is present.
>>>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>>>
>>>>>>>>>>> Thank for your inputs
>>>>>>>>>>>
>>>>>>>>>>> Guilherme
>>>>>>>>>>>
>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>
>>>>>>>>>>>> First, your index and analysis chains are considerably
>> different, this
>>>>>>>>>> can easily be a source of problems. In particular, using two
>> different
>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>> this unless
>>>>>>>>>> you’re totally sure you understand the consequences.
>> Additionally, your use
>>>>>>>>>> of the length filter is suspicious, especially since your problem
>> statement
>>>>>>>>>> is about the addition of a single letter term and the min length
>> allowed on
>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
>> ’a’ is
>>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
>> about the
>>>>>>>>>> interactions.
>>>>>>>>>>>>
>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
>> typos?
>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>
>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely
>> that
>>>>>>>>>> all the params with an equal-sign are totally ignored unless it’s
>> just a
>>>>>>>>>> typo.
>>>>>>>>>>>>
>>>>>>>>>>>> Third, the easiest way to see what’s happening under the covers
>> is to
>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>> Ignore all the
>>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
>> to skip
>>>>>>>>>> that part.
>>>>>>>>>>>>
>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do what I
>>>>>>>>>> expect” is answered by looking at the “&debug=query” output and
>> the
>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
>> sure to look
>>>>>>>>>> at _both_ the query and index output. Also, and very important
>> about the
>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_ that
>> what you
>>>>>>>>>> put in the text boxes have made it through the query parser
>> intact and is
>>>>>>>>>> analyzed by the field selected. Consider the search
>> "q=field:word1 word2".
>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
>> looks like
>>>>>>>>>> what you expect. That’s misleading because the query is _parsed_
>> as
>>>>>>>>>> "field:word1 default_search_field:word2”. This is where
>> “&debug=query”
>>>>>>>>>> helps.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Erick
>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>> Those words
>>>>>>>>>> will
>>>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the OP's concern is different results when adding a
>> stopword. I
>>>>>>>>>>>>> think he's using the filter factory correctly - the query chain
>>>>>>>>>> includes
>>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>> document in
>>>>>>>>>>>>> result you are concerned about and post full result of
>> analysis screen
>>>>>>>>>> (for
>>>>>>>>>>>>> both query and index).
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>> Those words
>>>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
>> chain in
>>>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read
>> the new
>>>>>>>>>> config.
>>>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
>> not be
>>>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wunder
>>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>> http://observer.wunderwood.org/ <
>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's
>> the final
>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in
>> the
>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>> stopwords.txt"," ")
>>>>>>>>>> then
>>>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>>>> [hidden email] <mailto:[hidden email]
>>>
>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>> [hidden email]>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
>> opinion of
>>>>>>>>>> course,
>>>>>>>>>>>>>>>> based on your use case may be different, but i generally
>> axe any
>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>>>> [hidden email] <mailto:[hidden email]
>>>
>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>> [hidden email]>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
>> stop
>>>>>>>>>> words
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
>> however
>>>>>>>>>> this
>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
>> records. If i
>>>>>>>>>> remove
>>>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>>>> <field name="name"
>> type="text_field"
>>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>>>> required="true"
>>>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>> StopAnalyzer
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>
>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>
>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> IMPORTANT:
>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Paras Lehana [65871]
>>>>> Development Engineer, Auto-Suggest,
>>>>> IndiaMART Intermesh Ltd.
>>>>>
>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>> Noida, UP, IN - 201303
>>>>>
>>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>>> Work: 01203916600 | Extn:  8173
>>>>>
>>>>> IMPORTANT:
>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

David Hastings
I use 3 word shingles with stopwords for my MLT ML trainer that worked
pretty well for such a solution, but for a full index the size became
prohibitive

On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
wrote:

> If we had IDF for phrases, they would be super effective. The 2X weight is
> a hack that mostly works.
>
> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 8, 2019, at 11:08 AM, David Hastings <
> [hidden email]> wrote:
> >
> > the pf and qf fields are REALLY nice for this
> >
> > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <[hidden email]>
> > wrote:
> >
> >> I always enable phrase searching in edismax for exactly this reason.
> >>
> >> Something like:
> >>
> >>       <str name="qf”>title^8 keywords^4 text</str>
> >>       <str name="pf”>title^16 keywords^8 text^2</str>
> >>
> >> To deal with concepts in queries, a classifier and/or named entity
> >> extractor can be helpful. If you have a list of concepts (“controlled
> >> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
> >> term can be queried against the field matching that vocabulary.
> >>
> >> This is how LinkedIn separates people, companies, and places, for
> example.
> >>
> >> wunder
> >> Walter Underwood
> >> [hidden email]
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]>
> >> wrote:
> >>>
> >>> Look at the “mm” parameter, try setting it to 100%. Although that’t not
> >> entirely likely to do what you want either since virtually every doc
> will
> >> have “a” in it. But at least you’d get docs that have both terms.
> >>>
> >>> you may also be able to search for things like “Lamin A” _only as a
> >> phrase_ and have some luck. But this is a gnarly problem in general.
> Some
> >> people have been able to substitute synonyms and/or shingles to make
> this
> >> work at the expense of a larger index.
> >>>
> >>> This is a generic problem with context. “Lamin A” is really a
> “concept”,
> >> not just two words that happen to be near each other. Searching as a
> phrase
> >> is an OOB-but-naive way to try to make it more likely that the ranked
> >> results refer to the _concept_ of “Lamin A”. The assumption here is “if
> >> these two words appear next to each other, they’re more likely to be
> what I
> >> want”. I say “naive” because “Lamins: A new approach to...” would
> _also_ be
> >> found for a naive phrase search. (I have no idea whether such a title
> makes
> >> sense or not, but you figured that out already)...
> >>>
> >>> To do this well you’d have to dive in to NLP/Machine learning.
> >>>
> >>> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
> >>>
> >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
> >> wrote:
> >>>>
> >>>> HI Walter and Paras
> >>>>
> >>>> I indexed it removing all the references to StopWordFilter and I went
> >> from 121 results to near 20K as the search term q="Lymphoid and a
> >> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
> So I
> >> don't think removing it completely is the way to go from the scenario we
> >> have, but I appreciate the suggestion…
> >>>>
> >>>> Yes the response is using fl=*
> >>>> I am trying some combinations at the moment, but yet no success.
> >>>>
> >>>> defType=edismax
> >>>> q.alt=Lymphoid and a non-Lymphoid cell
> >>>> Number of results=1599
> >>>> Quite a considerable increase, even though reasonable meaningful
> >> results.
> >>>>
> >>>> I am sorry but I didn't understand what do you want me to do exactly
> >> with the lst (??) and qf and bf.
> >>>>
> >>>> Thanks everyone with their inputs
> >>>>
> >>>>
> >>>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
> >> wrote:
> >>>>>
> >>>>> Hi Guilherme
> >>>>>
> >>>>> By accident, I ended up querying the using the default handler
> >> (/select) and it worked.
> >>>>>
> >>>>> You've just found the culprit. Thanks for giving the material I
> >> requested. Your analysis chain is working as expected. I don't see any
> >> issue in either StopWordFilter or your boosts. I also use a boost of 50
> >> when boosting contextual suggestions (boosting "gold iphone" on a page
> of
> >> iphone) but I take Walter's suggestion and would try to optimize my
> >> weights. I agree that this 50 thing was not researched much about by us
> as
> >> well (we never faced performance or relevance issues).
> >>>>>
> >>>>> See the major difference in both the handlers - edismax. I'm pretty
> >> sure that your problem lies in the parsing of queries (you can confirm
> that
> >> from parsedquery key in debug of both JSON responses). I hope you have
> >> provided the response with fl=*. Replace q with q.alt in your /search
> >> handler query and I think you should start getting responses. That's
> >> because q.alt uses standard parser. If you want to keep using edisMax, I
> >> suggest you to test the responses removing some combination of lst (qf,
> bf)
> >> and find what's restricting the documents to come up. I'm out of office
> >> today - would have certainly tried analyzing the field values of the
> >> document in /select request and compare it with qf/bq in solrconfig.xml
> >> /search. Do this for me and you'd certainly find something.
> >>>>>
> >>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <[hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>>>> I normally use a weight of 8 for the most important field, like
> title.
> >> Other fields might get a 4 or 2.
> >>>>>
> >>>>> I add a “pf” field with the weights doubled, so that phrase matches
> >> have a higher weight.
> >>>>>
> >>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
> >> early web search engines. With different relevance algorithms and
> totally
> >> different evaluation and tuning systems, they settled on weights of 8
> and
> >> 7.5 for HTML titles. With the the two radically different system getting
> >> the same number, I decided that was a property of the documents, not of
> the
> >> search engines.
> >>>>>
> >>>>> wunder
> >>>>> Walter Underwood
> >>>>> [hidden email] <mailto:[hidden email]>
> >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> >> (my blog)
> >>>>>
> >>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>>>>>
> >>>>>> Hi Wunder,
> >>>>>>
> >>>>>> My indexer takes quite a few hours to be executed I am shortening it
> >> to run faster, but I also need to make sure it gives what we are
> expecting.
> >> This implementation's been there for >4y, and massively used.
> >>>>>>
> >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> years
> >> of configuring Solr.
> >>>>>> I've inherited that implementation and I am really keen to adequate
> >> it, what would you recommend ?
> >>>>>>
> >>>>>> Cheers
> >>>>>> Guilherme
> >>>>>>
> >>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>>>>>>
> >>>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
> >> still are using StopFilterFactory. The first advice we gave you was to
> >> remove that.
> >>>>>>>
> >>>>>>> Remove StopFilterFactory everywhere and reindex.
> >>>>>>>
> >>>>>>> You will continue to have problems matching stopwords until you do
> >> that.
> >>>>>>>
> >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> years
> >> of configuring Solr.
> >>>>>>>
> >>>>>>> wunder
> >>>>>>> Walter Underwood
> >>>>>>> [hidden email] <mailto:[hidden email]>
> >>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> >> (my blog)
> >>>>>>>
> >>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Paras, everyone
> >>>>>>>>
> >>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
> >> you had trouble with the attachments I will host it somewhere and share
> the
> >> links.
> >>>>>>>> I don't tweak my index, I get the data from the graph database,
> >> create a document as they are and save to solr.
> >>>>>>>>
> >>>>>>>> So, I am sending the new analysis screen querying the way you
> >> suggested. Also the results with params and solr query url.
> >>>>>>>>
> >>>>>>>> During the process of querying what you asked I found something
> >> really weird (at least for me). By accident, I ended up querying the
> using
> >> the default handler (/select) and it worked. Then If I use the one I
> must
> >> use, then sadly doesn't work. I am posting both results and I will also
> >> post the handlers as well.
> >>>>>>>>
> >>>>>>>> Here is the link with all the files mentioned before
> >>>>>>>>
> >>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
> >>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> >> <
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >> <
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >>>>
> >>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
> >> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
> [hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Guilherme.
> >>>>>>>>>
> >>>>>>>>> I am sending they analysis result and the json result as
> requested.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
> >> quality
> >>>>>>>>> though).
> >>>>>>>>>
> >>>>>>>>> From the analysis screen, the analysis is working as expected.
> One
> >> of the
> >>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
> matching
> >>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
> >> initially
> >>>>>>>>> think of is: the stopword "a" is probably present in
> post-analysis
> >> either
> >>>>>>>>> of query or index. Did you tweak your index time analysis after
> >> indexing?
> >>>>>>>>>
> >>>>>>>>> Do two things:
> >>>>>>>>>
> >>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
> >>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
> >>>>>>>>> "query=*"lymphoid
> >>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
> the
> >> link
> >>>>>>>>> here.
> >>>>>>>>> 2. Give the same JSON output as you have sent but this time with
> >>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
> >> [hidden email] <mailto:[hidden email]>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or some
> >> such. The
> >>>>>>>>>> Apache server is fairly aggressive about stripping attachments
> >> though, so
> >>>>>>>>>> it’s also possible they didn’t make it through.
> >>>>>>>>>>
> >>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
> [hidden email]
> >> <mailto:[hidden email]>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks Erick.
> >>>>>>>>>>>
> >>>>>>>>>>>> First, your index and analysis chains are considerably
> >> different, this
> >>>>>>>>>> can easily be a source of problems. In particular, using two
> >> different
> >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> >> this unless
> >>>>>>>>>> you’re totally sure you understand the consequences.
> >> Additionally, your use
> >>>>>>>>>> of the length filter is suspicious, especially since your
> problem
> >> statement
> >>>>>>>>>> is about the addition of a single letter term and the min length
> >> allowed on
> >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> >> ’a’ is
> >>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> >> about the
> >>>>>>>>>> interactions.
> >>>>>>>>>>> I will investigate the min length and post the results later.
> >>>>>>>>>>>
> >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> >> typos?
> >>>>>>>>>> Used by custom code?
> >>>>>>>>>>> This the url in my application, not solr params. That's the
> >> query string.
> >>>>>>>>>>>
> >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> likely
> >> that
> >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> it’s
> >> just a
> >>>>>>>>>> typo.
> >>>>>>>>>>> This is part of the application. Species will be used later on
> >> in solr
> >>>>>>>>>> to filter out the result. That's not solr. That my app params.
> >>>>>>>>>>>
> >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> covers
> >> is to
> >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> >> Ignore all the
> >>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
> >> to skip
> >>>>>>>>>> that part.
> >>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
> >> explain tag
> >>>>>>>>>> is present.
> >>>>>>>>>>> I will try the searching the way you mentioned.
> >>>>>>>>>>>
> >>>>>>>>>>> Thank for your inputs
> >>>>>>>>>>>
> >>>>>>>>>>> Guilherme
> >>>>>>>>>>>
> >>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
> >> [hidden email] <mailto:[hidden email]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>
> >>>>>>>>>>>> First, your index and analysis chains are considerably
> >> different, this
> >>>>>>>>>> can easily be a source of problems. In particular, using two
> >> different
> >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> >> this unless
> >>>>>>>>>> you’re totally sure you understand the consequences.
> >> Additionally, your use
> >>>>>>>>>> of the length filter is suspicious, especially since your
> problem
> >> statement
> >>>>>>>>>> is about the addition of a single letter term and the min length
> >> allowed on
> >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that the
> >> ’a’ is
> >>>>>>>>>> filtered out in both cases, but maybe you’ve found something odd
> >> about the
> >>>>>>>>>> interactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal signs
> >> typos?
> >>>>>>>>>> Used by custom code?
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> likely
> >> that
> >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> it’s
> >> just a
> >>>>>>>>>> typo.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> covers
> >> is to
> >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> >> Ignore all the
> >>>>>>>>>> relevance calculations for the nonce, or specify “&debug=query”
> >> to skip
> >>>>>>>>>> that part.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
> what I
> >>>>>>>>>> expect” is answered by looking at the “&debug=query” output and
> >> the
> >>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
> >> sure to look
> >>>>>>>>>> at _both_ the query and index output. Also, and very important
> >> about the
> >>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
> that
> >> what you
> >>>>>>>>>> put in the text boxes have made it through the query parser
> >> intact and is
> >>>>>>>>>> analyzed by the field selected. Consider the search
> >> "q=field:word1 word2".
> >>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
> >> looks like
> >>>>>>>>>> what you expect. That’s misleading because the query is _parsed_
> >> as
> >>>>>>>>>> "field:word1 default_search_field:word2”. This is where
> >> “&debug=query”
> >>>>>>>>>> helps.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Erick
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> >> [hidden email] <mailto:[hidden email]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Walter,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> >> Those words
> >>>>>>>>>> will
> >>>>>>>>>>>>>> not be in the index, so they can never match a query.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think the OP's concern is different results when adding a
> >> stopword. I
> >>>>>>>>>>>>> think he's using the filter factory correctly - the query
> chain
> >>>>>>>>>> includes
> >>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
> >> document in
> >>>>>>>>>>>>> result you are concerned about and post full result of
> >> analysis screen
> >>>>>>>>>> (for
> >>>>>>>>>>>>> both query and index).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
> >> [hidden email] <mailto:[hidden email]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> No.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> >> Those words
> >>>>>>>>>>>>>> will not be in the index, so they can never match a query.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every analysis
> >> chain in
> >>>>>>>>>>>>>> schema.xml.
> >>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to read
> >> the new
> >>>>>>>>>> config.
> >>>>>>>>>>>>>> 3. Reindex all of the documents.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords will
> >> not be
> >>>>>>>>>>>>>> removed and they will be searchable.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> wunder
> >>>>>>>>>>>>>> Walter Underwood
> >>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
> >>>>>>>>>>>>>> http://observer.wunderwood.org/ <
> >> http://observer.wunderwood.org/>  (my blog)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
> >> [hidden email] <mailto:[hidden email]>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Ok. I am kind a lost now.
> >>>>>>>>>>>>>>> If I open up the console > analysis and perform it, that's
> >> the final
> >>>>>>>>>>>>>> result.
> >>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in
> >> the
> >>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
> >> stopwords.txt"," ")
> >>>>>>>>>> then
> >>>>>>>>>>>>>> add to solr. Is that correct ?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks David
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> >>>>>>>>>> [hidden email] <mailto:
> [hidden email]
> >>>
> >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
> >> [hidden email]>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> no,
> >>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >> ignoreCase="true"
> >>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
> >> opinion of
> >>>>>>>>>> course,
> >>>>>>>>>>>>>>>> based on your use case may be different, but i generally
> >> axe any
> >>>>>>>>>>>>>> reference
> >>>>>>>>>>>>>>>> to them at all
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
> >> [hidden email] <mailto:[hidden email]>
> >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>> Haven't I done this here ?
> >>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>>>>>> <analyzer type="index">
> >>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >> ignoreCase="true"
> >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> >>>>>>>>>> [hidden email] <mailto:
> [hidden email]
> >>>
> >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
> >> [hidden email]>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference to
> >> stop
> >>>>>>>>>> words
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
> again.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> >>>>>>>>>> [hidden email] <mailto:[hidden email]>
> >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I am performing a search to match a name (text_field),
> >> however
> >>>>>>>>>> this
> >>>>>>>>>>>>>> term
> >>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
> >> records. If i
> >>>>>>>>>> remove
> >>>>>>>>>>>>>>>>> 'a'
> >>>>>>>>>>>>>>>>>>> then it works.
> >>>>>>>>>>>>>>>>>>> e.g
> >>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>>>>>>>>>>>>>>>> doesn't work:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
> >>>>>>>>>>>>>>>>>>> works:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> interested in the first result
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> schema.xml
> >>>>>>>>>>>>>>>>>>> <field name="name"
> >> type="text_field"
> >>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
> >>>>>>>>>> required="true"
> >>>>>>>>>>>>>>>>>>> multiValued="false"/>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>>>>>>>> <analyzer type="index">
> >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>>>>>>>>> max="20"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>>>>>>>> ignoreCase="true"
> >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>>>>>>>> </analyzer>
> >>>>>>>>>>>>>>>>>>> </fieldType>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> stopwords.txt
> >>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
> >> StopAnalyzer
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> b
> >>>>>>>>>>>>>>>>>>> c
> >>>>>>>>>>>>>>>>>>> ....
> >>>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>> Guilherme
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Paras Lehana* [65871]
> >>>>>>>>>>>>> Development Engineer, Auto-Suggest,
> >>>>>>>>>>>>> IndiaMART Intermesh Ltd.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>>>>>>>>> Noida, UP, IN - 201303
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Mob.: +91-9560911996
> >>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> IMPORTANT:
> >>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> --
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> *Paras Lehana* [65871]
> >>>>>>>>> Development Engineer, Auto-Suggest,
> >>>>>>>>> IndiaMART Intermesh Ltd.
> >>>>>>>>>
> >>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>>>>> Noida, UP, IN - 201303
> >>>>>>>>>
> >>>>>>>>> Mob.: +91-9560911996
> >>>>>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> IMPORTANT:
> >>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> --
> >>>>> Regards,
> >>>>>
> >>>>> Paras Lehana [65871]
> >>>>> Development Engineer, Auto-Suggest,
> >>>>> IndiaMART Intermesh Ltd.
> >>>>>
> >>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>> Noida, UP, IN - 201303
> >>>>>
> >>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
> >>>>> Work: 01203916600 | Extn:  8173
> >>>>>
> >>>>> IMPORTANT:
> >>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana
Hi

So I don't think removing it completely is the way to go from the scenario
> we have


Removing stopwords is another story. I'm curious to find the reason
assuming that you keep on using stopwords. In some cases, stopwords are
really necessary.


Quite a considerable increase


If q.alt is giving you responses, it's confirmed that your stopwords filter
is working as expected. The problem definitely lies in the configuration of
edismax.



> I am sorry but I didn't understand what do you want me to do exactly with
> the lst (??) and qf and bf.


What combinations did you try? I was referring to the field-level boosting
you have applied in edismax config.

*Let me explain again:* In your solrconfig.xml, look at your /search
request handler. There are many qf and some bq boosts. I want you to remove
all of these, check response again (with q now) and keep on adding them
again (one by one) while looking for when the numFound drastically changes.

On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]>
wrote:

> I use 3 word shingles with stopwords for my MLT ML trainer that worked
> pretty well for such a solution, but for a full index the size became
> prohibitive
>
> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
> wrote:
>
> > If we had IDF for phrases, they would be super effective. The 2X weight
> is
> > a hack that mostly works.
> >
> > Infoseek had phrase IDF and it was a killer algorithm for relevance.
> >
> > wunder
> > Walter Underwood
> > [hidden email]
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Nov 8, 2019, at 11:08 AM, David Hastings <
> > [hidden email]> wrote:
> > >
> > > the pf and qf fields are REALLY nice for this
> > >
> > > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
> [hidden email]>
> > > wrote:
> > >
> > >> I always enable phrase searching in edismax for exactly this reason.
> > >>
> > >> Something like:
> > >>
> > >>       <str name="qf”>title^8 keywords^4 text</str>
> > >>       <str name="pf”>title^16 keywords^8 text^2</str>
> > >>
> > >> To deal with concepts in queries, a classifier and/or named entity
> > >> extractor can be helpful. If you have a list of concepts (“controlled
> > >> vocabulary”) that includes “Lamin A”, and that shows up in a query,
> that
> > >> term can be queried against the field matching that vocabulary.
> > >>
> > >> This is how LinkedIn separates people, companies, and places, for
> > example.
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> [hidden email]
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
> >
> > >> wrote:
> > >>>
> > >>> Look at the “mm” parameter, try setting it to 100%. Although that’t
> not
> > >> entirely likely to do what you want either since virtually every doc
> > will
> > >> have “a” in it. But at least you’d get docs that have both terms.
> > >>>
> > >>> you may also be able to search for things like “Lamin A” _only as a
> > >> phrase_ and have some luck. But this is a gnarly problem in general.
> > Some
> > >> people have been able to substitute synonyms and/or shingles to make
> > this
> > >> work at the expense of a larger index.
> > >>>
> > >>> This is a generic problem with context. “Lamin A” is really a
> > “concept”,
> > >> not just two words that happen to be near each other. Searching as a
> > phrase
> > >> is an OOB-but-naive way to try to make it more likely that the ranked
> > >> results refer to the _concept_ of “Lamin A”. The assumption here is
> “if
> > >> these two words appear next to each other, they’re more likely to be
> > what I
> > >> want”. I say “naive” because “Lamins: A new approach to...” would
> > _also_ be
> > >> found for a naive phrase search. (I have no idea whether such a title
> > makes
> > >> sense or not, but you figured that out already)...
> > >>>
> > >>> To do this well you’d have to dive in to NLP/Machine learning.
> > >>>
> > >>> I truly wish we could have the DWIM search algorithm (Do What I
> Mean)….
> > >>>
> > >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
> > >> wrote:
> > >>>>
> > >>>> HI Walter and Paras
> > >>>>
> > >>>> I indexed it removing all the references to StopWordFilter and I
> went
> > >> from 121 results to near 20K as the search term q="Lymphoid and a
> > >> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
> > So I
> > >> don't think removing it completely is the way to go from the scenario
> we
> > >> have, but I appreciate the suggestion…
> > >>>>
> > >>>> Yes the response is using fl=*
> > >>>> I am trying some combinations at the moment, but yet no success.
> > >>>>
> > >>>> defType=edismax
> > >>>> q.alt=Lymphoid and a non-Lymphoid cell
> > >>>> Number of results=1599
> > >>>> Quite a considerable increase, even though reasonable meaningful
> > >> results.
> > >>>>
> > >>>> I am sorry but I didn't understand what do you want me to do exactly
> > >> with the lst (??) and qf and bf.
> > >>>>
> > >>>> Thanks everyone with their inputs
> > >>>>
> > >>>>
> > >>>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
> > >> wrote:
> > >>>>>
> > >>>>> Hi Guilherme
> > >>>>>
> > >>>>> By accident, I ended up querying the using the default handler
> > >> (/select) and it worked.
> > >>>>>
> > >>>>> You've just found the culprit. Thanks for giving the material I
> > >> requested. Your analysis chain is working as expected. I don't see any
> > >> issue in either StopWordFilter or your boosts. I also use a boost of
> 50
> > >> when boosting contextual suggestions (boosting "gold iphone" on a page
> > of
> > >> iphone) but I take Walter's suggestion and would try to optimize my
> > >> weights. I agree that this 50 thing was not researched much about by
> us
> > as
> > >> well (we never faced performance or relevance issues).
> > >>>>>
> > >>>>> See the major difference in both the handlers - edismax. I'm pretty
> > >> sure that your problem lies in the parsing of queries (you can confirm
> > that
> > >> from parsedquery key in debug of both JSON responses). I hope you have
> > >> provided the response with fl=*. Replace q with q.alt in your /search
> > >> handler query and I think you should start getting responses. That's
> > >> because q.alt uses standard parser. If you want to keep using
> edisMax, I
> > >> suggest you to test the responses removing some combination of lst
> (qf,
> > bf)
> > >> and find what's restricting the documents to come up. I'm out of
> office
> > >> today - would have certainly tried analyzing the field values of the
> > >> document in /select request and compare it with qf/bq in
> solrconfig.xml
> > >> /search. Do this for me and you'd certainly find something.
> > >>>>>
> > >>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
> [hidden email]
> > >> <mailto:[hidden email]>> wrote:
> > >>>>> I normally use a weight of 8 for the most important field, like
> > title.
> > >> Other fields might get a 4 or 2.
> > >>>>>
> > >>>>> I add a “pf” field with the weights doubled, so that phrase matches
> > >> have a higher weight.
> > >>>>>
> > >>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
> > >> early web search engines. With different relevance algorithms and
> > totally
> > >> different evaluation and tuning systems, they settled on weights of 8
> > and
> > >> 7.5 for HTML titles. With the the two radically different system
> getting
> > >> the same number, I decided that was a property of the documents, not
> of
> > the
> > >> search engines.
> > >>>>>
> > >>>>> wunder
> > >>>>> Walter Underwood
> > >>>>> [hidden email] <mailto:[hidden email]>
> > >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> > >> (my blog)
> > >>>>>
> > >>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
> > >> <mailto:[hidden email]>> wrote:
> > >>>>>>
> > >>>>>> Hi Wunder,
> > >>>>>>
> > >>>>>> My indexer takes quite a few hours to be executed I am shortening
> it
> > >> to run faster, but I also need to make sure it gives what we are
> > expecting.
> > >> This implementation's been there for >4y, and massively used.
> > >>>>>>
> > >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
> extremely
> > >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> > years
> > >> of configuring Solr.
> > >>>>>> I've inherited that implementation and I am really keen to
> adequate
> > >> it, what would you recommend ?
> > >>>>>>
> > >>>>>> Cheers
> > >>>>>> Guilherme
> > >>>>>>
> > >>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
> > >> <mailto:[hidden email]>> wrote:
> > >>>>>>>
> > >>>>>>> Thanks for posting the files. Looking at schema.xml, I see that
> you
> > >> still are using StopFilterFactory. The first advice we gave you was to
> > >> remove that.
> > >>>>>>>
> > >>>>>>> Remove StopFilterFactory everywhere and reindex.
> > >>>>>>>
> > >>>>>>> You will continue to have problems matching stopwords until you
> do
> > >> that.
> > >>>>>>>
> > >>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
> extremely
> > >> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
> > years
> > >> of configuring Solr.
> > >>>>>>>
> > >>>>>>> wunder
> > >>>>>>> Walter Underwood
> > >>>>>>> [hidden email] <mailto:[hidden email]>
> > >>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
> >
> > >> (my blog)
> > >>>>>>>
> > >>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
> > >> <mailto:[hidden email]>> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Paras, everyone
> > >>>>>>>>
> > >>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
> > >> you had trouble with the attachments I will host it somewhere and
> share
> > the
> > >> links.
> > >>>>>>>> I don't tweak my index, I get the data from the graph database,
> > >> create a document as they are and save to solr.
> > >>>>>>>>
> > >>>>>>>> So, I am sending the new analysis screen querying the way you
> > >> suggested. Also the results with params and solr query url.
> > >>>>>>>>
> > >>>>>>>> During the process of querying what you asked I found something
> > >> really weird (at least for me). By accident, I ended up querying the
> > using
> > >> the default handler (/select) and it worked. Then If I use the one I
> > must
> > >> use, then sadly doesn't work. I am posting both results and I will
> also
> > >> post the handlers as well.
> > >>>>>>>>
> > >>>>>>>> Here is the link with all the files mentioned before
> > >>>>>>>>
> > >>
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
> > >>
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> > >> <
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> > >> <
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> > >>>>
> > >>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
> > >> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> > >>>>>>>>
> > >>>>>>>> Thanks
> > >>>>>>>>
> > >>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
> > [hidden email]
> > >> <mailto:[hidden email]>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Guilherme.
> > >>>>>>>>>
> > >>>>>>>>> I am sending they analysis result and the json result as
> > requested.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
> > >> quality
> > >>>>>>>>> though).
> > >>>>>>>>>
> > >>>>>>>>> From the analysis screen, the analysis is working as expected.
> > One
> > >> of the
> > >>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
> > matching
> > >>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
> > >> initially
> > >>>>>>>>> think of is: the stopword "a" is probably present in
> > post-analysis
> > >> either
> > >>>>>>>>> of query or index. Did you tweak your index time analysis after
> > >> indexing?
> > >>>>>>>>>
> > >>>>>>>>> Do two things:
> > >>>>>>>>>
> > >>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
> > >>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
> > >>>>>>>>> "query=*"lymphoid
> > >>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
> > the
> > >> link
> > >>>>>>>>> here.
> > >>>>>>>>> 2. Give the same JSON output as you have sent but this time
> with
> > >>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
> > >> [hidden email] <mailto:[hidden email]>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or
> some
> > >> such. The
> > >>>>>>>>>> Apache server is fairly aggressive about stripping attachments
> > >> though, so
> > >>>>>>>>>> it’s also possible they didn’t make it through.
> > >>>>>>>>>>
> > >>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
> > [hidden email]
> > >> <mailto:[hidden email]>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks Erick.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> First, your index and analysis chains are considerably
> > >> different, this
> > >>>>>>>>>> can easily be a source of problems. In particular, using two
> > >> different
> > >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> > >> this unless
> > >>>>>>>>>> you’re totally sure you understand the consequences.
> > >> Additionally, your use
> > >>>>>>>>>> of the length filter is suspicious, especially since your
> > problem
> > >> statement
> > >>>>>>>>>> is about the addition of a single letter term and the min
> length
> > >> allowed on
> > >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
> the
> > >> ’a’ is
> > >>>>>>>>>> filtered out in both cases, but maybe you’ve found something
> odd
> > >> about the
> > >>>>>>>>>> interactions.
> > >>>>>>>>>>> I will investigate the min length and post the results later.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
> signs
> > >> typos?
> > >>>>>>>>>> Used by custom code?
> > >>>>>>>>>>> This the url in my application, not solr params. That's the
> > >> query string.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> > likely
> > >> that
> > >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> > it’s
> > >> just a
> > >>>>>>>>>> typo.
> > >>>>>>>>>>> This is part of the application. Species will be used later
> on
> > >> in solr
> > >>>>>>>>>> to filter out the result. That's not solr. That my app params.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> > covers
> > >> is to
> > >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> > >> Ignore all the
> > >>>>>>>>>> relevance calculations for the nonce, or specify
> “&debug=query”
> > >> to skip
> > >>>>>>>>>> that part.
> > >>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
> > >> explain tag
> > >>>>>>>>>> is present.
> > >>>>>>>>>>> I will try the searching the way you mentioned.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thank for your inputs
> > >>>>>>>>>>>
> > >>>>>>>>>>> Guilherme
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
> > >> [hidden email] <mailto:[hidden email]>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Fwd to another server
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> First, your index and analysis chains are considerably
> > >> different, this
> > >>>>>>>>>> can easily be a source of problems. In particular, using two
> > >> different
> > >>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
> > >> this unless
> > >>>>>>>>>> you’re totally sure you understand the consequences.
> > >> Additionally, your use
> > >>>>>>>>>> of the length filter is suspicious, especially since your
> > problem
> > >> statement
> > >>>>>>>>>> is about the addition of a single letter term and the min
> length
> > >> allowed on
> > >>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
> the
> > >> ’a’ is
> > >>>>>>>>>> filtered out in both cases, but maybe you’ve found something
> odd
> > >> about the
> > >>>>>>>>>> interactions.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
> signs
> > >> typos?
> > >>>>>>>>>> Used by custom code?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
> > likely
> > >> that
> > >>>>>>>>>> all the params with an equal-sign are totally ignored unless
> > it’s
> > >> just a
> > >>>>>>>>>> typo.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Third, the easiest way to see what’s happening under the
> > covers
> > >> is to
> > >>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
> > >> Ignore all the
> > >>>>>>>>>> relevance calculations for the nonce, or specify
> “&debug=query”
> > >> to skip
> > >>>>>>>>>> that part.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
> > what I
> > >>>>>>>>>> expect” is answered by looking at the “&debug=query” output
> and
> > >> the
> > >>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
> > >> sure to look
> > >>>>>>>>>> at _both_ the query and index output. Also, and very important
> > >> about the
> > >>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
> > that
> > >> what you
> > >>>>>>>>>> put in the text boxes have made it through the query parser
> > >> intact and is
> > >>>>>>>>>> analyzed by the field selected. Consider the search
> > >> "q=field:word1 word2".
> > >>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
> > >> looks like
> > >>>>>>>>>> what you expect. That’s misleading because the query is
> _parsed_
> > >> as
> > >>>>>>>>>> "field:word1 default_search_field:word2”. This is where
> > >> “&debug=query”
> > >>>>>>>>>> helps.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Erick
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> > >> [hidden email] <mailto:[hidden email]>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Walter,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> > >> Those words
> > >>>>>>>>>> will
> > >>>>>>>>>>>>>> not be in the index, so they can never match a query.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I think the OP's concern is different results when adding a
> > >> stopword. I
> > >>>>>>>>>>>>> think he's using the filter factory correctly - the query
> > chain
> > >>>>>>>>>> includes
> > >>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
> > >> document in
> > >>>>>>>>>>>>> result you are concerned about and post full result of
> > >> analysis screen
> > >>>>>>>>>> (for
> > >>>>>>>>>>>>> both query and index).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
> > >> [hidden email] <mailto:[hidden email]>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> No.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
> > >> Those words
> > >>>>>>>>>>>>>> will not be in the index, so they can never match a query.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every
> analysis
> > >> chain in
> > >>>>>>>>>>>>>> schema.xml.
> > >>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to
> read
> > >> the new
> > >>>>>>>>>> config.
> > >>>>>>>>>>>>>> 3. Reindex all of the documents.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords
> will
> > >> not be
> > >>>>>>>>>>>>>> removed and they will be searchable.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wunder
> > >>>>>>>>>>>>>> Walter Underwood
> > >>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
> > >>>>>>>>>>>>>> http://observer.wunderwood.org/ <
> > >> http://observer.wunderwood.org/>  (my blog)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
> > >> [hidden email] <mailto:[hidden email]>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Ok. I am kind a lost now.
> > >>>>>>>>>>>>>>> If I open up the console > analysis and perform it,
> that's
> > >> the final
> > >>>>>>>>>>>>>> result.
> > >>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
> in
> > >> the
> > >>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
> > >> stopwords.txt"," ")
> > >>>>>>>>>> then
> > >>>>>>>>>>>>>> add to solr. Is that correct ?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks David
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> > >>>>>>>>>> [hidden email] <mailto:
> > [hidden email]
> > >>>
> > >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
> > >> [hidden email]>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Fwd to another server
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> no,
> > >>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >> ignoreCase="true"
> > >>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
> > >> opinion of
> > >>>>>>>>>> course,
> > >>>>>>>>>>>>>>>> based on your use case may be different, but i generally
> > >> axe any
> > >>>>>>>>>>>>>> reference
> > >>>>>>>>>>>>>>>> to them at all
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
> > >> [hidden email] <mailto:[hidden email]>
> > >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
> > wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks.
> > >>>>>>>>>>>>>>>>> Haven't I done this here ?
> > >>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> > >>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> > >>>>>>>>>>>>>>>>> <analyzer type="index">
> > >>>>>>>>>>>>>>>>>     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >> ignoreCase="true"
> > >>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> > >>>>>>>>>> [hidden email] <mailto:
> > [hidden email]
> > >>>
> > >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
> > >> [hidden email]>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Fwd to another server
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference
> to
> > >> stop
> > >>>>>>>>>> words
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
> > again.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> > >>>>>>>>>> [hidden email] <mailto:[hidden email]>
> > >>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I am performing a search to match a name
> (text_field),
> > >> however
> > >>>>>>>>>> this
> > >>>>>>>>>>>>>> term
> > >>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
> > >> records. If i
> > >>>>>>>>>> remove
> > >>>>>>>>>>>>>>>>> 'a'
> > >>>>>>>>>>>>>>>>>>> then it works.
> > >>>>>>>>>>>>>>>>>>> e.g
> > >>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> > >>>>>>>>>>>>>>>>>>> doesn't work:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
> > >>>>>>>>>>>>>>>>>>> works:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >> <
> > >>
> >
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> > >>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> interested in the first result
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> schema.xml
> > >>>>>>>>>>>>>>>>>>> <field name="name"
> > >> type="text_field"
> > >>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
> > >>>>>>>>>> required="true"
> > >>>>>>>>>>>>>>>>>>> multiValued="false"/>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> > >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >>>>>>>>>>>>>> ignoreCase="true"
> > >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> > >>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> > >>>>>>>>>>>>>>>>>>> <analyzer type="index">
> > >>>>>>>>>>>>>>>>>>>     <tokenizer
> class="solr.StandardTokenizerFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.ClassicFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >>>>>>>>>>>>>> ignoreCase="true"
> > >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>>> <analyzer type="query">
> > >>>>>>>>>>>>>>>>>>>     <tokenizer class="solr.PatternTokenizerFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.PatternReplaceFilterFactory"
> > >>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LengthFilterFactory" min="2"
> > >>>>>>>>>>>>>>>>> max="20"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> > >>>>>>>>>>>>>>>>>>>     <filter class="solr.StopFilterFactory"
> > >>>>>>>>>>>>>> ignoreCase="true"
> > >>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
> > >>>>>>>>>>>>>>>>>>> </analyzer>
> > >>>>>>>>>>>>>>>>>>> </fieldType>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> stopwords.txt
> > >>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
> > >> StopAnalyzer
> > >>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>> b
> > >>>>>>>>>>>>>>>>>>> c
> > >>>>>>>>>>>>>>>>>>> ....
> > >>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>>>>>>> Guilherme
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Paras Lehana* [65871]
> > >>>>>>>>>>>>> Development Engineer, Auto-Suggest,
> > >>>>>>>>>>>>> IndiaMART Intermesh Ltd.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > >>>>>>>>>>>>> Noida, UP, IN - 201303
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Mob.: +91-9560911996
> > >>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> IMPORTANT:
> > >>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> --
> > >>>>>>>>> Regards,
> > >>>>>>>>>
> > >>>>>>>>> *Paras Lehana* [65871]
> > >>>>>>>>> Development Engineer, Auto-Suggest,
> > >>>>>>>>> IndiaMART Intermesh Ltd.
> > >>>>>>>>>
> > >>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > >>>>>>>>> Noida, UP, IN - 201303
> > >>>>>>>>>
> > >>>>>>>>> Mob.: +91-9560911996
> > >>>>>>>>> Work: 01203916600 | Extn:  *8173*
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> IMPORTANT:
> > >>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> --
> > >>>>> Regards,
> > >>>>>
> > >>>>> Paras Lehana [65871]
> > >>>>> Development Engineer, Auto-Suggest,
> > >>>>> IndiaMART Intermesh Ltd.
> > >>>>>
> > >>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > >>>>> Noida, UP, IN - 201303
> > >>>>>
> > >>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
> > >>>>> Work: 01203916600 | Extn:  8173
> > >>>>>
> > >>>>> IMPORTANT:
> > >>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> > >>>
> > >>
> > >>
> >
> >
>


--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri
Thanks
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
Yes. It always make sense the way we've been using.

> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
I see.

> *Let me explain again:* In your solrconfig.xml, look at your /search
Ok, using q now, removed all qf, performed the search and I got 23 results, and the one I really want, on the top.
As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I don't get anything (which make sense). However if I query name_exact, I get the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I still don't get any results.

In summary
- without qf - 23 results
- dbId - 0 results
- name_exact - 16 results
- name - 23 results
- dbId^1.0
  name_exact^10.0 - 0 results
- 0 results if any other, stId, dbId (key) is added on top of the name(name_exact, etc).

Definitely lost here! :-/


> On 11 Nov 2019, at 07:59, Paras Lehana <[hidden email]> wrote:
>
> Hi
>
> So I don't think removing it completely is the way to go from the scenario
>> we have
>
>
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
>
>
> Quite a considerable increase
>
>
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
>
>
>
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>
>
> What combinations did you try? I was referring to the field-level boosting
> you have applied in edismax config.
>
> *Let me explain again:* In your solrconfig.xml, look at your /search
> request handler. There are many qf and some bq boosts. I want you to remove
> all of these, check response again (with q now) and keep on adding them
> again (one by one) while looking for when the numFound drastically changes.
>
> On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]>
> wrote:
>
>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>> pretty well for such a solution, but for a full index the size became
>> prohibitive
>>
>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
>> wrote:
>>
>>> If we had IDF for phrases, they would be super effective. The 2X weight
>> is
>>> a hack that mostly works.
>>>
>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>> [hidden email]> wrote:
>>>>
>>>> the pf and qf fields are REALLY nice for this
>>>>
>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>> [hidden email]>
>>>> wrote:
>>>>
>>>>> I always enable phrase searching in edismax for exactly this reason.
>>>>>
>>>>> Something like:
>>>>>
>>>>>      <str name="qf”>title^8 keywords^4 text</str>
>>>>>      <str name="pf”>title^16 keywords^8 text^2</str>
>>>>>
>>>>> To deal with concepts in queries, a classifier and/or named entity
>>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>> that
>>>>> term can be queried against the field matching that vocabulary.
>>>>>
>>>>> This is how LinkedIn separates people, companies, and places, for
>>> example.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email]
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>
>>>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
>>>
>>>>> wrote:
>>>>>>
>>>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>> not
>>>>> entirely likely to do what you want either since virtually every doc
>>> will
>>>>> have “a” in it. But at least you’d get docs that have both terms.
>>>>>>
>>>>>> you may also be able to search for things like “Lamin A” _only as a
>>>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>> Some
>>>>> people have been able to substitute synonyms and/or shingles to make
>>> this
>>>>> work at the expense of a larger index.
>>>>>>
>>>>>> This is a generic problem with context. “Lamin A” is really a
>>> “concept”,
>>>>> not just two words that happen to be near each other. Searching as a
>>> phrase
>>>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>> “if
>>>>> these two words appear next to each other, they’re more likely to be
>>> what I
>>>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>> _also_ be
>>>>> found for a naive phrase search. (I have no idea whether such a title
>>> makes
>>>>> sense or not, but you figured that out already)...
>>>>>>
>>>>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>>>>
>>>>>> I truly wish we could have the DWIM search algorithm (Do What I
>> Mean)….
>>>>>>
>>>>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>>>>> wrote:
>>>>>>>
>>>>>>> HI Walter and Paras
>>>>>>>
>>>>>>> I indexed it removing all the references to StopWordFilter and I
>> went
>>>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>> So I
>>>>> don't think removing it completely is the way to go from the scenario
>> we
>>>>> have, but I appreciate the suggestion…
>>>>>>>
>>>>>>> Yes the response is using fl=*
>>>>>>> I am trying some combinations at the moment, but yet no success.
>>>>>>>
>>>>>>> defType=edismax
>>>>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>>>>> Number of results=1599
>>>>>>> Quite a considerable increase, even though reasonable meaningful
>>>>> results.
>>>>>>>
>>>>>>> I am sorry but I didn't understand what do you want me to do exactly
>>>>> with the lst (??) and qf and bf.
>>>>>>>
>>>>>>> Thanks everyone with their inputs
>>>>>>>
>>>>>>>
>>>>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Guilherme
>>>>>>>>
>>>>>>>> By accident, I ended up querying the using the default handler
>>>>> (/select) and it worked.
>>>>>>>>
>>>>>>>> You've just found the culprit. Thanks for giving the material I
>>>>> requested. Your analysis chain is working as expected. I don't see any
>>>>> issue in either StopWordFilter or your boosts. I also use a boost of
>> 50
>>>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>> of
>>>>> iphone) but I take Walter's suggestion and would try to optimize my
>>>>> weights. I agree that this 50 thing was not researched much about by
>> us
>>> as
>>>>> well (we never faced performance or relevance issues).
>>>>>>>>
>>>>>>>> See the major difference in both the handlers - edismax. I'm pretty
>>>>> sure that your problem lies in the parsing of queries (you can confirm
>>> that
>>>>> from parsedquery key in debug of both JSON responses). I hope you have
>>>>> provided the response with fl=*. Replace q with q.alt in your /search
>>>>> handler query and I think you should start getting responses. That's
>>>>> because q.alt uses standard parser. If you want to keep using
>> edisMax, I
>>>>> suggest you to test the responses removing some combination of lst
>> (qf,
>>> bf)
>>>>> and find what's restricting the documents to come up. I'm out of
>> office
>>>>> today - would have certainly tried analyzing the field values of the
>>>>> document in /select request and compare it with qf/bq in
>> solrconfig.xml
>>>>> /search. Do this for me and you'd certainly find something.
>>>>>>>>
>>>>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>> [hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>> I normally use a weight of 8 for the most important field, like
>>> title.
>>>>> Other fields might get a 4 or 2.
>>>>>>>>
>>>>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>>>>> have a higher weight.
>>>>>>>>
>>>>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>>>> early web search engines. With different relevance algorithms and
>>> totally
>>>>> different evaluation and tuning systems, they settled on weights of 8
>>> and
>>>>> 7.5 for HTML titles. With the the two radically different system
>> getting
>>>>> the same number, I decided that was a property of the documents, not
>> of
>>> the
>>>>> search engines.
>>>>>>>>
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>> (my blog)
>>>>>>>>
>>>>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wunder,
>>>>>>>>>
>>>>>>>>> My indexer takes quite a few hours to be executed I am shortening
>> it
>>>>> to run faster, but I also need to make sure it gives what we are
>>> expecting.
>>>>> This implementation's been there for >4y, and massively used.
>>>>>>>>>
>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>> extremely
>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>> years
>>>>> of configuring Solr.
>>>>>>>>> I've inherited that implementation and I am really keen to
>> adequate
>>>>> it, what would you recommend ?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Guilherme
>>>>>>>>>
>>>>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that
>> you
>>>>> still are using StopFilterFactory. The first advice we gave you was to
>>>>> remove that.
>>>>>>>>>>
>>>>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>>>>>
>>>>>>>>>> You will continue to have problems matching stopwords until you
>> do
>>>>> that.
>>>>>>>>>>
>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>> extremely
>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>> years
>>>>> of configuring Solr.
>>>>>>>>>>
>>>>>>>>>> wunder
>>>>>>>>>> Walter Underwood
>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>>
>>>>> (my blog)
>>>>>>>>>>
>>>>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Paras, everyone
>>>>>>>>>>>
>>>>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>>>>> you had trouble with the attachments I will host it somewhere and
>> share
>>> the
>>>>> links.
>>>>>>>>>>> I don't tweak my index, I get the data from the graph database,
>>>>> create a document as they are and save to solr.
>>>>>>>>>>>
>>>>>>>>>>> So, I am sending the new analysis screen querying the way you
>>>>> suggested. Also the results with params and solr query url.
>>>>>>>>>>>
>>>>>>>>>>> During the process of querying what you asked I found something
>>>>> really weird (at least for me). By accident, I ended up querying the
>>> using
>>>>> the default handler (/select) and it worked. Then If I use the one I
>>> must
>>>>> use, then sadly doesn't work. I am posting both results and I will
>> also
>>>>> post the handlers as well.
>>>>>>>>>>>
>>>>>>>>>>> Here is the link with all the files mentioned before
>>>>>>>>>>>
>>>>>
>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>>>>>
>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>>>> <
>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>> <
>>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>>>>
>>>>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>> [hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Guilherme.
>>>>>>>>>>>>
>>>>>>>>>>>> I am sending they analysis result and the json result as
>>> requested.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>>>>> quality
>>>>>>>>>>>> though).
>>>>>>>>>>>>
>>>>>>>>>>>> From the analysis screen, the analysis is working as expected.
>>> One
>>>>> of the
>>>>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>> matching
>>>>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>>>> initially
>>>>>>>>>>>> think of is: the stopword "a" is probably present in
>>> post-analysis
>>>>> either
>>>>>>>>>>>> of query or index. Did you tweak your index time analysis after
>>>>> indexing?
>>>>>>>>>>>>
>>>>>>>>>>>> Do two things:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>>>>>>> "query=*"lymphoid
>>>>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>> the
>>>>> link
>>>>>>>>>>>> here.
>>>>>>>>>>>> 2. Give the same JSON output as you have sent but this time
>> with
>>>>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>>>> [hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or
>> some
>>>>> such. The
>>>>>>>>>>>>> Apache server is fairly aggressive about stripping attachments
>>>>> though, so
>>>>>>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>> [hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Erick.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>> different, this
>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>> different
>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>> this unless
>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>> Additionally, your use
>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>> problem
>>>>> statement
>>>>>>>>>>>>> is about the addition of a single letter term and the min
>> length
>>>>> allowed on
>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>> the
>>>>> ’a’ is
>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>> odd
>>>>> about the
>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>> signs
>>>>> typos?
>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>> This the url in my application, not solr params. That's the
>>>>> query string.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>> likely
>>>>> that
>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>> it’s
>>>>> just a
>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>> This is part of the application. Species will be used later
>> on
>>>>> in solr
>>>>>>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>> covers
>>>>> is to
>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>> Ignore all the
>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>> “&debug=query”
>>>>> to skip
>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>>>>> explain tag
>>>>>>>>>>>>> is present.
>>>>>>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank for your inputs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>> different, this
>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>> different
>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>> this unless
>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>> Additionally, your use
>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>> problem
>>>>> statement
>>>>>>>>>>>>> is about the addition of a single letter term and the min
>> length
>>>>> allowed on
>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>> the
>>>>> ’a’ is
>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>> odd
>>>>> about the
>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>> signs
>>>>> typos?
>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>> likely
>>>>> that
>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>> it’s
>>>>> just a
>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>> covers
>>>>> is to
>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>> Ignore all the
>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>> “&debug=query”
>>>>> to skip
>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
>>> what I
>>>>>>>>>>>>> expect” is answered by looking at the “&debug=query” output
>> and
>>>>> the
>>>>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
>>>>> sure to look
>>>>>>>>>>>>> at _both_ the query and index output. Also, and very important
>>>>> about the
>>>>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
>>> that
>>>>> what you
>>>>>>>>>>>>> put in the text boxes have made it through the query parser
>>>>> intact and is
>>>>>>>>>>>>> analyzed by the field selected. Consider the search
>>>>> "q=field:word1 word2".
>>>>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
>>>>> looks like
>>>>>>>>>>>>> what you expect. That’s misleading because the query is
>> _parsed_
>>>>> as
>>>>>>>>>>>>> "field:word1 default_search_field:word2”. This is where
>>>>> “&debug=query”
>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Erick
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>> Those words
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think the OP's concern is different results when adding a
>>>>> stopword. I
>>>>>>>>>>>>>>>> think he's using the filter factory correctly - the query
>>> chain
>>>>>>>>>>>>> includes
>>>>>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>>>>> document in
>>>>>>>>>>>>>>>> result you are concerned about and post full result of
>>>>> analysis screen
>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>> both query and index).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>> Those words
>>>>>>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every
>> analysis
>>>>> chain in
>>>>>>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to
>> read
>>>>> the new
>>>>>>>>>>>>> config.
>>>>>>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords
>> will
>>>>> not be
>>>>>>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wunder
>>>>>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>> http://observer.wunderwood.org/ <
>>>>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>>>>>>> If I open up the console > analysis and perform it,
>> that's
>>>>> the final
>>>>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
>> in
>>>>> the
>>>>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>>>>> stopwords.txt"," ")
>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>>>>>>> [hidden email] <mailto:
>>> [hidden email]
>>>>>>
>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>>>>> [hidden email]>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
>>>>> opinion of
>>>>>>>>>>>>> course,
>>>>>>>>>>>>>>>>>>> based on your use case may be different, but i generally
>>>>> axe any
>>>>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>    <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>>>>>>> [hidden email] <mailto:
>>> [hidden email]
>>>>>>
>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>>>>> [hidden email]>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference
>> to
>>>>> stop
>>>>>>>>>>>>> words
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
>>> again.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am performing a search to match a name
>> (text_field),
>>>>> however
>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
>>>>> records. If i
>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>
>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>>
>>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>>>>>>> <field name="name"
>>>>> type="text_field"
>>>>>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>>>>>>> required="true"
>>>>>>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>    <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>>>    <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>    <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>    <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>>>>> StopAnalyzer
>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>
>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>
>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Paras Lehana [65871]
>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>
>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>
>>>>>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>>>>>> Work: 01203916600 | Extn:  8173
>>>>>>>>
>>>>>>>> IMPORTANT:
>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.


Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri
What I can't understand is:
I search for the exact term - "Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell" and If i search "I search for the exact term - Immunoregulatory interactions between a Lymphoid and non-Lymphoid cell" then it works

> On 11 Nov 2019, at 12:24, Guilherme Viteri <[hidden email]> wrote:
>
> Thanks
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
> Yes. It always make sense the way we've been using.
>
>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>> is working as expected. The problem definitely lies in the configuration of
>> edismax.
> I see.
>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
> Ok, using q now, removed all qf, performed the search and I got 23 results, and the one I really want, on the top.
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I don't get anything (which make sense). However if I query name_exact, I get the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I still don't get any results.
>
> In summary
> - without qf - 23 results
> - dbId - 0 results
> - name_exact - 16 results
> - name - 23 results
> - dbId^1.0
>  name_exact^10.0 - 0 results
> - 0 results if any other, stId, dbId (key) is added on top of the name(name_exact, etc).
>
> Definitely lost here! :-/
>
>
>> On 11 Nov 2019, at 07:59, Paras Lehana <[hidden email]> wrote:
>>
>> Hi
>>
>> So I don't think removing it completely is the way to go from the scenario
>>> we have
>>
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>>
>> Quite a considerable increase
>>
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>> is working as expected. The problem definitely lies in the configuration of
>> edismax.
>>
>>
>>
>>> I am sorry but I didn't understand what do you want me to do exactly with
>>> the lst (??) and qf and bf.
>>
>>
>> What combinations did you try? I was referring to the field-level boosting
>> you have applied in edismax config.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>> request handler. There are many qf and some bq boosts. I want you to remove
>> all of these, check response again (with q now) and keep on adding them
>> again (one by one) while looking for when the numFound drastically changes.
>>
>> On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]>
>> wrote:
>>
>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>> pretty well for such a solution, but for a full index the size became
>>> prohibitive
>>>
>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
>>> wrote:
>>>
>>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>> is
>>>> a hack that mostly works.
>>>>
>>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> [hidden email]
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>>> [hidden email]> wrote:
>>>>>
>>>>> the pf and qf fields are REALLY nice for this
>>>>>
>>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> I always enable phrase searching in edismax for exactly this reason.
>>>>>>
>>>>>> Something like:
>>>>>>
>>>>>>     <str name="qf”>title^8 keywords^4 text</str>
>>>>>>     <str name="pf”>title^16 keywords^8 text^2</str>
>>>>>>
>>>>>> To deal with concepts in queries, a classifier and/or named entity
>>>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>> that
>>>>>> term can be queried against the field matching that vocabulary.
>>>>>>
>>>>>> This is how LinkedIn separates people, companies, and places, for
>>>> example.
>>>>>>
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> [hidden email]
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>
>>>>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>> not
>>>>>> entirely likely to do what you want either since virtually every doc
>>>> will
>>>>>> have “a” in it. But at least you’d get docs that have both terms.
>>>>>>>
>>>>>>> you may also be able to search for things like “Lamin A” _only as a
>>>>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>>> Some
>>>>>> people have been able to substitute synonyms and/or shingles to make
>>>> this
>>>>>> work at the expense of a larger index.
>>>>>>>
>>>>>>> This is a generic problem with context. “Lamin A” is really a
>>>> “concept”,
>>>>>> not just two words that happen to be near each other. Searching as a
>>>> phrase
>>>>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>>>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>> “if
>>>>>> these two words appear next to each other, they’re more likely to be
>>>> what I
>>>>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>>> _also_ be
>>>>>> found for a naive phrase search. (I have no idea whether such a title
>>>> makes
>>>>>> sense or not, but you figured that out already)...
>>>>>>>
>>>>>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>>>>>
>>>>>>> I truly wish we could have the DWIM search algorithm (Do What I
>>> Mean)….
>>>>>>>
>>>>>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> HI Walter and Paras
>>>>>>>>
>>>>>>>> I indexed it removing all the references to StopWordFilter and I
>>> went
>>>>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>>>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>>> So I
>>>>>> don't think removing it completely is the way to go from the scenario
>>> we
>>>>>> have, but I appreciate the suggestion…
>>>>>>>>
>>>>>>>> Yes the response is using fl=*
>>>>>>>> I am trying some combinations at the moment, but yet no success.
>>>>>>>>
>>>>>>>> defType=edismax
>>>>>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>>>>>> Number of results=1599
>>>>>>>> Quite a considerable increase, even though reasonable meaningful
>>>>>> results.
>>>>>>>>
>>>>>>>> I am sorry but I didn't understand what do you want me to do exactly
>>>>>> with the lst (??) and qf and bf.
>>>>>>>>
>>>>>>>> Thanks everyone with their inputs
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Guilherme
>>>>>>>>>
>>>>>>>>> By accident, I ended up querying the using the default handler
>>>>>> (/select) and it worked.
>>>>>>>>>
>>>>>>>>> You've just found the culprit. Thanks for giving the material I
>>>>>> requested. Your analysis chain is working as expected. I don't see any
>>>>>> issue in either StopWordFilter or your boosts. I also use a boost of
>>> 50
>>>>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>>> of
>>>>>> iphone) but I take Walter's suggestion and would try to optimize my
>>>>>> weights. I agree that this 50 thing was not researched much about by
>>> us
>>>> as
>>>>>> well (we never faced performance or relevance issues).
>>>>>>>>>
>>>>>>>>> See the major difference in both the handlers - edismax. I'm pretty
>>>>>> sure that your problem lies in the parsing of queries (you can confirm
>>>> that
>>>>>> from parsedquery key in debug of both JSON responses). I hope you have
>>>>>> provided the response with fl=*. Replace q with q.alt in your /search
>>>>>> handler query and I think you should start getting responses. That's
>>>>>> because q.alt uses standard parser. If you want to keep using
>>> edisMax, I
>>>>>> suggest you to test the responses removing some combination of lst
>>> (qf,
>>>> bf)
>>>>>> and find what's restricting the documents to come up. I'm out of
>>> office
>>>>>> today - would have certainly tried analyzing the field values of the
>>>>>> document in /select request and compare it with qf/bq in
>>> solrconfig.xml
>>>>>> /search. Do this for me and you'd certainly find something.
>>>>>>>>>
>>>>>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>> [hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>> I normally use a weight of 8 for the most important field, like
>>>> title.
>>>>>> Other fields might get a 4 or 2.
>>>>>>>>>
>>>>>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>>>>>> have a higher weight.
>>>>>>>>>
>>>>>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>>>>> early web search engines. With different relevance algorithms and
>>>> totally
>>>>>> different evaluation and tuning systems, they settled on weights of 8
>>>> and
>>>>>> 7.5 for HTML titles. With the the two radically different system
>>> getting
>>>>>> the same number, I decided that was a property of the documents, not
>>> of
>>>> the
>>>>>> search engines.
>>>>>>>>>
>>>>>>>>> wunder
>>>>>>>>> Walter Underwood
>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>>> (my blog)
>>>>>>>>>
>>>>>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Wunder,
>>>>>>>>>>
>>>>>>>>>> My indexer takes quite a few hours to be executed I am shortening
>>> it
>>>>>> to run faster, but I also need to make sure it gives what we are
>>>> expecting.
>>>>>> This implementation's been there for >4y, and massively used.
>>>>>>>>>>
>>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>>> extremely
>>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>> years
>>>>>> of configuring Solr.
>>>>>>>>>> I've inherited that implementation and I am really keen to
>>> adequate
>>>>>> it, what would you recommend ?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Guilherme
>>>>>>>>>>
>>>>>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that
>>> you
>>>>>> still are using StopFilterFactory. The first advice we gave you was to
>>>>>> remove that.
>>>>>>>>>>>
>>>>>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>>>>>>
>>>>>>>>>>> You will continue to have problems matching stopwords until you
>>> do
>>>>>> that.
>>>>>>>>>>>
>>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>>> extremely
>>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>> years
>>>>>> of configuring Solr.
>>>>>>>>>>>
>>>>>>>>>>> wunder
>>>>>>>>>>> Walter Underwood
>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>>>
>>>>>> (my blog)
>>>>>>>>>>>
>>>>>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Paras, everyone
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>>>>>> you had trouble with the attachments I will host it somewhere and
>>> share
>>>> the
>>>>>> links.
>>>>>>>>>>>> I don't tweak my index, I get the data from the graph database,
>>>>>> create a document as they are and save to solr.
>>>>>>>>>>>>
>>>>>>>>>>>> So, I am sending the new analysis screen querying the way you
>>>>>> suggested. Also the results with params and solr query url.
>>>>>>>>>>>>
>>>>>>>>>>>> During the process of querying what you asked I found something
>>>>>> really weird (at least for me). By accident, I ended up querying the
>>>> using
>>>>>> the default handler (/select) and it worked. Then If I use the one I
>>>> must
>>>>>> use, then sadly doesn't work. I am posting both results and I will
>>> also
>>>>>> post the handlers as well.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is the link with all the files mentioned before
>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>>>>>>
>>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>>>>> <
>>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>>> <
>>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>>>>>
>>>>>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>>> [hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Guilherme.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am sending they analysis result and the json result as
>>>> requested.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>>>>>> quality
>>>>>>>>>>>>> though).
>>>>>>>>>>>>>
>>>>>>>>>>>>> From the analysis screen, the analysis is working as expected.
>>>> One
>>>>>> of the
>>>>>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>>> matching
>>>>>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>>>>> initially
>>>>>>>>>>>>> think of is: the stopword "a" is probably present in
>>>> post-analysis
>>>>>> either
>>>>>>>>>>>>> of query or index. Did you tweak your index time analysis after
>>>>>> indexing?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do two things:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>>>>>>>> "query=*"lymphoid
>>>>>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>>> the
>>>>>> link
>>>>>>>>>>>>> here.
>>>>>>>>>>>>> 2. Give the same JSON output as you have sent but this time
>>> with
>>>>>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>>>>> [hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or
>>> some
>>>>>> such. The
>>>>>>>>>>>>>> Apache server is fairly aggressive about stripping attachments
>>>>>> though, so
>>>>>>>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>>> [hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks Erick.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>>> different, this
>>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>>> different
>>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>>> this unless
>>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>>> Additionally, your use
>>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>>> problem
>>>>>> statement
>>>>>>>>>>>>>> is about the addition of a single letter term and the min
>>> length
>>>>>> allowed on
>>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>>> the
>>>>>> ’a’ is
>>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>>> odd
>>>>>> about the
>>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>>> signs
>>>>>> typos?
>>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>>> This the url in my application, not solr params. That's the
>>>>>> query string.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>> likely
>>>>>> that
>>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>>> it’s
>>>>>> just a
>>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>>> This is part of the application. Species will be used later
>>> on
>>>>>> in solr
>>>>>>>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>>> covers
>>>>>> is to
>>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>>> Ignore all the
>>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>>> “&debug=query”
>>>>>> to skip
>>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>>>>>> explain tag
>>>>>>>>>>>>>> is present.
>>>>>>>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank for your inputs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>>> different, this
>>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>>> different
>>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>>> this unless
>>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>>> Additionally, your use
>>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>>> problem
>>>>>> statement
>>>>>>>>>>>>>> is about the addition of a single letter term and the min
>>> length
>>>>>> allowed on
>>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>>> the
>>>>>> ’a’ is
>>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>>> odd
>>>>>> about the
>>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>>> signs
>>>>>> typos?
>>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>> likely
>>>>>> that
>>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>>> it’s
>>>>>> just a
>>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>>> covers
>>>>>> is to
>>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>>> Ignore all the
>>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>>> “&debug=query”
>>>>>> to skip
>>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
>>>> what I
>>>>>>>>>>>>>> expect” is answered by looking at the “&debug=query” output
>>> and
>>>>>> the
>>>>>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
>>>>>> sure to look
>>>>>>>>>>>>>> at _both_ the query and index output. Also, and very important
>>>>>> about the
>>>>>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
>>>> that
>>>>>> what you
>>>>>>>>>>>>>> put in the text boxes have made it through the query parser
>>>>>> intact and is
>>>>>>>>>>>>>> analyzed by the field selected. Consider the search
>>>>>> "q=field:word1 word2".
>>>>>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
>>>>>> looks like
>>>>>>>>>>>>>> what you expect. That’s misleading because the query is
>>> _parsed_
>>>>>> as
>>>>>>>>>>>>>> "field:word1 default_search_field:word2”. This is where
>>>>>> “&debug=query”
>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Erick
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>>> Those words
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think the OP's concern is different results when adding a
>>>>>> stopword. I
>>>>>>>>>>>>>>>>> think he's using the filter factory correctly - the query
>>>> chain
>>>>>>>>>>>>>> includes
>>>>>>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>>>>>> document in
>>>>>>>>>>>>>>>>> result you are concerned about and post full result of
>>>>>> analysis screen
>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>> both query and index).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>>> Those words
>>>>>>>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every
>>> analysis
>>>>>> chain in
>>>>>>>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to
>>> read
>>>>>> the new
>>>>>>>>>>>>>> config.
>>>>>>>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords
>>> will
>>>>>> not be
>>>>>>>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> wunder
>>>>>>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>>> http://observer.wunderwood.org/ <
>>>>>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>>>>>>>> If I open up the console > analysis and perform it,
>>> that's
>>>>>> the final
>>>>>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
>>> in
>>>>>> the
>>>>>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>>>>>> stopwords.txt"," ")
>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>>>>>>>> [hidden email] <mailto:
>>>> [hidden email]
>>>>>>>
>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>>>>>> [hidden email]>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
>>>>>> opinion of
>>>>>>>>>>>>>> course,
>>>>>>>>>>>>>>>>>>>> based on your use case may be different, but i generally
>>>>>> axe any
>>>>>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>>   <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>>>>>>>> [hidden email] <mailto:
>>>> [hidden email]
>>>>>>>
>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>>>>>> [hidden email]>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference
>>> to
>>>>>> stop
>>>>>>>>>>>>>> words
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
>>>> again.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I am performing a search to match a name
>>> (text_field),
>>>>>> however
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
>>>>>> records. If i
>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> <
>>>>>>
>>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>>>>>>>> <field name="name"
>>>>>> type="text_field"
>>>>>>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>>>>>>>> required="true"
>>>>>>>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer
>>> class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>>>>>> StopAnalyzer
>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Paras Lehana [65871]
>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>
>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>
>>>>>>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>>>>>>> Work: 01203916600 | Extn:  8173
>>>>>>>>>
>>>>>>>>> IMPORTANT:
>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana
Hey Guilherme,

I was a bit busy for the past few days and couldn't read your mail. So, did
you find anything? Anyways, as I had expected, the culprit is definitely
among the qfs. Do the documents in concern contain dbId? I suggest you to
cross check the fields in your document with those impacting the result in
qf.

On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <[hidden email]> wrote:

> What I can't understand is:
> I search for the exact term - "Immunoregulatory interactions between a
> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
> exact term - Immunoregulatory interactions between a Lymphoid *and *non-Lymphoid
> cell" then it works
>
> On 11 Nov 2019, at 12:24, Guilherme Viteri <[hidden email]> wrote:
>
> Thanks
>
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
>
> Yes. It always make sense the way we've been using.
>
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
>
> I see.
>
> *Let me explain again:* In your solrconfig.xml, look at your /search
>
> Ok, using q now, removed all qf, performed the search and I got 23
> results, and the one I really want, on the top.
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
> don't get anything (which make sense). However if I query name_exact, I get
> the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0
> I still don't get any results.
>
> In summary
> - without qf - 23 results
> - dbId - 0 results
> - name_exact - 16 results
> - name - 23 results
> - dbId^1.0
>  name_exact^10.0 - 0 results
> - 0 results if any other, stId, dbId (key) is added on top of the
> name(name_exact, etc).
>
> Definitely lost here! :-/
>
>
> On 11 Nov 2019, at 07:59, Paras Lehana <[hidden email]> wrote:
>
> Hi
>
> So I don't think removing it completely is the way to go from the scenario
>
> we have
>
>
>
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
>
>
> Quite a considerable increase
>
>
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
>
>
>
> I am sorry but I didn't understand what do you want me to do exactly with
> the lst (??) and qf and bf.
>
>
>
> What combinations did you try? I was referring to the field-level boosting
> you have applied in edismax config.
>
> *Let me explain again:* In your solrconfig.xml, look at your /search
> request handler. There are many qf and some bq boosts. I want you to remove
> all of these, check response again (with q now) and keep on adding them
> again (one by one) while looking for when the numFound drastically changes.
>
> On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]>
> wrote:
>
> I use 3 word shingles with stopwords for my MLT ML trainer that worked
> pretty well for such a solution, but for a full index the size became
> prohibitive
>
> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
> wrote:
>
> If we had IDF for phrases, they would be super effective. The 2X weight
>
> is
>
> a hack that mostly works.
>
> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 8, 2019, at 11:08 AM, David Hastings <
>
> [hidden email]> wrote:
>
>
> the pf and qf fields are REALLY nice for this
>
> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>
> [hidden email]>
>
> wrote:
>
> I always enable phrase searching in edismax for exactly this reason.
>
> Something like:
>
>     <str name="qf”>title^8 keywords^4 text</str>
>     <str name="pf”>title^16 keywords^8 text^2</str>
>
> To deal with concepts in queries, a classifier and/or named entity
> extractor can be helpful. If you have a list of concepts (“controlled
> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>
> that
>
> term can be queried against the field matching that vocabulary.
>
> This is how LinkedIn separates people, companies, and places, for
>
> example.
>
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
>
>
> wrote:
>
>
> Look at the “mm” parameter, try setting it to 100%. Although that’t
>
> not
>
> entirely likely to do what you want either since virtually every doc
>
> will
>
> have “a” in it. But at least you’d get docs that have both terms.
>
>
> you may also be able to search for things like “Lamin A” _only as a
>
> phrase_ and have some luck. But this is a gnarly problem in general.
>
> Some
>
> people have been able to substitute synonyms and/or shingles to make
>
> this
>
> work at the expense of a larger index.
>
>
> This is a generic problem with context. “Lamin A” is really a
>
> “concept”,
>
> not just two words that happen to be near each other. Searching as a
>
> phrase
>
> is an OOB-but-naive way to try to make it more likely that the ranked
> results refer to the _concept_ of “Lamin A”. The assumption here is
>
> “if
>
> these two words appear next to each other, they’re more likely to be
>
> what I
>
> want”. I say “naive” because “Lamins: A new approach to...” would
>
> _also_ be
>
> found for a naive phrase search. (I have no idea whether such a title
>
> makes
>
> sense or not, but you figured that out already)...
>
>
> To do this well you’d have to dive in to NLP/Machine learning.
>
> I truly wish we could have the DWIM search algorithm (Do What I
>
> Mean)….
>
>
> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>
> wrote:
>
>
> HI Walter and Paras
>
> I indexed it removing all the references to StopWordFilter and I
>
> went
>
> from 121 results to near 20K as the search term q="Lymphoid and a
> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>
> So I
>
> don't think removing it completely is the way to go from the scenario
>
> we
>
> have, but I appreciate the suggestion…
>
>
> Yes the response is using fl=*
> I am trying some combinations at the moment, but yet no success.
>
> defType=edismax
> q.alt=Lymphoid and a non-Lymphoid cell
> Number of results=1599
> Quite a considerable increase, even though reasonable meaningful
>
> results.
>
>
> I am sorry but I didn't understand what do you want me to do exactly
>
> with the lst (??) and qf and bf.
>
>
> Thanks everyone with their inputs
>
>
> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>
> wrote:
>
>
> Hi Guilherme
>
> By accident, I ended up querying the using the default handler
>
> (/select) and it worked.
>
>
> You've just found the culprit. Thanks for giving the material I
>
> requested. Your analysis chain is working as expected. I don't see any
> issue in either StopWordFilter or your boosts. I also use a boost of
>
> 50
>
> when boosting contextual suggestions (boosting "gold iphone" on a page
>
> of
>
> iphone) but I take Walter's suggestion and would try to optimize my
> weights. I agree that this 50 thing was not researched much about by
>
> us
>
> as
>
> well (we never faced performance or relevance issues).
>
>
> See the major difference in both the handlers - edismax. I'm pretty
>
> sure that your problem lies in the parsing of queries (you can confirm
>
> that
>
> from parsedquery key in debug of both JSON responses). I hope you have
> provided the response with fl=*. Replace q with q.alt in your /search
> handler query and I think you should start getting responses. That's
> because q.alt uses standard parser. If you want to keep using
>
> edisMax, I
>
> suggest you to test the responses removing some combination of lst
>
> (qf,
>
> bf)
>
> and find what's restricting the documents to come up. I'm out of
>
> office
>
> today - would have certainly tried analyzing the field values of the
> document in /select request and compare it with qf/bq in
>
> solrconfig.xml
>
> /search. Do this for me and you'd certainly find something.
>
>
> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>
> [hidden email]
>
> <mailto:[hidden email]>> wrote:
>
> I normally use a weight of 8 for the most important field, like
>
> title.
>
> Other fields might get a 4 or 2.
>
>
> I add a “pf” field with the weights doubled, so that phrase matches
>
> have a higher weight.
>
>
> The weight of 8 comes from experience at Infoseek and Inktomi, two
>
> early web search engines. With different relevance algorithms and
>
> totally
>
> different evaluation and tuning systems, they settled on weights of 8
>
> and
>
> 7.5 for HTML titles. With the the two radically different system
>
> getting
>
> the same number, I decided that was a property of the documents, not
>
> of
>
> the
>
> search engines.
>
>
> wunder
> Walter Underwood
> [hidden email] <mailto:[hidden email]>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>
> (my blog)
>
>
> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>
> <mailto:[hidden email]>> wrote:
>
>
> Hi Wunder,
>
> My indexer takes quite a few hours to be executed I am shortening
>
> it
>
> to run faster, but I also need to make sure it gives what we are
>
> expecting.
>
> This implementation's been there for >4y, and massively used.
>
>
> In your edismax handlers, weights of 20, 50, and 100 are
>
> extremely
>
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>
> years
>
> of configuring Solr.
>
> I've inherited that implementation and I am really keen to
>
> adequate
>
> it, what would you recommend ?
>
>
> Cheers
> Guilherme
>
> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>
> <mailto:[hidden email]>> wrote:
>
>
> Thanks for posting the files. Looking at schema.xml, I see that
>
> you
>
> still are using StopFilterFactory. The first advice we gave you was to
> remove that.
>
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you
>
> do
>
> that.
>
>
> In your edismax handlers, weights of 20, 50, and 100 are
>
> extremely
>
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>
> years
>
> of configuring Solr.
>
>
> wunder
> Walter Underwood
> [hidden email] <mailto:[hidden email]>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>
>
> (my blog)
>
>
> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>
> <mailto:[hidden email]>> wrote:
>
>
> Hi Paras, everyone
>
> Thank you again for your inputs and suggestions. I sorry to hear
>
> you had trouble with the attachments I will host it somewhere and
>
> share
>
> the
>
> links.
>
> I don't tweak my index, I get the data from the graph database,
>
> create a document as they are and save to solr.
>
>
> So, I am sending the new analysis screen querying the way you
>
> suggested. Also the results with params and solr query url.
>
>
> During the process of querying what you asked I found something
>
> really weird (at least for me). By accident, I ended up querying the
>
> using
>
> the default handler (/select) and it worked. Then If I use the one I
>
> must
>
> use, then sadly doesn't work. I am posting both results and I will
>
> also
>
> post the handlers as well.
>
>
> Here is the link with all the files mentioned before
>
>
>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>
>
>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>
> <
>
>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>
> <
>
>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>
>
> If the link doesn't work www dot dropbox dot com slash sh slash
>
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>
>
> Thanks
>
> On 7 Nov 2019, at 05:23, Paras Lehana <
>
> [hidden email]
>
> <mailto:[hidden email]>> wrote:
>
>
> Hi Guilherme.
>
> I am sending they analysis result and the json result as
>
> requested.
>
>
>
> Thanks for the effort. Luckily, I can see your attachments (low
>
> quality
>
> though).
>
> From the analysis screen, the analysis is working as expected.
>
> One
>
> of the
>
> reasons for query="lymphoid and *a* non-lymphoid cell" not
>
> matching
>
> document containing "Lymphoid and a non-Lymphoid cell" I can
>
> initially
>
> think of is: the stopword "a" is probably present in
>
> post-analysis
>
> either
>
> of query or index. Did you tweak your index time analysis after
>
> indexing?
>
>
> Do two things:
>
> 1. Post the analysis screen for and index=*"Immunoregulatory
> interactions between a Lymphoid and a non-Lymphoid cell"* and
> "query=*"lymphoid
> and a non-lymphoid cell"*. Try hosting the image and providing
>
> the
>
> link
>
> here.
> 2. Give the same JSON output as you have sent but this time
>
> with
>
> *"echoParams=all"*. Also, post the exact Solr query url.
>
>
>
> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>
> [hidden email] <mailto:[hidden email]>> wrote:
>
>
> I don’t see the attachments, maybe I deleted old e-mails or
>
> some
>
> such. The
>
> Apache server is fairly aggressive about stripping attachments
>
> though, so
>
> it’s also possible they didn’t make it through.
>
> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>
> [hidden email]
>
> <mailto:[hidden email]>> wrote:
>
>
> Thanks Erick.
>
> First, your index and analysis chains are considerably
>
> different, this
>
> can easily be a source of problems. In particular, using two
>
> different
>
> tokenizers is a huge red flag. I _strongly_ recommend against
>
> this unless
>
> you’re totally sure you understand the consequences.
>
> Additionally, your use
>
> of the length filter is suspicious, especially since your
>
> problem
>
> statement
>
> is about the addition of a single letter term and the min
>
> length
>
> allowed on
>
> that filter is 2. That said, it’s reasonable to suppose that
>
> the
>
> ’a’ is
>
> filtered out in both cases, but maybe you’ve found something
>
> odd
>
> about the
>
> interactions.
>
> I will investigate the min length and post the results later.
>
> Second, I have no idea what this will do. Are the equal
>
> signs
>
> typos?
>
> Used by custom code?
>
> This the url in my application, not solr params. That's the
>
> query string.
>
>
> What does “species=“ do? That’s not Solr syntax, so it’s
>
> likely
>
> that
>
> all the params with an equal-sign are totally ignored unless
>
> it’s
>
> just a
>
> typo.
>
> This is part of the application. Species will be used later
>
> on
>
> in solr
>
> to filter out the result. That's not solr. That my app params.
>
>
> Third, the easiest way to see what’s happening under the
>
> covers
>
> is to
>
> add “&debug=true” to the query and look at the parsed query.
>
> Ignore all the
>
> relevance calculations for the nonce, or specify
>
> “&debug=query”
>
> to skip
>
> that part.
>
> The two json files i've sent, they are debugQuery=on and the
>
> explain tag
>
> is present.
>
> I will try the searching the way you mentioned.
>
> Thank for your inputs
>
> Guilherme
>
> On 6 Nov 2019, at 14:14, Erick Erickson <
>
> [hidden email] <mailto:[hidden email]>>
>
> wrote:
>
>
> Fwd to another server
>
> First, your index and analysis chains are considerably
>
> different, this
>
> can easily be a source of problems. In particular, using two
>
> different
>
> tokenizers is a huge red flag. I _strongly_ recommend against
>
> this unless
>
> you’re totally sure you understand the consequences.
>
> Additionally, your use
>
> of the length filter is suspicious, especially since your
>
> problem
>
> statement
>
> is about the addition of a single letter term and the min
>
> length
>
> allowed on
>
> that filter is 2. That said, it’s reasonable to suppose that
>
> the
>
> ’a’ is
>
> filtered out in both cases, but maybe you’ve found something
>
> odd
>
> about the
>
> interactions.
>
>
> Second, I have no idea what this will do. Are the equal
>
> signs
>
> typos?
>
> Used by custom code?
>
>
>
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
>
> What does “species=“ do? That’s not Solr syntax, so it’s
>
> likely
>
> that
>
> all the params with an equal-sign are totally ignored unless
>
> it’s
>
> just a
>
> typo.
>
>
> Third, the easiest way to see what’s happening under the
>
> covers
>
> is to
>
> add “&debug=true” to the query and look at the parsed query.
>
> Ignore all the
>
> relevance calculations for the nonce, or specify
>
> “&debug=query”
>
> to skip
>
> that part.
>
>
> 90% + of the time, the question “why didn’t this query do
>
> what I
>
> expect” is answered by looking at the “&debug=query” output
>
> and
>
> the
>
> analysis page in the admin UI. NOTE: for the analysis page be
>
> sure to look
>
> at _both_ the query and index output. Also, and very important
>
> about the
>
> analysis page (and this is confusing) is that this _assumes_
>
> that
>
> what you
>
> put in the text boxes have made it through the query parser
>
> intact and is
>
> analyzed by the field selected. Consider the search
>
> "q=field:word1 word2".
>
> Now you type “word1 word2” into the analysis text box and it
>
> looks like
>
> what you expect. That’s misleading because the query is
>
> _parsed_
>
> as
>
> "field:word1 default_search_field:word2”. This is where
>
> “&debug=query”
>
> helps.
>
>
> Best,
> Erick
>
> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>
> [hidden email] <mailto:[hidden email]>>
>
> wrote:
>
>
> Hi Walter,
>
> The solr.StopFilter removes all tokens that are stopwords.
>
> Those words
>
> will
>
> not be in the index, so they can never match a query.
>
>
>
> I think the OP's concern is different results when adding a
>
> stopword. I
>
> think he's using the filter factory correctly - the query
>
> chain
>
> includes
>
> the filter as well so it should remove "a" while querying.
>
> *@Guilherme*, please post results for both the query, the
>
> document in
>
> result you are concerned about and post full result of
>
> analysis screen
>
> (for
>
> both query and index).
>
> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>
> [hidden email] <mailto:[hidden email]>>
>
> wrote:
>
>
> No.
>
> The solr.StopFilter removes all tokens that are stopwords.
>
> Those words
>
> will not be in the index, so they can never match a query.
>
> 1. Remove the lines with solr.StopFilter from every
>
> analysis
>
> chain in
>
> schema.xml.
> 2. Reload the collection, restart Solr, or whatever to
>
> read
>
> the new
>
> config.
>
> 3. Reindex all of the documents.
>
> When indexed with the new analysis chain, the stopwords
>
> will
>
> not be
>
> removed and they will be searchable.
>
> wunder
> Walter Underwood
> [hidden email] <mailto:[hidden email]>
> http://observer.wunderwood.org/ <
>
> http://observer.wunderwood.org/>  (my blog)
>
>
> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>
> [hidden email] <mailto:[hidden email]>>
>
> wrote:
>
>
> Ok. I am kind a lost now.
> If I open up the console > analysis and perform it,
>
> that's
>
> the final
>
> result.
>
> <Screenshot 2019-11-05 at 14.54.16.png>
>
> Your suggestion is: get rid of the <filter stopword.txt>
>
> in
>
> the
>
> schema.xml and during index phase replaceAll("in
>
> stopwords.txt"," ")
>
> then
>
> add to solr. Is that correct ?
>
>
> Thanks David
>
> On 5 Nov 2019, at 14:48, David Hastings <
>
> [hidden email] <mailto:
>
> [hidden email]
>
>
> <mailto:[hidden email] <mailto:
>
> [hidden email]>>> wrote:
>
>
> Fwd to another server
>
> no,
>   <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
>
> is still using stopwords and should be removed, in my
>
> opinion of
>
> course,
>
> based on your use case may be different, but i generally
>
> axe any
>
> reference
>
> to them at all
>
> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>
> [hidden email] <mailto:[hidden email]>
>
> <mailto:[hidden email] <mailto:[hidden email]>>>
>
> wrote:
>
>
> Thanks.
> Haven't I done this here ?
> <fieldType name="text_field" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false" >
> <analyzer type="index">
>   <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.ClassicFilterFactory"/>
>   <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
> </analyzer>
>
>
> On 5 Nov 2019, at 14:15, David Hastings <
>
> [hidden email] <mailto:
>
> [hidden email]
>
>
> <mailto:[hidden email] <mailto:
>
> [hidden email]>>>
>
> wrote:
>
>
> Fwd to another server
>
> The first thing you should do is remove any reference
>
> to
>
> stop
>
> words
>
> and
>
> never use them, then re-index your data and try it
>
> again.
>
>
> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>
> [hidden email] <mailto:[hidden email]>
>
> <mailto:[hidden email] <mailto:[hidden email]>>>
>
> wrote:
>
>
> Hi,
>
> I am performing a search to match a name
>
> (text_field),
>
> however
>
> this
>
> term
>
> contains 'and' and 'a' and it doesn't return any
>
> records. If i
>
> remove
>
> 'a'
>
> then it works.
> e.g
> Search Term: lymphoid and a non-lymphoid cell
> doesn't work:
>
>
>
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
> <
>
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
>
> <
>
>
>
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
>
>
> Search term: lymphoid and non-lymphoid cell
> works:
>
>
>
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
> <
>
>
>
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
>
> interested in the first result
>
> schema.xml
> <field name="name"
>
> type="text_field"
>
> indexed="true"  stored="true"   omitNorms="false"
>
> required="true"
>
> multiValued="false"/>
>
> <analyzer type="query">
>   <tokenizer class="solr.PatternTokenizerFactory"
> pattern="[^a-zA-Z0-9/._:]"/>
>   <filter class="solr.PatternReplaceFilterFactory"
> pattern="^[/._:]+" replacement=""/>
>   <filter class="solr.PatternReplaceFilterFactory"
> pattern="[/._:]+$" replacement=""/>
>   <filter class="solr.PatternReplaceFilterFactory"
> pattern="[_]" replacement=" "/>
>   <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
> </analyzer>
>
> <fieldType name="text_field" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false" >
> <analyzer type="index">
>   <tokenizer
>
> class="solr.StandardTokenizerFactory"/>
>
>   <filter class="solr.ClassicFilterFactory"/>
>   <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
> </analyzer>
> <analyzer type="query">
>   <tokenizer class="solr.PatternTokenizerFactory"
> pattern="[^a-zA-Z0-9/._:]"/>
>   <filter class="solr.PatternReplaceFilterFactory"
> pattern="^[/._:]+" replacement=""/>
>   <filter class="solr.PatternReplaceFilterFactory"
> pattern="[/._:]+$" replacement=""/>
>   <filter class="solr.PatternReplaceFilterFactory"
> pattern="[_]" replacement=" "/>
>   <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>   <filter class="solr.LowerCaseFilterFactory"/>
>   <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
> </analyzer>
> </fieldType>
>
> stopwords.txt
> #Standard english stop words taken from Lucene's
>
> StopAnalyzer
>
> a
> b
> c
> ....
> an
> and
> are
>
> Running SolR 6.6.2.
>
> Is there anything I could do to prevent this ?
>
> Thanks
> Guilherme
>
>
>
>
>
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>
>
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>
>
>
>
>
> --
> --
> Regards,
>
> Paras Lehana [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996 <tel:+91-9560911996>
> Work: 01203916600 | Extn:  8173
>
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>
>
>
>
>
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>
>
>

--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri
Hi Paras
No worries.
No I didn’t find anything. This is annoying now...
Yes! They do contain dbId. Absolutely all my docs contains dbId and it is actually my key, if you check again the schema.xml

Cheers
Guilherme

> On 15 Nov 2019, at 05:37, Paras Lehana <[hidden email]> wrote:
>
> 
> Hey Guilherme,
>
> I was a bit busy for the past few days and couldn't read your mail. So, did you find anything? Anyways, as I had expected, the culprit is definitely among the qfs. Do the documents in concern contain dbId? I suggest you to cross check the fields in your document with those impacting the result in qf.
>
>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <[hidden email]> wrote:
>> What I can't understand is:
>> I search for the exact term - "Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell" and If i search "I search for the exact term - Immunoregulatory interactions between a Lymphoid and non-Lymphoid cell" then it works
>>
>>> On 11 Nov 2019, at 12:24, Guilherme Viteri <[hidden email]> wrote:
>>>
>>> Thanks
>>>> Removing stopwords is another story. I'm curious to find the reason
>>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>>> really necessary.
>>> Yes. It always make sense the way we've been using.
>>>
>>>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>>>> is working as expected. The problem definitely lies in the configuration of
>>>> edismax.
>>> I see.
>>>
>>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>> Ok, using q now, removed all qf, performed the search and I got 23 results, and the one I really want, on the top.
>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I don't get anything (which make sense). However if I query name_exact, I get the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I still don't get any results.
>>>
>>> In summary
>>> - without qf - 23 results
>>> - dbId - 0 results
>>> - name_exact - 16 results
>>> - name - 23 results
>>> - dbId^1.0
>>>  name_exact^10.0 - 0 results
>>> - 0 results if any other, stId, dbId (key) is added on top of the name(name_exact, etc).
>>>
>>> Definitely lost here! :-/
>>>
>>>
>>>> On 11 Nov 2019, at 07:59, Paras Lehana <[hidden email]> wrote:
>>>>
>>>> Hi
>>>>
>>>> So I don't think removing it completely is the way to go from the scenario
>>>>> we have
>>>>
>>>>
>>>> Removing stopwords is another story. I'm curious to find the reason
>>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>>> really necessary.
>>>>
>>>>
>>>> Quite a considerable increase
>>>>
>>>>
>>>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>>>> is working as expected. The problem definitely lies in the configuration of
>>>> edismax.
>>>>
>>>>
>>>>
>>>>> I am sorry but I didn't understand what do you want me to do exactly with
>>>>> the lst (??) and qf and bf.
>>>>
>>>>
>>>> What combinations did you try? I was referring to the field-level boosting
>>>> you have applied in edismax config.
>>>>
>>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>>> request handler. There are many qf and some bq boosts. I want you to remove
>>>> all of these, check response again (with q now) and keep on adding them
>>>> again (one by one) while looking for when the numFound drastically changes.
>>>>
>>>> On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]>
>>>> wrote:
>>>>
>>>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>>>> pretty well for such a solution, but for a full index the size became
>>>>> prohibitive
>>>>>
>>>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>>>> is
>>>>>> a hack that mostly works.
>>>>>>
>>>>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>>>>>
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> [hidden email]
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>
>>>>>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>> the pf and qf fields are REALLY nice for this
>>>>>>>
>>>>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>>>> [hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I always enable phrase searching in edismax for exactly this reason.
>>>>>>>>
>>>>>>>> Something like:
>>>>>>>>
>>>>>>>>     <str name="qf”>title^8 keywords^4 text</str>
>>>>>>>>     <str name="pf”>title^16 keywords^8 text^2</str>
>>>>>>>>
>>>>>>>> To deal with concepts in queries, a classifier and/or named entity
>>>>>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>>>> that
>>>>>>>> term can be queried against the field matching that vocabulary.
>>>>>>>>
>>>>>>>> This is how LinkedIn separates people, companies, and places, for
>>>>>> example.
>>>>>>>>
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> [hidden email]
>>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>>>
>>>>>>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>>>> not
>>>>>>>> entirely likely to do what you want either since virtually every doc
>>>>>> will
>>>>>>>> have “a” in it. But at least you’d get docs that have both terms.
>>>>>>>>>
>>>>>>>>> you may also be able to search for things like “Lamin A” _only as a
>>>>>>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>>>>> Some
>>>>>>>> people have been able to substitute synonyms and/or shingles to make
>>>>>> this
>>>>>>>> work at the expense of a larger index.
>>>>>>>>>
>>>>>>>>> This is a generic problem with context. “Lamin A” is really a
>>>>>> “concept”,
>>>>>>>> not just two words that happen to be near each other. Searching as a
>>>>>> phrase
>>>>>>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>>>>>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>>>> “if
>>>>>>>> these two words appear next to each other, they’re more likely to be
>>>>>> what I
>>>>>>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>>>>> _also_ be
>>>>>>>> found for a naive phrase search. (I have no idea whether such a title
>>>>>> makes
>>>>>>>> sense or not, but you figured that out already)...
>>>>>>>>>
>>>>>>>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>>>>>>>
>>>>>>>>> I truly wish we could have the DWIM search algorithm (Do What I
>>>>> Mean)….
>>>>>>>>>
>>>>>>>>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> HI Walter and Paras
>>>>>>>>>>
>>>>>>>>>> I indexed it removing all the references to StopWordFilter and I
>>>>> went
>>>>>>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>>>>>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>>>>> So I
>>>>>>>> don't think removing it completely is the way to go from the scenario
>>>>> we
>>>>>>>> have, but I appreciate the suggestion…
>>>>>>>>>>
>>>>>>>>>> Yes the response is using fl=*
>>>>>>>>>> I am trying some combinations at the moment, but yet no success.
>>>>>>>>>>
>>>>>>>>>> defType=edismax
>>>>>>>>>> q.alt=Lymphoid and a non-Lymphoid cell
>>>>>>>>>> Number of results=1599
>>>>>>>>>> Quite a considerable increase, even though reasonable meaningful
>>>>>>>> results.
>>>>>>>>>>
>>>>>>>>>> I am sorry but I didn't understand what do you want me to do exactly
>>>>>>>> with the lst (??) and qf and bf.
>>>>>>>>>>
>>>>>>>>>> Thanks everyone with their inputs
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Guilherme
>>>>>>>>>>>
>>>>>>>>>>> By accident, I ended up querying the using the default handler
>>>>>>>> (/select) and it worked.
>>>>>>>>>>>
>>>>>>>>>>> You've just found the culprit. Thanks for giving the material I
>>>>>>>> requested. Your analysis chain is working as expected. I don't see any
>>>>>>>> issue in either StopWordFilter or your boosts. I also use a boost of
>>>>> 50
>>>>>>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>>>>> of
>>>>>>>> iphone) but I take Walter's suggestion and would try to optimize my
>>>>>>>> weights. I agree that this 50 thing was not researched much about by
>>>>> us
>>>>>> as
>>>>>>>> well (we never faced performance or relevance issues).
>>>>>>>>>>>
>>>>>>>>>>> See the major difference in both the handlers - edismax. I'm pretty
>>>>>>>> sure that your problem lies in the parsing of queries (you can confirm
>>>>>> that
>>>>>>>> from parsedquery key in debug of both JSON responses). I hope you have
>>>>>>>> provided the response with fl=*. Replace q with q.alt in your /search
>>>>>>>> handler query and I think you should start getting responses. That's
>>>>>>>> because q.alt uses standard parser. If you want to keep using
>>>>> edisMax, I
>>>>>>>> suggest you to test the responses removing some combination of lst
>>>>> (qf,
>>>>>> bf)
>>>>>>>> and find what's restricting the documents to come up. I'm out of
>>>>> office
>>>>>>>> today - would have certainly tried analyzing the field values of the
>>>>>>>> document in /select request and compare it with qf/bq in
>>>>> solrconfig.xml
>>>>>>>> /search. Do this for me and you'd certainly find something.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>>>> [hidden email]
>>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>> I normally use a weight of 8 for the most important field, like
>>>>>> title.
>>>>>>>> Other fields might get a 4 or 2.
>>>>>>>>>>>
>>>>>>>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>>>>>>>> have a higher weight.
>>>>>>>>>>>
>>>>>>>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>>>>>>> early web search engines. With different relevance algorithms and
>>>>>> totally
>>>>>>>> different evaluation and tuning systems, they settled on weights of 8
>>>>>> and
>>>>>>>> 7.5 for HTML titles. With the the two radically different system
>>>>> getting
>>>>>>>> the same number, I decided that was a property of the documents, not
>>>>> of
>>>>>> the
>>>>>>>> search engines.
>>>>>>>>>>>
>>>>>>>>>>> wunder
>>>>>>>>>>> Walter Underwood
>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>>>>> (my blog)
>>>>>>>>>>>
>>>>>>>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Wunder,
>>>>>>>>>>>>
>>>>>>>>>>>> My indexer takes quite a few hours to be executed I am shortening
>>>>> it
>>>>>>>> to run faster, but I also need to make sure it gives what we are
>>>>>> expecting.
>>>>>>>> This implementation's been there for >4y, and massively used.
>>>>>>>>>>>>
>>>>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>>>>> extremely
>>>>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>>>> years
>>>>>>>> of configuring Solr.
>>>>>>>>>>>> I've inherited that implementation and I am really keen to
>>>>> adequate
>>>>>>>> it, what would you recommend ?
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that
>>>>> you
>>>>>>>> still are using StopFilterFactory. The first advice we gave you was to
>>>>>>>> remove that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You will continue to have problems matching stopwords until you
>>>>> do
>>>>>>>> that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are
>>>>> extremely
>>>>>>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>>>> years
>>>>>>>> of configuring Solr.
>>>>>>>>>>>>>
>>>>>>>>>>>>> wunder
>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>>>>>
>>>>>>>> (my blog)
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Paras, everyone
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>>>>>>>> you had trouble with the attachments I will host it somewhere and
>>>>> share
>>>>>> the
>>>>>>>> links.
>>>>>>>>>>>>>> I don't tweak my index, I get the data from the graph database,
>>>>>>>> create a document as they are and save to solr.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, I am sending the new analysis screen querying the way you
>>>>>>>> suggested. Also the results with params and solr query url.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> During the process of querying what you asked I found something
>>>>>>>> really weird (at least for me). By accident, I ended up querying the
>>>>>> using
>>>>>>>> the default handler (/select) and it worked. Then If I use the one I
>>>>>> must
>>>>>>>> use, then sadly doesn't work. I am posting both results and I will
>>>>> also
>>>>>>>> post the handlers as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is the link with all the files mentioned before
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
>>>>>>>>
>>>>>>
>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>>>>>>> <
>>>>>>
>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>>>>> <
>>>>>>
>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>>>>>>>
>>>>>>>>>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>>>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>>>>> [hidden email]
>>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Guilherme.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am sending they analysis result and the json result as
>>>>>> requested.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the effort. Luckily, I can see your attachments (low
>>>>>>>> quality
>>>>>>>>>>>>>>> though).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From the analysis screen, the analysis is working as expected.
>>>>>> One
>>>>>>>> of the
>>>>>>>>>>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>>>>> matching
>>>>>>>>>>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>>>>>>> initially
>>>>>>>>>>>>>>> think of is: the stopword "a" is probably present in
>>>>>> post-analysis
>>>>>>>> either
>>>>>>>>>>>>>>> of query or index. Did you tweak your index time analysis after
>>>>>>>> indexing?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do two things:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>>>>>>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>>>>>>>>>>> "query=*"lymphoid
>>>>>>>>>>>>>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>>>>> the
>>>>>>>> link
>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>> 2. Give the same JSON output as you have sent but this time
>>>>> with
>>>>>>>>>>>>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>>>>>>> [hidden email] <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don’t see the attachments, maybe I deleted old e-mails or
>>>>> some
>>>>>>>> such. The
>>>>>>>>>>>>>>>> Apache server is fairly aggressive about stripping attachments
>>>>>>>> though, so
>>>>>>>>>>>>>>>> it’s also possible they didn’t make it through.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>>>>> [hidden email]
>>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks Erick.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>>>>> different, this
>>>>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>>>>> different
>>>>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>>>>> this unless
>>>>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>>>>> Additionally, your use
>>>>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>>>>> problem
>>>>>>>> statement
>>>>>>>>>>>>>>>> is about the addition of a single letter term and the min
>>>>> length
>>>>>>>> allowed on
>>>>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>>>>> the
>>>>>>>> ’a’ is
>>>>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>>>>> odd
>>>>>>>> about the
>>>>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>>>>> I will investigate the min length and post the results later.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>>>>> signs
>>>>>>>> typos?
>>>>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>>>>> This the url in my application, not solr params. That's the
>>>>>>>> query string.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>>>> likely
>>>>>>>> that
>>>>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>>>>> it’s
>>>>>>>> just a
>>>>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>>>>> This is part of the application. Species will be used later
>>>>> on
>>>>>>>> in solr
>>>>>>>>>>>>>>>> to filter out the result. That's not solr. That my app params.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>>>>> covers
>>>>>>>> is to
>>>>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>>>>> Ignore all the
>>>>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>>>>> “&debug=query”
>>>>>>>> to skip
>>>>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>>>>> The two json files i've sent, they are debugQuery=on and the
>>>>>>>> explain tag
>>>>>>>>>>>>>>>> is present.
>>>>>>>>>>>>>>>>> I will try the searching the way you mentioned.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank for your inputs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> First, your index and analysis chains are considerably
>>>>>>>> different, this
>>>>>>>>>>>>>>>> can easily be a source of problems. In particular, using two
>>>>>>>> different
>>>>>>>>>>>>>>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>>>>>> this unless
>>>>>>>>>>>>>>>> you’re totally sure you understand the consequences.
>>>>>>>> Additionally, your use
>>>>>>>>>>>>>>>> of the length filter is suspicious, especially since your
>>>>>> problem
>>>>>>>> statement
>>>>>>>>>>>>>>>> is about the addition of a single letter term and the min
>>>>> length
>>>>>>>> allowed on
>>>>>>>>>>>>>>>> that filter is 2. That said, it’s reasonable to suppose that
>>>>> the
>>>>>>>> ’a’ is
>>>>>>>>>>>>>>>> filtered out in both cases, but maybe you’ve found something
>>>>> odd
>>>>>>>> about the
>>>>>>>>>>>>>>>> interactions.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Second, I have no idea what this will do. Are the equal
>>>>> signs
>>>>>>>> typos?
>>>>>>>>>>>>>>>> Used by custom code?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>>>> likely
>>>>>>>> that
>>>>>>>>>>>>>>>> all the params with an equal-sign are totally ignored unless
>>>>>> it’s
>>>>>>>> just a
>>>>>>>>>>>>>>>> typo.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Third, the easiest way to see what’s happening under the
>>>>>> covers
>>>>>>>> is to
>>>>>>>>>>>>>>>> add “&debug=true” to the query and look at the parsed query.
>>>>>>>> Ignore all the
>>>>>>>>>>>>>>>> relevance calculations for the nonce, or specify
>>>>> “&debug=query”
>>>>>>>> to skip
>>>>>>>>>>>>>>>> that part.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 90% + of the time, the question “why didn’t this query do
>>>>>> what I
>>>>>>>>>>>>>>>> expect” is answered by looking at the “&debug=query” output
>>>>> and
>>>>>>>> the
>>>>>>>>>>>>>>>> analysis page in the admin UI. NOTE: for the analysis page be
>>>>>>>> sure to look
>>>>>>>>>>>>>>>> at _both_ the query and index output. Also, and very important
>>>>>>>> about the
>>>>>>>>>>>>>>>> analysis page (and this is confusing) is that this _assumes_
>>>>>> that
>>>>>>>> what you
>>>>>>>>>>>>>>>> put in the text boxes have made it through the query parser
>>>>>>>> intact and is
>>>>>>>>>>>>>>>> analyzed by the field selected. Consider the search
>>>>>>>> "q=field:word1 word2".
>>>>>>>>>>>>>>>> Now you type “word1 word2” into the analysis text box and it
>>>>>>>> looks like
>>>>>>>>>>>>>>>> what you expect. That’s misleading because the query is
>>>>> _parsed_
>>>>>>>> as
>>>>>>>>>>>>>>>> "field:word1 default_search_field:word2”. This is where
>>>>>>>> “&debug=query”
>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Erick
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Walter,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>>>>> Those words
>>>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>>>> not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think the OP's concern is different results when adding a
>>>>>>>> stopword. I
>>>>>>>>>>>>>>>>>>> think he's using the filter factory correctly - the query
>>>>>> chain
>>>>>>>>>>>>>>>> includes
>>>>>>>>>>>>>>>>>>> the filter as well so it should remove "a" while querying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *@Guilherme*, please post results for both the query, the
>>>>>>>> document in
>>>>>>>>>>>>>>>>>>> result you are concerned about and post full result of
>>>>>>>> analysis screen
>>>>>>>>>>>>>>>> (for
>>>>>>>>>>>>>>>>>>> both query and index).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The solr.StopFilter removes all tokens that are stopwords.
>>>>>>>> Those words
>>>>>>>>>>>>>>>>>>>> will not be in the index, so they can never match a query.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. Remove the lines with solr.StopFilter from every
>>>>> analysis
>>>>>>>> chain in
>>>>>>>>>>>>>>>>>>>> schema.xml.
>>>>>>>>>>>>>>>>>>>> 2. Reload the collection, restart Solr, or whatever to
>>>>> read
>>>>>>>> the new
>>>>>>>>>>>>>>>> config.
>>>>>>>>>>>>>>>>>>>> 3. Reindex all of the documents.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> When indexed with the new analysis chain, the stopwords
>>>>> will
>>>>>>>> not be
>>>>>>>>>>>>>>>>>>>> removed and they will be searchable.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wunder
>>>>>>>>>>>>>>>>>>>> Walter Underwood
>>>>>>>>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>>>>> http://observer.wunderwood.org/ <
>>>>>>>> http://observer.wunderwood.org/>  (my blog)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>>>>>>> [hidden email] <mailto:[hidden email]>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ok. I am kind a lost now.
>>>>>>>>>>>>>>>>>>>>> If I open up the console > analysis and perform it,
>>>>> that's
>>>>>>>> the final
>>>>>>>>>>>>>>>>>>>> result.
>>>>>>>>>>>>>>>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Your suggestion is: get rid of the <filter stopword.txt>
>>>>> in
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> schema.xml and during index phase replaceAll("in
>>>>>>>> stopwords.txt"," ")
>>>>>>>>>>>>>>>> then
>>>>>>>>>>>>>>>>>>>> add to solr. Is that correct ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks David
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>>>>>>>>>>>>>> [hidden email] <mailto:
>>>>>> [hidden email]
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>>>>>>>> [hidden email]>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> no,
>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> is still using stopwords and should be removed, in my
>>>>>>>> opinion of
>>>>>>>>>>>>>>>> course,
>>>>>>>>>>>>>>>>>>>>>> based on your use case may be different, but i generally
>>>>>>>> axe any
>>>>>>>>>>>>>>>>>>>> reference
>>>>>>>>>>>>>>>>>>>>>> to them at all
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>> Haven't I done this here ?
>>>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>>>>>>>>>>>>>> [hidden email] <mailto:
>>>>>> [hidden email]
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:
>>>>>>>> [hidden email]>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Fwd to another server
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The first thing you should do is remove any reference
>>>>> to
>>>>>>>> stop
>>>>>>>>>>>>>>>> words
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> never use them, then re-index your data and try it
>>>>>> again.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>>>>>>>>>>>>>> [hidden email] <mailto:[hidden email]>
>>>>>>>>>>>>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I am performing a search to match a name
>>>>> (text_field),
>>>>>>>> however
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> term
>>>>>>>>>>>>>>>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any
>>>>>>>> records. If i
>>>>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>>>>>>>>>> 'a'
>>>>>>>>>>>>>>>>>>>>>>>>> then it works.
>>>>>>>>>>>>>>>>>>>>>>>>> e.g
>>>>>>>>>>>>>>>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>>>>> doesn't work:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>>>>>>>>>>>>>>>>>> works:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> interested in the first result
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> schema.xml
>>>>>>>>>>>>>>>>>>>>>>>>> <field name="name"
>>>>>>>> type="text_field"
>>>>>>>>>>>>>>>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>>>>>>>>>>>>>>>> required="true"
>>>>>>>>>>>>>>>>>>>>>>>>> multiValued="false"/>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>>>>>>>>>>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="index">
>>>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer
>>>>> class="solr.StandardTokenizerFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>>>> <analyzer type="query">
>>>>>>>>>>>>>>>>>>>>>>>>>   <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>>>>>>>>>>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LengthFilterFactory" min="2"
>>>>>>>>>>>>>>>>>>>>>>> max="20"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>>>>>>>>>>>>>>>   <filter class="solr.StopFilterFactory"
>>>>>>>>>>>>>>>>>>>> ignoreCase="true"
>>>>>>>>>>>>>>>>>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>>>>>>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>>>>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> stopwords.txt
>>>>>>>>>>>>>>>>>>>>>>>>> #Standard english stop words taken from Lucene's
>>>>>>>> StopAnalyzer
>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>> b
>>>>>>>>>>>>>>>>>>>>>>>>> c
>>>>>>>>>>>>>>>>>>>>>>>>> ....
>>>>>>>>>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>> Guilherme
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Paras Lehana* [65871]
>>>>>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mob.: +91-9560911996
>>>>>>>>>>>>>>> Work: 01203916600 | Extn:  *8173*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Paras Lehana [65871]
>>>>>>>>>>> Development Engineer, Auto-Suggest,
>>>>>>>>>>> IndiaMART Intermesh Ltd.
>>>>>>>>>>>
>>>>>>>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>>>>>>>>> Noida, UP, IN - 201303
>>>>>>>>>>>
>>>>>>>>>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>>>>>>>>>> Work: 01203916600 | Extn:  8173
>>>>>>>>>>>
>>>>>>>>>>> IMPORTANT:
>>>>>>>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Regards,
>>>>
>>>> *Paras Lehana* [65871]
>>>> Development Engineer, Auto-Suggest,
>>>> IndiaMART Intermesh Ltd.
>>>>
>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>>> Noida, UP, IN - 201303
>>>>
>>>> Mob.: +91-9560911996
>>>> Work: 01203916600 | Extn:  *8173*
>>>>
>>>> --
>>>> IMPORTANT:
>>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>
>>>
>>
>
>
> --
> --
> Regards,
>
> Paras Lehana [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  8173
>
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana
Hi Guilherme,

Have you tried reindexing the documents and compare the results? No issues
if you cannot do that - let's try something else. I was going through the
whole mail and your files. You had said:

As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
> don't get anything (which make sense).


Why did you think that not getting anything when you add dbId made sense?
Asking because I may be missing something here.

Also, what is the purpose of so many qf's? Going through your documents and
config files, I found that your dbId's are string of numbers and I don't
think you want to find your query terms in dbId, right?
Do you want to boost the score by the values in dbId?

Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
your terms don't match with the values in dbId for any document, the score
produced by this scoring is 0. 100x or 1x of 0 is still 0.
I still need to see how this scoring gets added up in edismax parser but do
reevaluate the usage of these qfs. Same goes for other qf boosts. :)


On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri <[hidden email]> wrote:

> Hi Paras
> No worries.
> No I didn’t find anything. This is annoying now...
> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
> actually my key, if you check again the schema.xml
>
> Cheers
> Guilherme
>
> On 15 Nov 2019, at 05:37, Paras Lehana <[hidden email]> wrote:
>
> 
> Hey Guilherme,
>
> I was a bit busy for the past few days and couldn't read your mail. So,
> did you find anything? Anyways, as I had expected, the culprit is
> definitely among the qfs. Do the documents in concern contain dbId? I
> suggest you to cross check the fields in your document with those impacting
> the result in qf.
>
> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <[hidden email]> wrote:
>
>> What I can't understand is:
>> I search for the exact term - "Immunoregulatory interactions between a
>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>> exact term - Immunoregulatory interactions between a Lymphoid *and *non-Lymphoid
>> cell" then it works
>>
>> On 11 Nov 2019, at 12:24, Guilherme Viteri <[hidden email]> wrote:
>>
>> Thanks
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>> Yes. It always make sense the way we've been using.
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords
>> filter
>> is working as expected. The problem definitely lies in the configuration
>> of
>> edismax.
>>
>> I see.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>
>> Ok, using q now, removed all qf, performed the search and I got 23
>> results, and the one I really want, on the top.
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>> I don't get anything (which make sense). However if I query name_exact, I
>> get the 23 results again, and unfortunately if I query stId^1.0
>> name_exact^10.0 I still don't get any results.
>>
>> In summary
>> - without qf - 23 results
>> - dbId - 0 results
>> - name_exact - 16 results
>> - name - 23 results
>> - dbId^1.0
>>  name_exact^10.0 - 0 results
>> - 0 results if any other, stId, dbId (key) is added on top of the
>> name(name_exact, etc).
>>
>> Definitely lost here! :-/
>>
>>
>> On 11 Nov 2019, at 07:59, Paras Lehana <[hidden email]>
>> wrote:
>>
>> Hi
>>
>> So I don't think removing it completely is the way to go from the scenario
>>
>> we have
>>
>>
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>>
>> Quite a considerable increase
>>
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords
>> filter
>> is working as expected. The problem definitely lies in the configuration
>> of
>> edismax.
>>
>>
>>
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>>
>>
>>
>> What combinations did you try? I was referring to the field-level boosting
>> you have applied in edismax config.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>> request handler. There are many qf and some bq boosts. I want you to
>> remove
>> all of these, check response again (with q now) and keep on adding them
>> again (one by one) while looking for when the numFound drastically
>> changes.
>>
>> On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]
>> >
>> wrote:
>>
>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>> pretty well for such a solution, but for a full index the size became
>> prohibitive
>>
>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
>> wrote:
>>
>> If we had IDF for phrases, they would be super effective. The 2X weight
>>
>> is
>>
>> a hack that mostly works.
>>
>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>
>> [hidden email]> wrote:
>>
>>
>> the pf and qf fields are REALLY nice for this
>>
>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>
>> [hidden email]>
>>
>> wrote:
>>
>> I always enable phrase searching in edismax for exactly this reason.
>>
>> Something like:
>>
>>     <str name="qf”>title^8 keywords^4 text</str>
>>     <str name="pf”>title^16 keywords^8 text^2</str>
>>
>> To deal with concepts in queries, a classifier and/or named entity
>> extractor can be helpful. If you have a list of concepts (“controlled
>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>
>> that
>>
>> term can be queried against the field matching that vocabulary.
>>
>> This is how LinkedIn separates people, companies, and places, for
>>
>> example.
>>
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
>>
>>
>> wrote:
>>
>>
>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>
>> not
>>
>> entirely likely to do what you want either since virtually every doc
>>
>> will
>>
>> have “a” in it. But at least you’d get docs that have both terms.
>>
>>
>> you may also be able to search for things like “Lamin A” _only as a
>>
>> phrase_ and have some luck. But this is a gnarly problem in general.
>>
>> Some
>>
>> people have been able to substitute synonyms and/or shingles to make
>>
>> this
>>
>> work at the expense of a larger index.
>>
>>
>> This is a generic problem with context. “Lamin A” is really a
>>
>> “concept”,
>>
>> not just two words that happen to be near each other. Searching as a
>>
>> phrase
>>
>> is an OOB-but-naive way to try to make it more likely that the ranked
>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>
>> “if
>>
>> these two words appear next to each other, they’re more likely to be
>>
>> what I
>>
>> want”. I say “naive” because “Lamins: A new approach to...” would
>>
>> _also_ be
>>
>> found for a naive phrase search. (I have no idea whether such a title
>>
>> makes
>>
>> sense or not, but you figured that out already)...
>>
>>
>> To do this well you’d have to dive in to NLP/Machine learning.
>>
>> I truly wish we could have the DWIM search algorithm (Do What I
>>
>> Mean)….
>>
>>
>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>>
>> wrote:
>>
>>
>> HI Walter and Paras
>>
>> I indexed it removing all the references to StopWordFilter and I
>>
>> went
>>
>> from 121 results to near 20K as the search term q="Lymphoid and a
>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>
>> So I
>>
>> don't think removing it completely is the way to go from the scenario
>>
>> we
>>
>> have, but I appreciate the suggestion…
>>
>>
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>>
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful
>>
>> results.
>>
>>
>> I am sorry but I didn't understand what do you want me to do exactly
>>
>> with the lst (??) and qf and bf.
>>
>>
>> Thanks everyone with their inputs
>>
>>
>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>>
>> wrote:
>>
>>
>> Hi Guilherme
>>
>> By accident, I ended up querying the using the default handler
>>
>> (/select) and it worked.
>>
>>
>> You've just found the culprit. Thanks for giving the material I
>>
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of
>>
>> 50
>>
>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>
>> of
>>
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by
>>
>> us
>>
>> as
>>
>> well (we never faced performance or relevance issues).
>>
>>
>> See the major difference in both the handlers - edismax. I'm pretty
>>
>> sure that your problem lies in the parsing of queries (you can confirm
>>
>> that
>>
>> from parsedquery key in debug of both JSON responses). I hope you have
>> provided the response with fl=*. Replace q with q.alt in your /search
>> handler query and I think you should start getting responses. That's
>> because q.alt uses standard parser. If you want to keep using
>>
>> edisMax, I
>>
>> suggest you to test the responses removing some combination of lst
>>
>> (qf,
>>
>> bf)
>>
>> and find what's restricting the documents to come up. I'm out of
>>
>> office
>>
>> today - would have certainly tried analyzing the field values of the
>> document in /select request and compare it with qf/bq in
>>
>> solrconfig.xml
>>
>> /search. Do this for me and you'd certainly find something.
>>
>>
>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>
>> [hidden email]
>>
>> <mailto:[hidden email]>> wrote:
>>
>> I normally use a weight of 8 for the most important field, like
>>
>> title.
>>
>> Other fields might get a 4 or 2.
>>
>>
>> I add a “pf” field with the weights doubled, so that phrase matches
>>
>> have a higher weight.
>>
>>
>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>
>> early web search engines. With different relevance algorithms and
>>
>> totally
>>
>> different evaluation and tuning systems, they settled on weights of 8
>>
>> and
>>
>> 7.5 for HTML titles. With the the two radically different system
>>
>> getting
>>
>> the same number, I decided that was a property of the documents, not
>>
>> of
>>
>> the
>>
>> search engines.
>>
>>
>> wunder
>> Walter Underwood
>> [hidden email] <mailto:[hidden email]>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>
>> (my blog)
>>
>>
>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>>
>> <mailto:[hidden email]>> wrote:
>>
>>
>> Hi Wunder,
>>
>> My indexer takes quite a few hours to be executed I am shortening
>>
>> it
>>
>> to run faster, but I also need to make sure it gives what we are
>>
>> expecting.
>>
>> This implementation's been there for >4y, and massively used.
>>
>>
>> In your edismax handlers, weights of 20, 50, and 100 are
>>
>> extremely
>>
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>
>> years
>>
>> of configuring Solr.
>>
>> I've inherited that implementation and I am really keen to
>>
>> adequate
>>
>> it, what would you recommend ?
>>
>>
>> Cheers
>> Guilherme
>>
>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>>
>> <mailto:[hidden email]>> wrote:
>>
>>
>> Thanks for posting the files. Looking at schema.xml, I see that
>>
>> you
>>
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>
>>
>> Remove StopFilterFactory everywhere and reindex.
>>
>> You will continue to have problems matching stopwords until you
>>
>> do
>>
>> that.
>>
>>
>> In your edismax handlers, weights of 20, 50, and 100 are
>>
>> extremely
>>
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>
>> years
>>
>> of configuring Solr.
>>
>>
>> wunder
>> Walter Underwood
>> [hidden email] <mailto:[hidden email]>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>
>>
>> (my blog)
>>
>>
>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>>
>> <mailto:[hidden email]>> wrote:
>>
>>
>> Hi Paras, everyone
>>
>> Thank you again for your inputs and suggestions. I sorry to hear
>>
>> you had trouble with the attachments I will host it somewhere and
>>
>> share
>>
>> the
>>
>> links.
>>
>> I don't tweak my index, I get the data from the graph database,
>>
>> create a document as they are and save to solr.
>>
>>
>> So, I am sending the new analysis screen querying the way you
>>
>> suggested. Also the results with params and solr query url.
>>
>>
>> During the process of querying what you asked I found something
>>
>> really weird (at least for me). By accident, I ended up querying the
>>
>> using
>>
>> the default handler (/select) and it worked. Then If I use the one I
>>
>> must
>>
>> use, then sadly doesn't work. I am posting both results and I will
>>
>> also
>>
>> post the handlers as well.
>>
>>
>> Here is the link with all the files mentioned before
>>
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <
>>
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> >
>>
>> <
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>
>> <
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>
>>
>> If the link doesn't work www dot dropbox dot com slash sh slash
>>
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>
>>
>> Thanks
>>
>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>
>> [hidden email]
>>
>> <mailto:[hidden email]>> wrote:
>>
>>
>> Hi Guilherme.
>>
>> I am sending they analysis result and the json result as
>>
>> requested.
>>
>>
>>
>> Thanks for the effort. Luckily, I can see your attachments (low
>>
>> quality
>>
>> though).
>>
>> From the analysis screen, the analysis is working as expected.
>>
>> One
>>
>> of the
>>
>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>
>> matching
>>
>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>
>> initially
>>
>> think of is: the stopword "a" is probably present in
>>
>> post-analysis
>>
>> either
>>
>> of query or index. Did you tweak your index time analysis after
>>
>> indexing?
>>
>>
>> Do two things:
>>
>> 1. Post the analysis screen for and index=*"Immunoregulatory
>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>> "query=*"lymphoid
>> and a non-lymphoid cell"*. Try hosting the image and providing
>>
>> the
>>
>> link
>>
>> here.
>> 2. Give the same JSON output as you have sent but this time
>>
>> with
>>
>> *"echoParams=all"*. Also, post the exact Solr query url.
>>
>>
>>
>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>
>> [hidden email] <mailto:[hidden email]>> wrote:
>>
>>
>> I don’t see the attachments, maybe I deleted old e-mails or
>>
>> some
>>
>> such. The
>>
>> Apache server is fairly aggressive about stripping attachments
>>
>> though, so
>>
>> it’s also possible they didn’t make it through.
>>
>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>
>> [hidden email]
>>
>> <mailto:[hidden email]>> wrote:
>>
>>
>> Thanks Erick.
>>
>> First, your index and analysis chains are considerably
>>
>> different, this
>>
>> can easily be a source of problems. In particular, using two
>>
>> different
>>
>> tokenizers is a huge red flag. I _strongly_ recommend against
>>
>> this unless
>>
>> you’re totally sure you understand the consequences.
>>
>> Additionally, your use
>>
>> of the length filter is suspicious, especially since your
>>
>> problem
>>
>> statement
>>
>> is about the addition of a single letter term and the min
>>
>> length
>>
>> allowed on
>>
>> that filter is 2. That said, it’s reasonable to suppose that
>>
>> the
>>
>> ’a’ is
>>
>> filtered out in both cases, but maybe you’ve found something
>>
>> odd
>>
>> about the
>>
>> interactions.
>>
>> I will investigate the min length and post the results later.
>>
>> Second, I have no idea what this will do. Are the equal
>>
>> signs
>>
>> typos?
>>
>> Used by custom code?
>>
>> This the url in my application, not solr params. That's the
>>
>> query string.
>>
>>
>> What does “species=“ do? That’s not Solr syntax, so it’s
>>
>> likely
>>
>> that
>>
>> all the params with an equal-sign are totally ignored unless
>>
>> it’s
>>
>> just a
>>
>> typo.
>>
>> This is part of the application. Species will be used later
>>
>> on
>>
>> in solr
>>
>> to filter out the result. That's not solr. That my app params.
>>
>>
>> Third, the easiest way to see what’s happening under the
>>
>> covers
>>
>> is to
>>
>> add “&debug=true” to the query and look at the parsed query.
>>
>> Ignore all the
>>
>> relevance calculations for the nonce, or specify
>>
>> “&debug=query”
>>
>> to skip
>>
>> that part.
>>
>> The two json files i've sent, they are debugQuery=on and the
>>
>> explain tag
>>
>> is present.
>>
>> I will try the searching the way you mentioned.
>>
>> Thank for your inputs
>>
>> Guilherme
>>
>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>
>> [hidden email] <mailto:[hidden email]>>
>>
>> wrote:
>>
>>
>> Fwd to another server
>>
>> First, your index and analysis chains are considerably
>>
>> different, this
>>
>> can easily be a source of problems. In particular, using two
>>
>> different
>>
>> tokenizers is a huge red flag. I _strongly_ recommend against
>>
>> this unless
>>
>> you’re totally sure you understand the consequences.
>>
>> Additionally, your use
>>
>> of the length filter is suspicious, especially since your
>>
>> problem
>>
>> statement
>>
>> is about the addition of a single letter term and the min
>>
>> length
>>
>> allowed on
>>
>> that filter is 2. That said, it’s reasonable to suppose that
>>
>> the
>>
>> ’a’ is
>>
>> filtered out in both cases, but maybe you’ve found something
>>
>> odd
>>
>> about the
>>
>> interactions.
>>
>>
>> Second, I have no idea what this will do. Are the equal
>>
>> signs
>>
>> typos?
>>
>> Used by custom code?
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>> What does “species=“ do? That’s not Solr syntax, so it’s
>>
>> likely
>>
>> that
>>
>> all the params with an equal-sign are totally ignored unless
>>
>> it’s
>>
>> just a
>>
>> typo.
>>
>>
>> Third, the easiest way to see what’s happening under the
>>
>> covers
>>
>> is to
>>
>> add “&debug=true” to the query and look at the parsed query.
>>
>> Ignore all the
>>
>> relevance calculations for the nonce, or specify
>>
>> “&debug=query”
>>
>> to skip
>>
>> that part.
>>
>>
>> 90% + of the time, the question “why didn’t this query do
>>
>> what I
>>
>> expect” is answered by looking at the “&debug=query” output
>>
>> and
>>
>> the
>>
>> analysis page in the admin UI. NOTE: for the analysis page be
>>
>> sure to look
>>
>> at _both_ the query and index output. Also, and very important
>>
>> about the
>>
>> analysis page (and this is confusing) is that this _assumes_
>>
>> that
>>
>> what you
>>
>> put in the text boxes have made it through the query parser
>>
>> intact and is
>>
>> analyzed by the field selected. Consider the search
>>
>> "q=field:word1 word2".
>>
>> Now you type “word1 word2” into the analysis text box and it
>>
>> looks like
>>
>> what you expect. That’s misleading because the query is
>>
>> _parsed_
>>
>> as
>>
>> "field:word1 default_search_field:word2”. This is where
>>
>> “&debug=query”
>>
>> helps.
>>
>>
>> Best,
>> Erick
>>
>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>
>> [hidden email] <mailto:[hidden email]>>
>>
>> wrote:
>>
>>
>> Hi Walter,
>>
>> The solr.StopFilter removes all tokens that are stopwords.
>>
>> Those words
>>
>> will
>>
>> not be in the index, so they can never match a query.
>>
>>
>>
>> I think the OP's concern is different results when adding a
>>
>> stopword. I
>>
>> think he's using the filter factory correctly - the query
>>
>> chain
>>
>> includes
>>
>> the filter as well so it should remove "a" while querying.
>>
>> *@Guilherme*, please post results for both the query, the
>>
>> document in
>>
>> result you are concerned about and post full result of
>>
>> analysis screen
>>
>> (for
>>
>> both query and index).
>>
>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>
>> [hidden email] <mailto:[hidden email]>>
>>
>> wrote:
>>
>>
>> No.
>>
>> The solr.StopFilter removes all tokens that are stopwords.
>>
>> Those words
>>
>> will not be in the index, so they can never match a query.
>>
>> 1. Remove the lines with solr.StopFilter from every
>>
>> analysis
>>
>> chain in
>>
>> schema.xml.
>> 2. Reload the collection, restart Solr, or whatever to
>>
>> read
>>
>> the new
>>
>> config.
>>
>> 3. Reindex all of the documents.
>>
>> When indexed with the new analysis chain, the stopwords
>>
>> will
>>
>> not be
>>
>> removed and they will be searchable.
>>
>> wunder
>> Walter Underwood
>> [hidden email] <mailto:[hidden email]>
>> http://observer.wunderwood.org/ <
>>
>> http://observer.wunderwood.org/>  (my blog)
>>
>>
>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>
>> [hidden email] <mailto:[hidden email]>>
>>
>> wrote:
>>
>>
>> Ok. I am kind a lost now.
>> If I open up the console > analysis and perform it,
>>
>> that's
>>
>> the final
>>
>> result.
>>
>> <Screenshot 2019-11-05 at 14.54.16.png>
>>
>> Your suggestion is: get rid of the <filter stopword.txt>
>>
>> in
>>
>> the
>>
>> schema.xml and during index phase replaceAll("in
>>
>> stopwords.txt"," ")
>>
>> then
>>
>> add to solr. Is that correct ?
>>
>>
>> Thanks David
>>
>> On 5 Nov 2019, at 14:48, David Hastings <
>>
>> [hidden email] <mailto:
>>
>> [hidden email]
>>
>>
>> <mailto:[hidden email] <mailto:
>>
>> [hidden email]>>> wrote:
>>
>>
>> Fwd to another server
>>
>> no,
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>>
>> is still using stopwords and should be removed, in my
>>
>> opinion of
>>
>> course,
>>
>> based on your use case may be different, but i generally
>>
>> axe any
>>
>> reference
>>
>> to them at all
>>
>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>
>> [hidden email] <mailto:[hidden email]>
>>
>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>
>> wrote:
>>
>>
>> Thanks.
>> Haven't I done this here ?
>> <fieldType name="text_field" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="false" >
>> <analyzer type="index">
>>   <tokenizer class="solr.StandardTokenizerFactory"/>
>>   <filter class="solr.ClassicFilterFactory"/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>>
>>
>> On 5 Nov 2019, at 14:15, David Hastings <
>>
>> [hidden email] <mailto:
>>
>> [hidden email]
>>
>>
>> <mailto:[hidden email] <mailto:
>>
>> [hidden email]>>>
>>
>> wrote:
>>
>>
>> Fwd to another server
>>
>> The first thing you should do is remove any reference
>>
>> to
>>
>> stop
>>
>> words
>>
>> and
>>
>> never use them, then re-index your data and try it
>>
>> again.
>>
>>
>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>
>> [hidden email] <mailto:[hidden email]>
>>
>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>
>> wrote:
>>
>>
>> Hi,
>>
>> I am performing a search to match a name
>>
>> (text_field),
>>
>> however
>>
>> this
>>
>> term
>>
>> contains 'and' and 'a' and it doesn't return any
>>
>> records. If i
>>
>> remove
>>
>> 'a'
>>
>> then it works.
>> e.g
>> Search Term: lymphoid and a non-lymphoid cell
>> doesn't work:
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>> <
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>> <
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>>
>> Search term: lymphoid and non-lymphoid cell
>> works:
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>> <
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>> interested in the first result
>>
>> schema.xml
>> <field name="name"
>>
>> type="text_field"
>>
>> indexed="true"  stored="true"   omitNorms="false"
>>
>> required="true"
>>
>> multiValued="false"/>
>>
>> <analyzer type="query">
>>   <tokenizer class="solr.PatternTokenizerFactory"
>> pattern="[^a-zA-Z0-9/._:]"/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^[/._:]+" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[/._:]+$" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[_]" replacement=" "/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>>
>> <fieldType name="text_field" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="false" >
>> <analyzer type="index">
>>   <tokenizer
>>
>> class="solr.StandardTokenizerFactory"/>
>>
>>   <filter class="solr.ClassicFilterFactory"/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>> <analyzer type="query">
>>   <tokenizer class="solr.PatternTokenizerFactory"
>> pattern="[^a-zA-Z0-9/._:]"/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^[/._:]+" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[/._:]+$" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[_]" replacement=" "/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> stopwords.txt
>> #Standard english stop words taken from Lucene's
>>
>> StopAnalyzer
>>
>> a
>> b
>> c
>> ....
>> an
>> and
>> are
>>
>> Running SolR 6.6.2.
>>
>> Is there anything I could do to prevent this ?
>>
>> Thanks
>> Guilherme
>>
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> Paras Lehana [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996 <tel:+91-9560911996>
>> Work: 01203916600 | Extn:  8173
>>
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>

--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri
Hi,

> Have you tried reindexing the documents and compare the results? No issues
> if you cannot do that - let's try something else. I was going through the
> whole mail and your files. You had said:
Yes, but since it hasn't worked as suggested, I kept as you suggested.

> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>> don't get anything (which make sense).
>
> Why did you think that not getting anything when you add dbId made sense?
> Asking because I may be missing something here.
I am searching for a text and I was searching on an ID field, which wouldn't make sense.
(I will come back to this soon.)

Ok, I've been adding and removing fields in the qf and I could isolate half of the problem. First, I have one type of field called keyword_field and I added the StopWords filter for this field and It worked. Second,
when I add the fields that are id (<fieldType name="id" class="solr.StrField" />

Do you think I should also the stopwords filter for the fieldtype id ?
(I tried, and it worked, but I am not sure if this is conceptually correct, id, should remain intact from my understand)

Thanks
Guilherme

> On 18 Nov 2019, at 05:37, Paras Lehana <[hidden email]> wrote:
>
> Hi Guilherme,
>
> Have you tried reindexing the documents and compare the results? No issues
> if you cannot do that - let's try something else. I was going through the
> whole mail and your files. You had said:
>
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>> don't get anything (which make sense).
>
>
> Why did you think that not getting anything when you add dbId made sense?
> Asking because I may be missing something here.
>
> Also, what is the purpose of so many qf's? Going through your documents and
> config files, I found that your dbId's are string of numbers and I don't
> think you want to find your query terms in dbId, right?
> Do you want to boost the score by the values in dbId?
>
> Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
> your terms don't match with the values in dbId for any document, the score
> produced by this scoring is 0. 100x or 1x of 0 is still 0.
> I still need to see how this scoring gets added up in edismax parser but do
> reevaluate the usage of these qfs. Same goes for other qf boosts. :)
>
>
> On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri <[hidden email]> wrote:
>
>> Hi Paras
>> No worries.
>> No I didn’t find anything. This is annoying now...
>> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
>> actually my key, if you check again the schema.xml
>>
>> Cheers
>> Guilherme
>>
>> On 15 Nov 2019, at 05:37, Paras Lehana <[hidden email]> wrote:
>>
>> 
>> Hey Guilherme,
>>
>> I was a bit busy for the past few days and couldn't read your mail. So,
>> did you find anything? Anyways, as I had expected, the culprit is
>> definitely among the qfs. Do the documents in concern contain dbId? I
>> suggest you to cross check the fields in your document with those impacting
>> the result in qf.
>>
>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <[hidden email]> wrote:
>>
>>> What I can't understand is:
>>> I search for the exact term - "Immunoregulatory interactions between a
>>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>>> exact term - Immunoregulatory interactions between a Lymphoid *and *non-Lymphoid
>>> cell" then it works
>>>
>>> On 11 Nov 2019, at 12:24, Guilherme Viteri <[hidden email]> wrote:
>>>
>>> Thanks
>>>
>>> Removing stopwords is another story. I'm curious to find the reason
>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>> really necessary.
>>>
>>> Yes. It always make sense the way we've been using.
>>>
>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>> filter
>>> is working as expected. The problem definitely lies in the configuration
>>> of
>>> edismax.
>>>
>>> I see.
>>>
>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>>
>>> Ok, using q now, removed all qf, performed the search and I got 23
>>> results, and the one I really want, on the top.
>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>>> I don't get anything (which make sense). However if I query name_exact, I
>>> get the 23 results again, and unfortunately if I query stId^1.0
>>> name_exact^10.0 I still don't get any results.
>>>
>>> In summary
>>> - without qf - 23 results
>>> - dbId - 0 results
>>> - name_exact - 16 results
>>> - name - 23 results
>>> - dbId^1.0
>>> name_exact^10.0 - 0 results
>>> - 0 results if any other, stId, dbId (key) is added on top of the
>>> name(name_exact, etc).
>>>
>>> Definitely lost here! :-/
>>>
>>>
>>> On 11 Nov 2019, at 07:59, Paras Lehana <[hidden email]>
>>> wrote:
>>>
>>> Hi
>>>
>>> So I don't think removing it completely is the way to go from the scenario
>>>
>>> we have
>>>
>>>
>>>
>>> Removing stopwords is another story. I'm curious to find the reason
>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>> really necessary.
>>>
>>>
>>> Quite a considerable increase
>>>
>>>
>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>> filter
>>> is working as expected. The problem definitely lies in the configuration
>>> of
>>> edismax.
>>>
>>>
>>>
>>> I am sorry but I didn't understand what do you want me to do exactly with
>>> the lst (??) and qf and bf.
>>>
>>>
>>>
>>> What combinations did you try? I was referring to the field-level boosting
>>> you have applied in edismax config.
>>>
>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>> request handler. There are many qf and some bq boosts. I want you to
>>> remove
>>> all of these, check response again (with q now) and keep on adding them
>>> again (one by one) while looking for when the numFound drastically
>>> changes.
>>>
>>> On Fri, 8 Nov 2019 at 23:47, David Hastings <[hidden email]
>>>>
>>> wrote:
>>>
>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>> pretty well for such a solution, but for a full index the size became
>>> prohibitive
>>>
>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <[hidden email]>
>>> wrote:
>>>
>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>>
>>> is
>>>
>>> a hack that mostly works.
>>>
>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>>
>>> [hidden email]> wrote:
>>>
>>>
>>> the pf and qf fields are REALLY nice for this
>>>
>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>>
>>> [hidden email]>
>>>
>>> wrote:
>>>
>>> I always enable phrase searching in edismax for exactly this reason.
>>>
>>> Something like:
>>>
>>>    <str name="qf”>title^8 keywords^4 text</str>
>>>    <str name="pf”>title^16 keywords^8 text^2</str>
>>>
>>> To deal with concepts in queries, a classifier and/or named entity
>>> extractor can be helpful. If you have a list of concepts (“controlled
>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>>
>>> that
>>>
>>> term can be queried against the field matching that vocabulary.
>>>
>>> This is how LinkedIn separates people, companies, and places, for
>>>
>>> example.
>>>
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <[hidden email]
>>>
>>>
>>> wrote:
>>>
>>>
>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>>
>>> not
>>>
>>> entirely likely to do what you want either since virtually every doc
>>>
>>> will
>>>
>>> have “a” in it. But at least you’d get docs that have both terms.
>>>
>>>
>>> you may also be able to search for things like “Lamin A” _only as a
>>>
>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>>
>>> Some
>>>
>>> people have been able to substitute synonyms and/or shingles to make
>>>
>>> this
>>>
>>> work at the expense of a larger index.
>>>
>>>
>>> This is a generic problem with context. “Lamin A” is really a
>>>
>>> “concept”,
>>>
>>> not just two words that happen to be near each other. Searching as a
>>>
>>> phrase
>>>
>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>>
>>> “if
>>>
>>> these two words appear next to each other, they’re more likely to be
>>>
>>> what I
>>>
>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>>
>>> _also_ be
>>>
>>> found for a naive phrase search. (I have no idea whether such a title
>>>
>>> makes
>>>
>>> sense or not, but you figured that out already)...
>>>
>>>
>>> To do this well you’d have to dive in to NLP/Machine learning.
>>>
>>> I truly wish we could have the DWIM search algorithm (Do What I
>>>
>>> Mean)….
>>>
>>>
>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <[hidden email]>
>>>
>>> wrote:
>>>
>>>
>>> HI Walter and Paras
>>>
>>> I indexed it removing all the references to StopWordFilter and I
>>>
>>> went
>>>
>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>>
>>> So I
>>>
>>> don't think removing it completely is the way to go from the scenario
>>>
>>> we
>>>
>>> have, but I appreciate the suggestion…
>>>
>>>
>>> Yes the response is using fl=*
>>> I am trying some combinations at the moment, but yet no success.
>>>
>>> defType=edismax
>>> q.alt=Lymphoid and a non-Lymphoid cell
>>> Number of results=1599
>>> Quite a considerable increase, even though reasonable meaningful
>>>
>>> results.
>>>
>>>
>>> I am sorry but I didn't understand what do you want me to do exactly
>>>
>>> with the lst (??) and qf and bf.
>>>
>>>
>>> Thanks everyone with their inputs
>>>
>>>
>>> On 8 Nov 2019, at 06:45, Paras Lehana <[hidden email]>
>>>
>>> wrote:
>>>
>>>
>>> Hi Guilherme
>>>
>>> By accident, I ended up querying the using the default handler
>>>
>>> (/select) and it worked.
>>>
>>>
>>> You've just found the culprit. Thanks for giving the material I
>>>
>>> requested. Your analysis chain is working as expected. I don't see any
>>> issue in either StopWordFilter or your boosts. I also use a boost of
>>>
>>> 50
>>>
>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>>
>>> of
>>>
>>> iphone) but I take Walter's suggestion and would try to optimize my
>>> weights. I agree that this 50 thing was not researched much about by
>>>
>>> us
>>>
>>> as
>>>
>>> well (we never faced performance or relevance issues).
>>>
>>>
>>> See the major difference in both the handlers - edismax. I'm pretty
>>>
>>> sure that your problem lies in the parsing of queries (you can confirm
>>>
>>> that
>>>
>>> from parsedquery key in debug of both JSON responses). I hope you have
>>> provided the response with fl=*. Replace q with q.alt in your /search
>>> handler query and I think you should start getting responses. That's
>>> because q.alt uses standard parser. If you want to keep using
>>>
>>> edisMax, I
>>>
>>> suggest you to test the responses removing some combination of lst
>>>
>>> (qf,
>>>
>>> bf)
>>>
>>> and find what's restricting the documents to come up. I'm out of
>>>
>>> office
>>>
>>> today - would have certainly tried analyzing the field values of the
>>> document in /select request and compare it with qf/bq in
>>>
>>> solrconfig.xml
>>>
>>> /search. Do this for me and you'd certainly find something.
>>>
>>>
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>>
>>> [hidden email]
>>>
>>> <mailto:[hidden email]>> wrote:
>>>
>>> I normally use a weight of 8 for the most important field, like
>>>
>>> title.
>>>
>>> Other fields might get a 4 or 2.
>>>
>>>
>>> I add a “pf” field with the weights doubled, so that phrase matches
>>>
>>> have a higher weight.
>>>
>>>
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>>
>>> early web search engines. With different relevance algorithms and
>>>
>>> totally
>>>
>>> different evaluation and tuning systems, they settled on weights of 8
>>>
>>> and
>>>
>>> 7.5 for HTML titles. With the the two radically different system
>>>
>>> getting
>>>
>>> the same number, I decided that was a property of the documents, not
>>>
>>> of
>>>
>>> the
>>>
>>> search engines.
>>>
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email] <mailto:[hidden email]>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>
>>> (my blog)
>>>
>>>
>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <[hidden email]
>>>
>>> <mailto:[hidden email]>> wrote:
>>>
>>>
>>> Hi Wunder,
>>>
>>> My indexer takes quite a few hours to be executed I am shortening
>>>
>>> it
>>>
>>> to run faster, but I also need to make sure it gives what we are
>>>
>>> expecting.
>>>
>>> This implementation's been there for >4y, and massively used.
>>>
>>>
>>> In your edismax handlers, weights of 20, 50, and 100 are
>>>
>>> extremely
>>>
>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>
>>> years
>>>
>>> of configuring Solr.
>>>
>>> I've inherited that implementation and I am really keen to
>>>
>>> adequate
>>>
>>> it, what would you recommend ?
>>>
>>>
>>> Cheers
>>> Guilherme
>>>
>>> On 7 Nov 2019, at 14:43, Walter Underwood <[hidden email]
>>>
>>> <mailto:[hidden email]>> wrote:
>>>
>>>
>>> Thanks for posting the files. Looking at schema.xml, I see that
>>>
>>> you
>>>
>>> still are using StopFilterFactory. The first advice we gave you was to
>>> remove that.
>>>
>>>
>>> Remove StopFilterFactory everywhere and reindex.
>>>
>>> You will continue to have problems matching stopwords until you
>>>
>>> do
>>>
>>> that.
>>>
>>>
>>> In your edismax handlers, weights of 20, 50, and 100 are
>>>
>>> extremely
>>>
>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>>
>>> years
>>>
>>> of configuring Solr.
>>>
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email] <mailto:[hidden email]>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>>
>>>
>>> (my blog)
>>>
>>>
>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <[hidden email]
>>>
>>> <mailto:[hidden email]>> wrote:
>>>
>>>
>>> Hi Paras, everyone
>>>
>>> Thank you again for your inputs and suggestions. I sorry to hear
>>>
>>> you had trouble with the attachments I will host it somewhere and
>>>
>>> share
>>>
>>> the
>>>
>>> links.
>>>
>>> I don't tweak my index, I get the data from the graph database,
>>>
>>> create a document as they are and save to solr.
>>>
>>>
>>> So, I am sending the new analysis screen querying the way you
>>>
>>> suggested. Also the results with params and solr query url.
>>>
>>>
>>> During the process of querying what you asked I found something
>>>
>>> really weird (at least for me). By accident, I ended up querying the
>>>
>>> using
>>>
>>> the default handler (/select) and it worked. Then If I use the one I
>>>
>>> must
>>>
>>> use, then sadly doesn't work. I am posting both results and I will
>>>
>>> also
>>>
>>> post the handlers as well.
>>>
>>>
>>> Here is the link with all the files mentioned before
>>>
>>>
>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>> <
>>>
>>>
>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>>
>>>
>>> <
>>>
>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>
>>> <
>>>
>>>
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>
>>>
>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>>
>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>
>>>
>>> Thanks
>>>
>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>>
>>> [hidden email]
>>>
>>> <mailto:[hidden email]>> wrote:
>>>
>>>
>>> Hi Guilherme.
>>>
>>> I am sending they analysis result and the json result as
>>>
>>> requested.
>>>
>>>
>>>
>>> Thanks for the effort. Luckily, I can see your attachments (low
>>>
>>> quality
>>>
>>> though).
>>>
>>> From the analysis screen, the analysis is working as expected.
>>>
>>> One
>>>
>>> of the
>>>
>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>>
>>> matching
>>>
>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>>
>>> initially
>>>
>>> think of is: the stopword "a" is probably present in
>>>
>>> post-analysis
>>>
>>> either
>>>
>>> of query or index. Did you tweak your index time analysis after
>>>
>>> indexing?
>>>
>>>
>>> Do two things:
>>>
>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>> "query=*"lymphoid
>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>>
>>> the
>>>
>>> link
>>>
>>> here.
>>> 2. Give the same JSON output as you have sent but this time
>>>
>>> with
>>>
>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>
>>>
>>>
>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>>
>>> [hidden email] <mailto:[hidden email]>> wrote:
>>>
>>>
>>> I don’t see the attachments, maybe I deleted old e-mails or
>>>
>>> some
>>>
>>> such. The
>>>
>>> Apache server is fairly aggressive about stripping attachments
>>>
>>> though, so
>>>
>>> it’s also possible they didn’t make it through.
>>>
>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>>
>>> [hidden email]
>>>
>>> <mailto:[hidden email]>> wrote:
>>>
>>>
>>> Thanks Erick.
>>>
>>> First, your index and analysis chains are considerably
>>>
>>> different, this
>>>
>>> can easily be a source of problems. In particular, using two
>>>
>>> different
>>>
>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>
>>> this unless
>>>
>>> you’re totally sure you understand the consequences.
>>>
>>> Additionally, your use
>>>
>>> of the length filter is suspicious, especially since your
>>>
>>> problem
>>>
>>> statement
>>>
>>> is about the addition of a single letter term and the min
>>>
>>> length
>>>
>>> allowed on
>>>
>>> that filter is 2. That said, it’s reasonable to suppose that
>>>
>>> the
>>>
>>> ’a’ is
>>>
>>> filtered out in both cases, but maybe you’ve found something
>>>
>>> odd
>>>
>>> about the
>>>
>>> interactions.
>>>
>>> I will investigate the min length and post the results later.
>>>
>>> Second, I have no idea what this will do. Are the equal
>>>
>>> signs
>>>
>>> typos?
>>>
>>> Used by custom code?
>>>
>>> This the url in my application, not solr params. That's the
>>>
>>> query string.
>>>
>>>
>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>
>>> likely
>>>
>>> that
>>>
>>> all the params with an equal-sign are totally ignored unless
>>>
>>> it’s
>>>
>>> just a
>>>
>>> typo.
>>>
>>> This is part of the application. Species will be used later
>>>
>>> on
>>>
>>> in solr
>>>
>>> to filter out the result. That's not solr. That my app params.
>>>
>>>
>>> Third, the easiest way to see what’s happening under the
>>>
>>> covers
>>>
>>> is to
>>>
>>> add “&debug=true” to the query and look at the parsed query.
>>>
>>> Ignore all the
>>>
>>> relevance calculations for the nonce, or specify
>>>
>>> “&debug=query”
>>>
>>> to skip
>>>
>>> that part.
>>>
>>> The two json files i've sent, they are debugQuery=on and the
>>>
>>> explain tag
>>>
>>> is present.
>>>
>>> I will try the searching the way you mentioned.
>>>
>>> Thank for your inputs
>>>
>>> Guilherme
>>>
>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>>
>>> [hidden email] <mailto:[hidden email]>>
>>>
>>> wrote:
>>>
>>>
>>> Fwd to another server
>>>
>>> First, your index and analysis chains are considerably
>>>
>>> different, this
>>>
>>> can easily be a source of problems. In particular, using two
>>>
>>> different
>>>
>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>>
>>> this unless
>>>
>>> you’re totally sure you understand the consequences.
>>>
>>> Additionally, your use
>>>
>>> of the length filter is suspicious, especially since your
>>>
>>> problem
>>>
>>> statement
>>>
>>> is about the addition of a single letter term and the min
>>>
>>> length
>>>
>>> allowed on
>>>
>>> that filter is 2. That said, it’s reasonable to suppose that
>>>
>>> the
>>>
>>> ’a’ is
>>>
>>> filtered out in both cases, but maybe you’ve found something
>>>
>>> odd
>>>
>>> about the
>>>
>>> interactions.
>>>
>>>
>>> Second, I have no idea what this will do. Are the equal
>>>
>>> signs
>>>
>>> typos?
>>>
>>> Used by custom code?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>> <
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>
>>>
>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>>
>>> likely
>>>
>>> that
>>>
>>> all the params with an equal-sign are totally ignored unless
>>>
>>> it’s
>>>
>>> just a
>>>
>>> typo.
>>>
>>>
>>> Third, the easiest way to see what’s happening under the
>>>
>>> covers
>>>
>>> is to
>>>
>>> add “&debug=true” to the query and look at the parsed query.
>>>
>>> Ignore all the
>>>
>>> relevance calculations for the nonce, or specify
>>>
>>> “&debug=query”
>>>
>>> to skip
>>>
>>> that part.
>>>
>>>
>>> 90% + of the time, the question “why didn’t this query do
>>>
>>> what I
>>>
>>> expect” is answered by looking at the “&debug=query” output
>>>
>>> and
>>>
>>> the
>>>
>>> analysis page in the admin UI. NOTE: for the analysis page be
>>>
>>> sure to look
>>>
>>> at _both_ the query and index output. Also, and very important
>>>
>>> about the
>>>
>>> analysis page (and this is confusing) is that this _assumes_
>>>
>>> that
>>>
>>> what you
>>>
>>> put in the text boxes have made it through the query parser
>>>
>>> intact and is
>>>
>>> analyzed by the field selected. Consider the search
>>>
>>> "q=field:word1 word2".
>>>
>>> Now you type “word1 word2” into the analysis text box and it
>>>
>>> looks like
>>>
>>> what you expect. That’s misleading because the query is
>>>
>>> _parsed_
>>>
>>> as
>>>
>>> "field:word1 default_search_field:word2”. This is where
>>>
>>> “&debug=query”
>>>
>>> helps.
>>>
>>>
>>> Best,
>>> Erick
>>>
>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>>
>>> [hidden email] <mailto:[hidden email]>>
>>>
>>> wrote:
>>>
>>>
>>> Hi Walter,
>>>
>>> The solr.StopFilter removes all tokens that are stopwords.
>>>
>>> Those words
>>>
>>> will
>>>
>>> not be in the index, so they can never match a query.
>>>
>>>
>>>
>>> I think the OP's concern is different results when adding a
>>>
>>> stopword. I
>>>
>>> think he's using the filter factory correctly - the query
>>>
>>> chain
>>>
>>> includes
>>>
>>> the filter as well so it should remove "a" while querying.
>>>
>>> *@Guilherme*, please post results for both the query, the
>>>
>>> document in
>>>
>>> result you are concerned about and post full result of
>>>
>>> analysis screen
>>>
>>> (for
>>>
>>> both query and index).
>>>
>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>>
>>> [hidden email] <mailto:[hidden email]>>
>>>
>>> wrote:
>>>
>>>
>>> No.
>>>
>>> The solr.StopFilter removes all tokens that are stopwords.
>>>
>>> Those words
>>>
>>> will not be in the index, so they can never match a query.
>>>
>>> 1. Remove the lines with solr.StopFilter from every
>>>
>>> analysis
>>>
>>> chain in
>>>
>>> schema.xml.
>>> 2. Reload the collection, restart Solr, or whatever to
>>>
>>> read
>>>
>>> the new
>>>
>>> config.
>>>
>>> 3. Reindex all of the documents.
>>>
>>> When indexed with the new analysis chain, the stopwords
>>>
>>> will
>>>
>>> not be
>>>
>>> removed and they will be searchable.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email] <mailto:[hidden email]>
>>> http://observer.wunderwood.org/ <
>>>
>>> http://observer.wunderwood.org/>  (my blog)
>>>
>>>
>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>>
>>> [hidden email] <mailto:[hidden email]>>
>>>
>>> wrote:
>>>
>>>
>>> Ok. I am kind a lost now.
>>> If I open up the console > analysis and perform it,
>>>
>>> that's
>>>
>>> the final
>>>
>>> result.
>>>
>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>
>>> Your suggestion is: get rid of the <filter stopword.txt>
>>>
>>> in
>>>
>>> the
>>>
>>> schema.xml and during index phase replaceAll("in
>>>
>>> stopwords.txt"," ")
>>>
>>> then
>>>
>>> add to solr. Is that correct ?
>>>
>>>
>>> Thanks David
>>>
>>> On 5 Nov 2019, at 14:48, David Hastings <
>>>
>>> [hidden email] <mailto:
>>>
>>> [hidden email]
>>>
>>>
>>> <mailto:[hidden email] <mailto:
>>>
>>> [hidden email]>>> wrote:
>>>
>>>
>>> Fwd to another server
>>>
>>> no,
>>>  <filter class="solr.StopFilterFactory"
>>>
>>> ignoreCase="true"
>>>
>>> words="stopwords.txt"/>
>>>
>>> is still using stopwords and should be removed, in my
>>>
>>> opinion of
>>>
>>> course,
>>>
>>> based on your use case may be different, but i generally
>>>
>>> axe any
>>>
>>> reference
>>>
>>> to them at all
>>>
>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>>
>>> [hidden email] <mailto:[hidden email]>
>>>
>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>
>>> wrote:
>>>
>>>
>>> Thanks.
>>> Haven't I done this here ?
>>> <fieldType name="text_field" class="solr.TextField"
>>> positionIncrementGap="100" omitNorms="false" >
>>> <analyzer type="index">
>>>  <tokenizer class="solr.StandardTokenizerFactory"/>
>>>  <filter class="solr.ClassicFilterFactory"/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>
>>> max="20"/>
>>>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>>
>>> ignoreCase="true"
>>>
>>> words="stopwords.txt"/>
>>> </analyzer>
>>>
>>>
>>> On 5 Nov 2019, at 14:15, David Hastings <
>>>
>>> [hidden email] <mailto:
>>>
>>> [hidden email]
>>>
>>>
>>> <mailto:[hidden email] <mailto:
>>>
>>> [hidden email]>>>
>>>
>>> wrote:
>>>
>>>
>>> Fwd to another server
>>>
>>> The first thing you should do is remove any reference
>>>
>>> to
>>>
>>> stop
>>>
>>> words
>>>
>>> and
>>>
>>> never use them, then re-index your data and try it
>>>
>>> again.
>>>
>>>
>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>>
>>> [hidden email] <mailto:[hidden email]>
>>>
>>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>>
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I am performing a search to match a name
>>>
>>> (text_field),
>>>
>>> however
>>>
>>> this
>>>
>>> term
>>>
>>> contains 'and' and 'a' and it doesn't return any
>>>
>>> records. If i
>>>
>>> remove
>>>
>>> 'a'
>>>
>>> then it works.
>>> e.g
>>> Search Term: lymphoid and a non-lymphoid cell
>>> doesn't work:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>> <
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>
>>> <
>>>
>>>
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>> <
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>
>>>
>>> <
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>> <
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>
>>>
>>>
>>> Search term: lymphoid and non-lymphoid cell
>>> works:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>> <
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>
>>> <
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>> <
>>>
>>>
>>>
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>
>>>
>>>
>>> interested in the first result
>>>
>>> schema.xml
>>> <field name="name"
>>>
>>> type="text_field"
>>>
>>> indexed="true"  stored="true"   omitNorms="false"
>>>
>>> required="true"
>>>
>>> multiValued="false"/>
>>>
>>> <analyzer type="query">
>>>  <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="^[/._:]+" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[/._:]+$" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[_]" replacement=" "/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>
>>> max="20"/>
>>>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>>
>>> ignoreCase="true"
>>>
>>> words="stopwords.txt"/>
>>> </analyzer>
>>>
>>> <fieldType name="text_field" class="solr.TextField"
>>> positionIncrementGap="100" omitNorms="false" >
>>> <analyzer type="index">
>>>  <tokenizer
>>>
>>> class="solr.StandardTokenizerFactory"/>
>>>
>>>  <filter class="solr.ClassicFilterFactory"/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>
>>> max="20"/>
>>>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>>
>>> ignoreCase="true"
>>>
>>> words="stopwords.txt"/>
>>> </analyzer>
>>> <analyzer type="query">
>>>  <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="^[/._:]+" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[/._:]+$" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[_]" replacement=" "/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>>
>>> max="20"/>
>>>
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>>
>>> ignoreCase="true"
>>>
>>> words="stopwords.txt"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>> stopwords.txt
>>> #Standard english stop words taken from Lucene's
>>>
>>> StopAnalyzer
>>>
>>> a
>>> b
>>> c
>>> ....
>>> an
>>> and
>>> are
>>>
>>> Running SolR 6.6.2.
>>>
>>> Is there anything I could do to prevent this ?
>>>
>>> Thanks
>>> Guilherme
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>>
>>> --
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>>
>>> --
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> Paras Lehana [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>> Work: 01203916600 | Extn:  8173
>>>
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Regards,
>>>
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>>
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>>
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>>
>>> --
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.

Reply | Threaded
Open this post in threaded view
|