Protecting Tokens from Any Analysis

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Protecting Tokens from Any Analysis

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Hi All,

This is likely a rudimentary question, but I can’t seem to find a straight-forward answer on forums or the documentation…is there a way to protect tokens from ANY analysis? I know things like the KeywordMarkerFilterFactory protect tokens from stemming, but we have some terms we don’t even want our tokenizer to touch. Mostly, these are IBM-specific acronyms, such as IT:ibm. In this case, we would want to maintain the colon and the capitalization (otherwise “it” would be taken out as a stopword).

Any advice is appreciated!

Thank you,
Audrey

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Protecting Tokens from Any Analysis

Alexandre Rafalovitch
If you don't want it to be touched by a tokenizer, how would the
protection step know that the sequence of characters you want to
protect is "IT:ibm" and not "this is an IT:ibm term I want to
protect"?

What it sounds to me is that you may want to:
1) copyField to a second field
2) Apply a much lighter (whitespace?) tokenizer to that second field
3) Run the results through something like KeepWordFilterFactory
4) Search both fields with a boost on the second, higher-signal field

The other option is to run CharacterFilter,
(PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
term365". As long as it is done on both indexing and query, they will
still match. You may have to have a bunch of them or write some sort
of lookup map.

Regards,
   Alex.

On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
[hidden email] <[hidden email]> wrote:

>
> Hi All,
>
> This is likely a rudimentary question, but I can’t seem to find a straight-forward answer on forums or the documentation…is there a way to protect tokens from ANY analysis? I know things like the KeywordMarkerFilterFactory protect tokens from stemming, but we have some terms we don’t even want our tokenizer to touch. Mostly, these are IBM-specific acronyms, such as IT:ibm. In this case, we would want to maintain the colon and the capitalization (otherwise “it” would be taken out as a stopword).
>
> Any advice is appreciated!
>
> Thank you,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Protecting Tokens from Any Analysis

David Hastings
Another thing to add to the above,
>
> IT:ibm. In this case, we would want to maintain the colon and the
> capitalization (otherwise “it” would be taken out as a stopword).
>
stopwords are a thing of the past at this point.  there is no benefit to
using them now with hardware being so cheap.

On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <[hidden email]>
wrote:

> If you don't want it to be touched by a tokenizer, how would the
> protection step know that the sequence of characters you want to
> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> protect"?
>
> What it sounds to me is that you may want to:
> 1) copyField to a second field
> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> 3) Run the results through something like KeepWordFilterFactory
> 4) Search both fields with a boost on the second, higher-signal field
>
> The other option is to run CharacterFilter,
> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> term365". As long as it is done on both indexing and query, they will
> still match. You may have to have a bunch of them or write some sort
> of lookup map.
>
> Regards,
>    Alex.
>
> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> [hidden email] <[hidden email]> wrote:
> >
> > Hi All,
> >
> > This is likely a rudimentary question, but I can’t seem to find a
> straight-forward answer on forums or the documentation…is there a way to
> protect tokens from ANY analysis? I know things like the
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
> terms we don’t even want our tokenizer to touch. Mostly, these are
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> maintain the colon and the capitalization (otherwise “it” would be taken
> out as a stopword).
> >
> > Any advice is appreciated!
> >
> > Thank you,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > [hidden email]
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Hey Alex,

Thank you!

Re: stopwords being a thing of the past due to the affordability of hardware...can you expand? I'm not sure I understand.

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]> wrote:

    Another thing to add to the above,
    >
    > IT:ibm. In this case, we would want to maintain the colon and the
    > capitalization (otherwise “it” would be taken out as a stopword).
    >
    stopwords are a thing of the past at this point.  there is no benefit to
    using them now with hardware being so cheap.
   
    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <[hidden email]>
    wrote:
   
    > If you don't want it to be touched by a tokenizer, how would the
    > protection step know that the sequence of characters you want to
    > protect is "IT:ibm" and not "this is an IT:ibm term I want to
    > protect"?
    >
    > What it sounds to me is that you may want to:
    > 1) copyField to a second field
    > 2) Apply a much lighter (whitespace?) tokenizer to that second field
    > 3) Run the results through something like KeepWordFilterFactory
    > 4) Search both fields with a boost on the second, higher-signal field
    >
    > The other option is to run CharacterFilter,
    > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
    > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
    > term365". As long as it is done on both indexing and query, they will
    > still match. You may have to have a bunch of them or write some sort
    > of lookup map.
    >
    > Regards,
    >    Alex.
    >
    > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
    > [hidden email] <[hidden email]> wrote:
    > >
    > > Hi All,
    > >
    > > This is likely a rudimentary question, but I can’t seem to find a
    > straight-forward answer on forums or the documentation…is there a way to
    > protect tokens from ANY analysis? I know things like the
    > KeywordMarkerFilterFactory protect tokens from stemming, but we have some
    > terms we don’t even want our tokenizer to touch. Mostly, these are
    > IBM-specific acronyms, such as IT:ibm. In this case, we would want to
    > maintain the colon and the capitalization (otherwise “it” would be taken
    > out as a stopword).
    > >
    > > Any advice is appreciated!
    > >
    > > Thank you,
    > > Audrey
    > >
    > > --
    > > Audrey Lorberfeld
    > > Data Scientist, w3 Search
    > > IBM
    > > [hidden email]
    > >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Protecting Tokens from Any Analysis

Alexandre Rafalovitch
Stopwords (it was discussed on mailing list several times I recall):
The ideas is that it used to be part of the tricks to make the index
as small as possible to allow faster search. Stopwords being the most
common words....
This days, disk space is not an issue most of the time and there have
been many optimizations to make stopwords less relevant. Plus, like
you said, sometimes the stopword management actively gets in the way.
Here is an interesting - if old - article about it too:
https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be

Regards,
   Alex.

On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
[hidden email] <[hidden email]> wrote:

>
> Hey Alex,
>
> Thank you!
>
> Re: stopwords being a thing of the past due to the affordability of hardware...can you expand? I'm not sure I understand.
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]> wrote:
>
>     Another thing to add to the above,
>     >
>     > IT:ibm. In this case, we would want to maintain the colon and the
>     > capitalization (otherwise “it” would be taken out as a stopword).
>     >
>     stopwords are a thing of the past at this point.  there is no benefit to
>     using them now with hardware being so cheap.
>
>     On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <[hidden email]>
>     wrote:
>
>     > If you don't want it to be touched by a tokenizer, how would the
>     > protection step know that the sequence of characters you want to
>     > protect is "IT:ibm" and not "this is an IT:ibm term I want to
>     > protect"?
>     >
>     > What it sounds to me is that you may want to:
>     > 1) copyField to a second field
>     > 2) Apply a much lighter (whitespace?) tokenizer to that second field
>     > 3) Run the results through something like KeepWordFilterFactory
>     > 4) Search both fields with a boost on the second, higher-signal field
>     >
>     > The other option is to run CharacterFilter,
>     > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>     > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>     > term365". As long as it is done on both indexing and query, they will
>     > still match. You may have to have a bunch of them or write some sort
>     > of lookup map.
>     >
>     > Regards,
>     >    Alex.
>     >
>     > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>     > [hidden email] <[hidden email]> wrote:
>     > >
>     > > Hi All,
>     > >
>     > > This is likely a rudimentary question, but I can’t seem to find a
>     > straight-forward answer on forums or the documentation…is there a way to
>     > protect tokens from ANY analysis? I know things like the
>     > KeywordMarkerFilterFactory protect tokens from stemming, but we have some
>     > terms we don’t even want our tokenizer to touch. Mostly, these are
>     > IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>     > maintain the colon and the capitalization (otherwise “it” would be taken
>     > out as a stopword).
>     > >
>     > > Any advice is appreciated!
>     > >
>     > > Thank you,
>     > > Audrey
>     > >
>     > > --
>     > > Audrey Lorberfeld
>     > > Data Scientist, w3 Search
>     > > IBM
>     > > [hidden email]
>     > >
>     >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Protecting Tokens from Any Analysis

Walter Underwood
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Stopwords were used when we were running search engines on 16-bit computers with 50 Megabyte disks, like the PDP-11. They avoided storing and processing long posting lists.

Think of removing stopwords as a binary weighting on frequent terms, either on or off (not in the index). With idf, we have a proportional weighting for frequent terms. That gives better results than binary weighting.

Removing stopwords makes some searches impossible. The classic example is “to be or not to be”, which is 100% stopwords. This is a real-world problem. When I was building search for Netflix a dozen years ago, I hit several movie or TV titles which were all stopwords. I wrote about them in this blog post.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2019, at 6:38 AM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
>
> Hey Alex,
>
> Thank you!
>
> Re: stopwords being a thing of the past due to the affordability of hardware...can you expand? I'm not sure I understand.
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]> wrote:
>
>    Another thing to add to the above,
>>
>> IT:ibm. In this case, we would want to maintain the colon and the
>> capitalization (otherwise “it” would be taken out as a stopword).
>>
>    stopwords are a thing of the past at this point.  there is no benefit to
>    using them now with hardware being so cheap.
>
>    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <[hidden email]>
>    wrote:
>
>> If you don't want it to be touched by a tokenizer, how would the
>> protection step know that the sequence of characters you want to
>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>> protect"?
>>
>> What it sounds to me is that you may want to:
>> 1) copyField to a second field
>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
>> 3) Run the results through something like KeepWordFilterFactory
>> 4) Search both fields with a boost on the second, higher-signal field
>>
>> The other option is to run CharacterFilter,
>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>> term365". As long as it is done on both indexing and query, they will
>> still match. You may have to have a bunch of them or write some sort
>> of lookup map.
>>
>> Regards,
>>   Alex.
>>
>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>> [hidden email] <[hidden email]> wrote:
>>>
>>> Hi All,
>>>
>>> This is likely a rudimentary question, but I can’t seem to find a
>> straight-forward answer on forums or the documentation…is there a way to
>> protect tokens from ANY analysis? I know things like the
>> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
>> terms we don’t even want our tokenizer to touch. Mostly, these are
>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>> maintain the colon and the capitalization (otherwise “it” would be taken
>> out as a stopword).
>>>
>>> Any advice is appreciated!
>>>
>>> Thank you,
>>> Audrey
>>>
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> [hidden email]
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Protecting Tokens from Any Analysis

Erick Erickson
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
The theory behind stopwords is that they are “safe” to remove when calculating relevance, so we can squeeze every last bit of usefulness out of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve come a long way since then and the necessity of removing stopwords from the indexed tokens to conserve RAM and disk is much less relevant than it used to be in “the bad old days” when the idea of stopwords was invented.

I’m not quite so confident as Alex that there is “no benefit”, but I’ll totally agree that you should remove stopwords only _after_ you have some evidence that removing them is A Good Thing in your situation.

And removing stopwords leads to some interesting corner cases. Consider a search for “to be or not to be” if they’re all stopwords.

Best,
Erick

> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - [hidden email] <[hidden email]> wrote:
>
> Hey Alex,
>
> Thank you!
>
> Re: stopwords being a thing of the past due to the affordability of hardware...can you expand? I'm not sure I understand.
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]> wrote:
>
>    Another thing to add to the above,
>>
>> IT:ibm. In this case, we would want to maintain the colon and the
>> capitalization (otherwise “it” would be taken out as a stopword).
>>
>    stopwords are a thing of the past at this point.  there is no benefit to
>    using them now with hardware being so cheap.
>
>    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <[hidden email]>
>    wrote:
>
>> If you don't want it to be touched by a tokenizer, how would the
>> protection step know that the sequence of characters you want to
>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>> protect"?
>>
>> What it sounds to me is that you may want to:
>> 1) copyField to a second field
>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
>> 3) Run the results through something like KeepWordFilterFactory
>> 4) Search both fields with a boost on the second, higher-signal field
>>
>> The other option is to run CharacterFilter,
>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>> term365". As long as it is done on both indexing and query, they will
>> still match. You may have to have a bunch of them or write some sort
>> of lookup map.
>>
>> Regards,
>>   Alex.
>>
>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>> [hidden email] <[hidden email]> wrote:
>>>
>>> Hi All,
>>>
>>> This is likely a rudimentary question, but I can’t seem to find a
>> straight-forward answer on forums or the documentation…is there a way to
>> protect tokens from ANY analysis? I know things like the
>> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
>> terms we don’t even want our tokenizer to touch. Mostly, these are
>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>> maintain the colon and the capitalization (otherwise “it” would be taken
>> out as a stopword).
>>>
>>> Any advice is appreciated!
>>>
>>> Thank you,
>>> Audrey
>>>
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> [hidden email]
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Protecting Tokens from Any Analysis

David Hastings
In reply to this post by Alexandre Rafalovitch
another add on, as the previous two were pretty much spot on:

https://www.google.com/search?rlz=1C5CHFA_enUS814US819&sxsrf=ACYBGNTi2tQTQH6TycDKwRNEn9g2km9awg%3A1570632176627&ei=8PGdXa7tJeem_QaatJ_oAg&q=drive+in&oq=drive+in&gs_l=psy-ab.3..0l10.35669.36730..37042...0.4..1.434.1152.4j3j4-1......0....1..gws-wiz.......0i71j35i39j0i273j0i67j0i131j0i273i70i249.agjl1cqAyog&ved=0ahUKEwiupdfntI_lAhVnU98KHRraBy0Q4dUDCAs&uact=5

vs

https://www.google.com/search?rlz=1C5CHFA_enUS814US819&sxsrf=ACYBGNRFNjzWADDR7awohPfgg8qGXqOlmg%3A1570632182338&ei=9vGdXZ2VFKW8ggeuw73IDQ&q=drive+on&oq=drive+on&gs_l=psy-ab.3..0l10.35301.37396..37917...0.4..0.83.590.8....2..0....1..gws-wiz.......0i71j35i39j0i273j0i131j0i67j0i3.34FIDQtvfOE&ved=0ahUKEwid6LPqtI_lAhUlnuAKHa5hD9kQ4dUDCAs&uact=5


On Wed, Oct 9, 2019 at 10:41 AM Alexandre Rafalovitch <[hidden email]>
wrote:

> Stopwords (it was discussed on mailing list several times I recall):
> The ideas is that it used to be part of the tricks to make the index
> as small as possible to allow faster search. Stopwords being the most
> common words....
> This days, disk space is not an issue most of the time and there have
> been many optimizations to make stopwords less relevant. Plus, like
> you said, sometimes the stopword management actively gets in the way.
> Here is an interesting - if old - article about it too:
>
> https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be
>
> Regards,
>    Alex.
>
> On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
> [hidden email] <[hidden email]> wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > [hidden email]
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]>
> wrote:
> >
> >     Another thing to add to the above,
> >     >
> >     > IT:ibm. In this case, we would want to maintain the colon and the
> >     > capitalization (otherwise “it” would be taken out as a stopword).
> >     >
> >     stopwords are a thing of the past at this point.  there is no
> benefit to
> >     using them now with hardware being so cheap.
> >
> >     On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> [hidden email]>
> >     wrote:
> >
> >     > If you don't want it to be touched by a tokenizer, how would the
> >     > protection step know that the sequence of characters you want to
> >     > protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >     > protect"?
> >     >
> >     > What it sounds to me is that you may want to:
> >     > 1) copyField to a second field
> >     > 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
> >     > 3) Run the results through something like KeepWordFilterFactory
> >     > 4) Search both fields with a boost on the second, higher-signal
> field
> >     >
> >     > The other option is to run CharacterFilter,
> >     > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
> >     > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >     > term365". As long as it is done on both indexing and query, they
> will
> >     > still match. You may have to have a bunch of them or write some
> sort
> >     > of lookup map.
> >     >
> >     > Regards,
> >     >    Alex.
> >     >
> >     > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >     > [hidden email] <[hidden email]> wrote:
> >     > >
> >     > > Hi All,
> >     > >
> >     > > This is likely a rudimentary question, but I can’t seem to find a
> >     > straight-forward answer on forums or the documentation…is there a
> way to
> >     > protect tokens from ANY analysis? I know things like the
> >     > KeywordMarkerFilterFactory protect tokens from stemming, but we
> have some
> >     > terms we don’t even want our tokenizer to touch. Mostly, these are
> >     > IBM-specific acronyms, such as IT:ibm. In this case, we would want
> to
> >     > maintain the colon and the capitalization (otherwise “it” would be
> taken
> >     > out as a stopword).
> >     > >
> >     > > Any advice is appreciated!
> >     > >
> >     > > Thank you,
> >     > > Audrey
> >     > >
> >     > > --
> >     > > Audrey Lorberfeld
> >     > > Data Scientist, w3 Search
> >     > > IBM
> >     > > [hidden email]
> >     > >
> >     >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Protecting Tokens from Any Analysis

David Hastings
In reply to this post by Erick Erickson
However, with all that said, stopwords CAN be useful in some situations.  I
combine stopwords with the shingle factory to create "interesting phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <[hidden email]>
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness out
> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> come a long way since then and the necessity of removing stopwords from the
> indexed tokens to conserve RAM and disk is much less relevant than it used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> totally agree that you should remove stopwords only _after_ you have some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. Consider a
> search for “to be or not to be” if they’re all stopwords.
>
> Best,
> Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> [hidden email] <[hidden email]> wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > [hidden email]
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]>
> wrote:
> >
> >    Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >    stopwords are a thing of the past at this point.  there is no benefit
> to
> >    using them now with hardware being so cheap.
> >
> >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> [hidden email]>
> >    wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >> 3) Run the results through something like KeepWordFilterFactory
> >> 4) Search both fields with a boost on the second, higher-signal field
> >>
> >> The other option is to run CharacterFilter,
> >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >> term365". As long as it is done on both indexing and query, they will
> >> still match. You may have to have a bunch of them or write some sort
> >> of lookup map.
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >> [hidden email] <[hidden email]> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> This is likely a rudimentary question, but I can’t seem to find a
> >> straight-forward answer on forums or the documentation…is there a way to
> >> protect tokens from ANY analysis? I know things like the
> >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
> some
> >> terms we don’t even want our tokenizer to touch. Mostly, these are
> >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> >> maintain the colon and the capitalization (otherwise “it” would be taken
> >> out as a stopword).
> >>>
> >>> Any advice is appreciated!
> >>>
> >>> Thank you,
> >>> Audrey
> >>>
> >>> --
> >>> Audrey Lorberfeld
> >>> Data Scientist, w3 Search
> >>> IBM
> >>> [hidden email]
> >>>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Wow, thank you so much, everyone. This is all incredibly helpful insight.

So, would it be fair to say that the majority of you all do NOT use stop words?

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 10/9/19, 11:14 AM, "David Hastings" <[hidden email]> wrote:

    However, with all that said, stopwords CAN be useful in some situations.  I
    combine stopwords with the shingle factory to create "interesting phrases"
    (not really) that i use in "my more like this" needs.  for example,
    europe for vacation
    europe on vacation
    will create the shingle
    europe_vacation
    which i can then use to relate other documents that would be much
    more similar in such regard, rather than just using the "interesting words"
    europe, vacation
   
    with stop words, the shingles would be
    europe_for
    for_vacation
    and
    europe_on
    on_vacation
   
    just something to keep in mind,  theres a lot of creative ways to use
    stopwords depending on your needs.  i use the above for a VERY basic ML
    teacher and it works way better than using stopwords,
   
   
   
   
   
   
   
   
   
   
   
   
   
    On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <[hidden email]>
    wrote:
   
    > The theory behind stopwords is that they are “safe” to remove when
    > calculating relevance, so we can squeeze every last bit of usefulness out
    > of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
    > come a long way since then and the necessity of removing stopwords from the
    > indexed tokens to conserve RAM and disk is much less relevant than it used
    > to be in “the bad old days” when the idea of stopwords was invented.
    >
    > I’m not quite so confident as Alex that there is “no benefit”, but I’ll
    > totally agree that you should remove stopwords only _after_ you have some
    > evidence that removing them is A Good Thing in your situation.
    >
    > And removing stopwords leads to some interesting corner cases. Consider a
    > search for “to be or not to be” if they’re all stopwords.
    >
    > Best,
    > Erick
    >
    > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
    > [hidden email] <[hidden email]> wrote:
    > >
    > > Hey Alex,
    > >
    > > Thank you!
    > >
    > > Re: stopwords being a thing of the past due to the affordability of
    > hardware...can you expand? I'm not sure I understand.
    > >
    > > --
    > > Audrey Lorberfeld
    > > Data Scientist, w3 Search
    > > IBM
    > > [hidden email]
    > >
    > >
    > > On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]>
    > wrote:
    > >
    > >    Another thing to add to the above,
    > >>
    > >> IT:ibm. In this case, we would want to maintain the colon and the
    > >> capitalization (otherwise “it” would be taken out as a stopword).
    > >>
    > >    stopwords are a thing of the past at this point.  there is no benefit
    > to
    > >    using them now with hardware being so cheap.
    > >
    > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
    > [hidden email]>
    > >    wrote:
    > >
    > >> If you don't want it to be touched by a tokenizer, how would the
    > >> protection step know that the sequence of characters you want to
    > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
    > >> protect"?
    > >>
    > >> What it sounds to me is that you may want to:
    > >> 1) copyField to a second field
    > >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
    > >> 3) Run the results through something like KeepWordFilterFactory
    > >> 4) Search both fields with a boost on the second, higher-signal field
    > >>
    > >> The other option is to run CharacterFilter,
    > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
    > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
    > >> term365". As long as it is done on both indexing and query, they will
    > >> still match. You may have to have a bunch of them or write some sort
    > >> of lookup map.
    > >>
    > >> Regards,
    > >>   Alex.
    > >>
    > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
    > >> [hidden email] <[hidden email]> wrote:
    > >>>
    > >>> Hi All,
    > >>>
    > >>> This is likely a rudimentary question, but I can’t seem to find a
    > >> straight-forward answer on forums or the documentation…is there a way to
    > >> protect tokens from ANY analysis? I know things like the
    > >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
    > some
    > >> terms we don’t even want our tokenizer to touch. Mostly, these are
    > >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
    > >> maintain the colon and the capitalization (otherwise “it” would be taken
    > >> out as a stopword).
    > >>>
    > >>> Any advice is appreciated!
    > >>>
    > >>> Thank you,
    > >>> Audrey
    > >>>
    > >>> --
    > >>> Audrey Lorberfeld
    > >>> Data Scientist, w3 Search
    > >>> IBM
    > >>> [hidden email]
    > >>>
    > >>
    > >
    > >
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Protecting Tokens from Any Analysis

David Hastings
only in my more like this tools, but they have a very specific purpose,
otherwise no

On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - [hidden email]
<[hidden email]> wrote:

> Wow, thank you so much, everyone. This is all incredibly helpful insight.
>
> So, would it be fair to say that the majority of you all do NOT use stop
> words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 10/9/19, 11:14 AM, "David Hastings" <[hidden email]>
> wrote:
>
>     However, with all that said, stopwords CAN be useful in some
> situations.  I
>     combine stopwords with the shingle factory to create "interesting
> phrases"
>     (not really) that i use in "my more like this" needs.  for example,
>     europe for vacation
>     europe on vacation
>     will create the shingle
>     europe_vacation
>     which i can then use to relate other documents that would be much
>     more similar in such regard, rather than just using the "interesting
> words"
>     europe, vacation
>
>     with stop words, the shingles would be
>     europe_for
>     for_vacation
>     and
>     europe_on
>     on_vacation
>
>     just something to keep in mind,  theres a lot of creative ways to use
>     stopwords depending on your needs.  i use the above for a VERY basic ML
>     teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
>     On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> [hidden email]>
>     wrote:
>
>     > The theory behind stopwords is that they are “safe” to remove when
>     > calculating relevance, so we can squeeze every last bit of
> usefulness out
>     > of very constrained hardware (think 64K of memory. Yes kilobytes).
> We’ve
>     > come a long way since then and the necessity of removing stopwords
> from the
>     > indexed tokens to conserve RAM and disk is much less relevant than
> it used
>     > to be in “the bad old days” when the idea of stopwords was invented.
>     >
>     > I’m not quite so confident as Alex that there is “no benefit”, but
> I’ll
>     > totally agree that you should remove stopwords only _after_ you have
> some
>     > evidence that removing them is A Good Thing in your situation.
>     >
>     > And removing stopwords leads to some interesting corner cases.
> Consider a
>     > search for “to be or not to be” if they’re all stopwords.
>     >
>     > Best,
>     > Erick
>     >
>     > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>     > [hidden email] <[hidden email]> wrote:
>     > >
>     > > Hey Alex,
>     > >
>     > > Thank you!
>     > >
>     > > Re: stopwords being a thing of the past due to the affordability of
>     > hardware...can you expand? I'm not sure I understand.
>     > >
>     > > --
>     > > Audrey Lorberfeld
>     > > Data Scientist, w3 Search
>     > > IBM
>     > > [hidden email]
>     > >
>     > >
>     > > On 10/8/19, 1:01 PM, "David Hastings" <
> [hidden email]>
>     > wrote:
>     > >
>     > >    Another thing to add to the above,
>     > >>
>     > >> IT:ibm. In this case, we would want to maintain the colon and the
>     > >> capitalization (otherwise “it” would be taken out as a stopword).
>     > >>
>     > >    stopwords are a thing of the past at this point.  there is no
> benefit
>     > to
>     > >    using them now with hardware being so cheap.
>     > >
>     > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>     > [hidden email]>
>     > >    wrote:
>     > >
>     > >> If you don't want it to be touched by a tokenizer, how would the
>     > >> protection step know that the sequence of characters you want to
>     > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>     > >> protect"?
>     > >>
>     > >> What it sounds to me is that you may want to:
>     > >> 1) copyField to a second field
>     > >> 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
>     > >> 3) Run the results through something like KeepWordFilterFactory
>     > >> 4) Search both fields with a boost on the second, higher-signal
> field
>     > >>
>     > >> The other option is to run CharacterFilter,
>     > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
>     > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>     > >> term365". As long as it is done on both indexing and query, they
> will
>     > >> still match. You may have to have a bunch of them or write some
> sort
>     > >> of lookup map.
>     > >>
>     > >> Regards,
>     > >>   Alex.
>     > >>
>     > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>     > >> [hidden email] <[hidden email]> wrote:
>     > >>>
>     > >>> Hi All,
>     > >>>
>     > >>> This is likely a rudimentary question, but I can’t seem to find a
>     > >> straight-forward answer on forums or the documentation…is there a
> way to
>     > >> protect tokens from ANY analysis? I know things like the
>     > >> KeywordMarkerFilterFactory protect tokens from stemming, but we
> have
>     > some
>     > >> terms we don’t even want our tokenizer to touch. Mostly, these are
>     > >> IBM-specific acronyms, such as IT:ibm. In this case, we would
> want to
>     > >> maintain the colon and the capitalization (otherwise “it” would
> be taken
>     > >> out as a stopword).
>     > >>>
>     > >>> Any advice is appreciated!
>     > >>>
>     > >>> Thank you,
>     > >>> Audrey
>     > >>>
>     > >>> --
>     > >>> Audrey Lorberfeld
>     > >>> Data Scientist, w3 Search
>     > >>> IBM
>     > >>> [hidden email]
>     > >>>
>     > >>
>     > >
>     > >
>     >
>     >
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Protecting Tokens from Any Analysis

Walter Underwood
In reply to this post by David Hastings
We did something like that with Infoseek and Ultraseek. We had a set of
“glue words” that made noun phrases and indexed patterns like “noun glue noun”
as single tokens.

I remember Doug Cutting saying that Nutch did something similar using pairs,
but using that as a prefilter instead of as a relevance term.

This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek always
beat Google in relevance tests, probably because of phrase IDF.

More Like This could do the same thing, but it seems to be really slow and
not especially useful as a search component.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2019, at 8:14 AM, David Hastings <[hidden email]> wrote:
>
> However, with all that said, stopwords CAN be useful in some situations.  I
> combine stopwords with the shingle factory to create "interesting phrases"
> (not really) that i use in "my more like this" needs.  for example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the "interesting words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to use
> stopwords depending on your needs.  i use the above for a VERY basic ML
> teacher and it works way better than using stopwords,
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <[hidden email]>
> wrote:
>
>> The theory behind stopwords is that they are “safe” to remove when
>> calculating relevance, so we can squeeze every last bit of usefulness out
>> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
>> come a long way since then and the necessity of removing stopwords from the
>> indexed tokens to conserve RAM and disk is much less relevant than it used
>> to be in “the bad old days” when the idea of stopwords was invented.
>>
>> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
>> totally agree that you should remove stopwords only _after_ you have some
>> evidence that removing them is A Good Thing in your situation.
>>
>> And removing stopwords leads to some interesting corner cases. Consider a
>> search for “to be or not to be” if they’re all stopwords.
>>
>> Best,
>> Erick
>>
>>> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>> [hidden email] <[hidden email]> wrote:
>>>
>>> Hey Alex,
>>>
>>> Thank you!
>>>
>>> Re: stopwords being a thing of the past due to the affordability of
>> hardware...can you expand? I'm not sure I understand.
>>>
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> [hidden email]
>>>
>>>
>>> On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]>
>> wrote:
>>>
>>>   Another thing to add to the above,
>>>>
>>>> IT:ibm. In this case, we would want to maintain the colon and the
>>>> capitalization (otherwise “it” would be taken out as a stopword).
>>>>
>>>   stopwords are a thing of the past at this point.  there is no benefit
>> to
>>>   using them now with hardware being so cheap.
>>>
>>>   On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>> [hidden email]>
>>>   wrote:
>>>
>>>> If you don't want it to be touched by a tokenizer, how would the
>>>> protection step know that the sequence of characters you want to
>>>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>>>> protect"?
>>>>
>>>> What it sounds to me is that you may want to:
>>>> 1) copyField to a second field
>>>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
>>>> 3) Run the results through something like KeepWordFilterFactory
>>>> 4) Search both fields with a boost on the second, higher-signal field
>>>>
>>>> The other option is to run CharacterFilter,
>>>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>>>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>>>> term365". As long as it is done on both indexing and query, they will
>>>> still match. You may have to have a bunch of them or write some sort
>>>> of lookup map.
>>>>
>>>> Regards,
>>>>  Alex.
>>>>
>>>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>>>> [hidden email] <[hidden email]> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> This is likely a rudimentary question, but I can’t seem to find a
>>>> straight-forward answer on forums or the documentation…is there a way to
>>>> protect tokens from ANY analysis? I know things like the
>>>> KeywordMarkerFilterFactory protect tokens from stemming, but we have
>> some
>>>> terms we don’t even want our tokenizer to touch. Mostly, these are
>>>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>>>> maintain the colon and the capitalization (otherwise “it” would be taken
>>>> out as a stopword).
>>>>>
>>>>> Any advice is appreciated!
>>>>>
>>>>> Thank you,
>>>>> Audrey
>>>>>
>>>>> --
>>>>> Audrey Lorberfeld
>>>>> Data Scientist, w3 Search
>>>>> IBM
>>>>> [hidden email]
>>>>>
>>>>
>>>
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Protecting Tokens from Any Analysis

David Hastings
Yeah, I dont use it as a search, only well, finding more documents like
that one :) . for my purposes i tested between 2 to 5 part shingles and
ended up that the 2 part was actually giving me better results, for my use
case, than using any more.

I dont suppose you could point me to any of the phrase IDF documentation
for solr by chance?  That would be fun to poke around with.

On Wed, Oct 9, 2019 at 2:49 PM Walter Underwood <[hidden email]>
wrote:

> We did something like that with Infoseek and Ultraseek. We had a set of
> “glue words” that made noun phrases and indexed patterns like “noun glue
> noun”
> as single tokens.
>
> I remember Doug Cutting saying that Nutch did something similar using
> pairs,
> but using that as a prefilter instead of as a relevance term.
>
> This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek
> always
> beat Google in relevance tests, probably because of phrase IDF.
>
> More Like This could do the same thing, but it seems to be really slow and
> not especially useful as a search component.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> > On Oct 9, 2019, at 8:14 AM, David Hastings <[hidden email]>
> wrote:
> >
> > However, with all that said, stopwords CAN be useful in some
> situations.  I
> > combine stopwords with the shingle factory to create "interesting
> phrases"
> > (not really) that i use in "my more like this" needs.  for example,
> > europe for vacation
> > europe on vacation
> > will create the shingle
> > europe_vacation
> > which i can then use to relate other documents that would be much
> > more similar in such regard, rather than just using the "interesting
> words"
> > europe, vacation
> >
> > with stop words, the shingles would be
> > europe_for
> > for_vacation
> > and
> > europe_on
> > on_vacation
> >
> > just something to keep in mind,  theres a lot of creative ways to use
> > stopwords depending on your needs.  i use the above for a VERY basic ML
> > teacher and it works way better than using stopwords,
> >
> > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <[hidden email]>
> > wrote:
> >
> >> The theory behind stopwords is that they are “safe” to remove when
> >> calculating relevance, so we can squeeze every last bit of usefulness
> out
> >> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> >> come a long way since then and the necessity of removing stopwords from
> the
> >> indexed tokens to conserve RAM and disk is much less relevant than it
> used
> >> to be in “the bad old days” when the idea of stopwords was invented.
> >>
> >> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> >> totally agree that you should remove stopwords only _after_ you have
> some
> >> evidence that removing them is A Good Thing in your situation.
> >>
> >> And removing stopwords leads to some interesting corner cases. Consider
> a
> >> search for “to be or not to be” if they’re all stopwords.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> >> [hidden email] <[hidden email]> wrote:
> >>>
> >>> Hey Alex,
> >>>
> >>> Thank you!
> >>>
> >>> Re: stopwords being a thing of the past due to the affordability of
> >> hardware...can you expand? I'm not sure I understand.
> >>>
> >>> --
> >>> Audrey Lorberfeld
> >>> Data Scientist, w3 Search
> >>> IBM
> >>> [hidden email]
> >>>
> >>>
> >>> On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]>
> >> wrote:
> >>>
> >>>   Another thing to add to the above,
> >>>>
> >>>> IT:ibm. In this case, we would want to maintain the colon and the
> >>>> capitalization (otherwise “it” would be taken out as a stopword).
> >>>>
> >>>   stopwords are a thing of the past at this point.  there is no benefit
> >> to
> >>>   using them now with hardware being so cheap.
> >>>
> >>>   On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> >> [hidden email]>
> >>>   wrote:
> >>>
> >>>> If you don't want it to be touched by a tokenizer, how would the
> >>>> protection step know that the sequence of characters you want to
> >>>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >>>> protect"?
> >>>>
> >>>> What it sounds to me is that you may want to:
> >>>> 1) copyField to a second field
> >>>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >>>> 3) Run the results through something like KeepWordFilterFactory
> >>>> 4) Search both fields with a boost on the second, higher-signal field
> >>>>
> >>>> The other option is to run CharacterFilter,
> >>>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> >>>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >>>> term365". As long as it is done on both indexing and query, they will
> >>>> still match. You may have to have a bunch of them or write some sort
> >>>> of lookup map.
> >>>>
> >>>> Regards,
> >>>>  Alex.
> >>>>
> >>>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >>>> [hidden email] <[hidden email]> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This is likely a rudimentary question, but I can’t seem to find a
> >>>> straight-forward answer on forums or the documentation…is there a way
> to
> >>>> protect tokens from ANY analysis? I know things like the
> >>>> KeywordMarkerFilterFactory protect tokens from stemming, but we have
> >> some
> >>>> terms we don’t even want our tokenizer to touch. Mostly, these are
> >>>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> >>>> maintain the colon and the capitalization (otherwise “it” would be
> taken
> >>>> out as a stopword).
> >>>>>
> >>>>> Any advice is appreciated!
> >>>>>
> >>>>> Thank you,
> >>>>> Audrey
> >>>>>
> >>>>> --
> >>>>> Audrey Lorberfeld
> >>>>> Data Scientist, w3 Search
> >>>>> IBM
> >>>>> [hidden email]
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
In reply to this post by Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
Also, in terms of computational cost, it would seem that including most terms/not having a stop ilst would take a toll on the system. For instance, right now we have "ibm" as a stop word because it appears everywhere in our corpus. If we did not include it in the stop words file, we would have to retrieve every single document in our corpus and rank them. That's a high computational cost, no?

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 10/9/19, 2:31 PM, "Audrey Lorberfeld - [hidden email]" <[hidden email]> wrote:

    Wow, thank you so much, everyone. This is all incredibly helpful insight.
   
    So, would it be fair to say that the majority of you all do NOT use stop words?
   
    --
    Audrey Lorberfeld
    Data Scientist, w3 Search
    IBM
    [hidden email]
     
   
    On 10/9/19, 11:14 AM, "David Hastings" <[hidden email]> wrote:
   
        However, with all that said, stopwords CAN be useful in some situations.  I
        combine stopwords with the shingle factory to create "interesting phrases"
        (not really) that i use in "my more like this" needs.  for example,
        europe for vacation
        europe on vacation
        will create the shingle
        europe_vacation
        which i can then use to relate other documents that would be much
        more similar in such regard, rather than just using the "interesting words"
        europe, vacation
       
        with stop words, the shingles would be
        europe_for
        for_vacation
        and
        europe_on
        on_vacation
       
        just something to keep in mind,  theres a lot of creative ways to use
        stopwords depending on your needs.  i use the above for a VERY basic ML
        teacher and it works way better than using stopwords,
       
       
       
       
       
       
       
       
       
       
       
       
       
        On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <[hidden email]>
        wrote:
       
        > The theory behind stopwords is that they are “safe” to remove when
        > calculating relevance, so we can squeeze every last bit of usefulness out
        > of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
        > come a long way since then and the necessity of removing stopwords from the
        > indexed tokens to conserve RAM and disk is much less relevant than it used
        > to be in “the bad old days” when the idea of stopwords was invented.
        >
        > I’m not quite so confident as Alex that there is “no benefit”, but I’ll
        > totally agree that you should remove stopwords only _after_ you have some
        > evidence that removing them is A Good Thing in your situation.
        >
        > And removing stopwords leads to some interesting corner cases. Consider a
        > search for “to be or not to be” if they’re all stopwords.
        >
        > Best,
        > Erick
        >
        > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
        > [hidden email] <[hidden email]> wrote:
        > >
        > > Hey Alex,
        > >
        > > Thank you!
        > >
        > > Re: stopwords being a thing of the past due to the affordability of
        > hardware...can you expand? I'm not sure I understand.
        > >
        > > --
        > > Audrey Lorberfeld
        > > Data Scientist, w3 Search
        > > IBM
        > > [hidden email]
        > >
        > >
        > > On 10/8/19, 1:01 PM, "David Hastings" <[hidden email]>
        > wrote:
        > >
        > >    Another thing to add to the above,
        > >>
        > >> IT:ibm. In this case, we would want to maintain the colon and the
        > >> capitalization (otherwise “it” would be taken out as a stopword).
        > >>
        > >    stopwords are a thing of the past at this point.  there is no benefit
        > to
        > >    using them now with hardware being so cheap.
        > >
        > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
        > [hidden email]>
        > >    wrote:
        > >
        > >> If you don't want it to be touched by a tokenizer, how would the
        > >> protection step know that the sequence of characters you want to
        > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
        > >> protect"?
        > >>
        > >> What it sounds to me is that you may want to:
        > >> 1) copyField to a second field
        > >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
        > >> 3) Run the results through something like KeepWordFilterFactory
        > >> 4) Search both fields with a boost on the second, higher-signal field
        > >>
        > >> The other option is to run CharacterFilter,
        > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
        > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
        > >> term365". As long as it is done on both indexing and query, they will
        > >> still match. You may have to have a bunch of them or write some sort
        > >> of lookup map.
        > >>
        > >> Regards,
        > >>   Alex.
        > >>
        > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
        > >> [hidden email] <[hidden email]> wrote:
        > >>>
        > >>> Hi All,
        > >>>
        > >>> This is likely a rudimentary question, but I can’t seem to find a
        > >> straight-forward answer on forums or the documentation…is there a way to
        > >> protect tokens from ANY analysis? I know things like the
        > >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
        > some
        > >> terms we don’t even want our tokenizer to touch. Mostly, these are
        > >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
        > >> maintain the colon and the capitalization (otherwise “it” would be taken
        > >> out as a stopword).
        > >>>
        > >>> Any advice is appreciated!
        > >>>
        > >>> Thank you,
        > >>> Audrey
        > >>>
        > >>> --
        > >>> Audrey Lorberfeld
        > >>> Data Scientist, w3 Search
        > >>> IBM
        > >>> [hidden email]
        > >>>
        > >>
        > >
        > >
        >
        >
       
   
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Protecting Tokens from Any Analysis

David Hastings
if you have anything close to a decent server you wont notice it all.  im
at about 21 million documents, index varies between 450gb to 800gb
depending on merges, and about 60k searches a day and stay sub second non
stop, and this is on a single core/non cloud environment

On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - [hidden email]
<[hidden email]> wrote:

> Also, in terms of computational cost, it would seem that including most
> terms/not having a stop ilst would take a toll on the system. For instance,
> right now we have "ibm" as a stop word because it appears everywhere in our
> corpus. If we did not include it in the stop words file, we would have to
> retrieve every single document in our corpus and rank them. That's a high
> computational cost, no?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - [hidden email]" <
> [hidden email]> wrote:
>
>     Wow, thank you so much, everyone. This is all incredibly helpful
> insight.
>
>     So, would it be fair to say that the majority of you all do NOT use
> stop words?
>
>     --
>     Audrey Lorberfeld
>     Data Scientist, w3 Search
>     IBM
>     [hidden email]
>
>
>     On 10/9/19, 11:14 AM, "David Hastings" <[hidden email]>
> wrote:
>
>         However, with all that said, stopwords CAN be useful in some
> situations.  I
>         combine stopwords with the shingle factory to create "interesting
> phrases"
>         (not really) that i use in "my more like this" needs.  for example,
>         europe for vacation
>         europe on vacation
>         will create the shingle
>         europe_vacation
>         which i can then use to relate other documents that would be much
>         more similar in such regard, rather than just using the
> "interesting words"
>         europe, vacation
>
>         with stop words, the shingles would be
>         europe_for
>         for_vacation
>         and
>         europe_on
>         on_vacation
>
>         just something to keep in mind,  theres a lot of creative ways to
> use
>         stopwords depending on your needs.  i use the above for a VERY
> basic ML
>         teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
>         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> [hidden email]>
>         wrote:
>
>         > The theory behind stopwords is that they are “safe” to remove
> when
>         > calculating relevance, so we can squeeze every last bit of
> usefulness out
>         > of very constrained hardware (think 64K of memory. Yes
> kilobytes). We’ve
>         > come a long way since then and the necessity of removing
> stopwords from the
>         > indexed tokens to conserve RAM and disk is much less relevant
> than it used
>         > to be in “the bad old days” when the idea of stopwords was
> invented.
>         >
>         > I’m not quite so confident as Alex that there is “no benefit”,
> but I’ll
>         > totally agree that you should remove stopwords only _after_ you
> have some
>         > evidence that removing them is A Good Thing in your situation.
>         >
>         > And removing stopwords leads to some interesting corner cases.
> Consider a
>         > search for “to be or not to be” if they’re all stopwords.
>         >
>         > Best,
>         > Erick
>         >
>         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>         > [hidden email] <[hidden email]> wrote:
>         > >
>         > > Hey Alex,
>         > >
>         > > Thank you!
>         > >
>         > > Re: stopwords being a thing of the past due to the
> affordability of
>         > hardware...can you expand? I'm not sure I understand.
>         > >
>         > > --
>         > > Audrey Lorberfeld
>         > > Data Scientist, w3 Search
>         > > IBM
>         > > [hidden email]
>         > >
>         > >
>         > > On 10/8/19, 1:01 PM, "David Hastings" <
> [hidden email]>
>         > wrote:
>         > >
>         > >    Another thing to add to the above,
>         > >>
>         > >> IT:ibm. In this case, we would want to maintain the colon and
> the
>         > >> capitalization (otherwise “it” would be taken out as a
> stopword).
>         > >>
>         > >    stopwords are a thing of the past at this point.  there is
> no benefit
>         > to
>         > >    using them now with hardware being so cheap.
>         > >
>         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>         > [hidden email]>
>         > >    wrote:
>         > >
>         > >> If you don't want it to be touched by a tokenizer, how would
> the
>         > >> protection step know that the sequence of characters you want
> to
>         > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>         > >> protect"?
>         > >>
>         > >> What it sounds to me is that you may want to:
>         > >> 1) copyField to a second field
>         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
> second field
>         > >> 3) Run the results through something like
> KeepWordFilterFactory
>         > >> 4) Search both fields with a boost on the second,
> higher-signal field
>         > >>
>         > >> The other option is to run CharacterFilter,
>         > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to
> map known
>         > >> complex acronyms to non-tokenizable substitutions. E.g.
> "IT:ibm ->
>         > >> term365". As long as it is done on both indexing and query,
> they will
>         > >> still match. You may have to have a bunch of them or write
> some sort
>         > >> of lookup map.
>         > >>
>         > >> Regards,
>         > >>   Alex.
>         > >>
>         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>         > >> [hidden email] <[hidden email]> wrote:
>         > >>>
>         > >>> Hi All,
>         > >>>
>         > >>> This is likely a rudimentary question, but I can’t seem to
> find a
>         > >> straight-forward answer on forums or the documentation…is
> there a way to
>         > >> protect tokens from ANY analysis? I know things like the
>         > >> KeywordMarkerFilterFactory protect tokens from stemming, but
> we have
>         > some
>         > >> terms we don’t even want our tokenizer to touch. Mostly,
> these are
>         > >> IBM-specific acronyms, such as IT:ibm. In this case, we would
> want to
>         > >> maintain the colon and the capitalization (otherwise “it”
> would be taken
>         > >> out as a stopword).
>         > >>>
>         > >>> Any advice is appreciated!
>         > >>>
>         > >>> Thank you,
>         > >>> Audrey
>         > >>>
>         > >>> --
>         > >>> Audrey Lorberfeld
>         > >>> Data Scientist, w3 Search
>         > >>> IBM
>         > >>> [hidden email]
>         > >>>
>         > >>
>         > >
>         > >
>         >
>         >
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Protecting Tokens from Any Analysis

David Hastings
oh and by 'non stop' i mean close enough for me :)

On Wed, Oct 9, 2019 at 2:59 PM David Hastings <[hidden email]>
wrote:

> if you have anything close to a decent server you wont notice it all.  im
> at about 21 million documents, index varies between 450gb to 800gb
> depending on merges, and about 60k searches a day and stay sub second non
> stop, and this is on a single core/non cloud environment
>
> On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> [hidden email] <[hidden email]> wrote:
>
>> Also, in terms of computational cost, it would seem that including most
>> terms/not having a stop ilst would take a toll on the system. For instance,
>> right now we have "ibm" as a stop word because it appears everywhere in our
>> corpus. If we did not include it in the stop words file, we would have to
>> retrieve every single document in our corpus and rank them. That's a high
>> computational cost, no?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> [hidden email]
>>
>>
>> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - [hidden email]" <
>> [hidden email]> wrote:
>>
>>     Wow, thank you so much, everyone. This is all incredibly helpful
>> insight.
>>
>>     So, would it be fair to say that the majority of you all do NOT use
>> stop words?
>>
>>     --
>>     Audrey Lorberfeld
>>     Data Scientist, w3 Search
>>     IBM
>>     [hidden email]
>>
>>
>>     On 10/9/19, 11:14 AM, "David Hastings" <[hidden email]>
>> wrote:
>>
>>         However, with all that said, stopwords CAN be useful in some
>> situations.  I
>>         combine stopwords with the shingle factory to create "interesting
>> phrases"
>>         (not really) that i use in "my more like this" needs.  for
>> example,
>>         europe for vacation
>>         europe on vacation
>>         will create the shingle
>>         europe_vacation
>>         which i can then use to relate other documents that would be much
>>         more similar in such regard, rather than just using the
>> "interesting words"
>>         europe, vacation
>>
>>         with stop words, the shingles would be
>>         europe_for
>>         for_vacation
>>         and
>>         europe_on
>>         on_vacation
>>
>>         just something to keep in mind,  theres a lot of creative ways to
>> use
>>         stopwords depending on your needs.  i use the above for a VERY
>> basic ML
>>         teacher and it works way better than using stopwords,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>> [hidden email]>
>>         wrote:
>>
>>         > The theory behind stopwords is that they are “safe” to remove
>> when
>>         > calculating relevance, so we can squeeze every last bit of
>> usefulness out
>>         > of very constrained hardware (think 64K of memory. Yes
>> kilobytes). We’ve
>>         > come a long way since then and the necessity of removing
>> stopwords from the
>>         > indexed tokens to conserve RAM and disk is much less relevant
>> than it used
>>         > to be in “the bad old days” when the idea of stopwords was
>> invented.
>>         >
>>         > I’m not quite so confident as Alex that there is “no benefit”,
>> but I’ll
>>         > totally agree that you should remove stopwords only _after_ you
>> have some
>>         > evidence that removing them is A Good Thing in your situation.
>>         >
>>         > And removing stopwords leads to some interesting corner cases.
>> Consider a
>>         > search for “to be or not to be” if they’re all stopwords.
>>         >
>>         > Best,
>>         > Erick
>>         >
>>         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>>         > [hidden email] <[hidden email]> wrote:
>>         > >
>>         > > Hey Alex,
>>         > >
>>         > > Thank you!
>>         > >
>>         > > Re: stopwords being a thing of the past due to the
>> affordability of
>>         > hardware...can you expand? I'm not sure I understand.
>>         > >
>>         > > --
>>         > > Audrey Lorberfeld
>>         > > Data Scientist, w3 Search
>>         > > IBM
>>         > > [hidden email]
>>         > >
>>         > >
>>         > > On 10/8/19, 1:01 PM, "David Hastings" <
>> [hidden email]>
>>         > wrote:
>>         > >
>>         > >    Another thing to add to the above,
>>         > >>
>>         > >> IT:ibm. In this case, we would want to maintain the colon
>> and the
>>         > >> capitalization (otherwise “it” would be taken out as a
>> stopword).
>>         > >>
>>         > >    stopwords are a thing of the past at this point.  there is
>> no benefit
>>         > to
>>         > >    using them now with hardware being so cheap.
>>         > >
>>         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>>         > [hidden email]>
>>         > >    wrote:
>>         > >
>>         > >> If you don't want it to be touched by a tokenizer, how would
>> the
>>         > >> protection step know that the sequence of characters you
>> want to
>>         > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>>         > >> protect"?
>>         > >>
>>         > >> What it sounds to me is that you may want to:
>>         > >> 1) copyField to a second field
>>         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
>> second field
>>         > >> 3) Run the results through something like
>> KeepWordFilterFactory
>>         > >> 4) Search both fields with a boost on the second,
>> higher-signal field
>>         > >>
>>         > >> The other option is to run CharacterFilter,
>>         > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to
>> map known
>>         > >> complex acronyms to non-tokenizable substitutions. E.g.
>> "IT:ibm ->
>>         > >> term365". As long as it is done on both indexing and query,
>> they will
>>         > >> still match. You may have to have a bunch of them or write
>> some sort
>>         > >> of lookup map.
>>         > >>
>>         > >> Regards,
>>         > >>   Alex.
>>         > >>
>>         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>>         > >> [hidden email] <[hidden email]> wrote:
>>         > >>>
>>         > >>> Hi All,
>>         > >>>
>>         > >>> This is likely a rudimentary question, but I can’t seem to
>> find a
>>         > >> straight-forward answer on forums or the documentation…is
>> there a way to
>>         > >> protect tokens from ANY analysis? I know things like the
>>         > >> KeywordMarkerFilterFactory protect tokens from stemming, but
>> we have
>>         > some
>>         > >> terms we don’t even want our tokenizer to touch. Mostly,
>> these are
>>         > >> IBM-specific acronyms, such as IT:ibm. In this case, we
>> would want to
>>         > >> maintain the colon and the capitalization (otherwise “it”
>> would be taken
>>         > >> out as a stopword).
>>         > >>>
>>         > >>> Any advice is appreciated!
>>         > >>>
>>         > >>> Thank you,
>>         > >>> Audrey
>>         > >>>
>>         > >>> --
>>         > >>> Audrey Lorberfeld
>>         > >>> Data Scientist, w3 Search
>>         > >>> IBM
>>         > >>> [hidden email]
>>         > >>>
>>         > >>
>>         > >
>>         > >
>>         >
>>         >
>>
>>
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
In reply to this post by David Hastings
True...I guess another rub here is that we're using the edismax parser, so all of our queries are inherently OR queries. So for a query like  'the ibm way', the search engine would have to:

1) retrieve a document list for:
 -->  "ibm" (this list is probably 80% of the documents)
 -->  "the" (this list is 100%  of the english documents)
 -- >"way"
2) apply edismax parser
 --> foreach term
 -->  -->  foreach document  in term
 -->  -->  -->  score it

So, it seems like it would take a toll on our system.... but maybe that's incorrect! (For reference, our corpus is ~5MM documents, multi-language, and we get ~80k-100k queries/day)

Are you using edismax?

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
[hidden email]
 

On 10/9/19, 3:11 PM, "David Hastings" <[hidden email]> wrote:

    if you have anything close to a decent server you wont notice it all.  im
    at about 21 million documents, index varies between 450gb to 800gb
    depending on merges, and about 60k searches a day and stay sub second non
    stop, and this is on a single core/non cloud environment
   
    On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - [hidden email]
    <[hidden email]> wrote:
   
    > Also, in terms of computational cost, it would seem that including most
    > terms/not having a stop ilst would take a toll on the system. For instance,
    > right now we have "ibm" as a stop word because it appears everywhere in our
    > corpus. If we did not include it in the stop words file, we would have to
    > retrieve every single document in our corpus and rank them. That's a high
    > computational cost, no?
    >
    > --
    > Audrey Lorberfeld
    > Data Scientist, w3 Search
    > IBM
    > [hidden email]
    >
    >
    > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - [hidden email]" <
    > [hidden email]> wrote:
    >
    >     Wow, thank you so much, everyone. This is all incredibly helpful
    > insight.
    >
    >     So, would it be fair to say that the majority of you all do NOT use
    > stop words?
    >
    >     --
    >     Audrey Lorberfeld
    >     Data Scientist, w3 Search
    >     IBM
    >     [hidden email]
    >
    >
    >     On 10/9/19, 11:14 AM, "David Hastings" <[hidden email]>
    > wrote:
    >
    >         However, with all that said, stopwords CAN be useful in some
    > situations.  I
    >         combine stopwords with the shingle factory to create "interesting
    > phrases"
    >         (not really) that i use in "my more like this" needs.  for example,
    >         europe for vacation
    >         europe on vacation
    >         will create the shingle
    >         europe_vacation
    >         which i can then use to relate other documents that would be much
    >         more similar in such regard, rather than just using the
    > "interesting words"
    >         europe, vacation
    >
    >         with stop words, the shingles would be
    >         europe_for
    >         for_vacation
    >         and
    >         europe_on
    >         on_vacation
    >
    >         just something to keep in mind,  theres a lot of creative ways to
    > use
    >         stopwords depending on your needs.  i use the above for a VERY
    > basic ML
    >         teacher and it works way better than using stopwords,
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
    > [hidden email]>
    >         wrote:
    >
    >         > The theory behind stopwords is that they are “safe” to remove
    > when
    >         > calculating relevance, so we can squeeze every last bit of
    > usefulness out
    >         > of very constrained hardware (think 64K of memory. Yes
    > kilobytes). We’ve
    >         > come a long way since then and the necessity of removing
    > stopwords from the
    >         > indexed tokens to conserve RAM and disk is much less relevant
    > than it used
    >         > to be in “the bad old days” when the idea of stopwords was
    > invented.
    >         >
    >         > I’m not quite so confident as Alex that there is “no benefit”,
    > but I’ll
    >         > totally agree that you should remove stopwords only _after_ you
    > have some
    >         > evidence that removing them is A Good Thing in your situation.
    >         >
    >         > And removing stopwords leads to some interesting corner cases.
    > Consider a
    >         > search for “to be or not to be” if they’re all stopwords.
    >         >
    >         > Best,
    >         > Erick
    >         >
    >         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
    >         > [hidden email] <[hidden email]> wrote:
    >         > >
    >         > > Hey Alex,
    >         > >
    >         > > Thank you!
    >         > >
    >         > > Re: stopwords being a thing of the past due to the
    > affordability of
    >         > hardware...can you expand? I'm not sure I understand.
    >         > >
    >         > > --
    >         > > Audrey Lorberfeld
    >         > > Data Scientist, w3 Search
    >         > > IBM
    >         > > [hidden email]
    >         > >
    >         > >
    >         > > On 10/8/19, 1:01 PM, "David Hastings" <
    > [hidden email]>
    >         > wrote:
    >         > >
    >         > >    Another thing to add to the above,
    >         > >>
    >         > >> IT:ibm. In this case, we would want to maintain the colon and
    > the
    >         > >> capitalization (otherwise “it” would be taken out as a
    > stopword).
    >         > >>
    >         > >    stopwords are a thing of the past at this point.  there is
    > no benefit
    >         > to
    >         > >    using them now with hardware being so cheap.
    >         > >
    >         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
    >         > [hidden email]>
    >         > >    wrote:
    >         > >
    >         > >> If you don't want it to be touched by a tokenizer, how would
    > the
    >         > >> protection step know that the sequence of characters you want
    > to
    >         > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
    >         > >> protect"?
    >         > >>
    >         > >> What it sounds to me is that you may want to:
    >         > >> 1) copyField to a second field
    >         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
    > second field
    >         > >> 3) Run the results through something like
    > KeepWordFilterFactory
    >         > >> 4) Search both fields with a boost on the second,
    > higher-signal field
    >         > >>
    >         > >> The other option is to run CharacterFilter,
    >         > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to
    > map known
    >         > >> complex acronyms to non-tokenizable substitutions. E.g.
    > "IT:ibm ->
    >         > >> term365". As long as it is done on both indexing and query,
    > they will
    >         > >> still match. You may have to have a bunch of them or write
    > some sort
    >         > >> of lookup map.
    >         > >>
    >         > >> Regards,
    >         > >>   Alex.
    >         > >>
    >         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
    >         > >> [hidden email] <[hidden email]> wrote:
    >         > >>>
    >         > >>> Hi All,
    >         > >>>
    >         > >>> This is likely a rudimentary question, but I can’t seem to
    > find a
    >         > >> straight-forward answer on forums or the documentation…is
    > there a way to
    >         > >> protect tokens from ANY analysis? I know things like the
    >         > >> KeywordMarkerFilterFactory protect tokens from stemming, but
    > we have
    >         > some
    >         > >> terms we don’t even want our tokenizer to touch. Mostly,
    > these are
    >         > >> IBM-specific acronyms, such as IT:ibm. In this case, we would
    > want to
    >         > >> maintain the colon and the capitalization (otherwise “it”
    > would be taken
    >         > >> out as a stopword).
    >         > >>>
    >         > >>> Any advice is appreciated!
    >         > >>>
    >         > >>> Thank you,
    >         > >>> Audrey
    >         > >>>
    >         > >>> --
    >         > >>> Audrey Lorberfeld
    >         > >>> Data Scientist, w3 Search
    >         > >>> IBM
    >         > >>> [hidden email]
    >         > >>>
    >         > >>
    >         > >
    >         > >
    >         >
    >         >
    >
    >
    >
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

David Hastings
yup.  youre going to find solr is WAY more efficient than you think when it
comes to complex queries.

On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - [hidden email]
<[hidden email]> wrote:

> True...I guess another rub here is that we're using the edismax parser, so
> all of our queries are inherently OR queries. So for a query like  'the ibm
> way', the search engine would have to:
>
> 1) retrieve a document list for:
>  -->  "ibm" (this list is probably 80% of the documents)
>  -->  "the" (this list is 100%  of the english documents)
>  -- >"way"
> 2) apply edismax parser
>  --> foreach term
>  -->  -->  foreach document  in term
>  -->  -->  -->  score it
>
> So, it seems like it would take a toll on our system.... but maybe that's
> incorrect! (For reference, our corpus is ~5MM documents, multi-language,
> and we get ~80k-100k queries/day)
>
> Are you using edismax?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> [hidden email]
>
>
> On 10/9/19, 3:11 PM, "David Hastings" <[hidden email]>
> wrote:
>
>     if you have anything close to a decent server you wont notice it all.
> im
>     at about 21 million documents, index varies between 450gb to 800gb
>     depending on merges, and about 60k searches a day and stay sub second
> non
>     stop, and this is on a single core/non cloud environment
>
>     On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> [hidden email]
>     <[hidden email]> wrote:
>
>     > Also, in terms of computational cost, it would seem that including
> most
>     > terms/not having a stop ilst would take a toll on the system. For
> instance,
>     > right now we have "ibm" as a stop word because it appears everywhere
> in our
>     > corpus. If we did not include it in the stop words file, we would
> have to
>     > retrieve every single document in our corpus and rank them. That's a
> high
>     > computational cost, no?
>     >
>     > --
>     > Audrey Lorberfeld
>     > Data Scientist, w3 Search
>     > IBM
>     > [hidden email]
>     >
>     >
>     > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - [hidden email]"
> <
>     > [hidden email]> wrote:
>     >
>     >     Wow, thank you so much, everyone. This is all incredibly helpful
>     > insight.
>     >
>     >     So, would it be fair to say that the majority of you all do NOT
> use
>     > stop words?
>     >
>     >     --
>     >     Audrey Lorberfeld
>     >     Data Scientist, w3 Search
>     >     IBM
>     >     [hidden email]
>     >
>     >
>     >     On 10/9/19, 11:14 AM, "David Hastings" <
> [hidden email]>
>     > wrote:
>     >
>     >         However, with all that said, stopwords CAN be useful in some
>     > situations.  I
>     >         combine stopwords with the shingle factory to create
> "interesting
>     > phrases"
>     >         (not really) that i use in "my more like this" needs.  for
> example,
>     >         europe for vacation
>     >         europe on vacation
>     >         will create the shingle
>     >         europe_vacation
>     >         which i can then use to relate other documents that would be
> much
>     >         more similar in such regard, rather than just using the
>     > "interesting words"
>     >         europe, vacation
>     >
>     >         with stop words, the shingles would be
>     >         europe_for
>     >         for_vacation
>     >         and
>     >         europe_on
>     >         on_vacation
>     >
>     >         just something to keep in mind,  theres a lot of creative
> ways to
>     > use
>     >         stopwords depending on your needs.  i use the above for a
> VERY
>     > basic ML
>     >         teacher and it works way better than using stopwords,
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>     > [hidden email]>
>     >         wrote:
>     >
>     >         > The theory behind stopwords is that they are “safe” to
> remove
>     > when
>     >         > calculating relevance, so we can squeeze every last bit of
>     > usefulness out
>     >         > of very constrained hardware (think 64K of memory. Yes
>     > kilobytes). We’ve
>     >         > come a long way since then and the necessity of removing
>     > stopwords from the
>     >         > indexed tokens to conserve RAM and disk is much less
> relevant
>     > than it used
>     >         > to be in “the bad old days” when the idea of stopwords was
>     > invented.
>     >         >
>     >         > I’m not quite so confident as Alex that there is “no
> benefit”,
>     > but I’ll
>     >         > totally agree that you should remove stopwords only
> _after_ you
>     > have some
>     >         > evidence that removing them is A Good Thing in your
> situation.
>     >         >
>     >         > And removing stopwords leads to some interesting corner
> cases.
>     > Consider a
>     >         > search for “to be or not to be” if they’re all stopwords.
>     >         >
>     >         > Best,
>     >         > Erick
>     >         >
>     >         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>     >         > [hidden email] <[hidden email]>
> wrote:
>     >         > >
>     >         > > Hey Alex,
>     >         > >
>     >         > > Thank you!
>     >         > >
>     >         > > Re: stopwords being a thing of the past due to the
>     > affordability of
>     >         > hardware...can you expand? I'm not sure I understand.
>     >         > >
>     >         > > --
>     >         > > Audrey Lorberfeld
>     >         > > Data Scientist, w3 Search
>     >         > > IBM
>     >         > > [hidden email]
>     >         > >
>     >         > >
>     >         > > On 10/8/19, 1:01 PM, "David Hastings" <
>     > [hidden email]>
>     >         > wrote:
>     >         > >
>     >         > >    Another thing to add to the above,
>     >         > >>
>     >         > >> IT:ibm. In this case, we would want to maintain the
> colon and
>     > the
>     >         > >> capitalization (otherwise “it” would be taken out as a
>     > stopword).
>     >         > >>
>     >         > >    stopwords are a thing of the past at this point.
> there is
>     > no benefit
>     >         > to
>     >         > >    using them now with hardware being so cheap.
>     >         > >
>     >         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch
> <
>     >         > [hidden email]>
>     >         > >    wrote:
>     >         > >
>     >         > >> If you don't want it to be touched by a tokenizer, how
> would
>     > the
>     >         > >> protection step know that the sequence of characters
> you want
>     > to
>     >         > >> protect is "IT:ibm" and not "this is an IT:ibm term I
> want to
>     >         > >> protect"?
>     >         > >>
>     >         > >> What it sounds to me is that you may want to:
>     >         > >> 1) copyField to a second field
>     >         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
>     > second field
>     >         > >> 3) Run the results through something like
>     > KeepWordFilterFactory
>     >         > >> 4) Search both fields with a boost on the second,
>     > higher-signal field
>     >         > >>
>     >         > >> The other option is to run CharacterFilter,
>     >         > >> (PatternReplaceCharFilterFactory) which is
> pre-tokenizer to
>     > map known
>     >         > >> complex acronyms to non-tokenizable substitutions. E.g.
>     > "IT:ibm ->
>     >         > >> term365". As long as it is done on both indexing and
> query,
>     > they will
>     >         > >> still match. You may have to have a bunch of them or
> write
>     > some sort
>     >         > >> of lookup map.
>     >         > >>
>     >         > >> Regards,
>     >         > >>   Alex.
>     >         > >>
>     >         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>     >         > >> [hidden email] <[hidden email]>
> wrote:
>     >         > >>>
>     >         > >>> Hi All,
>     >         > >>>
>     >         > >>> This is likely a rudimentary question, but I can’t
> seem to
>     > find a
>     >         > >> straight-forward answer on forums or the
> documentation…is
>     > there a way to
>     >         > >> protect tokens from ANY analysis? I know things like the
>     >         > >> KeywordMarkerFilterFactory protect tokens from
> stemming, but
>     > we have
>     >         > some
>     >         > >> terms we don’t even want our tokenizer to touch. Mostly,
>     > these are
>     >         > >> IBM-specific acronyms, such as IT:ibm. In this case, we
> would
>     > want to
>     >         > >> maintain the colon and the capitalization (otherwise
> “it”
>     > would be taken
>     >         > >> out as a stopword).
>     >         > >>>
>     >         > >>> Any advice is appreciated!
>     >         > >>>
>     >         > >>> Thank you,
>     >         > >>> Audrey
>     >         > >>>
>     >         > >>> --
>     >         > >>> Audrey Lorberfeld
>     >         > >>> Data Scientist, w3 Search
>     >         > >>> IBM
>     >         > >>> [hidden email]
>     >         > >>>
>     >         > >>
>     >         > >
>     >         > >
>     >         >
>     >         >
>     >
>     >
>     >
>     >
>     >
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Protecting Tokens from Any Analysis

Walter Underwood
I wouldn’t worry about performance with that setup. I just checked on a production
system with 13 million docs in four shards, so 3+ million per shard. I searched on
the most common term in the title field and got a response in 31 milliseconds.
This was probably not cached, because the collection gets frequent updates and
is getting limited public traffic. That will change on Monday.

Make sure that you have more free RAM than the size of the index. Allow
for the size of the JVM, OS, etc.

Make sure you have plenty of CPU. After you have the RAM, CPU is the
bottleneck.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2019, at 12:33 PM, David Hastings <[hidden email]> wrote:
>
> yup.  youre going to find solr is WAY more efficient than you think when it
> comes to complex queries.
>
> On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - [hidden email]
> <[hidden email]> wrote:
>
>> True...I guess another rub here is that we're using the edismax parser, so
>> all of our queries are inherently OR queries. So for a query like  'the ibm
>> way', the search engine would have to:
>>
>> 1) retrieve a document list for:
>> -->  "ibm" (this list is probably 80% of the documents)
>> -->  "the" (this list is 100%  of the english documents)
>> -- >"way"
>> 2) apply edismax parser
>> --> foreach term
>> -->  -->  foreach document  in term
>> -->  -->  -->  score it
>>
>> So, it seems like it would take a toll on our system.... but maybe that's
>> incorrect! (For reference, our corpus is ~5MM documents, multi-language,
>> and we get ~80k-100k queries/day)
>>
>> Are you using edismax?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> [hidden email]
>>
>>
>> On 10/9/19, 3:11 PM, "David Hastings" <[hidden email]>
>> wrote:
>>
>>    if you have anything close to a decent server you wont notice it all.
>> im
>>    at about 21 million documents, index varies between 450gb to 800gb
>>    depending on merges, and about 60k searches a day and stay sub second
>> non
>>    stop, and this is on a single core/non cloud environment
>>
>>    On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
>> [hidden email]
>>    <[hidden email]> wrote:
>>
>>> Also, in terms of computational cost, it would seem that including
>> most
>>> terms/not having a stop ilst would take a toll on the system. For
>> instance,
>>> right now we have "ibm" as a stop word because it appears everywhere
>> in our
>>> corpus. If we did not include it in the stop words file, we would
>> have to
>>> retrieve every single document in our corpus and rank them. That's a
>> high
>>> computational cost, no?
>>>
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> [hidden email]
>>>
>>>
>>> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - [hidden email]"
>> <
>>> [hidden email]> wrote:
>>>
>>>    Wow, thank you so much, everyone. This is all incredibly helpful
>>> insight.
>>>
>>>    So, would it be fair to say that the majority of you all do NOT
>> use
>>> stop words?
>>>
>>>    --
>>>    Audrey Lorberfeld
>>>    Data Scientist, w3 Search
>>>    IBM
>>>    [hidden email]
>>>
>>>
>>>    On 10/9/19, 11:14 AM, "David Hastings" <
>> [hidden email]>
>>> wrote:
>>>
>>>        However, with all that said, stopwords CAN be useful in some
>>> situations.  I
>>>        combine stopwords with the shingle factory to create
>> "interesting
>>> phrases"
>>>        (not really) that i use in "my more like this" needs.  for
>> example,
>>>        europe for vacation
>>>        europe on vacation
>>>        will create the shingle
>>>        europe_vacation
>>>        which i can then use to relate other documents that would be
>> much
>>>        more similar in such regard, rather than just using the
>>> "interesting words"
>>>        europe, vacation
>>>
>>>        with stop words, the shingles would be
>>>        europe_for
>>>        for_vacation
>>>        and
>>>        europe_on
>>>        on_vacation
>>>
>>>        just something to keep in mind,  theres a lot of creative
>> ways to
>>> use
>>>        stopwords depending on your needs.  i use the above for a
>> VERY
>>> basic ML
>>>        teacher and it works way better than using stopwords,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>        On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>>> [hidden email]>
>>>        wrote:
>>>
>>>> The theory behind stopwords is that they are “safe” to
>> remove
>>> when
>>>> calculating relevance, so we can squeeze every last bit of
>>> usefulness out
>>>> of very constrained hardware (think 64K of memory. Yes
>>> kilobytes). We’ve
>>>> come a long way since then and the necessity of removing
>>> stopwords from the
>>>> indexed tokens to conserve RAM and disk is much less
>> relevant
>>> than it used
>>>> to be in “the bad old days” when the idea of stopwords was
>>> invented.
>>>>
>>>> I’m not quite so confident as Alex that there is “no
>> benefit”,
>>> but I’ll
>>>> totally agree that you should remove stopwords only
>> _after_ you
>>> have some
>>>> evidence that removing them is A Good Thing in your
>> situation.
>>>>
>>>> And removing stopwords leads to some interesting corner
>> cases.
>>> Consider a
>>>> search for “to be or not to be” if they’re all stopwords.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>>>> [hidden email] <[hidden email]>
>> wrote:
>>>>>
>>>>> Hey Alex,
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Re: stopwords being a thing of the past due to the
>>> affordability of
>>>> hardware...can you expand? I'm not sure I understand.
>>>>>
>>>>> --
>>>>> Audrey Lorberfeld
>>>>> Data Scientist, w3 Search
>>>>> IBM
>>>>> [hidden email]
>>>>>
>>>>>
>>>>> On 10/8/19, 1:01 PM, "David Hastings" <
>>> [hidden email]>
>>>> wrote:
>>>>>
>>>>>   Another thing to add to the above,
>>>>>>
>>>>>> IT:ibm. In this case, we would want to maintain the
>> colon and
>>> the
>>>>>> capitalization (otherwise “it” would be taken out as a
>>> stopword).
>>>>>>
>>>>>   stopwords are a thing of the past at this point.
>> there is
>>> no benefit
>>>> to
>>>>>   using them now with hardware being so cheap.
>>>>>
>>>>>   On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch
>> <
>>>> [hidden email]>
>>>>>   wrote:
>>>>>
>>>>>> If you don't want it to be touched by a tokenizer, how
>> would
>>> the
>>>>>> protection step know that the sequence of characters
>> you want
>>> to
>>>>>> protect is "IT:ibm" and not "this is an IT:ibm term I
>> want to
>>>>>> protect"?
>>>>>>
>>>>>> What it sounds to me is that you may want to:
>>>>>> 1) copyField to a second field
>>>>>> 2) Apply a much lighter (whitespace?) tokenizer to that
>>> second field
>>>>>> 3) Run the results through something like
>>> KeepWordFilterFactory
>>>>>> 4) Search both fields with a boost on the second,
>>> higher-signal field
>>>>>>
>>>>>> The other option is to run CharacterFilter,
>>>>>> (PatternReplaceCharFilterFactory) which is
>> pre-tokenizer to
>>> map known
>>>>>> complex acronyms to non-tokenizable substitutions. E.g.
>>> "IT:ibm ->
>>>>>> term365". As long as it is done on both indexing and
>> query,
>>> they will
>>>>>> still match. You may have to have a bunch of them or
>> write
>>> some sort
>>>>>> of lookup map.
>>>>>>
>>>>>> Regards,
>>>>>>  Alex.
>>>>>>
>>>>>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>>>>>> [hidden email] <[hidden email]>
>> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> This is likely a rudimentary question, but I can’t
>> seem to
>>> find a
>>>>>> straight-forward answer on forums or the
>> documentation…is
>>> there a way to
>>>>>> protect tokens from ANY analysis? I know things like the
>>>>>> KeywordMarkerFilterFactory protect tokens from
>> stemming, but
>>> we have
>>>> some
>>>>>> terms we don’t even want our tokenizer to touch. Mostly,
>>> these are
>>>>>> IBM-specific acronyms, such as IT:ibm. In this case, we
>> would
>>> want to
>>>>>> maintain the colon and the capitalization (otherwise
>> “it”
>>> would be taken
>>>>>> out as a stopword).
>>>>>>>
>>>>>>> Any advice is appreciated!
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Audrey
>>>>>>>
>>>>>>> --
>>>>>>> Audrey Lorberfeld
>>>>>>> Data Scientist, w3 Search
>>>>>>> IBM
>>>>>>> [hidden email]
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>