Solr stemming -> preserve original words

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr stemming -> preserve original words

Thushara Wijeratna-2
hello,

Is it possible to retrieve the original words once solr (Porter algorithm)
stems them?
I need to index a bunch of data, store it in solr, and get back a list of
most frequent terms out of solr. and i want to see the non-stemmed version
of this data.

so basically, i want to enhance this:
http://localhost:8983/solr/admin/schema.jsp to see the "top terms" in
non-stemmed form.

thanks,
thushara
Reply | Threaded
Open this post in threaded view
|

Re: Solr stemming -> preserve original words

iorixxx
I think best way to get non-stemmed top terms is to index the field using a fieldType that does not employes any stem filter. For example:

<fieldType name="non_stemmed_text" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
</fieldType>

By using copyField you can store two (or more) versions of a field. Stemmed and non-stemmed.

Just a new field:
<field name="text" type="non_stemmed_text" indexed="true" stored="true" />

And a copy field:
<copyField source="your_original_field" dest="text" />

Schema Browser (Field: text) will give you top terms.

> Is it possible to retrieve the original words once solr
> (Porter algorithm)
> stems them?
> I need to index a bunch of data, store it in solr, and get
> back a list of
> most frequent terms out of solr. and i want to see the
> non-stemmed version
> of this data.
>
> so basically, i want to enhance this:
> http://localhost:8983/solr/admin/schema.jsp to see the
> "top terms" in
> non-stemmed form.
>
> thanks,
> thushara


     
Reply | Threaded
Open this post in threaded view
|

Re: Solr stemming -> preserve original words

Thushara Wijeratna-2
hi Ahmet,

thanks. when i look at the non_stemmed_text field to get the top terms, i
will not be getting the useful feature of aggregating many related words
into one (which is done by stemming).

for ex: if a document has run(10), running(20), runner(2), runners(8) - i
would like to see a a "top term" to be "run" here. i think with the
non-stemmed solution, i will see run, running, runner, runners as separate
top terms so if the term "weather" happens to occur 21 times in the
document, it will replace any version of "run" as the top term.

of course i could go back to the text field for top terms where i will see
"run", but some of the terms in the text field will be non-english (stemmed
beyond english, ex: archiv, perman). so how can i tell if a term i see in
the text field is a "badly stemmed" word or not?

maybe at this point i could use a dictionary? if a term in the text field is
not in the dictionary, i would try to find a prefix match from the
non-stemmed field? or maybe there's a better way?

thanks,
thushara

On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN <[hidden email]> wrote:

> I think best way to get non-stemmed top terms is to index the field using a
> fieldType that does not employes any stem filter. For example:
>
> <fieldType name="non_stemmed_text" class="solr.TextField">
>      <analyzer
> class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> </fieldType>
>
> By using copyField you can store two (or more) versions of a field. Stemmed
> and non-stemmed.
>
> Just a new field:
> <field name="text" type="non_stemmed_text" indexed="true" stored="true" />
>
> And a copy field:
> <copyField source="your_original_field" dest="text" />
>
> Schema Browser (Field: text) will give you top terms.
>
> > Is it possible to retrieve the original words once solr
> > (Porter algorithm)
> > stems them?
> > I need to index a bunch of data, store it in solr, and get
> > back a list of
> > most frequent terms out of solr. and i want to see the
> > non-stemmed version
> > of this data.
> >
> > so basically, i want to enhance this:
> > http://localhost:8983/solr/admin/schema.jsp to see the
> > "top terms" in
> > non-stemmed form.
> >
> > thanks,
> > thushara
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr stemming -> preserve original words

Chris Harris-2
It seems like what's desired is not so much a stemmer as what you might call
a "canonicalizer", which would translate each source word not into its
"stem" but into its "most canonical form". Critically, the latter, by
definition, is always a legitimate word, e.g. "run". What's more, it's
always the "most appropriate word" or "most general word", or some such.

I'm not sure you could implement this except through a massive dictionary.
And you'd have trouble because some words would probably be ambiguous
between whether they should canonicalize this way or that.

On Fri, Jan 23, 2009 at 11:53 AM, Thushara Wijeratna <[hidden email]>wrote:

> hi Ahmet,
>
> thanks. when i look at the non_stemmed_text field to get the top terms, i
> will not be getting the useful feature of aggregating many related words
> into one (which is done by stemming).
>
> for ex: if a document has run(10), running(20), runner(2), runners(8) - i
> would like to see a a "top term" to be "run" here. i think with the
> non-stemmed solution, i will see run, running, runner, runners as separate
> top terms so if the term "weather" happens to occur 21 times in the
> document, it will replace any version of "run" as the top term.
>
> of course i could go back to the text field for top terms where i will see
> "run", but some of the terms in the text field will be non-english (stemmed
> beyond english, ex: archiv, perman). so how can i tell if a term i see in
> the text field is a "badly stemmed" word or not?
>
> maybe at this point i could use a dictionary? if a term in the text field
> is
> not in the dictionary, i would try to find a prefix match from the
> non-stemmed field? or maybe there's a better way?
>
> thanks,
> thushara
>
> On Fri, Jan 23, 2009 at 11:37 AM, AHMET ARSLAN <[hidden email]> wrote:
>
> > I think best way to get non-stemmed top terms is to index the field using
> a
> > fieldType that does not employes any stem filter. For example:
> >
> > <fieldType name="non_stemmed_text" class="solr.TextField">
> >      <analyzer
> > class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
> > </fieldType>
> >
> > By using copyField you can store two (or more) versions of a field.
> Stemmed
> > and non-stemmed.
> >
> > Just a new field:
> > <field name="text" type="non_stemmed_text" indexed="true" stored="true"
> />
> >
> > And a copy field:
> > <copyField source="your_original_field" dest="text" />
> >
> > Schema Browser (Field: text) will give you top terms.
> >
> > > Is it possible to retrieve the original words once solr
> > > (Porter algorithm)
> > > stems them?
> > > I need to index a bunch of data, store it in solr, and get
> > > back a list of
> > > most frequent terms out of solr. and i want to see the
> > > non-stemmed version
> > > of this data.
> > >
> > > so basically, i want to enhance this:
> > > http://localhost:8983/solr/admin/schema.jsp to see the
> > > "top terms" in
> > > non-stemmed form.
> > >
> > > thanks,
> > > thushara
> >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr stemming -> preserve original words

iorixxx
In reply to this post by Thushara Wijeratna-2
I didn't understand what exactly you want.

if a document has run(10), running(20), runner(2), runners(8):
(assuming stemmer reduces all those words to run)
with non-stemmed you will see:
running(20)
run(10)
runners(8)
runner(2)

with stemmed you will see:
run(40)

You want to see run as a top term but also you want to see the original words that formed that term?
run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner

Or do you want to see most frequent terms that passed through stem filter verbatim? (terms that stemmer didn't change/modify)

What do you mean by saying "badly stemmed" word?


> hi Ahmet,
>
> thanks. when i look at the non_stemmed_text field to get
> the top terms, i
> will not be getting the useful feature of aggregating many
> related words
> into one (which is done by stemming).
>
> for ex: if a document has run(10), running(20), runner(2),
> runners(8) - i
> would like to see a a "top term" to be
> "run" here. i think with the
> non-stemmed solution, i will see run, running, runner,
> runners as separate
> top terms so if the term "weather" happens to
> occur 21 times in the
> document, it will replace any version of "run" as
> the top term.
>
> of course i could go back to the text field for top terms
> where i will see
> "run", but some of the terms in the text field
> will be non-english (stemmed
> beyond english, ex: archiv, perman). so how can i tell if a
> term i see in
> the text field is a "badly stemmed" word or not?
>
> maybe at this point i could use a dictionary? if a term in
> the text field is
> not in the dictionary, i would try to find a prefix match
> from the
> non-stemmed field? or maybe there's a better way?
>
> thanks,
> thushara


     
Reply | Threaded
Open this post in threaded view
|

Re: Solr stemming -> preserve original words

Thushara Wijeratna-2
Chris, Ahmet - thanks for the responses.

Ahmet - yes, i want to see "run" as a top term + the original words that
formed that term
The reason is that due to mis-stemming, the terms could become non-english.
ex:  "permanent" would stem to "perm", "archive" would become "archiv".

I need to extract a set of keywords from the indexed content - I'd like
these to be correct full english words.

thanks,
thushara

On Fri, Jan 23, 2009 at 2:12 PM, AHMET ARSLAN <[hidden email]> wrote:

> I didn't understand what exactly you want.
>
> if a document has run(10), running(20), runner(2), runners(8):
> (assuming stemmer reduces all those words to run)
> with non-stemmed you will see:
> running(20)
> run(10)
> runners(8)
> runner(2)
>
> with stemmed you will see:
> run(40)
>
> You want to see run as a top term but also you want to see the original
> words that formed that term?
> run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner
>
> Or do you want to see most frequent terms that passed through stem filter
> verbatim? (terms that stemmer didn't change/modify)
>
> What do you mean by saying "badly stemmed" word?
>
>
> > hi Ahmet,
> >
> > thanks. when i look at the non_stemmed_text field to get
> > the top terms, i
> > will not be getting the useful feature of aggregating many
> > related words
> > into one (which is done by stemming).
> >
> > for ex: if a document has run(10), running(20), runner(2),
> > runners(8) - i
> > would like to see a a "top term" to be
> > "run" here. i think with the
> > non-stemmed solution, i will see run, running, runner,
> > runners as separate
> > top terms so if the term "weather" happens to
> > occur 21 times in the
> > document, it will replace any version of "run" as
> > the top term.
> >
> > of course i could go back to the text field for top terms
> > where i will see
> > "run", but some of the terms in the text field
> > will be non-english (stemmed
> > beyond english, ex: archiv, perman). so how can i tell if a
> > term i see in
> > the text field is a "badly stemmed" word or not?
> >
> > maybe at this point i could use a dictionary? if a term in
> > the text field is
> > not in the dictionary, i would try to find a prefix match
> > from the
> > non-stemmed field? or maybe there's a better way?
> >
> > thanks,
> > thushara
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr stemming -> preserve original words

iorixxx
I still don't understand your final goal but if you want to get an output in the form of
"run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner"
you need to index your documents using standard analyzer. Walk through the index using org.apache.lucene.index.IndexReader and stem each term using stemmer. Storing stems (key) and orignal word list (value) in a map will give that kind of output.

However if seeing something like the following list (not exactly you want but similar) on schema.jsp will help you

run=>run
run=>running
run=>runner
run=>runners

add one line of code

newstr = newstr + "=>" +  new String(termBuffer, 0, len);

to org.apache.solr.analysis.EnglishPorterFilterFactory.java between lines #116 and #117.

Rename the file, compile the code, put your jar file to libs directory under your solr home. Now you can use your new FilfterFactory in your schema.xml


--- On Sat, 1/24/09, Thushara Wijeratna <[hidden email]> wrote:

> From: Thushara Wijeratna <[hidden email]>
> Subject: Re: Solr stemming -> preserve original words
> To: [hidden email], [hidden email]
> Date: Saturday, January 24, 2009, 1:53 AM
> Chris, Ahmet - thanks for the responses.
>
> Ahmet - yes, i want to see "run" as a top term +
> the original words that
> formed that term
> The reason is that due to mis-stemming, the terms could
> become non-english.
> ex:  "permanent" would stem to "perm",
> "archive" would become "archiv".
>
> I need to extract a set of keywords from the indexed
> content - I'd like
> these to be correct full english words.
>
> thanks,
> thushara