Problems when changing stoplist file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems when changing stoplist file

Marie-Christine Plogmann
Hi,

I am currently using the demo class IndexFiles to index some corpus. I have replaced the Standard by a GermanAnalyzer. Here, indexing works fine.
But if i specify a different stopword list that should be used, the tokenization doesn't seem to work properly. Mostly some letters are missing at the end. Has somebody encountered a similar problem? What could be the problem?

Thanks!
Marie
Reply | Threaded
Open this post in threaded view
|

RE: Problems when changing stoplist file

steve_rowe
Hi Marie,

On 09/11/2008 at 4:03 AM, Marie-Christine Plogmann wrote:
> I am currently using the demo class IndexFiles to index some
> corpus. I have replaced the Standard by a GermanAnalyzer.
> Here, indexing works fine.
> But if i specify a different stopword list that should be
> used, the tokenization doesn't seem to work properly. Mostly
> some letters are missing at the end. Has somebody encountered
> a similar problem? What could be the problem?

Are you sure that this only occurs after you change the stopword list?

I assume you're using the GermanAnalyzer in contrib/; it constructs an analysis pipeline consisting of StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and then  GermanStemFilter, which invokes GermanStemmer <http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_3_2/contrib/analyzers/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?view=markup>, which is an implementation of the stemming algorithm described in the paper linked from here: <http://www.inf.fu-berlin.de/inst/pubs/tr-b-99-16.abstract.html>.

A basic question to get out of the way: Are you aware that the stemming operation removes letters from the end of some words?

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]