[jira] Created: (LUCENE-1003) RussianAnalyzer's tokenizer skips numbers from input text,

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1003) RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
RussianAnalyzer's tokenizer skips numbers from input text,
----------------------------------------------------------

                 Key: LUCENE-1003
                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.2
            Reporter: TUSUR OpenTeam


RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1003) RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

TUSUR OpenTeam updated LUCENE-1003:
-----------------------------------

    Description:
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  // test FAILS
  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  // test PASSES
  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}

  was:
RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.

{code:title=TestRussianAnalyzer.java|borderStyle=solid}

public class TestRussianAnalyzer extends TestCase {

  Reader reader = new StringReader("text 1000");

  public void testStemmer() {
    testAnalyzer(new RussianAnalyzer());
  }

  public void testFixedRussianAnalyzer() {
    testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
  }

  private void testAnalyzer(RussianAnalyzer analyzer) {
    try {
      TokenStream stream = analyzer.tokenStream("text", reader);
      assertEquals("text", stream.next().termText());
      assertNotNull(stream.next());
    } catch (IOException e) {
      fail(e.getMessage());
    }
  }

  private char[] getRussianCharSet() {
    int length = RussianCharsets.UnicodeRussian.length;
    final char[] russianChars = new char[length + 10];

    System
        .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
    russianChars[length++] = '0';
    russianChars[length++] = '1';
    russianChars[length++] = '2';
    russianChars[length++] = '3';
    russianChars[length++] = '4';
    russianChars[length++] = '5';
    russianChars[length++] = '6';
    russianChars[length++] = '7';
    russianChars[length++] = '8';
    russianChars[length] = '9';
    return russianChars;
  }
}

{code}


> RussianAnalyzer's tokenizer skips numbers from input text,
> ----------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1003) RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528663 ]

Nick Menere commented on LUCENE-1003:
-------------------------------------

Yeah,
I raised this on the dev list a few months ago and didn't get much response.

I think I might even be responsible for that code above.  It was meant more as hack to [get a customer up and running|http://jira.atlassian.com/browse/JRA-12399].

Cheers,
Nick


> RussianAnalyzer's tokenizer skips numbers from input text,
> ----------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1003) RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528685 ]

TUSUR OpenTeam commented on LUCENE-1003:
----------------------------------------

Yeah, Nick, the code above was taken from your JIRA issue. We wasn't able to find similar issue in Lucene issue tracker. We're using Lucene a lot so we needed this bug fixed in the core.

> RussianAnalyzer's tokenizer skips numbers from input text,
> ----------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1003) RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

TUSUR OpenTeam updated LUCENE-1003:
-----------------------------------

    Attachment: RussianCharsets.java.patch

Patch that adds numbers to RussianCharset
usage: patch RussianCharsets.java < RussianCharsets.java.patch

> RussianAnalyzer's tokenizer skips numbers from input text,
> ----------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>         Attachments: RussianCharsets.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

TUSUR OpenTeam updated LUCENE-1003:
-----------------------------------

    Summary: [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,  (was: RussianAnalyzer's tokenizer skips numbers from input text,)

> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>         Attachments: RussianCharsets.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528832 ]

Grant Ingersoll commented on LUCENE-1003:
-----------------------------------------

minor nit, can you add the test case to the patch as well?

> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>         Attachments: RussianCharsets.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated LUCENE-1003:
-------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])
         Assignee: Otis Gospodnetic

TUSUR OpenTeam: would it be possible to get a unit test, too?


> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>            Assignee: Otis Gospodnetic
>         Attachments: RussianCharsets.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lihachev updated LUCENE-1003:
------------------------------------

    Attachment: TestRussianAnalyzer.java.patch

Patch that adds new test to the  TestRussianAnalyzer
usage:
patch TestRussianAnalyzer.java < TestRussianAnalyzer.java.patch

> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>            Assignee: Otis Gospodnetic
>         Attachments: RussianCharsets.java.patch, TestRussianAnalyzer.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-1003) [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved LUCENE-1003.
--------------------------------------

    Resolution: Fixed

> [PATCH] RussianAnalyzer's tokenizer skips numbers from input text,
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1003
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1003
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: TUSUR OpenTeam
>            Assignee: Otis Gospodnetic
>         Attachments: RussianCharsets.java.patch, TestRussianAnalyzer.java.patch
>
>
> RussianAnalyzer's tokenizer skips numbers from input text, so that resulting token stream miss numbers. Problem can be solved by adding numbers to RussianCharsets.UnicodeRussian. See test case below  for details.
> {code:title=TestRussianAnalyzer.java|borderStyle=solid}
> public class TestRussianAnalyzer extends TestCase {
>   Reader reader = new StringReader("text 1000");
>   // test FAILS
>   public void testStemmer() {
>     testAnalyzer(new RussianAnalyzer());
>   }
>   // test PASSES
>   public void testFixedRussianAnalyzer() {
>     testAnalyzer(new RussianAnalyzer(getRussianCharSet()));
>   }
>   private void testAnalyzer(RussianAnalyzer analyzer) {
>     try {
>       TokenStream stream = analyzer.tokenStream("text", reader);
>       assertEquals("text", stream.next().termText());
>       assertNotNull(stream.next());
>     } catch (IOException e) {
>       fail(e.getMessage());
>     }
>   }
>   private char[] getRussianCharSet() {
>     int length = RussianCharsets.UnicodeRussian.length;
>     final char[] russianChars = new char[length + 10];
>     System
>         .arraycopy(RussianCharsets.UnicodeRussian, 0, russianChars, 0, length);
>     russianChars[length++] = '0';
>     russianChars[length++] = '1';
>     russianChars[length++] = '2';
>     russianChars[length++] = '3';
>     russianChars[length++] = '4';
>     russianChars[length++] = '5';
>     russianChars[length++] = '6';
>     russianChars[length++] = '7';
>     russianChars[length++] = '8';
>     russianChars[length] = '9';
>     return russianChars;
>   }
> }
> {code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]