Token startOffsets with HtmlStripReader

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Token startOffsets with HtmlStripReader

Chris Harris-2
https://issues.apache.org/jira/browse/SOLR-42 changed the
HtmlStripReader so that Tokens from a TokenStream made with
HTMLStripWhitespaceTokenizerFactory would have the correct
Token.startOffset() values. If I'm not mistaken, though, the
HtmlStripReader in trunk still doesn't get offsets quite right where
XML processing instructions like

  <?xml version="1.0" encoding="UTF-8" ?>

are concerned. SOLR-42 is marked as resolved, so I'll write what I
know right here. I'm wondering if someone who understands
HtmlStripReader a little bit more than me could fix this in like two
minutes.

To demonstrate the problem, I made a little test class that will
tokenize some text with the HTMLStripWhitespaceTokenizer, and then
display both the startOffset of each token and the first few
characters on and after the startOffset. As you can see, things work
fine for most test strings, but in the case with processing
instructions, the startOffset is off by one character. Here's the
output:

-------------------------------------
String to test: <uniqueKey>id</uniqueKey>
  Token info:
    token 'id'
      startOffset: 11
      char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- Unless this field is marked with
required="false", it will be a required field -->
<uniqueKey>id</uniqueKey>
  Token info:
    token 'id'
      startOffset: 99
      char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- And now: two elements --> <element1>one</element1>
  <element2>two</element2>
  Token info:
    token 'one'
      startOffset: 41
      char at startOffset, and next few: 'one</'
    token 'two'
      startOffset: 68
      char at startOffset, and next few: 'two</'
-------------------------------------
String to test: <?xml version="1.0" encoding="UTF-8" ?><uniqueKey>id</uniqueKey>
  Token info:
    token 'id'
      startOffset: 49
      char at startOffset, and next few: '>id</'
-------------------------------------

I've also modified one of the existing test cases to identify the
problem. I will paste the rest of my code below.

Thanks,
Chris

*******************************

[Source code for the test program whose output appears above]

import java.io.Reader;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.*;
import org.apache.solr.analysis.*;


public class Baz
{
        public static void main(String args[]) throws IOException
        {
                String singleElement = "<uniqueKey>id</uniqueKey>";
                String singleElementWithComment = "<!-- Unless this field is marked
with required=\"false\", it will be a required field -->
<uniqueKey>id</uniqueKey>";
                String twoElementsWithComment = "<!-- And now: two elements -->
<element1>one</element1>\n  <element2>two</element2>";
                String elementWithXmlHeader = "<?xml version=\"1.0\"
encoding=\"UTF-8\" ?><uniqueKey>id</uniqueKey>";


                testStr(singleElement);
                testStr(singleElementWithComment);
                testStr(twoElementsWithComment);
                testStr(elementWithXmlHeader);
        }

        static void testStr(String s) throws IOException
        {
                System.out.println("-------------------------------------");
                System.out.println("String to test: " + s);
                System.out.println("  Token info:");
                StringReader reader = new StringReader(s);

                HTMLStripWhitespaceTokenizerFactory factory = new
HTMLStripWhitespaceTokenizerFactory();

                //This standard factory also gets processing instructions wrong:
                //HTMLStripStandardTokenizerFactory factory = new
HTMLStripStandardTokenizerFactory();
               
                TokenStream ts = factory.create(reader);

                while (true)
                {
                        Token t = ts.next();
                        if (t == null)
                        {
                                break;
                        }
               
                        String tokenText = new String(t.termBuffer(), 0, t.termLength());
                        String startOffsetStr = s.substring(t.startOffset(), t.startOffset()+5);
                        System.out.println("    token '" + tokenText + "'");
                        System.out.println("      startOffset: " + t.startOffset());
                        System.out.println("      char at startOffset, and next few: '" +
startOffsetStr + "'");
                }
        }
}

***************************

[Here's the unit test]

    public void testXmlProcessingInstruction() throws IOException {
    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><p>Here
is a paragraph.</p>";
    String gold = "                                          Here is a
paragraph.    ";
    HTMLStripReader reader = new HTMLStripReader(new StringReader(html));
    StringBuilder builder = new StringBuilder();
    int ch = -1;
    char [] goldArray = gold.toCharArray();
    int position = 0;
    while ((ch = reader.read()) != -1){
      char theChar = (char) ch;
      builder.append(theChar);
      assertTrue("\"" + theChar + "\"" + " at position: " + position +
" does not equal: " + goldArray[position]
              + " Buffer so far: " + builder + "<EOB>", theChar ==
goldArray[position]);
      position++;
    }
    assertTrue(gold + " is not equal to " + builder.toString(),
gold.equals(builder.toString()) == true);
  }
Reply | Threaded
Open this post in threaded view
|

Re: Token startOffsets with HtmlStripReader

Grant Ingersoll-2
Hi Chris,
You should be able to reopen SOLR-42.  Please attach your test case  
and/or patch on it.  This does sound like a problem.

-Grant


On Feb 13, 2008, at 8:28 PM, Chris Harris wrote:

> https://issues.apache.org/jira/browse/SOLR-42 changed the
> HtmlStripReader so that Tokens from a TokenStream made with
> HTMLStripWhitespaceTokenizerFactory would have the correct
> Token.startOffset() values. If I'm not mistaken, though, the
> HtmlStripReader in trunk still doesn't get offsets quite right where
> XML processing instructions like
>
>  <?xml version="1.0" encoding="UTF-8" ?>
>
> are concerned. SOLR-42 is marked as resolved, so I'll write what I
> know right here. I'm wondering if someone who understands
> HtmlStripReader a little bit more than me could fix this in like two
> minutes.
>
> To demonstrate the problem, I made a little test class that will
> tokenize some text with the HTMLStripWhitespaceTokenizer, and then
> display both the startOffset of each token and the first few
> characters on and after the startOffset. As you can see, things work
> fine for most test strings, but in the case with processing
> instructions, the startOffset is off by one character. Here's the
> output:
>
> -------------------------------------
> String to test: <uniqueKey>id</uniqueKey>
>  Token info:
>    token 'id'
>      startOffset: 11
>      char at startOffset, and next few: 'id</u'
> -------------------------------------
> String to test: <!-- Unless this field is marked with
> required="false", it will be a required field -->
> <uniqueKey>id</uniqueKey>
>  Token info:
>    token 'id'
>      startOffset: 99
>      char at startOffset, and next few: 'id</u'
> -------------------------------------
> String to test: <!-- And now: two elements --> <element1>one</
> element1>
>  <element2>two</element2>
>  Token info:
>    token 'one'
>      startOffset: 41
>      char at startOffset, and next few: 'one</'
>    token 'two'
>      startOffset: 68
>      char at startOffset, and next few: 'two</'
> -------------------------------------
> String to test: <?xml version="1.0" encoding="UTF-8" ?
> ><uniqueKey>id</uniqueKey>
>  Token info:
>    token 'id'
>      startOffset: 49
>      char at startOffset, and next few: '>id</'
> -------------------------------------
>
> I've also modified one of the existing test cases to identify the
> problem. I will paste the rest of my code below.
>
> Thanks,
> Chris
>
> *******************************
>
> [Source code for the test program whose output appears above]
>
> import java.io.Reader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.io.StringReader;
> import org.apache.lucene.analysis.*;
> import org.apache.solr.analysis.*;
>
>
> public class Baz
> {
> public static void main(String args[]) throws IOException
> {
> String singleElement = "<uniqueKey>id</uniqueKey>";
> String singleElementWithComment = "<!-- Unless this field is marked
> with required=\"false\", it will be a required field -->
> <uniqueKey>id</uniqueKey>";
> String twoElementsWithComment = "<!-- And now: two elements -->
> <element1>one</element1>\n  <element2>two</element2>";
> String elementWithXmlHeader = "<?xml version=\"1.0\"
> encoding=\"UTF-8\" ?><uniqueKey>id</uniqueKey>";
>
>
> testStr(singleElement);
> testStr(singleElementWithComment);
> testStr(twoElementsWithComment);
> testStr(elementWithXmlHeader);
> }
>
> static void testStr(String s) throws IOException
> {
> System.out.println("-------------------------------------");
> System.out.println("String to test: " + s);
> System.out.println("  Token info:");
> StringReader reader = new StringReader(s);
>
> HTMLStripWhitespaceTokenizerFactory factory = new
> HTMLStripWhitespaceTokenizerFactory();
>
> //This standard factory also gets processing instructions wrong:
> //HTMLStripStandardTokenizerFactory factory = new
> HTMLStripStandardTokenizerFactory();
>
> TokenStream ts = factory.create(reader);
>
> while (true)
> {
> Token t = ts.next();
> if (t == null)
> {
> break;
> }
>
> String tokenText = new String(t.termBuffer(), 0, t.termLength());
> String startOffsetStr = s.substring(t.startOffset(),  
> t.startOffset()+5);
> System.out.println("    token '" + tokenText + "'");
> System.out.println("      startOffset: " + t.startOffset());
> System.out.println("      char at startOffset, and next few: '" +
> startOffsetStr + "'");
> }
> }
> }
>
> ***************************
>
> [Here's the unit test]
>
>    public void testXmlProcessingInstruction() throws IOException {
>    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><p>Here
> is a paragraph.</p>";
>    String gold = "                                          Here is a
> paragraph.    ";
>    HTMLStripReader reader = new HTMLStripReader(new  
> StringReader(html));
>    StringBuilder builder = new StringBuilder();
>    int ch = -1;
>    char [] goldArray = gold.toCharArray();
>    int position = 0;
>    while ((ch = reader.read()) != -1){
>      char theChar = (char) ch;
>      builder.append(theChar);
>      assertTrue("\"" + theChar + "\"" + " at position: " + position +
> " does not equal: " + goldArray[position]
>              + " Buffer so far: " + builder + "<EOB>", theChar ==
> goldArray[position]);
>      position++;
>    }
>    assertTrue(gold + " is not equal to " + builder.toString(),
> gold.equals(builder.toString()) == true);
>  }

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




Reply | Threaded
Open this post in threaded view
|

Re: Token startOffsets with HtmlStripReader

Chris Harris-2
Ok, I've filed everything from my email under SOLR-42. It looks like I
don't have permissions to reopen the bug, so maybe someone else can do
that, if appropriate.

Thanks,
Chris

On Thu, Feb 14, 2008 at 3:56 AM, Grant Ingersoll <[hidden email]> wrote:
> Hi Chris,
>  You should be able to reopen SOLR-42.  Please attach your test case
>  and/or patch on it.  This does sound like a problem.
>
>  -Grant