Indexing HTML

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing HTML

mgkimsal
Hello

I'm trying to index individual lines of an HTML file, and I'm hitting this
error:

TEXT must be immediately followed by END_TAG and not START_TAG

I've got something that looks like

<add>
<doc>
<field name="id">4</field>
<field name="line"><a href="foobar"><b><i>linktext</i></b></a></field>
</doc>
</add>

Actually, that sample code above, as its own data file POSTed to SOLR,
throws

parser must be on START_TAG or TEXT to read text (position: START_TAG seen
...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37

as an error.

Any clues as to how I can do this?  I'd like to keep the original copy of
each line intact in the index.

Thanks!

--
Michael Kimsal
http://webdevradio.com
Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML

Thierry Collogne
I think you can use the HTMLStripWhitespaceTokenizerFactory.

Look here :

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

I hope this helps


On 27/08/07, Michael Kimsal <[hidden email]> wrote:

>
> Hello
>
> I'm trying to index individual lines of an HTML file, and I'm hitting this
> error:
>
> TEXT must be immediately followed by END_TAG and not START_TAG
>
> I've got something that looks like
>
> <add>
> <doc>
> <field name="id">4</field>
> <field name="line"><a href="foobar"><b><i>linktext</i></b></a></field>
> </doc>
> </add>
>
> Actually, that sample code above, as its own data file POSTed to SOLR,
> throws
>
> parser must be on START_TAG or TEXT to read text (position: START_TAG seen
> ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
>
> as an error.
>
> Any clues as to how I can do this?  I'd like to keep the original copy of
> each line intact in the index.
>
> Thanks!
>
> --
> Michael Kimsal
> http://webdevradio.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML

Erik Hatcher
In reply to this post by mgkimsal
Michael,

I think the issue is that you're not escaping the <field> values.    
Send something like this to Solr instead:

  <field name="line">&lt;a  
href="foobar"&gt;&lt;b&gt;&lt;i&gt;linktext&lt;/i&gt;&lt;/b&gt;&lt;/
a&gt;</field>

        Erik


On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:

> Hello
>
> I'm trying to index individual lines of an HTML file, and I'm  
> hitting this
> error:
>
> TEXT must be immediately followed by END_TAG and not START_TAG
>
> I've got something that looks like
>
> <add>
> <doc>
> <field name="id">4</field>
> <field name="line"><a href="foobar"><b><i>linktext</i></b></a></field>
> </doc>
> </add>
>
> Actually, that sample code above, as its own data file POSTed to SOLR,
> throws
>
> parser must be on START_TAG or TEXT to read text (position:  
> START_TAG seen
> ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
>
> as an error.
>
> Any clues as to how I can do this?  I'd like to keep the original  
> copy of
> each line intact in the index.
>
> Thanks!
>
> --
> Michael Kimsal
> http://webdevradio.com

Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML

mgkimsal
What's odd about this is that the error seems to indicate that I did.

The full text (minus the stack trace) was

org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG or TEXT
to read text (position: START_TAG seen ...&lt;field name="line"&gt;&lt;a
href="foobar"&gt;... @4:37)

Or is that just a byproduct of how SOLR reports the errors back - always
escaping them?

Thanks guys - I'll have another crack at this tonight.


On 8/27/07, Erik Hatcher <[hidden email]> wrote:

>
> Michael,
>
> I think the issue is that you're not escaping the <field> values.
> Send something like this to Solr instead:
>
>   <field name="line">&lt;a
> href="foobar"&gt;&lt;b&gt;&lt;i&gt;linktext&lt;/i&gt;&lt;/b&gt;&lt;/
> a&gt;</field>
>
>         Erik
>
>
> On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
>
> > Hello
> >
> > I'm trying to index individual lines of an HTML file, and I'm
> > hitting this
> > error:
> >
> > TEXT must be immediately followed by END_TAG and not START_TAG
> >
> > I've got something that looks like
> >
> > <add>
> > <doc>
> > <field name="id">4</field>
> > <field name="line"><a href="foobar"><b><i>linktext</i></b></a></field>
> > </doc>
> > </add>
> >
> > Actually, that sample code above, as its own data file POSTed to SOLR,
> > throws
> >
> > parser must be on START_TAG or TEXT to read text (position:
> > START_TAG seen
> > ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
> >
> > as an error.
> >
> > Any clues as to how I can do this?  I'd like to keep the original
> > copy of
> > each line intact in the index.
> >
> > Thanks!
> >
> > --
> > Michael Kimsal
> > http://webdevradio.com
>
>


--
Michael Kimsal
http://webdevradio.com
Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML

Erik Hatcher

On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote:
> What's odd about this is that the error seems to indicate that I did.

Actually the error message looks like you escaped too much.  You  
should _not_ escape <field>, only the contents of it.

        Erik



>
> The full text (minus the stack trace) was
>
> org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG  
> or TEXT
> to read text (position: START_TAG seen ...&lt;field  
> name="line"&gt;&lt;a
> href="foobar"&gt;... @4:37)
>
> Or is that just a byproduct of how SOLR reports the errors back -  
> always
> escaping them?
>
> Thanks guys - I'll have another crack at this tonight.
>
>
> On 8/27/07, Erik Hatcher <[hidden email]> wrote:
>>
>> Michael,
>>
>> I think the issue is that you're not escaping the <field> values.
>> Send something like this to Solr instead:
>>
>>   <field name="line">&lt;a
>> href="foobar"&gt;&lt;b&gt;&lt;i&gt;linktext&lt;/i&gt;&lt;/b&gt;&lt;/
>> a&gt;</field>
>>
>>         Erik
>>
>>
>> On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
>>
>>> Hello
>>>
>>> I'm trying to index individual lines of an HTML file, and I'm
>>> hitting this
>>> error:
>>>
>>> TEXT must be immediately followed by END_TAG and not START_TAG
>>>
>>> I've got something that looks like
>>>
>>> <add>
>>> <doc>
>>> <field name="id">4</field>
>>> <field name="line"><a href="foobar"><b><i>linktext</i></b></a></
>>> field>
>>> </doc>
>>> </add>
>>>
>>> Actually, that sample code above, as its own data file POSTed to  
>>> SOLR,
>>> throws
>>>
>>> parser must be on START_TAG or TEXT to read text (position:
>>> START_TAG seen
>>> ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
>>>
>>> as an error.
>>>
>>> Any clues as to how I can do this?  I'd like to keep the original
>>> copy of
>>> each line intact in the index.
>>>
>>> Thanks!
>>>
>>> --
>>> Michael Kimsal
>>> http://webdevradio.com
>>
>>
>
>
> --
> Michael Kimsal
> http://webdevradio.com

Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML

Ravish Bhagdev
Hi Erik, All,

I escaped HTML text into entities before sending to Solr and indexing
went fine.  The problem now is that when I get back a snippet with
highlighted text for the html field, its not well formed as the
highliting dosen't somtimes include the entire tag if present.  For
e.g.:

<lst name="0008369D">

        <arr name="document">

        <str>
ound-color: #FFFFFF; text-align: left; text-indent: 0px;
<em>line-heigh</em>t: normal ; margin-top: 0px; margin-ri
</str>
</arr>
</lst>

<lst name="0008369B">

        <arr name="document">

        <str>
/TR&gt;<br />
&lt;TR align=&quot;left<em>&quot;  va</em>lign=&quot;middle&quot;
style=&quot; height: 28.800000px;&q
</str>
</arr>
</lst>
</lst>

Because of this I cannot present the resulting html in a webpage.  Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead?  But
doesn't Solr use HTML features while searching (anchors/titles etc).

Why is there no documentation about indexing HTML specifically using
solr.  How does nutch do it?  does it strip out html in the snippets
it returns?

Any help will be appreciated.

Thanks,
Ravi

On 8/27/07, Erik Hatcher <[hidden email]> wrote:

>
> On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote:
> > What's odd about this is that the error seems to indicate that I did.
>
> Actually the error message looks like you escaped too much.  You
> should _not_ escape <field>, only the contents of it.
>
>         Erik
>
>
>
> >
> > The full text (minus the stack trace) was
> >
> > org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
> > or TEXT
> > to read text (position: START_TAG seen ...&lt;field
> > name="line"&gt;&lt;a
> > href="foobar"&gt;... @4:37)
> >
> > Or is that just a byproduct of how SOLR reports the errors back -
> > always
> > escaping them?
> >
> > Thanks guys - I'll have another crack at this tonight.
> >
> >
> > On 8/27/07, Erik Hatcher <[hidden email]> wrote:
> >>
> >> Michael,
> >>
> >> I think the issue is that you're not escaping the <field> values.
> >> Send something like this to Solr instead:
> >>
> >>   <field name="line">&lt;a
> >> href="foobar"&gt;&lt;b&gt;&lt;i&gt;linktext&lt;/i&gt;&lt;/b&gt;&lt;/
> >> a&gt;</field>
> >>
> >>         Erik
> >>
> >>
> >> On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
> >>
> >>> Hello
> >>>
> >>> I'm trying to index individual lines of an HTML file, and I'm
> >>> hitting this
> >>> error:
> >>>
> >>> TEXT must be immediately followed by END_TAG and not START_TAG
> >>>
> >>> I've got something that looks like
> >>>
> >>> <add>
> >>> <doc>
> >>> <field name="id">4</field>
> >>> <field name="line"><a href="foobar"><b><i>linktext</i></b></a></
> >>> field>
> >>> </doc>
> >>> </add>
> >>>
> >>> Actually, that sample code above, as its own data file POSTed to
> >>> SOLR,
> >>> throws
> >>>
> >>> parser must be on START_TAG or TEXT to read text (position:
> >>> START_TAG seen
> >>> ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
> >>>
> >>> as an error.
> >>>
> >>> Any clues as to how I can do this?  I'd like to keep the original
> >>> copy of
> >>> each line intact in the index.
> >>>
> >>> Thanks!
> >>>
> >>> --
> >>> Michael Kimsal
> >>> http://webdevradio.com
> >>
> >>
> >
> >
> > --
> > Michael Kimsal
> > http://webdevradio.com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML

Mike Klaas
On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote:

>
> Because of this I cannot present the resulting html in a webpage.  Is
> it possible to strip out all HTML tags completely in result set?
> Would you recommend sending stripped out text to solr instead?  But
> doesn't Solr use HTML features while searching (anchors/titles etc).
>
> Why is there no documentation about indexing HTML specifically using
> solr.  How does nutch do it?  does it strip out html in the snippets
> it returns?

Solr isn't a web search engine, and doesn't do any special processing  
of html (although you can ask it to strip html if you want).

I recommend stripping the html yourself, and putting titles, anchors,  
etc in separate fields.

I believe that it would be possible to write this as a Solr update-
handler plugin, if you wanted it to all run in one place.

-Mike