Indexing HTML content... (Embed HTML into XML?)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing HTML content... (Embed HTML into XML?)

Ravish Bhagdev
Hello,

Sorry for stupid question.  I'm trying to index html file as one of
the fields in Solr, I've setup appropriate analyzer in schema but I'm
not sure how to add html content to Solr.  Encapsulating HTML content
within field tag is obviously not valid.  How do I add html content?
Hope the query is clear....

Thanks,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML content... (Embed HTML into XML?)

Jérôme Etévé-2
You need to encode your html content so it can be include as a normal
'string' value in your xml element.

As far as remember, the only unsafe characters you have to encode as
entities are:
<  -> &lt;
> -> &gt;
" -> &quote;
& -> &amp;

(google xml entities to be sure).

I dont know what language you use , but for perl for instance, you can
use something like:
use HTML::Entities ;
my $xmlString = encode_entities($rawHTML  , '<>&"' );

Also you need to make sure your Html is encoded in UTF-8 . To comply
with solr need for UTF-8 encoded xml.

I hope it helps.

J.

On 8/22/07, Ravish Bhagdev <[hidden email]> wrote:

> Hello,
>
> Sorry for stupid question.  I'm trying to index html file as one of
> the fields in Solr, I've setup appropriate analyzer in schema but I'm
> not sure how to add html content to Solr.  Encapsulating HTML content
> within field tag is obviously not valid.  How do I add html content?
> Hope the query is clear....
>
> Thanks,
> Ravi
>


--
Jerome Eteve.
[hidden email]
http://jerome.eteve.free.fr/
Reply | Threaded
Open this post in threaded view
|

Re: Indexing HTML content... (Embed HTML into XML?)

Ravish Bhagdev
Thanks Jérôme!

It seems to work now.  I just hope the provided
HTMLStripWhitespaceTokenizerFactory will strip the right tags now.

I use Java and used HtmlEncoder provided in
http://itext.ugent.be/library/api/  for encoding with success. (just
in case someone happens to search this thread)

Ravi

On 8/22/07, Jérôme Etévé <[hidden email]> wrote:

> You need to encode your html content so it can be include as a normal
> 'string' value in your xml element.
>
> As far as remember, the only unsafe characters you have to encode as
> entities are:
> <  -> &lt;
> > -> &gt;
> " -> &quote;
> & -> &amp;
>
> (google xml entities to be sure).
>
> I dont know what language you use , but for perl for instance, you can
> use something like:
> use HTML::Entities ;
> my $xmlString = encode_entities($rawHTML  , '<>&"' );
>
> Also you need to make sure your Html is encoded in UTF-8 . To comply
> with solr need for UTF-8 encoded xml.
>
> I hope it helps.
>
> J.
>
> On 8/22/07, Ravish Bhagdev <[hidden email]> wrote:
> > Hello,
> >
> > Sorry for stupid question.  I'm trying to index html file as one of
> > the fields in Solr, I've setup appropriate analyzer in schema but I'm
> > not sure how to add html content to Solr.  Encapsulating HTML content
> > within field tag is obviously not valid.  How do I add html content?
> > Hope the query is clear....
> >
> > Thanks,
> > Ravi
> >
>
>
> --
> Jerome Eteve.
> [hidden email]
> http://jerome.eteve.free.fr/
>