Problem with html code inside xml

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with html code inside xml

steve.christin@gmail.com
Hello,

I've got some problem with html code who is embedded in xml file:

Sample source .

<content>
        <stories>
                <div class="storyTitle">
                         Les débats
                </div>
                <div class="storyIntroductionText">
                        Le premier tour des élections fédérales se déroulera le 21  
octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
vous, dont plusieurs grands débats à l'enseigne de Forums.
                </div>
                <div class="paragraph">
                        <div class="paragraphTitle"/>
                        <div class="paragraphText">
                                my para textehere
                                <br/>
                                <br/>
                                Vous trouverez sur cette page toutes les dates et les heures de  
ces différents rendez-vous ainsi que le nom et les partis des  
débatteurs. De plus, vous pourrez également écouter ou réécouter  
l'ensemble de ces émissions.
                        </div>
                </div>
....
---------
When a make a query on solr I've got something like that in the  
source code of the xml result:

<td xmlns="http://www.w3.org/1999/xhtml">
<span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraph"</span>
<span class="markup">&gt;</span><div class="expander-content">
<div class="indent"><span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraphTitle"</span>
<span class="markup">/&gt;</span></div><table><tr>
<td class="expander">−<div class="spacer"/>
</td><td><span class="markup">&lt;</span>
...

It is not exactly what I want. I want to keep the html tags, that all  
without formatting.

So the br tags and a tags are well formed in xml and json result, but  
the div tags are not kept.
---------
In the schema.xml I've got this for the html content

<fieldType name="html" class="solr.TextField" />

  <field name="storyFullText" type="html" indexed="true"  
stored="true" multiValued="true"/>

---------

Any help would be appreciate.

Thanks in advance.

S. Christin





Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

Jérôme Etévé-2
If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):

<field name="storyFullText">
  <html></html>
</field>

I think you should encode your content to protect these xml entities:
<  ->  &lt;
> -> &gt;
" -> &quot;
& -> &amp;

If you use perl, have a look at HTML::Entities.


On 9/25/07, [hidden email] <[hidden email]> wrote:

> Hello,
>
> I've got some problem with html code who is embedded in xml file:
>
> Sample source .
>
> <content>
>         <stories>
>                 <div class="storyTitle">
>                          Les débats
>                 </div>
>                 <div class="storyIntroductionText">
>                         Le premier tour des élections fédérales se déroulera le 21
> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> vous, dont plusieurs grands débats à l'enseigne de Forums.
>                 </div>
>                 <div class="paragraph">
>                         <div class="paragraphTitle"/>
>                         <div class="paragraphText">
>                                 my para textehere
>                                 <br/>
>                                 <br/>
>                                 Vous trouverez sur cette page toutes les dates et les heures de
> ces différents rendez-vous ainsi que le nom et les partis des
> débatteurs. De plus, vous pourrez également écouter ou réécouter
> l'ensemble de ces émissions.
>                         </div>
>                 </div>
> ....
> ---------
> When a make a query on solr I've got something like that in the
> source code of the xml result:
>
> <td xmlns="http://www.w3.org/1999/xhtml">
> <span class="markup">&lt;</span>
> <span class="start-tag">div</span>
> <span class="attribute-name">class</span>
> <span class="markup">=</span>
> <span class="attribute-value">"paragraph"</span>
> <span class="markup">&gt;</span><div class="expander-content">
> <div class="indent"><span class="markup">&lt;</span>
> <span class="start-tag">div</span>
> <span class="attribute-name">class</span>
> <span class="markup">=</span>
> <span class="attribute-value">"paragraphTitle"</span>
> <span class="markup">/&gt;</span></div><table><tr>
> <td class="expander">−<div class="spacer"/>
> </td><td><span class="markup">&lt;</span>
> ...
>
> It is not exactly what I want. I want to keep the html tags, that all
> without formatting.
>
> So the br tags and a tags are well formed in xml and json result, but
> the div tags are not kept.
> ---------
> In the schema.xml I've got this for the html content
>
> <fieldType name="html" class="solr.TextField" />
>
>   <field name="storyFullText" type="html" indexed="true"
> stored="true" multiValued="true"/>
>
> ---------
>
> Any help would be appreciate.
>
> Thanks in advance.
>
> S. Christin
>
>
>
>
>
>


--
Jerome Eteve.
[hidden email]
http://jerome.eteve.free.fr/
Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

Thorsten Scherler-3
On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:

> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
>
> <field name="storyFullText">
>   <html></html>
> </field>
>
> I think you should encode your content to protect these xml entities:
> <  ->  &lt;
> > -> &gt;
> " -> &quot;
> & -> &amp;
>
> If you use perl, have a look at HTML::Entities.

AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread
http://marc.info/?t=116775837900001&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2

HTH

salu2

>
>
> On 9/25/07, [hidden email] <[hidden email]> wrote:
> > Hello,
> >
> > I've got some problem with html code who is embedded in xml file:
> >
> > Sample source .
> >
> > <content>
> >         <stories>
> >                 <div class="storyTitle">
> >                          Les débats
> >                 </div>
> >                 <div class="storyIntroductionText">
> >                         Le premier tour des élections fédérales se déroulera le 21
> > octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> > vous, dont plusieurs grands débats à l'enseigne de Forums.
> >                 </div>
> >                 <div class="paragraph">
> >                         <div class="paragraphTitle"/>
> >                         <div class="paragraphText">
> >                                 my para textehere
> >                                 <br/>
> >                                 <br/>
> >                                 Vous trouverez sur cette page toutes les dates et les heures de
> > ces différents rendez-vous ainsi que le nom et les partis des
> > débatteurs. De plus, vous pourrez également écouter ou réécouter
> > l'ensemble de ces émissions.
> >                         </div>
> >                 </div>
> > ....
> > ---------
> > When a make a query on solr I've got something like that in the
> > source code of the xml result:
> >
> > <td xmlns="http://www.w3.org/1999/xhtml">
> > <span class="markup">&lt;</span>
> > <span class="start-tag">div</span>
> > <span class="attribute-name">class</span>
> > <span class="markup">=</span>
> > <span class="attribute-value">"paragraph"</span>
> > <span class="markup">&gt;</span><div class="expander-content">
> > <div class="indent"><span class="markup">&lt;</span>
> > <span class="start-tag">div</span>
> > <span class="attribute-name">class</span>
> > <span class="markup">=</span>
> > <span class="attribute-value">"paragraphTitle"</span>
> > <span class="markup">/&gt;</span></div><table><tr>
> > <td class="expander">−<div class="spacer"/>
> > </td><td><span class="markup">&lt;</span>
> > ...
> >
> > It is not exactly what I want. I want to keep the html tags, that all
> > without formatting.
> >
> > So the br tags and a tags are well formed in xml and json result, but
> > the div tags are not kept.
> > ---------
> > In the schema.xml I've got this for the html content
> >
> > <fieldType name="html" class="solr.TextField" />
> >
> >   <field name="storyFullText" type="html" indexed="true"
> > stored="true" multiValued="true"/>
> >
> > ---------
> >
> > Any help would be appreciate.
> >
> > Thanks in advance.
> >
> > S. Christin
> >
> >
> >
> >
> >
> >
>
>
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

steve.christin@gmail.com
Thanks

I use this solution:

put  <![CDATA[  Here my hml code   ]]> in the xml to be indexed and  
it works, nothing to change in the xsl.

In the schema I use this fieldType

<fieldType name="html" class="solr.TextField"  
positionIncrementGap="100">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1" catenateWords="1"  
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt"/>
          <filter class="solr.ISOLatin1AccentFilterFactory"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      </fieldType>

----------
Now question:
I created a field to index only the text for this html code.

I created a field type:

<fieldType name="htmlTxt" class="solr.TextField"  
positionIncrementGap="100">
      <analyzer>
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1" catenateWords="1"  
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt"/>
          <filter class="solr.ISOLatin1AccentFilterFactory"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      </fieldType>

Everything works (the div tags, p tags are removed) but some  
<strong>nnn</strong>   or <br/> tags are style in the text after  
indexing.

If you've got any idea to solve this problem it we'll be great.

Thanks

S. Christin



-------------


Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :

> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
>> If I understand, you want to keep the raw html code in solr like that
>> (in your posting xml file):
>>
>> <field name="storyFullText">
>>   <html></html>
>> </field>
>>
>> I think you should encode your content to protect these xml entities:
>> <  ->  &lt;
>>> -> &gt;
>> " -> &quot;
>> & -> &amp;
>>
>> If you use perl, have a look at HTML::Entities.
>
> AFAIR you cannot use tags, they always are getting transformed to
> entities. The solution is to have a xsl transformation after the
> response that transforms the entities back to tags.
>
> Have a look at the thread
> http://marc.info/?t=116775837900001&r=1&w=2
> and especially at
> http://marc.info/?l=solr-user&m=116782664828926&w=2
>
> HTH
>
> salu2
>
>>
>>
>> On 9/25/07, [hidden email] <[hidden email]>  
>> wrote:
>>> Hello,
>>>
>>> I've got some problem with html code who is embedded in xml file:
>>>
>>> Sample source .
>>>
>>> <content>
>>>         <stories>
>>>                 <div class="storyTitle">
>>>                          Les débats
>>>                 </div>
>>>                 <div class="storyIntroductionText">
>>>                         Le premier tour des élections fédérales  
>>> se déroulera le 21
>>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>>>                 </div>
>>>                 <div class="paragraph">
>>>                         <div class="paragraphTitle"/>
>>>                         <div class="paragraphText">
>>>                                 my para textehere
>>>                                 <br/>
>>>                                 <br/>
>>>                                 Vous trouverez sur cette page  
>>> toutes les dates et les heures de
>>> ces différents rendez-vous ainsi que le nom et les partis des
>>> débatteurs. De plus, vous pourrez également écouter ou  
>>> réécouter
>>> l'ensemble de ces émissions.
>>>                         </div>
>>>                 </div>
>>> ....
>>> ---------
>>> When a make a query on solr I've got something like that in the
>>> source code of the xml result:
>>>
>>> <td xmlns="http://www.w3.org/1999/xhtml">
>>> <span class="markup">&lt;</span>
>>> <span class="start-tag">div</span>
>>> <span class="attribute-name">class</span>
>>> <span class="markup">=</span>
>>> <span class="attribute-value">"paragraph"</span>
>>> <span class="markup">&gt;</span><div class="expander-content">
>>> <div class="indent"><span class="markup">&lt;</span>
>>> <span class="start-tag">div</span>
>>> <span class="attribute-name">class</span>
>>> <span class="markup">=</span>
>>> <span class="attribute-value">"paragraphTitle"</span>
>>> <span class="markup">/&gt;</span></div><table><tr>
>>> <td class="expander">−<div class="spacer"/>
>>> </td><td><span class="markup">&lt;</span>
>>> ...
>>>
>>> It is not exactly what I want. I want to keep the html tags, that  
>>> all
>>> without formatting.
>>>
>>> So the br tags and a tags are well formed in xml and json result,  
>>> but
>>> the div tags are not kept.
>>> ---------
>>> In the schema.xml I've got this for the html content
>>>
>>> <fieldType name="html" class="solr.TextField" />
>>>
>>>   <field name="storyFullText" type="html" indexed="true"
>>> stored="true" multiValued="true"/>
>>>
>>> ---------
>>>
>>> Any help would be appreciate.
>>>
>>> Thanks in advance.
>>>
>>> S. Christin
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> --
> Thorsten Scherler                                  
> thorsten.at.apache.org
> Open Source Java                      consulting, training and  
> solutions
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with html code inside xml

Ycrux
In reply to this post by steve.christin@gmail.com
Hi !

I'm facing a similar problem. Some HTML docs are correctly indexed and others are simply rejected even I encoded all problematic HTML tags as Thorsten suggested.

In the following example, "my_doc.xml" is a valid "XML" file, compliant with my Solr's schema fields :

$ java -jar post.jar ./my_doc.xml

SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file solrdoc
SimplePostTool: FATAL: Connection error (is Solr running at http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update

Is there any way to let "Solr" to be more verbose than that ?
Do I need to go into the Java code to understand what happen?
 I'm looking for a simple solution.

Thanks in advance

cheers
Y.

----Message d'origine----

>De: "[hidden email]"
>Sujet: Re: Problem with html code inside xml
>Date: Tue, 2 Oct 2007 16:15:26 +0200
>A: [hidden email]
>
>Thanks
>
>I use this solution:
>
>put  <![CDATA[  Here my hml code   ]]> in the xml to be indexed and  
>it works, nothing to change in the xsl.
>
>In the schema I use this fieldType
>
><fieldType name="html" class="solr.TextField"  
>positionIncrementGap="100">
>     <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>           <filter class="solr.WordDelimiterFilterFactory"  
>generateWordParts="1" generateNumberParts="1" catenateWords="1"  
>catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>           <filter class="solr.LowerCaseFilterFactory"/>
>           <filter class="solr.StopFilterFactory" ignoreCase="true"  
>words="stopwords.txt"/>
>           <filter class="solr.ISOLatin1AccentFilterFactory"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>      </fieldType>
>
>----------
>Now question:
>I created a field to index only the text for this html code.
>
>I created a field type:
>
><fieldType name="htmlTxt" class="solr.TextField"  
>positionIncrementGap="100">
>     <analyzer>
>         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>           <filter class="solr.WordDelimiterFilterFactory"  
>generateWordParts="1" generateNumberParts="1" catenateWords="1"  
>catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>           <filter class="solr.LowerCaseFilterFactory"/>
>           <filter class="solr.StopFilterFactory" ignoreCase="true"  
>words="stopwords.txt"/>
>           <filter class="solr.ISOLatin1AccentFilterFactory"/>
>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>      </fieldType>
>
>Everything works (the div tags, p tags are removed) but some  
><strong>nnn</strong>   or <br/> tags are style in the text after  
>indexing.
>
>If you've got any idea to solve this problem it we'll be great.
>
>Thanks
>
>S. Christin
>
>
>
>-------------
>
>
>Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :
>
>> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
>>> If I understand, you want to keep the raw html code in solr like that
>>> (in your posting xml file):
>>>
>>> <field name="storyFullText">
>>>   <html></html>
>>> </field>
>>>
>>> I think you should encode your content to protect these xml entities:
>>> <  ->  &lt;
>>>> -> &gt;
>>> " -> &quot;
>>> & -> &amp;
>>>
>>> If you use perl, have a look at HTML::Entities.
>>
>> AFAIR you cannot use tags, they always are getting transformed to
>> entities. The solution is to have a xsl transformation after the
>> response that transforms the entities back to tags.
>>
>> Have a look at the thread
>> http://marc.info/?t=116775837900001&r=1&w=2
>> and especially at
>> http://marc.info/?l=solr-user&m=116782664828926&w=2
>>
>> HTH
>>
>> salu2
>>
>>>
>>>
>>> On 9/25/07, [hidden email] <[hidden email]>  
>>> wrote:
>>>> Hello,
>>>>
>>>> I've got some problem with html code who is embedded in xml file:
>>>>
>>>> Sample source .
>>>>
>>>> <content>
>>>>         <stories>
>>>>                 <div class="storyTitle">
>>>>                          Les débats
>>>>                 </div>
>>>>                 <div class="storyIntroductionText">
>>>>                         Le premier tour des élections fédérales  
>>>> se déroulera le 21
>>>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>>>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>>>>                 </div>
>>>>                 <div class="paragraph">
>>>>                         <div class="paragraphTitle"/>
>>>>                         <div class="paragraphText">
>>>>                                 my para textehere
>>>>                                 <br/>
>>>>                                 <br/>
>>>>                                 Vous trouverez sur cette page  
>>>> toutes les dates et les heures de
>>>> ces différents rendez-vous ainsi que le nom et les partis des
>>>> débatteurs. De plus, vous pourrez également écouter ou  
>>>> réécouter
>>>> l'ensemble de ces émissions.
>>>>                         </div>
>>>>                 </div>
>>>> ....
>>>> ---------
>>>> When a make a query on solr I've got something like that in the
>>>> source code of the xml result:
>>>>
>>>> <td xmlns="http://www.w3.org/1999/xhtml">
>>>> <span class="markup">&lt;</span>
>>>> <span class="start-tag">div</span>
>>>> <span class="attribute-name">class</span>
>>>> <span class="markup">=</span>
>>>> <span class="attribute-value">"paragraph"</span>
>>>> <span class="markup">&gt;</span><div class="expander-content">
>>>> <div class="indent"><span class="markup">&lt;</span>
>>>> <span class="start-tag">div</span>
>>>> <span class="attribute-name">class</span>
>>>> <span class="markup">=</span>
>>>> <span class="attribute-value">"paragraphTitle"</span>
>>>> <span class="markup">/&gt;</span></div><table><tr>
>>>> <td class="expander">−<div class="spacer"/>
>>>> </td><td><span class="markup">&lt;</span>
>>>> ...
>>>>
>>>> It is not exactly what I want. I want to keep the html tags, that  
>>>> all
>>>> without formatting.
>>>>
>>>> So the br tags and a tags are well formed in xml and json result,  
>>>> but
>>>> the div tags are not kept.
>>>> ---------
>>>> In the schema.xml I've got this for the html content
>>>>
>>>> <fieldType name="html" class="solr.TextField" />
>>>>
>>>>   <field name="storyFullText" type="html" indexed="true"
>>>> stored="true" multiValued="true"/>
>>>>
>>>> ---------
>>>>
>>>> Any help would be appreciate.
>>>>
>>>> Thanks in advance.
>>>>
>>>> S. Christin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Thorsten Scherler                                  
>> thorsten.at.apache.org
>> Open Source Java                      consulting, training and  
>> solutions
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

hossman
In reply to this post by steve.christin@gmail.com
: I created a field type:
:
: <fieldType name="htmlTxt" class="solr.TextField" positionIncrementGap="100">

        ...

: Everything works (the div tags, p tags are removed) but some
: <strong>nnn</strong>   or <br/> tags are style in the text after indexing.

i cut/paste that fieldtype into the example schema.xml, and experimented
with the analysis tool (http://localhost:8983/solr/admin/analysis.jsp) and
both of those examples were correctly striped.

do you have a more specific example of something that doesn't work?

Hmm... it seems like maybe the problem is examples like this...
        blahblah<string>nnn</strong>
...if the tag is direclty adjacent to other text, it may not get striped
off ... i'm not sure if that's specific to the HtmlWhitespaceTokenizer.




-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Re: Problem with html code inside xml

hossman
In reply to this post by Ycrux

: SimplePostTool: FATAL: Connection error (is Solr running at http://localhost:8983/solr/update ?): java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update
:
: Is there any way to let "Solr" to be more verbose than that ?

Solr outputs all errors using whatever default error page format your
servlet container uses, it also logs all errors tothe servlet containers
loging system.

this specific error indicates that post.jar could not connect to Solr at
all (hence the "FATAL: Connection error" and the hint that perhaps Solr
isn't actually runing at the URL ypost.jar is trying to contact.)

If you are using the example Jetty setup that comes with Solr, and you
send a document that triggers a Solr error, post.jar will output something
like this (in this specific error, the problem is that the document
being posted is total giberesh, an not XML at all)...

SimplePostTool: FATAL: Solr returned an error: ParseError_at_rowcol11_Message_only_whitespace_content_allowed_before_start_tag_and_not___javaxxmlstreamXMLStreamException_ParseError_at_rowcol11_Message_only_whitespace_content_allowed_before_start_tag_and_not___at_combeaxmlstreamMXParserparsePrologMXParserjava2044__at_combeaxmlstreamMXParsernextImplMXParserjava1947__at_combeaxmlstreamMXParsernextMXParserjava1333__at_orgapachesolrhandlerXmlUpdateRequestHandlerprocessUpdateXmlUpdateRequestHandlerjava148__at_orgapachesolrhandlerXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHandlerjava123__at_orgapachesolrhandlerRequestHandlerBasehandleRequestRequestHandlerBasejava78__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava807__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava206__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava174__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHttpConnectionhandleRequestHttpConnectionjava502__at_orgmortbayjettyHttpConnection$RequestHandlercontentHttpConnectionjava835__at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_orgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_orgmortbayjettyHttpCo



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

steve.christin@gmail.com
In reply to this post by hossman
well... the xml output has changed and I receive  
&lt;strong&gt;hhhhhh&lt;strong&gt;   sic!

So the problem is not a problem...

Thanks

Steve

Le 3 oct. 07 à 01:09, Chris Hostetter a écrit :

> : I created a field type:
> :
> : <fieldType name="htmlTxt" class="solr.TextField"  
> positionIncrementGap="100">
>
> ...
>
> : Everything works (the div tags, p tags are removed) but some
> : <strong>nnn</strong>   or <br/> tags are style in the text after  
> indexing.
>
> i cut/paste that fieldtype into the example schema.xml, and  
> experimented
> with the analysis tool (http://localhost:8983/solr/admin/ 
> analysis.jsp) and
> both of those examples were correctly striped.
>
> do you have a more specific example of something that doesn't work?
>
> Hmm... it seems like maybe the problem is examples like this...
> blahblah<string>nnn</strong>
> ...if the tag is direclty adjacent to other text, it may not get  
> striped
> off ... i'm not sure if that's specific to the  
> HtmlWhitespaceTokenizer.
>
>
>
>
> -Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

jtal
In reply to this post by Jérôme Etévé-2

When I use HTML::Entities to encode my text, I get this error:

SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity named 'para'

Its complaining about finding:   ¶   in my text. Anyone know why this is a problem?




Jérôme Etévé-2 wrote
If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):

<field name="storyFullText">
  <html></html>
</field>

I think you should encode your content to protect these xml entities:
<  ->  <
> -> >
" -> "
& -> &

If you use perl, have a look at HTML::Entities.


On 9/25/07, steve.christin@gmail.com <steve.christin@gmail.com> wrote:
> Hello,
>
> I've got some problem with html code who is embedded in xml file:
>
> Sample source .
>
> <content>
>         <stories>
>                 <div class="storyTitle">
>                          Les débats
>                 </div>
>                 <div class="storyIntroductionText">
>                         Le premier tour des élections fédérales se déroulera le 21
> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> vous, dont plusieurs grands débats à l'enseigne de Forums.
>                 </div>
>                 <div class="paragraph">
>                         <div class="paragraphTitle"/>
>                         <div class="paragraphText">
>                                 my para textehere
>                                 <br/>
>                                 <br/>
>                                 Vous trouverez sur cette page toutes les dates et les heures de
> ces différents rendez-vous ainsi que le nom et les partis des
> débatteurs. De plus, vous pourrez également écouter ou réécouter
> l'ensemble de ces émissions.
>                         </div>
>                 </div>
> ....
> ---------
> When a make a query on solr I've got something like that in the
> source code of the xml result:
>
>
> <
> div
> class
> =
> "paragraph"
> ><div class="expander-content">
> <div class="indent"><
> div
> class
> =
> "paragraphTitle"
> /></div>>
−<div class="spacer"/>
>
<
> ...
>
> It is not exactly what I want. I want to keep the html tags, that all
> without formatting.
>
> So the br tags and a tags are well formed in xml and json result, but
> the div tags are not kept.
> ---------
> In the schema.xml I've got this for the html content
>
> <fieldType name="html" class="solr.TextField" />
>
>   <field name="storyFullText" type="html" indexed="true"
> stored="true" multiValued="true"/>
>
> ---------
>
> Any help would be appreciate.
>
> Thanks in advance.
>
> S. Christin
>
>
>
>
>
>


--
Jerome Eteve.
jerome@eteve.net
http://jerome.eteve.free.fr/
Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

Reece-3
Just use cdata to have the parser ignore the html characters.

http://www.w3schools.com/xml/xml_cdata.asp

-Reece



On Fri, Mar 7, 2008 at 5:11 PM, Latj <[hidden email]> wrote:

>
>
>  When I use HTML::Entities to encode my text, I get this error:
>
>  SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
>  named 'para'
>
>  Its complaining about finding:   &para;   in my text. Anyone know why this
>  is a problem?
>
>
>
>
>
>  Jérôme Etévé-2 wrote:
>  >
>  > If I understand, you want to keep the raw html code in solr like that
>  > (in your posting xml file):
>  >
>  > <field name="storyFullText">
>  >   <html></html>
>  > </field>
>  >
>  > I think you should encode your content to protect these xml entities:
>  > <  ->  &lt;
>  >> -> &gt;
>  > " -> &quot;
>  > & -> &amp;
>  >
>  > If you use perl, have a look at HTML::Entities.
>  >
>  >
>  > On 9/25/07, [hidden email] <[hidden email]> wrote:
>  >> Hello,
>  >>
>  >> I've got some problem with html code who is embedded in xml file:
>  >>
>  >> Sample source .
>  >>
>  >> <content>
>  >>         <stories>
>  >>                 <div class="storyTitle">
>  >>                          Les débats
>  >>                 </div>
>  >>                 <div class="storyIntroductionText">
>  >>                         Le premier tour des élections fédérales se
>  >> déroulera le 21
>  >> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>  >> vous, dont plusieurs grands débats à l'enseigne de Forums.
>  >>                 </div>
>  >>                 <div class="paragraph">
>  >>                         <div class="paragraphTitle"/>
>  >>                         <div class="paragraphText">
>  >>                                 my para textehere
>  >>                                 <br/>
>  >>                                 <br/>
>  >>                                 Vous trouverez sur cette page toutes les
>  >> dates et les heures de
>  >> ces différents rendez-vous ainsi que le nom et les partis des
>  >> débatteurs. De plus, vous pourrez également écouter ou réécouter
>  >> l'ensemble de ces émissions.
>  >>                         </div>
>  >>                 </div>
>  >> ....
>  >> ---------
>  >> When a make a query on solr I've got something like that in the
>  >> source code of the xml result:
>  >>
>  >> <td xmlns="http://www.w3.org/1999/xhtml">
>  >> &lt;
>  >> div
>  >> class
>  >> =
>  >> "paragraph"
>  >> &gt;<div class="expander-content">
>  >> <div class="indent">&lt;
>  >> div
>  >> class
>  >> =
>  >> "paragraphTitle"
>  >> /&gt;</div><table><tr>
>  >> <td class="expander">−<div class="spacer"/>
>  >> </td><td>&lt;
>  >> ...
>  >>
>  >> It is not exactly what I want. I want to keep the html tags, that all
>  >> without formatting.
>  >>
>  >> So the br tags and a tags are well formed in xml and json result, but
>  >> the div tags are not kept.
>  >> ---------
>  >> In the schema.xml I've got this for the html content
>  >>
>  >> <fieldType name="html" class="solr.TextField" />
>  >>
>  >>   <field name="storyFullText" type="html" indexed="true"
>  >> stored="true" multiValued="true"/>
>  >>
>  >> ---------
>  >>
>  >> Any help would be appreciate.
>  >>
>  >> Thanks in advance.
>  >>
>  >> S. Christin
>  >>
>  >>
>  >>
>  >>
>  >>
>  >>
>  >
>  >
>  > --
>  > Jerome Eteve.
>  > [hidden email]
>  > http://jerome.eteve.free.fr/
>  >
>  >
>
>  --
>  View this message in context: http://www.nabble.com/Problem-with-html-code-inside-xml-tp12877194p15907551.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Problem with html code inside xml

Yonik Seeley-2
In reply to this post by jtal
On Fri, Mar 7, 2008 at 5:11 PM, Latj <[hidden email]> wrote:
>  When I use HTML::Entities to encode my text, I get this error:
>
>  SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
>  named 'para'
>
>  Its complaining about finding:   &para;   in my text. Anyone know why this
>  is a problem?

&para; is an HTML entity, not standard in XML.

-Yonik