How to best index user-generated content

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to best index user-generated content

nicksnels1
Hi,

I want users to add content to my site using tinyMCE, which generates HTML.
When I tried adding the data to Solr, Solr refused to add it (or at least
generated an error):

SEVERE: org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
or TEXT to read text (position: START_TAG seen ...<field name="text"><p>...
@4:39)
    at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1071)
    at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:910)
    at org.apache.solr.core.SolrCore.update(SolrCore.java:685)
    at org.apache.solr.servlet.SolrUpdateServlet.doPost(
SolrUpdateServlet.java:52)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
    at org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
    at org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
    at org.apache.catalina.core.StandardHostValve.invoke(
StandardHostValve.java:126)
    at org.apache.catalina.valves.ErrorReportValve.invoke(
ErrorReportValve.java:105)
    at org.apache.catalina.valves.RequestFilterValve.process(
RequestFilterValve.java:275)
    at org.apache.catalina.valves.RemoteAddrValve.invoke(
RemoteAddrValve.java:80)
    at org.apache.catalina.core.StandardEngineValve.invoke(
StandardEngineValve.java:107)
    at org.apache.catalina.connector.CoyoteAdapter.service(
CoyoteAdapter.java:148)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
:869)
    at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
(Http11BaseProtocol.java:664)
    at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
PoolTcpEndpoint.java:527)
    at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
LeaderFollowerWorkerThread.java:80)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
ThreadPool.java:684)
    at java.lang.Thread.run(Thread.java:595)

So I searched the archives to resolve this issue, since I didn't want to
strip out the HTML entirely. The solution proved to be to add <![CDATA[
around the HTML text, like so:

<add><doc>
   <field name="text"><![CDATA[#{field.text}]]></field>
</add></doc>

This also drew my attention to another problem, characters like < > & are
all 'invalid' characters between xml tags. So that would mean, I have to put
<![CDATA[ around all the fields I want to index!? Because I don't know or
cann't control what my users will input. Is this the only solution or is
their a way for Solr to handle these 'invalid' characters in the indexed
text by itself, without generating errors?

Kind regards,

Nick
Reply | Threaded
Open this post in threaded view
|

Re: How to best index user-generated content

Tim Archambault-2
Whatever programming language you are using probably has a function that
makes "xml-safe" text. For example, I'm using Coldfusion to integrate with
Solr and all data is set like follows:

#xmlformat(usergeneratedcontent)#

My guess is PHP, ASP, etc. all have a function like this


On 9/20/06, Nick Snels <[hidden email]> wrote:

>
> Hi,
>
> I want users to add content to my site using tinyMCE, which generates
> HTML.
> When I tried adding the data to Solr, Solr refused to add it (or at least
> generated an error):
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
> or TEXT to read text (position: START_TAG seen ...<field
> name="text"><p>...
> @4:39)
>    at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1071)
>    at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:910)
>    at org.apache.solr.core.SolrCore.update(SolrCore.java:685)
>    at org.apache.solr.servlet.SolrUpdateServlet.doPost(
> SolrUpdateServlet.java:52)
>    at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
>    at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> ApplicationFilterChain.java:252)
>    at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:173)
>    at org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:213)
>    at org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:178)
>    at org.apache.catalina.core.StandardHostValve.invoke(
> StandardHostValve.java:126)
>    at org.apache.catalina.valves.ErrorReportValve.invoke(
> ErrorReportValve.java:105)
>    at org.apache.catalina.valves.RequestFilterValve.process(
> RequestFilterValve.java:275)
>    at org.apache.catalina.valves.RemoteAddrValve.invoke(
> RemoteAddrValve.java:80)
>    at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:107)
>    at org.apache.catalina.connector.CoyoteAdapter.service(
> CoyoteAdapter.java:148)
>    at org.apache.coyote.http11.Http11Processor.process(
> Http11Processor.java
> :869)
>    at
>
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection
> (Http11BaseProtocol.java:664)
>    at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(
> PoolTcpEndpoint.java:527)
>    at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(
> LeaderFollowerWorkerThread.java:80)
>    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(
> ThreadPool.java:684)
>    at java.lang.Thread.run(Thread.java:595)
>
> So I searched the archives to resolve this issue, since I didn't want to
> strip out the HTML entirely. The solution proved to be to add <![CDATA[
> around the HTML text, like so:
>
> <add><doc>
>   <field name="text"><![CDATA[#{field.text}]]></field>
> </add></doc>
>
> This also drew my attention to another problem, characters like < > & are
> all 'invalid' characters between xml tags. So that would mean, I have to
> put
> <![CDATA[ around all the fields I want to index!? Because I don't know or
> cann't control what my users will input. Is this the only solution or is
> their a way for Solr to handle these 'invalid' characters in the indexed
> text by itself, without generating errors?
>
> Kind regards,
>
> Nick
>
>