Indexing XML files

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing XML files

mirko-9
Hi,

I am trying to index an xml file as a field in lucene, see example below:

<add>
 <doc>
  <field name="title">As You Like it</field>
  <field name="author">Shakespeare, William</field>
  <field name="record"><myxml>here goes the xml...</myxml></field>
 </doc>
</add>

I can index the title and author fields because they are strings, but the
record field is an xml itself and I bump into some problems as I cannot
directly input an xml file using the post.sh script (solr complains).


I wonder what would be the correct (and relatively simple) way of doing it.
Ideally, I would like to store the xml as is, and index only the content
removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
that).
And output the result as an xml (so, simple escaping does not work for me).


So far, I had the idea of escaping the xml record and then unescaping it for
inner storage and using the analyzer for indexing (which would possible
require creating a class like XMLField or such).

thanks,
mirko
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Chris Hostetter-3

Since XML is the transport for sending data to Solr, you need to make sure
all field values are XML escaped.

If you wanted to index a plain text "title" and that tile contained an
ampersand character....

        Sense & Sensability

...you would need to XML escape that as...

        Sense &amp; Sensability

...Solr internally will treat that consistently as the JAva string "Sense
& Sensability" and when it comes time to return that string back to your
query clients, will output it in whatever form is appropraite for your
ResponseWriter -- if that's XML, then it will be XML escaped again, if
it's JSON or something ike it, it can probably be left alone.

The same holds tru for any other characters you wna to include in your
field values: Solr doens't care that they *value* itself is an XML string,
just that you properly escape the value in your XML <add><doc> message to
Solr...

 <add>
  <doc>
   <field name="title">As You Like it</field>
   <field name="author">Shakespeare, William</field>
   <field name="record">&lt;myxml&gt;here goes the xml...&lt;/myxml&gt;</field>
  </doc>
 </add>

...does that make sense?

: Ideally, I would like to store the xml as is, and index only the content
: removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
: that).
: And output the result as an xml (so, simple escaping does not work for me).

the escaping is just to send the data to Solr -- once sent, Solr will
process the unescaped string when deailing with analyzers, etc exactly as
you'd expect.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

mirko-9
Hi,

Thanks for the quick response.  Now, I have one more question.
Is it possible to get the result for a query back in the following form
(considering the input is the escaped xml, what you mentioned before):

<response>
 <responseHeader>
  <status>0</status>
  <QTime>0</QTime>
 </responseHeader>

 <result numFound="1" start="0">
  <doc>
   <str name="label">As You Like It (Promptbook of McVicars 1860)</str
   <str name="author">Shakespeare, William,</str>
   <str name="record"><myxml>...</myxml></str>
  </doc>
 </result>
</response>

Note, that the here the xml data is not escaped.  If yes, what do I have to do
to get such results back?  Would <str> need to be replaced with a type, say,
<xml> which has a different write method?  Or will I only be able to display
escaped xml within <str> (and any other types).  If so, why?

thanks,
mirko


Quoting Chris Hostetter <[hidden email]>:

>
> Since XML is the transport for sending data to Solr, you need to make sure
> all field values are XML escaped.
>
> If you wanted to index a plain text "title" and that tile contained an
> ampersand character....
>
> Sense & Sensability
>
> ...you would need to XML escape that as...
>
> Sense &amp; Sensability
>
> ...Solr internally will treat that consistently as the JAva string "Sense
> & Sensability" and when it comes time to return that string back to your
> query clients, will output it in whatever form is appropraite for your
> ResponseWriter -- if that's XML, then it will be XML escaped again, if
> it's JSON or something ike it, it can probably be left alone.
>
> The same holds tru for any other characters you wna to include in your
> field values: Solr doens't care that they *value* itself is an XML string,
> just that you properly escape the value in your XML <add><doc> message to
> Solr...
>
>  <add>
>   <doc>
>    <field name="title">As You Like it</field>
>    <field name="author">Shakespeare, William</field>
>    <field name="record">&lt;myxml&gt;here goes the
> xml...&lt;/myxml&gt;</field>
>   </doc>
>  </add>
>
> ...does that make sense?
>
> : Ideally, I would like to store the xml as is, and index only the content
> : removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
> : that).
> : And output the result as an xml (so, simple escaping does not work for me).
>
> the escaping is just to send the data to Solr -- once sent, Solr will
> process the unescaped string when deailing with analyzers, etc exactly as
> you'd expect.
>
>
> -Hoss
>


Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Yonik Seeley-2
On 12/5/06, [hidden email] <[hidden email]> wrote:

> Thanks for the quick response.  Now, I have one more question.
> Is it possible to get the result for a query back in the following form
> (considering the input is the escaped xml, what you mentioned before):
>
> <response>
>  <responseHeader>
>   <status>0</status>
>   <QTime>0</QTime>
>  </responseHeader>
>
>  <result numFound="1" start="0">
>   <doc>
>    <str name="label">As You Like It (Promptbook of McVicars 1860)</str
>    <str name="author">Shakespeare, William,</str>
>    <str name="record"><myxml>...</myxml></str>
>   </doc>
>  </result>
> </response>
>
> Note, that the here the xml data is not escaped.

I bet it is escaped, but your browser has helpfully displayed it as unescaped.
Try doing CTRL-U in firefox to see the real source for the reply.


-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

mirko-9
You are right, it is escaped.  But my question is: (how) can I
make it unescaped?

mirko


Quoting Yonik Seeley <[hidden email]>:

...
>
> I bet it is escaped, but your browser has helpfully displayed it as
> unescaped.
> Try doing CTRL-U in firefox to see the real source for the reply.
>
>
> -Yonik
>


Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Yonik Seeley-2
On 12/5/06, [hidden email] <[hidden email]> wrote:
> You are right, it is escaped.  But my question is: (how) can I
> make it unescaped?

For what purpose?
If you use an XML parser, the values it gives back to you will be unescaped.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Mike Klaas
In reply to this post by mirko-9
On 12/5/06, [hidden email] <[hidden email]> wrote:
> You are right, it is escaped.  But my question is: (how) can I
> make it unescaped?

I don't think solr will support such functionality.  The xml that solr
uses to return data is completely orthogonal to the xml embedded in
the data, and mixing the two would have utterly unpredictable results.
 What if a document contained a <str ...> element?  That could crash
the parsing code, or leave it vulnerable to injection attacks.

Try using the JSON output format if you absolutely have no way of
unescaping the resulting data (though I'd expect that any
self-respecting xml parser would do that for you).

-MIke
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

mirko-9
In reply to this post by Yonik Seeley-2
Hi,

the idea is to apply XSLT transformation on the result.  But it seems that
I would have to apply two transformations in a row, one which unescapes the
escaped node and a second which performs the actual transformation...

mirko


Quoting Yonik Seeley <[hidden email]>:

> On 12/5/06, [hidden email] <[hidden email]> wrote:
> > You are right, it is escaped.  But my question is: (how) can I
> > make it unescaped?
>
> For what purpose?
> If you use an XML parser, the values it gives back to you will be unescaped.
>
> -Yonik
>


Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Walter Underwood, Netflix
At some point, it would be simpler to write a custom response handler
and generate the output in your desired XML format.

wunder

On 12/5/06 1:52 PM, "[hidden email]" <[hidden email]> wrote:

> Hi,
>
> the idea is to apply XSLT transformation on the result.  But it seems that
> I would have to apply two transformations in a row, one which unescapes the
> escaped node and a second which performs the actual transformation...
>
> mirko
>
>
> Quoting Yonik Seeley <[hidden email]>:
>
>> On 12/5/06, [hidden email] <[hidden email]> wrote:
>>> You are right, it is escaped.  But my question is: (how) can I
>>> make it unescaped?
>>
>> For what purpose?
>> If you use an XML parser, the values it gives back to you will be unescaped.
>>
>> -Yonik

Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Chris Hostetter-3

: At some point, it would be simpler to write a custom response handler
: and generate the output in your desired XML format.

I think Walters got the right idea ... as a general rule, we want to make
the XmlResponseWriter "bullet proof" so that no matter waht data you put
into your index, it is garunteed to produce a well formed XML document
that conforms to a specified DTD, or XSD (see SOLR-17 for one we already
have but we haven't figured out what to do with yet)

But I can certainly understand your use case: you know you have
wellformed XML values in some fields, and want to be able ot apply
a simple XSL transform on the whole response, and use XPath selectors to
pull data out of your response fields.

the best approach i can think of that should work for you out of the box
is what you already said: two XSL trnasforms ... one can be applied
on the Solr server using the qt=xslt response -- just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side (or using a
proxy)

if you're interested in writing a bit of custom java code you could in
fact write a new FieldType (which could easily subclass TextField) with a
custom "write" method that just outputs the raw value directly, and then
load your field type as a plugin...

        http://wiki.apache.org/solr/SolrPlugins

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Graham O'Regan-2
In reply to this post by Chris Hostetter-3
couldn't you use a cdata section?

Chris Hostetter wrote:

> Since XML is the transport for sending data to Solr, you need to make sure
> all field values are XML escaped.
>
> If you wanted to index a plain text "title" and that tile contained an
> ampersand character....
>
> Sense & Sensability
>
> ...you would need to XML escape that as...
>
> Sense &amp; Sensability
>
> ...Solr internally will treat that consistently as the JAva string "Sense
> & Sensability" and when it comes time to return that string back to your
> query clients, will output it in whatever form is appropraite for your
> ResponseWriter -- if that's XML, then it will be XML escaped again, if
> it's JSON or something ike it, it can probably be left alone.
>
> The same holds tru for any other characters you wna to include in your
> field values: Solr doens't care that they *value* itself is an XML string,
> just that you properly escape the value in your XML <add><doc> message to
> Solr...
>
>  <add>
>   <doc>
>    <field name="title">As You Like it</field>
>    <field name="author">Shakespeare, William</field>
>    <field name="record">&lt;myxml&gt;here goes the xml...&lt;/myxml&gt;</field>
>   </doc>
>  </add>
>
> ...does that make sense?
>
> : Ideally, I would like to store the xml as is, and index only the content
> : removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
> : that).
> : And output the result as an xml (so, simple escaping does not work for me).
>
> the escaping is just to send the data to Solr -- once sent, Solr will
> process the unescaped string when deailing with analyzers, etc exactly as
> you'd expect.
>
>
> -Hoss
>
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Yonik Seeley-2
On 12/6/06, Graham O'Regan <[hidden email]> wrote:
> couldn't you use a cdata section?

That's just another form of escaping.  Mirko actually want's the XML
field value to be part of the XML of Solr's response, not encapsulated
by it.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

mirko-9
In reply to this post by Chris Hostetter-3
Thank you all for the quick responses.  They were very helpful.

My XML is well-formed, so I ended up implementing my own FieldType:

public class XMLField extends TextField {
  public void write(XMLWriter xmlWriter, String name, Fieldable f) throws
IOException {
    xmlWriter.writePrim("xml", name, f.stringValue(), false);
  }
}

I looked at the XSD and there is one thing I don't understand:

If the desired way is to conform to the XSD (and hence the types used in XSD),
then how would it possible to use user-defined fieldtypes as plugins?  Wouldn't
they violate the same principle?

thanks,
mirko


Quoting Chris Hostetter <[hidden email]>:
...
> I think Walters got the right idea ... as a general rule, we want to make
> the XmlResponseWriter "bullet proof" so that no matter waht data you put
> into your index, it is garunteed to produce a well formed XML document
> that conforms to a specified DTD, or XSD (see SOLR-17 for one we already
> have but we haven't figured out what to do with yet)
>
...

> if you're interested in writing a bit of custom java code you could in
> fact write a new FieldType (which could easily subclass TextField) with a
> custom "write" method that just outputs the raw value directly, and then
> load your field type as a plugin...
>
> http://wiki.apache.org/solr/SolrPlugins
>
> -Hoss
>


Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML files

Chris Hostetter-3

: I looked at the XSD and there is one thing I don't understand:
:
: If the desired way is to conform to the XSD (and hence the types used in XSD),
: then how would it possible to use user-defined fieldtypes as plugins?  Wouldn't
: they violate the same principle?

The XSD is intended to match the behavior of the XmlResponseWriter and the
core solr code base ... if you write a new ResponseWriter (or use one of
the other built in ResponseWriters like JSON or Ruby) then all bets are
off.  if you are writing a new FieldType, then you might still be able to
use the XSD as is if your data can easily be represented using one of hte
"primative' types (ie: i might add a new LonLatFieldType class for
efficinetly storing/searching geographic coordinates, but when writing as
XML the syntax <str>+37.774395-122.422156</str> might work fine)

In a case like yours, where you genuinely need to extend the list of valid
tags, XMLSchema has a mechanism for that by letting you define your
own XSD which can reuse the elements defined in the main XSD. (the same
way DTDs can reuse elements from other DTDs)

all of this being a somewhat theoretical issue: since Solr doens't
currently do anything with that XSD ... I assume if/when it does, it will
be voluntary (ie: there might be a config option to have it include an XSD
of your choice in the XML header of the responses so you can validate if you
choose to)



-Hoss