XML querying

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

XML querying

Luis Neves-3

Hello.
What I do now to index XML documents it's to use a Filter to strip the markup,
this works but it's impossible to know where in the document is the match located.
What would it take to make possible to specify a filter query that accepts xpath
expressions?... something like:

fq=xmlField:/book/content/text()

This way only the "/book/content/" element was searched.

Did I make sense? Is this possible?

--
Luis Neves
Reply | Threaded
Open this post in threaded view
|

Re: XML querying

Thorsten Scherler-3
On Mon, 2007-01-15 at 12:23 +0000, Luis Neves wrote:

> Hello.
> What I do now to index XML documents it's to use a Filter to strip the markup,
> this works but it's impossible to know where in the document is the match located.
> What would it take to make possible to specify a filter query that accepts xpath
> expressions?... something like:
>
> fq=xmlField:/book/content/text()
>
> This way only the "/book/content/" element was searched.
>
> Did I make sense? Is this possible?

AFAIK short answer: no.

The field is ALWAYS plain text. There is no xmlField type.

...but why don't you just add your text in multiple field when indexing.

Instead of plain stripping the markup do above xpath on your document
and create different fields. Like
<field name="content"> <xsl:value-of
select="/book/content/text()"/></field>
<field name="more"> <xsl:value-of select="/book/more/text()"/></field>

Makes sense?

HTH

salu2

>
> --
> Luis Neves

Reply | Threaded
Open this post in threaded view
|

Re: XML querying

Luis Neves-3

Hi!

Thorsten Scherler wrote:

> On Mon, 2007-01-15 at 12:23 +0000, Luis Neves wrote:
>> Hello.
>> What I do now to index XML documents it's to use a Filter to strip the markup,
>> this works but it's impossible to know where in the document is the match located.
>> What would it take to make possible to specify a filter query that accepts xpath
>> expressions?... something like:
>>
>> fq=xmlField:/book/content/text()
>>
>> This way only the "/book/content/" element was searched.
>>
>> Did I make sense? Is this possible?
>
> AFAIK short answer: no.
>
> The field is ALWAYS plain text. There is no xmlField type.
>
> ...but why don't you just add your text in multiple field when indexing.
>
> Instead of plain stripping the markup do above xpath on your document
> and create different fields. Like
> <field name="content"> <xsl:value-of
> select="/book/content/text()"/></field>
> <field name="more"> <xsl:value-of select="/book/more/text()"/></field>
>
> Makes sense?

Yes, but I have documents with different schemas on the same "xml field", also,
that way I  would have to know the schema of the documents being indexed (which
I don't).

The schema I use is something like:
<field name="DocumentType" type="string" indexed="true" stored="true"/>
<field name="Document" type="text" indexed="true" stored="true"/>

Where each distinct DocumentType has its own schema.

I could revise this approach to use an Solr instance for each DocumentType but I
would have to find a way to "merge" results from the different instances because
I also need to search across different DocumentTypes... I guess I'm SOL :-(


--
Luis Neves
Reply | Threaded
Open this post in threaded view
|

Re: XML querying

Yonik Seeley-2
On 1/15/07, Luis Neves <[hidden email]> wrote:
> Yes, but I have documents with different schemas on the same "xml field", also,
> that way I  would have to know the schema of the documents being indexed (which
> I don't).

Solr and Lucene don't really support indexing structured data such as
XML... people are looking at ways to add flexible indexing to Lucene
so that XML indexing could be supported.  When that happens, then
we'll figure out how to fit that into Solr.

There are also XML databases out there, but performance currently
isn't great from what I've heard.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: XML querying

thorsten
In reply to this post by Luis Neves-3
On Mon, 2007-01-15 at 13:42 +0000, Luis Neves wrote:

> Hi!
>
> Thorsten Scherler wrote:
>
> > On Mon, 2007-01-15 at 12:23 +0000, Luis Neves wrote:
> >> Hello.
> >> What I do now to index XML documents it's to use a Filter to strip the markup,
> >> this works but it's impossible to know where in the document is the match located.
> >> What would it take to make possible to specify a filter query that accepts xpath
> >> expressions?... something like:
> >>
> >> fq=xmlField:/book/content/text()
> >>
> >> This way only the "/book/content/" element was searched.
> >>
> >> Did I make sense? Is this possible?
> >
> > AFAIK short answer: no.
> >
> > The field is ALWAYS plain text. There is no xmlField type.
> >
> > ...but why don't you just add your text in multiple field when indexing.
> >
> > Instead of plain stripping the markup do above xpath on your document
> > and create different fields. Like
> > <field name="content"> <xsl:value-of
> > select="/book/content/text()"/></field>
> > <field name="more"> <xsl:value-of select="/book/more/text()"/></field>
> >
> > Makes sense?
>
> Yes, but I have documents with different schemas on the same "xml field", also,
> that way I  would have to know the schema of the documents being indexed (which
> I don't).
>
> The schema I use is something like:
> <field name="DocumentType" type="string" indexed="true" stored="true"/>
> <field name="Document" type="text" indexed="true" stored="true"/>
>
> Where each distinct DocumentType has its own schema.
>
> I could revise this approach to use an Solr instance for each DocumentType but I
> would have to find a way to "merge" results from the different instances because
> I also need to search across different DocumentTypes... I guess I'm SOL :-(
>

I think you should explain your use case a wee bit more.

>>> What I do now to index XML documents it's to use a Filter to strip
the markup,
> >> this works but it's impossible to know where in the document is the match located.

why do you need to know where?

Maybe we can think of something.

salu2
--
thorsten

"Together we stand, divided we fall!"
Hey you (Pink Floyd)


Reply | Threaded
Open this post in threaded view
|

Re: XML querying

Luis Neves-3
Hi,

Thorsten Scherler wrote:
> On Mon, 2007-01-15 at 13:42 +0000, Luis Neves wrote:

>
> I think you should explain your use case a wee bit more.
>
>>>> What I do now to index XML documents it's to use a Filter to strip
> the markup,
>>>> this works but it's impossible to know where in the document is the match located.
>
> why do you need to know where?

Poorly phrased from my part. Ideally I want to apply "lucene filters" to the xml
content.
Something like what Nux does:
<http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html>


--
Luis Neves
Reply | Threaded
Open this post in threaded view
|

Re: XML querying

Thorsten Scherler-3
On Wed, 2007-01-17 at 09:36 +0000, Luis Neves wrote:

> Hi,
>
> Thorsten Scherler wrote:
> > On Mon, 2007-01-15 at 13:42 +0000, Luis Neves wrote:
>
> >
> > I think you should explain your use case a wee bit more.
> >
> >>>> What I do now to index XML documents it's to use a Filter to strip
> > the markup,
> >>>> this works but it's impossible to know where in the document is the match located.
> >
> > why do you need to know where?
>
> Poorly phrased from my part. Ideally I want to apply "lucene filters" to the xml
> content.
> Something like what Nux does:
> <http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html>
>

http://dsd.lbl.gov/nux/#Google-like realtime fulltext search via Apache
Lucene engine

If you have a look at this you will see that the lucene search is plain
and not xquery based. It is more that you can define relations like in
SQL connecting tow tables via keys. Like I understand it, it will return
the docs that have the xpath /books/book[author="James" and the
lucene:match(abstract, $query) where the lucene match is based on a
normal lucene query.

I reckon it should be very easy to do something like this in a client
environment like cocoon/forrest. See the nux code for getting an idea.
If I would need to solve this I would look for a component that allows
me XQuery like nux and a component that let me do query against a solr
server.

Then you "just" need to match the documents that return for both
components a result with a custom method.

salu2

>
> --
> Luis Neves