Indexing XML

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing XML

PAUWELS  Benoit
Hi,

 

I wish to index well formed xml documents as they are.

I have a database filled with MARCXML records. An example of these looks like this:

 

        <record

            ns0:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"

            xmlns="http://www.loc.gov/MARC21/slim" xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">

            <leader>00000nam  22      a 4500</leader>

            <controlfield tag="001">000500000</controlfield>

            <controlfield tag="005">20050826220257.0</controlfield>

            <controlfield tag="008">000710s1998    xx      r     000 0 dut d</controlfield>

            <datafield ind1=" " ind2=" " tag="040">

                <subfield code="a">Univ</subfield>

            </datafield>

            <datafield ind1="1" ind2=" " tag="100">

                <subfield code="a">van Wetten, J. W.</subfield>

            </datafield>

            <datafield ind1="1" ind2="3" tag="245">

                <subfield code="a">De positie van vrouwen in de asielprocedure /</subfield>

                <subfield code="c">J.W. van Wetten, N. Dijkhof, F. Heide.</subfield>

            </datafield>

        </record>

 

The idea is to create Lucene indexes on specific MARC fields and store the complete MARC record in Lucene 'as is'. In the presentation layer of my application I would then have this complete MARC record at hand, and as such have full flexibility on which MARC fields to display. So I want to create the following record through XSLT and feed this to SOLR.

 

<doc>

<field name="title">De positie van vrouwen in de asielprocedure</field>

<field name="author">van Wetten, J. W.</field>

...

<field name="originalRecord">

  <record

            ns0:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"

            xmlns="http://www.loc.gov/MARC21/slim" xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">

            <leader>00000nam  22      a 4500</leader>

            <controlfield tag="001">000500000</controlfield>

            <controlfield tag="005">20050826220257.0</controlfield>

            <controlfield tag="008">000710s1998    xx      r     000 0 dut d</controlfield>

            <datafield ind1=" " ind2=" " tag="040">

                <subfield code="a">UGent</subfield>

            </datafield>

            <datafield ind1="1" ind2=" " tag="100">

                <subfield code="a">van Wetten, J. W.</subfield>

            </datafield>

            <datafield ind1="1" ind2="3" tag="245">

                <subfield code="a">De positie van vrouwen in de asielprocedure /</subfield>

                <subfield code="c">J.W. van Wetten, N. Dijkhof, F. Heide.</subfield>

            </datafield>

        </record>

</field>

</doc>

 

I have the following in my schema.xml:

 

<field name="author" type="text" indexed="true" stored="true" termVectors="true"/>

<field name="title" type="text" indexed="true" stored="true" termVectors="true"/>

<field name="originalRecord" type="text" indexed="false" stored="true"/>

 

 

SOLR has of course a problem with the XML in the 'originalRecord' field.

Is there a solution to this? Has anyone done this before?

 

Thanks a lot.

Benoit.

 

 

=============================

PAUWELS Benoit

Université Libre de Bruxelles - Libraries

Head of Automation

Av. F.D. Roosevelt 50, CP 180

1050 BRUSSELS

Belgium

Tel: + 32 2 650 23 91

Fax: + 32 2 650 23 91

=============================

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML

Pieter Berkel
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> Is there a solution to this? Has anyone done this before?


I would suggest changing the field type of "originalRecord" to "string"
rather than "text", and if you're still having trouble with the XML data
simply encapsulated the data with a CDATA:

<field name="originalRecord"><![CDATA[ ... ]]></field>

cheers,
Piete
Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML

Alan Rykhus
In reply to this post by PAUWELS Benoit
Hello Benoit,

An additonal thing to check out is the work being done on fac-back-opac.
They have a parser that will parse native MARC records.

I would assume that if you can extract your records in MARC XML you can
extract them in native MARC.

I've used the parser and it works well.

al

On Fri, 2007-10-05 at 02:44 -0500, PAUWELS Benoit wrote:

> Hi,
>
>
>
> I wish to index well formed xml documents as they are.
>
> I have a database filled with MARCXML records. An example of these looks like this:
>
>
>
>         <record
>
>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>
>             xmlns="http://www.loc.gov/MARC21/slim" xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
>
>             <leader>00000nam  22      a 4500</leader>
>
>             <controlfield tag="001">000500000</controlfield>
>
>             <controlfield tag="005">20050826220257.0</controlfield>
>
>             <controlfield tag="008">000710s1998    xx      r     000 0 dut d</controlfield>
>
>             <datafield ind1=" " ind2=" " tag="040">
>
>                 <subfield code="a">Univ</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2=" " tag="100">
>
>                 <subfield code="a">van Wetten, J. W.</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2="3" tag="245">
>
>                 <subfield code="a">De positie van vrouwen in de asielprocedure /</subfield>
>
>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F. Heide.</subfield>
>
>             </datafield>
>
>         </record>
>
>
>
> The idea is to create Lucene indexes on specific MARC fields and store the complete MARC record in Lucene 'as is'. In the presentation layer of my application I would then have this complete MARC record at hand, and as such have full flexibility on which MARC fields to display. So I want to create the following record through XSLT and feed this to SOLR.
>
>
>
> <doc>
>
> <field name="title">De positie van vrouwen in de asielprocedure</field>
>
> <field name="author">van Wetten, J. W.</field>
>
> ...
>
> <field name="originalRecord">
>
>   <record
>
>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>
>             xmlns="http://www.loc.gov/MARC21/slim" xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
>
>             <leader>00000nam  22      a 4500</leader>
>
>             <controlfield tag="001">000500000</controlfield>
>
>             <controlfield tag="005">20050826220257.0</controlfield>
>
>             <controlfield tag="008">000710s1998    xx      r     000 0 dut d</controlfield>
>
>             <datafield ind1=" " ind2=" " tag="040">
>
>                 <subfield code="a">UGent</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2=" " tag="100">
>
>                 <subfield code="a">van Wetten, J. W.</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2="3" tag="245">
>
>                 <subfield code="a">De positie van vrouwen in de asielprocedure /</subfield>
>
>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F. Heide.</subfield>
>
>             </datafield>
>
>         </record>
>
> </field>
>
> </doc>
>
>
>
> I have the following in my schema.xml:
>
>
>
> <field name="author" type="text" indexed="true" stored="true" termVectors="true"/>
>
> <field name="title" type="text" indexed="true" stored="true" termVectors="true"/>
>
> <field name="originalRecord" type="text" indexed="false" stored="true"/>
>
>
>
>
>
> SOLR has of course a problem with the XML in the 'originalRecord' field.
>
> Is there a solution to this? Has anyone done this before?
>
>
>
> Thanks a lot.
>
> Benoit.
>
>
>
>
>
> =============================
>
> PAUWELS Benoit
>
> Université Libre de Bruxelles - Libraries
>
> Head of Automation
>
> Av. F.D. Roosevelt 50, CP 180
>
> 1050 BRUSSELS
>
> Belgium
>
> Tel: + 32 2 650 23 91
>
> Fax: + 32 2 650 23 91
>
> =============================
>
>
>
>
>
--
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities
(507)389-1975
[hidden email]

-----------------------------------------------------------------------

"You and I as individuals can, by borrowing, live beyond our means, but
only for a limited period of time. Why should we think that
collectively, as a nation, we are not bound by that same limitation?"
-- Ronald Reagan

Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML

Walter Underwood, Netflix
In reply to this post by PAUWELS Benoit
Solr is not an XML engine (or a MARC engine). It uses XML as an input format
for fielded data. It does not index or search arbitrary XML. You need to
convert your XML into Solr's format.

I would recommend expressing MARC in a Solr schema, then working on the
input XML. The input XML depends on the schema.

If you need an XML engine, I'd recommend MarkLogic (commercial), a very
good product.

wunder

On 10/5/07 12:44 AM, "PAUWELS  Benoit" <[hidden email]> wrote:

> Hi,
>
> I wish to index well formed xml documents as they are.
>
> I have a database filled with MARCXML records. An example of these looks like
> this:
>
>  
>
>         <record
>
>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>
>             xmlns="http://www.loc.gov/MARC21/slim"
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
>
>             <leader>00000nam  22      a 4500</leader>
>
>             <controlfield tag="001">000500000</controlfield>
>
>             <controlfield tag="005">20050826220257.0</controlfield>
>
>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
> d</controlfield>
>
>             <datafield ind1=" " ind2=" " tag="040">
>
>                 <subfield code="a">Univ</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2=" " tag="100">
>
>                 <subfield code="a">van Wetten, J. W.</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2="3" tag="245">
>
>                 <subfield code="a">De positie van vrouwen in de asielprocedure
> /</subfield>
>
>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
> Heide.</subfield>
>
>             </datafield>
>
>         </record>
>
>  
>
> The idea is to create Lucene indexes on specific MARC fields and store the
> complete MARC record in Lucene 'as is'. In the presentation layer of my
> application I would then have this complete MARC record at hand, and as such
> have full flexibility on which MARC fields to display. So I want to create the
> following record through XSLT and feed this to SOLR.
>
>  
>
> <doc>
>
> <field name="title">De positie van vrouwen in de asielprocedure</field>
>
> <field name="author">van Wetten, J. W.</field>
>
> ...
>
> <field name="originalRecord">
>
>   <record
>
>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>
>             xmlns="http://www.loc.gov/MARC21/slim"
> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
>
>             <leader>00000nam  22      a 4500</leader>
>
>             <controlfield tag="001">000500000</controlfield>
>
>             <controlfield tag="005">20050826220257.0</controlfield>
>
>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
> d</controlfield>
>
>             <datafield ind1=" " ind2=" " tag="040">
>
>                 <subfield code="a">UGent</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2=" " tag="100">
>
>                 <subfield code="a">van Wetten, J. W.</subfield>
>
>             </datafield>
>
>             <datafield ind1="1" ind2="3" tag="245">
>
>                 <subfield code="a">De positie van vrouwen in de asielprocedure
> /</subfield>
>
>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
> Heide.</subfield>
>
>             </datafield>
>
>         </record>
>
> </field>
>
> </doc>
>
>  
>
> I have the following in my schema.xml:
>
>  
>
> <field name="author" type="text" indexed="true" stored="true"
> termVectors="true"/>
>
> <field name="title" type="text" indexed="true" stored="true"
> termVectors="true"/>
>
> <field name="originalRecord" type="text" indexed="false" stored="true"/>
>
>  
>
>  
>
> SOLR has of course a problem with the XML in the 'originalRecord' field.
>
> Is there a solution to this? Has anyone done this before?
>
>  
>
> Thanks a lot.
>
> Benoit.
>
>  
>
>  
>
> =============================
>
> PAUWELS Benoit
>
> Université Libre de Bruxelles - Libraries
>
> Head of Automation
>
> Av. F.D. Roosevelt 50, CP 180
>
> 1050 BRUSSELS
>
> Belgium
>
> Tel: + 32 2 650 23 91
>
> Fax: + 32 2 650 23 91
>
> =============================
>
>  
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Indexing XML

wsgrah
Benoit,

Are you familiar with the Vufind project (http://www.vufind.org)? If you
look at the PHP code in the import folder to see how the indexing is
working (there's an XSL transformation that then updates the index).
I've also written some initial code to use embedded Solr to do this
indexing directly from marc format files, including holding the entire
marcxml format record in the index.

You can contact me off-list if you have questions...

Wayne

Walter Underwood wrote:

> Solr is not an XML engine (or a MARC engine). It uses XML as an input format
> for fielded data. It does not index or search arbitrary XML. You need to
> convert your XML into Solr's format.
>
> I would recommend expressing MARC in a Solr schema, then working on the
> input XML. The input XML depends on the schema.
>
> If you need an XML engine, I'd recommend MarkLogic (commercial), a very
> good product.
>
> wunder
>
> On 10/5/07 12:44 AM, "PAUWELS  Benoit" <[hidden email]> wrote:
>
>> Hi,
>>
>> I wish to index well formed xml documents as they are.
>>
>> I have a database filled with MARCXML records. An example of these looks like
>> this:
>>
>>  
>>
>>         <record
>>
>>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>>
>>             xmlns="http://www.loc.gov/MARC21/slim"
>> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
>>
>>             <leader>00000nam  22      a 4500</leader>
>>
>>             <controlfield tag="001">000500000</controlfield>
>>
>>             <controlfield tag="005">20050826220257.0</controlfield>
>>
>>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
>> d</controlfield>
>>
>>             <datafield ind1=" " ind2=" " tag="040">
>>
>>                 <subfield code="a">Univ</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2=" " tag="100">
>>
>>                 <subfield code="a">van Wetten, J. W.</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2="3" tag="245">
>>
>>                 <subfield code="a">De positie van vrouwen in de asielprocedure
>> /</subfield>
>>
>>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
>> Heide.</subfield>
>>
>>             </datafield>
>>
>>         </record>
>>
>>  
>>
>> The idea is to create Lucene indexes on specific MARC fields and store the
>> complete MARC record in Lucene 'as is'. In the presentation layer of my
>> application I would then have this complete MARC record at hand, and as such
>> have full flexibility on which MARC fields to display. So I want to create the
>> following record through XSLT and feed this to SOLR.
>>
>>  
>>
>> <doc>
>>
>> <field name="title">De positie van vrouwen in de asielprocedure</field>
>>
>> <field name="author">van Wetten, J. W.</field>
>>
>> ...
>>
>> <field name="originalRecord">
>>
>>   <record
>>
>>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd"
>>
>>             xmlns="http://www.loc.gov/MARC21/slim"
>> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance">
>>
>>             <leader>00000nam  22      a 4500</leader>
>>
>>             <controlfield tag="001">000500000</controlfield>
>>
>>             <controlfield tag="005">20050826220257.0</controlfield>
>>
>>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
>> d</controlfield>
>>
>>             <datafield ind1=" " ind2=" " tag="040">
>>
>>                 <subfield code="a">UGent</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2=" " tag="100">
>>
>>                 <subfield code="a">van Wetten, J. W.</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2="3" tag="245">
>>
>>                 <subfield code="a">De positie van vrouwen in de asielprocedure
>> /</subfield>
>>
>>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
>> Heide.</subfield>
>>
>>             </datafield>
>>
>>         </record>
>>
>> </field>
>>
>> </doc>
>>
>>  
>>
>> I have the following in my schema.xml:
>>
>>  
>>
>> <field name="author" type="text" indexed="true" stored="true"
>> termVectors="true"/>
>>
>> <field name="title" type="text" indexed="true" stored="true"
>> termVectors="true"/>
>>
>> <field name="originalRecord" type="text" indexed="false" stored="true"/>
>>
>>  
>>
>>  
>>
>> SOLR has of course a problem with the XML in the 'originalRecord' field.
>>
>> Is there a solution to this? Has anyone done this before?
>>
>>  
>>
>> Thanks a lot.
>>
>> Benoit.
>>
>>  
>>
>>  
>>
>> =============================
>>
>> PAUWELS Benoit
>>
>> Université Libre de Bruxelles - Libraries
>>
>> Head of Automation
>>
>> Av. F.D. Roosevelt 50, CP 180
>>
>> 1050 BRUSSELS
>>
>> Belgium
>>
>> Tel: + 32 2 650 23 91
>>
>> Fax: + 32 2 650 23 91
>>
>> =============================
>>
>>  
>>
>>  
>>
>


--
/**
 * Wayne Graham
 * Earl Gregg Swem Library
 * PO Box 8794
 * Williamsburg, VA 23188
 * 757.221.3112
 * http://swem.wm.edu/blogs/waynegraham/
 */