DIH transformer problems

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

DIH transformer problems

Lemke, Michael  SZ/HZA-ZSW
I am having a little fight with the DataImportHandler and the
application of RegexTransformer and TemplateTransformer.  
A stripped down version of what I try in data-config.xml, which
is taken pretty much from the various solr wikis:

<dataConfig>
    <dataSource type="FileDataSource" encoding="UTF-8" />
    <document>
         <entity name="wf" rootEntity="false" dataSource="null"
             processor="FileListEntityProcessor"
             baseDir="d:\inetpub\webapps\searchserver\solr\importdaten\import_wiki"
             fileName="wiki_..\.xml">
            <entity name="doc"
                 processor="XPathEntityProcessor"
                 forEach="/mediawiki/page"
                 stream="true"
                 url="${wf.fileAbsolutePath}"
                 transformer="RegexTransformer,HTMLStripTransformer,TemplateTransformer"
                 >
              <field column="ilang" template="${wf.fileAbsolutePath}" regex=".*?(..)\.xml" replaceWith="$1"/>
              <field column="HEADER" xpath="/mediawiki/page/title" required="true" stripHTML="true"/>

              <field column="xxCONTENT" xpath="/mediawiki/page/revision/text"/>
              <field column="xxCONTENT" regex="(?m)^=====(.+?)=====$"
                      replaceWith="&lt;h4&gt;$1&lt;/h4&gt;"/>

              <!-- more regex transforms here -->
              <field column="xxCONTENT" stripHTML="true"/>

              <field column="NGLANG"             template="${doc.ilang}" />
              <field column="CONTENTPREVIEW" template="${doc.xxCONTENT}"/>
            </entity>
         </entity>
    </document>
</dataConfig>

The problem is with ilang.  The regex is not applied, no matter what I try.  Even
a straight forward  <... regex=".*" replaceWith="en" ...> doesn't work.  I always
end up with the full pathname.

The regexs on xxCONTENT work fine, however.  So it's not that my regex is wrong or
that regexs don't work at all.

I tried all sorts of things like intermediate columns, sourceColumn or different
sequences in the transformer attribute.  It all lead to different errors.  Nothing
worked or lead to any clues.

What am I doing wrong here?  This is with solr 1.4.1.


Thanks,
Michael

Reply | Threaded
Open this post in threaded view
|

Re: DIH transformer problems

Alexandre Rafalovitch
What are you actually trying to do on a business level? Maybe that's
something that can be handled better by sticking an
UpdateRequestProcessor chain _after_ DIH?

As to your configuration, you have xxCONTENT column definition twice.
It might be working, but I think it is non-deterministic. For ilang,
you don't seem to have xpath attribute, so I suspect it is just being
skipped all together.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 November 2014 09:05, Lemke, Michael  ST/HZA-ZSW
<[hidden email]> wrote:

> I am having a little fight with the DataImportHandler and the
> application of RegexTransformer and TemplateTransformer.
> A stripped down version of what I try in data-config.xml, which
> is taken pretty much from the various solr wikis:
>
> <dataConfig>
>     <dataSource type="FileDataSource" encoding="UTF-8" />
>     <document>
>          <entity name="wf" rootEntity="false" dataSource="null"
>              processor="FileListEntityProcessor"
>              baseDir="d:\inetpub\webapps\searchserver\solr\importdaten\import_wiki"
>              fileName="wiki_..\.xml">
>             <entity name="doc"
>                  processor="XPathEntityProcessor"
>                  forEach="/mediawiki/page"
>                  stream="true"
>                  url="${wf.fileAbsolutePath}"
>                  transformer="RegexTransformer,HTMLStripTransformer,TemplateTransformer"
>                  >
>               <field column="ilang" template="${wf.fileAbsolutePath}" regex=".*?(..)\.xml" replaceWith="$1"/>
>               <field column="HEADER" xpath="/mediawiki/page/title" required="true" stripHTML="true"/>
>
>               <field column="xxCONTENT" xpath="/mediawiki/page/revision/text"/>
>               <field column="xxCONTENT" regex="(?m)^=====(.+?)=====$"
>                       replaceWith="&lt;h4&gt;$1&lt;/h4&gt;"/>
>
>               <!-- more regex transforms here -->
>               <field column="xxCONTENT" stripHTML="true"/>
>
>               <field column="NGLANG"             template="${doc.ilang}" />
>               <field column="CONTENTPREVIEW" template="${doc.xxCONTENT}"/>
>             </entity>
>          </entity>
>     </document>
> </dataConfig>
>
> The problem is with ilang.  The regex is not applied, no matter what I try.  Even
> a straight forward  <... regex=".*" replaceWith="en" ...> doesn't work.  I always
> end up with the full pathname.
>
> The regexs on xxCONTENT work fine, however.  So it's not that my regex is wrong or
> that regexs don't work at all.
>
> I tried all sorts of things like intermediate columns, sourceColumn or different
> sequences in the transformer attribute.  It all lead to different errors.  Nothing
> worked or lead to any clues.
>
> What am I doing wrong here?  This is with solr 1.4.1.
>
>
> Thanks,
> Michael
>
Reply | Threaded
Open this post in threaded view
|

RE: DIH transformer problems

Lemke, Michael  SZ/HZA-ZSW
On Tuesday, November 04, 2014 4:07 PM
Alexandre Rafalovitch wrote:
>
>What are you actually trying to do on a business level?

I am importing a wiki extract and the goal here is to extract the
wiki's language from the filename.  

The language is also in an attribute within the imported xml
but it has a namespace.  DIH doesn't find the attribute.  I tried,
with or without the namespace.  I'd  actually prefer that option.

Example:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">

Both
xpath="/mediawiki/@xml:lang"
xpath="/mediawiki/@lang"
return nothing while
xpath="/mediawiki/@version"
correctly picks up the version attribute.

>Maybe that's
>something that can be handled better by sticking an
>UpdateRequestProcessor chain _after_ DIH?

Haven't looked at that.  It is as simple as a DIH?

>
>As to your configuration, you have xxCONTENT column definition twice.
>It might be working, but I think it is non-deterministic.

In fact there are many more xxCONTENT definitions.  The idea is to
apply many unrelated regex substitutions.  That part does work.
The actual goal is to replace mediawiki's wikitext with plain
text.

>For ilang,
>you don't seem to have xpath attribute, so I suspect it is just being
>skipped all together.

It gets its value from the template attribute.  That part does work.
But the value is not transformed further by the regex.  Why?

There is a similar example at
https://wiki.apache.org/solr/DataImportHandler#Transformers_Example

Michael

>
>
>On 4 November 2014 09:05, Lemke, Michael  ST/HZA-ZSW
><[hidden email]> wrote:
>> I am having a little fight with the DataImportHandler and the
>> application of RegexTransformer and TemplateTransformer.
>> A stripped down version of what I try in data-config.xml, which
>> is taken pretty much from the various solr wikis:
>>
>> <dataConfig>
>>     <dataSource type="FileDataSource" encoding="UTF-8" />
>>     <document>
>>          <entity name="wf" rootEntity="false" dataSource="null"
>>              processor="FileListEntityProcessor"
>>              baseDir="d:\inetpub\webapps\searchserver\solr\importdaten\import_wiki"
>>              fileName="wiki_..\.xml">
>>             <entity name="doc"
>>                  processor="XPathEntityProcessor"
>>                  forEach="/mediawiki/page"
>>                  stream="true"
>>                  url="${wf.fileAbsolutePath}"
>>                  transformer="RegexTransformer,HTMLStripTransformer,TemplateTransformer"
>>                  >
>>               <field column="ilang" template="${wf.fileAbsolutePath}" regex=".*?(..)\.xml" replaceWith="$1"/>
>>               <field column="HEADER" xpath="/mediawiki/page/title" required="true" stripHTML="true"/>
>>
>>               <field column="xxCONTENT" xpath="/mediawiki/page/revision/text"/>
>>               <field column="xxCONTENT" regex="(?m)^=====(.+?)=====$"
>>                       replaceWith="&lt;h4&gt;$1&lt;/h4&gt;"/>
>>
>>               <!-- more regex transforms here -->
>>               <field column="xxCONTENT" stripHTML="true"/>
>>
>>               <field column="NGLANG"             template="${doc.ilang}" />
>>               <field column="CONTENTPREVIEW" template="${doc.xxCONTENT}"/>
>>             </entity>
>>          </entity>
>>     </document>
>> </dataConfig>
>>
>> The problem is with ilang.  The regex is not applied, no matter what I try.  Even
>> a straight forward  <... regex=".*" replaceWith="en" ...> doesn't work.  I always
>> end up with the full pathname.
>>
>> The regexs on xxCONTENT work fine, however.  So it's not that my regex is wrong or
>> that regexs don't work at all.
>>
>> I tried all sorts of things like intermediate columns, sourceColumn or different
>> sequences in the transformer attribute.  It all lead to different errors.  Nothing
>> worked or lead to any clues.
>>
>> What am I doing wrong here?  This is with solr 1.4.1.
>>
>>
>> Thanks,
>> Michael
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: DIH transformer problems

Alexandre Rafalovitch
On 4 November 2014 10:42, Lemke, Michael  ST/HZA-ZSW
<[hidden email]> wrote:

> On Tuesday, November 04, 2014 4:07 PM
> Alexandre Rafalovitch wrote:
>>
>>What are you actually trying to do on a business level?
>
> I am importing a wiki extract and the goal here is to extract the
> wiki's language from the filename.
>
> The language is also in an attribute within the imported xml
> but it has a namespace.  DIH doesn't find the attribute.  I tried,
> with or without the namespace.  I'd  actually prefer that option.
>
> Example:
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
>
> Both
> xpath="/mediawiki/@xml:lang"
> xpath="/mediawiki/@lang"
> return nothing while
> xpath="/mediawiki/@version"
> correctly picks up the version attribute.

DIH ignores/does not support namespaces. So I would expect
'xpath="/mediawiki/@lang"' to work. Unless it is something that XML
parser strips away. Possible.

>
>>Maybe that's
>>something that can be handled better by sticking an
>>UpdateRequestProcessor chain _after_ DIH?
>
> Haven't looked at that.  It is as simple as a DIH?

Simpler :-) And you can find the full list of the processors at
http://www.solr-start.com/info/update-request-processors/

>
>>
>>As to your configuration, you have xxCONTENT column definition twice.
>>It might be working, but I think it is non-deterministic.
>
> In fact there are many more xxCONTENT definitions.  The idea is to
> apply many unrelated regex substitutions.  That part does work.
> The actual goal is to replace mediawiki's wikitext with plain
> text.

Don't know if this is helpful:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/wikipedia/WikipediaTokenizerFactory.html
. That's much later in the chain, once the text is already in Solr.

Out of clues on everything else.

Regards,
   Alex.