dataimporter tika doesn't extract certain div

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

dataimporter tika doesn't extract certain div

Andreas Owen
I want tika to only index the content in <div id="content">...</div> for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?

data-config.xml:

<dataConfig>
        <dataSource type="BinFileDataSource" name="data"/>
        <dataSource type="BinURLDataSource" name="dataUrl"/>
        <dataSource type="URLDataSource" name="main"/>
<document>
        <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"-->
                <field column="title" xpath="//title" />
                <field column="id" xpath="//id" />
                <field column="file" xpath="//file" />
                <field column="path" xpath="//path" />
                <field column="url" xpath="//url" />
                <field column="Author" xpath="//author" />
               
                <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
                        <field column="text" xpath="//div[@id='content']" />
                       
                </entity>
        </entity>
</document>
</dataConfig>
Reply | Threaded
Open this post in threaded view
|

Re: dataimporter tika doesn't extract certain div

Shalin Shekhar Mangar
I don't know much about Tika but in the example data-config.xml that
you posted, the "xpath" attribute on the field "text" won't work
because the xpath attribute is used only by a XPathEntityProcessor.

On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <[hidden email]> wrote:

> I want tika to only index the content in <div id="content">...</div> for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>
> data-config.xml:
>
> <dataConfig>
>         <dataSource type="BinFileDataSource" name="data"/>
>         <dataSource type="BinURLDataSource" name="dataUrl"/>
>         <dataSource type="URLDataSource" name="main"/>
> <document>
>         <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"-->
>                 <field column="title" xpath="//title" />
>                 <field column="id" xpath="//id" />
>                 <field column="file" xpath="//file" />
>                 <field column="path" xpath="//path" />
>                 <field column="url" xpath="//url" />
>                 <field column="Author" xpath="//author" />
>
>                 <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>                         <field column="text" xpath="//div[@id='content']" />
>
>                 </entity>
>         </entity>
> </document>
> </dataConfig>



--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: dataimporter tika doesn't extract certain div

Andreas Owen
so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika?

<entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" forEach="/div[@id='content']" dataSource="main">
                        <entity name="tika" processor="TikaEntityProcessor" url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
                                <field column="text" />
                        </entity>
                </entity>

but now i dont know how to pass the text to tika, what do i put in url and datasource?


On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:

> I don't know much about Tika but in the example data-config.xml that
> you posted, the "xpath" attribute on the field "text" won't work
> because the xpath attribute is used only by a XPathEntityProcessor.
>
> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <[hidden email]> wrote:
>> I want tika to only index the content in <div id="content">...</div> for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>>
>> data-config.xml:
>>
>> <dataConfig>
>>        <dataSource type="BinFileDataSource" name="data"/>
>>        <dataSource type="BinURLDataSource" name="dataUrl"/>
>>        <dataSource type="URLDataSource" name="main"/>
>> <document>
>>        <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"-->
>>                <field column="title" xpath="//title" />
>>                <field column="id" xpath="//id" />
>>                <field column="file" xpath="//file" />
>>                <field column="path" xpath="//path" />
>>                <field column="url" xpath="//url" />
>>                <field column="Author" xpath="//author" />
>>
>>                <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>>                        <field column="text" xpath="//div[@id='content']" />
>>
>>                </entity>
>>        </entity>
>> </document>
>> </dataConfig>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.

Reply | Threaded
Open this post in threaded view
|

Re: dataimporter tika doesn't extract certain div

Shalin Shekhar Mangar
No that wouldn't work. It seems that you probably need a custom
Transformer to extract the right div content. I do not know if
TikaEntityProcessor supports such a thing.

On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen <[hidden email]> wrote:

> so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika?
>
> <entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" forEach="/div[@id='content']" dataSource="main">
>                         <entity name="tika" processor="TikaEntityProcessor" url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>                                 <field column="text" />
>                         </entity>
>                 </entity>
>
> but now i dont know how to pass the text to tika, what do i put in url and datasource?
>
>
> On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:
>
>> I don't know much about Tika but in the example data-config.xml that
>> you posted, the "xpath" attribute on the field "text" won't work
>> because the xpath attribute is used only by a XPathEntityProcessor.
>>
>> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <[hidden email]> wrote:
>>> I want tika to only index the content in <div id="content">...</div> for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>>>
>>> data-config.xml:
>>>
>>> <dataConfig>
>>>        <dataSource type="BinFileDataSource" name="data"/>
>>>        <dataSource type="BinURLDataSource" name="dataUrl"/>
>>>        <dataSource type="URLDataSource" name="main"/>
>>> <document>
>>>        <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"-->
>>>                <field column="title" xpath="//title" />
>>>                <field column="id" xpath="//id" />
>>>                <field column="file" xpath="//file" />
>>>                <field column="path" xpath="//path" />
>>>                <field column="url" xpath="//url" />
>>>                <field column="Author" xpath="//author" />
>>>
>>>                <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>>>                        <field column="text" xpath="//div[@id='content']" />
>>>
>>>                </entity>
>>>        </entity>
>>> </document>
>>> </dataConfig>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>



--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: dataimporter tika doesn't extract certain div

Andreas Owen
or could i use a filter in schema.xml where i define a fieldtype and use some filter that understands xpath?

On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote:

> No that wouldn't work. It seems that you probably need a custom
> Transformer to extract the right div content. I do not know if
> TikaEntityProcessor supports such a thing.
>
> On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen <[hidden email]> wrote:
>> so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika?
>>
>> <entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" forEach="/div[@id='content']" dataSource="main">
>>                        <entity name="tika" processor="TikaEntityProcessor" url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>>                                <field column="text" />
>>                        </entity>
>>                </entity>
>>
>> but now i dont know how to pass the text to tika, what do i put in url and datasource?
>>
>>
>> On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:
>>
>>> I don't know much about Tika but in the example data-config.xml that
>>> you posted, the "xpath" attribute on the field "text" won't work
>>> because the xpath attribute is used only by a XPathEntityProcessor.
>>>
>>> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <[hidden email]> wrote:
>>>> I want tika to only index the content in <div id="content">...</div> for the field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>>>>
>>>> data-config.xml:
>>>>
>>>> <dataConfig>
>>>>       <dataSource type="BinFileDataSource" name="data"/>
>>>>       <dataSource type="BinURLDataSource" name="dataUrl"/>
>>>>       <dataSource type="URLDataSource" name="main"/>
>>>> <document>
>>>>       <entity name="rec" processor="XPathEntityProcessor" url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" dataSource="main"> <!--transformer="script:GenerateId"-->
>>>>               <field column="title" xpath="//title" />
>>>>               <field column="id" xpath="//id" />
>>>>               <field column="file" xpath="//file" />
>>>>               <field column="path" xpath="//path" />
>>>>               <field column="url" xpath="//url" />
>>>>               <field column="Author" xpath="//author" />
>>>>
>>>>               <entity name="tika" processor="TikaEntityProcessor" url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >
>>>>                       <field column="text" xpath="//div[@id='content']" />
>>>>
>>>>               </entity>
>>>>       </entity>
>>>> </document>
>>>> </dataConfig>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.