indexing data from rich documents - Tika with solr3.1

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing data from rich documents - Tika with solr3.1

scorpking
Hi everyone,
Now i have had a problem with tika and solr. I successed in index data from various file formats (pdf, doc...) with a file absolute path. but now I have a link from internet (ex: http://myweb/filename.pdf). I want to index from this link, But it's not ok. I don't why? This is my file dataconfig.xml:

<dataConfig>
    <dataSource type="BinFileDataSource" name="bin"/>
    <document>
                                               
        <entity name="tika-test" processor="TikaEntityProcessor" url=" http://myweb/filename.pdf" format="text" dataSource="bin" >
                               
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>

                </entity>
    </document>
</dataConfig>


when i change url=" http://myweb/filename.pdf" by a file absolute path, it work very good.
Any one know this?
Thanks for your help.
Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

Erik Hatcher-4
If the only thing you're doing is indexing file content, then you can bypass using the Data Import Handler altogether and use the ExtractingRequestHandler (aka Solr Cell).  And you can feed in a file from a URL using the stream.url capability, like the stream.file example here: <http://wiki.apache.org/solr/ExtractingRequestHandler#Configuration>

Something like -  http://localhost:8983/solr/update/extract?stream.url=http://myweb/filename.pdf&literal.id=filename.pdf

But to fix what you're doing below, looks like you should be using BinURLDataSource rather than BinFileDataSource - other than that, it looks fine.

        Erik

On Sep 9, 2011, at 06:58 , scorpking wrote:

> Hi everyone,
> Now i have had a problem with tika and solr. I successed in index data from
> various file formats (pdf, doc...) with a file absolute path. but now I have
> a link from internet (ex: http://myweb/filename.pdf). I want to index from
> this link, But it's not ok. I don't why? This is my file dataconfig.xml:
>
> *<dataConfig>
>    <dataSource type="BinFileDataSource" name="bin"/>
>    <document>
>
>        <entity name="tika-test" processor="TikaEntityProcessor" url="
> http://myweb/filename.pdf" format="text" dataSource="bin" >
>
>                <field column="Author" name="author" meta="true"/>
>                <field column="title" name="title" meta="true"/>
>                <field column="text" name="text"/>
>
> </entity>
>    </document>
> </dataConfig>*
>
> when i change url=" http://myweb/filename.pdf" by a file absolute path, it
> work very good.
> Any one know this?
> Thanks for your help.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3322555.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

scorpking
oh, it is good for me. Thank Erik Hatcher-4 very much. I have done to index from https.
Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

scorpking
In reply to this post by Erik Hatcher-4
Hi,
Can you explain me this problem?
I have indexed data from multi file which use tika libs. And i have indexed data from http. But only one file (ex: http://myweb/filename.pdf). Now i have many file formats in a http path (ex:http://myweb/files/). I tried index data from a http path but it's not work. It is my data-config.

<dataConfig>
    <dataSource type="BinURLDataSource" name="bin" encoding="utf-8"/>
    <document>
                <entity name="sd" processor="FileListEntityProcessor" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)" baseDir="http://www.lc.unsw.edu.au/onlib/pdf/"
                                recursive="true" rootEntity="false" transformer="DateFormatTransformer" > 
                               
        <entity name="tika-test" processor="TikaEntityProcessor" url="${sd.fileAbsolutePath}" format="text" dataSource="bin" >
                               
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
                                                               
        </entity>
                                 <field column="file" name="filename"/> 
                                 
                </entity>
    </document>
</dataConfig>


Error:
Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' value: http://www.lc.unsw.edu.au/onlib/pdf/ is not a directory Processing Document # 1
        at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:124)
        at org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:69)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:552)
        at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
        at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
        at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
        at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
        at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)

Thanks for your help.
Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

Erick Erickson
FileListEntityProcessor pre-supposes it's looking at files on disk. it
doesn't know anything about the web. So, as the stack trace
indicates, it tries to open a directory called http://..... and fails.

What is it you're really trying to do here? Perhaps if you explain
your higher-level problem we can provide some help.

Best
Erick

On Mon, Sep 12, 2011 at 11:53 PM, scorpking <[hidden email]> wrote:

> Hi,
> Can you explain me this problem?
> I have indexed data from multi file which use tika libs. And i have indexed
> data from http. But only one file (ex: http://myweb/filename.pdf). Now i
> have many file formats in a http path (ex:http://myweb/files/). I tried
> index data from a http path but it's not work. It is my data-config.
>
> *<dataConfig>
>    <dataSource type="BinURLDataSource" name="bin" encoding="utf-8"/>
>    <document>
>                <entity name="sd" processor="FileListEntityProcessor"
> fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)"
> baseDir="http://www.lc.unsw.edu.au/onlib/pdf/"
>                                recursive="true" rootEntity="false" transformer="DateFormatTransformer"
>>
>
>        <entity name="tika-test" processor="TikaEntityProcessor"
> url="${sd.fileAbsolutePath}" format="text" dataSource="bin" >
>
>                <field column="Author" name="author" meta="true"/>
>                <field column="title" name="title" meta="true"/>
>                <field column="text" name="text"/>
>
>        </entity>
>                                 <field column="file" name="filename"/>
>
>                </entity>
>    </document>
> </dataConfig>*
>
> Error:
> Full Import
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> 'baseDir' value: http://www.lc.unsw.edu.au/onlib/pdf/ is not a directory
> Processing Document # 1
>        at
> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:124)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:69)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:552)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
>
> Thanks for your help.
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3331651.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

scorpking
Hi Erick Erickson,
Now, we have many files format(doc, ppt, pdf, ...), File's purpose serve to search details content of education in that files. Because i am new solr, so maybe i understand not enough depth about Apache Tika. At the moment i can't index pdf files from http, with one file is ok. Thank for your attention.

Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

Erik Hatcher-4
Maybe this quick script will get you running?

    <http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/>


On Sep 15, 2011, at 00:44 , scorpking wrote:

> Hi Erick Erickson,
> Now, we have many files format(doc, ppt, pdf, ...), File's purpose serve to
> search details content of education in that files. Because i am new solr, so
> maybe i understand not enough depth about Apache Tika. At the moment i can't
> index pdf files from http, with one file is ok. Thank for your attention.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3337963.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

scorpking
Hi Erik Hatcher-4
I tried index from your url. But i have a problem. In your case, you knew a files absolute path (Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs"). So you can indexed it. In my case, i don't know a files absolute path. I only know http's address where have files (ex: you can see this link as reference: http://www.lc.unsw.edu.au/onlib/pdf/). Another ways? Thanks

Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

Erik Hatcher-4

On Sep 18, 2011, at 21:52 , scorpking wrote:

> Hi Erik Hatcher-4
> I tried index from your url. But i have a problem. In your case, you knew a
> files absolute path (Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs").
> So you can indexed it. In my case, i don't know a files absolute path. I
> only know http's address where have files (ex: you can see this link as
> reference: http://www.lc.unsw.edu.au/onlib/pdf/). Another ways? Thanks

Write a little script that takes the HTTP directory listing like that, and then uses stream.url (rather than stream.file as my example used).

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

scorpking
yeah, i want to use DIH and i tried config my file dataconfig. but it is wrong. This is my config:

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:sqlserver://ipAddress;databaseName=VTC_Edu" user="myuser" password="mypass"  name="VTCEduDocument"/>
       
        <dataSource type="BinURLDataSource" name="dsurl"/>
   
        <document>
               
                <entity name="VTCEduDocument" pk="pk_document_id" query="select TOP 10 pk_document_id, s_path_origin from [VTC_Edu].[dbo].[tbl_Document]"

                transformer="vn.vtc.solr.transformer.ImageFilter,vn.vtc.solr.transformer.RemoveHTML,RegexTransformer,TemplateTransformer,vn.vtc.solr.transformer.vntransformer,vn.vtc.solr.correctUnicodeString.correctUnicodeString,vn.vtc.solr.unescapeHtmlString.UnescapeHtmlString,vn.vtc.solr.correctISOString.correctISOString"
>
                <field column="pk_document_id" name="pk_document_id" />                               
                                <field column="s_path_origin" name="s_path_origin" />                                               
                </entity>
               
                <entity processor="TikaEntityProcessor" dataSource="dsurl" format="text" url= "http://media.gox.vn/edu/document/original/${VTCEduDocument.s_path_origin}">
                                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/> 
      </entity>
 
    </document>
</dataConfig>


And here error:
EVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 1
        at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
        at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:89)
        at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:38)
        at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
        at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
        at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
        at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
        at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
        at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
        at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
        at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
        at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: java.net.MalformedURLException: no protocol: nullselect TOP 10 pk_document_id, s_path_origin from [VTC_Edu].[dbo].[tbl_Document]
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:81)
        ... 10 more


???
Thanks
Reply | Threaded
Open this post in threaded view
|

Re: indexing data from rich documents - Tika with solr3.1

scorpking
Hi all, thanks everyone who help me very much, i indexed form http using DIH.