Tika HTTP 400 Errors with DIH

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika HTTP 400 Errors with DIH

Teague James
Hi all,

I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
field. In the DIH Tika uses that field to fetch and parse the documents. The
URL from the field is valid and will download the document in the browser
just fine. But Tika is getting HTTP response code 400. Any ideas why?

ERROR
BinURLDataSource
java.io.IOException: Server returned HTTP response code: 400 for URL:

EntityProcessorWrapper
Exception in entity :
tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception in invoking url

DIH
<dataConfig>
        <dataSource type="JdbcDataSource"
              name="ds-1"
              driver="net.sourceforge.jtds.jdbc.Driver"
       
url="jdbc:jtds:sqlserver://1.2.3.4/database;instance=INSTANCE;user=USER;pass
word=PASSWORD" />

        <dataSource type="BinURLDataSource" name="ds-2" />

        <document>
    <entity name="db_content" dataSource="ds-1"
transformer="ClobTransformer, RegexTransformer"
            query="SELECT ContentID,
                        DownloadURL
                        FROM DATABASE.VIEW
    <field column="ContentID" name="id" />
                        <field column="DownloadURL" clob="true"
name="DownloadURL" />
                       
                        <entity name="tika_content"
processor="TikaEntityProcessor" url="${db_content.DownloadURL}"
onError="continue" dataSource="ds-2">
                                <field column="TikaParsedContent" />
                        </entity>
                       
        </entity>
        </document>
</dataConfig>

SCHEMA - Fields
<field name="DownloadURL" type="string" indexed="true" stored="true" />
<field name="TikaParsedContent" type="text_general" indexed="true"
stored="true" multiValued="true"/>



Reply | Threaded
Open this post in threaded view
|

Re: Tika HTTP 400 Errors with DIH

Alexandre Rafalovitch
On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
> clob="true"

What does ClobTransformer is doing on the DownloadURL field? Is it
possible it is corrupting the value somehow?

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
Reply | Threaded
Open this post in threaded view
|

RE: Tika HTTP 400 Errors with DIH

Teague James
The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"
The URL works if I copy and paste it to the browser, but Tika gets a 400 error.

Any ideas?

Thanks!
-Teague
-----Original Message-----
From: Alexandre Rafalovitch [mailto:[hidden email]]
Sent: Tuesday, December 02, 2014 1:45 PM
To: solr-user
Subject: Re: Tika HTTP 400 Errors with DIH

On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
> clob="true"

What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow?

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

Reply | Threaded
Open this post in threaded view
|

Re: Tika HTTP 400 Errors with DIH

Alexandre Rafalovitch
400 error means something wrong on the server (resource not found).
So, it would be useful to see what URL is actually being requested.

Can you run some sort of network tracer to see the actual network
request (dtrace, Wireshark, etc)? That will dissect the problem into
half for you.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 December 2014 at 09:42, Teague James <[hidden email]> wrote:

> The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"
> The URL works if I copy and paste it to the browser, but Tika gets a 400 error.
>
> Any ideas?
>
> Thanks!
> -Teague
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[hidden email]]
> Sent: Tuesday, December 02, 2014 1:45 PM
> To: solr-user
> Subject: Re: Tika HTTP 400 Errors with DIH
>
> On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
>> clob="true"
>
> What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow?
>
> Regards,
>    Alex.
>
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
Reply | Threaded
Open this post in threaded view
|

Re: Tika HTTP 400 Errors with DIH

Walter Underwood
No, 400 should mean that the request was bad. When the server fails, that is a 500.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/


On Dec 4, 2014, at 8:43 AM, Alexandre Rafalovitch <[hidden email]> wrote:

> 400 error means something wrong on the server (resource not found).
> So, it would be useful to see what URL is actually being requested.
>
> Can you run some sort of network tracer to see the actual network
> request (dtrace, Wireshark, etc)? That will dissect the problem into
> half for you.
>
> Regards,
>   Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 4 December 2014 at 09:42, Teague James <[hidden email]> wrote:
>> The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"
>> The URL works if I copy and paste it to the browser, but Tika gets a 400 error.
>>
>> Any ideas?
>>
>> Thanks!
>> -Teague
>> -----Original Message-----
>> From: Alexandre Rafalovitch [mailto:[hidden email]]
>> Sent: Tuesday, December 02, 2014 1:45 PM
>> To: solr-user
>> Subject: Re: Tika HTTP 400 Errors with DIH
>>
>> On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
>>> clob="true"
>>
>> What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow?
>>
>> Regards,
>>   Alex.
>>
>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>

Reply | Threaded
Open this post in threaded view
|

Re: Tika HTTP 400 Errors with DIH

Alexandre Rafalovitch
Right. Resource not found (on server).

The end result is the same. If it works in the browser but not from
the application than either not the same URL is being requested or -
somehow - not even the same server.

The solution (watching network traffic) is still the same, right?

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 December 2014 at 11:51, Walter Underwood <[hidden email]> wrote:

> No, 400 should mean that the request was bad. When the server fails, that is a 500.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/
>
>
> On Dec 4, 2014, at 8:43 AM, Alexandre Rafalovitch <[hidden email]> wrote:
>
>> 400 error means something wrong on the server (resource not found).
>> So, it would be useful to see what URL is actually being requested.
>>
>> Can you run some sort of network tracer to see the actual network
>> request (dtrace, Wireshark, etc)? That will dissect the problem into
>> half for you.
>>
>> Regards,
>>   Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 4 December 2014 at 09:42, Teague James <[hidden email]> wrote:
>>> The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"
>>> The URL works if I copy and paste it to the browser, but Tika gets a 400 error.
>>>
>>> Any ideas?
>>>
>>> Thanks!
>>> -Teague
>>> -----Original Message-----
>>> From: Alexandre Rafalovitch [mailto:[hidden email]]
>>> Sent: Tuesday, December 02, 2014 1:45 PM
>>> To: solr-user
>>> Subject: Re: Tika HTTP 400 Errors with DIH
>>>
>>> On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
>>>> clob="true"
>>>
>>> What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow?
>>>
>>> Regards,
>>>   Alex.
>>>
>>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>>
>
Reply | Threaded
Open this post in threaded view
|

RE: Tika HTTP 400 Errors with DIH

Teague James
Alex,

Your suggestion might be a solution, but the issue isn't that the resource isn't found. Like Walter said 400 is a "bad request" which makes me wonder, what is the DIH/Tika doing when trying to access the documents? What is the "request" that is bad? Is there any other way to suss this out? Placing a network monitor in this case would be on the extreme end of difficult.

I know that the URL stored is good and that the resource exists by copying it out of a Solr query and pasting it into the browser, so that eliminates 404 and 500 errors. Is the format of the URL correct? Is there some other setting I've missed?

I appreciate the suggestions!

-Teague


-----Original Message-----
From: Alexandre Rafalovitch [mailto:[hidden email]]
Sent: Thursday, December 04, 2014 12:22 PM
To: solr-user
Subject: Re: Tika HTTP 400 Errors with DIH

Right. Resource not found (on server).

The end result is the same. If it works in the browser but not from the application than either not the same URL is being requested or - somehow - not even the same server.

The solution (watching network traffic) is still the same, right?

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 4 December 2014 at 11:51, Walter Underwood <[hidden email]> wrote:

> No, 400 should mean that the request was bad. When the server fails, that is a 500.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/
>
>
> On Dec 4, 2014, at 8:43 AM, Alexandre Rafalovitch <[hidden email]> wrote:
>
>> 400 error means something wrong on the server (resource not found).
>> So, it would be useful to see what URL is actually being requested.
>>
>> Can you run some sort of network tracer to see the actual network
>> request (dtrace, Wireshark, etc)? That will dissect the problem into
>> half for you.
>>
>> Regards,
>>   Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources
>> and newsletter: http://www.solr-start.com/ and @solrstart Solr
>> popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 4 December 2014 at 09:42, Teague James <[hidden email]> wrote:
>>> The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"
>>> The URL works if I copy and paste it to the browser, but Tika gets a 400 error.
>>>
>>> Any ideas?
>>>
>>> Thanks!
>>> -Teague
>>> -----Original Message-----
>>> From: Alexandre Rafalovitch [mailto:[hidden email]]
>>> Sent: Tuesday, December 02, 2014 1:45 PM
>>> To: solr-user
>>> Subject: Re: Tika HTTP 400 Errors with DIH
>>>
>>> On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
>>>> clob="true"
>>>
>>> What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow?
>>>
>>> Regards,
>>>   Alex.
>>>
>>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources
>>> and newsletter: http://www.solr-start.com/ and @solrstart Solr
>>> popularizers community: https://www.linkedin.com/groups?gid=6713853
>>>
>

Reply | Threaded
Open this post in threaded view
|

RE: Tika HTTP 400 Errors with DIH

steve-4
Likely a good http debugger would help (wireshark, or fiddler2, for example)
http://www.telerik.com/fiddler
https://www.wireshark.org/download.html
For example, it could show the http header that the "client" uses to request info from an api, then the show results of that query. One small caveat: I have not tried this with "standalone" server or with any SOLR type project.
Cheers!Steve

> From: [hidden email]
> To: [hidden email]
> Subject: RE: Tika HTTP 400 Errors with DIH
> Date: Fri, 5 Dec 2014 12:03:23 -0500
>
> Alex,
>
> Your suggestion might be a solution, but the issue isn't that the resource isn't found. Like Walter said 400 is a "bad request" which makes me wonder, what is the DIH/Tika doing when trying to access the documents? What is the "request" that is bad? Is there any other way to suss this out? Placing a network monitor in this case would be on the extreme end of difficult.
>
> I know that the URL stored is good and that the resource exists by copying it out of a Solr query and pasting it into the browser, so that eliminates 404 and 500 errors. Is the format of the URL correct? Is there some other setting I've missed?
>
> I appreciate the suggestions!
>
> -Teague
>
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[hidden email]]
> Sent: Thursday, December 04, 2014 12:22 PM
> To: solr-user
> Subject: Re: Tika HTTP 400 Errors with DIH
>
> Right. Resource not found (on server).
>
> The end result is the same. If it works in the browser but not from the application than either not the same URL is being requested or - somehow - not even the same server.
>
> The solution (watching network traffic) is still the same, right?
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 4 December 2014 at 11:51, Walter Underwood <[hidden email]> wrote:
> > No, 400 should mean that the request was bad. When the server fails, that is a 500.
> >
> > wunder
> > Walter Underwood
> > [hidden email]
> > http://observer.wunderwood.org/
> >
> >
> > On Dec 4, 2014, at 8:43 AM, Alexandre Rafalovitch <[hidden email]> wrote:
> >
> >> 400 error means something wrong on the server (resource not found).
> >> So, it would be useful to see what URL is actually being requested.
> >>
> >> Can you run some sort of network tracer to see the actual network
> >> request (dtrace, Wireshark, etc)? That will dissect the problem into
> >> half for you.
> >>
> >> Regards,
> >>   Alex.
> >> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources
> >> and newsletter: http://www.solr-start.com/ and @solrstart Solr
> >> popularizers community: https://www.linkedin.com/groups?gid=6713853
> >>
> >>
> >> On 4 December 2014 at 09:42, Teague James <[hidden email]> wrote:
> >>> The database stores the URL as a CLOB. Querying Solr shows that the field value is "http://www.someaddress.com/documents/document1.docx"
> >>> The URL works if I copy and paste it to the browser, but Tika gets a 400 error.
> >>>
> >>> Any ideas?
> >>>
> >>> Thanks!
> >>> -Teague
> >>> -----Original Message-----
> >>> From: Alexandre Rafalovitch [mailto:[hidden email]]
> >>> Sent: Tuesday, December 02, 2014 1:45 PM
> >>> To: solr-user
> >>> Subject: Re: Tika HTTP 400 Errors with DIH
> >>>
> >>> On 2 December 2014 at 13:19, Teague James <[hidden email]> wrote:
> >>>> clob="true"
> >>>
> >>> What does ClobTransformer is doing on the DownloadURL field? Is it possible it is corrupting the value somehow?
> >>>
> >>> Regards,
> >>>   Alex.
> >>>
> >>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources
> >>> and newsletter: http://www.solr-start.com/ and @solrstart Solr
> >>> popularizers community: https://www.linkedin.com/groups?gid=6713853
> >>>
> >
>
     
Reply | Threaded
Open this post in threaded view
|

Re: Tika HTTP 400 Errors with DIH

Dan Davis-2
In reply to this post by Teague James
I would say that you could determine a row that gives a bad URL, and then
run it in DIH admin interface (or the command-line) with "debug" enabled
The url parameter going into tika should be present in its transformed form
before the next entity gets going.   This works in a similar scenario for
me.

On Tue, Dec 2, 2014 at 1:19 PM, Teague James <[hidden email]>
wrote:

> Hi all,
>
> I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
> field. In the DIH Tika uses that field to fetch and parse the documents.
> The
> URL from the field is valid and will download the document in the browser
> just fine. But Tika is getting HTTP response code 400. Any ideas why?
>
> ERROR
> BinURLDataSource
> java.io.IOException: Server returned HTTP response code: 400 for URL:
>
> EntityProcessorWrapper
> Exception in entity :
> tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Exception in invoking url
>
> DIH
> <dataConfig>
>         <dataSource type="JdbcDataSource"
>               name="ds-1"
>               driver="net.sourceforge.jtds.jdbc.Driver"
>
> url="jdbc:jtds:sqlserver://
> 1.2.3.4/database;instance=INSTANCE;user=USER;pass
> word=PASSWORD" />
>
>         <dataSource type="BinURLDataSource" name="ds-2" />
>
>         <document>
>         <entity name="db_content" dataSource="ds-1"
> transformer="ClobTransformer, RegexTransformer"
>                 query="SELECT ContentID,
>                         DownloadURL
>                         FROM DATABASE.VIEW
>                         <field column="ContentID" name="id" />
>                         <field column="DownloadURL" clob="true"
> name="DownloadURL" />
>
>                         <entity name="tika_content"
> processor="TikaEntityProcessor" url="${db_content.DownloadURL}"
> onError="continue" dataSource="ds-2">
>                                 <field column="TikaParsedContent" />
>                         </entity>
>
>         </entity>
>         </document>
> </dataConfig>
>
> SCHEMA - Fields
> <field name="DownloadURL" type="string" indexed="true" stored="true" />
> <field name="TikaParsedContent" type="text_general" indexed="true"
> stored="true" multiValued="true"/>
>
>
>
>