Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Tim Allison
All,

  Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
challenges of using Solr's ExtractingRequestHandler and the guidance to
avoid it in production.

   I completely agree with this point, and I think that Shawn did a very
nice job of capturing some of the challenges.  If you have any feedback or
would like to make edits, see:

https://wiki.apache.org/solr/RecommendCustomIndexingWithTika

   Cheers,

                 Tim
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

kkrugler
Thanks for the ref, Tim.

I’m curious why SolrCell doesn’t fire up threads when parsing docs with Tika (or use the fork parser), to mitigate issues with hangs & crashes?

— Ken

> On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]> wrote:
>
> All,
>
>  Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
> challenges of using Solr's ExtractingRequestHandler and the guidance to
> avoid it in production.
>
>   I completely agree with this point, and I think that Shawn did a very
> nice job of capturing some of the challenges.  If you have any feedback or
> would like to make edits, see:
>
> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>
>   Cheers,
>
>                 Tim

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Luís Filipe Nassif
Hi Ken,

Threads will not help with OutOfMemoryErrors or crashes caused by native
libs. ForkParser can help, after the refactoring started by Tim to handle
some of its limitations. See TIKA-2653

2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:

> Thanks for the ref, Tim.
>
> I’m curious why SolrCell doesn’t fire up threads when parsing docs with
> Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>
> — Ken
>
> > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]> wrote:
> >
> > All,
> >
> >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
> > challenges of using Solr's ExtractingRequestHandler and the guidance to
> > avoid it in production.
> >
> >   I completely agree with this point, and I think that Shawn did a very
> > nice job of capturing some of the challenges.  If you have any feedback
> or
> > would like to make edits, see:
> >
> > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
> >
> >   Cheers,
> >
> >                 Tim
>
> --------------------------------------------
> http://about.me/kkrugler
> +1 530-210-6378
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Dave Fisher
In reply to this post by kkrugler
Having run a Solr service, you are striving to have quick response on queries and want to avoid anything that can pause the JVM. You work hard to make your updates quick and NRT. Text Extractions of XML based documents like Office and big object files like PDF are memory intensive and should be sandboxed on separate VMs.

Regards,
Dave

> On May 29, 2018, at 12:11 PM, Ken Krugler <[hidden email]> wrote:
>
> Thanks for the ref, Tim.
>
> I’m curious why SolrCell doesn’t fire up threads when parsing docs with Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>
> — Ken
>
>> On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]> wrote:
>>
>> All,
>>
>> Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
>> challenges of using Solr's ExtractingRequestHandler and the guidance to
>> avoid it in production.
>>
>>  I completely agree with this point, and I think that Shawn did a very
>> nice job of capturing some of the challenges.  If you have any feedback or
>> would like to make edits, see:
>>
>> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>>
>>  Cheers,
>>
>>                Tim
>
> --------------------------------------------
> http://about.me/kkrugler
> +1 530-210-6378
>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Luís Filipe Nassif
In reply to this post by Luís Filipe Nassif
Related to this, do we have any guidance to help java users choosing
between ForkParser or TikaServer?

2018-05-29 16:18 GMT-03:00 Luís Filipe Nassif <[hidden email]>:

> Hi Ken,
>
> Threads will not help with OutOfMemoryErrors or crashes caused by native
> libs. ForkParser can help, after the refactoring started by Tim to handle
> some of its limitations. See TIKA-2653
>
> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>
>> Thanks for the ref, Tim.
>>
>> I’m curious why SolrCell doesn’t fire up threads when parsing docs with
>> Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>>
>> — Ken
>>
>> > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]> wrote:
>> >
>> > All,
>> >
>> >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
>> > challenges of using Solr's ExtractingRequestHandler and the guidance to
>> > avoid it in production.
>> >
>> >   I completely agree with this point, and I think that Shawn did a very
>> > nice job of capturing some of the challenges.  If you have any feedback
>> or
>> > would like to make edits, see:
>> >
>> > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>> >
>> >   Cheers,
>> >
>> >                 Tim
>>
>> --------------------------------------------
>> http://about.me/kkrugler
>> +1 530-210-6378
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Tim Allison
In reply to this post by Luís Filipe Nassif
Y, my mods to the ForkParser should make it more robust, and will help with
OOMs, permanent hangs and native lib crashing.  But those changes are still
in the works...

On Tue, May 29, 2018 at 3:18 PM, Luís Filipe Nassif <[hidden email]>
wrote:

> Hi Ken,
>
> Threads will not help with OutOfMemoryErrors or crashes caused by native
> libs. ForkParser can help, after the refactoring started by Tim to handle
> some of its limitations. See TIKA-2653
>
> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>
> > Thanks for the ref, Tim.
> >
> > I’m curious why SolrCell doesn’t fire up threads when parsing docs with
> > Tika (or use the fork parser), to mitigate issues with hangs & crashes?
> >
> > — Ken
> >
> > > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]> wrote:
> > >
> > > All,
> > >
> > >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about
> the
> > > challenges of using Solr's ExtractingRequestHandler and the guidance to
> > > avoid it in production.
> > >
> > >   I completely agree with this point, and I think that Shawn did a very
> > > nice job of capturing some of the challenges.  If you have any feedback
> > or
> > > would like to make edits, see:
> > >
> > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
> > >
> > >   Cheers,
> > >
> > >                 Tim
> >
> > --------------------------------------------
> > http://about.me/kkrugler
> > +1 530-210-6378
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

kkrugler
In reply to this post by Luís Filipe Nassif
Hi Luis,

Yes, threads can only mitigate (not solve) some (not all) of the problems.

But they do help with hangs, which is why we use them when parsing docs with Tika in Hadoop jobs.

Which is why I was curious why ERH didn’t take advantage of that approach, as it’s better than not using them.

— Ken

> On May 29, 2018, at 12:18 PM, Luís Filipe Nassif <[hidden email]> wrote:
>
> Hi Ken,
>
> Threads will not help with OutOfMemoryErrors or crashes caused by native
> libs. ForkParser can help, after the refactoring started by Tim to handle
> some of its limitations. See TIKA-2653
>
> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>
>> Thanks for the ref, Tim.
>>
>> I’m curious why SolrCell doesn’t fire up threads when parsing docs with
>> Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>>
>> — Ken
>>
>>> On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]> wrote:
>>>
>>> All,
>>>
>>> Over the weekend, Shawn Heisey very kindly drafted a wikipage about the
>>> challenges of using Solr's ExtractingRequestHandler and the guidance to
>>> avoid it in production.
>>>
>>>  I completely agree with this point, and I think that Shawn did a very
>>> nice job of capturing some of the challenges.  If you have any feedback
>> or
>>> would like to make edits, see:
>>>
>>> https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>>>
>>>  Cheers,
>>>
>>>                Tim

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Tim Allison
In reply to this post by Tim Allison
1: CORRECTION: the ForkParser by itself (without my mods) will protect
against ooms, permanent hangs, and native lib crashing.  My proposed mods (on
TIKA-2653) only move the parser dependencies out of Solr's dependencies.

2: note: Also, note the discussion on where to place this information.
Cassandra Targett advocates putting this guidance in the main users' guide.

On Tue, May 29, 2018 at 3:22 PM, Tim Allison <[hidden email]> wrote:

> Y, my mods to the ForkParser should make it more robust, and will help
> with OOMs, permanent hangs and native lib crashing.  But those changes are
> still in the works...
>
> On Tue, May 29, 2018 at 3:18 PM, Luís Filipe Nassif <[hidden email]>
> wrote:
>
>> Hi Ken,
>>
>> Threads will not help with OutOfMemoryErrors or crashes caused by native
>> libs. ForkParser can help, after the refactoring started by Tim to handle
>> some of its limitations. See TIKA-2653
>>
>> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>>
>> > Thanks for the ref, Tim.
>> >
>> > I’m curious why SolrCell doesn’t fire up threads when parsing docs with
>> > Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>> >
>> > — Ken
>> >
>> > > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]>
>> wrote:
>> > >
>> > > All,
>> > >
>> > >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about
>> the
>> > > challenges of using Solr's ExtractingRequestHandler and the guidance
>> to
>> > > avoid it in production.
>> > >
>> > >   I completely agree with this point, and I think that Shawn did a
>> very
>> > > nice job of capturing some of the challenges.  If you have any
>> feedback
>> > or
>> > > would like to make edits, see:
>> > >
>> > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>> > >
>> > >   Cheers,
>> > >
>> > >                 Tim
>> >
>> > --------------------------------------------
>> > http://about.me/kkrugler
>> > +1 530-210-6378
>> >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Tim Allison
Ken,
  Once TIKA-2653 is done and 1.19(?) is released, I'll propose switching
ERH to the ForkParser.  There's also an open ticket for using tika-server.
I think users should have both options.

On Tue, May 29, 2018 at 3:25 PM, Tim Allison <[hidden email]> wrote:

> 1: CORRECTION: the ForkParser by itself (without my mods) will protect
> against ooms, permanent hangs, and native lib crashing.  My proposed mods (on
> TIKA-2653) only move the parser dependencies out of Solr's dependencies.
>
> 2: note: Also, note the discussion on where to place this information.
> Cassandra Targett advocates putting this guidance in the main users' guide.
>
> On Tue, May 29, 2018 at 3:22 PM, Tim Allison <[hidden email]> wrote:
>
>> Y, my mods to the ForkParser should make it more robust, and will help
>> with OOMs, permanent hangs and native lib crashing.  But those changes are
>> still in the works...
>>
>> On Tue, May 29, 2018 at 3:18 PM, Luís Filipe Nassif <[hidden email]>
>> wrote:
>>
>>> Hi Ken,
>>>
>>> Threads will not help with OutOfMemoryErrors or crashes caused by native
>>> libs. ForkParser can help, after the refactoring started by Tim to handle
>>> some of its limitations. See TIKA-2653
>>>
>>> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>>>
>>> > Thanks for the ref, Tim.
>>> >
>>> > I’m curious why SolrCell doesn’t fire up threads when parsing docs with
>>> > Tika (or use the fork parser), to mitigate issues with hangs & crashes?
>>> >
>>> > — Ken
>>> >
>>> > > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]>
>>> wrote:
>>> > >
>>> > > All,
>>> > >
>>> > >  Over the weekend, Shawn Heisey very kindly drafted a wikipage about
>>> the
>>> > > challenges of using Solr's ExtractingRequestHandler and the guidance
>>> to
>>> > > avoid it in production.
>>> > >
>>> > >   I completely agree with this point, and I think that Shawn did a
>>> very
>>> > > nice job of capturing some of the challenges.  If you have any
>>> feedback
>>> > or
>>> > > would like to make edits, see:
>>> > >
>>> > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>>> > >
>>> > >   Cheers,
>>> > >
>>> > >                 Tim
>>> >
>>> > --------------------------------------------
>>> > http://about.me/kkrugler
>>> > +1 530-210-6378
>>> >
>>> >
>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Luís Filipe Nassif
Hi Tim,

Could you clarify the pros and cons between ForkParser (after your
refactoring) and TikaServer? Maybe we should send those to users list and
wiki...

Thanks

2018-05-29 16:27 GMT-03:00 Tim Allison <[hidden email]>:

> Ken,
>   Once TIKA-2653 is done and 1.19(?) is released, I'll propose switching
> ERH to the ForkParser.  There's also an open ticket for using tika-server.
> I think users should have both options.
>
> On Tue, May 29, 2018 at 3:25 PM, Tim Allison <[hidden email]> wrote:
>
>> 1: CORRECTION: the ForkParser by itself (without my mods) will protect
>> against ooms, permanent hangs, and native lib crashing.  My proposed mods (on
>> TIKA-2653) only move the parser dependencies out of Solr's dependencies.
>>
>> 2: note: Also, note the discussion on where to place this information.
>> Cassandra Targett advocates putting this guidance in the main users' guide.
>>
>> On Tue, May 29, 2018 at 3:22 PM, Tim Allison <[hidden email]> wrote:
>>
>>> Y, my mods to the ForkParser should make it more robust, and will help
>>> with OOMs, permanent hangs and native lib crashing.  But those changes are
>>> still in the works...
>>>
>>> On Tue, May 29, 2018 at 3:18 PM, Luís Filipe Nassif <[hidden email]
>>> > wrote:
>>>
>>>> Hi Ken,
>>>>
>>>> Threads will not help with OutOfMemoryErrors or crashes caused by native
>>>> libs. ForkParser can help, after the refactoring started by Tim to
>>>> handle
>>>> some of its limitations. See TIKA-2653
>>>>
>>>> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>>>>
>>>> > Thanks for the ref, Tim.
>>>> >
>>>> > I’m curious why SolrCell doesn’t fire up threads when parsing docs
>>>> with
>>>> > Tika (or use the fork parser), to mitigate issues with hangs &
>>>> crashes?
>>>> >
>>>> > — Ken
>>>> >
>>>> > > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]>
>>>> wrote:
>>>> > >
>>>> > > All,
>>>> > >
>>>> > >  Over the weekend, Shawn Heisey very kindly drafted a wikipage
>>>> about the
>>>> > > challenges of using Solr's ExtractingRequestHandler and the
>>>> guidance to
>>>> > > avoid it in production.
>>>> > >
>>>> > >   I completely agree with this point, and I think that Shawn did a
>>>> very
>>>> > > nice job of capturing some of the challenges.  If you have any
>>>> feedback
>>>> > or
>>>> > > would like to make edits, see:
>>>> > >
>>>> > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>>>> > >
>>>> > >   Cheers,
>>>> > >
>>>> > >                 Tim
>>>> >
>>>> > --------------------------------------------
>>>> > http://about.me/kkrugler
>>>> > +1 530-210-6378
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

Tim Allison
Y.  Will do.  I'll be interested to compare the performance.  One of the
obvious pros of tika-server is that you can move Tika off your vm and m.
:)  The downside is that you have to manage it and open ports, which is
simple in many applications and impossible in others.

Once my refactoring is done, we should offer the use of the ForkParser
within tika-server because tika-server is vulnerable to permanent
hangs/oom...

On Tue, May 29, 2018 at 3:35 PM, Luís Filipe Nassif <[hidden email]>
wrote:

> Hi Tim,
>
> Could you clarify the pros and cons between ForkParser (after your
> refactoring) and TikaServer? Maybe we should send those to users list and
> wiki...
>
> Thanks
>
> 2018-05-29 16:27 GMT-03:00 Tim Allison <[hidden email]>:
>
>> Ken,
>>   Once TIKA-2653 is done and 1.19(?) is released, I'll propose switching
>> ERH to the ForkParser.  There's also an open ticket for using tika-server.
>> I think users should have both options.
>>
>> On Tue, May 29, 2018 at 3:25 PM, Tim Allison <[hidden email]> wrote:
>>
>>> 1: CORRECTION: the ForkParser by itself (without my mods) will protect
>>> against ooms, permanent hangs, and native lib crashing.  My proposed mods (on
>>> TIKA-2653) only move the parser dependencies out of Solr's
>>> dependencies.
>>>
>>> 2: note: Also, note the discussion on where to place this information.
>>> Cassandra Targett advocates putting this guidance in the main users' guide.
>>>
>>> On Tue, May 29, 2018 at 3:22 PM, Tim Allison <[hidden email]>
>>> wrote:
>>>
>>>> Y, my mods to the ForkParser should make it more robust, and will help
>>>> with OOMs, permanent hangs and native lib crashing.  But those changes are
>>>> still in the works...
>>>>
>>>> On Tue, May 29, 2018 at 3:18 PM, Luís Filipe Nassif <
>>>> [hidden email]> wrote:
>>>>
>>>>> Hi Ken,
>>>>>
>>>>> Threads will not help with OutOfMemoryErrors or crashes caused by
>>>>> native
>>>>> libs. ForkParser can help, after the refactoring started by Tim to
>>>>> handle
>>>>> some of its limitations. See TIKA-2653
>>>>>
>>>>> 2018-05-29 16:11 GMT-03:00 Ken Krugler <[hidden email]>:
>>>>>
>>>>> > Thanks for the ref, Tim.
>>>>> >
>>>>> > I’m curious why SolrCell doesn’t fire up threads when parsing docs
>>>>> with
>>>>> > Tika (or use the fork parser), to mitigate issues with hangs &
>>>>> crashes?
>>>>> >
>>>>> > — Ken
>>>>> >
>>>>> > > On May 29, 2018, at 11:54 AM, Tim Allison <[hidden email]>
>>>>> wrote:
>>>>> > >
>>>>> > > All,
>>>>> > >
>>>>> > >  Over the weekend, Shawn Heisey very kindly drafted a wikipage
>>>>> about the
>>>>> > > challenges of using Solr's ExtractingRequestHandler and the
>>>>> guidance to
>>>>> > > avoid it in production.
>>>>> > >
>>>>> > >   I completely agree with this point, and I think that Shawn did a
>>>>> very
>>>>> > > nice job of capturing some of the challenges.  If you have any
>>>>> feedback
>>>>> > or
>>>>> > > would like to make edits, see:
>>>>> > >
>>>>> > > https://wiki.apache.org/solr/RecommendCustomIndexingWithTika
>>>>> > >
>>>>> > >   Cheers,
>>>>> > >
>>>>> > >                 Tim
>>>>> >
>>>>> > --------------------------------------------
>>>>> > http://about.me/kkrugler
>>>>> > +1 530-210-6378
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>