Specialized Solr Application

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Specialized Solr Application

Terry Steichen
I have from time-to-time posted questions to this list (and received
very prompt and helpful responses).  But it seems that many of you are
operating in a very different space from me.  The problems (and
lessons-learned) which I encounter are often very different from those
that are reflected in exchanges with most other participants.

So I thought it would be useful to describe what I'm about, and see if
there are others out there with similar implementations (or interest in
moving in that direction).  A sort of pay-forward.

We (the Lakota Peoples Law Office) are a small public interest, pro bono
law firm actively engaged in defending Native American North Dakota
Water Protector clients against (ridiculously excessive) criminal charges. 

I have a small Solr (6.6.0) implementation - just one shard.  I'm using
the cloud mode mainly to be able to implement access controls.  The
server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 8GB of
RAM and 4 cpu processors.  We presently have 8 collections with a total
of about 60,000 documents, mostly pdfs and emails.  The indexed
documents are partly our own files and partly those we obtain through
legal discovery (which, surprisingly, is allowed in ND for criminal
cases).  We only have a few users (our lawyers and a couple of
researchers mostly), so traffic is minimal.  However, there's a premium
on precision (and recall) in searches. 

The document repository is local to the server.  I piggyback on the
embedded Jetty httpd in order to serve files (selected from the
hitlists).  I just use a symbolic link to tie the repository to
Solr/Jetty's "webapp" subdirectory.

We provide remote access via ssh with port forwarding.  It provides very
snappy performance, with fully encrypted links.  Appears quite stable. 

I've had some bizarre behavior apparently caused by an interaction
between repository permissions, solr permissions and the ssh link.  I
seem "solved" for the moment, but time will tell for how long.

If there are any folks out there who have similar requirements, I'd be
more than happy to share the insights I've gained and problems I've
encountered and (I think) overcome.  There are so many unique parts of
this small scale, specialized application (many dimensions of which are
not strictly internal to Solr) that it probably won't be appreciated to
dump them on this (excellent) Solr list.  So, if you encounter problems
peculiar to this kind of setup, we can perhaps help handle them off-list
(although if they have more general Solr application, we should, of
course, post them to the list).

Terry Steichen

Reply | Threaded
Open this post in threaded view
|

Re: Specialized Solr Application

Charlie Hull-3
On 16/04/2018 19:48, Terry Steichen wrote:
> I have from time-to-time posted questions to this list (and received
> very prompt and helpful responses).  But it seems that many of you are
> operating in a very different space from me.  The problems (and
> lessons-learned) which I encounter are often very different from those
> that are reflected in exchanges with most other participants.

Hi Terry,

Sounds like a fascinating use case. We have some similar clients - small
scale law firms and publishers - who have taken advantage of Solr.

One thing I would encourage you to do is to blog and/or talk about what
you've built. Lucene Revolution is worth applying to talk at and if you
do manage to get accepted - or if you go anyway - you'll meet lots of
others with similar challenges and come away with a huge amount of
useful information and contacts. Otherwise there are lots of smaller
Meetup events (we run the London, UK one).

Don't assume just because some people here are describing their 350
billion document learning-to-rank clustered monster that the small
applications don't matter - they really do, and the fact that they're
possible to build at all is a testament to the open source model and how
we share information and tips.

Cheers

Charlie

>
> So I thought it would be useful to describe what I'm about, and see if
> there are others out there with similar implementations (or interest in
> moving in that direction).  A sort of pay-forward.
>
> We (the Lakota Peoples Law Office) are a small public interest, pro bono
> law firm actively engaged in defending Native American North Dakota
> Water Protector clients against (ridiculously excessive) criminal charges.
>
> I have a small Solr (6.6.0) implementation - just one shard.  I'm using
> the cloud mode mainly to be able to implement access controls.  The
> server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 8GB of
> RAM and 4 cpu processors.  We presently have 8 collections with a total
> of about 60,000 documents, mostly pdfs and emails.  The indexed
> documents are partly our own files and partly those we obtain through
> legal discovery (which, surprisingly, is allowed in ND for criminal
> cases).  We only have a few users (our lawyers and a couple of
> researchers mostly), so traffic is minimal.  However, there's a premium
> on precision (and recall) in searches.
>
> The document repository is local to the server.  I piggyback on the
> embedded Jetty httpd in order to serve files (selected from the
> hitlists).  I just use a symbolic link to tie the repository to
> Solr/Jetty's "webapp" subdirectory.
>
> We provide remote access via ssh with port forwarding.  It provides very
> snappy performance, with fully encrypted links.  Appears quite stable.
>
> I've had some bizarre behavior apparently caused by an interaction
> between repository permissions, solr permissions and the ssh link.  I
> seem "solved" for the moment, but time will tell for how long.
>
> If there are any folks out there who have similar requirements, I'd be
> more than happy to share the insights I've gained and problems I've
> encountered and (I think) overcome.  There are so many unique parts of
> this small scale, specialized application (many dimensions of which are
> not strictly internal to Solr) that it probably won't be appreciated to
> dump them on this (excellent) Solr list.  So, if you encounter problems
> peculiar to this kind of setup, we can perhaps help handle them off-list
> (although if they have more general Solr application, we should, of
> course, post them to the list).
>
> Terry Steichen
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk
Reply | Threaded
Open this post in threaded view
|

RE: Specialized Solr Application

Allison, Timothy B.
+1 to Charlie's guidance.

And...

>60,000 documents, mostly pdfs and emails.
> However, there's a premium on precision (and recall) in searches.

Please, oh, please, no matter what you're using for content/text extraction and/or OCR, run tika-eval[1] on the output to ensure that that you are getting mostly language-y content out of your documents.  Ping us on the Tika user's list if you have any questions.

Bad text, bad search. 😊

[1] https://wiki.apache.org/tika/TikaEval

-----Original Message-----
From: Charlie Hull [mailto:[hidden email]]
Sent: Tuesday, April 17, 2018 4:17 AM
To: [hidden email]
Subject: Re: Specialized Solr Application

On 16/04/2018 19:48, Terry Steichen wrote:
> I have from time-to-time posted questions to this list (and received
> very prompt and helpful responses).  But it seems that many of you are
> operating in a very different space from me.  The problems (and
> lessons-learned) which I encounter are often very different from those
> that are reflected in exchanges with most other participants.

Hi Terry,

Sounds like a fascinating use case. We have some similar clients - small scale law firms and publishers - who have taken advantage of Solr.

One thing I would encourage you to do is to blog and/or talk about what you've built. Lucene Revolution is worth applying to talk at and if you do manage to get accepted - or if you go anyway - you'll meet lots of others with similar challenges and come away with a huge amount of useful information and contacts. Otherwise there are lots of smaller Meetup events (we run the London, UK one).

Don't assume just because some people here are describing their 350 billion document learning-to-rank clustered monster that the small applications don't matter - they really do, and the fact that they're possible to build at all is a testament to the open source model and how we share information and tips.

Cheers

Charlie

>
> So I thought it would be useful to describe what I'm about, and see if
> there are others out there with similar implementations (or interest
> in moving in that direction).  A sort of pay-forward.
>
> We (the Lakota Peoples Law Office) are a small public interest, pro
> bono law firm actively engaged in defending Native American North
> Dakota Water Protector clients against (ridiculously excessive) criminal charges.
>
> I have a small Solr (6.6.0) implementation - just one shard.  I'm
> using the cloud mode mainly to be able to implement access controls. 
> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with
> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with
> a total of about 60,000 documents, mostly pdfs and emails.  The
> indexed documents are partly our own files and partly those we obtain
> through legal discovery (which, surprisingly, is allowed in ND for
> criminal cases).  We only have a few users (our lawyers and a couple
> of researchers mostly), so traffic is minimal.  However, there's a
> premium on precision (and recall) in searches.
>
> The document repository is local to the server.  I piggyback on the
> embedded Jetty httpd in order to serve files (selected from the
> hitlists).  I just use a symbolic link to tie the repository to
> Solr/Jetty's "webapp" subdirectory.
>
> We provide remote access via ssh with port forwarding.  It provides
> very snappy performance, with fully encrypted links.  Appears quite stable.
>
> I've had some bizarre behavior apparently caused by an interaction
> between repository permissions, solr permissions and the ssh link.  I
> seem "solved" for the moment, but time will tell for how long.
>
> If there are any folks out there who have similar requirements, I'd be
> more than happy to share the insights I've gained and problems I've
> encountered and (I think) overcome.  There are so many unique parts of
> this small scale, specialized application (many dimensions of which
> are not strictly internal to Solr) that it probably won't be
> appreciated to dump them on this (excellent) Solr list.  So, if you
> encounter problems peculiar to this kind of setup, we can perhaps help
> handle them off-list (although if they have more general Solr
> application, we should, of course, post them to the list).
>
> Terry Steichen
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply | Threaded
Open this post in threaded view
|

Re: Specialized Solr Application

Terry Steichen
Hi Timothy,

As I understand it, Tika is integrated with Solr.  All my indexed
documents declare that they've been parsed by tika.  For the eml files
it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
files show: ||org.apache.tika.parser.pdf.PDFParser|

||

||

What do you mean by improving the output with "tika-eval?"  I confess I
don't completely understand how documents should be prepared for
indexing.  But with the eml docs, solr/tika seems to properly pull out
things like date, subject, to and from.  Other (so-called 'rich text') 
documents (like pdfs and Word-type), the metadata is not so useful, but
on the other hand, there's not much consistent structure to the
documents I have to deal with.

I may be missing something - am I?

Regards,

Terry


On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:

> +1 to Charlie's guidance.
>
> And...
>
>> 60,000 documents, mostly pdfs and emails.
>> However, there's a premium on precision (and recall) in searches.
> Please, oh, please, no matter what you're using for content/text extraction and/or OCR, run tika-eval[1] on the output to ensure that that you are getting mostly language-y content out of your documents.  Ping us on the Tika user's list if you have any questions.
>
> Bad text, bad search. 😊
>
> [1] https://wiki.apache.org/tika/TikaEval
>
> -----Original Message-----
> From: Charlie Hull [mailto:[hidden email]]
> Sent: Tuesday, April 17, 2018 4:17 AM
> To: [hidden email]
> Subject: Re: Specialized Solr Application
>
> On 16/04/2018 19:48, Terry Steichen wrote:
>> I have from time-to-time posted questions to this list (and received
>> very prompt and helpful responses).  But it seems that many of you are
>> operating in a very different space from me.  The problems (and
>> lessons-learned) which I encounter are often very different from those
>> that are reflected in exchanges with most other participants.
> Hi Terry,
>
> Sounds like a fascinating use case. We have some similar clients - small scale law firms and publishers - who have taken advantage of Solr.
>
> One thing I would encourage you to do is to blog and/or talk about what you've built. Lucene Revolution is worth applying to talk at and if you do manage to get accepted - or if you go anyway - you'll meet lots of others with similar challenges and come away with a huge amount of useful information and contacts. Otherwise there are lots of smaller Meetup events (we run the London, UK one).
>
> Don't assume just because some people here are describing their 350 billion document learning-to-rank clustered monster that the small applications don't matter - they really do, and the fact that they're possible to build at all is a testament to the open source model and how we share information and tips.
>
> Cheers
>
> Charlie
>> So I thought it would be useful to describe what I'm about, and see if
>> there are others out there with similar implementations (or interest
>> in moving in that direction).  A sort of pay-forward.
>>
>> We (the Lakota Peoples Law Office) are a small public interest, pro
>> bono law firm actively engaged in defending Native American North
>> Dakota Water Protector clients against (ridiculously excessive) criminal charges.
>>
>> I have a small Solr (6.6.0) implementation - just one shard.  I'm
>> using the cloud mode mainly to be able to implement access controls. 
>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with
>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with
>> a total of about 60,000 documents, mostly pdfs and emails.  The
>> indexed documents are partly our own files and partly those we obtain
>> through legal discovery (which, surprisingly, is allowed in ND for
>> criminal cases).  We only have a few users (our lawyers and a couple
>> of researchers mostly), so traffic is minimal.  However, there's a
>> premium on precision (and recall) in searches.
>>
>> The document repository is local to the server.  I piggyback on the
>> embedded Jetty httpd in order to serve files (selected from the
>> hitlists).  I just use a symbolic link to tie the repository to
>> Solr/Jetty's "webapp" subdirectory.
>>
>> We provide remote access via ssh with port forwarding.  It provides
>> very snappy performance, with fully encrypted links.  Appears quite stable.
>>
>> I've had some bizarre behavior apparently caused by an interaction
>> between repository permissions, solr permissions and the ssh link.  I
>> seem "solved" for the moment, but time will tell for how long.
>>
>> If there are any folks out there who have similar requirements, I'd be
>> more than happy to share the insights I've gained and problems I've
>> encountered and (I think) overcome.  There are so many unique parts of
>> this small scale, specialized application (many dimensions of which
>> are not strictly internal to Solr) that it probably won't be
>> appreciated to dump them on this (excellent) Solr list.  So, if you
>> encounter problems peculiar to this kind of setup, we can perhaps help
>> handle them off-list (although if they have more general Solr
>> application, we should, of course, post them to the list).
>>
>> Terry Steichen
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Reply | Threaded
Open this post in threaded view
|

Re: Specialized Solr Application

Erick Erickson
Terry:

Tika has a horrible problem to deal with and it's approaching a
miracle that it does so well ;)

Let's take a PDF file. Which vendor's version? From what _decade_? Did
that vendor adhere
to the spec? Every spec has gray areas so even good-faith efforts can
result in some version/vendor
behaving slightly differently from the other.

And what about Word .vs. PDF? One might have "last_modified" and the
other might have
"last_edited" to mean the same thing. You mentioned that you're aware
of this, you can make
it more useful if you have finer-grained control over the ETL process.

You say "As I understand it, Tika is integrated with Solr"  which is
correct, you're talking about
the "Extracting Request Handler". However that has a couple of
important caveats:

1> It does the best it can. But Tika has a _lot_ of tuning options
that allow you to get down-and-dirty
with the data you're indexing. You mentioned that precision is
important. You can do some interesting
things with extracting specific fields from specific kinds of
documents and making use of them. The
"last_modified" and "last_edited" fields above are an example.

2> It loads the work on a single Solr node. So the very expensive
process of extracting data from the
semi-structure document is all on the Solr node. If you use Tika in a
client-side program you can
parallelize the extraction and get through your indexing much more quickly.

3> Tika can occasionally get its knickers in a knot over some
particular document. That'll also bring
down the Solr instance.

Here's a blog that can get you started doing client-side parsing,
ignore the RDBMS bits.
https://lucidworks.com/2012/02/14/indexing-with-solrj/

I'll leave Tim to talk about tika-eval ;) But the general problem is
that the extraction process can
result in garbage, lots of garbage. OCR is particularly prone to
nonsense. PDFs can be tricky,
there's this spacing parameter that, depending on it's setting can
render e r i c k as 5 separate
letters or my name.

Hey, you asked! Don't complain about long answers ;)

Best,
Erick

On Tue, Apr 17, 2018 at 1:50 PM, Terry Steichen <[hidden email]> wrote:

> Hi Timothy,
>
> As I understand it, Tika is integrated with Solr.  All my indexed
> documents declare that they've been parsed by tika.  For the eml files
> it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
> were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
> files show: ||org.apache.tika.parser.pdf.PDFParser|
>
> ||
>
> ||
>
> What do you mean by improving the output with "tika-eval?"  I confess I
> don't completely understand how documents should be prepared for
> indexing.  But with the eml docs, solr/tika seems to properly pull out
> things like date, subject, to and from.  Other (so-called 'rich text')
> documents (like pdfs and Word-type), the metadata is not so useful, but
> on the other hand, there's not much consistent structure to the
> documents I have to deal with.
>
> I may be missing something - am I?
>
> Regards,
>
> Terry
>
>
> On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:
>> +1 to Charlie's guidance.
>>
>> And...
>>
>>> 60,000 documents, mostly pdfs and emails.
>>> However, there's a premium on precision (and recall) in searches.
>> Please, oh, please, no matter what you're using for content/text extraction and/or OCR, run tika-eval[1] on the output to ensure that that you are getting mostly language-y content out of your documents.  Ping us on the Tika user's list if you have any questions.
>>
>> Bad text, bad search. 😊
>>
>> [1] https://wiki.apache.org/tika/TikaEval
>>
>> -----Original Message-----
>> From: Charlie Hull [mailto:[hidden email]]
>> Sent: Tuesday, April 17, 2018 4:17 AM
>> To: [hidden email]
>> Subject: Re: Specialized Solr Application
>>
>> On 16/04/2018 19:48, Terry Steichen wrote:
>>> I have from time-to-time posted questions to this list (and received
>>> very prompt and helpful responses).  But it seems that many of you are
>>> operating in a very different space from me.  The problems (and
>>> lessons-learned) which I encounter are often very different from those
>>> that are reflected in exchanges with most other participants.
>> Hi Terry,
>>
>> Sounds like a fascinating use case. We have some similar clients - small scale law firms and publishers - who have taken advantage of Solr.
>>
>> One thing I would encourage you to do is to blog and/or talk about what you've built. Lucene Revolution is worth applying to talk at and if you do manage to get accepted - or if you go anyway - you'll meet lots of others with similar challenges and come away with a huge amount of useful information and contacts. Otherwise there are lots of smaller Meetup events (we run the London, UK one).
>>
>> Don't assume just because some people here are describing their 350 billion document learning-to-rank clustered monster that the small applications don't matter - they really do, and the fact that they're possible to build at all is a testament to the open source model and how we share information and tips.
>>
>> Cheers
>>
>> Charlie
>>> So I thought it would be useful to describe what I'm about, and see if
>>> there are others out there with similar implementations (or interest
>>> in moving in that direction).  A sort of pay-forward.
>>>
>>> We (the Lakota Peoples Law Office) are a small public interest, pro
>>> bono law firm actively engaged in defending Native American North
>>> Dakota Water Protector clients against (ridiculously excessive) criminal charges.
>>>
>>> I have a small Solr (6.6.0) implementation - just one shard.  I'm
>>> using the cloud mode mainly to be able to implement access controls.
>>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with
>>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with
>>> a total of about 60,000 documents, mostly pdfs and emails.  The
>>> indexed documents are partly our own files and partly those we obtain
>>> through legal discovery (which, surprisingly, is allowed in ND for
>>> criminal cases).  We only have a few users (our lawyers and a couple
>>> of researchers mostly), so traffic is minimal.  However, there's a
>>> premium on precision (and recall) in searches.
>>>
>>> The document repository is local to the server.  I piggyback on the
>>> embedded Jetty httpd in order to serve files (selected from the
>>> hitlists).  I just use a symbolic link to tie the repository to
>>> Solr/Jetty's "webapp" subdirectory.
>>>
>>> We provide remote access via ssh with port forwarding.  It provides
>>> very snappy performance, with fully encrypted links.  Appears quite stable.
>>>
>>> I've had some bizarre behavior apparently caused by an interaction
>>> between repository permissions, solr permissions and the ssh link.  I
>>> seem "solved" for the moment, but time will tell for how long.
>>>
>>> If there are any folks out there who have similar requirements, I'd be
>>> more than happy to share the insights I've gained and problems I've
>>> encountered and (I think) overcome.  There are so many unique parts of
>>> this small scale, specialized application (many dimensions of which
>>> are not strictly internal to Solr) that it probably won't be
>>> appreciated to dump them on this (excellent) Solr list.  So, if you
>>> encounter problems peculiar to this kind of setup, we can perhaps help
>>> handle them off-list (although if they have more general Solr
>>> application, we should, of course, post them to the list).
>>>
>>> Terry Steichen
>>>
>>
>> --
>> Charlie Hull
>> Flax - Open Source Enterprise Search
>>
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.flax.co.uk
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Specialized Solr Application

Terry Steichen
Thanks, Erick.  What I don't understand that "rich text documents" (aka,
PDF and DOC) lack any internal structure (unlike JSON, XML, etc.), so
there's not much potential in trying to get really precise in parsing
them.  Or am I overlooking something here?

And, as you say, the metadata of such documents is not somewhat variable
(some PDFs have a field and others don't), which suggests that you may
not want the parser to be rigid.

Moreover, as I noted earlier, most of the metadata fields of such
documents seem to be of little value (since many document authors are
not consistent in creating that information). 

I take your point about non-optimum Tika workload distribution - but I
am only occasionally doing indexing so I don't think that would be a
significant factor (for me, at least).

A point of possible interest: I was recently indexing a set of about
13,000 documents and at one point, a document caused solr to crash.  I
had to restart it.  I removed the offending document, and restarted the
indexing.  It then eventually happened again, so I did the same thing. 
It then completed indexing successfully.  IOW, out of 13,000 documents
there were two that caused a crash, but once they were removed, the
other 12,998 were parsed/indexed fine.

On OCRs, I presume you're referring to PDFs that are images?  Part of
our team uses Acrobat Pro to screen and convert such documents (which
are very common in legal circles) so they can be searched.  Or did you
mean something else?

Thanks for the insights.  And the long answers (from you, Tim and
Charlie).  These are helping me (and I hope others on the list) to
better understand some of the nuances of effectively implementing
(small-scale) solr.


On 04/17/2018 10:35 PM, Erick Erickson wrote:

> Terry:
>
> Tika has a horrible problem to deal with and it's approaching a
> miracle that it does so well ;)
>
> Let's take a PDF file. Which vendor's version? From what _decade_? Did
> that vendor adhere
> to the spec? Every spec has gray areas so even good-faith efforts can
> result in some version/vendor
> behaving slightly differently from the other.
>
> And what about Word .vs. PDF? One might have "last_modified" and the
> other might have
> "last_edited" to mean the same thing. You mentioned that you're aware
> of this, you can make
> it more useful if you have finer-grained control over the ETL process.
>
> You say "As I understand it, Tika is integrated with Solr"  which is
> correct, you're talking about
> the "Extracting Request Handler". However that has a couple of
> important caveats:
>
> 1> It does the best it can. But Tika has a _lot_ of tuning options
> that allow you to get down-and-dirty
> with the data you're indexing. You mentioned that precision is
> important. You can do some interesting
> things with extracting specific fields from specific kinds of
> documents and making use of them. The
> "last_modified" and "last_edited" fields above are an example.
>
> 2> It loads the work on a single Solr node. So the very expensive
> process of extracting data from the
> semi-structure document is all on the Solr node. If you use Tika in a
> client-side program you can
> parallelize the extraction and get through your indexing much more quickly.
>
> 3> Tika can occasionally get its knickers in a knot over some
> particular document. That'll also bring
> down the Solr instance.
>
> Here's a blog that can get you started doing client-side parsing,
> ignore the RDBMS bits.
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> I'll leave Tim to talk about tika-eval ;) But the general problem is
> that the extraction process can
> result in garbage, lots of garbage. OCR is particularly prone to
> nonsense. PDFs can be tricky,
> there's this spacing parameter that, depending on it's setting can
> render e r i c k as 5 separate
> letters or my name.
>
> Hey, you asked! Don't complain about long answers ;)
>
> Best,
> Erick
>
> On Tue, Apr 17, 2018 at 1:50 PM, Terry Steichen <[hidden email]> wrote:
>> Hi Timothy,
>>
>> As I understand it, Tika is integrated with Solr.  All my indexed
>> documents declare that they've been parsed by tika.  For the eml files
>> it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
>> were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
>> files show: ||org.apache.tika.parser.pdf.PDFParser|
>>
>> ||
>>
>> ||
>>
>> What do you mean by improving the output with "tika-eval?"  I confess I
>> don't completely understand how documents should be prepared for
>> indexing.  But with the eml docs, solr/tika seems to properly pull out
>> things like date, subject, to and from.  Other (so-called 'rich text')
>> documents (like pdfs and Word-type), the metadata is not so useful, but
>> on the other hand, there's not much consistent structure to the
>> documents I have to deal with.
>>
>> I may be missing something - am I?
>>
>> Regards,
>>
>> Terry
>>
>>
>> On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:
>>> +1 to Charlie's guidance.
>>>
>>> And...
>>>
>>>> 60,000 documents, mostly pdfs and emails.
>>>> However, there's a premium on precision (and recall) in searches.
>>> Please, oh, please, no matter what you're using for content/text extraction and/or OCR, run tika-eval[1] on the output to ensure that that you are getting mostly language-y content out of your documents.  Ping us on the Tika user's list if you have any questions.
>>>
>>> Bad text, bad search. 😊
>>>
>>> [1] https://wiki.apache.org/tika/TikaEval
>>>
>>> -----Original Message-----
>>> From: Charlie Hull [mailto:[hidden email]]
>>> Sent: Tuesday, April 17, 2018 4:17 AM
>>> To: [hidden email]
>>> Subject: Re: Specialized Solr Application
>>>
>>> On 16/04/2018 19:48, Terry Steichen wrote:
>>>> I have from time-to-time posted questions to this list (and received
>>>> very prompt and helpful responses).  But it seems that many of you are
>>>> operating in a very different space from me.  The problems (and
>>>> lessons-learned) which I encounter are often very different from those
>>>> that are reflected in exchanges with most other participants.
>>> Hi Terry,
>>>
>>> Sounds like a fascinating use case. We have some similar clients - small scale law firms and publishers - who have taken advantage of Solr.
>>>
>>> One thing I would encourage you to do is to blog and/or talk about what you've built. Lucene Revolution is worth applying to talk at and if you do manage to get accepted - or if you go anyway - you'll meet lots of others with similar challenges and come away with a huge amount of useful information and contacts. Otherwise there are lots of smaller Meetup events (we run the London, UK one).
>>>
>>> Don't assume just because some people here are describing their 350 billion document learning-to-rank clustered monster that the small applications don't matter - they really do, and the fact that they're possible to build at all is a testament to the open source model and how we share information and tips.
>>>
>>> Cheers
>>>
>>> Charlie
>>>> So I thought it would be useful to describe what I'm about, and see if
>>>> there are others out there with similar implementations (or interest
>>>> in moving in that direction).  A sort of pay-forward.
>>>>
>>>> We (the Lakota Peoples Law Office) are a small public interest, pro
>>>> bono law firm actively engaged in defending Native American North
>>>> Dakota Water Protector clients against (ridiculously excessive) criminal charges.
>>>>
>>>> I have a small Solr (6.6.0) implementation - just one shard.  I'm
>>>> using the cloud mode mainly to be able to implement access controls.
>>>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with
>>>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with
>>>> a total of about 60,000 documents, mostly pdfs and emails.  The
>>>> indexed documents are partly our own files and partly those we obtain
>>>> through legal discovery (which, surprisingly, is allowed in ND for
>>>> criminal cases).  We only have a few users (our lawyers and a couple
>>>> of researchers mostly), so traffic is minimal.  However, there's a
>>>> premium on precision (and recall) in searches.
>>>>
>>>> The document repository is local to the server.  I piggyback on the
>>>> embedded Jetty httpd in order to serve files (selected from the
>>>> hitlists).  I just use a symbolic link to tie the repository to
>>>> Solr/Jetty's "webapp" subdirectory.
>>>>
>>>> We provide remote access via ssh with port forwarding.  It provides
>>>> very snappy performance, with fully encrypted links.  Appears quite stable.
>>>>
>>>> I've had some bizarre behavior apparently caused by an interaction
>>>> between repository permissions, solr permissions and the ssh link.  I
>>>> seem "solved" for the moment, but time will tell for how long.
>>>>
>>>> If there are any folks out there who have similar requirements, I'd be
>>>> more than happy to share the insights I've gained and problems I've
>>>> encountered and (I think) overcome.  There are so many unique parts of
>>>> this small scale, specialized application (many dimensions of which
>>>> are not strictly internal to Solr) that it probably won't be
>>>> appreciated to dump them on this (excellent) Solr list.  So, if you
>>>> encounter problems peculiar to this kind of setup, we can perhaps help
>>>> handle them off-list (although if they have more general Solr
>>>> application, we should, of course, post them to the list).
>>>>
>>>> Terry Steichen
>>>>
>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>

Reply | Threaded
Open this post in threaded view
|

Re: Specialized Solr Application

Erick Erickson
Terry:

If your process works, then it works and there's no real reason to change.

I was commingling the structure of the content with the metadata. You're
right that the content doesn't really have any useful structure. Sometimes
you can get some useful information out of the metadata, particularly
metadata that doesn't require a user action (last_modified and the like,
sometimes).

Whether that effort is worth it in your use-case is, of course, a valid
question.....

bq: On OCRs, I presume you're referring to PDFs that are images?

No, I was referring to scanned images. I once had to try to index
a document (I wouldn't lie to you) that was a scanned image of
a "family tree" where the most remote ancestor was written
vertically on the trunk, and each branch had a descendant
written at various angles. The resulting scanned image
was run through an OCR program that produces...well, let's
just say little of value ;)..

Best,
Erick

On Wed, Apr 18, 2018 at 8:10 AM, Terry Steichen <[hidden email]> wrote:

> Thanks, Erick.  What I don't understand that "rich text documents" (aka,
> PDF and DOC) lack any internal structure (unlike JSON, XML, etc.), so
> there's not much potential in trying to get really precise in parsing
> them.  Or am I overlooking something here?
>
> And, as you say, the metadata of such documents is not somewhat variable
> (some PDFs have a field and others don't), which suggests that you may
> not want the parser to be rigid.
>
> Moreover, as I noted earlier, most of the metadata fields of such
> documents seem to be of little value (since many document authors are
> not consistent in creating that information).
>
> I take your point about non-optimum Tika workload distribution - but I
> am only occasionally doing indexing so I don't think that would be a
> significant factor (for me, at least).
>
> A point of possible interest: I was recently indexing a set of about
> 13,000 documents and at one point, a document caused solr to crash.  I
> had to restart it.  I removed the offending document, and restarted the
> indexing.  It then eventually happened again, so I did the same thing.
> It then completed indexing successfully.  IOW, out of 13,000 documents
> there were two that caused a crash, but once they were removed, the
> other 12,998 were parsed/indexed fine.
>
> On OCRs, I presume you're referring to PDFs that are images?  Part of
> our team uses Acrobat Pro to screen and convert such documents (which
> are very common in legal circles) so they can be searched.  Or did you
> mean something else?
>
> Thanks for the insights.  And the long answers (from you, Tim and
> Charlie).  These are helping me (and I hope others on the list) to
> better understand some of the nuances of effectively implementing
> (small-scale) solr.
>
>
> On 04/17/2018 10:35 PM, Erick Erickson wrote:
>> Terry:
>>
>> Tika has a horrible problem to deal with and it's approaching a
>> miracle that it does so well ;)
>>
>> Let's take a PDF file. Which vendor's version? From what _decade_? Did
>> that vendor adhere
>> to the spec? Every spec has gray areas so even good-faith efforts can
>> result in some version/vendor
>> behaving slightly differently from the other.
>>
>> And what about Word .vs. PDF? One might have "last_modified" and the
>> other might have
>> "last_edited" to mean the same thing. You mentioned that you're aware
>> of this, you can make
>> it more useful if you have finer-grained control over the ETL process.
>>
>> You say "As I understand it, Tika is integrated with Solr"  which is
>> correct, you're talking about
>> the "Extracting Request Handler". However that has a couple of
>> important caveats:
>>
>> 1> It does the best it can. But Tika has a _lot_ of tuning options
>> that allow you to get down-and-dirty
>> with the data you're indexing. You mentioned that precision is
>> important. You can do some interesting
>> things with extracting specific fields from specific kinds of
>> documents and making use of them. The
>> "last_modified" and "last_edited" fields above are an example.
>>
>> 2> It loads the work on a single Solr node. So the very expensive
>> process of extracting data from the
>> semi-structure document is all on the Solr node. If you use Tika in a
>> client-side program you can
>> parallelize the extraction and get through your indexing much more quickly.
>>
>> 3> Tika can occasionally get its knickers in a knot over some
>> particular document. That'll also bring
>> down the Solr instance.
>>
>> Here's a blog that can get you started doing client-side parsing,
>> ignore the RDBMS bits.
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> I'll leave Tim to talk about tika-eval ;) But the general problem is
>> that the extraction process can
>> result in garbage, lots of garbage. OCR is particularly prone to
>> nonsense. PDFs can be tricky,
>> there's this spacing parameter that, depending on it's setting can
>> render e r i c k as 5 separate
>> letters or my name.
>>
>> Hey, you asked! Don't complain about long answers ;)
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 17, 2018 at 1:50 PM, Terry Steichen <[hidden email]> wrote:
>>> Hi Timothy,
>>>
>>> As I understand it, Tika is integrated with Solr.  All my indexed
>>> documents declare that they've been parsed by tika.  For the eml files
>>> it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
>>> were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
>>> files show: ||org.apache.tika.parser.pdf.PDFParser|
>>>
>>> ||
>>>
>>> ||
>>>
>>> What do you mean by improving the output with "tika-eval?"  I confess I
>>> don't completely understand how documents should be prepared for
>>> indexing.  But with the eml docs, solr/tika seems to properly pull out
>>> things like date, subject, to and from.  Other (so-called 'rich text')
>>> documents (like pdfs and Word-type), the metadata is not so useful, but
>>> on the other hand, there's not much consistent structure to the
>>> documents I have to deal with.
>>>
>>> I may be missing something - am I?
>>>
>>> Regards,
>>>
>>> Terry
>>>
>>>
>>> On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:
>>>> +1 to Charlie's guidance.
>>>>
>>>> And...
>>>>
>>>>> 60,000 documents, mostly pdfs and emails.
>>>>> However, there's a premium on precision (and recall) in searches.
>>>> Please, oh, please, no matter what you're using for content/text extraction and/or OCR, run tika-eval[1] on the output to ensure that that you are getting mostly language-y content out of your documents.  Ping us on the Tika user's list if you have any questions.
>>>>
>>>> Bad text, bad search. 😊
>>>>
>>>> [1] https://wiki.apache.org/tika/TikaEval
>>>>
>>>> -----Original Message-----
>>>> From: Charlie Hull [mailto:[hidden email]]
>>>> Sent: Tuesday, April 17, 2018 4:17 AM
>>>> To: [hidden email]
>>>> Subject: Re: Specialized Solr Application
>>>>
>>>> On 16/04/2018 19:48, Terry Steichen wrote:
>>>>> I have from time-to-time posted questions to this list (and received
>>>>> very prompt and helpful responses).  But it seems that many of you are
>>>>> operating in a very different space from me.  The problems (and
>>>>> lessons-learned) which I encounter are often very different from those
>>>>> that are reflected in exchanges with most other participants.
>>>> Hi Terry,
>>>>
>>>> Sounds like a fascinating use case. We have some similar clients - small scale law firms and publishers - who have taken advantage of Solr.
>>>>
>>>> One thing I would encourage you to do is to blog and/or talk about what you've built. Lucene Revolution is worth applying to talk at and if you do manage to get accepted - or if you go anyway - you'll meet lots of others with similar challenges and come away with a huge amount of useful information and contacts. Otherwise there are lots of smaller Meetup events (we run the London, UK one).
>>>>
>>>> Don't assume just because some people here are describing their 350 billion document learning-to-rank clustered monster that the small applications don't matter - they really do, and the fact that they're possible to build at all is a testament to the open source model and how we share information and tips.
>>>>
>>>> Cheers
>>>>
>>>> Charlie
>>>>> So I thought it would be useful to describe what I'm about, and see if
>>>>> there are others out there with similar implementations (or interest
>>>>> in moving in that direction).  A sort of pay-forward.
>>>>>
>>>>> We (the Lakota Peoples Law Office) are a small public interest, pro
>>>>> bono law firm actively engaged in defending Native American North
>>>>> Dakota Water Protector clients against (ridiculously excessive) criminal charges.
>>>>>
>>>>> I have a small Solr (6.6.0) implementation - just one shard.  I'm
>>>>> using the cloud mode mainly to be able to implement access controls.
>>>>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with
>>>>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with
>>>>> a total of about 60,000 documents, mostly pdfs and emails.  The
>>>>> indexed documents are partly our own files and partly those we obtain
>>>>> through legal discovery (which, surprisingly, is allowed in ND for
>>>>> criminal cases).  We only have a few users (our lawyers and a couple
>>>>> of researchers mostly), so traffic is minimal.  However, there's a
>>>>> premium on precision (and recall) in searches.
>>>>>
>>>>> The document repository is local to the server.  I piggyback on the
>>>>> embedded Jetty httpd in order to serve files (selected from the
>>>>> hitlists).  I just use a symbolic link to tie the repository to
>>>>> Solr/Jetty's "webapp" subdirectory.
>>>>>
>>>>> We provide remote access via ssh with port forwarding.  It provides
>>>>> very snappy performance, with fully encrypted links.  Appears quite stable.
>>>>>
>>>>> I've had some bizarre behavior apparently caused by an interaction
>>>>> between repository permissions, solr permissions and the ssh link.  I
>>>>> seem "solved" for the moment, but time will tell for how long.
>>>>>
>>>>> If there are any folks out there who have similar requirements, I'd be
>>>>> more than happy to share the insights I've gained and problems I've
>>>>> encountered and (I think) overcome.  There are so many unique parts of
>>>>> this small scale, specialized application (many dimensions of which
>>>>> are not strictly internal to Solr) that it probably won't be
>>>>> appreciated to dump them on this (excellent) Solr list.  So, if you
>>>>> encounter problems peculiar to this kind of setup, we can perhaps help
>>>>> handle them off-list (although if they have more general Solr
>>>>> application, we should, of course, post them to the list).
>>>>>
>>>>> Terry Steichen
>>>>>
>>>> --
>>>> Charlie Hull
>>>> Flax - Open Source Enterprise Search
>>>>
>>>> tel/fax: +44 (0)8700 118334
>>>> mobile:  +44 (0)7767 825828
>>>> web: www.flax.co.uk
>>>>
>
Reply | Threaded
Open this post in threaded view
|

RE: Specialized Solr Application

Allison, Timothy B.
In reply to this post by Terry Steichen
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during content extraction.[1]  I had two big concerns when I heard of your task:



1) image only pdfs, which can parse without problem, but which might yield 0 content.

2) emails (see, e.g. SOLR-12048)



It sounds like you're taking care of 1), and 2) doesn't apply because you're using Tika (although note that we've made some major changes to our RFC822 parsing in the upcoming Tika 1.18).  So, no need to read further! 😊



In general, surprising things can happen during the content extraction phase, and unless you are monitoring/measuring/evaluating what's extracted, your search system can yield results that are downright dangerous if you assume that the full stack is actually working.



I worked with one batch of documents where HALF of the Excel files weren't being parsed.  They all had the same quirk which caused an exception in POI, and because they were inside zip files, and Tika's legacy/default behavior is to silently ignore embedded exceptions -- the owners of the search system had _no idea_ that they'd never be able to find those documents.  At one point, Tika wasn't extracting sdt form fields in docx or form fields in pdf...at all...imagine if your document set was a bunch docx with sdts or pdfs with form fields...  We just fixed a bug to pull text from joined shapes in ppt...we've been missing that text for years!



Those are a few horror stories, I have many, and there are countless more yet to be discovered!



The goal of tika-eval[2] is to allow you to see if things don't look right based on your expectations.[3]  It doesn't help with indexing at all per se, but it can allow you to see odd things and 1) change your processing pipeline (add OCR where necessary or use an alternate parser for some file formats) or 2) raise an issue to fix bugs in the content extraction libraries, or at least 3) recognize that you aren't getting reliable content out of ~x% of your documents.  If manually checking PDFs to determine whether or not to run OCR is a hassle, run tika-eval and identify those docs that have a low word count/page ratio.



Couple of handfuls of Welsh documents; I thought we only had English...what?!  No, that's just bad content extraction (character mapping failure in the PDF or other mojibake).  Average token length in this document is 1, and it is supposed to be English...what?  No, that's the spacing problem that Erick Mentioned.  Average words per page in some pdfs = 2?  No, that's an image-only pdf...that needs to go through OCR.  Ratio of out of vocabulary words = 90%...no that's character encoding mojibake.





> I was recently indexing a set of about

13,000 documents and at one point, a document caused solr to crash.  I had to restart it.  I removed the offending document, and restarted the indexing.  It then eventually happened again, so I did the same thing.



Crash, crash like OOM?  If you're able to share that with Tika or PDFBox, we can _try_ to fix the underlying bug if there is one.  Sometimes, though, our parsers require far more memory that is ideal. 😐



If you have questions about tika-eval, please ask over on the Tika list.  Apologies for too many words.  Thank you, all, for this discussion!



Cheers,



           Tim





P.S. On metadata author vs. creator, for a good while, we've been trying to standardize to Dublin core -- dc:creator.  If you see areas for improvement, let us know.



[1] https://www.slideshare.net/TimAllison6/haystack-2018-apachetikaevaltallison

[2] https://wiki.apache.org/tika/TikaEval

[3] Obviously, without ground truth, there is no automated way to detect the sdt/form field/grouped text box problems, but tika-eval does what it can to identify and count:

a) catastrophic problems (oom, permanent hang)

b) catchable exceptions

c) corrupted text

d) nearly entirely missing text




Reply | Threaded
Open this post in threaded view
|

Re: Specialized Solr Application

Terry Steichen
Thanks, Tim.  A couple of quick comments and a couple of questions:

    1) the toughest pdfs to identify are those that are partly
    searchable (text) and partly not (image-based text).  However, I've
    found that such documents tend to exist in clusters.

    2) email documents (.eml) are no problem, provided the -filetypes
    eml is including in the indexing command.  Otherwise the indexing is
    not recursive and you'll completely (and silently) miss all such
    documents in lower subdirectories.

    3) I have indexed other repositories and noticed some silent
    failures (mostly for large .doc documents).  Wish there was some way
    to log these errors so it would be obvious what documents have been
    excluded.

    4) I still don't understand the use of tika.eval - is that an
    application that you run against a collection or what?

    5) I've seen reference to tika-server - but I have no idea on how
    that tool might be usefully applied.

    6) Adobe Acrobat Pro apparently has a batch mode suitable for
    flagging unsearchable (that is, image-based) pdf files and fixing them.

    7) Another problem I've encountered is documents that are themselves
    a composite of other documents (like an email thread).  The problem
    is that a hit on such a document doesn't tell you much about the
    true relevance of each contained document.  You have to do a
    laborious manual search to figure it out.

    8) Is there a way to return the size of a matching document (which,
    I think, would help identify non-searchable/image documents)?

Regards,

Terry




On 04/18/2018 12:50 PM, Allison, Timothy B. wrote:

> To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during content extraction.[1]  I had two big concerns when I heard of your task:
>
>
>
> 1) image only pdfs, which can parse without problem, but which might yield 0 content.
>
> 2) emails (see, e.g. SOLR-12048)
>
>
>
> It sounds like you're taking care of 1), and 2) doesn't apply because you're using Tika (although note that we've made some major changes to our RFC822 parsing in the upcoming Tika 1.18).  So, no need to read further! 😊
>
>
>
> In general, surprising things can happen during the content extraction phase, and unless you are monitoring/measuring/evaluating what's extracted, your search system can yield results that are downright dangerous if you assume that the full stack is actually working.
>
>
>
> I worked with one batch of documents where HALF of the Excel files weren't being parsed.  They all had the same quirk which caused an exception in POI, and because they were inside zip files, and Tika's legacy/default behavior is to silently ignore embedded exceptions -- the owners of the search system had _no idea_ that they'd never be able to find those documents.  At one point, Tika wasn't extracting sdt form fields in docx or form fields in pdf...at all...imagine if your document set was a bunch docx with sdts or pdfs with form fields...  We just fixed a bug to pull text from joined shapes in ppt...we've been missing that text for years!
>
>
>
> Those are a few horror stories, I have many, and there are countless more yet to be discovered!
>
>
>
> The goal of tika-eval[2] is to allow you to see if things don't look right based on your expectations.[3]  It doesn't help with indexing at all per se, but it can allow you to see odd things and 1) change your processing pipeline (add OCR where necessary or use an alternate parser for some file formats) or 2) raise an issue to fix bugs in the content extraction libraries, or at least 3) recognize that you aren't getting reliable content out of ~x% of your documents.  If manually checking PDFs to determine whether or not to run OCR is a hassle, run tika-eval and identify those docs that have a low word count/page ratio.
>
>
>
> Couple of handfuls of Welsh documents; I thought we only had English...what?!  No, that's just bad content extraction (character mapping failure in the PDF or other mojibake).  Average token length in this document is 1, and it is supposed to be English...what?  No, that's the spacing problem that Erick Mentioned.  Average words per page in some pdfs = 2?  No, that's an image-only pdf...that needs to go through OCR.  Ratio of out of vocabulary words = 90%...no that's character encoding mojibake.
>
>
>
>
>
>> I was recently indexing a set of about
> 13,000 documents and at one point, a document caused solr to crash.  I had to restart it.  I removed the offending document, and restarted the indexing.  It then eventually happened again, so I did the same thing.
>
>
>
> Crash, crash like OOM?  If you're able to share that with Tika or PDFBox, we can _try_ to fix the underlying bug if there is one.  Sometimes, though, our parsers require far more memory that is ideal. 😐
>
>
>
> If you have questions about tika-eval, please ask over on the Tika list.  Apologies for too many words.  Thank you, all, for this discussion!
>
>
>
> Cheers,
>
>
>
>            Tim
>
>
>
>
>
> P.S. On metadata author vs. creator, for a good while, we've been trying to standardize to Dublin core -- dc:creator.  If you see areas for improvement, let us know.
>
>
>
> [1] https://www.slideshare.net/TimAllison6/haystack-2018-apachetikaevaltallison
>
> [2] https://wiki.apache.org/tika/TikaEval
>
> [3] Obviously, without ground truth, there is no automated way to detect the sdt/form field/grouped text box problems, but tika-eval does what it can to identify and count:
>
> a) catastrophic problems (oom, permanent hang)
>
> b) catchable exceptions
>
> c) corrupted text
>
> d) nearly entirely missing text
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Specialized Solr Application

Allison, Timothy B.
>    1) the toughest pdfs to identify are those that are partly
    searchable (text) and partly not (image-based text).  However, I've
    found that such documents tend to exist in clusters.
Agreed.  We should do something better in Tika to identify image-only pages on a page-by-page basis, and then ship those with very little text to tesseract.  We don't currently do this.

>    3) I have indexed other repositories and noticed some silent
    failures (mostly for large .doc documents).  Wish there was some way
    to log these errors so it would be obvious what documents have been
    excluded.
Agreed on the Solr side.  You can run `java -jar tika-app.jar -J -t -i <input_dir> -o <output_dir>` and then tika-eval on the <output_dir> to count exceptions, even exceptions in embedded documents, which are now silently ignored. ☹

>   4) I still don't understand the use of tika.eval - is that an
    application that you run against a collection or what?
Currently, it is set up to run against a directory of extracts (text+metadata extracted from pdfs/word/etc).  It will give you info about # of exceptions, lang id, and some other statistics that can help you get a sense of how well content extraction worked.  It wouldn't take much to add an adapter that would have it run against Solr to run the same content statistics.

>    5) I've seen reference to tika-server - but I have no idea on how
    that tool might be usefully applied.
 We have to harden it, but the benefit is that you isolate the tika process in its own jvm so that it can't harm Solr.  By harden, I mean we need to spawn a child process and set a parent process that will kill and restart on oom or permanent hang.  We don't have that yet.  Tika very rarely runs into serious, show stopping problems (kill -9 just might solve your problem).  If you only have a few 10s of thousands of docs, you aren't likely to run into these problems.  If you're processing a few million, esp noisy things that come of the internet, you're more likely to run into these kinds of problems.

>    6) Adobe Acrobat Pro apparently has a batch mode suitable for
    flagging unsearchable (that is, image-based) pdf files and fixing them.
 Great.  If you have commercial tools available, use them.  IMHO, we have a ways to go on our OCR integration with PDFs.

>    7) Another problem I've encountered is documents that are themselves
    a composite of other documents (like an email thread).  The problem
    is that a hit on such a document doesn't tell you much about the
    true relevance of each contained document.  You have to do a
    laborious manual search to figure it out.


Agreed.  Concordance search can be useful for making sense of large documents <self_promotion> https://github.com/mitre/rhapsode </self_promotion> The other thing that can be useful for handling genuine attachments (pdfs inside of email) is to treat the embedded docs as their own standalone/child doc (see github link and SOLR-7229.


>    8) Is there a way to return the size of a matching document (which,
    I think, would help identify non-searchable/image documents)?
Not that I'm aware of, but that's one of the stats calculated by tika-eval.  Length of extracted string, number of tokens, number of alphabetic tokens, number of "common words" (I took top 20k most common words from Wikipedia dumps per lang)...and others.

Cheers,

            Tim