solr cell: write entire file content binary to index along with metadata

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

solr cell: write entire file content binary to index along with metadata

lee carroll
Does the solr cell contrib give access to the files raw content  along with
the extracted metadata?

cheers Lee C
Reply | Threaded
Open this post in threaded view
|

Re: solr cell: write entire file content binary to index along with metadata

Shawn Heisey-2
On 4/24/2018 10:26 AM, Lee Carroll wrote:
> Does the solr cell contrib give access to the files raw content  along with
> the extracted metadata?\

That's not usually the kind of information you want to have in a Solr
index.  Most of the time, there will be an entry in the Solr index that
tells the system making queries how to locate the actual data -- a
filename, a URL, a database lookup key, etc.

I have no idea whether solr-cell can put the info in the index.  My best
guess would be that it can't, since putting the entire binary content
into the index isn't recommended.

We don't recommend using solr-cell for production indexing.  If you
follow recommendations and write your own indexing program using Tika,
then you can do pretty much anything you want, including writing the
full content into the index.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: solr cell: write entire file content binary to index along with metadata

lee carroll
>
>
>
>
> *That's not usually the kind of information you want to have in a
> Solrindex.  Most of the time, there will be an entry in the Solr index
> thattells the system making queries how to locate the actual data --
> afilename, a URL, a database lookup key, etc.*


 Agreed. The app will have a few implementations for storing the binary
file. Easiest for a user to configure for proto-typing would be store in
index impl. A live impl would probably be fs

   *We don't recommend using solr-cell for production indexing.*


Ok. Are the reasons for:

Performance. I think we have rather modest index requirement (1000 a day...
on a busy day)

Security. The index workflow is, upload files to public facing server with
auth. Files written to disk, scanned and copied to internal server and
ingested into index via here.

 other reasons we should worry about ?

Cheers Lee C

On 25 April 2018 at 00:37, Shawn Heisey <[hidden email]> wrote:

> On 4/24/2018 10:26 AM, Lee Carroll wrote:
> > Does the solr cell contrib give access to the files raw content  along
> with
> > the extracted metadata?\
>
> That's not usually the kind of information you want to have in a Solr
> index.  Most of the time, there will be an entry in the Solr index that
> tells the system making queries how to locate the actual data -- a
> filename, a URL, a database lookup key, etc.
>
> I have no idea whether solr-cell can put the info in the index.  My best
> guess would be that it can't, since putting the entire binary content
> into the index isn't recommended.
>
> We don't recommend using solr-cell for production indexing.  If you
> follow recommendations and write your own indexing program using Tika,
> then you can do pretty much anything you want, including writing the
> full content into the index.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: solr cell: write entire file content binary to index along with metadata

Shawn Heisey-2
On 4/25/2018 4:02 AM, Lee Carroll wrote:

>     *We don't recommend using solr-cell for production indexing.*
>
> Ok. Are the reasons for:
>
> Performance. I think we have rather modest index requirement (1000 a day...
> on a busy day)
>
> Security. The index workflow is, upload files to public facing server with
> auth. Files written to disk, scanned and copied to internal server and
> ingested into index via here.
>
>   other reasons we should worry about ?

Tika is the underlying technology in solr-cell.  Tika is a separate
Apache product designed for parsing common rich-text formats, like
Microsoft, PDF, etc.

http://tika.apache.org/

The problems that can result are related to running Tika inside of Solr,
which is what solr-cell does.

The Tika authors try very hard to make sure that Tika doesn't misbehave,
but the very nature of what Tika does means it is somewhat prone to
misbehaving.  Many of the file formats that Tika processes are
undocumented, or any documentation that is available is not available to
open source developers.  Also, sometimes documents in those formats will
be constructed in a way that the Tika authors have never seen before, or
they may completely violate what conventions the authors DO know about.

Long story short -- Tika can encounter documents that can cause it to
crash, or to consume all the memory in the system, or misbehave in other
ways.  If Tika is running inside Solr, then when it has a problem, Solr
itself can blow up and have a problem too.

For this reason, and because Tika can sometimes use a lot of resources
even when it is working correctly, we recommend running it outside of
Solr in another program that takes its output and sends it to Solr. 
Ideally, it will be running on a completely different machine than Solr
is running on.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: solr cell: write entire file content binary to index along with metadata

Rahul Singh-3
Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s optimized to be an index , not a file store. Better to put that in another DB or file system like Cassandra, S3, etc. (better than SolR).

In our experience , leveraging the tika binary / microservice as a pre-index process can improve the overall stability of the SolR service.


--
Rahul Singh
[hidden email]

Anant Corporation

On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <[hidden email]>, wrote:

> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for production indexing.*
> >
> > Ok. Are the reasons for:
> >
> > Performance. I think we have rather modest index requirement (1000 a day...
> > on a busy day)
> >
> > Security. The index workflow is, upload files to public facing server with
> > auth. Files written to disk, scanned and copied to internal server and
> > ingested into index via here.
> >
> > other reasons we should worry about ?
>
> Tika is the underlying technology in solr-cell.  Tika is a separate
> Apache product designed for parsing common rich-text formats, like
> Microsoft, PDF, etc.
>
> http://tika.apache.org/
>
> The problems that can result are related to running Tika inside of Solr,
> which is what solr-cell does.
>
> The Tika authors try very hard to make sure that Tika doesn't misbehave,
> but the very nature of what Tika does means it is somewhat prone to
> misbehaving.  Many of the file formats that Tika processes are
> undocumented, or any documentation that is available is not available to
> open source developers.  Also, sometimes documents in those formats will
> be constructed in a way that the Tika authors have never seen before, or
> they may completely violate what conventions the authors DO know about.
>
> Long story short -- Tika can encounter documents that can cause it to
> crash, or to consume all the memory in the system, or misbehave in other
> ways.  If Tika is running inside Solr, then when it has a problem, Solr
> itself can blow up and have a problem too.
>
> For this reason, and because Tika can sometimes use a lot of resources
> even when it is working correctly, we recommend running it outside of
> Solr in another program that takes its output and sends it to Solr.
> Ideally, it will be running on a completely different machine than Solr
> is running on.
>
> Thanks,
> Shawn
>