Not-indexed, Stored Thumbnails or NoSQL?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Not-indexed, Stored Thumbnails or NoSQL?

Joe MA
Greetings,

I have an index where I import documents such as powerpoint, PDF, and so forth.  One nice feature I added is that for each document, I store a thumbnail of the first page as an encoded String (uuencode) using a stored,not-indexed field.  This thumbnail gets displayed when the user finds a document.  

I am wondering if, as the size of the index grows to perhaps hundreds of thousands if not millions of documents,  how efficient is this?  Is it a good idea?
These encoded strings could be several hundred bytes in size, and of course are completely unique for each file indexed, and provide no 'search' value.  On the surface, it seems like there could be a better way to do this given the size, as well as the extra retrieval time for Lucene to pull these fields for found documents.

Since I also have a unique hash for each document in the index, it would not be too difficult to set up a separate, independent NoSQL key/value store with the thumbnail images, such as MongoDB or similar, and then retrieve the thumbnails from that store instead of keeping them in the Lucene index.  Does this seem like a better approach? Or is Lucene stored field retrieval efficient enough that there would be no benefit to doing this?  Any other ideas?

Thanks in advance,
J


 



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Not-indexed, Stored Thumbnails or NoSQL?

Uwe Schindler
Hi,

It's perfectly fine to store binary blobs in Lucene. This does not affect performance of queries. The stored data is also compressed using LZ4.

Just one thing: why the hell UUEncode? You can store binary blobs as is. Just pass a byte[] as stored field. There is one StoredField constructor to put a byte array. If you get it from Indexreader it's received as byte array, too. That's the most efficient way to encode it.

No need for a side database.

Uwe

Am December 2, 2018 9:20:13 AM UTC schrieb Joe MA <[hidden email]>:

>Greetings,
>
>I have an index where I import documents such as powerpoint, PDF, and
>so forth.  One nice feature I added is that for each document, I store
>a thumbnail of the first page as an encoded String (uuencode) using a
>stored,not-indexed field.  This thumbnail gets displayed when the user
>finds a document.  
>
>I am wondering if, as the size of the index grows to perhaps hundreds
>of thousands if not millions of documents,  how efficient is this?  Is
>it a good idea?
>These encoded strings could be several hundred bytes in size, and of
>course are completely unique for each file indexed, and provide no
>'search' value.  On the surface, it seems like there could be a better
>way to do this given the size, as well as the extra retrieval time for
>Lucene to pull these fields for found documents.
>
>Since I also have a unique hash for each document in the index, it
>would not be too difficult to set up a separate, independent NoSQL
>key/value store with the thumbnail images, such as MongoDB or similar,
>and then retrieve the thumbnails from that store instead of keeping
>them in the Lucene index.  Does this seem like a better approach? Or is
>Lucene stored field retrieval efficient enough that there would be no
>benefit to doing this?  Any other ideas?
>
>Thanks in advance,
>J
>
>
>  
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Reply | Threaded
Open this post in threaded view
|

Re: Not-indexed, Stored Thumbnails or NoSQL?

Arjen van der Meijden
In reply to this post by Joe MA
I'd think it depends on your application.

If its a web-application and you're generating html, it may be better
for the (client side) performance to have those images load via a
webserver that can directly access the images as files (altough you
could generate the images inline with base64). If its some application
that has to load and display an image itself, than having easy control
over the entire document will likely outweigh most potential advantages
of a second database.

Btw, Lucene can be considered a NoSQL-storage ;) If you really do get
milllions of documents, it may be interesting to store them elsewhere if
otherwise the database gets too large (but see Uwe's reply for ways to
reduce the storage-overhead).

Best regards,

Arjen

On 2-12-2018 10:20, Joe MA wrote:

> Greetings,
>
> I have an index where I import documents such as powerpoint, PDF, and so forth.  One nice feature I added is that for each document, I store a thumbnail of the first page as an encoded String (uuencode) using a stored,not-indexed field.  This thumbnail gets displayed when the user finds a document.  
>
> I am wondering if, as the size of the index grows to perhaps hundreds of thousands if not millions of documents,  how efficient is this?  Is it a good idea?
> These encoded strings could be several hundred bytes in size, and of course are completely unique for each file indexed, and provide no 'search' value.  On the surface, it seems like there could be a better way to do this given the size, as well as the extra retrieval time for Lucene to pull these fields for found documents.
>
> Since I also have a unique hash for each document in the index, it would not be too difficult to set up a separate, independent NoSQL key/value store with the thumbnail images, such as MongoDB or similar, and then retrieve the thumbnails from that store instead of keeping them in the Lucene index.  Does this seem like a better approach? Or is Lucene stored field retrieval efficient enough that there would be no benefit to doing this?  Any other ideas?
>
> Thanks in advance,
> J
>
>
>  
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Not-indexed, Stored Thumbnails or NoSQL?

Michael Sokolov-4
Also, you should know that stored fields are handled in a block for each
document, so when you retrieve your first page you are also in some sense
paying for skipping over (loading into RAM and decoding) the large blobs
too. As you scale, and consider other storage options, that is something to
keep in mind.

On Sun, Dec 2, 2018, 7:17 AM Arjen van der Meijden <[hidden email]
wrote:

> I'd think it depends on your application.
>
> If its a web-application and you're generating html, it may be better
> for the (client side) performance to have those images load via a
> webserver that can directly access the images as files (altough you
> could generate the images inline with base64). If its some application
> that has to load and display an image itself, than having easy control
> over the entire document will likely outweigh most potential advantages
> of a second database.
>
> Btw, Lucene can be considered a NoSQL-storage ;) If you really do get
> milllions of documents, it may be interesting to store them elsewhere if
> otherwise the database gets too large (but see Uwe's reply for ways to
> reduce the storage-overhead).
>
> Best regards,
>
> Arjen
>
> On 2-12-2018 10:20, Joe MA wrote:
> > Greetings,
> >
> > I have an index where I import documents such as powerpoint, PDF, and so
> forth.  One nice feature I added is that for each document, I store a
> thumbnail of the first page as an encoded String (uuencode) using a
> stored,not-indexed field.  This thumbnail gets displayed when the user
> finds a document.
> >
> > I am wondering if, as the size of the index grows to perhaps hundreds of
> thousands if not millions of documents,  how efficient is this?  Is it a
> good idea?
> > These encoded strings could be several hundred bytes in size, and of
> course are completely unique for each file indexed, and provide no 'search'
> value.  On the surface, it seems like there could be a better way to do
> this given the size, as well as the extra retrieval time for Lucene to pull
> these fields for found documents.
> >
> > Since I also have a unique hash for each document in the index, it would
> not be too difficult to set up a separate, independent NoSQL key/value
> store with the thumbnail images, such as MongoDB or similar, and then
> retrieve the thumbnails from that store instead of keeping them in the
> Lucene index.  Does this seem like a better approach? Or is Lucene stored
> field retrieval efficient enough that there would be no benefit to doing
> this?  Any other ideas?
> >
> > Thanks in advance,
> > J
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>