Apache SOLR Design Query

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache SOLR Design Query

NetUser MSUser
Hi team,


We have a business case like the below one.


There are nearly 150 GB of docs(pdf/ppt/word/xl/msg) files which are in
stored in a N/w Path as of now. To implement text search on these , we are
planning to use solr search in these. Listed below is the plan.

1)Using a high configuration Windows server(16 GB RAM , 1 TB Disk etc)
2)Keep all the files in this server.
3)Index all the above docs to solr server(Solr installed in the same
windows server). Will use solr post command to post documents to this
server.
4)Using a Web application user can further add or remove files to/from
shared path in this server.
5)Web UI to search the text from these docs. and display the file Names.
User can click and download the files

Listed are the queries what we have.

1)Since we cannot index fields here, (as search is across all text in the
docs of various types. User can search for any text and it might be in XL
or in DOC or in PPT or in .MSG files), whether querying(Rest API from the
Web ) the search data will have any performance hit?

2)Is it a right decision to keep the physical files in the Shared folder of
Server itself(as a shared drive) instead of storing it in a DB or any other
storage?


Regards,
MS
Reply | Threaded
Open this post in threaded view
|

Re: Apache SOLR Design Query

Rahul Singh-3
This is a good start. Few things to consider.

1. Extract the contents via Tika externally or via Tika Server.
2. Create a canonical “Item” document schema which would have title, metadata, contents, imagePreview (something to consider) , etc.
3. Use the extracted Tika data to populate your index.
4. Unless you need highlighting, only index the actual contents, and store the rest of the fields.
5. Shared File storage is probably ok, but you may want to do with a caching later via Nginx and serve files through it. That way you don’t hit the disk every time.


--
Rahul Singh
[hidden email]

Anant Corporation

On May 12, 2018, 10:54 AM -0400, NetUser MSUser <[hidden email]>, wrote:

> Hi team,
>
>
> We have a business case like the below one.
>
>
> There are nearly 150 GB of docs(pdf/ppt/word/xl/msg) files which are in
> stored in a N/w Path as of now. To implement text search on these , we are
> planning to use solr search in these. Listed below is the plan.
>
> 1)Using a high configuration Windows server(16 GB RAM , 1 TB Disk etc)
> 2)Keep all the files in this server.
> 3)Index all the above docs to solr server(Solr installed in the same
> windows server). Will use solr post command to post documents to this
> server.
> 4)Using a Web application user can further add or remove files to/from
> shared path in this server.
> 5)Web UI to search the text from these docs. and display the file Names.
> User can click and download the files
>
> Listed are the queries what we have.
>
> 1)Since we cannot index fields here, (as search is across all text in the
> docs of various types. User can search for any text and it might be in XL
> or in DOC or in PPT or in .MSG files), whether querying(Rest API from the
> Web ) the search data will have any performance hit?
>
> 2)Is it a right decision to keep the physical files in the Shared folder of
> Server itself(as a shared drive) instead of storing it in a DB or any other
> storage?
>
>
> Regards,
> MS