What is the best way of Indexing different formats of documents?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

What is the best way of Indexing different formats of documents?

sangs8788
Hi,

I am a newbie to SOLR and basically from database background. We have a requirement of indexing files of different formats (x12,edifact, csv,xml).
The files which are inputted can be of any format and we need to do a content based search on it.

From the web I understand we can use TIKA processor to extract the content and store it in SOLR. What I want to know is, is there any better approach for indexing files in SOLR ? Can we index the document through streaming directly from the Application ? If so what is the disadvantage of using it (against DIH which fetches from the database)? Could someone share me some insight on this ? ls there any web links which I can refer to get some idea on it ? Please do help.

Thanks
Sangeetha

Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

Swaraj Kumar
You can always choose either DIH or /update/extract to index docs in solr.
Now there are multiple benefits of DIH which I am listing below :-

1. Clean and update using a single command.
2. DIH also optimize indexing using optimize=true
3. You can do delta-import based on last index time where as in case of
/update/extract you need to do manual operation in case of delta import.
4. You can use multiple entity processor and transformers in case of DIH
which is very useful to index exact data you want.
5. Query parameter "rows" limits the num of records.

Regards,


Swaraj Kumar
Senior Software Engineer I
MakeMyTrip.com
Mob No- 9811774497

On Tue, Apr 7, 2015 at 4:18 PM, [hidden email] <
[hidden email]> wrote:

> Hi,
>
> I am a newbie to SOLR and basically from database background. We have a
> requirement of indexing files of different formats (x12,edifact, csv,xml).
> The files which are inputted can be of any format and we need to do a
> content based search on it.
>
> From the web I understand we can use TIKA processor to extract the content
> and store it in SOLR. What I want to know is, is there any better approach
> for indexing files in SOLR ? Can we index the document through streaming
> directly from the Application ? If so what is the disadvantage of using it
> (against DIH which fetches from the database)? Could someone share me some
> insight on this ? ls there any web links which I can refer to get some idea
> on it ? Please do help.
>
> Thanks
> Sangeetha
>
>
Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

Malcolm Upayavira Holmes
In reply to this post by sangs8788


On Tue, Apr 7, 2015, at 11:48 AM, [hidden email]
wrote:

> Hi,
>
> I am a newbie to SOLR and basically from database background. We have a
> requirement of indexing files of different formats (x12,edifact,
> csv,xml).
> The files which are inputted can be of any format and we need to do a
> content based search on it.
>
> From the web I understand we can use TIKA processor to extract the
> content and store it in SOLR. What I want to know is, is there any better
> approach for indexing files in SOLR ? Can we index the document through
> streaming directly from the Application ? If so what is the disadvantage
> of using it (against DIH which fetches from the database)? Could someone
> share me some insight on this ? ls there any web links which I can refer
> to get some idea on it ? Please do help.

You can have Solr do the TIKA work for you, by posting to
update/extract. See here:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

You can only post one document at a time, and you will have to provide
extra metadata fields in the URL you post to (e.g. the document ID).

If the extracting update handler can handle what you need, then you are
good. Otherwise, you will want to write your own code to call Tika, then
push the extracted content as a plain document.

Solr is just an HTTP server, so your application can post binary files
for Solr to ingest with Tika, or otherwise.

Upayavira
Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

Yavar Husain
In reply to this post by sangs8788
Well have indexed heterogeneous sources including a variety of NoSQL's,
RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
of using SolrJ is that you should have an API to fetch data from your data
source (Say JDBC for RDBMS, Tika for extracting text content from rich
documents etc.) than SolrJ is so damn great and simple. Its as simple as
downloading the jar and few lines of code to send data to your solr server
after pre-processing your data. More details here:

http://lucidworks.com/blog/indexing-with-solrj/

https://wiki.apache.org/solr/Solrj

http://www.solrtutorial.com/solrj-tutorial.html

Cheers,
Yavar



On Tue, Apr 7, 2015 at 4:18 PM, [hidden email] <
[hidden email]> wrote:

> Hi,
>
> I am a newbie to SOLR and basically from database background. We have a
> requirement of indexing files of different formats (x12,edifact, csv,xml).
> The files which are inputted can be of any format and we need to do a
> content based search on it.
>
> From the web I understand we can use TIKA processor to extract the content
> and store it in SOLR. What I want to know is, is there any better approach
> for indexing files in SOLR ? Can we index the document through streaming
> directly from the Application ? If so what is the disadvantage of using it
> (against DIH which fetches from the database)? Could someone share me some
> insight on this ? ls there any web links which I can refer to get some idea
> on it ? Please do help.
>
> Thanks
> Sangeetha
>
>
Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

Dan Davis-2
Sangeetha,

You can also run Tika directly from data import handler, and Data Import
Handler can be made to run several threads if you can partition the input
documents by directory or database id.   I've done 4 "threads" by having a
base configuration that does an Oracle query like this:

      SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
WHERE ...) WHERE threadid = %d

A bash/sed script writes several data import handler XML files.
I can then index several threads at a time.

Each of these threads can then use all the transformers, e.g.
templateTransformer, etc.
XML can be transformed via XSLT.

The Data Import Handler has other entities that go out to the web and then
index the document via Tika.

If you are indexing generic HTML, you may want to figure out an approach to
SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
locally, because Boilerpipe has a bug that has been fixed, but not pushed
to Maven Central.   Without that, the ASF cannot include the fix, but
distributions such as LucidWorks Solr Enterprise can.

I can drop some configs into github.com if I clean them up to obfuscate
host names, passwords, and such.


On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain <[hidden email]> wrote:

> Well have indexed heterogeneous sources including a variety of NoSQL's,
> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
> of using SolrJ is that you should have an API to fetch data from your data
> source (Say JDBC for RDBMS, Tika for extracting text content from rich
> documents etc.) than SolrJ is so damn great and simple. Its as simple as
> downloading the jar and few lines of code to send data to your solr server
> after pre-processing your data. More details here:
>
> http://lucidworks.com/blog/indexing-with-solrj/
>
> https://wiki.apache.org/solr/Solrj
>
> http://www.solrtutorial.com/solrj-tutorial.html
>
> Cheers,
> Yavar
>
>
>
> On Tue, Apr 7, 2015 at 4:18 PM, [hidden email] <
> [hidden email]> wrote:
>
> > Hi,
> >
> > I am a newbie to SOLR and basically from database background. We have a
> > requirement of indexing files of different formats (x12,edifact,
> csv,xml).
> > The files which are inputted can be of any format and we need to do a
> > content based search on it.
> >
> > From the web I understand we can use TIKA processor to extract the
> content
> > and store it in SOLR. What I want to know is, is there any better
> approach
> > for indexing files in SOLR ? Can we index the document through streaming
> > directly from the Application ? If so what is the disadvantage of using
> it
> > (against DIH which fetches from the database)? Could someone share me
> some
> > insight on this ? ls there any web links which I can refer to get some
> idea
> > on it ? Please do help.
> >
> > Thanks
> > Sangeetha
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

Erick Erickson
The disadvantages of DIH are
1> it's a black box, debugging it isn't easy
2> it puts all the work on the Solr node. Parsing documents in various
forms can be pretty heavy-weight and steal cycles from indexing and
searching.
2a> the extracting request handler also puts all the load on Solr FWIW.


Personally I prefer an external program (and I was gratified to see
Yavar's reference to the indexing with SolrJ article...). But then I'm
a Java programmer by training, so that seems easy...

Best,
Erick

On Tue, Apr 7, 2015 at 7:41 AM, Dan Davis <[hidden email]> wrote:

> Sangeetha,
>
> You can also run Tika directly from data import handler, and Data Import
> Handler can be made to run several threads if you can partition the input
> documents by directory or database id.   I've done 4 "threads" by having a
> base configuration that does an Oracle query like this:
>
>       SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
> WHERE ...) WHERE threadid = %d
>
> A bash/sed script writes several data import handler XML files.
> I can then index several threads at a time.
>
> Each of these threads can then use all the transformers, e.g.
> templateTransformer, etc.
> XML can be transformed via XSLT.
>
> The Data Import Handler has other entities that go out to the web and then
> index the document via Tika.
>
> If you are indexing generic HTML, you may want to figure out an approach to
> SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
> locally, because Boilerpipe has a bug that has been fixed, but not pushed
> to Maven Central.   Without that, the ASF cannot include the fix, but
> distributions such as LucidWorks Solr Enterprise can.
>
> I can drop some configs into github.com if I clean them up to obfuscate
> host names, passwords, and such.
>
>
> On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain <[hidden email]> wrote:
>
>> Well have indexed heterogeneous sources including a variety of NoSQL's,
>> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
>> of using SolrJ is that you should have an API to fetch data from your data
>> source (Say JDBC for RDBMS, Tika for extracting text content from rich
>> documents etc.) than SolrJ is so damn great and simple. Its as simple as
>> downloading the jar and few lines of code to send data to your solr server
>> after pre-processing your data. More details here:
>>
>> http://lucidworks.com/blog/indexing-with-solrj/
>>
>> https://wiki.apache.org/solr/Solrj
>>
>> http://www.solrtutorial.com/solrj-tutorial.html
>>
>> Cheers,
>> Yavar
>>
>>
>>
>> On Tue, Apr 7, 2015 at 4:18 PM, [hidden email] <
>> [hidden email]> wrote:
>>
>> > Hi,
>> >
>> > I am a newbie to SOLR and basically from database background. We have a
>> > requirement of indexing files of different formats (x12,edifact,
>> csv,xml).
>> > The files which are inputted can be of any format and we need to do a
>> > content based search on it.
>> >
>> > From the web I understand we can use TIKA processor to extract the
>> content
>> > and store it in SOLR. What I want to know is, is there any better
>> approach
>> > for indexing files in SOLR ? Can we index the document through streaming
>> > directly from the Application ? If so what is the disadvantage of using
>> it
>> > (against DIH which fetches from the database)? Could someone share me
>> some
>> > insight on this ? ls there any web links which I can refer to get some
>> idea
>> > on it ? Please do help.
>> >
>> > Thanks
>> > Sangeetha
>> >
>> >
>>
Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

sangs8788
In reply to this post by Swaraj Kumar
I just want to index only certain documents and there will not be any update happening on the indexed document.

In our existing system we already have DIH implemented which indexes document from sql server (As you said based on last index time). In this case the metadata is there available in database. but if we are streaming via url, we woulld need to append the metadata too. correct me if i am wrong.

is  /update/extract is extractrequesthandler you are talking about ? Is there we post the metadata in the url to solr server ? When you say manual operation what is it you are talking about ?

Can you please clarify .
Reply | Threaded
Open this post in threaded view
|

RE: What is the best way of Indexing different formats of documents?

sangs8788
In reply to this post by Swaraj Kumar
Hi Swaraj,



Thanks for the answers.

From my understanding We can index,

·       Using DIH from db

·       Using DIH from filesystem - this is where I am concentrating on.

o   For this we can use SolrJ with Tika(solr cell) from Java layer in order to extract the content and send the data through REST API to solrserver

o   Or we can use extractrequesthandler to do the job.



I just want to index only certain documents and there will not be any update happening on the indexed document.



In our existing system we already have DIH implemented which indexes document from sql server (As you said based on last index time). In this case the metadata is there available in database.



But if we are streaming via url, we would need to append the metadata too. correct me if i am wrong. And how does the indexing happening here based on last index time or something else ? Also for  extractrequesthandler when you say manual operation what is it you are talking about ? Can you please clarify.



Thanks

Sangeetha



-----Original Message-----
From: Swaraj Kumar [mailto:[hidden email]]
Sent: 07 April 2015 18:02
To: [hidden email]
Subject: Re: What is the best way of Indexing different formats of documents?



You can always choose either DIH or /update/extract to index docs in solr.

Now there are multiple benefits of DIH which I am listing below :-



1. Clean and update using a single command.

2. DIH also optimize indexing using optimize=true 3. You can do delta-import based on last index time where as in case of /update/extract you need to do manual operation in case of delta import.

4. You can use multiple entity processor and transformers in case of DIH which is very useful to index exact data you want.

5. Query parameter "rows" limits the num of records.



Regards,





Swaraj Kumar

Senior Software Engineer I

MakeMyTrip.com

Mob No- 9811774497



On Tue, Apr 7, 2015 at 4:18 PM, [hidden email]<mailto:[hidden email]> < [hidden email]<mailto:[hidden email]>> wrote:



> Hi,

>

> I am a newbie to SOLR and basically from database background. We have

> a requirement of indexing files of different formats (x12,edifact, csv,xml).

> The files which are inputted can be of any format and we need to do a

> content based search on it.

>

> From the web I understand we can use TIKA processor to extract the

> content and store it in SOLR. What I want to know is, is there any

> better approach for indexing files in SOLR ? Can we index the document

> through streaming directly from the Application ? If so what is the

> disadvantage of using it (against DIH which fetches from the

> database)? Could someone share me some insight on this ? ls there any

> web links which I can refer to get some idea on it ? Please do help.

>

> Thanks

> Sangeetha

>

>
Reply | Threaded
Open this post in threaded view
|

Re: What is the best way of Indexing different formats of documents?

Swaraj Kumar
Hi Sangeetha,

/update/extract refers to extractrequesthandler.

If you only want to index the data, you can do it with extractrequesthandler.
I dont think it requires metadata, but you need to provide literal.id to
specify which field will be unique id.

For more information :-
https://wiki.apache.org/solr/ExtractingRequestHandler
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika



Regards,


Swaraj Kumar
Senior Software Engineer I
MakeMyTrip.com
Mob No- 9811774497

On Wed, Apr 8, 2015 at 2:20 PM, [hidden email] <
[hidden email]> wrote:

> Hi Swaraj,
>
>
>
> Thanks for the answers.
>
> From my understanding We can index,
>
> ·       Using DIH from db
>
> ·       Using DIH from filesystem - this is where I am concentrating on.
>
> o   For this we can use SolrJ with Tika(solr cell) from Java layer in
> order to extract the content and send the data through REST API to
> solrserver
>
> o   Or we can use extractrequesthandler to do the job.
>
>
>
> I just want to index only certain documents and there will not be any
> update happening on the indexed document.
>
>
>
> In our existing system we already have DIH implemented which indexes
> document from sql server (As you said based on last index time). In this
> case the metadata is there available in database.
>
>
>
> But if we are streaming via url, we would need to append the metadata too.
> correct me if i am wrong. And how does the indexing happening here based on
> last index time or something else ? Also for  extractrequesthandler when
> you say manual operation what is it you are talking about ? Can you please
> clarify.
>
>
>
> Thanks
>
> Sangeetha
>
>
>
> -----Original Message-----
> From: Swaraj Kumar [mailto:[hidden email]]
> Sent: 07 April 2015 18:02
> To: [hidden email]
> Subject: Re: What is the best way of Indexing different formats of
> documents?
>
>
>
> You can always choose either DIH or /update/extract to index docs in solr.
>
> Now there are multiple benefits of DIH which I am listing below :-
>
>
>
> 1. Clean and update using a single command.
>
> 2. DIH also optimize indexing using optimize=true 3. You can do
> delta-import based on last index time where as in case of /update/extract
> you need to do manual operation in case of delta import.
>
> 4. You can use multiple entity processor and transformers in case of DIH
> which is very useful to index exact data you want.
>
> 5. Query parameter "rows" limits the num of records.
>
>
>
> Regards,
>
>
>
>
>
> Swaraj Kumar
>
> Senior Software Engineer I
>
> MakeMyTrip.com
>
> Mob No- 9811774497
>
>
>
> On Tue, Apr 7, 2015 at 4:18 PM, [hidden email]<mailto:
> [hidden email]> < [hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
> > Hi,
>
> >
>
> > I am a newbie to SOLR and basically from database background. We have
>
> > a requirement of indexing files of different formats (x12,edifact,
> csv,xml).
>
> > The files which are inputted can be of any format and we need to do a
>
> > content based search on it.
>
> >
>
> > From the web I understand we can use TIKA processor to extract the
>
> > content and store it in SOLR. What I want to know is, is there any
>
> > better approach for indexing files in SOLR ? Can we index the document
>
> > through streaming directly from the Application ? If so what is the
>
> > disadvantage of using it (against DIH which fetches from the
>
> > database)? Could someone share me some insight on this ? ls there any
>
> > web links which I can refer to get some idea on it ? Please do help.
>
> >
>
> > Thanks
>
> > Sangeetha
>
> >
>
> >
>