Lucene is not able to index certain words of txt file converted form pdf

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene is not able to index certain words of txt file converted form pdf

Gaurav Sharma-4
Hi

I am using Lucene for indexing and searching the documents.
I have an PDF (Lucene_in_action.pdf) file which i converted to txt file using PDFBox.
The same txt file i indexed but while searching its not able to saerch certain words. But Lucene has given me the results if i search for other words.
I am not able to find any reason for that.
If any of you intellectuals can help me out in finding the reason.

Thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Lucene is not able to index certain words of txt file converted form pdf

Otis Gospodnetic-2
Hi,

Use java-user list, there are more people on it.

You need to change the setting in IndexWriter that tells Lucene how many tokens froma a document to index.  By default it indexes only 10,000.  I can't remember the parameter name, but look at the IndexWriter javadocs, it's right there.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: m657m <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, June 18, 2008 8:24:53 AM
> Subject: Lucene is not able to index certain words of txt file converted form pdf
>
>
> Hi
>
> I am using Lucene for indexing and searching the documents.
> I have an PDF (Lucene_in_action.pdf) file which i converted to txt file
> using PDFBox.
> The same txt file i indexed but while searching its not able to saerch
> certain words. But Lucene has given me the results if i search for other
> words.
> I am not able to find any reason for that.
> If any of you intellectuals can help me out in finding the reason.
>
> Thanks in advance.
> --
> View this message in context:
> http://www.nabble.com/Lucene-is-not-able-to-index-certain-words-of-txt-file-converted-form-pdf-tp17981585p17981585.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Lucene is not able to index certain words of txt file converted form pdf

Gaurav Sharma-4
In reply to this post by Gaurav Sharma-4
Thanks a lot.
The API is
writer.setMaxFieldLength(100000);

Gaurav Sharma wrote
Hi

I am using Lucene for indexing and searching the documents.
I have an PDF (Lucene_in_action.pdf) file which i converted to txt file using PDFBox.
The same txt file i indexed but while searching its not able to saerch certain words. But Lucene has given me the results if i search for other words.
I am not able to find any reason for that.
If any of you intellectuals can help me out in finding the reason.

Thanks in advance.