Apache solr not indexing complete pdf file using tikka

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache solr not indexing complete pdf file using tikka

Manoj Saini
Hello Guys,

I am using apache solr 3.3.0 with Tikka 1.0.

I have pdf files which I am pushing into solr for conent searching. Apache
solr is indexing pdf files and I can see them in apache solr admin interface
for search. But the issue is apache solr is not indexing whole file content.
It is indexing upto only limited size.

Am I missing something, some configuration, or this is the behavior of
apache solr?

I have tried to update solrconfig.xml. I have updated ramBufferSizeMB,
maxFieldLength.

Thanks
Manoj Saini

 

 

Thanks,

Best Regards,

 

Manoj Saini | Sr. Software Engineer  | Stigasoft

m: +91 98 1034 1281 |

e:  <mailto:[hidden email]> [hidden email] | w:
<http://www.stigasoft.com> www.stigasoft.com

 

Reply | Threaded
Open this post in threaded view
|

Re: Apache solr not indexing complete pdf file using tikka

Erick Erickson
You can index 2B tokens, so upping maxFieldLength should have
fixed your problem at least as far as Solr is concerned. How
many tokens get indexed? I'm not as familiar with Tika, but
there may be some kind of parameter there (although I
don't remember this coming up before)...

Did you restart Solr after making the change to solrconfig.xml?

If you're seeing 10,000 tokens or so, that's the default for
maxFieldLength....

I'd recommend stopping Solr, "rm -rf <solr home>/data/index"
and restarting Solr just to be sure you're not seeing leftover
junk, you'll have to re-index your docs after changing
the maxLength param.


Best
Erick


On Mon, Apr 2, 2012 at 7:19 AM, Manoj Saini <[hidden email]> wrote:

> Hello Guys,
>
> I am using apache solr 3.3.0 with Tikka 1.0.
>
> I have pdf files which I am pushing into solr for conent searching. Apache
> solr is indexing pdf files and I can see them in apache solr admin interface
> for search. But the issue is apache solr is not indexing whole file content.
> It is indexing upto only limited size.
>
> Am I missing something, some configuration, or this is the behavior of
> apache solr?
>
> I have tried to update solrconfig.xml. I have updated ramBufferSizeMB,
> maxFieldLength.
>
> Thanks
> Manoj Saini
>
>
>
>
>
> Thanks,
>
> Best Regards,
>
>
>
> Manoj Saini | Sr. Software Engineer  | Stigasoft
>
> m: +91 98 1034 1281 |
>
> e:  <mailto:[hidden email]> [hidden email] | w:
> <http://www.stigasoft.com> www.stigasoft.com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Apache solr not indexing complete pdf file using tikka

Ravish Bhagdev
I'd also suggest trying extracting text using tika-app (shipped with tika
distribution as executable jar) on the PDF(s) in question to see if problem
is with extraction or with indexing.

Rav

On Mon, Apr 2, 2012 at 1:55 PM, Erick Erickson <[hidden email]>wrote:

> You can index 2B tokens, so upping maxFieldLength should have
> fixed your problem at least as far as Solr is concerned. How
> many tokens get indexed? I'm not as familiar with Tika, but
> there may be some kind of parameter there (although I
> don't remember this coming up before)...
>
> Did you restart Solr after making the change to solrconfig.xml?
>
> If you're seeing 10,000 tokens or so, that's the default for
> maxFieldLength....
>
> I'd recommend stopping Solr, "rm -rf <solr home>/data/index"
> and restarting Solr just to be sure you're not seeing leftover
> junk, you'll have to re-index your docs after changing
> the maxLength param.
>
>
> Best
> Erick
>
>
> On Mon, Apr 2, 2012 at 7:19 AM, Manoj Saini <[hidden email]>
> wrote:
> > Hello Guys,
> >
> > I am using apache solr 3.3.0 with Tikka 1.0.
> >
> > I have pdf files which I am pushing into solr for conent searching.
> Apache
> > solr is indexing pdf files and I can see them in apache solr admin
> interface
> > for search. But the issue is apache solr is not indexing whole file
> content.
> > It is indexing upto only limited size.
> >
> > Am I missing something, some configuration, or this is the behavior of
> > apache solr?
> >
> > I have tried to update solrconfig.xml. I have updated ramBufferSizeMB,
> > maxFieldLength.
> >
> > Thanks
> > Manoj Saini
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Best Regards,
> >
> >
> >
> > Manoj Saini | Sr. Software Engineer  | Stigasoft
> >
> > m: +91 98 1034 1281 |
> >
> > e:  <mailto:[hidden email]> [hidden email] | w:
> > <http://www.stigasoft.com> www.stigasoft.com
> >
> >
> >
>