solr issue with pdf forms

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

solr issue with pdf forms

Steve.Scholl
Hi guys,

hopefully you can help me with my issue. We are using a solr setup and have the following issue:
- usual pdf files are indexed just fine
- pdf files with writable form-fields look like this:
Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind

Somehow the blank space character is not indexed correctly.

Is this a know issue? Does anybody have an idea?

Thanks a lot
Best
Steve
Reply | Threaded
Open this post in threaded view
|

Odp.: solr issue with pdf forms

Tomasz Borek
Out of my head I'd follow how are writable PDFs created and encoded.

@LAFK_PL
  Oryginalna wiadomość  
Od: [hidden email]
Wysłano: środa, 22 kwietnia 2015 12:41
Do: [hidden email]
Odpowiedz: [hidden email]
Temat: solr issue with pdf forms

Hi guys,

hopefully you can help me with my issue. We are using a solr setup and have the following issue:
- usual pdf files are indexed just fine
- pdf files with writable form-fields look like this:
Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind

Somehow the blank space character is not indexed correctly.

Is this a know issue? Does anybody have an idea?

Thanks a lot
Best
Steve
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
:-(


-----Ursprüngliche Nachricht-----
Von: LAFK [mailto:[hidden email]]
Gesendet: Mittwoch, 22. April 2015 14:01
An: [hidden email]; [hidden email]
Betreff: Odp.: solr issue with pdf forms

Out of my head I'd follow how are writable PDFs created and encoded.

@LAFK_PL
  Oryginalna wiadomość  
Od: [hidden email]
Wysłano: środa, 22 kwietnia 2015 12:41
Do: [hidden email]
Odpowiedz: [hidden email]
Temat: solr issue with pdf forms

Hi guys,

hopefully you can help me with my issue. We are using a solr setup and have the following issue:
- usual pdf files are indexed just fine
- pdf files with writable form-fields look like this:
Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind

Somehow the blank space character is not indexed correctly.

Is this a know issue? Does anybody have an idea?

Thanks a lot
Best
Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Erick Erickson
Are they not _indexed_ correctly or not being displayed correctly?
Take a look at admin UI>>schema browser>> your field and press the
"load terms" button. That'll show you what is _in_ the index as
opposed to what the raw data looked like.

When you return the field in a Solr search, you get a verbatim,
un-analyzed copy of your original input. My guess is that your browser
isn't using the compatible character encoding for display.

Best,
Erick

On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:

> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
> :-(
>
>
> -----Ursprüngliche Nachricht-----
> Von: LAFK [mailto:[hidden email]]
> Gesendet: Mittwoch, 22. April 2015 14:01
> An: [hidden email]; [hidden email]
> Betreff: Odp.: solr issue with pdf forms
>
> Out of my head I'd follow how are writable PDFs created and encoded.
>
> @LAFK_PL
>   Oryginalna wiadomość
> Od: [hidden email]
> Wysłano: środa, 22 kwietnia 2015 12:41
> Do: [hidden email]
> Odpowiedz: [hidden email]
> Temat: solr issue with pdf forms
>
> Hi guys,
>
> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
> - usual pdf files are indexed just fine
> - pdf files with writable form-fields look like this:
> Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind
>
> Somehow the blank space character is not indexed correctly.
>
> Is this a know issue? Does anybody have an idea?
>
> Thanks a lot
> Best
> Steve
Reply | Threaded
Open this post in threaded view
|

Re: solr issue with pdf forms

Dan Davis-2
In reply to this post by Steve.Scholl
Steve,

Are you using ExtractingRequestHandler / DataImportHandler or extracting
the text content from the PDF outside of Solr?

On Wed, Apr 22, 2015 at 6:40 AM, <[hidden email]> wrote:

> Hi guys,
>
> hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> - usual pdf files are indexed just fine
> - pdf files with writable form-fields look like this:
>
> Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind
>
> Somehow the blank space character is not indexed correctly.
>
> Is this a know issue? Does anybody have an idea?
>
> Thanks a lot
> Best
> Steve
>
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Dan Davis-2
In reply to this post by Erick Erickson
+1 - I like Erick's answer.  Let me know if that turns out to be the
problem - I'm interested in this problem and would be happy to help.

On Wed, Apr 22, 2015 at 11:11 AM, Erick Erickson <[hidden email]>
wrote:

> Are they not _indexed_ correctly or not being displayed correctly?
> Take a look at admin UI>>schema browser>> your field and press the
> "load terms" button. That'll show you what is _in_ the index as
> opposed to what the raw data looked like.
>
> When you return the field in a Solr search, you get a verbatim,
> un-analyzed copy of your original input. My guess is that your browser
> isn't using the compatible character encoding for display.
>
> Best,
> Erick
>
> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
> > Thanks for your answer. Maybe my English is not good enough, what are
> you trying to say? Sorry I didn't get the point.
> > :-(
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: LAFK [mailto:[hidden email]]
> > Gesendet: Mittwoch, 22. April 2015 14:01
> > An: [hidden email]; [hidden email]
> > Betreff: Odp.: solr issue with pdf forms
> >
> > Out of my head I'd follow how are writable PDFs created and encoded.
> >
> > @LAFK_PL
> >   Oryginalna wiadomość
> > Od: [hidden email]
> > Wysłano: środa, 22 kwietnia 2015 12:41
> > Do: [hidden email]
> > Odpowiedz: [hidden email]
> > Temat: solr issue with pdf forms
> >
> > Hi guys,
> >
> > hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> > - usual pdf files are indexed just fine
> > - pdf files with writable form-fields look like this:
> >
> Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind
> >
> > Somehow the blank space character is not indexed correctly.
> >
> > Is this a know issue? Does anybody have an idea?
> >
> > Thanks a lot
> > Best
> > Steve
>
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
In reply to this post by Erick Erickson
Hey Erick,

thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
I now figured out the following (not sure if it is relevant at all):
- PDF documents created with "Acrobat PDFMaker 10.0 for Word" are indexed correctly, no issues
- PDF documents (with editable form fields) created with "Adobe InDesign CS5 (7.0.1)"  are indexed with the blank space issue

Best
Steve

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Mittwoch, 22. April 2015 17:11
An: [hidden email]
Betreff: Re: Odp.: solr issue with pdf forms

Are they not _indexed_ correctly or not being displayed correctly?
Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.

When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.

Best,
Erick

On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:

> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
> :-(
>
>
> -----Ursprüngliche Nachricht-----
> Von: LAFK [mailto:[hidden email]]
> Gesendet: Mittwoch, 22. April 2015 14:01
> An: [hidden email]; [hidden email]
> Betreff: Odp.: solr issue with pdf forms
>
> Out of my head I'd follow how are writable PDFs created and encoded.
>
> @LAFK_PL
>   Oryginalna wiadomość
> Od: [hidden email]
> Wysłano: środa, 22 kwietnia 2015 12:41
> Do: [hidden email]
> Odpowiedz: [hidden email]
> Temat: solr issue with pdf forms
>
> Hi guys,
>
> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
> - usual pdf files are indexed just fine
> - pdf files with writable form-fields look like this:
> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v
> ollständig sind
>
> Somehow the blank space character is not indexed correctly.
>
> Is this a know issue? Does anybody have an idea?
>
> Thanks a lot
> Best
> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Erick Erickson
When you say "they're not indexed correctly", what's your evidence?
You cannot rely
on the display in the browser, that's the raw input just as it was
sent to Solr, _not_
the actual tokens in the index. What do you see when you go to the admin
schema browser pate and load the actual tokens.

Or use the TermsComponent
(https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
to see the actual terms in the index as opposed to the stored data you
see in the browser
when you look at search results.

If the actual terms don't seem right _in the index_ we need to see
your analysis chain,
i.e. your fieldType definition.

I'm, 90% sure you're seeing the stored data and your terms are indexed
just fine, but
I've certainly been wrong before, more times than I want to remember.....

Best,
Erick

On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:

> Hey Erick,
>
> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
> I now figured out the following (not sure if it is relevant at all):
> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are indexed correctly, no issues
> - PDF documents (with editable form fields) created with "Adobe InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Mittwoch, 22. April 2015 17:11
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> Are they not _indexed_ correctly or not being displayed correctly?
> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>
> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>
> Best,
> Erick
>
> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>> :-(
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: LAFK [mailto:[hidden email]]
>> Gesendet: Mittwoch, 22. April 2015 14:01
>> An: [hidden email]; [hidden email]
>> Betreff: Odp.: solr issue with pdf forms
>>
>> Out of my head I'd follow how are writable PDFs created and encoded.
>>
>> @LAFK_PL
>>   Oryginalna wiadomość
>> Od: [hidden email]
>> Wysłano: środa, 22 kwietnia 2015 12:41
>> Do: [hidden email]
>> Odpowiedz: [hidden email]
>> Temat: solr issue with pdf forms
>>
>> Hi guys,
>>
>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>> - usual pdf files are indexed just fine
>> - pdf files with writable form-fields look like this:
>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v
>> ollständig sind
>>
>> Somehow the blank space character is not indexed correctly.
>>
>> Is this a know issue? Does anybody have an idea?
>>
>> Thanks a lot
>> Best
>> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Dan Davis-2
Steve,

You gave as an example:

Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�
vollständig�sind

This sentence is probably from the PDF form label content, rather than form
values.   Sometimes in PDF, the form's value fields are kept in a separate
file.   I'm 99% sure Tika won't be able to handle that, because it handles
one file at a time.   If the form's value fields are in the PDF, Tika
should be able to handle it, but may be making some small errors that could
be addressed.

When you look at the form in Acrobat Reader, can you see whether the
indexed words contain any words from the form fields's values?

If you have a form where the data is not sensitive, I can investigate.   If
you are interested in this contact me offline - to [hidden email] or
[hidden email].

Thanks,

Dan

On Thu, Apr 23, 2015 at 11:59 AM, Erick Erickson <[hidden email]>
wrote:

> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was
> sent to Solr, _not_
> the actual tokens in the index. What do you see when you go to the admin
> schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you
> see in the browser
> when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see
> your analysis chain,
> i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed
> just fine, but
> I've certainly been wrong before, more times than I want to remember.....
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
> > Hey Erick,
> >
> > thanks for your answer. They are not indexed correctly. Also throught
> the solr admin interface I see these typical questionmarks within a rhombus
> where a blank space should be.
> > I now figured out the following (not sure if it is relevant at all):
> > - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
> indexed correctly, no issues
> > - PDF documents (with editable form fields) created with "Adobe InDesign
> CS5 (7.0.1)"  are indexed with the blank space issue
> >
> > Best
> > Steve
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Erick Erickson [mailto:[hidden email]]
> > Gesendet: Mittwoch, 22. April 2015 17:11
> > An: [hidden email]
> > Betreff: Re: Odp.: solr issue with pdf forms
> >
> > Are they not _indexed_ correctly or not being displayed correctly?
> > Take a look at admin UI>>schema browser>> your field and press the "load
> terms" button. That'll show you what is _in_ the index as opposed to what
> the raw data looked like.
> >
> > When you return the field in a Solr search, you get a verbatim,
> un-analyzed copy of your original input. My guess is that your browser
> isn't using the compatible character encoding for display.
> >
> > Best,
> > Erick
> >
> > On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
> >> Thanks for your answer. Maybe my English is not good enough, what are
> you trying to say? Sorry I didn't get the point.
> >> :-(
> >>
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: LAFK [mailto:[hidden email]]
> >> Gesendet: Mittwoch, 22. April 2015 14:01
> >> An: [hidden email]; [hidden email]
> >> Betreff: Odp.: solr issue with pdf forms
> >>
> >> Out of my head I'd follow how are writable PDFs created and encoded.
> >>
> >> @LAFK_PL
> >>   Oryginalna wiadomość
> >> Od: [hidden email]
> >> Wysłano: środa, 22 kwietnia 2015 12:41
> >> Do: [hidden email]
> >> Odpowiedz: [hidden email]
> >> Temat: solr issue with pdf forms
> >>
> >> Hi guys,
> >>
> >> hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> >> - usual pdf files are indexed just fine
> >> - pdf files with writable form-fields look like this:
> >> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v
> >> ollständig sind
> >>
> >> Somehow the blank space character is not indexed correctly.
> >>
> >> Is this a know issue? Does anybody have an idea?
> >>
> >> Thanks a lot
> >> Best
> >> Steve
>
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
In reply to this post by Erick Erickson
Hey Erick,

thanks a lot for your answer. I went to the admin schema browser, but what should I see there? Sorry I'm not firm with the admin schema browser. :-(

Best
Steve


-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Donnerstag, 23. April 2015 18:00
An: [hidden email]
Betreff: Re: Odp.: solr issue with pdf forms

When you say "they're not indexed correctly", what's your evidence?
You cannot rely
on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.

Or use the TermsComponent
(https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
to see the actual terms in the index as opposed to the stored data you see in the browser when you look at search results.

If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.

I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....

Best,
Erick

On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:

> Hey Erick,
>
> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
> I now figured out the following (not sure if it is relevant at all):
> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
> indexed correctly, no issues
> - PDF documents (with editable form fields) created with "Adobe
> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Mittwoch, 22. April 2015 17:11
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> Are they not _indexed_ correctly or not being displayed correctly?
> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>
> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>
> Best,
> Erick
>
> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>> :-(
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: LAFK [mailto:[hidden email]]
>> Gesendet: Mittwoch, 22. April 2015 14:01
>> An: [hidden email]; [hidden email]
>> Betreff: Odp.: solr issue with pdf forms
>>
>> Out of my head I'd follow how are writable PDFs created and encoded.
>>
>> @LAFK_PL
>>   Oryginalna wiadomość
>> Od: [hidden email]
>> Wysłano: środa, 22 kwietnia 2015 12:41
>> Do: [hidden email]
>> Odpowiedz: [hidden email]
>> Temat: solr issue with pdf forms
>>
>> Hi guys,
>>
>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>> - usual pdf files are indexed just fine
>> - pdf files with writable form-fields look like this:
>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und
>> v ollständig sind
>>
>> Somehow the blank space character is not indexed correctly.
>>
>> Is this a know issue? Does anybody have an idea?
>>
>> Thanks a lot
>> Best
>> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Erick Erickson
Steve:

Right, it's not exactly obvious. Bring up the admin UI, something like
http://localhost:8983/solr. From there you have to select a core in
the 'core selector' drop-down on the left side. If you're using
SolrCloud, this will have a rather strange name, but it should be easy
to identify what collection it belongs to.

At that point you'll see a bunch of new options, among them "schema
browser". From there, select your field from the drop-down that will
appear, then a button should pop up "load term info".

NOTE: you can get the same information from the TermsComponent, see:
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
This is a little more flexible because you can, among other things,
specify the place to start. In your case you might specify
terms.prefix=mein which will show you the terms that are actually
being _searched_ as opposed to being stored. This latter is what you
see in the browser when you search for docs and is sometimes
misleading as you're (probably) seeing.

Best,
Erick

On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:

> Hey Erick,
>
> thanks a lot for your answer. I went to the admin schema browser, but what should I see there? Sorry I'm not firm with the admin schema browser. :-(
>
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Donnerstag, 23. April 2015 18:00
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you see in the browser when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>> Hey Erick,
>>
>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>> I now figured out the following (not sure if it is relevant at all):
>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>> indexed correctly, no issues
>> - PDF documents (with editable form fields) created with "Adobe
>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>
>> Best
>> Steve
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Mittwoch, 22. April 2015 17:11
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Are they not _indexed_ correctly or not being displayed correctly?
>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>
>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>> :-(
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: LAFK [mailto:[hidden email]]
>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>> An: [hidden email]; [hidden email]
>>> Betreff: Odp.: solr issue with pdf forms
>>>
>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>
>>> @LAFK_PL
>>>   Oryginalna wiadomość
>>> Od: [hidden email]
>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>> Do: [hidden email]
>>> Odpowiedz: [hidden email]
>>> Temat: solr issue with pdf forms
>>>
>>> Hi guys,
>>>
>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>> - usual pdf files are indexed just fine
>>> - pdf files with writable form-fields look like this:
>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und
>>> v ollständig sind
>>>
>>> Somehow the blank space character is not indexed correctly.
>>>
>>> Is this a know issue? Does anybody have an idea?
>>>
>>> Thanks a lot
>>> Best
>>> Steve
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
Erick,

thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
Field: content
Field Type: text
Properties:  Indexed, Tokenized, Stored, TermVector Stored
Schema:  Indexed, Tokenized, Stored, TermVector Stored
Index:  Indexed, Tokenized, Stored, TermVector Stored
Copied Into: spell teaser
Position Increment Gap:  100
Index Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:  
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1 catenateAll: 0 catenateNumbers: 1 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 minWordSize: 5 dictionary: german/german-common-nouns.txt luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.StopFilterFactory args:{words: german/stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.GermanNormalizationFilterFactory args:{luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_36 }
Query Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:  
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1 catenateAll: 0 catenateNumbers: 0 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.StopFilterFactory args:{words: german/stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.GermanNormalizationFilterFactory args:{luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_36 }
Distinct:  160403

Does this somehow help to figure out the issue?
Thanks
Best
Steve


-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Freitag, 24. April 2015 20:15
An: [hidden email]
Betreff: Re: Odp.: solr issue with pdf forms

Steve:

Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.

At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".

NOTE: you can get the same information from the TermsComponent, see:
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.

Best,
Erick

On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:

> Hey Erick,
>
> thanks a lot for your answer. I went to the admin schema browser, but
> what should I see there? Sorry I'm not firm with the admin schema
> browser. :-(
>
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Donnerstag, 23. April 2015 18:00
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you see in the browser when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>> Hey Erick,
>>
>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>> I now figured out the following (not sure if it is relevant at all):
>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>> indexed correctly, no issues
>> - PDF documents (with editable form fields) created with "Adobe
>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>
>> Best
>> Steve
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Mittwoch, 22. April 2015 17:11
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Are they not _indexed_ correctly or not being displayed correctly?
>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>
>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>> :-(
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: LAFK [mailto:[hidden email]]
>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>> An: [hidden email]; [hidden email]
>>> Betreff: Odp.: solr issue with pdf forms
>>>
>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>
>>> @LAFK_PL
>>>   Oryginalna wiadomość
>>> Od: [hidden email]
>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>> Do: [hidden email]
>>> Odpowiedz: [hidden email]
>>> Temat: solr issue with pdf forms
>>>
>>> Hi guys,
>>>
>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>> - usual pdf files are indexed just fine
>>> - pdf files with writable form-fields look like this:
>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und
>>> v ollständig sind
>>>
>>> Somehow the blank space character is not indexed correctly.
>>>
>>> Is this a know issue? Does anybody have an idea?
>>>
>>> Thanks a lot
>>> Best
>>> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Erick Erickson
We're still not quite there. There should be a "load term info" button
on that page. Clicking that button will show you the terms in your
index (as opposed to the raw stored input which is what you get when
you look at results in the browser). My bet is that you'll see
perfectly normal tokens in the index that will NOT have the wonky
characters you see in the display.

If that's the case, then you have a browser issue, Solr is working
perfectly fine. On the other hand, if the individual terms are weird,
then you have something more fundamental going on.

Which is why I mentioned the TermsComponent. That will return indexed
tokens, and allows you a bit more flexibility than the admin page in
terms of what tokens you see, but it's essentially the same
information.

Best,
Erick

On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:

> Erick,
>
> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
> Field: content
> Field Type: text
> Properties:  Indexed, Tokenized, Stored, TermVector Stored
> Schema:  Indexed, Tokenized, Stored, TermVector Stored
> Index:  Indexed, Tokenized, Stored, TermVector Stored
> Copied Into: spell teaser
> Position Increment Gap:  100
> Index Analyzer: org.apache.solr.analysis.TokenizerChain Details
> Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory
> Filters:
> org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1 catenateAll: 0 catenateNumbers: 1 }
> org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 minWordSize: 5 dictionary: german/german-common-nouns.txt luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.StopFilterFactory args:{words: german/stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.GermanNormalizationFilterFactory args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_36 }
> Query Analyzer: org.apache.solr.analysis.TokenizerChain Details
> Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory
> Filters:
> org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1 catenateAll: 0 catenateNumbers: 0 }
> org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.StopFilterFactory args:{words: german/stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.GermanNormalizationFilterFactory args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_36 }
> Distinct:  160403
>
> Does this somehow help to figure out the issue?
> Thanks
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Freitag, 24. April 2015 20:15
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> Steve:
>
> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>
> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>
> NOTE: you can get the same information from the TermsComponent, see:
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>
> Best,
> Erick
>
> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>> Hey Erick,
>>
>> thanks a lot for your answer. I went to the admin schema browser, but
>> what should I see there? Sorry I'm not firm with the admin schema
>> browser. :-(
>>
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Donnerstag, 23. April 2015 18:00
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> When you say "they're not indexed correctly", what's your evidence?
>> You cannot rely
>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>
>> Or use the TermsComponent
>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
>> to see the actual terms in the index as opposed to the stored data you see in the browser when you look at search results.
>>
>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>
>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>> Hey Erick,
>>>
>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>> I now figured out the following (not sure if it is relevant at all):
>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>> indexed correctly, no issues
>>> - PDF documents (with editable form fields) created with "Adobe
>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>
>>> Best
>>> Steve
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> Are they not _indexed_ correctly or not being displayed correctly?
>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>
>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>> :-(
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: LAFK [mailto:[hidden email]]
>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>> An: [hidden email]; [hidden email]
>>>> Betreff: Odp.: solr issue with pdf forms
>>>>
>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>
>>>> @LAFK_PL
>>>>   Oryginalna wiadomość
>>>> Od: [hidden email]
>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>> Do: [hidden email]
>>>> Odpowiedz: [hidden email]
>>>> Temat: solr issue with pdf forms
>>>>
>>>> Hi guys,
>>>>
>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>> - usual pdf files are indexed just fine
>>>> - pdf files with writable form-fields look like this:
>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und
>>>> v ollständig sind
>>>>
>>>> Somehow the blank space character is not indexed correctly.
>>>>
>>>> Is this a know issue? Does anybody have an idea?
>>>>
>>>> Thanks a lot
>>>> Best
>>>> Steve
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
Thanks a lot for being patient with me. Unfortunately there is no button "load term info". :-(
Can you may be help me using the TermsComponent instead? I read it is per default configured.

Thanks a lot
Best
Steve

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Montag, 27. April 2015 17:23
An: [hidden email]
Betreff: Re: Odp.: solr issue with pdf forms

We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.

If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.

Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.

Best,
Erick

On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:

> Erick,
>
> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
> Field: content
> Field Type: text
> Properties:  Indexed, Tokenized, Stored, TermVector Stored
> Schema:  Indexed, Tokenized, Stored, TermVector Stored
> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
> spell teaser Position Increment Gap:  100 Index Analyzer:
> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:  
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> Filters:
> org.apache.solr.analysis.WordDelimiterFilterFactory
> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
> catenateAll: 0 catenateNumbers: 1 }
> org.apache.solr.analysis.LowerCaseFilterFactory
> args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
> LUCENE_36 }
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
> minWordSize: 5 dictionary: german/german-common-nouns.txt
> luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.StopFilterFactory args:{words:
> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
> luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.GermanNormalizationFilterFactory
> args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:  
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> Filters:
> org.apache.solr.analysis.WordDelimiterFilterFactory
> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
> catenateAll: 0 catenateNumbers: 0 }
> org.apache.solr.analysis.LowerCaseFilterFactory
> args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.StopFilterFactory args:{words:
> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
> luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.GermanNormalizationFilterFactory
> args:{luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
> args:{luceneMatchVersion: LUCENE_36 }
> Distinct:  160403
>
> Does this somehow help to figure out the issue?
> Thanks
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Freitag, 24. April 2015 20:15
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> Steve:
>
> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>
> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>
> NOTE: you can get the same information from the TermsComponent, see:
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>
> Best,
> Erick
>
> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>> Hey Erick,
>>
>> thanks a lot for your answer. I went to the admin schema browser, but
>> what should I see there? Sorry I'm not firm with the admin schema
>> browser. :-(
>>
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Donnerstag, 23. April 2015 18:00
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> When you say "they're not indexed correctly", what's your evidence?
>> You cannot rely
>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>
>> Or use the TermsComponent
>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>> ) to see the actual terms in the index as opposed to the stored data
>> you see in the browser when you look at search results.
>>
>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>
>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>> Hey Erick,
>>>
>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>> I now figured out the following (not sure if it is relevant at all):
>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>> indexed correctly, no issues
>>> - PDF documents (with editable form fields) created with "Adobe
>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>
>>> Best
>>> Steve
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> Are they not _indexed_ correctly or not being displayed correctly?
>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>
>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>> :-(
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: LAFK [mailto:[hidden email]]
>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>> An: [hidden email]; [hidden email]
>>>> Betreff: Odp.: solr issue with pdf forms
>>>>
>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>
>>>> @LAFK_PL
>>>>   Oryginalna wiadomość
>>>> Od: [hidden email]
>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>> Do: [hidden email]
>>>> Odpowiedz: [hidden email]
>>>> Temat: solr issue with pdf forms
>>>>
>>>> Hi guys,
>>>>
>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>> - usual pdf files are indexed just fine
>>>> - pdf files with writable form-fields look like this:
>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>> und v ollständig sind
>>>>
>>>> Somehow the blank space character is not indexed correctly.
>>>>
>>>> Is this a know issue? Does anybody have an idea?
>>>>
>>>> Thanks a lot
>>>> Best
>>>> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Erick Erickson
There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  <[hidden email]> wrote:

> Thanks a lot for being patient with me. Unfortunately there is no button "load term info". :-(
> Can you may be help me using the TermsComponent instead? I read it is per default configured.
>
> Thanks a lot
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Montag, 27. April 2015 17:23
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> Distinct:  160403
>>
>> Does this somehow help to figure out the issue?
>> Thanks
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Freitag, 24. April 2015 20:15
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Steve:
>>
>> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>>
>> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>>
>> NOTE: you can get the same information from the TermsComponent, see:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>>> Hey Erick,
>>>
>>> thanks a lot for your answer. I went to the admin schema browser, but
>>> what should I see there? Sorry I'm not firm with the admin schema
>>> browser. :-(
>>>
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> When you say "they're not indexed correctly", what's your evidence?
>>> You cannot rely
>>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>>
>>> Or use the TermsComponent
>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>> ) to see the actual terms in the index as opposed to the stored data
>>> you see in the browser when you look at search results.
>>>
>>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>>
>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>>> I now figured out the following (not sure if it is relevant at all):
>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>> indexed correctly, no issues
>>>> - PDF documents (with editable form fields) created with "Adobe
>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>
>>>> Best
>>>> Steve
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>> An: [hidden email]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>>
>>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>>> :-(
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: LAFK [mailto:[hidden email]]
>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>> An: [hidden email]; [hidden email]
>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>
>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>
>>>>> @LAFK_PL
>>>>>   Oryginalna wiadomość
>>>>> Od: [hidden email]
>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>> Do: [hidden email]
>>>>> Odpowiedz: [hidden email]
>>>>> Temat: solr issue with pdf forms
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>>> - usual pdf files are indexed just fine
>>>>> - pdf files with writable form-fields look like this:
>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>> und v ollständig sind
>>>>>
>>>>> Somehow the blank space character is not indexed correctly.
>>>>>
>>>>> Is this a know issue? Does anybody have an idea?
>>>>>
>>>>> Thanks a lot
>>>>> Best
>>>>> Steve
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
Sorry, but there really isn't... :-/

I never used the terms component. So I first looked if it is configured, and it really is.
Then I tried to get an idea how it works and tried the examples described in the doku.
After that I tried to figure out how to get the output from the "misscoded" pdf content.
My first step was to find the fields I need:

http://IP:8080/solr/core_de/terms?terms.fl=content&terms.fl=fileReferenceDocumentId&terms.fl=fileName

This gives me a top 10 list of the indexed documents and shows the fields content, fileReferenceDocumentId and fileName if I understand the documentation correctly.
Now I tried to limit the output to the specified file which has the coding issues:

http://IP:8080/solr/core_de/terms?terms.fl=content&terms.fl=fileReferenceDocumentId&terms.fl=fileName&terms.prefix=CODING-ISSUE.pdf

But this is then not showing the content of the content field anymore. :-(
The result looks like this:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="terms">
<lst name="content"/>
<lst name="fileReferenceDocumentId"/>
<lst name="fileName">
<int name=" CODING-ISSUE.pdf ">3</int>
</lst>
</lst>
</response>

Any help would be appreciated
Thanks a lot
Best
Steve


-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Mittwoch, 29. April 2015 03:07
An: [hidden email]
Betreff: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  <[hidden email]> wrote:

> Thanks a lot for being patient with me. Unfortunately there is no
> button "load term info". :-( Can you may be help me using the TermsComponent instead? I read it is per default configured.
>
> Thanks a lot
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Montag, 27. April 2015 17:23
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> Distinct:  160403
>>
>> Does this somehow help to figure out the issue?
>> Thanks
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Freitag, 24. April 2015 20:15
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Steve:
>>
>> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>>
>> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>>
>> NOTE: you can get the same information from the TermsComponent, see:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>>> Hey Erick,
>>>
>>> thanks a lot for your answer. I went to the admin schema browser,
>>> but what should I see there? Sorry I'm not firm with the admin
>>> schema browser. :-(
>>>
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> When you say "they're not indexed correctly", what's your evidence?
>>> You cannot rely
>>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>>
>>> Or use the TermsComponent
>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen
>>> t
>>> ) to see the actual terms in the index as opposed to the stored data
>>> you see in the browser when you look at search results.
>>>
>>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>>
>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>>> I now figured out the following (not sure if it is relevant at all):
>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>> indexed correctly, no issues
>>>> - PDF documents (with editable form fields) created with "Adobe
>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>
>>>> Best
>>>> Steve
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>> An: [hidden email]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>>
>>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>>> :-(
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: LAFK [mailto:[hidden email]]
>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>> An: [hidden email]; [hidden email]
>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>
>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>
>>>>> @LAFK_PL
>>>>>   Oryginalna wiadomość
>>>>> Od: [hidden email]
>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>> Do: [hidden email]
>>>>> Odpowiedz: [hidden email]
>>>>> Temat: solr issue with pdf forms
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>>> - usual pdf files are indexed just fine
>>>>> - pdf files with writable form-fields look like this:
>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>> und v ollständig sind
>>>>>
>>>>> Somehow the blank space character is not indexed correctly.
>>>>>
>>>>> Is this a know issue? Does anybody have an idea?
>>>>>
>>>>> Thanks a lot
>>>>> Best
>>>>> Steve
Reply | Threaded
Open this post in threaded view
|

RE: Odp.: solr issue with pdf forms

Allison, Timothy B.
In reply to this post by Erick Erickson
I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed.  If you find problems there and if you haven't done so already, you might also investigate further down the stack.  It might make sense to run the tika-app.jar (whichever version you are using in DIH or other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files outside of Solr to see what text/noise you're getting for the files that are causing problems.



-----Original Message-----
From: Erick Erickson [mailto:[hidden email]]
Sent: Tuesday, April 28, 2015 9:07 PM
To: [hidden email]
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  <[hidden email]> wrote:

> Thanks a lot for being patient with me. Unfortunately there is no button "load term info". :-(
> Can you may be help me using the TermsComponent instead? I read it is per default configured.
>
> Thanks a lot
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Montag, 27. April 2015 17:23
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> Distinct:  160403
>>
>> Does this somehow help to figure out the issue?
>> Thanks
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Freitag, 24. April 2015 20:15
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Steve:
>>
>> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>>
>> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>>
>> NOTE: you can get the same information from the TermsComponent, see:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>>> Hey Erick,
>>>
>>> thanks a lot for your answer. I went to the admin schema browser, but
>>> what should I see there? Sorry I'm not firm with the admin schema
>>> browser. :-(
>>>
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> When you say "they're not indexed correctly", what's your evidence?
>>> You cannot rely
>>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>>
>>> Or use the TermsComponent
>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>> ) to see the actual terms in the index as opposed to the stored data
>>> you see in the browser when you look at search results.
>>>
>>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>>
>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>>> I now figured out the following (not sure if it is relevant at all):
>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>> indexed correctly, no issues
>>>> - PDF documents (with editable form fields) created with "Adobe
>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>
>>>> Best
>>>> Steve
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>> An: [hidden email]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>>
>>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>>> :-(
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: LAFK [mailto:[hidden email]]
>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>> An: [hidden email]; [hidden email]
>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>
>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>
>>>>> @LAFK_PL
>>>>>   Oryginalna wiadomość
>>>>> Od: [hidden email]
>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>> Do: [hidden email]
>>>>> Odpowiedz: [hidden email]
>>>>> Temat: solr issue with pdf forms
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>>> - usual pdf files are indexed just fine
>>>>> - pdf files with writable form-fields look like this:
>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>> und v ollständig sind
>>>>>
>>>>> Somehow the blank space character is not indexed correctly.
>>>>>
>>>>> Is this a know issue? Does anybody have an idea?
>>>>>
>>>>> Thanks a lot
>>>>> Best
>>>>> Steve
Reply | Threaded
Open this post in threaded view
|

Re: Odp.: solr issue with pdf forms

Erick Erickson
Steve:

I'd just look at one field at a time....

Presumably you have a field that's displaying poorly, "content"? Just
look at _that_ field, as

http://IP:8080/solr/core_de/terms?terms.fl=content
or
http://IP:8080/solr/core_de/terms?terms.fl=content&terms.prefix=d

Now, that should show you terms in the index for the "content" field.
If you don't have any being displayed, then you aren't putting
anything in there. Which would probably mean that either you have
indexed="false" for the content field. And if that's the case, then
this thread has nothing at all to do with indexing, it's just a
display problem in your browser. I bet you can't search meaningfully
on that field either, which would be another symptom of having
indexed="false", as would not being able to get anything from the
schema browser.

So let's see your field definition before going any further.

And really go back and try to understand the difference between
indexed and stored. Once again the _stored_ data is what's displayed
in your browser. The browser settings determine how "odd" characters
are being displayed. Solr is just returning what you gave it in that
case.

_indexed_ data (i.e indexed="true"), on the other hand, is broken down
by your analysis chain into searchable tokens and _that's_ what is
searched when you, say, specify ?q=content:whatever.

Your original problem statement is that odd characters are being
indexed. I think that is a total red herring and this is all about
display, which has nothing really to do with Solr.

Best,
Erick

On Wed, Apr 29, 2015 at 5:15 AM, Allison, Timothy B. <[hidden email]> wrote:

> I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed.  If you find problems there and if you haven't done so already, you might also investigate further down the stack.  It might make sense to run the tika-app.jar (whichever version you are using in DIH or other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files outside of Solr to see what text/noise you're getting for the files that are causing problems.
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Tuesday, April 28, 2015 9:07 PM
> To: [hidden email]
> Subject: Re: Odp.: solr issue with pdf forms
>
> There better be.
>
> 1> go to the admin UI
> 2> select a core
> 3> select "schema browser"
> 4> select a field from the drop-down
>
> Until you do step 4 the window will be pretty blank.
>
> Here's the info for TermsComponent, what have you tried?
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>
> Best,
> Erick
>
> On Tue, Apr 28, 2015 at 1:04 PM,  <[hidden email]> wrote:
>> Thanks a lot for being patient with me. Unfortunately there is no button "load term info". :-(
>> Can you may be help me using the TermsComponent instead? I read it is per default configured.
>>
>> Thanks a lot
>> Best
>> Steve
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Montag, 27. April 2015 17:23
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.
>>
>> If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.
>>
>> Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.
>>
>> Best,
>> Erick
>>
>> On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:
>>> Erick,
>>>
>>> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
>>> Field: content
>>> Field Type: text
>>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>>> spell teaser Position Increment Gap:  100 Index Analyzer:
>>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>>> Filters:
>>> org.apache.solr.analysis.WordDelimiterFilterFactory
>>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>>> catenateAll: 0 catenateNumbers: 1 }
>>> org.apache.solr.analysis.LowerCaseFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>>> LUCENE_36 }
>>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>>> luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.StopFilterFactory args:{words:
>>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>>> luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>>> Filters:
>>> org.apache.solr.analysis.WordDelimiterFilterFactory
>>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>>> catenateAll: 0 catenateNumbers: 0 }
>>> org.apache.solr.analysis.LowerCaseFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.StopFilterFactory args:{words:
>>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>>> luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> Distinct:  160403
>>>
>>> Does this somehow help to figure out the issue?
>>> Thanks
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Freitag, 24. April 2015 20:15
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> Steve:
>>>
>>> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>>>
>>> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>>>
>>> NOTE: you can get the same information from the TermsComponent, see:
>>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>>> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks a lot for your answer. I went to the admin schema browser, but
>>>> what should I see there? Sorry I'm not firm with the admin schema
>>>> browser. :-(
>>>>
>>>> Best
>>>> Steve
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>>> An: [hidden email]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> When you say "they're not indexed correctly", what's your evidence?
>>>> You cannot rely
>>>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>>>
>>>> Or use the TermsComponent
>>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>>> ) to see the actual terms in the index as opposed to the stored data
>>>> you see in the browser when you look at search results.
>>>>
>>>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>>>
>>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>>>> Hey Erick,
>>>>>
>>>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>>>> I now figured out the following (not sure if it is relevant at all):
>>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>>> indexed correctly, no issues
>>>>> - PDF documents (with editable form fields) created with "Adobe
>>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>>
>>>>> Best
>>>>> Steve
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>>> An: [hidden email]
>>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>>
>>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>>>
>>>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>>>> :-(
>>>>>>
>>>>>>
>>>>>> -----Ursprüngliche Nachricht-----
>>>>>> Von: LAFK [mailto:[hidden email]]
>>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>>> An: [hidden email]; [hidden email]
>>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>>
>>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>>
>>>>>> @LAFK_PL
>>>>>>   Oryginalna wiadomość
>>>>>> Od: [hidden email]
>>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>>> Do: [hidden email]
>>>>>> Odpowiedz: [hidden email]
>>>>>> Temat: solr issue with pdf forms
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>>>> - usual pdf files are indexed just fine
>>>>>> - pdf files with writable form-fields look like this:
>>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>>> und v ollständig sind
>>>>>>
>>>>>> Somehow the blank space character is not indexed correctly.
>>>>>>
>>>>>> Is this a know issue? Does anybody have an idea?
>>>>>>
>>>>>> Thanks a lot
>>>>>> Best
>>>>>> Steve
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
Thank you very much fort he detailed information.
I now checked the properties of the content fied. In my oppinion it is indexed, right?:
Field: content
Properties: Indexed, Tokenized, Stored, TermVector Stored
Schema: Indexed, Tokenized, Stored, TermVector Stored
Index: Indexed, Tokenized, Stored, TermVector Stored

I then queried the content field again with:
http://172.29.200.17:8080/solr/core_de/terms?terms.fl=content&terms.limit=2000
I now kann see a lot of content and I also see those odd character there.

Thanks
Best
Steve

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:[hidden email]]
Gesendet: Mittwoch, 29. April 2015 16:07
An: [hidden email]
Betreff: Re: Odp.: solr issue with pdf forms

Steve:

I'd just look at one field at a time....

Presumably you have a field that's displaying poorly, "content"? Just look at _that_ field, as

http://IP:8080/solr/core_de/terms?terms.fl=content
or
http://IP:8080/solr/core_de/terms?terms.fl=content&terms.prefix=d

Now, that should show you terms in the index for the "content" field.
If you don't have any being displayed, then you aren't putting anything in there. Which would probably mean that either you have indexed="false" for the content field. And if that's the case, then this thread has nothing at all to do with indexing, it's just a display problem in your browser. I bet you can't search meaningfully on that field either, which would be another symptom of having indexed="false", as would not being able to get anything from the schema browser.

So let's see your field definition before going any further.

And really go back and try to understand the difference between indexed and stored. Once again the _stored_ data is what's displayed in your browser. The browser settings determine how "odd" characters are being displayed. Solr is just returning what you gave it in that case.

_indexed_ data (i.e indexed="true"), on the other hand, is broken down by your analysis chain into searchable tokens and _that's_ what is searched when you, say, specify ?q=content:whatever.

Your original problem statement is that odd characters are being indexed. I think that is a total red herring and this is all about display, which has nothing really to do with Solr.

Best,
Erick

On Wed, Apr 29, 2015 at 5:15 AM, Allison, Timothy B. <[hidden email]> wrote:

> I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed.  If you find problems there and if you haven't done so already, you might also investigate further down the stack.  It might make sense to run the tika-app.jar (whichever version you are using in DIH or other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files outside of Solr to see what text/noise you're getting for the files that are causing problems.
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Tuesday, April 28, 2015 9:07 PM
> To: [hidden email]
> Subject: Re: Odp.: solr issue with pdf forms
>
> There better be.
>
> 1> go to the admin UI
> 2> select a core
> 3> select "schema browser"
> 4> select a field from the drop-down
>
> Until you do step 4 the window will be pretty blank.
>
> Here's the info for TermsComponent, what have you tried?
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>
> Best,
> Erick
>
> On Tue, Apr 28, 2015 at 1:04 PM,  <[hidden email]> wrote:
>> Thanks a lot for being patient with me. Unfortunately there is no
>> button "load term info". :-( Can you may be help me using the TermsComponent instead? I read it is per default configured.
>>
>> Thanks a lot
>> Best
>> Steve
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Montag, 27. April 2015 17:23
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.
>>
>> If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.
>>
>> Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.
>>
>> Best,
>> Erick
>>
>> On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:
>>> Erick,
>>>
>>> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
>>> Field: content
>>> Field Type: text
>>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>>> spell teaser Position Increment Gap:  100 Index Analyzer:
>>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>>> Filters:
>>> org.apache.solr.analysis.WordDelimiterFilterFactory
>>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts:
>>> 1
>>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>>> catenateAll: 0 catenateNumbers: 1 }
>>> org.apache.solr.analysis.LowerCaseFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>>> LUCENE_36 }
>>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>>> luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.StopFilterFactory args:{words:
>>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>>> luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>>> Filters:
>>> org.apache.solr.analysis.WordDelimiterFilterFactory
>>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts:
>>> 1
>>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>>> catenateAll: 0 catenateNumbers: 0 }
>>> org.apache.solr.analysis.LowerCaseFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.StopFilterFactory args:{words:
>>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>>> luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>>> args:{luceneMatchVersion: LUCENE_36 }
>>> Distinct:  160403
>>>
>>> Does this somehow help to figure out the issue?
>>> Thanks
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Freitag, 24. April 2015 20:15
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> Steve:
>>>
>>> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>>>
>>> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>>>
>>> NOTE: you can get the same information from the TermsComponent, see:
>>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>>> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks a lot for your answer. I went to the admin schema browser,
>>>> but what should I see there? Sorry I'm not firm with the admin
>>>> schema browser. :-(
>>>>
>>>> Best
>>>> Steve
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>>> An: [hidden email]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> When you say "they're not indexed correctly", what's your evidence?
>>>> You cannot rely
>>>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>>>
>>>> Or use the TermsComponent
>>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Compone
>>>> nt
>>>> ) to see the actual terms in the index as opposed to the stored
>>>> data you see in the browser when you look at search results.
>>>>
>>>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>>>
>>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>>>> Hey Erick,
>>>>>
>>>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>>>> I now figured out the following (not sure if it is relevant at all):
>>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>>> indexed correctly, no issues
>>>>> - PDF documents (with editable form fields) created with "Adobe
>>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>>
>>>>> Best
>>>>> Steve
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>>> An: [hidden email]
>>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>>
>>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>>>
>>>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>>>> :-(
>>>>>>
>>>>>>
>>>>>> -----Ursprüngliche Nachricht-----
>>>>>> Von: LAFK [mailto:[hidden email]]
>>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>>> An: [hidden email]; [hidden email]
>>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>>
>>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>>
>>>>>> @LAFK_PL
>>>>>>   Oryginalna wiadomość
>>>>>> Od: [hidden email]
>>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>>> Do: [hidden email]
>>>>>> Odpowiedz: [hidden email]
>>>>>> Temat: solr issue with pdf forms
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>>>> - usual pdf files are indexed just fine
>>>>>> - pdf files with writable form-fields look like this:
>>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>>> und v ollständig sind
>>>>>>
>>>>>> Somehow the blank space character is not indexed correctly.
>>>>>>
>>>>>> Is this a know issue? Does anybody have an idea?
>>>>>>
>>>>>> Thanks a lot
>>>>>> Best
>>>>>> Steve
Reply | Threaded
Open this post in threaded view
|

AW: Odp.: solr issue with pdf forms

Steve.Scholl
In reply to this post by Allison, Timothy B.
Hey, thanks a lot for the hint with pdfbox-app.jar.
For testing purpose I now extracted a affected pdf form and a usual pdf file.
The result ist he following:

Usual pdf file:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut
labore et d

pdf form:
Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz

Best
Steve

-----Ursprüngliche Nachricht-----
Von: Allison, Timothy B. [mailto:[hidden email]]
Gesendet: Mittwoch, 29. April 2015 14:16
An: [hidden email]
Cc: [hidden email]
Betreff: RE: Odp.: solr issue with pdf forms

I completely agree with Erick about the utility of the TermsComponent to see what is actually being indexed.  If you find problems there and if you haven't done so already, you might also investigate further down the stack.  It might make sense to run the tika-app.jar (whichever version you are using in DIH or other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files outside of Solr to see what text/noise you're getting for the files that are causing problems.



-----Original Message-----
From: Erick Erickson [mailto:[hidden email]]
Sent: Tuesday, April 28, 2015 9:07 PM
To: [hidden email]
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  <[hidden email]> wrote:

> Thanks a lot for being patient with me. Unfortunately there is no
> button "load term info". :-( Can you may be help me using the TermsComponent instead? I read it is per default configured.
>
> Thanks a lot
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[hidden email]]
> Gesendet: Montag, 27. April 2015 17:23
> An: [hidden email]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on that page. Clicking that button will show you the terms in your index (as opposed to the raw stored input which is what you get when you look at results in the browser). My bet is that you'll see perfectly normal tokens in the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly fine. On the other hand, if the individual terms are weird, then you have something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, and allows you a bit more flexibility than the admin page in terms of what tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,  <[hidden email]> wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which is displayed not correctly. So I went tot he schema browser like you pointed out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> Distinct:  160403
>>
>> Does this somehow help to figure out the issue?
>> Thanks
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[hidden email]]
>> Gesendet: Freitag, 24. April 2015 20:15
>> An: [hidden email]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Steve:
>>
>> Right, it's not exactly obvious. Bring up the admin UI, something like http://localhost:8983/solr. From there you have to select a core in the 'core selector' drop-down on the left side. If you're using SolrCloud, this will have a rather strange name, but it should be easy to identify what collection it belongs to.
>>
>> At that point you'll see a bunch of new options, among them "schema browser". From there, select your field from the drop-down that will appear, then a button should pop up "load term info".
>>
>> NOTE: you can get the same information from the TermsComponent, see:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> This is a little more flexible because you can, among other things, specify the place to start. In your case you might specify terms.prefix=mein which will show you the terms that are actually being _searched_ as opposed to being stored. This latter is what you see in the browser when you search for docs and is sometimes misleading as you're (probably) seeing.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 24, 2015 at 1:58 AM,  <[hidden email]> wrote:
>>> Hey Erick,
>>>
>>> thanks a lot for your answer. I went to the admin schema browser,
>>> but what should I see there? Sorry I'm not firm with the admin
>>> schema browser. :-(
>>>
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[hidden email]]
>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>> An: [hidden email]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> When you say "they're not indexed correctly", what's your evidence?
>>> You cannot rely
>>> on the display in the browser, that's the raw input just as it was sent to Solr, _not_ the actual tokens in the index. What do you see when you go to the admin schema browser pate and load the actual tokens.
>>>
>>> Or use the TermsComponent
>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen
>>> t
>>> ) to see the actual terms in the index as opposed to the stored data
>>> you see in the browser when you look at search results.
>>>
>>> If the actual terms don't seem right _in the index_ we need to see your analysis chain, i.e. your fieldType definition.
>>>
>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just fine, but I've certainly been wrong before, more times than I want to remember.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[hidden email]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks for your answer. They are not indexed correctly. Also throught the solr admin interface I see these typical questionmarks within a rhombus where a blank space should be.
>>>> I now figured out the following (not sure if it is relevant at all):
>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>> indexed correctly, no issues
>>>> - PDF documents (with editable form fields) created with "Adobe
>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>
>>>> Best
>>>> Steve
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[hidden email]]
>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>> An: [hidden email]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>> Take a look at admin UI>>schema browser>> your field and press the "load terms" button. That'll show you what is _in_ the index as opposed to what the raw data looked like.
>>>>
>>>> When you return the field in a Solr search, you get a verbatim, un-analyzed copy of your original input. My guess is that your browser isn't using the compatible character encoding for display.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[hidden email]> wrote:
>>>>> Thanks for your answer. Maybe my English is not good enough, what are you trying to say? Sorry I didn't get the point.
>>>>> :-(
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: LAFK [mailto:[hidden email]]
>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>> An: [hidden email]; [hidden email]
>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>
>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>
>>>>> @LAFK_PL
>>>>>   Oryginalna wiadomość
>>>>> Od: [hidden email]
>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>> Do: [hidden email]
>>>>> Odpowiedz: [hidden email]
>>>>> Temat: solr issue with pdf forms
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> hopefully you can help me with my issue. We are using a solr setup and have the following issue:
>>>>> - usual pdf files are indexed just fine
>>>>> - pdf files with writable form-fields look like this:
>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>> und v ollständig sind
>>>>>
>>>>> Somehow the blank space character is not indexed correctly.
>>>>>
>>>>> Is this a know issue? Does anybody have an idea?
>>>>>
>>>>> Thanks a lot
>>>>> Best
>>>>> Steve
12