encoding problem when retrieving document field value

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

encoding problem when retrieving document field value

G.Long
Hi :)

My index (Lucene 3.5) contains a field called title. Its value is
indexed (analyzed and stored) with the WhitespaceAnalyzer and can
contains html entities such as ’ or °

My problem is that when i retrieve values from this field, some of the
html entities are missing.
For example :

Luke tells me that the stored value is : "l’application n°
90-1258" and when I retrieve the field value in my application, I get
"l’application n° 90-1258".

The apostrophe is not in the returned value whereas the ° character is
present.

What could be the problem?

Thanks,

Gary



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: encoding problem when retrieving document field value

Uwe Schindler
Hi G. Long,

Most likely, the problem is in your application. Lucene does not change the value stored in the index. For stored fields, Lucene does not deal with entities, it's just binary data to Lucene. From your application perspective, it is String in -> String out. I think maybe you strip the entities when you output the data to the user?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: G.Long [mailto:[hidden email]]
> Sent: Monday, March 03, 2014 6:09 PM
> To: [hidden email]
> Subject: encoding problem when retrieving document field value
>
> Hi :)
>
> My index (Lucene 3.5) contains a field called title. Its value is indexed
> (analyzed and stored) with the WhitespaceAnalyzer and can contains html
> entities such as ’ or °
>
> My problem is that when i retrieve values from this field, some of the html
> entities are missing.
> For example :
>
> Luke tells me that the stored value is : "l’application n° 90-1258"
> and when I retrieve the field value in my application, I get "l’application n°
> 90-1258".
>
> The apostrophe is not in the returned value whereas the ° character is
> present.
>
> What could be the problem?
>
> Thanks,
>
> Gary
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: encoding problem when retrieving document field value

G.Long
Hi :)

I've got this result directly from tncTitle in the following code:

field = doc.getFieldable(IndexConstants.FIELD_TNC_TITLE);
if (field != null) {
       tncTitle = field.stringValue();
}

ps: in my previous email, the copy/paste of the apostrophe html number
made it appear correctly although it's not when I debug my code. I get :
"lapplication n° 90-1258" from field.stringValue();

Gary

Le 03/03/2014 18:33, Uwe Schindler a écrit :

> Hi G. Long,
>
> Most likely, the problem is in your application. Lucene does not change the value stored in the index. For stored fields, Lucene does not deal with entities, it's just binary data to Lucene. From your application perspective, it is String in -> String out. I think maybe you strip the entities when you output the data to the user?
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>> -----Original Message-----
>> From: G.Long [mailto:[hidden email]]
>> Sent: Monday, March 03, 2014 6:09 PM
>> To: [hidden email]
>> Subject: encoding problem when retrieving document field value
>>
>> Hi :)
>>
>> My index (Lucene 3.5) contains a field called title. Its value is indexed
>> (analyzed and stored) with the WhitespaceAnalyzer and can contains html
>> entities such as ’ or °
>>
>> My problem is that when i retrieve values from this field, some of the html
>> entities are missing.
>> For example :
>>
>> Luke tells me that the stored value is : "l’application n° 90-1258"
>> and when I retrieve the field value in my application, I get "l’application n°
>> 90-1258".
>>
>> The apostrophe is not in the returned value whereas the ° character is
>> present.
>>
>> What could be the problem?
>>
>> Thanks,
>>
>> Gary
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: encoding problem when retrieving document field value

Jack Krupansky-2
In reply to this post by G.Long
What is the hex value for that second character returned that appears to
display as an apostrophe? Hex 92 (decimal 146) is  listed as "Private Use
2", so who knows what it might display as. All that is important is the
binary/hax value.

Out of curiosity, how did your application come about picking a PU Unicode
character?

-- Jack Krupansky

-----Original Message-----
From: G.Long
Sent: Monday, March 3, 2014 12:09 PM
To: [hidden email]
Subject: encoding problem when retrieving document field value

Hi :)

My index (Lucene 3.5) contains a field called title. Its value is
indexed (analyzed and stored) with the WhitespaceAnalyzer and can
contains html entities such as ’ or °

My problem is that when i retrieve values from this field, some of the
html entities are missing.
For example :

Luke tells me that the stored value is : "l’application n°
90-1258" and when I retrieve the field value in my application, I get
"l’application n° 90-1258".

The apostrophe is not in the returned value whereas the ° character is
present.

What could be the problem?

Thanks,

Gary



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: encoding problem when retrieving document field value

Trejkaz
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky <[hidden email]> wrote:
> What is the hex value for that second character returned that appears to
> display as an apostrophe? Hex 92 (decimal 146) is  listed as "Private Use
> 2", so who knows what it might display as.

Well, if they're dealing with HTML, then it will display as a right
single quotation mark, because that's what HTML5 now specifies that
you're supposed to do with it.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: encoding problem when retrieving document field value

G.Long
In reply to this post by Jack Krupansky-2
Hi :)

I found the source of the problem. It is indeed the input string. It
comes from a csv export from a relational database. The inputStream of
this csv file was encoded with the wrong charset (ISO8859-1 instead of
CP1252). So the right single quote was returned as this character
corresponding to hex 92 and was indexed as is in Lucene.

The problem was out of the scope of lucene, as Uwe Schindler said :)

Thanks for your help :)

Gary

Le 03/03/2014 18:44, Jack Krupansky a écrit :

> What is the hex value for that second character returned that appears
> to display as an apostrophe? Hex 92 (decimal 146) is  listed as
> "Private Use 2", so who knows what it might display as. All that is
> important is the binary/hax value.
>
> Out of curiosity, how did your application come about picking a PU
> Unicode character?
>
> -- Jack Krupansky
>
> -----Original Message----- From: G.Long
> Sent: Monday, March 3, 2014 12:09 PM
> To: [hidden email]
> Subject: encoding problem when retrieving document field value
>
> Hi :)
>
> My index (Lucene 3.5) contains a field called title. Its value is
> indexed (analyzed and stored) with the WhitespaceAnalyzer and can
> contains html entities such as &#146; or &#176;
>
> My problem is that when i retrieve values from this field, some of the
> html entities are missing.
> For example :
>
> Luke tells me that the stored value is : "l&#146;application n&#176;
> 90-1258" and when I retrieve the field value in my application, I get
> "l’application n° 90-1258".
>
> The apostrophe is not in the returned value whereas the ° character is
> present.
>
> What could be the problem?
>
> Thanks,
>
> Gary
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]