problems getting data into solr index

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

problems getting data into solr index

vanderkerkof
Hello everyone

I'm running solr1.1 and Jetty, I'm having problems looping through a mysql database with python and putting the data into the solr index.

Here's the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369: ordinal not in range(128)

I think that means that there is a UTF8 character in the data that is out of the ascii range.  Please let me know if I'm wrong.

So solr can't decode the character and therefore stops commiting any more data to the index.

Is there a simple way to tell solr to accept UTF8 characters?

I've read about this topic on your site and on others, so far I'm more confused than when I started.

Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Yonik Seeley-2
On 6/13/07, vanderkerkoff <[hidden email]> wrote:
> I'm running solr1.2 and Jetty, I'm having problems looping through a mysql
> database with python and putting the data into the solr index.
>
> Here's the error
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369:
> ordinal not in range(128)

There are two issues... what char encoding you tell solr to use, via
Content-type in the HTTP headers (solr defaults to UTF-8), and then if
what you send matches that coding.

If you can get the complete message (including HTTP headers) that is
being sent to Solr, that would help people debug the problem.

One easy way is to use netcat to pretend to be solr:
1) shut down solr
2) start up netcat on solr's port
  nc -l -p 8983
3) send your update message from the client as you normally would

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Ryan McKinley
In reply to this post by vanderkerkof
 >
 > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
 > 369: ordinal not in range(128)
 >

What character is at position 369?  make sure it is valid unicode...


>
> Is there a simple way to tell solr to accept UTF8 characters?
>

Solr can accept UTF8 characters... check the utf8-example.xml example in
exampledocs.

If you can put the character at position 369 into "utf8-example.xml" and
post it successfully (using post.sh or post.jar) then I suspect however
you are posting the xml is not encoding the stream properly.
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Chris Hostetter-3
In reply to this post by vanderkerkof

: I'm running solr1.2 and Jetty, I'm having problems looping through a mysql
: database with python and putting the data into the solr index.
:
: Here's the error
:
: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369:
: ordinal not in range(128)

I may be missing something here, but i don't think that error is coming
from Solr ... "UnicodeDecodeError" appears to be a python error message,
so i suspect the probelm is between MySql and your python script .. i bet
if yo uchange your script to comment out hte lines where you talk to solr,
and just read the data from mysql and throw it to /dev/null you'd still
see that message.

http://wiki.wxpython.org/UnicodeDecodeError


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Brian Whitman
In reply to this post by vanderkerkof

On Jun 13, 2007, at 11:37 AM, vanderkerkoff wrote:

>
> Hello everyone
>
> I'm running solr1.1 and Jetty, I'm having problems looping through  
> a mysql
> database with python and putting the data into the solr index.
>
> Here's the error
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in  
> position 369:
> ordinal not in range(128)
>

Post the line of code this is breaking on. Are you pulling the data  
from mysql as utf8? Are you setting the encoding of Mysqldb?

Solr has no problems with proper utf8 and you don't need to do  
anything special to get it to work. Check out the newer solr.py in JIRA.


Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
In reply to this post by Chris Hostetter-3
Hello Hoss

Thanks for replying, I tried what you suggested as the iniital step of my troubleshooting and it outputs it fine.

It was what I suspected initially as well, but thanks for the advice.


hossman_lucene wrote
: I'm running solr1.2 and Jetty, I'm having problems looping through a mysql
: database with python and putting the data into the solr index.
:
: Here's the error
:
: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369:
: ordinal not in range(128)

I may be missing something here, but i don't think that error is coming
from Solr ... "UnicodeDecodeError" appears to be a python error message,
so i suspect the probelm is between MySql and your python script .. i bet
if yo uchange your script to comment out hte lines where you talk to solr,
and just read the data from mysql and throw it to /dev/null you'd still
see that message.

http://wiki.wxpython.org/UnicodeDecodeError


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
In reply to this post by Yonik Seeley-2
Hi Yonik

Here's the output from netcat

POST /solr/update HTTP/1.1
Host: localhost:8983
Accept-Encoding: identity
Content-Length: 83
Content-Type: text/xml; charset=utf-8

that looks Ok to me, but I am a bit twp you see.

:-)
Yonik Seeley wrote
On 6/13/07, vanderkerkoff <mjdavies@glam.ac.uk> wrote:
> I'm running solr1.2 and Jetty, I'm having problems looping through a mysql
> database with python and putting the data into the solr index.
>
> Here's the error
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 369:
> ordinal not in range(128)

There are two issues... what char encoding you tell solr to use, via
Content-type in the HTTP headers (solr defaults to UTF-8), and then if
what you send matches that coding.

If you can get the complete message (including HTTP headers) that is
being sent to Solr, that would help people debug the problem.

One easy way is to use netcat to pretend to be solr:
1) shut down solr
2) start up netcat on solr's port
  nc -l -p 8983
3) send your update message from the client as you normally would

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

James liu-2
is it ok?

2007/6/14, vanderkerkoff <[hidden email]>:

>
>
> Hi Yonik
>
> Here's the output from netcat
>
> POST /solr/update HTTP/1.1
> Host: localhost:8983
> Accept-Encoding: identity
> Content-Length: 83
> Content-Type: text/xml; charset=utf-8
>
> that looks Ok to me, but I am a bit twp you see.
>
> :-)
>
> Yonik Seeley wrote:
> >
> > On 6/13/07, vanderkerkoff <[hidden email]> wrote:
> >> I'm running solr1.2 and Jetty, I'm having problems looping through a
> >> mysql
> >> database with python and putting the data into the solr index.
> >>
> >> Here's the error
> >>
> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
> 369:
> >> ordinal not in range(128)
> >
> > There are two issues... what char encoding you tell solr to use, via
> > Content-type in the HTTP headers (solr defaults to UTF-8), and then if
> > what you send matches that coding.
> >
> > If you can get the complete message (including HTTP headers) that is
> > being sent to Solr, that would help people debug the problem.
> >
> > One easy way is to use netcat to pretend to be solr:
> > 1) shut down solr
> > 2) start up netcat on solr's port
> >   nc -l -p 8983
> > 3) send your update message from the client as you normally would
> >
> > -Yonik
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11116020
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


--
regards
jl
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
In reply to this post by Brian Whitman
Hi Brian

I've now set the mysqldb to be default charset utf8, and everything else is utf8.  collation etc etc.

I think I know what the problem is, and it's a really old one and I feel foolish now for not realising it earlier.

Our content people are copying and pasting sh*t from word into the content.

:-)

Now that the database is utf8, I'd like to write something to change the crap from word into a readable value before it get's into the database.  Using python, so I suppose this is more of a python question than a solr one.

Anyone got any tips anyway?


Brian Whitman wrote
Post the line of code this is breaking on. Are you pulling the data  
from mysql as utf8? Are you setting the encoding of Mysqldb?

Solr has no problems with proper utf8 and you don't need to do  
anything special to get it to work. Check out the newer solr.py in JIRA.
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Mike Klaas
On 14-Jun-07, at 4:30 AM, vanderkerkoff wrote:

>
> Hi Brian
>
> I've now set the mysqldb to be default charset utf8, and everything  
> else is
> utf8.  collation etc etc.
>
> I think I know what the problem is, and it's a really old one and I  
> feel
> foolish now for not realising it earlier.
>
> Our content people are copying and pasting sh*t from word into the  
> content.
>
> :-)
>
> Now that the database is utf8, I'd like to write something to  
> change the
> crap from word into a readable value before it get's into the  
> database.
> Using python, so I suppose this is more of a python question than a  
> solr
> one.
>
> Anyone got any tips anyway?

I've dealt with tons of issues with python and unicode, but I need  
more information before proceeding with tips.

Specifically, what is the format of the "shit" being copied and  
pasted into your app, and what python datatype is handling it?  I  
suspect it is encoded somehow, which could be problematic.  Is it  
going through a web browser?  How is it getting into mysql?

-MIke


Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
Hi Mike
The characters that are giving us problems are the old favourites of apostrophe's and quotes pasted from Microsoft Word into a Django Web Site. I'm not using django's newforms yet, but still using the old ones.

Any help or tips or sending me off to sites to read stuff Mike I'll be grateful.

I'm coming round to the idea that I might have to strip these odd characters out with python before they get sent into the database, that would be the most sensible option I think.


Mike Klaas wrote
I've dealt with tons of issues with python and unicode, but I need  
more information before proceeding with tips.

Specifically, what is the format of the "shit" being copied and  
pasted into your app, and what python datatype is handling it?  I  
suspect it is encoded somehow, which could be problematic.  Is it  
going through a web browser?  How is it getting into mysql?

-MIke

Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Mike Klaas
Hi,

To diagnose this properly, you're going to have to figure out if  
you're dealing with encoded bytes or unicode, and what django does.  
See http://www.joelonsoftware.com/articles/Unicode.html.

As a short-term solution, you can force things to ascii using:

str(s.decode('ascii', 'ignore')) # assuming s is a bytestring
u.encode('ascii', 'ignore') # assuming u is a unicode string

-Mike

On 15-Jun-07, at 2:45 AM, vanderkerkoff wrote:

>
> Hi Mike
> The characters that are giving us problems are the old favourites of
> apostrophe's and quotes pasted from Microsoft Word into a Django  
> Web Site.
> I'm not using django's newforms yet, but still using the old ones.
>
> Any help or tips or sending me off to sites to read stuff Mike I'll be
> grateful.
>
> I'm coming round to the idea that I might have to strip these odd  
> characters
> out with python before they get sent into the database, that would  
> be the
> most sensible option I think.
>
>
>
> Mike Klaas wrote:
>>
>> I've dealt with tons of issues with python and unicode, but I need
>> more information before proceeding with tips.
>>
>> Specifically, what is the format of the "shit" being copied and
>> pasted into your app, and what python datatype is handling it?  I
>> suspect it is encoded somehow, which could be problematic.  Is it
>> going through a web browser?  How is it getting into mysql?
>>
>> -MIke
>>
>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/problems- 
> getting-data-into-solr-index-tf3915542.html#a11136156
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
Cheesr Mike, read the page, it's starting to get into my brian now.

Django was giving me unicode string, so I did some encoding and decoding and now the data is getting into solr, and it's simply not passing the characters that are cuasing problems, which is great.

I'm going to follow the same sort of principle in my python code when I'm adding the items, so I can keep my solr index up to date as and when things are entered.

Here's the code I'm using to enter the data.

http://pastie.textmate.org/71367

2 little things, I'm getting an error when it's trying to optimise the index

AttributeError: SolrConnection instance has no attribute 'optimise'

You don't know what that is about do you?

I'm still on solr1.1 as we were having trouble getting this sort of interaction to work with 1.2, not sure if it's related.

2.  I've used your suggestions to force the output into ascii, but if I try to force it into utf8, which I though solr would accept, it fails.  I'm not sure why though.

 


Mike Klaas wrote
Hi,

To diagnose this properly, you're going to have to figure out if  
you're dealing with encoded bytes or unicode, and what django does.  
See http://www.joelonsoftware.com/articles/Unicode.html.

As a short-term solution, you can force things to ascii using:

str(s.decode('ascii', 'ignore')) # assuming s is a bytestring
u.encode('ascii', 'ignore') # assuming u is a unicode string

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
I think I've resolved this.

I've edited that solr.py file to optimize=True on commit and moved the commit outside of the loop

http://pastie.textmate.org/71392

The data is going in, it's optmizing once but it's showing as commit = 0 in the stats page of my solr.

There's no errors that I can see, and the data is definately in the index as I can now search for it.


vanderkerkoff wrote
2 little things, I'm getting an error when it's trying to optimise the index

AttributeError: SolrConnection instance has no attribute 'optimise'

You don't know what that is about do you?

I'm still on solr1.1 as we were having trouble getting this sort of interaction to work with 1.2, not sure if it's related.
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Mike Klaas
In reply to this post by vanderkerkof
On 18-Jun-07, at 6:27 AM, vanderkerkoff wrote:

>
> Cheesr Mike, read the page, it's starting to get into my brian now.
>
> Django was giving me unicode string, so I did some encoding and  
> decoding and
> now the data is getting into solr, and it's simply not passing the
> characters that are cuasing problems, which is great.

Glad to hear that it is working.

> 2 little things, I'm getting an error when it's trying to optimise  
> the index
>
> AttributeError: SolrConnection instance has no attribute 'optimise'
>
> You don't know what that is about do you?

Er, it means that SolrConnection has no optimise command.  Instead do

conn.commit(optimize=True)

> I'm still on solr1.1 as we were having trouble getting this sort of
> interaction to work with 1.2, not sure if it's related.
>
> 2.  I've used your suggestions to force the output into ascii, but  
> if I try
> to force it into utf8, which I though solr would accept, it fails.  
> I'm not
> sure why though.

Perhaps this is why: solr.py expects unicode.  You can pass it ascii,  
and it will transparently convert to unicode fine because that is the  
default codec.  If you end up with utf-8, it will try to convert to  
unicode using the ascii codec and fail.

So, you could completely skip the ;encode('ascii', 'ignore') line.  
Of course, you'd have the characters in the text.  I'm not quite sure  
what you're after, since leaving it in utf-8 would leave the funny  
characters that you wanted to strip.

-MIke
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
Hello Mike, Brian

My brain is approcahing saturation point and I'm reading these two opinoins as opposing each other.

I'm sure I'm reading it incorrectly, but they seem to contradict each other.

Are they?

Brian Whitman wrote
Solr has no problems with proper utf8 and you don't need to do  
anything special to get it to work. Check out the newer solr.py in JIRA.
Mike Klaas wrote
Perhaps this is why: solr.py expects unicode.  You can pass it ascii,  
and it will transparently convert to unicode fine because that is the  
default codec.  If you end up with utf-8, it will try to convert to  
unicode using the ascii codec and fail.
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Brian Whitman
Mike is talking about solr.py, the python script, I'm talking about  
Solr itself.
I think your problem is in the former. You should play around with  
unicode in python for awhile. Remember that your terminal itself  
probably doesn't support utf-8, the biggest problem I run into is doing

 > print utf8string

Python forces you to be good about this stuff, but it's a steep  
climb. Google for python unicode and read the various tutorials to  
get a handle on it.

-b


On Jun 20, 2007, at 9:38 AM, vanderkerkoff wrote:

>
> Hello Mike, Brian
>
> My brain is approcahing saturation point and I'm reading these two  
> opinoins
> as opposing each other.
>
> I'm sure I'm reading it incorrectly, but they seem to contradict  
> each other.
>
> Are they?
>
>
> Brian Whitman wrote:
>>
>> Solr has no problems with proper utf8 and you don't need to do
>> anything special to get it to work. Check out the newer solr.py in  
>> JIRA.
>>
>
>
> Mike Klaas wrote:
>>
>> Perhaps this is why: solr.py expects unicode.  You can pass it ascii,
>> and it will transparently convert to unicode fine because that is the
>> default codec.  If you end up with utf-8, it will try to convert to
>> unicode using the ascii codec and fail.
>>
>
> --
> View this message in context: http://www.nabble.com/problems- 
> getting-data-into-solr-index-tf3915542.html#a11213488
> Sent from the Solr - User mailing list archive at Nabble.com.
>

--
http://variogr.am/
[hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

Mike Klaas
In reply to this post by vanderkerkof


On 20-Jun-07, at 6:38 AM, vanderkerkoff wrote:

>
> Hello Mike, Brian
>
> My brain is approcahing saturation point and I'm reading these two  
> opinoins
> as opposing each other.
>
> I'm sure I'm reading it incorrectly, but they seem to contradict  
> each other.
>
> Are they?

solr.py takes unicode and encodes it as utf-8 to send to Solr.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: problems getting data into solr index

vanderkerkof
Hi Mike, Brian

Thanks for helping with this, and for clearing up my misunderstanding.  Solr the python module and Solr the package being two different things, I've got you.

The issues I have are compounded by the fact that we're hovering between using the Unicode branch of Django and the older branch that has newforms, both of which have an impact on what I'm trying to do.

It's getting closer to being resolved, and it's down to your advice, so thanks again.