Is there a way to force content extraction with a given encoding

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there a way to force content extraction with a given encoding

lala
This post was updated on .
I am using the /update/extract request handler to push documents into solr,
but some text documents, that are encoded as windows-1255 (arabic texts) are
not extracted properly, the text given is not readable.

I searched in the web, and solr documentation and found nothing. I need to
send the file encoding as a parameter if possible to let the tika parser get
to know it.

Note that I am using solr7.5 ...

Is there a way to achieve that?
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to force content extraction with a given encoding

Jörn Franke
I would convert them to UTF-8 before posting and use UTF-8 in your application. Most of the web and applications use UTF-8. If you use other encodings you will always run into problems.

> Am 08.11.2019 um 07:47 schrieb lala <[hidden email]>:
>
> I am using the /update/extract request handler to push documents into solr,
> but some text documents, that are encoded as windows-1255 (arabic texts) are
> not extracted properly, the text given is not readable.
>
> I searched in the web, and solr documentation and found nothing. I need to
> send the file encoding as a parameter if possible to let the tika parser get
> to know it.
>
> Is there a way to achieve that?
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html