OT: Parsing Russian text from RTF

classic Classic list List threaded Threaded
1 message Options
adb
Reply | Threaded
Open this post in threaded view
|

OT: Parsing Russian text from RTF

adb
Not directly Lucene related, but I'm out of ideas and I'm not a Russian speaker...

I'm extracting text from RTF to pump into Lucene.  I'm using the original
RTFEditorKit() code shown in LIA, p252 (actually, it's Nutch's RTFParser)

I have an RTF document, which starts with

---
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset204{\*\fname
Times New Roman;}Times New Roman CYR;}{\f1\fswiss\fprq2\fcharset0 Arial;}}
{\colortbl ;\red0\green0\blue128;\red0\green0\blue0;}
\viewkind4\uc1\pard\tx360\cf1\f0\fs20\'c1\'ee\'eb\'fc\'f8\'e8\'ed\'f1\'f2\'e2\'ee
---

which should be 'Большинство', but when the RTFReader translationTable always
maps the RTF bytes to char using latin1 and it never sets the correct
translationTable.  The "fcharset204" is Russian, apparently CP1251, but there's
a lovely line in the RTFReader class

/* TODO: per-font font encodings ( \fcharset control word ) ? */

Does anyone know if the RTF above is correct - the only place the translation
table is set during the parse is when the 'ansi' keyword is set.

Other than that, anyone have any ideas about getting the text out of the RTF
properly?

Thanks
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]