Bug in Content+TextParser?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Bug in Content+TextParser?

I am using Nutch 0.9 parsing framework on its own.  I create a Content with a
contentType text/plain; charset="windows-1251".  However, Content does not
preserve the charset part of the content type, so when the TextParser calls

String encoding = StringUtil.parseCharacterEncoding(content.getContentType());

it always gets null because the contentType no longer contains the charset string.

I see from the trunk that all this has changed quite a lot and I read about the
changes, but I'm not sure if I'm doing something wrong or if it ever worked.

Can anyone confirm is this is a known problem and if there is a simple known
solution-  I could simply store the full contentType and add a new method to get
that, which is then used in TextParers, but is there a more elegant solution.