[jira] Assigned: (LUCENE-589) Demo HTML parser doesn't work for international documents

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-589) Demo HTML parser doesn't work for international documents

Soren Daugaard (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned LUCENE-589:
----------------------------------

    Assignee: Robert Muir

> Demo HTML parser doesn't work for international documents
> ---------------------------------------------------------
>
>                 Key: LUCENE-589
>                 URL: https://issues.apache.org/jira/browse/LUCENE-589
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Examples
>    Affects Versions: 2.0.0
>            Reporter: Curtis d'Entremont
>            Assignee: Robert Muir
>            Priority: Minor
>
> Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it would read the charset from the HTML markup, but that can by tricky. For now assuming unicode would do the trick:
> Add the following line marked with a + to HTMLParser.jj:
> options {
>   STATIC = false;
>   OPTIMIZE_TOKEN_MANAGER = true;
>   //DEBUG_LOOKAHEAD = true;
>   //DEBUG_TOKEN_MANAGER = true;
> +  UNICODE_INPUT = true;
> }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]