[jira] [Commented] (TIKA-245) Support of CHM Format

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-245) Support of CHM Format

Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888778#comment-13888778 ]

Prashanth Ramaswamy commented on TIKA-245:

Hi, I still get the Array index exception in trying to parse CHM files.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: -1
        at java.util.ArrayList.elementData(ArrayList.java:382)
        at java.util.ArrayList.get(ArrayList.java:395)
        at org.apache.tika.parser.chm.core.ChmExtractor.<init>(ChmExtractor.java:178)

There was an old comment that this was fixed?  Is this so, or is the bug still there?

> Support of CHM Format
> ---------------------
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.10
>         Attachments: TIKA-245.oleg.20110806.PATCH, TIKA-245.tikhonov.04082011.patch.txt, TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
> It might be a good idea to support the CHM File format of Windows. Some information about http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data from the CHM file.

This message was sent by Atlassian JIRA