[jira] Created: (TIKA-251) package parser ignoring tika-config.xml

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-251) package parser ignoring tika-config.xml

JIRA jira@apache.org
package parser ignoring tika-config.xml
----------------------------------------

                 Key: TIKA-251
                 URL: https://issues.apache.org/jira/browse/TIKA-251
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Jonathan Koren


I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed.  I modified tika-config so that AutoDetectParser will call this parser for xml files:

       <parser name="parse-xml" class="XmlParser">
               <mime>application/xml</mime>
       </parser>

If tika parses an xml file directly, the right thing is done:

        resourceName: 1001281.xml
ComplexIndexerTaskThread()
        XmlParser Begins
        SCH: start document
        SCH: start element nitf
        SCH: a: change.date=June 10, 2005
        SCH: a: change.time=19:30
        SCH: a: version=-//IPTC//DTD NITF 3.3//EN
        SCH: start element head
        SCH: start element title
        Apprentices Sample Life Of Doctors In Villages
        SCH: end element title
        SCH: start element meta
        SCH: a: content=Y11DOC$01
        SCH: a: name=slug

and so on for the fragment:

        <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
        <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
        <head>
        <title>Apprentices Sample Life Of Doctors In Villages</title>
        <meta content="Y11DOC$01" name="slug"/>


Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't called.  Instead it is somehow converted to plain text.  Which is not correct. Example output:

        fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
        resourceName: aaa.tar.gz
        ComplexIndexerTaskThread()
        SCH: start document
        SCH: start element html
        SCH: start element head
        SCH: start element title

        SCH: end element title

        SCH: end element head
        SCH: start element body
        SCH: start element div
        SCH: a: class=package-entry
        SCH: subfile 1 detected!
        SCH: start element h1
        aaa.tar
        SCH: subfile 1's name is aaa.tar

        SCH: end element h1
        SCH: start element div
        SCH: a: class=package-entry
        SCH: subfile 2 detected!
        SCH: start element h1
        1001281.xml
        SCH: subfile 2's name is 1001281.xml

        SCH: end element h1
        SCH: start element p


   Apprentices Sample Life Of Doctors In Villages


and so on.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-251) package parser ignoring tika-config.xml

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Koren updated TIKA-251:
--------------------------------

    Priority: Minor  (was: Major)

> package parser ignoring tika-config.xml
> ----------------------------------------
>
>                 Key: TIKA-251
>                 URL: https://issues.apache.org/jira/browse/TIKA-251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed.  I modified tika-config so that AutoDetectParser will call this parser for xml files:
>        <parser name="parse-xml" class="XmlParser">
>                <mime>application/xml</mime>
>        </parser>
> If tika parses an xml file directly, the right thing is done:
> resourceName: 1001281.xml
> ComplexIndexerTaskThread()
> XmlParser Begins
> SCH: start document
> SCH: start element nitf
> SCH: a: change.date=June 10, 2005
> SCH: a: change.time=19:30
> SCH: a: version=-//IPTC//DTD NITF 3.3//EN
> SCH: start element head
> SCH: start element title
> Apprentices Sample Life Of Doctors In Villages
> SCH: end element title
> SCH: start element meta
> SCH: a: content=Y11DOC$01
> SCH: a: name=slug
> and so on for the fragment:
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
> <head>
> <title>Apprentices Sample Life Of Doctors In Villages</title>
> <meta content="Y11DOC$01" name="slug"/>
> Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't called.  Instead it is somehow converted to plain text.  Which is not correct. Example output:
> fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
> resourceName: aaa.tar.gz
> ComplexIndexerTaskThread()
> SCH: start document
> SCH: start element html
> SCH: start element head
> SCH: start element title
> SCH: end element title
> SCH: end element head
> SCH: start element body
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 1 detected!
> SCH: start element h1
> aaa.tar
> SCH: subfile 1's name is aaa.tar
> SCH: end element h1
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 2 detected!
> SCH: start element h1
> 1001281.xml
> SCH: subfile 2's name is 1001281.xml
> SCH: end element h1
> SCH: start element p
>    Apprentices Sample Life Of Doctors In Villages
> and so on.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-251) package parser ignoring tika-config.xml

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724199#action_12724199 ]

Jukka Zitting commented on TIKA-251:
------------------------------------

The package parser might not be picking up your custom configuration. Are you using a recent version from trunk?

See TIKA-238 that should fix the issue of a PackageParser always using the default Tika configuration.

> package parser ignoring tika-config.xml
> ----------------------------------------
>
>                 Key: TIKA-251
>                 URL: https://issues.apache.org/jira/browse/TIKA-251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed.  I modified tika-config so that AutoDetectParser will call this parser for xml files:
>        <parser name="parse-xml" class="XmlParser">
>                <mime>application/xml</mime>
>        </parser>
> If tika parses an xml file directly, the right thing is done:
> resourceName: 1001281.xml
> ComplexIndexerTaskThread()
> XmlParser Begins
> SCH: start document
> SCH: start element nitf
> SCH: a: change.date=June 10, 2005
> SCH: a: change.time=19:30
> SCH: a: version=-//IPTC//DTD NITF 3.3//EN
> SCH: start element head
> SCH: start element title
> Apprentices Sample Life Of Doctors In Villages
> SCH: end element title
> SCH: start element meta
> SCH: a: content=Y11DOC$01
> SCH: a: name=slug
> and so on for the fragment:
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
> <head>
> <title>Apprentices Sample Life Of Doctors In Villages</title>
> <meta content="Y11DOC$01" name="slug"/>
> Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't called.  Instead it is somehow converted to plain text.  Which is not correct. Example output:
> fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
> resourceName: aaa.tar.gz
> ComplexIndexerTaskThread()
> SCH: start document
> SCH: start element html
> SCH: start element head
> SCH: start element title
> SCH: end element title
> SCH: end element head
> SCH: start element body
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 1 detected!
> SCH: start element h1
> aaa.tar
> SCH: subfile 1's name is aaa.tar
> SCH: end element h1
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 2 detected!
> SCH: start element h1
> 1001281.xml
> SCH: subfile 2's name is 1001281.xml
> SCH: end element h1
> SCH: start element p
>    Apprentices Sample Life Of Doctors In Villages
> and so on.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-251) package parser ignoring tika-config.xml

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724352#action_12724352 ]

Jonathan Koren commented on TIKA-251:
-------------------------------------

Just updated and reran `mvn install` to make sure.  

bash-3.2# svn update
At revision 788551.



> package parser ignoring tika-config.xml
> ----------------------------------------
>
>                 Key: TIKA-251
>                 URL: https://issues.apache.org/jira/browse/TIKA-251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed.  I modified tika-config so that AutoDetectParser will call this parser for xml files:
>        <parser name="parse-xml" class="XmlParser">
>                <mime>application/xml</mime>
>        </parser>
> If tika parses an xml file directly, the right thing is done:
> resourceName: 1001281.xml
> ComplexIndexerTaskThread()
> XmlParser Begins
> SCH: start document
> SCH: start element nitf
> SCH: a: change.date=June 10, 2005
> SCH: a: change.time=19:30
> SCH: a: version=-//IPTC//DTD NITF 3.3//EN
> SCH: start element head
> SCH: start element title
> Apprentices Sample Life Of Doctors In Villages
> SCH: end element title
> SCH: start element meta
> SCH: a: content=Y11DOC$01
> SCH: a: name=slug
> and so on for the fragment:
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
> <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN">
> <head>
> <title>Apprentices Sample Life Of Doctors In Villages</title>
> <meta content="Y11DOC$01" name="slug"/>
> Now.  If I put this XML file within a a gzipped tar file, my XmlParser isn't called.  Instead it is somehow converted to plain text.  Which is not correct. Example output:
> fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
> resourceName: aaa.tar.gz
> ComplexIndexerTaskThread()
> SCH: start document
> SCH: start element html
> SCH: start element head
> SCH: start element title
> SCH: end element title
> SCH: end element head
> SCH: start element body
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 1 detected!
> SCH: start element h1
> aaa.tar
> SCH: subfile 1's name is aaa.tar
> SCH: end element h1
> SCH: start element div
> SCH: a: class=package-entry
> SCH: subfile 2 detected!
> SCH: start element h1
> 1001281.xml
> SCH: subfile 2's name is 1001281.xml
> SCH: end element h1
> SCH: start element p
>    Apprentices Sample Life Of Doctors In Villages
> and so on.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.