[jira] Created: (TIKA-337) SWF parser

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-337) SWF parser

JIRA jira@apache.org
SWF parser
----------

                 Key: TIKA-337
                 URL: https://issues.apache.org/jira/browse/TIKA-337
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Julien Nioche


Here is an initial implementation of a SWF Parser which uses JavaSWF and has been adapted from  A. Bialecki's implementation for Nutch.
The main differences with the implementation for Nutch is that we use the latest version of JavaSWF and do not try to extract text from the actions or structured URLs. As usual URLs can be obtained from the text extracted using ParserPostProcessor.
JavaSWF has changed quite a bit since the Nutch integration and I wanted to keep this initial port nice and simple. It should be possible to extract the URLs from the actions using  JavaSWF's API, I think this is what they did in Heritrix.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-337) SWF parser

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-337:
-------------------------------

    Attachment: TIKA-337.patch

patch for SWF parser

> SWF parser
> ----------
>
>                 Key: TIKA-337
>                 URL: https://issues.apache.org/jira/browse/TIKA-337
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Julien Nioche
>         Attachments: TIKA-337.patch
>
>
> Here is an initial implementation of a SWF Parser which uses JavaSWF and has been adapted from  A. Bialecki's implementation for Nutch.
> The main differences with the implementation for Nutch is that we use the latest version of JavaSWF and do not try to extract text from the actions or structured URLs. As usual URLs can be obtained from the text extracted using ParserPostProcessor.
> JavaSWF has changed quite a bit since the Nutch integration and I wanted to keep this initial port nice and simple. It should be possible to extract the URLs from the actions using  JavaSWF's API, I think this is what they did in Heritrix.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-337) SWF parser

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-337.
--------------------------------

    Resolution: Duplicate
      Assignee: Jukka Zitting

Resolving as a duplicate of the earlier issue TIKA-147. I'll add a comment there pointing to your patch.

Before applying the patch we'll need to get the JavaSWF library uploaded to Maven central. I sent a message to the JavaSWF support list about this.

> SWF parser
> ----------
>
>                 Key: TIKA-337
>                 URL: https://issues.apache.org/jira/browse/TIKA-337
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Jukka Zitting
>         Attachments: TIKA-337.patch
>
>
> Here is an initial implementation of a SWF Parser which uses JavaSWF and has been adapted from  A. Bialecki's implementation for Nutch.
> The main differences with the implementation for Nutch is that we use the latest version of JavaSWF and do not try to extract text from the actions or structured URLs. As usual URLs can be obtained from the text extracted using ParserPostProcessor.
> JavaSWF has changed quite a bit since the Nutch integration and I wanted to keep this initial port nice and simple. It should be possible to extract the URLs from the actions using  JavaSWF's API, I think this is what they did in Heritrix.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-337) SWF parser

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-337:
-------------------------------

    Attachment: test.swf

test file for the swf parser

> SWF parser
> ----------
>
>                 Key: TIKA-337
>                 URL: https://issues.apache.org/jira/browse/TIKA-337
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Jukka Zitting
>         Attachments: test.swf, TIKA-337.patch
>
>
> Here is an initial implementation of a SWF Parser which uses JavaSWF and has been adapted from  A. Bialecki's implementation for Nutch.
> The main differences with the implementation for Nutch is that we use the latest version of JavaSWF and do not try to extract text from the actions or structured URLs. As usual URLs can be obtained from the text extracted using ParserPostProcessor.
> JavaSWF has changed quite a bit since the Nutch integration and I wanted to keep this initial port nice and simple. It should be possible to extract the URLs from the actions using  JavaSWF's API, I think this is what they did in Heritrix.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.