[jira] [Created] (TIKA-2810) Back off to tagsoup when xml parser fails on Tika xhtml in tika-eval

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (TIKA-2810) Back off to tagsoup when xml parser fails on Tika xhtml in tika-eval

JIRA jira@apache.org
Tim Allison created TIKA-2810:
---------------------------------

             Summary: Back off to tagsoup when xml parser fails on Tika xhtml in tika-eval
                 Key: TIKA-2810
                 URL: https://issues.apache.org/jira/browse/TIKA-2810
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


On TIKA-2791, we added extraction of structure tags.  If there's a parse failure on Tika's xhtml, we initially backed off to treat the full xhtml as if it were a string of text that happened to include markup.  

It would be better to back off to the html parser so that content comparisons can still work accurately even if there is a tag failure: <b><i></b></i>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)