Blogger RSS Parsing Error

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Blogger RSS Parsing Error

mikeyc
Hey all,
I'm trying to parse Blogger rss feeds and seem to be getting errors when certain elements are encountered.  Specifically, the elements are prefixed by "st1".  I believe these are Microsoft Smart Tags - not 100% though.  Has anyone successfully done this?  If so, can you point me in the right direction?  

I have attached the error message below for reference.  

Thanks,
Mike

org.apache.commons.feedparser.FeedParserException: org.jdom.JDOMException: Error on line 46: The prefix "st1" for element "st1:country-region" is not bound.
        at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:86)
        at org.apache.nutch.parse.rss.RSSParser.getParse(RSSParser.java:116)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:225)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:137)
Caused by: org.jdom.JDOMException: Error on line 46: The prefix "st1" for element "st1:country-region" is not bound.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:367)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:673)
        at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:73)
        ... 4 more
Caused by: org.xml.sax.SAXParseException: The prefix "st1" for element "st1:country-region" is not bound.
        at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:354)


Reply | Threaded
Open this post in threaded view
|

Re: Blogger RSS Parsing Error

chrismattmann
Hi Mike,

  The RSS parser for Nutch is based on Kevin Burton's commons-feedparser in the Jakarta Sandbox. Here is the documentation for that feedparser:

http://jakarta.apache.org/commons/sandbox/feedparser/

You might want to post to the commons-feedparser email list asking him about your RSS question: he's the real RSS guru, and I bet you he could help you out.

  As for your guess that it's probably an unrecognized tag, I think you're probably right. Now the question is, your fetch isn't failing because of this, right? I mean, I see in the RSS parser that line 116 (the call to the "parse" function) is within a try/catch block, so what you are pasting below is just the output of the stack trace, right?

Anyways, good luck on your problem!

Cheers,
  Chris
Reply | Threaded
Open this post in threaded view
|

Re: Blogger RSS Parsing Error

mikeyc
Chris,
Ok, I'll try the commons-feedparser mailing list.  Also, yes that was the stack trace in the log output.  

Thanks again,
Mike