Nutch 1.14 issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch 1.14 issues

Arkadi.Kosmynin

Hi guys,

 

I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2, and I have come across a few serious issues, of which you should be aware:

 

1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null. If a parser fails to parse a document, it returns an empty result, but not null. This means that, from a chain of parser candidates, only the first one has a chance to try to parse the document.

2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO, MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great without them.

3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t understand, why Arch content blocking plugin gets it.

4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load library call in my code, but I still don’t notice any significant time savings.

5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java generated a NumberFormatException (which caused the failure of the entire crawling process!) because it was trying to parse a date in string format, not a number. Given that this metadata piece was generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.

6.       This is less important, but when Tika fails to parse a document, it generates a scary error message and ugly stack trace. I think this should be a one line warning, because other parsers may still parse this document successfully.

 

Hope this helps.

 

Regards,

 

Arkadi

Reply | Threaded
Open this post in threaded view
|

Re: Nutch 1.14 issues

Sebastian Nagel
Hi Arkadi,

thanks for your feedback and suggestions.
I can understand your frustration but I also want to clarify:

- Arch is a nice project, for sure. But Arch is GPL licensed
  which makes contributions a one-way route (Nutch -> Arch)
  and causes me even not to look into the Arch sources. Sorry.

- Please take the time to split your list of issues into separate
  requests on the mailing list or open separate Jira issues.
  Also take care that the problems are reproducible by sharing
  documents failed to parse, log snippets, config files, etc.

- Sorry about NUTCH-2071, I took this mainly as a class path issue
  in the parse-tika plugin (which is solved). Now I understand better
  what your objective is and I'll will review and try to fix it
  (in combination with NUTCH-1993). But again: please take the time
  to explain your objectives, ping committers if fixes make no progress,
  etc.

- Nutch is a community project. There are no "paid" committers. This
  means although some of us are paid to configure/operate/adapt crawlers
  nobody is delegated to fix issues, support Nutch users, etc.
  That's voluntary work.

- Everybody is welcome to contribute (patches, documentation, support
  on the mailing list, etc.)  Because Nutch is a small project this
  will help us definitely.


Thanks,
Sebastian



On 06/12/2018 08:46 AM, [hidden email] wrote:

> Hi guys,
>
>  
>
> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
> and I have come across a few serious issues, of which you should be aware:
>
>  
>
> 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
> If a parser fails to parse a document, it returns an empty result, but not null. This means that,
> from a chain of parser candidates, only the first one has a chance to try to parse the document.
>
> 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
> MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
> am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
> on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
> success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
> parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
> without them.
>
> 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
> they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
> cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
> understand, why Arch content blocking plugin gets it.
>
> 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
> obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
> library call in my code, but I still don’t notice any significant time savings.
>
> 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
> generated a NumberFormatException (which caused the failure of the entire crawling process!) because
> it was trying to parse a date in string format, not a number. Given that this metadata piece was
> generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
>
> 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
> message and ugly stack trace. I think this should be a one line warning, because other parsers may
> still parse this document successfully.
>
>  
>
> Hope this helps.
>
>  
>
> Regards,
>
>  
>
> Arkadi
>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch 1.14 issues

Arkadi.Kosmynin
Hi Sebastian,

Sorry, clarifying my objectives:

I am not frustrated, just trying to help. I did not write this message to request fixes for Arch. All these issues have been fixed in Arch, except perhaps the native library issue, but I may fix it as well, if lucky enough. I wrote that message to contribute back to Nutch, because I consider these issues (at least, some of them) very important for Nutch.

I do understand that Nutch is supported by volunteers, and I really appreciate the work your are doing.

I will open JIRA issues.

Regards,

Arkadi  
________________________________________
From: Sebastian Nagel <[hidden email]>
Sent: Wednesday, 13 June 2018 12:24 AM
To: [hidden email]
Subject: Re: Nutch 1.14 issues

Hi Arkadi,

thanks for your feedback and suggestions.
I can understand your frustration but I also want to clarify:

- Arch is a nice project, for sure. But Arch is GPL licensed
  which makes contributions a one-way route (Nutch -> Arch)
  and causes me even not to look into the Arch sources. Sorry.

- Please take the time to split your list of issues into separate
  requests on the mailing list or open separate Jira issues.
  Also take care that the problems are reproducible by sharing
  documents failed to parse, log snippets, config files, etc.

- Sorry about NUTCH-2071, I took this mainly as a class path issue
  in the parse-tika plugin (which is solved). Now I understand better
  what your objective is and I'll will review and try to fix it
  (in combination with NUTCH-1993). But again: please take the time
  to explain your objectives, ping committers if fixes make no progress,
  etc.

- Nutch is a community project. There are no "paid" committers. This
  means although some of us are paid to configure/operate/adapt crawlers
  nobody is delegated to fix issues, support Nutch users, etc.
  That's voluntary work.

- Everybody is welcome to contribute (patches, documentation, support
  on the mailing list, etc.)  Because Nutch is a small project this
  will help us definitely.


Thanks,
Sebastian



On 06/12/2018 08:46 AM, [hidden email] wrote:

> Hi guys,
>
>
>
> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
> and I have come across a few serious issues, of which you should be aware:
>
>
>
> 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
> If a parser fails to parse a document, it returns an empty result, but not null. This means that,
> from a chain of parser candidates, only the first one has a chance to try to parse the document.
>
> 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
> MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
> am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
> on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
> success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
> parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
> without them.
>
> 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
> they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
> cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
> understand, why Arch content blocking plugin gets it.
>
> 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
> obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
> library call in my code, but I still don’t notice any significant time savings.
>
> 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
> generated a NumberFormatException (which caused the failure of the entire crawling process!) because
> it was trying to parse a date in string format, not a number. Given that this metadata piece was
> generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
>
> 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
> message and ugly stack trace. I think this should be a one line warning, because other parsers may
> still parse this document successfully.
>
>
>
> Hope this helps.
>
>
>
> Regards,
>
>
>
> Arkadi
>

Reply | Threaded
Open this post in threaded view
|

RE: Nutch 1.14 issues

Markus Jelsma-2
In reply to this post by Arkadi.Kosmynin
Hi,

I've got some tests failing here on a vanilla master check out.

    [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.314 sec
    [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED

Jurian had protocol-http's test failing just now, but running ant test on my system with a clean check out didn't run the plugin tests at all. Whatever i do, plugin tests won't run.

Markus



 
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Tuesday 12th June 2018 16:24
> To: [hidden email]
> Subject: Re: Nutch 1.14 issues
>
> Hi Arkadi,
>
> thanks for your feedback and suggestions.
> I can understand your frustration but I also want to clarify:
>
> - Arch is a nice project, for sure. But Arch is GPL licensed
>   which makes contributions a one-way route (Nutch -> Arch)
>   and causes me even not to look into the Arch sources. Sorry.
>
> - Please take the time to split your list of issues into separate
>   requests on the mailing list or open separate Jira issues.
>   Also take care that the problems are reproducible by sharing
>   documents failed to parse, log snippets, config files, etc.
>
> - Sorry about NUTCH-2071, I took this mainly as a class path issue
>   in the parse-tika plugin (which is solved). Now I understand better
>   what your objective is and I'll will review and try to fix it
>   (in combination with NUTCH-1993). But again: please take the time
>   to explain your objectives, ping committers if fixes make no progress,
>   etc.
>
> - Nutch is a community project. There are no "paid" committers. This
>   means although some of us are paid to configure/operate/adapt crawlers
>   nobody is delegated to fix issues, support Nutch users, etc.
>   That's voluntary work.
>
> - Everybody is welcome to contribute (patches, documentation, support
>   on the mailing list, etc.)  Because Nutch is a small project this
>   will help us definitely.
>
>
> Thanks,
> Sebastian
>
>
>
> On 06/12/2018 08:46 AM, [hidden email] wrote:
> > Hi guys,
> >
> >  
> >
> > I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
> > and I have come across a few serious issues, of which you should be aware:
> >
> >  
> >
> > 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
> > If a parser fails to parse a document, it returns an empty result, but not null. This means that,
> > from a chain of parser candidates, only the first one has a chance to try to parse the document.
> >
> > 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
> > MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
> > am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
> > on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
> > success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
> > parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
> > without them.
> >
> > 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
> > they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
> > cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
> > understand, why Arch content blocking plugin gets it.
> >
> > 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
> > crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
> > obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
> > library call in my code, but I still don’t notice any significant time savings.
> >
> > 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
> > generated a NumberFormatException (which caused the failure of the entire crawling process!) because
> > it was trying to parse a date in string format, not a number. Given that this metadata piece was
> > generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
> >
> > 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
> > message and ugly stack trace. I think this should be a one line warning, because other parsers may
> > still parse this document successfully.
> >
> >  
> >
> > Hope this helps.
> >
> >  
> >
> > Regards,
> >
> >  
> >
> > Arkadi
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Nutch 1.14 issues

Markus Jelsma-2
Ah, wrong thread. But it seems some things are not entirely right for 1.15 release just yet.
Markus

 
 
-----Original message-----

> From:Markus Jelsma <[hidden email]>
> Sent: Wednesday 13th June 2018 12:44
> To: [hidden email]
> Subject: RE: Nutch 1.14 issues
>
> Hi,
>
> I've got some tests failing here on a vanilla master check out.
>
>     [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.314 sec
>     [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED
>
> Jurian had protocol-http's test failing just now, but running ant test on my system with a clean check out didn't run the plugin tests at all. Whatever i do, plugin tests won't run.
>
> Markus
>
>
>
>  
>  
> -----Original message-----
> > From:Sebastian Nagel <[hidden email]>
> > Sent: Tuesday 12th June 2018 16:24
> > To: [hidden email]
> > Subject: Re: Nutch 1.14 issues
> >
> > Hi Arkadi,
> >
> > thanks for your feedback and suggestions.
> > I can understand your frustration but I also want to clarify:
> >
> > - Arch is a nice project, for sure. But Arch is GPL licensed
> >   which makes contributions a one-way route (Nutch -> Arch)
> >   and causes me even not to look into the Arch sources. Sorry.
> >
> > - Please take the time to split your list of issues into separate
> >   requests on the mailing list or open separate Jira issues.
> >   Also take care that the problems are reproducible by sharing
> >   documents failed to parse, log snippets, config files, etc.
> >
> > - Sorry about NUTCH-2071, I took this mainly as a class path issue
> >   in the parse-tika plugin (which is solved). Now I understand better
> >   what your objective is and I'll will review and try to fix it
> >   (in combination with NUTCH-1993). But again: please take the time
> >   to explain your objectives, ping committers if fixes make no progress,
> >   etc.
> >
> > - Nutch is a community project. There are no "paid" committers. This
> >   means although some of us are paid to configure/operate/adapt crawlers
> >   nobody is delegated to fix issues, support Nutch users, etc.
> >   That's voluntary work.
> >
> > - Everybody is welcome to contribute (patches, documentation, support
> >   on the mailing list, etc.)  Because Nutch is a small project this
> >   will help us definitely.
> >
> >
> > Thanks,
> > Sebastian
> >
> >
> >
> > On 06/12/2018 08:46 AM, [hidden email] wrote:
> > > Hi guys,
> > >
> > >  
> > >
> > > I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
> > > and I have come across a few serious issues, of which you should be aware:
> > >
> > >  
> > >
> > > 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
> > > If a parser fails to parse a document, it returns an empty result, but not null. This means that,
> > > from a chain of parser candidates, only the first one has a chance to try to parse the document.
> > >
> > > 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
> > > MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
> > > am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
> > > on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
> > > success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
> > > parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
> > > without them.
> > >
> > > 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
> > > they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
> > > cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
> > > understand, why Arch content blocking plugin gets it.
> > >
> > > 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
> > > crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
> > > obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
> > > library call in my code, but I still don’t notice any significant time savings.
> > >
> > > 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
> > > generated a NumberFormatException (which caused the failure of the entire crawling process!) because
> > > it was trying to parse a date in string format, not a number. Given that this metadata piece was
> > > generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
> > >
> > > 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
> > > message and ugly stack trace. I think this should be a one line warning, because other parsers may
> > > still parse this document successfully.
> > >
> > >  
> > >
> > > Hope this helps.
> > >
> > >  
> > >
> > > Regards,
> > >
> > >  
> > >
> > > Arkadi
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch 1.14 issues

Sebastian Nagel
Hi Markus,

On Jenkins all unit tests have passed including plugins:
  https://builds.apache.org/job/Nutch-trunk/3536/testReport/

(same on my laptop running Ubuntu 18.04 and on a Ubuntu 16.04 server)

Could be related to the Java version.
% java -version
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

But let's discuss the test failures in separate threads.

Sebastian


On 06/13/2018 12:45 PM, Markus Jelsma wrote:

> Ah, wrong thread. But it seems some things are not entirely right for 1.15 release just yet.
> Markus
>
>  
>  
> -----Original message-----
>> From:Markus Jelsma <[hidden email]>
>> Sent: Wednesday 13th June 2018 12:44
>> To: [hidden email]
>> Subject: RE: Nutch 1.14 issues
>>
>> Hi,
>>
>> I've got some tests failing here on a vanilla master check out.
>>
>>     [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.314 sec
>>     [junit] Test org.apache.nutch.net.TestURLNormalizers FAILED
>>
>> Jurian had protocol-http's test failing just now, but running ant test on my system with a clean check out didn't run the plugin tests at all. Whatever i do, plugin tests won't run.
>>
>> Markus
>>
>>
>>
>>  
>>  
>> -----Original message-----
>>> From:Sebastian Nagel <[hidden email]>
>>> Sent: Tuesday 12th June 2018 16:24
>>> To: [hidden email]
>>> Subject: Re: Nutch 1.14 issues
>>>
>>> Hi Arkadi,
>>>
>>> thanks for your feedback and suggestions.
>>> I can understand your frustration but I also want to clarify:
>>>
>>> - Arch is a nice project, for sure. But Arch is GPL licensed
>>>   which makes contributions a one-way route (Nutch -> Arch)
>>>   and causes me even not to look into the Arch sources. Sorry.
>>>
>>> - Please take the time to split your list of issues into separate
>>>   requests on the mailing list or open separate Jira issues.
>>>   Also take care that the problems are reproducible by sharing
>>>   documents failed to parse, log snippets, config files, etc.
>>>
>>> - Sorry about NUTCH-2071, I took this mainly as a class path issue
>>>   in the parse-tika plugin (which is solved). Now I understand better
>>>   what your objective is and I'll will review and try to fix it
>>>   (in combination with NUTCH-1993). But again: please take the time
>>>   to explain your objectives, ping committers if fixes make no progress,
>>>   etc.
>>>
>>> - Nutch is a community project. There are no "paid" committers. This
>>>   means although some of us are paid to configure/operate/adapt crawlers
>>>   nobody is delegated to fix issues, support Nutch users, etc.
>>>   That's voluntary work.
>>>
>>> - Everybody is welcome to contribute (patches, documentation, support
>>>   on the mailing list, etc.)  Because Nutch is a small project this
>>>   will help us definitely.
>>>
>>>
>>> Thanks,
>>> Sebastian
>>>
>>>
>>>
>>> On 06/12/2018 08:46 AM, [hidden email] wrote:
>>>> Hi guys,
>>>>
>>>>  
>>>>
>>>> I am porting Arch (https://www.atnf.csiro.au/computing/software/arch/) to Nutch 1.14 and Solr 7.2,
>>>> and I have come across a few serious issues, of which you should be aware:
>>>>
>>>>  
>>>>
>>>> 1.       The Nutch-2071 is still an issue in 1.14, because the returned parseResult is never null.
>>>> If a parser fails to parse a document, it returns an empty result, but not null. This means that,
>>>> from a chain of parser candidates, only the first one has a chance to try to parse the document.
>>>>
>>>> 2.       Nutch adopted Tika as a general parsing tool, and stopped supporting “legacy” parsing (OO,
>>>> MS) plugins. I continued using them and hoped to stop supporting them in the next version of Arch I
>>>> am preparing to be released, but I still can’t do it, because Tika fails to parse too many documents
>>>> on our site. But, when I reinforce Tika with the legacy parsers, I achieve almost 100% parsing
>>>> success rate. This is why NUTCH-2071 is important for Arch. I think you should bring back legacy
>>>> parsers to Nutch, because the quality of parsing of “real life” data, such as ours, is not great
>>>> without them.
>>>>
>>>> 3.       The lines defining fall-back (*) plugin in parse-plugins.xml are not effective, because
>>>> they are ignored, as long as there is at least one plugin claiming * in its plugin.xml file. In some
>>>> cases, Nutch assigns * capability to plugins that don’t even claim it. For example, I can’t
>>>> understand, why Arch content blocking plugin gets it.
>>>>
>>>> 4.       In earlier versions of Nutch, use of the native libraries really helped. It reduced
>>>> crawling of our site from a couple of days to 6-7 hours. In Nutch 1.14, I don’t notice this. I’ve
>>>> obtained Hadoop libraries, placed them where they are expected, even inserted an explicit load
>>>> library call in my code, but I still don’t notice any significant time savings.
>>>>
>>>> 5.       The Feed plugin seems to have a major problem. The line 102 in  FeedIndexingFilter.java
>>>> generated a NumberFormatException (which caused the failure of the entire crawling process!) because
>>>> it was trying to parse a date in string format, not a number. Given that this metadata piece was
>>>> generated by the feed parser (same plugin), it seems that the plugin is in disagreement with itself.
>>>>
>>>> 6.       This is less important, but when Tika fails to parse a document, it generates a scary error
>>>> message and ugly stack trace. I think this should be a one line warning, because other parsers may
>>>> still parse this document successfully.
>>>>
>>>>  
>>>>
>>>> Hope this helps.
>>>>
>>>>  
>>>>
>>>> Regards,
>>>>
>>>>  
>>>>
>>>> Arkadi
>>>>
>>>
>>>
>>