Nutch trunk js-parser problem with extremely long and meaningless Elements

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch trunk js-parser problem with extremely long and meaningless Elements

Ned Rockson-3
I've run into the problem before that, while running the parser, it gets
caught in really deep regex loops.  For a quick fix I changed
urlfilter-prefix to not allow urls over 300 characters and to make sure
none of the characters have ascii values <32 (control characters).  I
just ran into another one today but it's in the js parser.  Take a look
at the source for http://www.magic-cadeaux.fr/ when it lists the
function swap(image, num).  If it weren't for all of the slashes then it
is well formed javascript, but unfortunately the parse-js plugin doesn't
deal with it correctly.  It just hangs in a very very deep loop.  A
browser, such as firefox, however seems to deal with it okay.  Is there
a way we can deal with these cases rather than limiting the size of an
Element?