both html parser have bug with javascript

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

both html parser have bug with javascript

Ilia S. Yatsenko
Hello :)

Sorry my little English

 

I have issue with both html parsers.

I see in summaries next text:

 

2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc
ript1.3javascript my text description.

 

Or

 

2javascript my text description.

 

Or

 

javascriptjavascript1.2javascript my text description.

 

But summary not should have it

 

Respectfully

Reply | Threaded
Open this post in threaded view
|

RE: both html parser have bug with javascript

Ilia S. Yatsenko
Opps, I see my mistake O-)

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 6:06 PM
To: [hidden email]
Subject: both html parser have bug with javascript

Hello :)

Sorry my little English

 

I have issue with both html parsers.

I see in summaries next text:

 

2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc
ript1.3javascript my text description.

 

Or

 

2javascript my text description.

 

Or

 

javascriptjavascript1.2javascript my text description.

 

But summary not should have it

 

Respectfully



Reply | Threaded
Open this post in threaded view
|

RE: both html parser have bug with javascript

Chirag Chaman
Actually, I think the JavaScript is there as it's part of the HTML page --
but it should not be part of the summaries.  Has anyone found a solution to
not showing the "JavaScript" or "text/css" -- that shows up from time to
time?

CC-


-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 12:09 PM
To: [hidden email]
Subject: RE: both html parser have bug with javascript

Opps, I see my mistake O-)

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 6:06 PM
To: [hidden email]
Subject: both html parser have bug with javascript

Hello :)

Sorry my little English

 

I have issue with both html parsers.

I see in summaries next text:

 

2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc
ript1.3javascript my text description.

 

Or

 

2javascript my text description.

 

Or

 

javascriptjavascript1.2javascript my text description.

 

But summary not should have it

 

Respectfully





Reply | Threaded
Open this post in threaded view
|

RE: both html parser have bug with javascript

Ilia S. Yatsenko
In reply to this post by Ilia S. Yatsenko
I thought "javascript" shown in summaries because I enable parse-js plug-in.
I have disabled it, made new database but got the same result :(

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 7:09 PM
To: [hidden email]
Subject: RE: both html parser have bug with javascript

Opps, I see my mistake O-)

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 6:06 PM
To: [hidden email]
Subject: both html parser have bug with javascript

Hello :)

Sorry my little English

 

I have issue with both html parsers.

I see in summaries next text:

 

2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc
ript1.3javascript my text description.

 

Or

 

2javascript my text description.

 

Or

 

javascriptjavascript1.2javascript my text description.

 

But summary not should have it

 

Respectfully





Reply | Threaded
Open this post in threaded view
|

RE: both html parser have bug with javascript

Ilia S. Yatsenko
And this <%@ Language=VBScript %> shown in summaries

I thought ANY text between < and > should be always ignored and unknown tags
too.

:)

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Monday, July 04, 2005 6:33 AM
To: [hidden email]
Subject: RE: both html parser have bug with javascript

I thought "javascript" shown in summaries because I enable parse-js plug-in.
I have disabled it, made new database but got the same result :(

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 7:09 PM
To: [hidden email]
Subject: RE: both html parser have bug with javascript

Opps, I see my mistake O-)

-----Original Message-----
From: Ilia S. Yatsenko [mailto:[hidden email]]
Sent: Sunday, July 03, 2005 6:06 PM
To: [hidden email]
Subject: both html parser have bug with javascript

Hello :)

Sorry my little English

 

I have issue with both html parsers.

I see in summaries next text:

 

2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc
ript1.3javascript my text description.

 

Or

 

2javascript my text description.

 

Or

 

javascriptjavascript1.2javascript my text description.

 

But summary not should have it

 

Respectfully







Reply | Threaded
Open this post in threaded view
|

Re: both html parser have bug with javascript

Andrzej Białecki-2
In reply to this post by Chirag Chaman
Chirag Chaman wrote:
> Actually, I think the JavaScript is there as it's part of the HTML page --
> but it should not be part of the summaries.  Has anyone found a solution to
> not showing the "JavaScript" or "text/css" -- that shows up from time to
> time?

Summary is generated from parse_text data. So, the problem is already
during the parsing.

Actually, I think the problem is caused by my patch to DOMContentUtils
;-), which adds script language, stylesheet type and so on to the output
text.

 From your comments I gather that you'd rather not have it there - I'll
fix it.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: both html parser have bug with javascript

Chirag Chaman
Andrzej,

Thank you -- and here we were going nuts thinking the problem might have
been with the plugin!
Would it be possible to post the patch file of the changes once you have
made them as our version of Nutch is different from SVN.

Thankx again.

CC-
 

-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Monday, July 04, 2005 6:05 AM
To: [hidden email]
Subject: Re: both html parser have bug with javascript

Chirag Chaman wrote:
> Actually, I think the JavaScript is there as it's part of the HTML
> page -- but it should not be part of the summaries.  Has anyone found
> a solution to not showing the "JavaScript" or "text/css" -- that shows
> up from time to time?

Summary is generated from parse_text data. So, the problem is already during
the parsing.

Actually, I think the problem is caused by my patch to DOMContentUtils ;-),
which adds script language, stylesheet type and so on to the output text.

 From your comments I gather that you'd rather not have it there - I'll fix
it.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com



Reply | Threaded
Open this post in threaded view
|

Re: both html parser have bug with javascript

Andrzej Białecki-2
Chirag Chaman wrote:
> Andrzej,
>
> Thank you -- and here we were going nuts thinking the problem might have
> been with the plugin!
> Would it be possible to post the patch file of the changes once you have
> made them as our version of Nutch is different from SVN.

I suggest keeping around a vanilla version, and porting diffs to your
tree, otherwise you will end up with more and more out-of-sync version...

The change itself is trivial (available as 'svn diff -r 179640
DOMContentUtils.java'):

Index: DOMContentUtils.java
===================================================================
--- DOMContentUtils.java        (revision 179640)
+++ DOMContentUtils.java        (working copy)
@@ -102,25 +102,9 @@
                                               boolean abortOnNestedAnchors,
                                               int anchorDepth) {
      if ("script".equalsIgnoreCase(node.getNodeName())) {
-      Node n = node.getAttributes().getNamedItem("language");
-      if (n != null) {
-        String text = n.getNodeValue();
-        sb.append(text);
-      }
        return false;
      }
      if ("style".equalsIgnoreCase(node.getNodeName())) {
-      Node n = node.getAttributes().getNamedItem("rel");
-      if (n != null) {
-        String text = n.getNodeValue();
-        sb.append(text);
-      }
-      n = node.getAttributes().getNamedItem("type");
-      if (n != null) {
-        String text = n.getNodeValue();
-        if (sb.length() > 0) sb.append(", ");
-        sb.append(text);
-      }
        return false;
      }
      if (abortOnNestedAnchors &&
"a".equalsIgnoreCase(node.getNodeName())) {


> Thankx again.

You're welcome.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: both html parser have bug with javascript

Chirag Chaman
Andrzej,

Thankx -- This works!!!


-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Monday, July 04, 2005 11:55 AM
To: [hidden email]
Subject: Re: both html parser have bug with javascript

Chirag Chaman wrote:
> Andrzej,
>
> Thank you -- and here we were going nuts thinking the problem might
> have been with the plugin!
> Would it be possible to post the patch file of the changes once you
> have made them as our version of Nutch is different from SVN.

I suggest keeping around a vanilla version, and porting diffs to your tree,
otherwise you will end up with more and more out-of-sync version...

The change itself is trivial (available as 'svn diff -r 179640
DOMContentUtils.java'):

Index: DOMContentUtils.java
===================================================================
--- DOMContentUtils.java        (revision 179640)
+++ DOMContentUtils.java        (working copy)
@@ -102,25 +102,9 @@
                                               boolean abortOnNestedAnchors,
                                               int anchorDepth) {
      if ("script".equalsIgnoreCase(node.getNodeName())) {
-      Node n = node.getAttributes().getNamedItem("language");
-      if (n != null) {
-        String text = n.getNodeValue();
-        sb.append(text);
-      }
        return false;
      }
      if ("style".equalsIgnoreCase(node.getNodeName())) {
-      Node n = node.getAttributes().getNamedItem("rel");
-      if (n != null) {
-        String text = n.getNodeValue();
-        sb.append(text);
-      }
-      n = node.getAttributes().getNamedItem("type");
-      if (n != null) {
-        String text = n.getNodeValue();
-        if (sb.length() > 0) sb.append(", ");
-        sb.append(text);
-      }
        return false;
      }
      if (abortOnNestedAnchors &&
"a".equalsIgnoreCase(node.getNodeName())) {


> Thankx again.

You're welcome.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com