[jira] Created: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

JIRA jira@apache.org
parse-js pollutes anchor text with base URL of source page
----------------------------------------------------------

                 Key: NUTCH-425
                 URL: https://issues.apache.org/jira/browse/NUTCH-425
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
            Reporter: [hidden email]


Parse-js plugin always adds URL -- usually page base URL -- as anchor text for any link discovered parsing javascript.  Anchor text is tokenized when indexed and by default gets a heavy weighting.  The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors.  

See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings.

Here is extract from linkdb exhibiting the problem:

https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks:
 fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
 fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
 fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
 fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
 fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
 fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
 fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

[hidden email] updated NUTCH-425:
------------------------------------

    Attachment: nutch425.patch

> parse-js pollutes anchor text with base URL of source page
> ----------------------------------------------------------
>
>                 Key: NUTCH-425
>                 URL: https://issues.apache.org/jira/browse/NUTCH-425
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: [hidden email]
>         Attachments: nutch425.patch
>
>
> Parse-js plugin always adds URL -- usually page base URL -- as anchor text for any link discovered parsing javascript.  Anchor text is tokenized when indexed and by default gets a heavy weighting.  The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors.  
> See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings.
> Here is extract from linkdb exhibiting the problem:
> https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks:
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
>  fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
>  fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291 ]

[hidden email] commented on NUTCH-425:
-----------------------------------------

I took a look at what is passed to parse-js both when called from parsehtml and when run by the parser passed javascript files.  It doesn't look like there is anything to hand that could possibly be construed as 'anchor text' when an URL is found in javascript.  Following on from this, the attached patch does the most basic 'fix'.  It just sets the anchor text param to the empty string when getJSLinks is called.

> parse-js pollutes anchor text with base URL of source page
> ----------------------------------------------------------
>
>                 Key: NUTCH-425
>                 URL: https://issues.apache.org/jira/browse/NUTCH-425
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: [hidden email]
>         Attachments: nutch425.patch
>
>
> Parse-js plugin always adds URL -- usually page base URL -- as anchor text for any link discovered parsing javascript.  Anchor text is tokenized when indexed and by default gets a heavy weighting.  The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors.  
> See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings.
> Here is extract from linkdb exhibiting the problem:
> https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks:
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
>  fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
>  fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-425.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.0
         Assignee: Andrzej Bialecki

Fixed in rev. 493085. Thank you!

> parse-js pollutes anchor text with base URL of source page
> ----------------------------------------------------------
>
>                 Key: NUTCH-425
>                 URL: https://issues.apache.org/jira/browse/NUTCH-425
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: [hidden email]
>         Assigned To: Andrzej Bialecki
>             Fix For: 0.9.0
>
>         Attachments: nutch425.patch
>
>
> Parse-js plugin always adds URL -- usually page base URL -- as anchor text for any link discovered parsing javascript.  Anchor text is tokenized when indexed and by default gets a heavy weighting.  The upshot is often pages show high in search results for no reason other than query term appears in (URL) anchors.  
> See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html for related user list postings.
> Here is extract from linkdb exhibiting the problem:
> https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks:
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
>  fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
>  fromUrl: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira