Normalizing URLs with anchors

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Normalizing URLs with anchors

kkrugler
Hi all,

The default regex-normalize.xml currently strips out PHP session ids.

I'm wondering whether it would also make sense to remove anchor text
from URLs. For example, currently these two URLs are treated as
different:

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex

and

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html

Is it safe to always strip # followed by (valid anchor characters) at
the end of a URL?

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Normalizing URLs with anchors

Otis Gospodnetic-2-2
I think it's safe to strip anchors, as they simply point to a different portion of the same page for browser rendering.  I do that for Simpy while normalizing URLs, in order not to have duplicates like this.

Otis

----- Original Message ----
From: Ken Krugler <[hidden email]>
To: [hidden email]
Sent: Thu 05 Jan 2006 04:40:07 PM EST
Subject: Normalizing URLs with anchors

Hi all,

The default regex-normalize.xml currently strips out PHP session ids.

I'm wondering whether it would also make sense to remove anchor text
from URLs. For example, currently these two URLs are treated as
different:

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex

and

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html

Is it safe to always strip # followed by (valid anchor characters) at
the end of a URL?

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


Reply | Threaded
Open this post in threaded view
|

Re: Normalizing URLs with anchors

Doug Cutting-2
In reply to this post by kkrugler
Ken Krugler wrote:

> I'm wondering whether it would also make sense to remove anchor text
> from URLs. For example, currently these two URLs are treated as different:
>
> http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex 
>
> and
>
> http://www.dina.kvl.dk/~sestoft/gcsharp/index.html 
>
> Is it safe to always strip # followed by (valid anchor characters) at
> the end of a URL?

Yes, I think so.  Please submit a patch.

Are there other common session ids that we should remove in this file?

Doug
Reply | Threaded
Open this post in threaded view
|

Speed up searching

luti
In reply to this post by Otis Gospodnetic-2-2
Dear Developers,

I think this great improvement is missing from latest Nutch/Lucene
nightly build:
http://issues.apache.org/jira/browse/LUCENE-443

Best Regards,
    Ferenc