Terminating slashes in URL normalization

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Terminating slashes in URL normalization

Chris Schneider-2
Gang,

Pardon my ignorance, but I noticed recently that some URLs were
duplicated in my crawldb, once with a terminating slash and once
without it. For example, both of the following URLs were found in the
same crawldb:

http://mail.python.org/mailman/listinfo/
http://mail.python.org/mailman/listinfo

As I understand it, if the URL refers to a folder on the server, a
terminating slash should be added to the URL, since this improves
performance of loading the page (presumably because the server
doesn't have to check to see if it refers to a file). See
<http://en.wikipedia.org/wiki/URL_normalization> for more details.

Given this, shouldn't the default URL normalizer just add a slash to
the end of a URL that doesn't have a file extension?

- Schmed
--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------
Reply | Threaded
Open this post in threaded view
|

Re: Terminating slashes in URL normalization

Jukka Zitting
Hi,

On 8/5/06, Chris Schneider <[hidden email]> wrote:
> Given this, shouldn't the default URL normalizer just add a slash to
> the end of a URL that doesn't have a file extension?

Section 6.2.4 of RFC 3986 suggests that a crawler could do such a
normalization if it detects that
http://mail.python.org/mailman/listinfo redirects to
http://mail.python.org/mailman/listinfo/. I think just blindly adding
the slash without knowing about the redirection is incorrect.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [hidden email]
Software craftsmanship, JCR consulting, and Java development
Reply | Threaded
Open this post in threaded view
|

Re: Terminating slashes in URL normalization

Sami Siren-2
In reply to this post by Chris Schneider-2
Chris Schneider wrote:

> Gang,
>
> Pardon my ignorance, but I noticed recently that some URLs were
> duplicated in my crawldb, once with a terminating slash and once
> without it. For example, both of the following URLs were found in the
> same crawldb:
>
> http://mail.python.org/mailman/listinfo/ 
> http://mail.python.org/mailman/listinfo
>
> As I understand it, if the URL refers to a folder on the server, a
> terminating slash should be added to the URL, since this improves
> performance of loading the page (presumably because the server
> doesn't have to check to see if it refers to a file). See
> <http://en.wikipedia.org/wiki/URL_normalization> for more details.
>
> Given this, shouldn't the default URL normalizer just add a slash to
> the end of a URL that doesn't have a file extension?

There's no way we can tell (from outside) if single url points to
directory or not (or that it's url could be normalized in a way you
describe)

for example try
  http://en.wikipedia.org/wiki/URL_normalization
  http://en.wikipedia.org/wiki/URL_normalization/

The referred paper [http://www2006.org/programme/item.php?id=p20]
presents an interesting idea for eliminating redundant urls from a list
of urls.

Currently duplicate pages can be caught (from search results) by running
dedup on index. If you have run dedup and still see those two pages in
search results then please check the hash for each page - dedup only
catches pages with identical hash and it is quite common for a web site
to change a very small part of the html content even for every request.

It might be a good idea extend current functionality with some kind of
tagging of reduntant (by content) urls in webdb to prevent them from
being fetched again.

--
  Sami Siren

Reply | Threaded
Open this post in threaded view
|

Re: Terminating slashes in URL normalization

Jukka Zitting
In reply to this post by Jukka Zitting
Hi,

On 8/5/06, Jukka Zitting <[hidden email]> wrote:
> Section 6.2.4 of RFC 3986 suggests that a crawler could do such a
> normalization if it detects that
> http://mail.python.org/mailman/listinfo redirects to
> http://mail.python.org/mailman/listinfo/.

Which it of course doesn't... :-) Another reasonable normalization aid
would be to detect that both URIs return identical information, like
with the dedup tool mentioned by Sami.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [hidden email]
Software craftsmanship, JCR consulting, and Java development
Reply | Threaded
Open this post in threaded view
|

Re: Terminating slashes in URL normalization

Chris Schneider-2
In reply to this post by Jukka Zitting
Jukka,

>On 8/5/06, Chris Schneider <[hidden email]> wrote:
>>Given this, shouldn't the default URL normalizer just add a slash to
>>the end of a URL that doesn't have a file extension?

At 8:41 AM +0300 8/5/06, Jukka Zitting wrote:
>Section 6.2.4 of RFC 3986 suggests that a crawler could do such a
>normalization if it detects that
>http://mail.python.org/mailman/listinfo redirects to
>http://mail.python.org/mailman/listinfo/. I think just blindly adding
>the slash without knowing about the redirection is incorrect.

I wasn't thinking about redirection. You are correct; dedup is the way to handle this problem.

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------