Relative urls, interpage href anchors

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Relative urls, interpage href anchors

webdev1977
I am seeing an issue with crawling html pages that have relative urls embedded in them.  I know there is an ongoing issue related to relative urls that begin with a ?. But this seems to be a different issue.

In regex-normalize.xml there is the following pattern:

<regex>
  <pattern>#.*?(\?|&|$)</pattern>
  <subsitution>$1</subsitution>
</regex>

Here is my url:
http://myhost.com/my_page.php?id=23141

the source of this page contains the following href:
href="#R&D_in_Research_Books"

it tries to fetch this url:
http://myhost.com/my_page.php?id=23141&D_in_Research_Books&D&D_in_Research_Books&D&D&D_in_Research_Books&D_in_Research_Books

WTH??? Commenting out that pattern stops the madness, otherwise it runs in a continual loop and never ends, just keeps generating more and more urls with the "&D_in_Research_Books" tacked onto the end.

I have over 1 MILLION of these in my crawldb (it has been running for over a week).



Reply | Threaded
Open this post in threaded view
|

Re: Relative urls, interpage href anchors

Sebastian Nagel
Hi,

I had the same problem with this pattern. I think the pattern is intented
to remove page anchors while keeping accidentially misplaced query parameters
(behind the anchor). In my case, there have been anchor links of the form
    #action?param1&param2
processed by some javascript code.

Removing inner-page anchors I wouldn't remove (you do this by commenting out
the pattern). Better change it so that the complete anchor is deleted:

<regex>
    <pattern>#.*</pattern>
    <subsitution></subsitution>
</regex>

Is it worth to open an issue to fix this pattern in general?

Sebastian


On 03/27/2012 02:43 PM, webdev1977 wrote:

> I am seeing an issue with crawling html pages that have relative urls
> embedded in them.  I know there is an ongoing issue related to relative urls
> that begin with a ?. But this seems to be a different issue.
>
> In regex-normalize.xml there is the following pattern:
>
> <regex>
>    <pattern>#.*?(\?|&amp;|$)</pattern>
>    <subsitution>$1</subsitution>
> </regex>
>
> Here is my url:
> http://myhost.com/my_page.php?id=23141
>
> the source of this page contains the following href:
> href="#R&amp;D_in_Research_Books"
>
> it tries to fetch this url:
> http://myhost.com/my_page.php?id=23141&D_in_Research_Books&D&D_in_Research_Books&D&D&D_in_Research_Books&D_in_Research_Books
>
> WTH??? Commenting out that pattern stops the madness, otherwise it runs in a
> continual loop and never ends, just keeps generating more and more urls with
> the "&D_in_Research_Books" tacked onto the end.
>
> I have over 1 MILLION of these in my crawldb (it has been running for over a
> week).
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Relative urls, interpage href anchors

Julien Nioche-4
Hi,

Yes please do open an issue.

Thanks

Julien

On 27 March 2012 22:17, Sebastian Nagel <[hidden email]> wrote:

> Hi,
>
> I had the same problem with this pattern. I think the pattern is intented
> to remove page anchors while keeping accidentially misplaced query
> parameters
> (behind the anchor). In my case, there have been anchor links of the form
>   #action?param1&param2
> processed by some javascript code.
>
> Removing inner-page anchors I wouldn't remove (you do this by commenting
> out
> the pattern). Better change it so that the complete anchor is deleted:
>
> <regex>
>   <pattern>#.*</pattern>
>   <subsitution></subsitution>
> </regex>
>
> Is it worth to open an issue to fix this pattern in general?
>
> Sebastian
>
>
>
> On 03/27/2012 02:43 PM, webdev1977 wrote:
>
>> I am seeing an issue with crawling html pages that have relative urls
>> embedded in them.  I know there is an ongoing issue related to relative
>> urls
>> that begin with a ?. But this seems to be a different issue.
>>
>> In regex-normalize.xml there is the following pattern:
>>
>> <regex>
>>   <pattern>#.*?(\?|&amp;|$)</**pattern>
>>   <subsitution>$1</subsitution>
>> </regex>
>>
>> Here is my url:
>> http://myhost.com/my_page.php?**id=23141<http://myhost.com/my_page.php?id=23141>
>>
>> the source of this page contains the following href:
>> href="#R&amp;D_in_Research_**Books"
>>
>> it tries to fetch this url:
>> http://myhost.com/my_page.php?**id=23141&D_in_Research_Books&**
>> D&D_in_Research_Books&D&D&D_**in_Research_Books&D_in_**Research_Books<http://myhost.com/my_page.php?id=23141&D_in_Research_Books&D&D_in_Research_Books&D&D&D_in_Research_Books&D_in_Research_Books>
>>
>> WTH??? Commenting out that pattern stops the madness, otherwise it runs
>> in a
>> continual loop and never ends, just keeps generating more and more urls
>> with
>> the "&D_in_Research_Books" tacked onto the end.
>>
>> I have over 1 MILLION of these in my crawldb (it has been running for
>> over a
>> week).
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.**
>> nabble.com/Relative-urls-**interpage-href-anchors-**
>> tp3861215p3861215.html<http://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html>
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Relative urls, interpage href anchors

Markus Jelsma-2
 Doesnt the BasicURLNormalizer already remove the anchor? I think it
 builds URL without anchor if present.

 On Wed, 28 Mar 2012 10:22:39 +0100, Julien Nioche
 <[hidden email]> wrote:

> Hi,
>
> Yes please do open an issue.
>
> Thanks
>
> Julien
>
> On 27 March 2012 22:17, Sebastian Nagel <[hidden email]>
> wrote:
>
>> Hi,
>>
>> I had the same problem with this pattern. I think the pattern is
>> intented
>> to remove page anchors while keeping accidentially misplaced query
>> parameters
>> (behind the anchor). In my case, there have been anchor links of the
>> form
>>   #action?param1&param2
>> processed by some javascript code.
>>
>> Removing inner-page anchors I wouldn't remove (you do this by
>> commenting
>> out
>> the pattern). Better change it so that the complete anchor is
>> deleted:
>>
>> <regex>
>>   <pattern>#.*</pattern>
>>   <subsitution></subsitution>
>> </regex>
>>
>> Is it worth to open an issue to fix this pattern in general?
>>
>> Sebastian
>>
>>
>>
>> On 03/27/2012 02:43 PM, webdev1977 wrote:
>>
>>> I am seeing an issue with crawling html pages that have relative
>>> urls
>>> embedded in them.  I know there is an ongoing issue related to
>>> relative
>>> urls
>>> that begin with a ?. But this seems to be a different issue.
>>>
>>> In regex-normalize.xml there is the following pattern:
>>>
>>> <regex>
>>>   <pattern>#.*?(\?|&|$)</**pattern>
>>>   <subsitution>$1</subsitution>
>>> </regex>
>>>
>>> Here is my url:
>>>
>>> http://myhost.com/my_page.php?**id=23141<http://myhost.com/my_page.php?id=23141>
>>>
>>> the source of this page contains the following href:
>>> href="#R&D_in_Research_**Books"
>>>
>>> it tries to fetch this url:
>>> http://myhost.com/my_page.php?**id=23141&D_in_Research_Books&**
>>>
>>> D&D_in_Research_Books&D&D&D_**in_Research_Books&D_in_**Research_Books<http://myhost.com/my_page.php?id=23141&D_in_Research_Books&D&D_in_Research_Books&D&D&D_in_Research_Books&D_in_Research_Books>
>>>
>>> WTH??? Commenting out that pattern stops the madness, otherwise it
>>> runs
>>> in a
>>> continual loop and never ends, just keeps generating more and more
>>> urls
>>> with
>>> the "&D_in_Research_Books" tacked onto the end.
>>>
>>> I have over 1 MILLION of these in my crawldb (it has been running
>>> for
>>> over a
>>> week).
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://lucene.472066.n3.**
>>> nabble.com/Relative-urls-**interpage-href-anchors-**
>>>
>>> tp3861215p3861215.html<http://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html>
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>>

--
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350