case-insensitivity needed

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

case-insensitivity needed

Schwank, Désirée
Hello community,

we use nutch in combination with solr for crawling internet- and intranet-sites for our clients. Unfortunately I did not find a suitable solution for the following problem but I am convinced there has to be one.

The versions installed on a Linux Debian system are Solr 4.10.2 and nutch 1.9. However, sites are scanned on a Windows Web server (IIS). Nutch on Linux behaves case-sensitive but the Windows results are case-insensitive.

I have tried the substitution in the regex-normalize.xml
<regex>
   <pattern>([A-Z]+)</pattern>
   <substitution>\L$1</substitution>
</regex>.

In first case it is useless cause some URLs should not be changed to lowercase, here cause of parameters or names of servlets.
https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf
https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5

In second case it doesn't work, supposedly cause of the installed nutch version 1.9. I have read somewhere that it is not supposed to work since nutch 1.5, for what reason whatever. It was  suggested to use a custom URL-Normalizer. Otherwise it could be possible to prepare some regular expressions. Could that be what mentioned deduplication is about (see message https://www.mail-archive.com/user@.../msg03904.html)?

Thanks for help or any useful hints in advance.

Mit freundlichem Gruß
Désirée Schwank
Team Verfahrensintegration/E-Government
Gemeinsame Kommunale Datenzentrale Recklinghausen
Zweckverband
Castroper Straße 30, 45665 Recklinghausen
Tel.: +49(0)2361-3033-247
Fax: +49(0)2361-3033-333
E-Mail [hidden email]<mailto:[hidden email]>
Internet: www.gkd-re.de<http://www.gkd-re.de/>
Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an [hidden email]<mailto:[hidden email]>. So können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert.
Reply | Threaded
Open this post in threaded view
|

Re: case-insensitivity needed

Sebastian Nagel
Hi,

it's a problem of deduplication caused by different rules regarding case in URLs (cf. [1]).
As you mentioned it's hard to handle by URL normalization:
- only the path element of a URL (protocol://host/path?query=value) has to be normalized,
  not necessarily parameters which are handled by the application (ASP, cf. [2])
- and only for servers running on Windows resp. Windows IIS
  (eservice2.gkd-re.de appears to run on Linux)

A custom URL normalizer would be possible:
- check whether the host belongs to the list of Windows servers
- convert the path element to lowercase

The regex normalizer does not support \L after it was moved from ORO to Java regexes
(NUTCH-1013). However, it would be difficult (even impossible) to formulate a proper
regular expression which catches only path elements for certain hosts.

Maybe it's best to find a pragmatic solution which could be one of:

- (if you stay in contact with the web admins)
  * find the links causing the duplicates and fix them
    (duplicates are also an issue for SEO, it may be worth to do the work)
  * ev. it's possible to configure Windows IIS to send redirects if case does not match

- (if there are few of these duplicates)
  maintain a list of duplicates and send deletions to Solr just after each run of Nutch

- (if there are many duplicates)
  use "nutch dedup" to remove duplicates by content, but make sure that a signature
  is chosen (see property db.signature.class) that does recognize the duplicates
  * org.apache.nutch.crawl.MD5Signature  may not work because the paths different in case
    can appear in the HTML as hrefs
  * org.apache.nutch.crawl.TextMD5Signature  should work but is only available since
    Nutch 1.10   (it's easily ported, see NUTCH-1693)


Best,
Sebastian

[1] https://webmasters.stackexchange.com/questions/90339/why-are-urls-case-sensitive
[2] https://forums.iis.net/t/1165661.aspx


On 09/07/2017 05:37 PM, Schwank, Désirée wrote:

> Hello community,
>
> we use nutch in combination with solr for crawling internet- and intranet-sites for our clients. Unfortunately I did not find a suitable solution for the following problem but I am convinced there has to be one.
>
> The versions installed on a Linux Debian system are Solr 4.10.2 and nutch 1.9. However, sites are scanned on a Windows Web server (IIS). Nutch on Linux behaves case-sensitive but the Windows results are case-insensitive.
>
> I have tried the substitution in the regex-normalize.xml
> <regex>
>    <pattern>([A-Z]+)</pattern>
>    <substitution>\L$1</substitution>
> </regex>.
>
> In first case it is useless cause some URLs should not be changed to lowercase, here cause of parameters or names of servlets.
> https://eservice2.gkd-re.de/bsointer320/DokumentServlet?dokumentenname=320l1946.pdf
> https://www.gladbeck.de/Leben_Wohnen/autostart.asp?db=404&form=report&searchfieldBeginndatum.max=06.09.2017&searchfieldAblaufdatum.min=06.09.2017&top=5
>
> In second case it doesn't work, supposedly cause of the installed nutch version 1.9. I have read somewhere that it is not supposed to work since nutch 1.5, for what reason whatever. It was  suggested to use a custom URL-Normalizer. Otherwise it could be possible to prepare some regular expressions. Could that be what mentioned deduplication is about (see message https://www.mail-archive.com/user@.../msg03904.html)?
>
> Thanks for help or any useful hints in advance.
>
> Mit freundlichem Gruß
> Désirée Schwank
> Team Verfahrensintegration/E-Government
> Gemeinsame Kommunale Datenzentrale Recklinghausen
> Zweckverband
> Castroper Straße 30, 45665 Recklinghausen
> Tel.: +49(0)2361-3033-247
> Fax: +49(0)2361-3033-333
> E-Mail [hidden email]<mailto:[hidden email]>
> Internet: www.gkd-re.de<http://www.gkd-re.de/>
> Bitte nutzen Sie unser OTRS-Ticket-System. Senden Sie Ihre Anfragen bitte an [hidden email]<mailto:[hidden email]>. So können Sie sicher sein, dass sich immer jemand um Ihr Problem kümmert.
>