[jira] Created: (NUTCH-659) Help! No urls fetched for internal repository website

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-659) Help! No urls fetched for internal repository website

Sebastian Nagel (Jira)
Help! No urls fetched for internal repository website
-----------------------------------------------------

                 Key: NUTCH-659
                 URL: https://issues.apache.org/jira/browse/NUTCH-659
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: nutch 0.9, TOMCAT6.0.18, JAVA 1.6.0_10, CentOS 5.2
            Reporter: Bryan
            Priority: Critical


I am new to Nutch, and implemented Nutch for my internal company websites search. The version is nutch-2008-11-02_04-01-26.tar.

 

My internal company websites includes several HTTP websites.

Another one is SVN repository HTTPS websites in XML structure, using <dir> and <file> tag.

 

The search in HTTP websites is good.

The HTTPS is ok. We have some links in those HTTP websites which point to Word files under SVN website. They can be indexed.

 

But the Nutch does not search my SVN website. If I only search the SVN website, it is always: 0 urls fetched.

 

My nutch-site.xml is as following:

<property>

  <name>plugin.includes</name>

  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

 

# skip file:, ftp:, & mailto: urls

-^(ftp|mailto):

 

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*smartlabs.com.au/

 

Any help would be much appreciated. Thanks in advnce.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-659) Help! No urls fetched for internal repository website

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved NUTCH-659.
------------------------------------

    Resolution: Invalid

Please ask questions on the mailing list.

> Help! No urls fetched for internal repository website
> -----------------------------------------------------
>
>                 Key: NUTCH-659
>                 URL: https://issues.apache.org/jira/browse/NUTCH-659
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: nutch 0.9, TOMCAT6.0.18, JAVA 1.6.0_10, CentOS 5.2
>            Reporter: Bryan
>            Priority: Critical
>
> I am new to Nutch, and implemented Nutch for my internal company websites search. The version is nutch-2008-11-02_04-01-26.tar.
>  
> My internal company websites includes several HTTP websites.
> Another one is SVN repository HTTPS websites in XML structure, using <dir> and <file> tag.
>  
> The search in HTTP websites is good.
> The HTTPS is ok. We have some links in those HTTP websites which point to Word files under SVN website. They can be indexed.
>  
> But the Nutch does not search my SVN website. If I only search the SVN website, it is always: 0 urls fetched.
>  
> My nutch-site.xml is as following:
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  
> # skip file:, ftp:, & mailto: urls
> -^(ftp|mailto):
>  
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*smartlabs.com.au/
>  
> Any help would be much appreciated. Thanks in advnce.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.