[jira] Created: (NUTCH-599) nutch crawl and index problem

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-599) nutch crawl and index problem

JIRA jira@apache.org
nutch crawl and index problem
-----------------------------

                 Key: NUTCH-599
                 URL: https://issues.apache.org/jira/browse/NUTCH-599
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
         Environment: hadoop-0.12.2, java jdk1.6.0
            Reporter: sudarat
             Fix For: 0.9.0


first i set
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
#-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# skip everything else
+.

 in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3"  i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only

-rw-r--r-- 1 nutch users   365 ม.ค.  7 16:47 _0.fdt
-rw-r--r-- 1 nutch users     8 ม.ค.  7 16:47 _0.fdx
-rw-r--r-- 1 nutch users    66 ม.ค.  7 16:47 _0.fnm
-rw-r--r-- 1 nutch users   370 ม.ค.  7 16:47 _0.frq
-rw-r--r-- 1 nutch users     9 ม.ค.  7 16:47 _0.nrm
-rw-r--r-- 1 nutch users   611 ม.ค.  7 16:47 _0.prx
-rw-r--r-- 1 nutch users   135 ม.ค.  7 16:47 _0.tii
-rw-r--r-- 1 nutch users 10553 ม.ค.  7 16:47 _0.tis
-rw-r--r-- 1 nutch users     0 ม.ค.  7 16:47 index.done
-rw-r--r-- 1 nutch users    41 ม.ค.  7 16:47 segments_2
-rw-r--r-- 1 nutch users    20 ม.ค.  7 16:47 segments.gen

how to solve it?


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (NUTCH-599) nutch crawl and index problem

Susam Pal
I have replied this query of yours yesterday in
[hidden email]. If you haven't received the reply,
probably you have not subscribed to the nutch-user mailing list. If
you haven't subscribed, please do so by sending a blank mail to
[hidden email].

Nutch 0.9 works fine for us. So it is not a bug in Nutch 0.9 This
looks like a configuration problem at your end. Please discuss this
properly in [hidden email] instead of submitting it as a
bug in Nutch.

Regards,
Susam Pal

On Jan 8, 2008 7:16 AM, sudarat (JIRA) <[hidden email]> wrote:

> nutch crawl and index problem
> -----------------------------
>
>                  Key: NUTCH-599
>                  URL: https://issues.apache.org/jira/browse/NUTCH-599
>              Project: Nutch
>           Issue Type: Bug
>     Affects Versions: 0.9.0
>          Environment: hadoop-0.12.2, java jdk1.6.0
>             Reporter: sudarat
>              Fix For: 0.9.0
>
>
> first i set
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # skip everything else
> +.
>
>  in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3"  i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only
>
> -rw-r--r-- 1 nutch users   365 ม.ค.  7 16:47 _0.fdt
> -rw-r--r-- 1 nutch users     8 ม.ค.  7 16:47 _0.fdx
> -rw-r--r-- 1 nutch users    66 ม.ค.  7 16:47 _0.fnm
> -rw-r--r-- 1 nutch users   370 ม.ค.  7 16:47 _0.frq
> -rw-r--r-- 1 nutch users     9 ม.ค.  7 16:47 _0.nrm
> -rw-r--r-- 1 nutch users   611 ม.ค.  7 16:47 _0.prx
> -rw-r--r-- 1 nutch users   135 ม.ค.  7 16:47 _0.tii
> -rw-r--r-- 1 nutch users 10553 ม.ค.  7 16:47 _0.tis
> -rw-r--r-- 1 nutch users     0 ม.ค.  7 16:47 index.done
> -rw-r--r-- 1 nutch users    41 ม.ค.  7 16:47 segments_2
> -rw-r--r-- 1 nutch users    20 ม.ค.  7 16:47 segments.gen
>
> how to solve it?
>
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (NUTCH-599) nutch crawl and index problem

Susam Pal
I wanted to send this as a private reply but sent it to the list
instead. Sorry for the inconvenience.

On Jan 8, 2008 10:21 AM, Susam Pal <[hidden email]> wrote:

> I have replied this query of yours yesterday in
> [hidden email]. If you haven't received the reply,
> probably you have not subscribed to the nutch-user mailing list. If
> you haven't subscribed, please do so by sending a blank mail to
> [hidden email].
>
> Nutch 0.9 works fine for us. So it is not a bug in Nutch 0.9 This
> looks like a configuration problem at your end. Please discuss this
> properly in [hidden email] instead of submitting it as a
> bug in Nutch.
>
> Regards,
> Susam Pal
>
>
> On Jan 8, 2008 7:16 AM, sudarat (JIRA) <[hidden email]> wrote:
> > nutch crawl and index problem
> > -----------------------------
> >
> >                  Key: NUTCH-599
> >                  URL: https://issues.apache.org/jira/browse/NUTCH-599
> >              Project: Nutch
> >           Issue Type: Bug
> >     Affects Versions: 0.9.0
> >          Environment: hadoop-0.12.2, java jdk1.6.0
> >             Reporter: sudarat
> >              Fix For: 0.9.0
> >
> >
> > first i set
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # skip everything else
> > +.
> >
> >  in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3"  i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only
> >
> > -rw-r--r-- 1 nutch users   365 ม.ค.  7 16:47 _0.fdt
> > -rw-r--r-- 1 nutch users     8 ม.ค.  7 16:47 _0.fdx
> > -rw-r--r-- 1 nutch users    66 ม.ค.  7 16:47 _0.fnm
> > -rw-r--r-- 1 nutch users   370 ม.ค.  7 16:47 _0.frq
> > -rw-r--r-- 1 nutch users     9 ม.ค.  7 16:47 _0.nrm
> > -rw-r--r-- 1 nutch users   611 ม.ค.  7 16:47 _0.prx
> > -rw-r--r-- 1 nutch users   135 ม.ค.  7 16:47 _0.tii
> > -rw-r--r-- 1 nutch users 10553 ม.ค.  7 16:47 _0.tis
> > -rw-r--r-- 1 nutch users     0 ม.ค.  7 16:47 index.done
> > -rw-r--r-- 1 nutch users    41 ม.ค.  7 16:47 segments_2
> > -rw-r--r-- 1 nutch users    20 ม.ค.  7 16:47 segments.gen
> >
> > how to solve it?
> >
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-599) nutch crawl and index problem

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-599.
-------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.9.0)
                   1.0.0
         Assignee: Doğacan Güney

Please use nutch-user for asking questions.

> nutch crawl and index problem
> -----------------------------
>
>                 Key: NUTCH-599
>                 URL: https://issues.apache.org/jira/browse/NUTCH-599
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: hadoop-0.12.2, java jdk1.6.0
>            Reporter: sudarat
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>
> first i set
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/.+?)/.*?\1/.*?\1/
> # skip everything else
> +.
>  in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3"  i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only
> -rw-r--r-- 1 nutch users   365 ม.ค.  7 16:47 _0.fdt
> -rw-r--r-- 1 nutch users     8 ม.ค.  7 16:47 _0.fdx
> -rw-r--r-- 1 nutch users    66 ม.ค.  7 16:47 _0.fnm
> -rw-r--r-- 1 nutch users   370 ม.ค.  7 16:47 _0.frq
> -rw-r--r-- 1 nutch users     9 ม.ค.  7 16:47 _0.nrm
> -rw-r--r-- 1 nutch users   611 ม.ค.  7 16:47 _0.prx
> -rw-r--r-- 1 nutch users   135 ม.ค.  7 16:47 _0.tii
> -rw-r--r-- 1 nutch users 10553 ม.ค.  7 16:47 _0.tis
> -rw-r--r-- 1 nutch users     0 ม.ค.  7 16:47 index.done
> -rw-r--r-- 1 nutch users    41 ม.ค.  7 16:47 segments_2
> -rw-r--r-- 1 nutch users    20 ม.ค.  7 16:47 segments.gen
> how to solve it?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.