two questions about nutch url filter when inject

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

two questions about nutch url filter when inject

beansproud
Hi, all

    I get two questions here about url filter when inject.

    First, I found that in inject , nutch uses "regex-urlfilter.txt" as its default filter. And in that text , I found this regex:
    # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
    -.*(/.+?)/.*?\1/.*?\1/
    I can't understand why this type url will cause loops. If anybody knows about this, please tell me.

    Second, when I changed this file, the output of nutch doesn't show any chang. And when I recomplied, it changed. This takes me 3 hours, can anybody tell me why ?

Reply | Threaded
Open this post in threaded view
|

Re: two questions about nutch url filter when inject

Eric J. Christeson-2

On Jun 18, 2008, at 9:38 AM, beansproud wrote:


>
>     Second, when I changed this file, the output of nutch doesn't  
> show any
> chang. And when I recomplied, it changed. This takes me 3 hours,  
> can anybody
> tell me why ?

What changes did you make to the file, and what specifically changed  
when you recompiled?

eric

--
Eric J. Christeson                                  
<[hidden email]>
Information Technology Services         (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which
are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law



Reply | Threaded
Open this post in threaded view
|

Re: two questions about nutch url filter when inject

beansproud

Eric J. Christeson-2 wrote
On Jun 18, 2008, at 9:38 AM, beansproud wrote:


>
>     Second, when I changed this file, the output of nutch doesn't  
> show any
> chang. And when I recomplied, it changed. This takes me 3 hours,  
> can anybody
> tell me why ?

What changes did you make to the file, and what specifically changed  
when you recompiled?

I removed the query filter :
-[?*!@=]    
by make it a annotate :
#-[?*!@=]
So , I can crawl some url like a query.

Before I recompiled, nutch can't get urls like a query, and when I recompiled, it works.

eric

--
Eric J. Christeson                                  
<Eric.Christeson@ndsu.edu>
Information Technology Services         (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which
are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law