Code to be modified

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Code to be modified

Vineet Garg-3
Hi,

I have read that to stop nutch to crawl parent directories following
code has to be modified:

org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
f).

I have just started to use nutch. I am not able to find out the location
of code. So can anybody tell me how can i modify this code?

Vineet Garg
Reply | Threaded
Open this post in threaded view
|

Re: Code to be modified

Martin Kuen
Hi,

I know that this advice can be found in some places on the internet.
However, it's not true that you have to modify code to achieve it.

see the faq in the nutch wiki: Nutch crawling parent directories for file
protocol -> misconfigured
URLFilters<http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10>

You can achieve the desired behaviour by adjusting your regexes.


Hope it helps,

martin


On Fri, Mar 28, 2008 at 12:32 PM, Vineet Garg <[hidden email]> wrote:

> Hi,
>
> I have read that to stop nutch to crawl parent directories following
> code has to be modified:
>
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> java.io.File
> f).
>
> I have just started to use nutch. I am not able to find out the location
> of code. So can anybody tell me how can i modify this code?
>
> Vineet Garg
>
Reply | Threaded
Open this post in threaded view
|

Re: Code to be modified

Vineet Garg-3
Hi

I configured the regex-urlfilter.txt file and included following lines:


+^file:///hm/vineetg/SPD38/libraries/
+^file:///hm/vineetg/SPD38/share/doc/
-.

as i want to craw /hm/vineetg/SPD38/libraries/ and /hm/vineetg/SPD38/share/doc/ directories.
But it is still crawling the parent directories and generating error too.

Did i configured regex-urlfiter.txt file correctly??

Vineet

Martin Kuen wrote:

> Hi,
>
> I know that this advice can be found in some places on the internet.
> However, it's not true that you have to modify code to achieve it.
>
> see the faq in the nutch wiki: Nutch crawling parent directories for file
> protocol -> misconfigured
> URLFilters<http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10>
>
> You can achieve the desired behaviour by adjusting your regexes.
>
>
> Hope it helps,
>
> martin
>
>
> On Fri, Mar 28, 2008 at 12:32 PM, Vineet Garg <[hidden email]> wrote:
>
>  
>> Hi,
>>
>> I have read that to stop nutch to crawl parent directories following
>> code has to be modified:
>>
>> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
>> java.io.File
>> f).
>>
>> I have just started to use nutch. I am not able to find out the location
>> of code. So can anybody tell me how can i modify this code?
>>
>> Vineet Garg
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Code to be modified

Martin Kuen
Hm . . . okay . . actually I thought this would fix your problem . . .

I "think" your configuration is fine.
However:
Do you use the "crawl" command? In this case "crawl-urlfilter.txt" is used,
not "regex-urlfilter.txt"

What kind of errors are generated? Please post them as well . . . .


best regards,

martin


On Mon, Mar 31, 2008 at 8:39 AM, Vineet Garg <[hidden email]> wrote:

> Hi
>
> I configured the regex-urlfilter.txt file and included following lines:
>
>
> +^file:///hm/vineetg/SPD38/libraries/
> +^file:///hm/vineetg/SPD38/share/doc/
> -.
>
> as i want to craw /hm/vineetg/SPD38/libraries/ and
> /hm/vineetg/SPD38/share/doc/ directories.
> But it is still crawling the parent directories and generating error too.
>
> Did i configured regex-urlfiter.txt file correctly??
>
> Vineet
>
> Martin Kuen wrote:
> > Hi,
> >
> > I know that this advice can be found in some places on the internet.
> > However, it's not true that you have to modify code to achieve it.
> >
> > see the faq in the nutch wiki: Nutch crawling parent directories for
> file
> > protocol -> misconfigured
> > URLFilters<
> http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10
> >
> >
> > You can achieve the desired behaviour by adjusting your regexes.
> >
> >
> > Hope it helps,
> >
> > martin
> >
> >
> > On Fri, Mar 28, 2008 at 12:32 PM, Vineet Garg <[hidden email]>
> wrote:
> >
> >
> >> Hi,
> >>
> >> I have read that to stop nutch to crawl parent directories following
> >> code has to be modified:
> >>
> >> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
> >> java.io.File
> >> f).
> >>
> >> I have just started to use nutch. I am not able to find out the
> location
> >> of code. So can anybody tell me how can i modify this code?
> >>
> >> Vineet Garg
> >>
> >>
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Code to be modified

Vineet Garg-3
Hi

yes i am using "crawl" command. I have corrected those errors but nutch is still crawling
parent directories. How can i resolve it??

Regards,
vineet



Martin Kuen wrote:

> Hm . . . okay . . actually I thought this would fix your problem . . .
>
> I "think" your configuration is fine.
> However:
> Do you use the "crawl" command? In this case "crawl-urlfilter.txt" is used,
> not "regex-urlfilter.txt"
>
> What kind of errors are generated? Please post them as well . . . .
>
>
> best regards,
>
> martin
>
>
> On Mon, Mar 31, 2008 at 8:39 AM, Vineet Garg <[hidden email]> wrote:
>
>  
>> Hi
>>
>> I configured the regex-urlfilter.txt file and included following lines:
>>
>>
>> +^file:///hm/vineetg/SPD38/libraries/
>> +^file:///hm/vineetg/SPD38/share/doc/
>> -.
>>
>> as i want to craw /hm/vineetg/SPD38/libraries/ and
>> /hm/vineetg/SPD38/share/doc/ directories.
>> But it is still crawling the parent directories and generating error too.
>>
>> Did i configured regex-urlfiter.txt file correctly??
>>
>> Vineet
>>
>> Martin Kuen wrote:
>>    
>>> Hi,
>>>
>>> I know that this advice can be found in some places on the internet.
>>> However, it's not true that you have to modify code to achieve it.
>>>
>>> see the faq in the nutch wiki: Nutch crawling parent directories for
>>>      
>> file
>>    
>>> protocol -> misconfigured
>>> URLFilters<
>>>      
>> http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10
>>    
>>> You can achieve the desired behaviour by adjusting your regexes.
>>>
>>>
>>> Hope it helps,
>>>
>>> martin
>>>
>>>
>>> On Fri, Mar 28, 2008 at 12:32 PM, Vineet Garg <[hidden email]>
>>>      
>> wrote:
>>    
>>>      
>>>> Hi,
>>>>
>>>> I have read that to stop nutch to crawl parent directories following
>>>> code has to be modified:
>>>>
>>>> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
>>>> java.io.File
>>>> f).
>>>>
>>>> I have just started to use nutch. I am not able to find out the
>>>>        
>> location
>>    
>>>> of code. So can anybody tell me how can i modify this code?
>>>>
>>>> Vineet Garg
>>>>
>>>>
>>>>        
>>>      
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: Code to be modified

Dennis Kubes-2
What you may want to try is to use the prefix-urlfilter instead of the
regex urlfilter.  The prefix urlfilter is defined in the
conf/prefix-urlfilter.txt file and would need to be enabled in the
plugin.includes variable in nutch-site.xml.

Dennis

Vineet Garg wrote:

> Hi
>
> yes i am using "crawl" command. I have corrected those errors but nutch
> is still crawling
> parent directories. How can i resolve it??
>
> Regards,
> vineet
>
>
>
> Martin Kuen wrote:
>> Hm . . . okay . . actually I thought this would fix your problem . . .
>>
>> I "think" your configuration is fine.
>> However:
>> Do you use the "crawl" command? In this case "crawl-urlfilter.txt" is
>> used,
>> not "regex-urlfilter.txt"
>>
>> What kind of errors are generated? Please post them as well . . . .
>>
>>
>> best regards,
>>
>> martin
>>
>>
>> On Mon, Mar 31, 2008 at 8:39 AM, Vineet Garg <[hidden email]> wrote:
>>
>>  
>>> Hi
>>>
>>> I configured the regex-urlfilter.txt file and included following lines:
>>>
>>>
>>> +^file:///hm/vineetg/SPD38/libraries/
>>> +^file:///hm/vineetg/SPD38/share/doc/
>>> -.
>>>
>>> as i want to craw /hm/vineetg/SPD38/libraries/ and
>>> /hm/vineetg/SPD38/share/doc/ directories.
>>> But it is still crawling the parent directories and generating error
>>> too.
>>>
>>> Did i configured regex-urlfiter.txt file correctly??
>>>
>>> Vineet
>>>
>>> Martin Kuen wrote:
>>>    
>>>> Hi,
>>>>
>>>> I know that this advice can be found in some places on the internet.
>>>> However, it's not true that you have to modify code to achieve it.
>>>>
>>>> see the faq in the nutch wiki: Nutch crawling parent directories for
>>>>      
>>> file
>>>    
>>>> protocol -> misconfigured
>>>> URLFilters<
>>>>      
>>> http://wiki.apache.org/nutch/FAQ#head-f64e7589b2f12792d6d781f3db23840a8f3a1e10 
>>>
>>>    
>>>> You can achieve the desired behaviour by adjusting your regexes.
>>>>
>>>>
>>>> Hope it helps,
>>>>
>>>> martin
>>>>
>>>>
>>>> On Fri, Mar 28, 2008 at 12:32 PM, Vineet Garg <[hidden email]>
>>>>      
>>> wrote:
>>>    
>>>>      
>>>>> Hi,
>>>>>
>>>>> I have read that to stop nutch to crawl parent directories following
>>>>> code has to be modified:
>>>>>
>>>>> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(
>>>>> java.io.File
>>>>> f).
>>>>>
>>>>> I have just started to use nutch. I am not able to find out the
>>>>>        
>>> location
>>>    
>>>>> of code. So can anybody tell me how can i modify this code?
>>>>>
>>>>> Vineet Garg
>>>>>
>>>>>
>>>>>        
>>>>      
>>>    
>>
>>  
>
>