Quite basic questions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Quite basic questions

Kai Hagemeister
Hello,

I have a few basic questions and hope that somebody can assist.
I'm trying to search different domains. It seems fairly simple to crawl
one special domain (intranet-search) which is defined in the configuration
file. But this seems to be limited to only the one, specified domain.
I also could search through the web (websearch) by giving different urls
via an urlfile. But I want to search complete domains without going
outside.
So, if I handover the urls bla.com and blub.net, only sites from this
domains should be fetched. I tried to set the parameter follow
outsitelinks to 0. But then, also links inside of the domain were ignored.
Is there a way to acomplish the task? I mean an other then changing the
sourcecode :-).
Furthermore I created a directory db for the database and one for
segments. Then I started tomcat from a parent-directory of segments. The
Java class seems to search for a child-directory segments from the current
position. The problem: after each update of the index I have to restart
tomcat :-(. It's getting worse each time when I start the processes I must
delete the database and the segments.
How do I accomplish a reasonable fetching cycle. Could somebody give an
example?
My idea would be to put the following snippet in a endless loop and call
this with nohup:

bin/nutch generate db segments -topN 1000
s1=`ls -d segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch updatedb db $s1
bin/nutch index $s1

Would this be advisable? And can sombody explain the meaning of -topN 1000.
Is there no other way then restarting tomcat?
I would appriciate any assistance.
Best regards
Kai


--
P L A N O M E D I A
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] NewbieNutcher....

Kai Hagemeister
Hello Niclas,

after calling ./nutch updatedb db $s1 you must index the fetched pages by
calling nutch index $s1

Kai



--
P L A N O M E D I A
Reply | Threaded
Open this post in threaded view
|

Re: Quite basic questions

Gal Nitzan
In reply to this post by Kai Hagemeister
Hi,

Take a look here: http://issues.apache.org/jira/browse/NUTCH-100

If you have further questions...

Regards,

Gal

Kai Hagemeister wrote:

> Hello,
>
> I have a few basic questions and hope that somebody can assist.
> I'm trying to search different domains. It seems fairly simple to crawl
> one special domain (intranet-search) which is defined in the configuration
> file. But this seems to be limited to only the one, specified domain.
> I also could search through the web (websearch) by giving different urls
> via an urlfile. But I want to search complete domains without going
> outside.
> So, if I handover the urls bla.com and blub.net, only sites from this
> domains should be fetched. I tried to set the parameter follow
> outsitelinks to 0. But then, also links inside of the domain were ignored.
> Is there a way to acomplish the task? I mean an other then changing the
> sourcecode :-).
> Furthermore I created a directory db for the database and one for
> segments. Then I started tomcat from a parent-directory of segments. The
> Java class seems to search for a child-directory segments from the current
> position. The problem: after each update of the index I have to restart
> tomcat :-(. It's getting worse each time when I start the processes I must
> delete the database and the segments.
> How do I accomplish a reasonable fetching cycle. Could somebody give an
> example?
> My idea would be to put the following snippet in a endless loop and call
> this with nohup:
>
> bin/nutch generate db segments -topN 1000
> s1=`ls -d segments/2* | tail -1`
> bin/nutch fetch $s1
> bin/nutch updatedb db $s1
> bin/nutch index $s1
>
> Would this be advisable? And can sombody explain the meaning of -topN 1000.
> Is there no other way then restarting tomcat?
> I would appriciate any assistance.
> Best regards
> Kai
>
>
>  


Reply | Threaded
Open this post in threaded view
|

Re: Quite basic questions

Kai Hagemeister
Hello Gal,

thanks for your reply.

> Take a look here: http://issues.apache.org/jira/browse/NUTCH-100
>
> If you have further questions...

I've a problem with nutch-extensionpoints. There are no Java-Sourcefiles
in src. So I cant compile.
I tried to remove the entry for nutch-extension-points from build.xml so
that I could compile the sources without it but it seems that
nutch-extensionpoints is vital.
Any idea?

Kai

>
> Regards,
>
> Gal
>
> Kai Hagemeister wrote:
>
>> Hello,
>>
>> I have a few basic questions and hope that somebody can assist.
>> I'm trying to search different domains. It seems fairly simple to crawl
>> one special domain (intranet-search) which is defined in the
>> configuration
>> file. But this seems to be limited to only the one, specified domain.
>> I also could search through the web (websearch) by giving different urls
>> via an urlfile. But I want to search complete domains without going
>> outside.
>> So, if I handover the urls bla.com and blub.net, only sites from this
>> domains should be fetched. I tried to set the parameter follow
>> outsitelinks to 0. But then, also links inside of the domain were
>> ignored.
>> Is there a way to acomplish the task? I mean an other then changing the
>> sourcecode :-).
>> Furthermore I created a directory db for the database and one for
>> segments. Then I started tomcat from a parent-directory of segments. The
>> Java class seems to search for a child-directory segments from the
>> current
>> position. The problem: after each update of the index I have to restart
>> tomcat :-(. It's getting worse each time when I start the processes I
>> must
>> delete the database and the segments.
>> How do I accomplish a reasonable fetching cycle. Could somebody give an
>> example?
>> My idea would be to put the following snippet in a endless loop and call
>> this with nohup:
>>
>> bin/nutch generate db segments -topN 1000
>> s1=`ls -d segments/2* | tail -1`
>> bin/nutch fetch $s1
>> bin/nutch updatedb db $s1
>> bin/nutch index $s1
>>
>> Would this be advisable? And can sombody explain the meaning of -topN
>> 1000.
>> Is there no other way then restarting tomcat?
>> I would appriciate any assistance.
>> Best regards
>> Kai
>>
>>
>>  
>
>
>
>