Injecting Into Intranet Crawl

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Injecting Into Intranet Crawl

Robert Sanford
I'm running version 0.7.2 and I'm using the Intranet crawl where I
specify a list of site root URIs in a text file along with a list of
regex for allowed URIs.

The question that I have is how to inject a new site into the crawl.

If I simply add a site URI into the file I have to completely restart
the crawl and can't use the same output directory as I used previously
and when that finishes I have to copy over the old one and then restart
my app server. That doesn't make sense... I really want to just give it
a new site root and have it added to the index.

Is that possible using the intranet config option?

rjsjr
Reply | Threaded
Open this post in threaded view
|

Re: Injecting Into Intranet Crawl

Thomas Delnoij-3
For stuff like this best use whole web concepts as explained in the tutorial.

Rgrds, Thomas

On 7/25/06, Robert Sanford <[hidden email]> wrote:

> I'm running version 0.7.2 and I'm using the Intranet crawl where I
> specify a list of site root URIs in a text file along with a list of
> regex for allowed URIs.
>
> The question that I have is how to inject a new site into the crawl.
>
> If I simply add a site URI into the file I have to completely restart
> the crawl and can't use the same output directory as I used previously
> and when that finishes I have to copy over the old one and then restart
> my app server. That doesn't make sense... I really want to just give it
> a new site root and have it added to the index.
>
> Is that possible using the intranet config option?
>
> rjsjr
>
Reply | Threaded
Open this post in threaded view
|

RE: Injecting Into Intranet Crawl

Robert Sanford
> -----Original Message-----
> From: Thomas Delnoij [mailto:[hidden email]]
> Sent: Tuesday, July 25, 2006 2:53 PM
> To: [hidden email]
> Subject: Re: Injecting Into Intranet Crawl
>
> For stuff like this best use whole web concepts as explained
> in the tutorial.
>
> Rgrds, Thomas

The tutorial suggests using a segment of the DMOZ directory which really
doesn't work for me as I only want to index a specific collection of
sites. But in that tutorial it does use the "inject" command option
which may actually be useful.

From the CommandLine Options page in the wiki I find...

Usage: bin/nutch inject (-local | -ndfs <namenode:port>) <db_dir>
(-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset
<subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc]
[-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]

So I would use something like
  bin/nutch inject crawl.out urls.txt

Where "crawl.out" is the result of my original crawl and "urls.txt" is
my original list of home pages. Or is "urls.txt" supposed to be a file
containing the list of home pages to be injected? There's no list of
what each of the options represent in the wiki like there is for the
"crawl" command so I have to guess. My assumptions based on that help
are:
1 - My urls.txt file will be modified by the inject command and
2 - My crawl.out directory will be updated with index information from
the injected site. I think I may have to run some additional commands to
get the index updated but I'm not 100% sure. Maybe the maintenance shell
script from http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
is what it is I need to rescan all of the sites I want indexed?

Assuming that I eventually get the syntax of the inject command correct
I still have to ask about conf/crawl-urlfilter.txt because I modified
that to only use the URIs that I want crawled. Does the inject command
modify that file or do I have to add in those domains manually?

Many thanks!

rjsjr


>
> On 7/25/06, Robert Sanford <[hidden email]> wrote:
> > I'm running version 0.7.2 and I'm using the Intranet crawl where I
> > specify a list of site root URIs in a text file along with
> a list of
> > regex for allowed URIs.
> >
> > The question that I have is how to inject a new site into the crawl.
> >
> > If I simply add a site URI into the file I have to
> completely restart
> > the crawl and can't use the same output directory as I used
> previously
> > and when that finishes I have to copy over the old one and then
> > restart my app server. That doesn't make sense... I really want to
> > just give it a new site root and have it added to the index.
> >
> > Is that possible using the intranet config option?
> >
> > rjsjr
> >
>