Intranet Crawling and Whole-web Crawling

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Intranet Crawling and Whole-web Crawling

Berlin Brown
I noticed in the documentation that you can do whole web crawling and
intranet.  My question, can you combine a database that you crawl
through a set provided URLs to the database you created with whole-web
crawling.

For example, here you create a directory crawl.test, this contains a database.

bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log

bin/nutch admin new/db -create

Here, I am creating a database in the directory 'new'.  Can I add the
two databases together.  For example, let me say I run through
whole-web crawling and then I want to crawl a set of URLs, can I add
those to the index.

bin/nutch crawl urls -dir new -depth 3 >& crawl.log

?
Reply | Threaded
Open this post in threaded view
|

Re: Intranet Crawling and Whole-web Crawling

Thomas Delnoij-3
You can merge the indexes together using the nutch merge command.

Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir
<workingdir>] outputIndex segments...

Rgrds, Thomas


On 3/18/06, Berlin Brown <[hidden email]> wrote:

>
> I noticed in the documentation that you can do whole web crawling and
> intranet.  My question, can you combine a database that you crawl
> through a set provided URLs to the database you created with whole-web
> crawling.
>
> For example, here you create a directory crawl.test, this contains a
> database.
>
> bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
>
> bin/nutch admin new/db -create
>
> Here, I am creating a database in the directory 'new'.  Can I add the
> two databases together.  For example, let me say I run through
> whole-web crawling and then I want to crawl a set of URLs, can I add
> those to the index.
>
> bin/nutch crawl urls -dir new -depth 3 >& crawl.log
>
> ?
>
Reply | Threaded
Open this post in threaded view
|

Re: Intranet Crawling and Whole-web Crawling

Berlin Brown
Thanks a lot, that is exactly it.

On 3/19/06, TDLN <[hidden email]> wrote:

> You can merge the indexes together using the nutch merge command.
>
> Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir
> <workingdir>] outputIndex segments...
>
> Rgrds, Thomas
>
>
> On 3/18/06, Berlin Brown <[hidden email]> wrote:
> >
> > I noticed in the documentation that you can do whole web crawling and
> > intranet.  My question, can you combine a database that you crawl
> > through a set provided URLs to the database you created with whole-web
> > crawling.
> >
> > For example, here you create a directory crawl.test, this contains a
> > database.
> >
> > bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
> >
> > bin/nutch admin new/db -create
> >
> > Here, I am creating a database in the directory 'new'.  Can I add the
> > two databases together.  For example, let me say I run through
> > whole-web crawling and then I want to crawl a set of URLs, can I add
> > those to the index.
> >
> > bin/nutch crawl urls -dir new -depth 3 >& crawl.log
> >
> > ?
> >
>
>