intranet crawling

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

intranet crawling

Edward Quick

Hi,

I want to do an exhaustive scan of our intranet but running

bin/nutch crawl urls -dir crawl -depth 9 -topN 50

doesn't get everything. I've increased this now to

bin/nutch crawl urls -dir crawl -depth 30 -topN 1000

and it's certainly running longer but I'm not sure if this will still miss any pages. Is there any way of doing this so I get an index of the whole intranet?

Thanks,

Ed.

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: intranet crawling

David Jashi
It may be a rude decision of that problem, but when I wanted ALL of my
video hosting site indexed, I simply generated list like

http://tvali.ge/index.php?action=watch&v=495
http://tvali.ge/index.php?action=watch&v=496
http://tvali.ge/index.php?action=watch&v=497
....

from MySQL table, containing list of posts and put it into urls dir.

On Thu, Sep 4, 2008 at 6:56 PM, Edward Quick <[hidden email]> wrote:

>
> Hi,
>
> I want to do an exhaustive scan of our intranet but running
>
> bin/nutch crawl urls -dir crawl -depth 9 -topN 50
>
> doesn't get everything. I've increased this now to
>
> bin/nutch crawl urls -dir crawl -depth 30 -topN 1000
>
> and it's certainly running longer but I'm not sure if this will still miss any pages. Is there any way of doing this so I get an index of the whole intranet?
>
> Thanks,
>
> Ed.
>
> _________________________________________________________________
> Win New York holidays with Kellogg's & Live Search
> http://clk.atdmt.com/UKM/go/111354033/direct/01/



--
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[hidden email]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
"კავკასუს ონლაინი"
+995(32)970368
[hidden email]