crawl and update one url already in crawldb

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

crawl and update one url already in crawldb

webdev1977
I have created an application that can detect when files are created/modified/deleted in one of our Windows Share drives and I would like to know if it is possible upon notification of this to crawl just a single URL in the crawldb?

I think it is possible to run individual new crawls for each url with the goal of merging the linkdbs and crawldbs at somepoint (once a night).  But I wonder if there is a more efficient  way of doing this.  The other obstacle is that the main crawldb is part of a continuous looping crawl that technically could never end (unless I force it to).  Would it be an issue to update a database that could potentially be locked at any point in time?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: crawl and update one url already in crawldb

Markus Jelsma-2


On Thursday 22 March 2012 13:53:02 webdev1977 wrote:
> I have created an application that can detect when files are
> created/modified/deleted in one of our Windows Share drives and I would
> like to know if it is possible upon notification of this to crawl just a
> single URL in the crawldb?
>

Easiest would be to use the freegenerator tool to generate a segment from a
plain text file with seed URL's, much like the injector. That segment can then
later join other segments when updating the crawldb.

> I think it is possible to run individual new crawls for each url with the
> goal of merging the linkdbs and crawldbs at somepoint (once a night).  But
> I wonder if there is a more efficient  way of doing this.  The other
> obstacle is that the main crawldb is part of a continuous looping crawl
> that technically could never end (unless I force it to).  Would it be an
> issue to update a database that could potentially be locked at any point
> in time?
>
> Thanks!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848358.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

--
Markus Jelsma - CTO - Openindex
Reply | Threaded
Open this post in threaded view
|

Re: crawl and update one url already in crawldb

webdev1977
Thanks for the quick response Markus!

How would that fit into this continuous crawling scenario (I am trying to get the updates as quickly as possible into solr :-)

If I am doing the generate --> fetch $SEGMENT --> parse $SEGMENT --> updatedb crawldb $segment --> solrindex --> solrdedub  cycle and i am generating an "on the fly" segment and I just happen to be generating it (and not done) when the updatedb command runs (changing it to the -dir option), isn't that bad?

Has anyone tested the mergedb command with potentially hundreds and hundreds of dbs to merge (one per changed url)?
Reply | Threaded
Open this post in threaded view
|

Re: crawl and update one url already in crawldb

Markus Jelsma-2


On Thursday 22 March 2012 14:10:41 webdev1977 wrote:

> Thanks for the quick response Markus!
>
> How would that fit into this continuous crawling scenario (I am trying to
> get the updates as quickly as possible into solr :-)
>
> If I am doing the generate --> fetch $SEGMENT --> parse $SEGMENT -->
> updatedb crawldb $segment --> solrindex --> solrdedub  cycle and i am
> generating an "on the fly" segment and I just happen to be generating it
> (and not done) when the updatedb command runs (changing it to the -dir
> option), isn't that bad?

You can just fetch and parse that tiny segment and have it updated in the
crawldb together with another segment. You don't have to update with only one
segment. -dir is ok, but you can also list the segments.


> Has anyone tested the mergedb command with potentially hundreds and
> hundreds of dbs to merge (one per changed url)?

I wouldn't try that. More scripting and locking horror and it's an I/O
consumer.

>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848423.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

--
Markus Jelsma - CTO - Openindex
Reply | Threaded
Open this post in threaded view
|

Re: crawl and update one url already in crawldb

webdev1977
I just tried it out and so far so good.. Not an near instant solution, but it works ;-)  One last question..

If I am running a bunch of bin/nutch commands from the same directory I seem to be having an issue.  I am assuming it is with the mapred system and various tmp files (running in local mode).  Is it possible to run multiple commands using the same nutch directory without causing conflicts?
Reply | Threaded
Open this post in threaded view
|

Re: crawl and update one url already in crawldb

Markus Jelsma-2
Use Hadoop or set the hadoop.tmp.dir per job. If you don't, things will break.

On Thursday 22 March 2012 15:29:50 webdev1977 wrote:

> I just tried it out and so far so good.. Not an near instant solution, but
> it works ;-)  One last question..
>
> If I am running a bunch of bin/nutch commands from the same directory I
> seem to be having an issue.  I am assuming it is with the mapred system
> and various tmp files (running in local mode).  Is it possible to run
> multiple commands using the same nutch directory without causing
> conflicts?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/crawl-and-update-one-url-already-in-cra
> wldb-tp3848358p3848665.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

--
Markus Jelsma - CTO - Openindex