Question about crawldb and segments

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about crawldb and segments

Jason Camp
Hi,
    I've been using Nutch 7 for a few months, and recently started
working with 8.  I'm testing everything right now on a single server,
using the local file system.  I generated 10 segments with 100k urls in
each, and fetched the content. Then I do the updatedb, but it looks like
the crawldb isn't working properly. For example, I ran the updatedb
command on one segment, and -stats shows this:

060409 140035 status 1 (DB_unfetched):  1732457
060409 140035 status 2 (DB_fetched):    82608
060409 140035 status 3 (DB_gone):       3447

I then ran the updatedb against the next segment, and -stats now shows this:

060409 150737 status 1 (DB_unfetched):  1777642
060409 150737 status 2 (DB_fetched):    81629
060409 150737 status 3 (DB_gone):       3377


Any idea why the number of fetched urls would actually go down? What I
*think* is happening is that the crawldb only contains the data from the
last crawl, not the subsequent crawls. Does this make sense? I ran the
test doing each segment and running -stats, and they are all around 80k
for fetched and 1.7m for unfetched, but the numbers dont seem to be
accumulating.

Since readsegs is broken in 8, I can't really get an idea of what is
actually in the segments. Is there an alternative way to see how many
urls are actually in the segment and fetched?

If you have any ideas, please let me know. Thanks a lot!

Jason

Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

Jason Camp
Doh, I think I found out the problem. After using luke to dig through
the indexed segments, it looks like all of the segments that I generated
contain the same exact urls.  When you generate a segment with the top
100k urls, I'm guessing they are not marked in any way to prevent the
next generate from grabbing the same urls? I'd like to generate multiple
segments in a row, and send them off to another server, is this possible
using the local file system?


Jason


Jason Camp wrote:

> Hi,
>    I've been using Nutch 7 for a few months, and recently started
> working with 8.  I'm testing everything right now on a single server,
> using the local file system.  I generated 10 segments with 100k urls
> in each, and fetched the content. Then I do the updatedb, but it looks
> like the crawldb isn't working properly. For example, I ran the
> updatedb command on one segment, and -stats shows this:
>
> 060409 140035 status 1 (DB_unfetched):  1732457
> 060409 140035 status 2 (DB_fetched):    82608
> 060409 140035 status 3 (DB_gone):       3447
>
> I then ran the updatedb against the next segment, and -stats now shows
> this:
>
> 060409 150737 status 1 (DB_unfetched):  1777642
> 060409 150737 status 2 (DB_fetched):    81629
> 060409 150737 status 3 (DB_gone):       3377
>
>
> Any idea why the number of fetched urls would actually go down? What I
> *think* is happening is that the crawldb only contains the data from
> the last crawl, not the subsequent crawls. Does this make sense? I ran
> the test doing each segment and running -stats, and they are all
> around 80k for fetched and 1.7m for unfetched, but the numbers dont
> seem to be accumulating.
>
> Since readsegs is broken in 8, I can't really get an idea of what is
> actually in the segments. Is there an alternative way to see how many
> urls are actually in the segment and fetched?
>
> If you have any ideas, please let me know. Thanks a lot!
>
> Jason
>

Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

Andrzej Białecki-2
Jason Camp wrote:
> Doh, I think I found out the problem. After using luke to dig through
> the indexed segments, it looks like all of the segments that I
> generated contain the same exact urls.  When you generate a segment
> with the top 100k urls, I'm guessing they are not marked in any way to
> prevent the next generate from grabbing the same urls? I'd like to
> generate multiple segments in a row, and send them off to another
> server, is this possible using the local file system?

No, at the moment they are not marked in any way. This is on my TODO
list, but not with a high priority.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

Doug Cutting
In reply to this post by Jason Camp
Jason Camp wrote:
> I'd like to generate multiple
> segments in a row, and send them off to another server, is this possible
> using the local file system?

The Hadoop-based Nutch now automates multiple, parallel fetches for you.
  So there is less need to manually generate multiple segments.  Try
configuring your servers as slaves (by adding them to conf/slaves) and
configuring a master (by setting fs.default.name and mapred.jobtracker
in conf/hadoop-site.xml) then using bin/start-all.sh to start daemons.
Then copy your root url directory to dfs with something like 'bin/hadoop
dfs -put roots roots'.  Then you can run a multi-machine crawl with
'bin/nutch crawl'.  Or if you need finer-grained control, you can still
step through the inject, generate, fetch, updatedb, generate, fetch, ...
cycle, except now each step runs across all slave nodes.

This is outlined in the Hadoop javadoc:

http://lucene.apache.org/hadoop/docs/api/overview-summary.html#overview_description

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

Jason Camp
    There's one problem with this in my environment - I'm basically
running two datacenters with different uses. Hopefully this won't sound
too confusing :)  We're using one datacenter to do the fetching, with a
hadoop cluster of servers, and one datacenter to store content and make
available for searching, with another cluster of hadoop servers.  My
understanding is that when you make servers into slaves for a cluster,
there's no way to define specifically what they do - i.e., these servers
just do fetching, these servers just do indexing. Is that correct?
Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).
    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!

Jason


Doug Cutting wrote:

> Jason Camp wrote:
>
>> I'd like to generate multiple segments in a row, and send them off to
>> another server, is this possible using the local file system?
>
>
> The Hadoop-based Nutch now automates multiple, parallel fetches for
> you.  So there is less need to manually generate multiple segments.
> Try configuring your servers as slaves (by adding them to conf/slaves)
> and configuring a master (by setting fs.default.name and
> mapred.jobtracker in conf/hadoop-site.xml) then using bin/start-all.sh
> to start daemons. Then copy your root url directory to dfs with
> something like 'bin/hadoop dfs -put roots roots'.  Then you can run a
> multi-machine crawl with 'bin/nutch crawl'.  Or if you need
> finer-grained control, you can still step through the inject,
> generate, fetch, updatedb, generate, fetch, ... cycle, except now each
> step runs across all slave nodes.
>
> This is outlined in the Hadoop javadoc:
>
> http://lucene.apache.org/hadoop/docs/api/overview-summary.html#overview_description
>
>
> Doug


Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

Doug Cutting
Jason Camp wrote:
> Unfortunately in our scenario, bw is cheap at our fetching datacenter,
> but adding additional disk capacity is expensive - so we are fetching
> the data and sending it back to another cluster (by exporting segments
> from ndfs, copy, importing).

But to perform the copies, you're using a lot of bandwidth to your
"indexing" datacenter, no?  Copying segments probably takes almost as
much bandwidth as fetching them...

>     I know this sounds a bit messy, but it was the only way we could
> come up with to utilize the benefits of both datacenters. Ideally, I'd
> love to be able to have all of the servers in one cluster, and define
> which servers I want to perform which tasks, so for instance we could
> use the one group of servers to fetch the data, but the other group of
> servers to store the data and perform the indexing/etc. If there's a
> better way to do something like this than what we're doing, or if  you
> think we're just insane for doing it this way, please let me know :) Thanks!

You can use different sets of machines for dfs and MapReduce, by
starting them in differently configured installations.  So you could run
dfs only in your "indexing" datacenter, and MapReduce in both
datacenters configured to talk to the same dfs, at the "indexing"
datacenter.  Then your fetch tasks as the "fetch" datacenter would write
their output to the "indexing" datacenter's dfs.  And
parse/updatedb/generate/index/etc. could all run at the other
datacenter.  Does that make sense?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

Jason Camp
Hey Doug,
    I'm finally picking this up again, and I believe I have everything
configured as you suggested - I'm running dfs at the indexing
datacenter, along with a job tracker/namenode/task trackers. I'm running
a job tracker/namenode/task trackers at the crawling datacenter, and its
all pointed to use dfs at the indexing datacenter.

If I have mapreduce set to run locally, everything seems to be working
fine. But if I set mapred.job.tracker to the job tracker that's running
in the crawl datacenter, I get this when trying to run a fetch:

060430 155743 Waiting to find target node: crawl01/XX.XX.XX.XX:50010

For some reason, even though I'm specifying the fs.default.name as my
other datacenter, it's trying to make itself a dfs node also. I was
experimenting with some of the shell scripts, and it seems that in any
configuration where this server is a task tracker, it also tries to be a
dfs node. Am I missing something, or is there no way to make this server
only a task tracker and not a dfs node also?

Let me know if this makes sense, its even a little confusing for me, and
I set it up :)

Thanks!

Jason


Doug Cutting wrote:

> Jason Camp wrote:
>
>> Unfortunately in our scenario, bw is cheap at our fetching datacenter,
>> but adding additional disk capacity is expensive - so we are fetching
>> the data and sending it back to another cluster (by exporting segments
>> from ndfs, copy, importing).
>
>
> But to perform the copies, you're using a lot of bandwidth to your
> "indexing" datacenter, no?  Copying segments probably takes almost as
> much bandwidth as fetching them...
>
>>     I know this sounds a bit messy, but it was the only way we could
>> come up with to utilize the benefits of both datacenters. Ideally, I'd
>> love to be able to have all of the servers in one cluster, and define
>> which servers I want to perform which tasks, so for instance we could
>> use the one group of servers to fetch the data, but the other group of
>> servers to store the data and perform the indexing/etc. If there's a
>> better way to do something like this than what we're doing, or if  you
>> think we're just insane for doing it this way, please let me know :)
>> Thanks!
>
>
> You can use different sets of machines for dfs and MapReduce, by
> starting them in differently configured installations.  So you could
> run dfs only in your "indexing" datacenter, and MapReduce in both
> datacenters configured to talk to the same dfs, at the "indexing"
> datacenter.  Then your fetch tasks as the "fetch" datacenter would
> write their output to the "indexing" datacenter's dfs.  And
> parse/updatedb/generate/index/etc. could all run at the other
> datacenter.  Does that make sense?
>
> Doug


vis
Reply | Threaded
Open this post in threaded view
|

Re: Question about crawldb and segments

vis
In reply to this post by Jason Camp
Sorry, I am on holiday until the 8th of May.

Please contact the [hidden email] for urgent matters.

Kind regards, Herman.