Have anybody thought of replacing CrawlDb with any kind of Rational DB?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Have anybody thought of replacing CrawlDb with any kind of Rational DB?

wangxu-3
Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.
Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Nuther
Hi, wangxu.

You wrote 13 апреля 2007 г., 1:03:31:

> Have anybody thought of replacing CrawlDb with any kind of Rational
> DB,mysql,for example?

> Crawldb is so difficult to manipulate.
> I often have the requirements to edit several entries in crawdb;
> But that would cost too much waiting for the mapReduce.
You think MySQL would give you higher speed? :)
Just try DataPark Search for large number of urls :)
and you will see the difference ;)



Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Andrzej Białecki-2
In reply to this post by wangxu-3
wangxu wrote:
> Have anybody thought of replacing CrawlDb with any kind of Rational
> DB,mysql,for example?
>
> Crawldb is so difficult to manipulate.
> I often have the requirements to edit several entries in crawdb;
> But that would cost too much waiting for the mapReduce.

Please make the following test using your favorite relational DB:

* create a table with 300 mln rows and 10 columns of mixed type

* select 1 mln rows, sorted by some value

* update 1 mln rows to different values

If you find that these operations take less time than with the current
crawldb then we will have to revisit this issue. :)


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Sami Siren-2
In reply to this post by wangxu-3
wangxu wrote:
> Have anybody thought of replacing CrawlDb with any kind of Rational
> DB,mysql,for example?
>
> Crawldb is so difficult to manipulate.
> I often have the requirements to edit several entries in crawdb;
> But that would cost too much waiting for the mapReduce.
>

Once when I was young and restless I went through the path with
relational db. It kind of worked with few million records. I am not
trying to do it anymore.

Perhaps your problem is that you process too few records at the time?
Quite often I see examples where people fetch few hundred of few
thousand pages at a time. It might be good amount for small crawls, but
if your goal is bigger you need bigger segments to get there.

--
 Sami Siren


Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Dennis Kubes
In reply to this post by Andrzej Białecki-2


Andrzej Bialecki wrote:

> wangxu wrote:
>> Have anybody thought of replacing CrawlDb with any kind of Rational
>> DB,mysql,for example?
>>
>> Crawldb is so difficult to manipulate.
>> I often have the requirements to edit several entries in crawdb;
>> But that would cost too much waiting for the mapReduce.
>
> Please make the following test using your favorite relational DB:
>
> * create a table with 300 mln rows and 10 columns of mixed type
>
> * select 1 mln rows, sorted by some value
>
> * update 1 mln rows to different values
>
> If you find that these operations take less time than with the current
> crawldb then we will have to revisit this issue. :)

That is so funny.
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Howie Wang
In reply to this post by wangxu-3
> > Please make the following test using your favorite relational DB:> > > > * create a table with 300 mln rows and 10 columns of mixed type> > > > * select 1 mln rows, sorted by some value> > > > * update 1 mln rows to different values> > > > If you find that these operations take less time than with the current > > crawldb then we will have to revisit this issue. :)> > That is so funny.I think the original question and the above answer shows the big difference in the ways that Nutch is being used. For a small niche searchengine with fewer than a few million pages, it would probably be performant to use a relational DB. I have a webdb with 5 million records, and usually fetch 20k pagesat a time. It takes me about 1 hour to do an updatedb. To inject just a few dozen new urls takes about 20 minutes. On a relational DB, I know the injecting would be *much* faster, and I think the updatedb step would be also.Also for smaller engines, the raw throughput doesn't matter as much, and other considerations like robustness and flexibility could be more important. With a relational DB, I could recover from a crashed crawl with a simple SQL update. Or I could remove a set of bogus URLs from thedb just as easily. Now when I want to tweak the webdb in an unanticipated way, I have to write a custom piece of Java to do it. Just thought I'd throw in a perspective from a niche search guy.Howie
_________________________________________________________________
Your friends are close to you. Keep them that way.
http://spaces.live.com/signup.aspx
Reply | Threaded
Open this post in threaded view
|

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Howie Wang
In reply to this post by wangxu-3
Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie
_________________________________________________________________
Live Search Maps – find all the local information you need, right when you need it.
http://maps.live.com/?icid=wlmtag2&FORM=MGAC01
Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Andrzej Białecki-2
Howie Wang wrote:
> Sorry about the previous crappily formatted message. In brief, my
> point wasthat relational DB might perform better for small niche
> users, and plusyou get the flexibility of SQL. No more writing custom
> code to tweak webdb.Howie

Generally speaking, I agree that it would be a good option to have,
especially for smaller setups - but it would require extensive
modifications to many tools in Nutch. Unless you are willing to provide
patches that implement it without breaking the large-scale case, I think
we should let the matter rest ...


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Howie Wang
In reply to this post by wangxu-3
 I definitely don't expect people to write it just because it happens to be useful to me :-)  Call me crazy, but I'm thinking of implementing  this when I get some free time (whenever that will be). It seems that I  would just need to implement IWebDBWriter and IWebDBReader, and  then add a command line option to the tools (something like -mysql) to  specify the type of db to instantiate. It would affect about 15 files, but  the tools changes would be simple -- a few if statements here and there. Does that sound right?  Howie
_________________________________________________________________
Live Search Maps – find all the local information you need, right when you need it.
http://maps.live.com/?icid=wlmtag2&FORM=MGAC01
Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Arun Sharma-3
Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided ....I am waiting really stable product with
incremental indexing, which detect and add/remove pages as soon as they
added/removed. But they don't want to this , i don't know why ? what is
there mission ? If we join together to implement this, it would be better. I
can work on this as weekend project.
         ping me, if u want


On 4/13/07, Howie Wang <[hidden email]> wrote:

>
> I definitely don't expect people to write it just because it happens to be
> useful to me :-)  Call me crazy, but I'm thinking of implementing  this when
> I get some free time (whenever that will be). It seems that I  would just
> need to implement IWebDBWriter and IWebDBReader, and  then add a command
> line option to the tools (something like -mysql) to  specify the type of db
> to instantiate. It would affect about 15 files, but  the tools changes would
> be simple -- a few if statements here and there. Does that sound
> right?  Howie
> _________________________________________________________________
> Live Search Maps – find all the local information you need, right when you
> need it.
> http://maps.live.com/?icid=wlmtag2&FORM=MGAC01
Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Doug Cutting
Arun Kaundal wrote:
> Actually nutch people are kind of autocrate., don't expect more from them
> They do what they have decided....

Have you submitted patches that have been ignored or rejected?

Each Nutch contributor indeed does what he or she decides.  Nutch is not
a service organization that implements every feature that someone
requests.  It is a collaborative project of volunteers.  Each
contributor adds things they need, and others share the benefits.

> I am waiting really stable product with
> incremental indexing, which detect and add/remove pages as soon as they
> added/removed. But they don't want to this, i don't know why ?

Perhaps because this is difficult, especially while still supporting
large crawls.  But if others don't want to implement this, I encourage
you to try to implement it, and, if you succeed, contribute it back to
the project.  That's the way Nutch grows.

> what is
> there mission ? If we join together to implement this, it would be
> better. I
> can work on this as weekend project.
>         ping me, if u want

You can of course fork Nutch, or start a new project from scratch.  But
you ought to also consider submitting patches to Nutch, working work
with other contributors to solve your problems here before abandoning
Nutch in favor of another project.

Cheers,

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Andrzej Białecki-2
In reply to this post by Howie Wang
Howie Wang wrote:
> I definitely don't expect people to write it just because it happens
> to be useful to me :-)  Call me crazy, but I'm thinking of
> implementing  this when I get some free time (whenever that will be).
> It seems that I  would just need to implement IWebDBWriter and
> IWebDBReader, and  then add a command line option to the tools
> (something like -mysql) to  specify the type of db to instantiate. It
> would affect about 15 files, but  the tools changes would be simple
> -- a few if statements here and there. Does that sound right?  Howie

You are talking about the codebase from branch 0.7. This branch is not
under active development. The current codebase is very different - it
uses the MapReduce framework to process data in a distributed fashion.

So, there is no single interface for writing the CrawlDb. There is one
class for reading the CrawlDb, but usually the data in the DB is used
not standalone, but as one of many inputs to a map-reduce job.

To summarize - I think it would be very difficult to do this with the
current codebase.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Howie Wang
In reply to this post by wangxu-3
 Thanks for the input, Andrzej. Yes, I'm still working off of 0.7.  I might still try it since I'm not planning on upgrading for a while, but it sounds like it's not going to port to the current versions.  Howie
_________________________________________________________________
Your friends are close to you. Keep them that way.
http://spaces.live.com/signup.aspx
Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

wangxu-3
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

> Howie Wang wrote:
>> I definitely don't expect people to write it just because it happens
>> to be useful to me :-)  Call me crazy, but I'm thinking of
>> implementing  this when I get some free time (whenever that will be).
>> It seems that I  would just need to implement IWebDBWriter and
>> IWebDBReader, and  then add a command line option to the tools
>> (something like -mysql) to  specify the type of db to instantiate. It
>> would affect about 15 files, but  the tools changes would be simple
>> -- a few if statements here and there. Does that sound right?  Howie
>
> You are talking about the codebase from branch 0.7. This branch is not
> under active development. The current codebase is very different - it
> uses the MapReduce framework to process data in a distributed fashion.
>
> So, there is no single interface for writing the CrawlDb. There is one
> class for reading the CrawlDb, but usually the data in the DB is used
> not standalone, but as one of many inputs to a map-reduce job.
>
> To summarize - I think it would be very difficult to do this with the
> current codebase.
>
My urls are at most at the level of 1000,000 per site;
Perhaps I can do some tests and go on with the idea.

Based on 0.9 It seems the most simplest way to achieve it is like this,
To do any mapReduce job associated with Crawldb,I add operations  like
these:
Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath;
Read the job-generated CrawlDb to update the RationalDB.
Is that right?

Reply | Threaded
Open this post in threaded view
|

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Andrzej Białecki-2
wangxu wrote:

> Andrzej Bialecki wrote:
>> Howie Wang wrote:
>>> I definitely don't expect people to write it just because it happens
>>> to be useful to me :-)  Call me crazy, but I'm thinking of
>>> implementing  this when I get some free time (whenever that will be).
>>> It seems that I  would just need to implement IWebDBWriter and
>>> IWebDBReader, and  then add a command line option to the tools
>>> (something like -mysql) to  specify the type of db to instantiate. It
>>> would affect about 15 files, but  the tools changes would be simple
>>> -- a few if statements here and there. Does that sound right?  Howie
>>
>> You are talking about the codebase from branch 0.7. This branch is not
>> under active development. The current codebase is very different - it
>> uses the MapReduce framework to process data in a distributed fashion.
>>
>> So, there is no single interface for writing the CrawlDb. There is one
>> class for reading the CrawlDb, but usually the data in the DB is used
>> not standalone, but as one of many inputs to a map-reduce job.
>>
>> To summarize - I think it would be very difficult to do this with the
>> current codebase.
>>
> My urls are at most at the level of 1000,000 per site;
> Perhaps I can do some tests and go on with the idea.
>
> Based on 0.9 It seems the most simplest way to achieve it is like this,
> To do any mapReduce job associated with Crawldb,I add operations  like
> these:
> Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath;
> Read the job-generated CrawlDb to update the RationalDB.
> Is that right?

Yes, it should be possible, you just need to keep track of the page
statuses and metadata that is normally kept in Crawldb. Also, if you
want to update the relational DB in a map-reduce job you need to be
careful about opening new connections to the DB - best set up the
connection in Mapper/Reducer configure() methods.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com