Nutch + HBase

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch + HBase

Marcus Herou
Hi.

Anyone tried to implement HBase as storage for:

* CrawlDB
* LinkDB
* Fetched and parsed url data

It would certainly be cool I think to be able to search in all these three
db's. Currently it is a little bit hard to use the data crawled without
actually indexing it.

Kindly

//Marcus

--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Nutch + HBase

Andrzej Białecki-2
Marcus Herou wrote:
> Hi.
>
> Anyone tried to implement HBase as storage for:


Not yet. We are waiting for HBase to reach certain stability and efficiency.

>
> * CrawlDB

Yes.

> * LinkDB

Yes.

> * Fetched and parsed url data

I don't think so, for performance reasons - the page storage needs to
offer high-performance search and retrieve operations, and I don't think
HBase is able to provide this level of performance. The current segment
format (or the future shard format) is for now the best option.

>
> It would certainly be cool I think to be able to search in all these three
> db's. Currently it is a little bit hard to use the data crawled without
> actually indexing it.

That's true - on the other hand, the current set of features is
optimized (read: minimized ;) ) to support the primary functionality,
and to do it well.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch + HBase

Marcus Herou
Hi thanks for the answer.

I will not use HBase for free-text searching, for that Lucene is way more
mature, scalable etc.

What I want to use HBase for is a somewhat more familiar and clean concept
of storing data than large sequential files spread out on HDFS.
Typical use-cases:

* Search with Lucene in some way: Solr, NutchBean etc.
* Get the actual data from HBase or some other clustered db based on a
primary key which is stored in Lucene.
* Applications get an easier integration point than using CrawlDb.get(...)
or dump.
* This is so we don't store the same data in duplicate (or more) places,
wasting disk.

The yes answers in you mail was they referring to actual implementations ?

Kindly

//Marcus






On Tue, Jun 17, 2008 at 9:07 PM, Andrzej Bialecki <[hidden email]> wrote:

> Marcus Herou wrote:
>
>> Hi.
>>
>> Anyone tried to implement HBase as storage for:
>>
>
>
> Not yet. We are waiting for HBase to reach certain stability and
> efficiency.
>
>
>> * CrawlDB
>>
>
> Yes.
>
>  * LinkDB
>>
>
> Yes.
>
>  * Fetched and parsed url data
>>
>
> I don't think so, for performance reasons - the page storage needs to offer
> high-performance search and retrieve operations, and I don't think HBase is
> able to provide this level of performance. The current segment format (or
> the future shard format) is for now the best option.
>
>
>> It would certainly be cool I think to be able to search in all these three
>> db's. Currently it is a little bit hard to use the data crawled without
>> actually indexing it.
>>
>
> That's true - on the other hand, the current set of features is optimized
> (read: minimized ;) ) to support the primary functionality, and to do it
> well.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Nutch + HBase

Andrzej Białecki-2
Marcus Herou wrote:

> Hi thanks for the answer.
>
> I will not use HBase for free-text searching, for that Lucene is way more
> mature, scalable etc.
>
> What I want to use HBase for is a somewhat more familiar and clean concept
> of storing data than large sequential files spread out on HDFS.
> Typical use-cases:
>
> * Search with Lucene in some way: Solr, NutchBean etc.
> * Get the actual data from HBase or some other clustered db based on a
> primary key which is stored in Lucene.

For occasional retrieval this might be ok, for quick access to many
random records - I doubt the performance would be acceptable.

> * Applications get an easier integration point than using CrawlDb.get(...)
> or dump.
> * This is so we don't store the same data in duplicate (or more) places,
> wasting disk.

Hmm ... Keep in mind that parsed text and parse data is needed when
searching, and for this you need to offer a maximum performance. If you
plan to keep parse text and parse data in HBase this means that you will
have to create a second copy of this data in a format suitable for fast
retrieval.

>
> The yes answers in you mail was they referring to actual implementations ?

Unfortunately, no :) I only meant that in my opinion it would be
desirable to move this part of Nutch to HBase.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com