How to index in real time?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to index in real time?

scott green
Hi list,

Firstly, i don't know whether nutch-dev mail list is suitable for this
topic or not. If I post in the wrong place, pls tell me where should I
ask this question. Thanks.

The question is how to index resource in real time in nutch? This
question is raised from GMail. I don't know what exactly behind GMail,
but it should be built on GFS. When I get one email or send one email
out,  push the "Search Mail" immediately and it always get it. I'll
appreciate if someone will to explain how GMail works.

And any advice to hack Nutch/Hadoop to archive this? Thanks
Reply | Threaded
Open this post in threaded view
|

Re: How to index in real time?

Enis Soztutar
Scott Green wrote:

> Hi list,
>
> Firstly, i don't know whether nutch-dev mail list is suitable for this
> topic or not. If I post in the wrong place, pls tell me where should I
> ask this question. Thanks.
>
> The question is how to index resource in real time in nutch? This
> question is raised from GMail. I don't know what exactly behind GMail,
> but it should be built on GFS. When I get one email or send one email
> out,  push the "Search Mail" immediately and it always get it. I'll
> appreciate if someone will to explain how GMail works.
>
> And any advice to hack Nutch/Hadoop to archive this? Thanks
>
hi,
Most of the projects in google uses a scalable data structure called
bigtable. Orkut, google earth, finance and writley is reported to use
this. And i suppose Gmail also uses bigtable. Bigtable is build upon GFS
and desined to scale at petabyte lavel, but they work to icrease it to
the next level.

As far as i know, you should rebuild the index every time or merge the
indexes, so there is not an online index building. Consider asking this
to lucene mailing list.
Reply | Threaded
Open this post in threaded view
|

RE: How to index in real time?

Alan Tanaman
Hi,

> As far as i know, you should rebuild the index every time or merge the
> indexes, so there is not an online index building. Consider asking this
> to lucene mailing list.

We are doing something similar to online (well, not sub-second but
sub-minute).  We are doing this by applying the adaptive-fetch patch, and
limiting the scope of each crawl so that we are only taking the items that
change.

As for the indexing, we are still using the existing mechanism, which
creates an entire index at once and then merges, but are planning to write a
patch to use Lucene-API to access the existing index:
- add the new documents
- delete existing documents/re-add them to refer new segment ids
- delete obsolete documents

You need to be aware that this is not the most efficient usage of Nutch, but
it should make it easier for use in an enterprise environment.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com