Index update and Google Dance

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Index update and Google Dance

Jack.Tang
Hi

I read GFS document and NFS document on the wiki. One interesting
question here: does NFS support updating index on the fly?

As you known, google updats its index via google dance. It is said
that replicator in GFS placed three copies of chunks in different
datanode. During index updating, the two chunks will be locked,
updated first, finally the remain is updated. I do not know it is true
or not. However, from NFS
document(http://wiki.apache.org/nutch/NutchDistributedFileSystem) I
learned:

   1. Files can only be written once. After the first write, they
become read-only. (Although they can be deleted.)
   2. Files are stream-oriented; you can only append bytes, and you
can only read/seek forward.
   3. There are no user permissions or quotas, although these could be
added fairly easily.

There are no "write" and "lock" i/o semantics, so we cannot update
index in dynamic, right?

Regards
/Jack

P.S.
what is the google dance: http://www.metamend.com/google-dance.html
Google Dance - The Index Update of the Google Search Engine :
http://dance.efactory.de/
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Stefan Groschupf-2
nutch use the concepts of segments and yes you are able to update  
part of the index by just delete older older segments and generate /  
fetch new segments.
Stefan

Am 08.11.2005 um 18:38 schrieb Jack Tang:

> Hi
>
> I read GFS document and NFS document on the wiki. One interesting
> question here: does NFS support updating index on the fly?
>
> As you known, google updats its index via google dance. It is said
> that replicator in GFS placed three copies of chunks in different
> datanode. During index updating, the two chunks will be locked,
> updated first, finally the remain is updated. I do not know it is true
> or not. However, from NFS
> document(http://wiki.apache.org/nutch/NutchDistributedFileSystem) I
> learned:
>
>    1. Files can only be written once. After the first write, they
> become read-only. (Although they can be deleted.)
>    2. Files are stream-oriented; you can only append bytes, and you
> can only read/seek forward.
>    3. There are no user permissions or quotas, although these could be
> added fairly easily.
>
> There are no "write" and "lock" i/o semantics, so we cannot update
> index in dynamic, right?
>
> Regards
> /Jack
>
> P.S.
> what is the google dance: http://www.metamend.com/google-dance.html
> Google Dance - The Index Update of the Google Search Engine :
> http://dance.efactory.de/
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Jack.Tang
Hi Stefan

Deleting is totally OK if there is NO references to the chunks(segments).
Also, Will master balance the searching request? Say, there are 3
slaves: Slave 1, 2, 3
and three copies of chunks are distributed on the slaves. If slave 1
is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
case? Or could you tell me where should I start learning?

Regards
/Jack

On 11/9/05, Stefan Groschupf <[hidden email]> wrote:

> nutch use the concepts of segments and yes you are able to update
> part of the index by just delete older older segments and generate /
> fetch new segments.
> Stefan
>
> Am 08.11.2005 um 18:38 schrieb Jack Tang:
>
> > Hi
> >
> > I read GFS document and NFS document on the wiki. One interesting
> > question here: does NFS support updating index on the fly?
> >
> > As you known, google updats its index via google dance. It is said
> > that replicator in GFS placed three copies of chunks in different
> > datanode. During index updating, the two chunks will be locked,
> > updated first, finally the remain is updated. I do not know it is true
> > or not. However, from NFS
> > document(http://wiki.apache.org/nutch/NutchDistributedFileSystem) I
> > learned:
> >
> >    1. Files can only be written once. After the first write, they
> > become read-only. (Although they can be deleted.)
> >    2. Files are stream-oriented; you can only append bytes, and you
> > can only read/seek forward.
> >    3. There are no user permissions or quotas, although these could be
> > added fairly easily.
> >
> > There are no "write" and "lock" i/o semantics, so we cannot update
> > index in dynamic, right?
> >
> > Regards
> > /Jack
> >
> > P.S.
> > what is the google dance: http://www.metamend.com/google-dance.html
> > Google Dance - The Index Update of the Google Search Engine :
> > http://dance.efactory.de/
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Andrzej Białecki-2
Jack Tang wrote:

>Hi Stefan
>
>Deleting is totally OK if there is NO references to the chunks(segments).
>Also, Will master balance the searching request? Say, there are 3
>slaves: Slave 1, 2, 3
>and three copies of chunks are distributed on the slaves. If slave 1
>is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
>  
>

NDFS has almost nothing to do with this case, because the distributed
search uses a separate distributed protocol (DistributedSearch$Server
and DistributedSearch$Client), which can use both local segment data and
segment data on NDFS.

Unfortunately, there is no mechanism (yet) in this IPC to do load
balancing. What's more important, the current architecture assumes that
there is exactly one copy of each unique document in the segments
deployed for searching (so there is no redundancy on this level), and
the process to manage this is largely manual... :-( This is clearly an
area to be further developed.

In my opinion the "master" (which is the DistributedSearch$Client)
should have the knowledge of which segments are deployed on which
servers, and also it should know what is a complete set of segments
comprising the total index. If you combine this with the information
about the current load per server, the "master" could dispatch partial
queries to be run against a subset of segments on each server, depending
on its local load, in such a way that partial runs should globally
involve the complete set of segments. In this scenario it would be
possible to deploy replicas of segments across the set of DS$Server-s
without getting duplicate results.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Jack.Tang
Hi Andrzej

In document, Michael said:
"I'd strongly recommend using the system with a replication rate of 3
copies, 2 minimum. Desired replication can be set in nutch config file
using "ndfs.replication" property, and MIN_REPLICATION constant is
located in ndfs/FSNamesystem.java (and set to 1 by default)."

Say, I own 1 master machine and 6 data nodes and I set
ndfs.replication to 3. After crawling, what the distribution of
chunks(suppose only 2 chunks exist). Please correct me if I am wrong.

DataNode      Chunk No.
A                          1#
B                          1#
C                          1#

D                          2#
E                          2#
F                          2#

If I own 5 datanodes, it should be

DataNode       Chunk No.
A                          1#
B                          1#
C                          1#

D                          2#
E                          2#

I mean B and C is the totally mirror of A. Is it true in NFS?

Below is google architecture in my brain:

                DataNode A
Master      DataNode B               GoogleCrawler
                DataNode C
                ......
GoogleCrawler is kept running all the time. One day, it gets fethlist
from DataNode A, crawls all pages and index them, then it tells Master
"I wanna to update DataNode A's index", finally it acquires "read
lock" and "write lock", and the index is updated. And some operation
is applied to DataNode B and C.

Commnets?

Regards
/Jack


On 11/9/05, Andrzej Bialecki <[hidden email]> wrote:

> Jack Tang wrote:
>
> >Hi Stefan
> >
> >Deleting is totally OK if there is NO references to the chunks(segments).
> >Also, Will master balance the searching request? Say, there are 3
> >slaves: Slave 1, 2, 3
> >and three copies of chunks are distributed on the slaves. If slave 1
> >is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
> >
> >
>
> NDFS has almost nothing to do with this case, because the distributed
> search uses a separate distributed protocol (DistributedSearch$Server
> and DistributedSearch$Client), which can use both local segment data and
> segment data on NDFS.
>
> Unfortunately, there is no mechanism (yet) in this IPC to do load
> balancing. What's more important, the current architecture assumes that
> there is exactly one copy of each unique document in the segments
> deployed for searching (so there is no redundancy on this level), and
> the process to manage this is largely manual... :-( This is clearly an
> area to be further developed.
>
> In my opinion the "master" (which is the DistributedSearch$Client)
> should have the knowledge of which segments are deployed on which
> servers, and also it should know what is a complete set of segments
> comprising the total index. If you combine this with the information
> about the current load per server, the "master" could dispatch partial
> queries to be run against a subset of segments on each server, depending
> on its local load, in such a way that partial runs should globally
> involve the complete set of segments. In this scenario it would be
> possible to deploy replicas of segments across the set of DS$Server-s
> without getting duplicate results.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Andrzej Białecki-2
Jack Tang wrote:

>Hi Andrzej
>
>In document, Michael said:
>"I'd strongly recommend using the system with a replication rate of 3
>copies, 2 minimum. Desired replication can be set in nutch config file
>using "ndfs.replication" property, and MIN_REPLICATION constant is
>located in ndfs/FSNamesystem.java (and set to 1 by default)."
>  
>

This is pertinent only to the parts of the system that use NDFS.
Distributed search part in Nutch does not have to use NDFS, in fact
using multiple local storage gives performance benefits... although it
creates a maintenance problem...

The difference between how GFS and NDFS work is that Google FS chunks
play the role of Nutch segments (these chunks are fully usable fragments
of the index) - however, NDFS does NOT work on such high level: it does
not replicate segments, only opaque data blocks, which are not directly
usable by high-level apps (like segment reader or index reader) and need
to be wrapped in a facade. NDFS is ideal for sequential access, but much
less so for random access - and Lucene indexes require efficient random
access.

You can think of NDFS as a very simple distributed filesystem like any
other out there (e.g. Coda) - currently it doesn't have this high-level
semantics of GFS (yet).

>Say, I own 1 master machine and 6 data nodes and I set
>ndfs.replication to 3. After crawling, what the distribution of
>chunks(suppose only 2 chunks exist). Please correct me if I am wrong.
>
>DataNode      Chunk No.
>A                          1#
>B                          1#
>C                          1#
>
>D                          2#
>E                          2#
>F                          2#
>
>If I own 5 datanodes, it should be
>
>DataNode       Chunk No.
>A                          1#
>B                          1#
>C                          1#
>
>D                          2#
>E                          2#
>
>I mean B and C is the totally mirror of A. Is it true in NFS?
>  
>

This is probably true for NDFS - "master" is an NDFS name node, and data
nodes are NDFS data nodes - with one important correction: the
replication is NOT on the high-level of "chunks" (segments), but on the
low level of data blocks. This means that high-level structures like
segments are NOT replicated - what you end up with is that they are
backed by a replicated storage.

>Below is google architecture in my brain:
>
>                DataNode A
>Master      DataNode B               GoogleCrawler
>                DataNode C
>                ......
>GoogleCrawler is kept running all the time. One day, it gets fethlist
>from DataNode A, crawls all pages and index them, then it tells Master
>"I wanna to update DataNode A's index", finally it acquires "read
>lock" and "write lock", and the index is updated. And some operation
>is applied to DataNode B and C.
>  
>

That's not how it works in Nutch. In Nutch you simply deploy new
segments to search servers, and delete the old ones, and the individual
search servers periodically check the list of available segments to
update their internal lists.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Stefan Groschupf-2
In reply to this post by Jack.Tang

> and three copies of chunks are distributed on the slaves. If slave 1
> is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
> case?
Actually you have to do that manually, but there will be a  
automatically solution later.

> Or could you tell me where should I start learning?
The nutch wiki and may this url can help:
http://wiki.media-style.com/display/nutchDocu/Home

Stefan

Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Jack.Tang
In reply to this post by Andrzej Białecki-2
Thanks for your explaination, Andrzej.
I am going to read some NFS source codes and ask smarter questions later.
Thanks again.

Regards
/Jack

On 11/9/05, Andrzej Bialecki <[hidden email]> wrote:

> Jack Tang wrote:
>
> >Hi Andrzej
> >
> >In document, Michael said:
> >"I'd strongly recommend using the system with a replication rate of 3
> >copies, 2 minimum. Desired replication can be set in nutch config file
> >using "ndfs.replication" property, and MIN_REPLICATION constant is
> >located in ndfs/FSNamesystem.java (and set to 1 by default)."
> >
> >
>
> This is pertinent only to the parts of the system that use NDFS.
> Distributed search part in Nutch does not have to use NDFS, in fact
> using multiple local storage gives performance benefits... although it
> creates a maintenance problem...
>
> The difference between how GFS and NDFS work is that Google FS chunks
> play the role of Nutch segments (these chunks are fully usable fragments
> of the index) - however, NDFS does NOT work on such high level: it does
> not replicate segments, only opaque data blocks, which are not directly
> usable by high-level apps (like segment reader or index reader) and need
> to be wrapped in a facade. NDFS is ideal for sequential access, but much
> less so for random access - and Lucene indexes require efficient random
> access.
>
> You can think of NDFS as a very simple distributed filesystem like any
> other out there (e.g. Coda) - currently it doesn't have this high-level
> semantics of GFS (yet).
>
> >Say, I own 1 master machine and 6 data nodes and I set
> >ndfs.replication to 3. After crawling, what the distribution of
> >chunks(suppose only 2 chunks exist). Please correct me if I am wrong.
> >
> >DataNode      Chunk No.
> >A                          1#
> >B                          1#
> >C                          1#
> >
> >D                          2#
> >E                          2#
> >F                          2#
> >
> >If I own 5 datanodes, it should be
> >
> >DataNode       Chunk No.
> >A                          1#
> >B                          1#
> >C                          1#
> >
> >D                          2#
> >E                          2#
> >
> >I mean B and C is the totally mirror of A. Is it true in NFS?
> >
> >
>
> This is probably true for NDFS - "master" is an NDFS name node, and data
> nodes are NDFS data nodes - with one important correction: the
> replication is NOT on the high-level of "chunks" (segments), but on the
> low level of data blocks. This means that high-level structures like
> segments are NOT replicated - what you end up with is that they are
> backed by a replicated storage.
>
> >Below is google architecture in my brain:
> >
> >                DataNode A
> >Master      DataNode B               GoogleCrawler
> >                DataNode C
> >                ......
> >GoogleCrawler is kept running all the time. One day, it gets fethlist
> >from DataNode A, crawls all pages and index them, then it tells Master
> >"I wanna to update DataNode A's index", finally it acquires "read
> >lock" and "write lock", and the index is updated. And some operation
> >is applied to DataNode B and C.
> >
> >
>
> That's not how it works in Nutch. In Nutch you simply deploy new
> segments to search servers, and delete the old ones, and the individual
> search servers periodically check the list of available segments to
> update their internal lists.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Doug Cutting-2
In reply to this post by Jack.Tang
Jack Tang wrote:

> Below is google architecture in my brain:
>
>                 DataNode A
> Master      DataNode B               GoogleCrawler
>                 DataNode C
>                 ......
> GoogleCrawler is kept running all the time. One day, it gets fethlist
> from DataNode A, crawls all pages and index them, then it tells Master
> "I wanna to update DataNode A's index", finally it acquires "read
> lock" and "write lock", and the index is updated. And some operation
> is applied to DataNode B and C.

Do you have evidence that this is how Google updates their index?  I've
never seen much published about that.

In the future I would like to implement a more automated distributed
search system than Nutch currently has.  One way to do this might be to
use MapReduce.  Each map task's input could be an index and some segment
data.  The map method would serve queries, i.e., run a Nutch
DistributedSearch.Server.  It would first copy the index out of NDFS to
the local disk, for better performance.  It would never exit normally,
but rather "map" forever.  When a new version of the index (new set of
segments, new boosts, and/or new deletions, etc.) is ready to deploy,
then a new job could be submitted.  If the number of map tasks (i.e.,
indexes) is kept equal or less than the number of nodes, and each node
is permitted to run two or more tasks, then two versions of the index
can be served at once.  Once the new version has been deployed
(listening for searches on different ports), and search front-ends are
using it, then the old version can be stopped by killing its MapReduce
job.  If a node dies, the MapReduce job tracker would automatically
re-start its task on another node.

If there were an affinity method between tasks and task trackers, then
attempts could be made to re-deploy new versions of indexes whose, e.g.,
only boosts or deletions have changed, to the same nodes as before.
Then the copy of the index to the local disk could be incremental, only
copying the parts of the index/segment that have changed.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Jack.Tang
Hi Doug

On 11/10/05, Doug Cutting <[hidden email]> wrote:

> Jack Tang wrote:
> > Below is google architecture in my brain:
> >
> >                 DataNode A
> > Master      DataNode B               GoogleCrawler
> >                 DataNode C
> >                 ......
> > GoogleCrawler is kept running all the time. One day, it gets fethlist
> > from DataNode A, crawls all pages and index them, then it tells Master
> > "I wanna to update DataNode A's index", finally it acquires "read
> > lock" and "write lock", and the index is updated. And some operation
> > is applied to DataNode B and C.
>
> Do you have evidence that this is how Google updates their index?  I've
> never seen much published about that.
No, google engineers came to my university for recruitment, and I
asked them how google update index. One of them told me, according
Google FS architecture, the chunks will be locked first and of couse
Master known it so it would not any search query refer to the updating
chunks.
I read GFS document, and try to give out my explain on google index updating.

Regards
/Jack

> In the future I would like to implement a more automated distributed
> search system than Nutch currently has.  One way to do this might be to
> use MapReduce.  Each map task's input could be an index and some segment
> data.  The map method would serve queries, i.e., run a Nutch
> DistributedSearch.Server.  It would first copy the index out of NDFS to
> the local disk, for better performance.  It would never exit normally,
> but rather "map" forever.  When a new version of the index (new set of
> segments, new boosts, and/or new deletions, etc.) is ready to deploy,
> then a new job could be submitted.  If the number of map tasks (i.e.,
> indexes) is kept equal or less than the number of nodes, and each node
> is permitted to run two or more tasks, then two versions of the index
> can be served at once.  Once the new version has been deployed
> (listening for searches on different ports), and search front-ends are
> using it, then the old version can be stopped by killing its MapReduce
> job.  If a node dies, the MapReduce job tracker would automatically
> re-start its task on another node.
>
> If there were an affinity method between tasks and task trackers, then
> attempts could be made to re-deploy new versions of indexes whose, e.g.,
> only boosts or deletions have changed, to the same nodes as before.
> Then the copy of the index to the local disk could be incremental, only
> copying the parts of the index/segment that have changed.
>
> Doug
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Index update and Google Dance

Byron Miller-2
In reply to this post by Doug Cutting-2
Doug, I love you

hehehe :)

Great vision for how things could work!  

--- Doug Cutting <[hidden email]> wrote:



>
> In the future I would like to implement a more
> automated distributed
> search system than Nutch currently has.  One way to
> do this might be to
> use MapReduce.  Each map task's input could be an
> index and some segment
> data.  The map method would serve queries, i.e., run
> a Nutch
> DistributedSearch.Server.  It would first copy the
> index out of NDFS to
> the local disk, for better performance.  It would
> never exit normally,
> but rather "map" forever.  When a new version of the
> index (new set of
> segments, new boosts, and/or new deletions, etc.) is
> ready to deploy,
> then a new job could be submitted.  If the number of
> map tasks (i.e.,
> indexes) is kept equal or less than the number of
> nodes, and each node
> is permitted to run two or more tasks, then two
> versions of the index
> can be served at once.  Once the new version has
> been deployed
> (listening for searches on different ports), and
> search front-ends are
> using it, then the old version can be stopped by
> killing its MapReduce
> job.  If a node dies, the MapReduce job tracker
> would automatically
> re-start its task on another node.
>
> If there were an affinity method between tasks and
> task trackers, then
> attempts could be made to re-deploy new versions of
> indexes whose, e.g.,
> only boosts or deletions have changed, to the same
> nodes as before.
> Then the copy of the index to the local disk could
> be incremental, only
> copying the parts of the index/segment that have
> changed.
>
> Doug
>

Reply | Threaded
Open this post in threaded view
|

mapSearcher was Re: Index update and Google Dance

Stefan Groschupf-2
In reply to this post by Doug Cutting-2
Hi Doug,
> In the future I would like to implement a more automated  
> distributed search system than Nutch currently has.  One way to do  
> this might be to use MapReduce.  Each map task's input could be an  
> index and some segment data.  The map method would serve queries,  
> i.e., run a Nutch DistributedSearch.Server.  It would first copy  
> the index out of NDFS to the local disk, for better performance.

I have 2 questions regarding this mechanism.
First, what you plan to make the running search servers known by the  
master (search client) I can imaging a similar mechanism as the  
tasktracker and jobtracker use, a kind of heart beat message.
Second wouldn't be there also a possibility to solve nutch-92  
(DistributedSearch incorrectly scores results) by first running a map  
reduce task over the indexes that counting terms and than hold this  
somehow in the memory of master (search server client). But I'm not  
sure if that is may to much data.

Stefan