Finding term vector per host using hadoop

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding term vector per host using hadoop

Arif Iqbal
In the mapreduce paper published by Google in OSDI 2004, they give an
example of term vector per host as a usage of mapreduce. They write

""" Term-Vector per Host: A term vector summarizes the most important words
that occur in a document or a set of documents as a list of
(word; frequency) pairs. The map function emits a (hostname; term vector)
pair for each input document (where the hostname is extracted from the URL
of the document). The reduce function is passed all per-document term
vectors for a given host. It adds these term vectors together, throwing away
infrequent terms, and then emits a final (hostname; term vector) pair. """

I want to implement the same thing but was wondering if this is possible
with Hadoop. In this case the map function emits (hostname, term vector)
.... is this possible with Hadoop. if yes, can someone paste some sample
code for me.

Cheers,
Arif
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Andrzej Białecki-2
Arif Iqbal wrote:

> In the mapreduce paper published by Google in OSDI 2004, they give an
> example of term vector per host as a usage of mapreduce. They write
>
> """ Term-Vector per Host: A term vector summarizes the most important
> words
> that occur in a document or a set of documents as a list of
> (word; frequency) pairs. The map function emits a (hostname; term vector)
> pair for each input document (where the hostname is extracted from the
> URL
> of the document). The reduce function is passed all per-document term
> vectors for a given host. It adds these term vectors together,
> throwing away
> infrequent terms, and then emits a final (hostname; term vector) pair.
> """
>
> I want to implement the same thing but was wondering if this is possible
> with Hadoop. In this case the map function emits (hostname, term vector)
> .... is this possible with Hadoop. if yes, can someone paste some sample
> code for me.

Yes, in fact this should be quite easy, you can follow exactly the steps
described above ... You can use Lucene's MemoryIndex to quickly create
term vectors from each document, then output them in map() operation as
<host, termVector> for each input document, and finally in reduce() you
will need to summarize term vectors ... Look at Grep.java or
WordCount.java in examples, your mapred job will follow a very similar
pattern.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Arif Iqbal
Can we emit (word, termVector) in map function.? I am new to hadoop and i
dont think that any given example in hadoop emits any other data type than
intwritable or text. is it possible to emit other datatypes like termVector,
Hashtable etc.
Thanks.

On 12/8/06, Andrzej Bialecki <[hidden email]> wrote:

>
> Arif Iqbal wrote:
> > In the mapreduce paper published by Google in OSDI 2004, they give an
> > example of term vector per host as a usage of mapreduce. They write
> >
> > """ Term-Vector per Host: A term vector summarizes the most important
> > words
> > that occur in a document or a set of documents as a list of
> > (word; frequency) pairs. The map function emits a (hostname; term
> vector)
> > pair for each input document (where the hostname is extracted from the
> > URL
> > of the document). The reduce function is passed all per-document term
> > vectors for a given host. It adds these term vectors together,
> > throwing away
> > infrequent terms, and then emits a final (hostname; term vector) pair.
> > """
> >
> > I want to implement the same thing but was wondering if this is possible
> > with Hadoop. In this case the map function emits (hostname, term vector)
> > .... is this possible with Hadoop. if yes, can someone paste some sample
> > code for me.
>
> Yes, in fact this should be quite easy, you can follow exactly the steps
> described above ... You can use Lucene's MemoryIndex to quickly create
> term vectors from each document, then output them in map() operation as
> <host, termVector> for each input document, and finally in reduce() you
> will need to summarize term vectors ... Look at Grep.java or
> WordCount.java in examples, your mapred job will follow a very similar
> pattern.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Andrzej Białecki-2
Arif Iqbal wrote:
> Can we emit (word, termVector) in map function.? I am new to hadoop and i
> dont think that any given example in hadoop emits any other data type
> than
> intwritable or text. is it possible to emit other datatypes like
> termVector,
> Hashtable etc.

Sure, so long as they implement Writable (for values) or
WritableComparable (for keys) - so practically speaking you need to wrap
any internal data structure that you use in an implementation of
Writable / WritableComparable.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Lukáš Vlček
Hi,

Is there any good example how to wrap compex custom data structure into
Writable (or WritableComparable)? I haven't found anything on Wiki.

Let's imagine that I need to wrap a tree-like structure (nodes, edges and
couple of other properties for each node). Is there any existing code in
hadoop where can I get inspiration?

Thanks,
Lukas

On 12/8/06, Andrzej Bialecki <[hidden email]> wrote:

>
> Arif Iqbal wrote:
> > Can we emit (word, termVector) in map function.? I am new to hadoop and
> i
> > dont think that any given example in hadoop emits any other data type
> > than
> > intwritable or text. is it possible to emit other datatypes like
> > termVector,
> > Hashtable etc.
>
> Sure, so long as they implement Writable (for values) or
> WritableComparable (for keys) - so practically speaking you need to wrap
> any internal data structure that you use in an implementation of
> Writable / WritableComparable.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Andrzej Białecki-2
Lukas Vlcek wrote:
> Hi,
>
> Is there any good example how to wrap compex custom data structure into
> Writable (or WritableComparable)? I haven't found anything on Wiki.
>
> Let's imagine that I need to wrap a tree-like structure (nodes, edges and
> couple of other properties for each node). Is there any existing code in
> hadoop where can I get inspiration?

These illustrate serialization of Map-like structures:

org.apache.nutch.crawl.MapWritable
org.apache.nutch.metadata.Metadata

I don't think we have examples of tree-like structures, but the
serialization parts would look similar, you would just need to traverse
the tree depth-first.

And if you need to process values stored in several different classes
you could use ObjectWritable to wrap them.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Lukáš Vlček
Hi,

I have just found the ObjectWritable class in org.apache.hadoop.io package.
However, it does not support any type from java Collection framework. As for
tree-like data structure it is useful to use LinkedList for node childs (as
opposed to fixed size array). This is not directly supported by Haddop as of
now.

Do you think it would be hard to extend the ObjectWritable so that it
handles the Collections as well? Would this be useful feature / contribution
for Hadoop community?

Regards,
Lukas

On 12/12/06, Andrzej Bialecki <[hidden email]> wrote:

>
> Lukas Vlcek wrote:
> > Hi,
> >
> > Is there any good example how to wrap compex custom data structure into
> > Writable (or WritableComparable)? I haven't found anything on Wiki.
> >
> > Let's imagine that I need to wrap a tree-like structure (nodes, edges
> and
> > couple of other properties for each node). Is there any existing code in
> > hadoop where can I get inspiration?
>
> These illustrate serialization of Map-like structures:
>
> org.apache.nutch.crawl.MapWritable
> org.apache.nutch.metadata.Metadata
>
> I don't think we have examples of tree-like structures, but the
> serialization parts would look similar, you would just need to traverse
> the tree depth-first.
>
> And if you need to process values stored in several different classes
> you could use ObjectWritable to wrap them.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Dennis Kubes
One thing we have done in the past is to have a SerializableWritable
that writes out a Serializable object to a byte stream which is then
stored in the writable.  If anyone is interested, email me and I will
send you the code for the SerializableWritable.

Dennis

Lukas Vlcek wrote:

> Hi,
>
> I have just found the ObjectWritable class in org.apache.hadoop.io
> package.
> However, it does not support any type from java Collection framework.
> As for
> tree-like data structure it is useful to use LinkedList for node
> childs (as
> opposed to fixed size array). This is not directly supported by Haddop
> as of
> now.
>
> Do you think it would be hard to extend the ObjectWritable so that it
> handles the Collections as well? Would this be useful feature /
> contribution
> for Hadoop community?
>
> Regards,
> Lukas
>
> On 12/12/06, Andrzej Bialecki <[hidden email]> wrote:
>>
>> Lukas Vlcek wrote:
>> > Hi,
>> >
>> > Is there any good example how to wrap compex custom data structure
>> into
>> > Writable (or WritableComparable)? I haven't found anything on Wiki.
>> >
>> > Let's imagine that I need to wrap a tree-like structure (nodes, edges
>> and
>> > couple of other properties for each node). Is there any existing
>> code in
>> > hadoop where can I get inspiration?
>>
>> These illustrate serialization of Map-like structures:
>>
>> org.apache.nutch.crawl.MapWritable
>> org.apache.nutch.metadata.Metadata
>>
>> I don't think we have examples of tree-like structures, but the
>> serialization parts would look similar, you would just need to traverse
>> the tree depth-first.
>>
>> And if you need to process values stored in several different classes
>> you could use ObjectWritable to wrap them.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Lukáš Vlček
Hi,

I would appreciate if you send me your code.
I am not sure if I will use serializabe approach but this could be a useful
source of inspiration :-)

Regards,
Lukas

On 12/13/06, Dennis Kubes <[hidden email]> wrote:

>
> One thing we have done in the past is to have a SerializableWritable
> that writes out a Serializable object to a byte stream which is then
> stored in the writable.  If anyone is interested, email me and I will
> send you the code for the SerializableWritable.
>
> Dennis
>
> Lukas Vlcek wrote:
> > Hi,
> >
> > I have just found the ObjectWritable class in org.apache.hadoop.io
> > package.
> > However, it does not support any type from java Collection framework.
> > As for
> > tree-like data structure it is useful to use LinkedList for node
> > childs (as
> > opposed to fixed size array). This is not directly supported by Haddop
> > as of
> > now.
> >
> > Do you think it would be hard to extend the ObjectWritable so that it
> > handles the Collections as well? Would this be useful feature /
> > contribution
> > for Hadoop community?
> >
> > Regards,
> > Lukas
> >
> > On 12/12/06, Andrzej Bialecki <[hidden email]> wrote:
> >>
> >> Lukas Vlcek wrote:
> >> > Hi,
> >> >
> >> > Is there any good example how to wrap compex custom data structure
> >> into
> >> > Writable (or WritableComparable)? I haven't found anything on Wiki.
> >> >
> >> > Let's imagine that I need to wrap a tree-like structure (nodes, edges
> >> and
> >> > couple of other properties for each node). Is there any existing
> >> code in
> >> > hadoop where can I get inspiration?
> >>
> >> These illustrate serialization of Map-like structures:
> >>
> >> org.apache.nutch.crawl.MapWritable
> >> org.apache.nutch.metadata.Metadata
> >>
> >> I don't think we have examples of tree-like structures, but the
> >> serialization parts would look similar, you would just need to traverse
> >> the tree depth-first.
> >>
> >> And if you need to process values stored in several different classes
> >> you could use ObjectWritable to wrap them.
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >> ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Dennis Kubes


Lukas Vlcek wrote:

> Hi,
>
> I would appreciate if you send me your code.
> I am not sure if I will use serializabe approach but this could be a
> useful
> source of inspiration :-)
>
> Regards,
> Lukas
>
> On 12/13/06, Dennis Kubes <[hidden email]> wrote:
>>
>> One thing we have done in the past is to have a SerializableWritable
>> that writes out a Serializable object to a byte stream which is then
>> stored in the writable.  If anyone is interested, email me and I will
>> send you the code for the SerializableWritable.
>>
>> Dennis
>>
>> Lukas Vlcek wrote:
>> > Hi,
>> >
>> > I have just found the ObjectWritable class in org.apache.hadoop.io
>> > package.
>> > However, it does not support any type from java Collection framework.
>> > As for
>> > tree-like data structure it is useful to use LinkedList for node
>> > childs (as
>> > opposed to fixed size array). This is not directly supported by Haddop
>> > as of
>> > now.
>> >
>> > Do you think it would be hard to extend the ObjectWritable so that it
>> > handles the Collections as well? Would this be useful feature /
>> > contribution
>> > for Hadoop community?
>> >
>> > Regards,
>> > Lukas
>> >
>> > On 12/12/06, Andrzej Bialecki <[hidden email]> wrote:
>> >>
>> >> Lukas Vlcek wrote:
>> >> > Hi,
>> >> >
>> >> > Is there any good example how to wrap compex custom data structure
>> >> into
>> >> > Writable (or WritableComparable)? I haven't found anything on Wiki.
>> >> >
>> >> > Let's imagine that I need to wrap a tree-like structure (nodes,
>> edges
>> >> and
>> >> > couple of other properties for each node). Is there any existing
>> >> code in
>> >> > hadoop where can I get inspiration?
>> >>
>> >> These illustrate serialization of Map-like structures:
>> >>
>> >> org.apache.nutch.crawl.MapWritable
>> >> org.apache.nutch.metadata.Metadata
>> >>
>> >> I don't think we have examples of tree-like structures, but the
>> >> serialization parts would look similar, you would just need to
>> traverse
>> >> the tree depth-first.
>> >>
>> >> And if you need to process values stored in several different classes
>> >> you could use ObjectWritable to wrap them.
>> >>
>> >> --
>> >> Best regards,
>> >> Andrzej Bialecki     <><
>> >> ___. ___ ___ ___ _ _   __________________________________
>> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> >> http://www.sigram.com  Contact: info at sigram dot com
>> >>
>> >>
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Arif Iqbal
In reply to this post by Lukáš Vlček
Dennis,

I also want this code. Kindly send it to me.

Cheers,
AI

On 12/14/06, Lukas Vlcek <[hidden email]> wrote:

>
> Hi,
>
> I would appreciate if you send me your code.
> I am not sure if I will use serializabe approach but this could be a
> useful
> source of inspiration :-)
>
> Regards,
> Lukas
>
> On 12/13/06, Dennis Kubes <[hidden email]> wrote:
> >
> > One thing we have done in the past is to have a SerializableWritable
> > that writes out a Serializable object to a byte stream which is then
> > stored in the writable.  If anyone is interested, email me and I will
> > send you the code for the SerializableWritable.
> >
> > Dennis
> >
> > Lukas Vlcek wrote:
> > > Hi,
> > >
> > > I have just found the ObjectWritable class in org.apache.hadoop.io
> > > package.
> > > However, it does not support any type from java Collection framework.
> > > As for
> > > tree-like data structure it is useful to use LinkedList for node
> > > childs (as
> > > opposed to fixed size array). This is not directly supported by Haddop
> > > as of
> > > now.
> > >
> > > Do you think it would be hard to extend the ObjectWritable so that it
> > > handles the Collections as well? Would this be useful feature /
> > > contribution
> > > for Hadoop community?
> > >
> > > Regards,
> > > Lukas
> > >
> > > On 12/12/06, Andrzej Bialecki <[hidden email]> wrote:
> > >>
> > >> Lukas Vlcek wrote:
> > >> > Hi,
> > >> >
> > >> > Is there any good example how to wrap compex custom data structure
> > >> into
> > >> > Writable (or WritableComparable)? I haven't found anything on Wiki.
> > >> >
> > >> > Let's imagine that I need to wrap a tree-like structure (nodes,
> edges
> > >> and
> > >> > couple of other properties for each node). Is there any existing
> > >> code in
> > >> > hadoop where can I get inspiration?
> > >>
> > >> These illustrate serialization of Map-like structures:
> > >>
> > >> org.apache.nutch.crawl.MapWritable
> > >> org.apache.nutch.metadata.Metadata
> > >>
> > >> I don't think we have examples of tree-like structures, but the
> > >> serialization parts would look similar, you would just need to
> traverse
> > >> the tree depth-first.
> > >>
> > >> And if you need to process values stored in several different classes
> > >> you could use ObjectWritable to wrap them.
> > >>
> > >> --
> > >> Best regards,
> > >> Andrzej Bialecki     <><
> > >> ___. ___ ___ ___ _ _   __________________________________
> > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > >> http://www.sigram.com  Contact: info at sigram dot com
> > >>
> > >>
> > >>
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Finding term vector per host using hadoop

Lukáš Vlček
Hi,
I have already sent it to Arif (on behalf of Dennis).
Lukas

On 12/14/06, Arif Iqbal <[hidden email]> wrote:

>
> Dennis,
>
> I also want this code. Kindly send it to me.
>
> Cheers,
> AI
>
> On 12/14/06, Lukas Vlcek <[hidden email]> wrote:
> >
> > Hi,
> >
> > I would appreciate if you send me your code.
> > I am not sure if I will use serializabe approach but this could be a
> > useful
> > source of inspiration :-)
> >
> > Regards,
> > Lukas
> >
> > On 12/13/06, Dennis Kubes <[hidden email]> wrote:
> > >
> > > One thing we have done in the past is to have a SerializableWritable
> > > that writes out a Serializable object to a byte stream which is then
> > > stored in the writable.  If anyone is interested, email me and I will
> > > send you the code for the SerializableWritable.
> > >
> > > Dennis
> > >
> > > Lukas Vlcek wrote:
> > > > Hi,
> > > >
> > > > I have just found the ObjectWritable class in org.apache.hadoop.io
> > > > package.
> > > > However, it does not support any type from java Collection
> framework.
> > > > As for
> > > > tree-like data structure it is useful to use LinkedList for node
> > > > childs (as
> > > > opposed to fixed size array). This is not directly supported by
> Haddop
> > > > as of
> > > > now.
> > > >
> > > > Do you think it would be hard to extend the ObjectWritable so that
> it
> > > > handles the Collections as well? Would this be useful feature /
> > > > contribution
> > > > for Hadoop community?
> > > >
> > > > Regards,
> > > > Lukas
> > > >
> > > > On 12/12/06, Andrzej Bialecki <[hidden email]> wrote:
> > > >>
> > > >> Lukas Vlcek wrote:
> > > >> > Hi,
> > > >> >
> > > >> > Is there any good example how to wrap compex custom data
> structure
> > > >> into
> > > >> > Writable (or WritableComparable)? I haven't found anything on
> Wiki.
> > > >> >
> > > >> > Let's imagine that I need to wrap a tree-like structure (nodes,
> > edges
> > > >> and
> > > >> > couple of other properties for each node). Is there any existing
> > > >> code in
> > > >> > hadoop where can I get inspiration?
> > > >>
> > > >> These illustrate serialization of Map-like structures:
> > > >>
> > > >> org.apache.nutch.crawl.MapWritable
> > > >> org.apache.nutch.metadata.Metadata
> > > >>
> > > >> I don't think we have examples of tree-like structures, but the
> > > >> serialization parts would look similar, you would just need to
> > traverse
> > > >> the tree depth-first.
> > > >>
> > > >> And if you need to process values stored in several different
> classes
> > > >> you could use ObjectWritable to wrap them.
> > > >>
> > > >> --
> > > >> Best regards,
> > > >> Andrzej Bialecki     <><
> > > >> ___. ___ ___ ___ _ _   __________________________________
> > > >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > >> http://www.sigram.com  Contact: info at sigram dot com
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
> >
>
>