Google Analytics in Hadoop ?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Google Analytics in Hadoop ?

Alex McLintock
Hi Folks,

This is not 100% a Nutch question... and I hate it when other people say "I
know my question is off topic....." so why I am doing it myself I don;t
know.

I am looking at building a system similar to Google Analytics - in that it
logs page requests on third party sites using some kind of Javascript, does
processing on those logs, and produces reports. I see there are open source
tools for this which are MySQL/RDBMS backed - but I want a Hadoop backed
system for scalability. Do I just need to implement it myself or is anyone
working on such a thing?

To bring this back to Nutch I would also like to fetch and index all the
pages which are logged in this way so that my system knows what they are
about. (But I don't really need any web crawling after that)

Any ideas?

Cheers
Reply | Threaded
Open this post in threaded view
|

Re: Google Analytics in Hadoop ?

mohajeri
If you want to process logs, you don't need to use Nutch and since you are
interested in storing it in Hadoop there are several log processors with
Hadoop backend, Cloudera has one that I forgot the name but here is another
one:
http://incubator.apache.org/chukwa/docs/r0.3.0/design.html

On Mon, Apr 30, 2012 at 8:36 AM, Alex McLintock <[hidden email]>wrote:

> Hi Folks,
>
> This is not 100% a Nutch question... and I hate it when other people say "I
> know my question is off topic....." so why I am doing it myself I don;t
> know.
>
> I am looking at building a system similar to Google Analytics - in that it
> logs page requests on third party sites using some kind of Javascript, does
> processing on those logs, and produces reports. I see there are open source
> tools for this which are MySQL/RDBMS backed - but I want a Hadoop backed
> system for scalability. Do I just need to implement it myself or is anyone
> working on such a thing?
>
> To bring this back to Nutch I would also like to fetch and index all the
> pages which are logged in this way so that my system knows what they are
> about. (But I don't really need any web crawling after that)
>
> Any ideas?
>
> Cheers
>
Reply | Threaded
Open this post in threaded view
|

Re: Google Analytics in Hadoop ?

lewis john mcgibbney
Hi Alex,

Further to this, the crawling should be fairly straightforward once
you've cracked the processing part of your problem.
Please get back to us if you encounter problems.

Lewis

On Mon, Apr 30, 2012 at 5:43 PM, Peyman Mohajerian <[hidden email]> wrote:

> If you want to process logs, you don't need to use Nutch and since you are
> interested in storing it in Hadoop there are several log processors with
> Hadoop backend, Cloudera has one that I forgot the name but here is another
> one:
> http://incubator.apache.org/chukwa/docs/r0.3.0/design.html
>
> On Mon, Apr 30, 2012 at 8:36 AM, Alex McLintock <[hidden email]>wrote:
>
>> Hi Folks,
>>
>> This is not 100% a Nutch question... and I hate it when other people say "I
>> know my question is off topic....." so why I am doing it myself I don;t
>> know.
>>
>> I am looking at building a system similar to Google Analytics - in that it
>> logs page requests on third party sites using some kind of Javascript, does
>> processing on those logs, and produces reports. I see there are open source
>> tools for this which are MySQL/RDBMS backed - but I want a Hadoop backed
>> system for scalability. Do I just need to implement it myself or is anyone
>> working on such a thing?
>>
>> To bring this back to Nutch I would also like to fetch and index all the
>> pages which are logged in this way so that my system knows what they are
>> about. (But I don't really need any web crawling after that)
>>
>> Any ideas?
>>
>> Cheers
>>



--
Lewis