How to configure Apache gora to take only ol as column family ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to configure Apache gora to take only ol as column family ?

suyashaoc
Hello nutch-users,

I want to edit and modify gora-hbasemapping.xml to ingest only outlinks to
hbase. While commenting any of the column family. I get an error.

Please suggest a solution so that i can only get ol column family to hbase.

Thanks,
Suyash
Reply | Threaded
Open this post in threaded view
|

Re: How to configure Apache gora to take only ol as column family ?

lewis john mcgibbney-2
Hi suyash,

This issue can be addressed by essentially, commenting OUT all of the
instances where the WebPage [0] object is augmented within each job (and
possibly plugin).
An example would be as follows
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/parse/ParseUtil.java#L358
You need to step through the entire codebase and essentially comment out
setting (and maybe getting) values from the WebPage object.
The alternative option, is to simply create a new WebPage schema with only
the outlinks data structure, then use the 'ant generate-gora-src' target to
recompile the Webpage Class.
https://github.com/apache/nutch/blob/2.x/build.xml#L612-L623
You can then attempt to recompile the project and address each compile
error sequentially until all you have remaining is code pertaining to
outlinks.
hth
Lewis

[0]
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/storage/WebPage.java

On Thu, Mar 16, 2017 at 2:45 AM, <[hidden email]> wrote:

>
> From: suyash singh <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Tue, 14 Mar 2017 01:30:49 +0530
> Subject: Re: extract elements from each url as json and write it to s3
> Hi,
> I think you have to take database like mongodb. Write your custom gora
> mongodb mapping.xml and pass your Jason object to this.
>
> Thanks,
> suyash
>
>