nutch-2.x with hbase filter option

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch-2.x with hbase filter option

alxsss
Hello,

I wondered when nutch -2.x with hbase filter option is planned to be released?

Thanks.
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

Talat Uyarer
Hi Alex,

Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
certain. We wait gora 0.4 relase now.

Talat
5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:

> Hello,
>
> I wondered when nutch -2.x with hbase filter option is planned to be
> released?
>
> Thanks.
> Alex.
>
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
Hi,

I took a look to fetcher and indexer and I do not see any changes regarding the filter options sent to backend datastore.

 I was expecting that nutch code will specify filtering on a paramerter,  let say on batchId. If this is not the case, then how the filtering will happen?

Thanks.
Alex.

 

 

-----Original Message-----
From: Talat Uyarer <[hidden email]>
To: user <[hidden email]>
Cc: nutch-user <[hidden email]>
Sent: Tue, Mar 4, 2014 8:32 pm
Subject: Re: nutch-2.x with hbase filter option


Hi Alex,

Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
certain. We wait gora 0.4 relase now.

Talat
5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:

> Hello,
>
> I wondered when nutch -2.x with hbase filter option is planned to be
> released?
>
> Thanks.
> Alex.
>

 
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

Otis Gospodnetic-2
This:

https://issues.apache.org/jira/browse/NUTCH-1674


 ?

Otis
---- 
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 




On Friday, March 7, 2014 7:18 PM, "[hidden email]" <[hidden email]> wrote:
 
Hi,

>
>I took a look to fetcher and indexer and I do not see any changes regarding the filter options sent to backend datastore.
>
>I was expecting that nutch code will specify filtering on a paramerter,  let say on batchId. If this is not the case, then how the filtering will happen?
>
>Thanks.
>Alex.
>
>
>
>
>
>
>-----Original Message-----
>From: Talat Uyarer <[hidden email]>
>To: user <[hidden email]>
>Cc: nutch-user <[hidden email]>
>Sent: Tue, Mar 4, 2014 8:32 pm
>Subject: Re: nutch-2.x with hbase filter option
>
>
>Hi Alex,
>
>Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
>certain. We wait gora 0.4 relase now.
>
>Talat
>5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:
>
>> Hello,
>>
>> I wondered when nutch -2.x with hbase filter option is planned to be
>> released?
>>
>> Thanks.
>> Alex.
>>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
Thanks Otis. This is what I was looking for.

After applying this patch to the current trunk with some modifications I have
gora-core-0.4-SNAPSHOT.jar
gora-hbase-0.4-SNAPSHOT.jar

With hbase-0.94.17.jar the inject command gives
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V

Do you know with which version of hbase this patch must be used?

Thanks.
Alex.


 

 

 

-----Original Message-----
From: Otis Gospodnetic <[hidden email]>
To: user <[hidden email]>
Sent: Fri, Mar 7, 2014 9:46 pm
Subject: Re: nutch-2.x with hbase filter option


This:

https://issues.apache.org/jira/browse/NUTCH-1674


 ?

Otis
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 





On Friday, March 7, 2014 7:18 PM, "[hidden email]" <[hidden email]> wrote:
 
Hi,
>
>I took a look to fetcher and indexer and I do not see any changes regarding the
filter options sent to backend datastore.
>
>I was expecting that nutch code will specify filtering on a paramerter,  let
say on batchId. If this is not the case, then how the filtering will happen?

>
>Thanks.
>Alex.
>
>
>
>
>
>
>-----Original Message-----
>From: Talat Uyarer <[hidden email]>
>To: user <[hidden email]>
>Cc: nutch-user <[hidden email]>
>Sent: Tue, Mar 4, 2014 8:32 pm
>Subject: Re: nutch-2.x with hbase filter option
>
>
>Hi Alex,
>
>Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
>certain. We wait gora 0.4 relase now.
>
>Talat
>5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:
>
>> Hello,
>>
>> I wondered when nutch -2.x with hbase filter option is planned to be
>> released?
>>
>> Thanks.
>> Alex.
>>
>
>
>
>

 
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

Alparslan Avcı
Hi Alex,

If you installed gora-core-0.4-SNAPSHOT.jar and
gora-hbase-0.4-SNAPSHOT.jar fromGora trunk, you can use filtering option
but you have to use hbase-0.90.x family. However, if you
installed gora-core-0.4-SNAPSHOT.jar and gora-hbase-0.4-SNAPSHOT.jar
fromGORA_94 branch (which filtering is also supported now), you can use
hbase-0.94.x family. I do not think you will have problems in minor
version changes.

Alparslan


On 26-03-2014 22:09, [hidden email] wrote:

> Thanks Otis. This is what I was looking for.
>
> After applying this patch to the current trunk with some modifications I have
> gora-core-0.4-SNAPSHOT.jar
> gora-hbase-0.4-SNAPSHOT.jar
>
> With hbase-0.94.17.jar the inject command gives
> Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V
>
> Do you know with which version of hbase this patch must be used?
>
> Thanks.
> Alex.
>
>
>  
>
>  
>
>  
>
> -----Original Message-----
> From: Otis Gospodnetic <[hidden email]>
> To: user <[hidden email]>
> Sent: Fri, Mar 7, 2014 9:46 pm
> Subject: Re: nutch-2.x with hbase filter option
>
>
> This:
>
> https://issues.apache.org/jira/browse/NUTCH-1674
>
>
>   ?
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>
>
>
>
>
> On Friday, March 7, 2014 7:18 PM, "[hidden email]" <[hidden email]> wrote:
>  
> Hi,
>> I took a look to fetcher and indexer and I do not see any changes regarding the
> filter options sent to backend datastore.
>> I was expecting that nutch code will specify filtering on a paramerter,  let
> say on batchId. If this is not the case, then how the filtering will happen?
>> Thanks.
>> Alex.
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Talat Uyarer <[hidden email]>
>> To: user <[hidden email]>
>> Cc: nutch-user <[hidden email]>
>> Sent: Tue, Mar 4, 2014 8:32 pm
>> Subject: Re: nutch-2.x with hbase filter option
>>
>>
>> Hi Alex,
>>
>> Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
>> certain. We wait gora 0.4 relase now.
>>
>> Talat
>> 5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:
>>
>>> Hello,
>>>
>>> I wondered when nutch -2.x with hbase filter option is planned to be
>>> released?
>>>
>>> Thanks.
>>> Alex.
>>>
>>
>>
>>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
Hi Alparslan,

I downloaded GORA_94 branch and with libs from it a get

14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: test_urls
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/gora/persistency/StateManager
        at java.lang.Class.getDeclaredConstructors0(Native Method)
        at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
        at java.lang.Class.getConstructor0(Class.java:2699)
        at java.lang.Class.getConstructor(Class.java:1657)
        at org.apache.gora.util.ReflectionUtils.getConstructor(ReflectionUtils.java:44)
        at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:78)
        at org.apache.gora.persistency.impl.BeanFactoryImpl.<init>(BeanFactoryImpl.java:66)
        at org.apache.gora.store.impl.DataStoreBase.initialize(DataStoreBase.java:91)
        at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:111)
        at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
        at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:76)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.ClassNotFoundException: org.apache.gora.persistency.StateManager
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 23 more

Looks like class StateManager is missing.

Please advise.

Thanks.
Alex.

 

 

 

-----Original Message-----
From: Alparslan Avcı <[hidden email]>
To: user <[hidden email]>
Sent: Thu, Mar 27, 2014 12:01 am
Subject: Re: nutch-2.x with hbase filter option


Hi Alex,

If you installed gora-core-0.4-SNAPSHOT.jar and
gora-hbase-0.4-SNAPSHOT.jar fromGora trunk, you can use filtering option
but you have to use hbase-0.90.x family. However, if you
installed gora-core-0.4-SNAPSHOT.jar and gora-hbase-0.4-SNAPSHOT.jar
fromGORA_94 branch (which filtering is also supported now), you can use
hbase-0.94.x family. I do not think you will have problems in minor
version changes.

Alparslan


On 26-03-2014 22:09, [hidden email] wrote:

> Thanks Otis. This is what I was looking for.
>
> After applying this patch to the current trunk with some modifications I have
> gora-core-0.4-SNAPSHOT.jar
> gora-hbase-0.4-SNAPSHOT.jar
>
> With hbase-0.94.17.jar the inject command gives
> Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V
>
> Do you know with which version of hbase this patch must be used?
>
> Thanks.
> Alex.
>
>
>  
>
>  
>
>  
>
> -----Original Message-----
> From: Otis Gospodnetic <[hidden email]>
> To: user <[hidden email]>
> Sent: Fri, Mar 7, 2014 9:46 pm
> Subject: Re: nutch-2.x with hbase filter option
>
>
> This:
>
> https://issues.apache.org/jira/browse/NUTCH-1674
>
>
>   ?
>
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>
>
>
>
>
> On Friday, March 7, 2014 7:18 PM, "[hidden email]" <[hidden email]> wrote:
>  
> Hi,
>> I took a look to fetcher and indexer and I do not see any changes regarding
the

> filter options sent to backend datastore.
>> I was expecting that nutch code will specify filtering on a paramerter,  let
> say on batchId. If this is not the case, then how the filtering will happen?
>> Thanks.
>> Alex.
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Talat Uyarer <[hidden email]>
>> To: user <[hidden email]>
>> Cc: nutch-user <[hidden email]>
>> Sent: Tue, Mar 4, 2014 8:32 pm
>> Subject: Re: nutch-2.x with hbase filter option
>>
>>
>> Hi Alex,
>>
>> Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
>> certain. We wait gora 0.4 relase now.
>>
>> Talat
>> 5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:
>>
>>> Hello,
>>>
>>> I wondered when nutch -2.x with hbase filter option is planned to be
>>> released?
>>>
>>> Thanks.
>>> Alex.
>>>
>>
>>
>>
>  
>


 
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

Alparslan Avcı
Hi Alex,

Nutch 2.x does not support GORA_94 yet. However, there is an issue
(https://issues.apache.org/jira/browse/NUTCH-1714) about this, and you
can use the patches if you want. But the patches are uploaded on January
2014, so some more changes may be needed.

Alparslan

On 27-03-2014 20:39, [hidden email] wrote:

> Hi Alparslan,
>
> I downloaded GORA_94 branch and with libs from it a get
>
> 14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: test_urls
> Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/gora/persistency/StateManager
>          at java.lang.Class.getDeclaredConstructors0(Native Method)
>          at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
>          at java.lang.Class.getConstructor0(Class.java:2699)
>          at java.lang.Class.getConstructor(Class.java:1657)
>          at org.apache.gora.util.ReflectionUtils.getConstructor(ReflectionUtils.java:44)
>          at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:78)
>          at org.apache.gora.persistency.impl.BeanFactoryImpl.<init>(BeanFactoryImpl.java:66)
>          at org.apache.gora.store.impl.DataStoreBase.initialize(DataStoreBase.java:91)
>          at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:111)
>          at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
>          at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
>          at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
>          at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:76)
>          at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
>          at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
>          at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>          at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.lang.ClassNotFoundException: org.apache.gora.persistency.StateManager
>          at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>          at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>          at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>          ... 23 more
>
> Looks like class StateManager is missing.
>
> Please advise.
>
> Thanks.
> Alex.
>
>  
>
>  
>
>  
>
> -----Original Message-----
> From: Alparslan Avcı <[hidden email]>
> To: user <[hidden email]>
> Sent: Thu, Mar 27, 2014 12:01 am
> Subject: Re: nutch-2.x with hbase filter option
>
>
> Hi Alex,
>
> If you installed gora-core-0.4-SNAPSHOT.jar and
> gora-hbase-0.4-SNAPSHOT.jar fromGora trunk, you can use filtering option
> but you have to use hbase-0.90.x family. However, if you
> installed gora-core-0.4-SNAPSHOT.jar and gora-hbase-0.4-SNAPSHOT.jar
> fromGORA_94 branch (which filtering is also supported now), you can use
> hbase-0.94.x family. I do not think you will have problems in minor
> version changes.
>
> Alparslan
>
>
> On 26-03-2014 22:09, [hidden email] wrote:
>> Thanks Otis. This is what I was looking for.
>>
>> After applying this patch to the current trunk with some modifications I have
>> gora-core-0.4-SNAPSHOT.jar
>> gora-hbase-0.4-SNAPSHOT.jar
>>
>> With hbase-0.94.17.jar the inject command gives
>> Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V
>>
>> Do you know with which version of hbase this patch must be used?
>>
>> Thanks.
>> Alex.
>>
>>
>>    
>>
>>    
>>
>>    
>>
>> -----Original Message-----
>> From: Otis Gospodnetic <[hidden email]>
>> To: user <[hidden email]>
>> Sent: Fri, Mar 7, 2014 9:46 pm
>> Subject: Re: nutch-2.x with hbase filter option
>>
>>
>> This:
>>
>> https://issues.apache.org/jira/browse/NUTCH-1674
>>
>>
>>    ?
>>
>> Otis
>> ----
>> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>>
>>
>>
>>
>>
>> On Friday, March 7, 2014 7:18 PM, "[hidden email]" <[hidden email]> wrote:
>>    
>> Hi,
>>> I took a look to fetcher and indexer and I do not see any changes regarding
> the
>> filter options sent to backend datastore.
>>> I was expecting that nutch code will specify filtering on a paramerter,  let
>> say on batchId. If this is not the case, then how the filtering will happen?
>>> Thanks.
>>> Alex.
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Talat Uyarer <[hidden email]>
>>> To: user <[hidden email]>
>>> Cc: nutch-user <[hidden email]>
>>> Sent: Tue, Mar 4, 2014 8:32 pm
>>> Subject: Re: nutch-2.x with hbase filter option
>>>
>>>
>>> Hi Alex,
>>>
>>> Next release 2.3 has filtering and Hbase 0.94 support. There is nothing
>>> certain. We wait gora 0.4 relase now.
>>>
>>> Talat
>>> 5 Mar 2014 02:40 tarihinde <[hidden email]> yazdı:
>>>
>>>> Hello,
>>>>
>>>> I wondered when nutch -2.x with hbase filter option is planned to be
>>>> released?
>>>>
>>>> Thanks.
>>>> Alex.
>>>>
>>>
>>>
>>    
>>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

lewis john mcgibbney
In reply to this post by alxsss
Hi alxsss,

On Sat, Mar 29, 2014 at 10:15 PM, <[hidden email]> wrote:

>
> I downloaded GORA_94 branch and with libs from it a get
>>
>> 14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
>> test_urls
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/gora/persistency/StateManager
>>
>
Please see the patch in https://issues.apache.org/jira/browse/NUTCH-1714
There has been a SIGNIFICANT amount of work done to the persistency API in
GORA_94 this has the consequence of meaning that API usage in Nutch has
also changed a bit.
Let us know how you get on.
Also thanks Alparslan for the patch in NUTCH-1714
Thanks
Lewis
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
I  have applied the patch to the current trunk

Here is the output of ant

    [javac] Compiling 2 source files to /home/kingson/nutch-2.x-svn/build/classes
    [javac] /nutch-2.x/src/java/org/apache/nutch/host/HostDbUpdateReducer.java:78: cannot find symbol
    [javac] symbol  : method newBuilder()
    [javac] location: class org.apache.nutch.storage.Host
    [javac]     Host host = Host.newBuilder().build();
    [javac]                     ^
    [javac] /nutch-2.x/src/java/org/apache/nutch/host/HostInjectorJob.java:124: cannot find symbol
    [javac] symbol  : method newBuilder()
    [javac] location: class org.apache.nutch.storage.Host
    [javac]       Host host = Host.newBuilder().build();
    [javac]                       ^
    [javac] 2 errors

BUILD FAILED
/nutch-2.x/build.xml:101: Compile failed; see the compiler error output for details.


Also, here are files that import/use StateManager class, which seems was removed from GORA_94

$ grep -r StateManager src/
src/java/org/apache/nutch/storage/Host.java:import org.apache.gora.persistency.StateManager;
src/java/org/apache/nutch/storage/Host.java:import org.apache.gora.persistency.impl.StateManagerImpl;
src/java/org/apache/nutch/storage/Host.java:    this(new StateManagerImpl());
src/java/org/apache/nutch/storage/Host.java:  public Host(StateManager stateManager) {
src/java/org/apache/nutch/storage/Host.java:  public Host newInstance(StateManager stateManager) {
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, _field);
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, 0);
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, 0);
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, 1);
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, 1);
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, 2);
src/java/org/apache/nutch/storage/Host.java:    getStateManager().setDirty(this, 2);
src/java/org/apache/nutch/storage/WebPage.java:import org.apache.gora.persistency.StateManager;
src/java/org/apache/nutch/storage/WebPage.java:import org.apache.gora.persistency.impl.StateManagerImpl;
src/java/org/apache/nutch/storage/WebPage.java:    this(new StateManagerImpl());
src/java/org/apache/nutch/storage/WebPage.java:  public WebPage(StateManager stateManager) {
src/java/org/apache/nutch/storage/WebPage.java:  public WebPage newInstance(StateManager stateManager) {
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, _field);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 18);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 18);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 19);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 19);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 20);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 20);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 21);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 21);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 22);
src/java/org/apache/nutch/storage/WebPage.java:    getStateManager().setDirty(this, 22);
src/java/org/apache/nutch/storage/ParseStatus.java:import org.apache.gora.persistency.StateManager;
src/java/org/apache/nutch/storage/ParseStatus.java:import org.apache.gora.persistency.impl.StateManagerImpl;
src/java/org/apache/nutch/storage/ParseStatus.java:    this(new StateManagerImpl());
src/java/org/apache/nutch/storage/ParseStatus.java:  public ParseStatus(StateManager stateManager) {
src/java/org/apache/nutch/storage/ParseStatus.java:  public ParseStatus newInstance(StateManager stateManager) {
src/java/org/apache/nutch/storage/ParseStatus.java:    getStateManager().setDirty(this, _field);
src/java/org/apache/nutch/storage/ParseStatus.java:    getStateManager().setDirty(this, 2);
src/java/org/apache/nutch/storage/ProtocolStatus.java:import org.apache.gora.persistency.StateManager;
src/java/org/apache/nutch/storage/ProtocolStatus.java:import org.apache.gora.persistency.impl.StateManagerImpl;
src/java/org/apache/nutch/storage/ProtocolStatus.java:    this(new StateManagerImpl());
src/java/org/apache/nutch/storage/ProtocolStatus.java:  public ProtocolStatus(StateManager stateManager) {
src/java/org/apache/nutch/storage/ProtocolStatus.java:  public ProtocolStatus newInstance(StateManager stateManager) {
src/java/org/apache/nutch/storage/ProtocolStatus.java:    getStateManager().setDirty(this, _field);
src/java/org/apache/nutch/storage/ProtocolStatus.java:    getStateManager().setDirty(this, 1);

Thanks.
Alex.

 

 

 

-----Original Message-----
From: Lewis John Mcgibbney <[hidden email]>
To: user <[hidden email]>
Sent: Sun, Mar 30, 2014 3:37 am
Subject: Re: nutch-2.x with hbase filter option


Hi alxsss,

On Sat, Mar 29, 2014 at 10:15 PM, <[hidden email]> wrote:

>
> I downloaded GORA_94 branch and with libs from it a get
>>
>> 14/03/27 11:21:19 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
>> test_urls
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/gora/persistency/StateManager
>>
>
Please see the patch in https://issues.apache.org/jira/browse/NUTCH-1714
There has been a SIGNIFICANT amount of work done to the persistency API in
GORA_94 this has the consequence of meaning that API usage in Nutch has
also changed a bit.
Let us know how you get on.
Also thanks Alparslan for the patch in NUTCH-1714
Thanks
Lewis

 
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
Hi,

I was able to fix these errors making some changes to code and using avro-1.7.
Now when I run updatedb command it gives

2014-04-09 14:29:36,460 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchMethodError: org.apache.nutch.storage.WebPage.getFetchInterval()I
        at org.apache.nutch.crawl.AdaptiveFetchSchedule.setFetchSchedule(AdaptiveFetchSchedule.java:85)
        at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:141)
        at org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:41)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
 

Thanks.
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

Alparslan Avcı
Hi Alex,

It seems from the log that your build's WebPage class has no
getFetchInterval() method. However,that method exists in 2.x branch and
the patch in NUTCH-1714.

Have you changed any code in WebPage class? And do you run in local or
distributed mode?


Thanks,
Alparslan


On 10-04-2014 01:07, alxsss wrote:

> Hi,
>
> I was able to fix these errors making some changes to code and using
> avro-1.7.
> Now when I run updatedb command it gives
>
> 2014-04-09 14:29:36,460 FATAL org.apache.hadoop.mapred.Child: Error running
> child : java.lang.NoSuchMethodError:
> org.apache.nutch.storage.WebPage.getFetchInterval()I
>          at
> org.apache.nutch.crawl.AdaptiveFetchSchedule.setFetchSchedule(AdaptiveFetchSchedule.java:85)
>          at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:141)
>          at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:41)
>          at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>          at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:396)
>          at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>  
>
> Thanks.
> Alex.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-x-with-hbase-filter-option-tp4121242p4130227.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
In reply to this post by alxsss



 

Hi,

It turned out that the error is generated because of the mismatch of the new function's return type(Integer) and the resulting variable. After changing these lines
in

src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
CHANGE


 int interval = page.getFetchInterval();
    switch (state) {
      case FetchSchedule.STATUS_MODIFIED:
        interval *= (1.0f - DEC_RATE);
        break;
      case FetchSchedule.STATUS_NOTMODIFIED:
        interval *= (1.0f + INC_RATE);
        break;
      case FetchSchedule.STATUS_UNKNOWN:
        break;
    }

 

 TO
  Integer interval = page.getFetchInterval();
    switch (state) {
      case FetchSchedule.STATUS_MODIFIED:
        interval *=(int) (1.0f - DEC_RATE);
        break;
      case FetchSchedule.STATUS_NOTMODIFIED:
        interval *=(int) (1.0f + INC_RATE);
        break;
      case FetchSchedule.STATUS_UNKNOWN:
        break;
    }

---------------------------------------------------------------
src/java/org/apache/nutch/indexer/IndexingJob.java
//complains  NoSuchMethodError  getMinorCode()
CHANGE
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
          || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
        return; // filter urls not parsed
      }

TO
if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)

          || (int)pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
        return; // filter urls not parsed
      }
-----------------------------------------------------------------
src/java/org/apache/nutch/indexer/IndexUtil.java

//complains  NoSuchMethodError  getBatchId()
CHANGE
if (page.getBatchId() != null) {
      doc.add("batchId", page.getBatchId().toString());
    }

TO

 if ((Utf8)page.getBatchId() != null) {
      doc.add("batchId", page.getBatchId().toString());
    }

import org.apache.avro.util.Utf8;
-----------------------------------------------------------------------

With these additional changes it works so far :).


Thanks.
Alex.


-----Original Message-----
From: Alparslan Avcı <[hidden email]>
To: user <[hidden email]>
Sent: Thu, Apr 10, 2014 7:09 am
Subject: Re: nutch-2.x with hbase filter option


Hi Alex,

It seems from the log that your build's WebPage class has no
getFetchInterval() method. However,that method exists in 2.x branch and
the patch in NUTCH-1714.

Have you changed any code in WebPage class? And do you run in local or
distributed mode?


Thanks,
Alparslan


On 10-04-2014 01:07, alxsss wrote:

> Hi,
>
> I was able to fix these errors making some changes to code and using
> avro-1.7.
> Now when I run updatedb command it gives
>
> 2014-04-09 14:29:36,460 FATAL org.apache.hadoop.mapred.Child: Error running
> child : java.lang.NoSuchMethodError:
> org.apache.nutch.storage.WebPage.getFetchInterval()I
>          at
> org.apache.nutch.crawl.AdaptiveFetchSchedule.setFetchSchedule(AdaptiveFetchSchedule.java:85)
>          at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:141)
>          at
> org.apache.nutch.crawl.DbUpdateReducer.reduce(DbUpdateReducer.java:41)
>          at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>          at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:396)
>          at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>          at org.apache.hadoop.mapred.Child.main(Child.java:249)
>  
>
> Thanks.
> Alex.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-2-x-with-hbase-filter-option-tp4121242p4130227.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


 

Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

lewis john mcgibbney
In reply to this post by alxsss
Hi Alex,

Regarding your findings and the code you've posted.
Can you please open an issue on Jira, posting in the code changes (or even
better a patch :) )
If you could like it to the Gora 0.4 upgrade issue in Nutch Jira it would
be excellent.
Thanks v much Alex.
Lewis

On Fri, Apr 11, 2014 at 11:14 AM, <[hidden email]> wrote:

>
> It turned out that the error is generated because of the mismatch of the
> new function's return type(Integer) and the resulting variable. After
> changing these lines
> in
>
>
SNIP


>  TO
>   Integer interval = page.getFetchInterval();
>     switch (state) {
>       case FetchSchedule.STATUS_MODIFIED:
>         interval *=(int) (1.0f - DEC_RATE);
>         break;
>       case FetchSchedule.STATUS_NOTMODIFIED:
>         interval *=(int) (1.0f + INC_RATE);
>         break;
>       case FetchSchedule.STATUS_UNKNOWN:
>         break;
>     }
>
>
Reply | Threaded
Open this post in threaded view
|

Re: nutch-2.x with hbase filter option

alxsss
Actually, patch provided in the original issue
 https://issues.apache.org/jira/browse/NUTCH-1714
 is not completely applicable to the current trunk, because nutch code has been changed since. I modified original patch in order to apply to the current trunk.
I can combine modifed patch with the changes I made to the code and generate new patch. If this is not advisable, please let me know.


Thanks.
Alex.

 

 

 

-----Original Message-----
From: Lewis John Mcgibbney <[hidden email]>
To: user <[hidden email]>
Sent: Fri, Apr 11, 2014 1:46 pm
Subject: Re: nutch-2.x with hbase filter option


Hi Alex,

Regarding your findings and the code you've posted.
Can you please open an issue on Jira, posting in the code changes (or even
better a patch :) )
If you could like it to the Gora 0.4 upgrade issue in Nutch Jira it would
be excellent.
Thanks v much Alex.
Lewis

On Fri, Apr 11, 2014 at 11:14 AM, <[hidden email]> wrote:

>
> It turned out that the error is generated because of the mismatch of the
> new function's return type(Integer) and the resulting variable. After
> changing these lines
> in
>
>
SNIP


>  TO
>   Integer interval = page.getFetchInterval();
>     switch (state) {
>       case FetchSchedule.STATUS_MODIFIED:
>         interval *=(int) (1.0f - DEC_RATE);
>         break;
>       case FetchSchedule.STATUS_NOTMODIFIED:
>         interval *=(int) (1.0f + INC_RATE);
>         break;
>       case FetchSchedule.STATUS_UNKNOWN:
>         break;
>     }
>
>