updatedb deletes all metadata except _csh_

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

updatedb deletes all metadata except _csh_

alxsss
This post was updated on .
Hello,


I am using nutch-2.x with GORA_94. I noticed that the second updatedb deletes all metadata except _csh_  for pages from the first fetch. Step to reproduce are the following
1. inject
2.generate batchId 1
3. fetch batchId 1 that adds some metadata to mtdt field
4 updatedb batchId 1
5.generate  batchId 2
6. fetch batchId 2
7. updatedb 2


check if metadata for urls with batchId 1 is present.


Thanks.
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

alxsss
This post was updated on .
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

Julien Nioche-4
Any Nutch-2 users or committers to help Alex on this one?
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

lewis john mcgibbney
In reply to this post by alxsss
Hi Alex,

On Tue, Jun 17, 2014 at 2:06 PM, <[hidden email]> wrote:

>
> I am using nutch-2.x with GORA_97.


You mean GORA-94, the Avro upgrade?
With which gora- backend please?


> Further investigation shows that DbUpdateReducer
> calls
>  inlinkedScoreData.clear();
>

I see this on line ~72 of DbUpdateReducer


>
> and it calls this function
>
>  public void readFields(DataInput in) throws IOException {
>

Can you please point me to where ScoreDatum#readFields is called?


>
> And metaData.clear(); line clears all metadata.
>

Yes this should result in an empty HashMap data structure.


>
> Why metaData.clear(); line is needed in this function?
>
>
It is poorly documented and this Class has not be altered for some time so
off the top of my head I need to say that I do not know why. Based on the
Javadoc for Writable, @Override readFields should "...should attempt to
re-use storage in the existing object where possible." so I am not sure why
we clear the metadata from the HashMap structure. I would need to debug
this to understand.
If you can provide more context on where ScoreDatum#readFields is called
then I can set break point up until then.
Thanks Alex
Lewis
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

alxsss
Hello,


I have gora_94 with  hbase-0.94.17 and  avro-1.7.6.




I have investigated further and it turned out that the culprit is not  inlinkedScoreData.clear()
and found another issue in addition to the deletion of custom metadata.


For the simplicity let's consider only one seed url, let say mydomain.com that has two <a tags in it


http://mydomain.com has <a href="http://mydomain.com">Home</a> and <a href="http://mydomain.com/page1">Page1</a>


the same <a tags are in http://mydomain.com/page1 i.e




http://mydomain.com/page1  <a href="http://mydomain.com">Home</a> and <a href="http://mydomain.com/page1">Page1</a>


When we do


bin/nutch inject seed
bin/nutch generate -batchId 1
bin/nutch fetch 1
bin/nutch updatedb 1


mydomain.com is fetched and after bin/nutch updatedb 1


http://mydomain.com/page1
comes as outlink


In the second round



bin/nutch generate -batchId 2
bin/nutch fetch 2





http://mydomain.com/page1 is fetched and parsed. However, in


bin/nutch updatedb 2


http://mydomain.com comes as outlink to http://mydomain.com/page1 and it is considered  as a new page by DbUpdateReducer.java.


So the first issue is that custom metadata for http://mydomain.com is deleted after  bin/nutch updatedb 2.
The second issue is  that  http://mydomain.com status is changed from fetched to unfetched.


I will investigate further and post again.


Thanks.
Alex.




http://mydomain.com/page1 
-----Original Message-----

From: Lewis John Mcgibbney <[hidden email]>
To: user <[hidden email]>
Sent: Wed, Jun 18, 2014 7:30 am
Subject: Re: updatedb deletes all metadata except _csh_


Hi Alex,

On Tue, Jun 17, 2014 at 2:06 PM, <[hidden email]> wrote:

>
> I am using nutch-2.x with GORA_97.


You mean GORA-94, the Avro upgrade?
With which gora- backend please?


> Further investigation shows that DbUpdateReducer
> calls
>  inlinkedScoreData.clear();
>

I see this on line ~72 of DbUpdateReducer


>
> and it calls this function
>
>  public void readFields(DataInput in) throws IOException {
>

Can you please point me to where ScoreDatum#readFields is called?


>
> And metaData.clear(); line clears all metadata.
>

Yes this should result in an empty HashMap data structure.


>
> Why metaData.clear(); line is needed in this function?
>
>
It is poorly documented and this Class has not be altered for some time so
off the top of my head I need to say that I do not know why. Based on the
Javadoc for Writable, @Override readFields should "...should attempt to
re-use storage in the existing object where possible." so I am not sure why
we clear the metadata from the HashMap structure. I would need to debug
this to understand.
If you can provide more context on where ScoreDatum#readFields is called
then I can set break point up until then.
Thanks Alex
Lewis

 
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

alxsss
Hi,

So far, this looks like a bug in updatedb when filtering with batchId.

I could only found one solution, to check if new pages are in the datastore and if they are skip them.
Otherwise updatedb with option -all will also work.

Thanks.
Alex.
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

lewis john mcgibbney
In reply to this post by alxsss
Hi Alex,

I am really sorry for not making the connection here.

On Tue, Jun 24, 2014 at 12:31 AM, <[hidden email]> wrote:

>
> So far, this looks like a bug in updatedb when filtering with batchId.
>
> I could only found one solution, to check if new pages are in the datastore
> and if they are skip them.
> Otherwise updatedb with option -all will also work.
>

https://issues.apache.org/jira/browse/NUTCH-1679

If you can run with this patch, then please post your results here.
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

alxsss
Hi,


I already came up with similar changes to the code as in this patch. Only suggestion to this patch's code is that to move checking if url exists in the datastore under


if (!additionsAllowed) {
         return;
       }


and close datastore.


Thanks.
Alex.
-----Original Message-----
From: Lewis John Mcgibbney <[hidden email]>
To: user <[hidden email]>
Sent: Tue, Jun 24, 2014 9:07 am
Subject: Re: updatedb deletes all metadata except _csh_


Hi Alex,

I am really sorry for not making the connection here.

On Tue, Jun 24, 2014 at 12:31 AM, <[hidden email]> wrote:

>
> So far, this looks like a bug in updatedb when filtering with batchId.
>
> I could only found one solution, to check if new pages are in the datastore
> and if they are skip them.
> Otherwise updatedb with option -all will also work.
>

https://issues.apache.org/jira/browse/NUTCH-1679

If you can run with this patch, then please post your results here.

 
Reply | Threaded
Open this post in threaded view
|

Re: updatedb deletes all metadata except _csh_

lewis john mcgibbney
In reply to this post by alxsss
Hi Alex,

On Thu, Jun 26, 2014 at 5:48 AM, <[hidden email]> wrote:

> I already came up with similar changes to the code as in this patch. Only
> suggestion to this patch's code is that to move checking if url exists in
> the datastore under
>
>
> if (!additionsAllowed) {
>          return;
>        }
>
>
> and close datastore.
>
>
> Is it possible for your to attach your working patch against the issue? I
would like to converge on what I am running e.g. the patch on the issue,
and what youy are suggesting.
Thank you if this is possible.
Lewis