CrawlDatum.metaData should never be null

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

CrawlDatum.metaData should never be null

Andrzej Białecki-2
Hi,

Per subject, I think it should follow the same pattern as other metadata
maps in ParseData and Content. Currently when we allocate new
CrawlDatum, metaData is null, which complicates the logic in all places
that want to handle metaData.

When CrawlDatum is serialized, we already check if metaData.size() > 0,
and if not then nothing is written out. So, it doesn't make much sense
to use null here - savings on the object creation are also minimal.

If there are no objections, I'll make the change to always allocate
metaData = new MapWritable(), whenever we create CrawlDatum.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: CrawlDatum.metaData should never be null

Stefan Groschupf-2
Hi Andrzej,
this specially requested by Doug to do not instantiate the object by  
default since this consume to much resources.
So I changed this in the way it works today.

Stefan

On 25.04.2006, at 21:40, Andrzej Bialecki wrote:

> Hi,
>
> Per subject, I think it should follow the same pattern as other  
> metadata maps in ParseData and Content. Currently when we allocate  
> new CrawlDatum, metaData is null, which complicates the logic in  
> all places that want to handle metaData.
>
> When CrawlDatum is serialized, we already check if metaData.size()  
> > 0, and if not then nothing is written out. So, it doesn't make  
> much sense to use null here - savings on the object creation are  
> also minimal.
>
> If there are no objections, I'll make the change to always allocate  
> metaData = new MapWritable(), whenever we create CrawlDatum.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: CrawlDatum.metaData should never be null

Andrzej Białecki-2
Stefan Groschupf wrote:
> Hi Andrzej,
> this specially requested by Doug to do not instantiate the object by
> default since this consume to much resources.
> So I changed this in the way it works today.

Hmm.. I understand his point. But it means that I have to always put "if
(datum.getMetaData() == null)" check, which pollutes the code in all
places that deal with metadata. Currently this is just CrawlDbReducer
(but it already looks ugly there), but it will be like that in any place
that wants to use metadata.

If that's really such a big concern, then perhaps we should also set
ParseData.contentMeta and parseMeta to null, as well as Content.metadata ...

or perhaps the CrawlDatum.getMetaData() should instantiate it, this way
if you don't call the getter you won't get any allocation.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: CrawlDatum.metaData should never be null

Doug Cutting
Andrzej Bialecki wrote:
  > Hmm.. I understand his point. But it means that I have to always put
"if
> (datum.getMetaData() == null)" check, which pollutes the code in all
> places that deal with metadata. Currently this is just CrawlDbReducer
> (but it already looks ugly there), but it will be like that in any place
> that wants to use metadata.

One thing to consider might be to add some methods to CrawlDatum like:

    public Writable getMeta(Writable key);

to minimize the null checks.

Or we can simply abandon this probably premature optimization.  The
MapReduce code now reuses keys and values (unless you're using a
combiner...) so the allocation should be less of an issue.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: CrawlDatum.metaData should never be null

Jérôme Charron
In reply to this post by Andrzej Białecki-2
> or perhaps the CrawlDatum.getMetaData() should instantiate it, this way
> if you don't call the getter you won't get any allocation.

+1 for this kind of "lazy inititalisation".

Jérôme