[jira] Created: (NUTCH-59) meta data support in webdb

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
meta data support in webdb
---------------------------

         Key: NUTCH-59
         URL: http://issues.apache.org/jira/browse/NUTCH-59
     Project: Nutch
        Type: New Feature
    Reporter: Stefan Grroschupf
    Priority: Minor


Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.

Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-59?page=all ]

Stefan Grroschupf updated NUTCH-59:
-----------------------------------

    Attachment: webDBMetaDataPatch.txt

Add meta data support to webdb.

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Grroschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364122 ]

James Jonas commented on NUTCH-59:
----------------------------------

I would like to offer my vote for Nutch-59  (+1)

I do have some comments with regards to the metadata infrastructure in Nutch. Here are some of my thoughts.

Storing metadata in WebDB does offer the potential for a long list of potential new uses for Nutch.

- Location based Queries
  - The topic of the page relates to what city, state, country, geospatial coordinate
  - This page has multiple locations (list of WalMart Stores)
  - The server is located in this country (legal domicile)
  - The content is targeted to this geographic group (middle east, east chicago)
  - this particular location has this list of websites associated with it (garage.com has invested in X companies located in this area)
  - Directions to the store on the website (mapquest)
  - List of other website/store in the area (google local)
  ...

- People and Organizations
 - whois info
 - webmaster
 - editor(s)
 - company that owns the website
 - group within the company that owns the website.

There are several other metadata classes that can be associated with a page.
 - Dublin Core (as mentioned in other Nutch requirement docs)
 - CWM - Common Warehouse Metadat - provide links for datawarehouse (datamart) information to a web page.
 - Products (Froogle, Business.com...)

As well as new forms of popular website technologies, each which contain a set of unique metadata.
 - wiki (license, topic...)
 - blog (topic, person, group...)
 - personal profiles (dating, facebook.com)
 - ontologies (dmoz, jena - owl, wordnet)
 - ...

Unstructured data (the web) contains a long list of course grained classes of metadata that can be associated with each Page (artifact).

  A CONCEPTUAL META-MODEL FOR UNSTRUCTURED DATA
    http://www.tdan.com/i024fe01.htm


The models that persist metadata can become very complex.

 A UNIVERSAL PERSON AND ORGANIZATION DATA MODEL:
 THE PARTY/PARTY-RELATIONSHIP PATTERN
    http://www.tdan.com/i021ht04.htm

As well as the repositories that persist this type of data:

  Advanced Meta Data Architecture
   http://www.tdan.com/i013fe01.htm

Summary:
- large number of types of metadata
- metadata models can be complex
- number of different archtectures for storing of metadata
- persisting metadata can be costly (query time, updates...)

Some Options
(1) WebDb Metadata Storage (changes to index,queryfilter..)
  - Nutch-59
  - Nutch-139
  ...
  with tools and plugins
  - Ontologies
  - Geospacial
  ...
(2) Internal Metadata Store - Create a MetaDB store that provides local storage of denomalized metadata in Lucene. This could use an optimized subset of a Metadata API.

(3) Metadata API - Formal API from Nutch into other external Metadata Repositories (lucene, mysql, DB2, Jena (OWL), GIS ...)

Issues to consider:
- persisting metadata in WebDb/Index offers faster queries
- as metadata becomes large and more complex and the number of pages increases (50mm - 6 billion) updates and searches will suffer
- use of external stores will impact any processes that require a call to that store
- external metadata stores can persist more complex forms of metadata
- Lucene, which is optimized for unstructured data may not be the best persistent mechanism for complex metadata


Feedback:
Please tell me if I'm close with regards to articulating the some of the issues that may need to be considered in defining a metadata architecture for Nutch. Suggesting solutions (Metadata API and MetaDB) at this stage is only to enhance discussion. A few more iterations on a requirement for a broader metadata architecture is necessary before we start laying down concreate solutions.

Thanks,
James


> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364127 ]

Doug Cutting commented on NUTCH-59:
-----------------------------------

This patch is to the 0.7 release and will not work in the current trunk.

Please see:

http://www.mail-archive.com/nutch-dev@.../msg02140.html

and

http://issues.apache.org/jira/browse/NUTCH-61

So extensible metadata should be added to CrawlDatum when a fix for NUTCH-61 is committed to trunk.


> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364136 ]

Stefan Groschupf commented on NUTCH-59:
---------------------------------------

Nutch 0.8 is very different to 0.7 in the way it stores page data and linkgraph. Therefore a reimplementation of meta data support for nutch 0.8 is on my todo list. It will be simple HashMap style api to store and retrieve key value tupples. Data will be stored in a extra file.

 

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364165 ]

James Jonas commented on NUTCH-59:
----------------------------------

Stefan,

Spot on.

Use of HashMaps - very fast

Use of separate file instead of extending WebDB - good

Background
Initially this will help limit the size of the MetaDB (the separate file). For example, association of DMOZ topics to Pages would only be one-to-one on the first fetch. On the supsequent fetches other websites outside of the DMOZ list would then contain a blank topic for that field, thus filling up needless space on WebDB. (some databases are more efficient with regards to managing this type of dead space. Lucene may be one of these). The next senario is adding a new metadata association (simple location - city,state(province),country). Here the MetaDB (temporary name for the convenience of discussion) would only related to the Region section of the DMOZ list, but some of the non-DMOZ pages would have such a Location association. This leads to the question of potentially splitting the file into a multiple file for each metadata artifact (topic, location). As the list of metadata artifacts grows, so does the number of files. This dancing between denormalized data (single big files) versus normalized data (many smaller files - complex relationships) will over time impact the speed of the queries. This type of performance penalty associated with metadata can be even more exaserbated when you move into metadata repositories, where they persist both the metadata and the model of the metadata (customer now roles back his eyes and passes out as you continue speaking of meta-meta models).

That being said, for simplicities sake, I would not get to far ahead of the game. Your decision of  using of a single separate file gets the job done. Changes to the other components (index, QueryFilter) to handle Extensible Metadata seems like the higher priority. I just wanted to give you a flavor for how metadata stores grow from simple to complex and that some planning is often helpful in order to avoid some small hickups in the users migration from one set of simple metadata stores into more complex structures. Normally applications go through a series of learning experiences as they move up the complexity slope for metadata. (sometimes these applications (companies) actually survive - several don't)

Quick HOW TO for building a metadata store:
- Write down a list of metadata that you think you may wish to store
- Map this list to Use Cases that create specific value to the user
- For each metadata artifact assign it the standard (must have, should have, could have, won't have)  (or a,b,c - red, white blue - whatever) based on your use cases.
- Define the API containing only a link to metadata that seems the most useful (must haves)
- Define a simple metadata model to contain that short list of metadata exposed in your API
- Define and implement the physical model to support that API. The semantics of the model will normally be greater than what is exposed
- Keep the API stable, grow the underlying physical model. Do Not Expose the physical model.
- Carefully expand the scope of the API based on what creates real value to the user

What happens is the underlying model will change radically over time and will often becomes the limiting factor in your persistence of more complex metadata artifacts. ( think of a person inside a hierarchical organization with matrixed relationships with associations to both titles and roles - yuk - it can get fun very quickly ) Most applications bind thier software tightly to the physical metamodel (its easy - just expose it). The result is unsatisfied customers as the metamodel has to change over time. Cometition usually swoops in since they can green field thier metamodels while you are stuck supporting the semantics of your pervious application.

2-cents worth of comments

PS
I'm very interested in testing our your DMOZ Topic Metadata Extention on .8. I have a couple websites that might find a use for it.

Thanks,
James

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365008 ]

James Jonas commented on NUTCH-59:
----------------------------------

I deployed this patch into a Nutch 7.1 sandbox and performed a test run. The 'topic' metadata has been captured. Congrats!

How do I display this information inside the 'more' section of my query result page?
How do I use this metadata to filter a standard query?

Thanks,
James

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365009 ]

Stefan Groschupf commented on NUTCH-59:
---------------------------------------

Please let's move this discuss into the user mailing list, since this is no 'real' issue comment.
Also please note that meta data support for nutch 0.8 is under development and is comming hopefully soon into sources. So may a better idea is to wait for nutch 0.8 meta data support.

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-59) meta data support in webdb

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365012 ]

James Jonas commented on NUTCH-59:
----------------------------------

Thanks,

I have been tracking Nutch-139 and Nutch-192 and look forward to these patches being committed into the .8 trunk.

James

> meta data support in webdb
> --------------------------
>
>          Key: NUTCH-59
>          URL: http://issues.apache.org/jira/browse/NUTCH-59
>      Project: Nutch
>         Type: New Feature
>     Reporter: Stefan Groschupf
>     Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch feature that needs long life meta data.
> Actually page meta data need to be regenerated or lookup every 30 days a page is re-fetched, in a long context web db meta data would bring a dramatically performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of linklist generation filters possible.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira