Fwd: 11 Messaggi Inoltrati

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Fwd: 11 Messaggi Inoltrati

g.marras


----------------------------------------------------------------
This message was sent using IMP at ifc.cnr.it


Content object is not properly initialized in map method of ParseSegment
------------------------------------------------------------------------

                 Key: NUTCH-405
                 URL: http://issues.apache.org/jira/browse/NUTCH-405
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Sami Siren
         Assigned To: Sami Siren


When unparsed segments are parsed the Content object is not properly initialized in map method. (This was a result of recent modifications to Content Object in NUTCH-395).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

scott green wrote:
> On 11/21/06, scott green <[hidden email]> wrote:
>> Hi
>>
>> Is nutch-gui dead? why i cannot find any source in svn repo?
>>
> I mean nutch-admin GUI.
>
Hi,

I am working on the patch to make it work with the current trunk. I will
upload the patch to the Jira, when i'm done.

Nutch support Https & Sessions?


    [ http://issues.apache.org/jira/browse/NUTCH-251?page=comments#action_12451527 ]
           
Sami Siren commented on NUTCH-251:
----------------------------------

>I am a strong supporter of XML. Can we not re-think about this like SOLR-58 or plain/jsp like the way hadoop does it?

I would say neither of those. We should concentrate on building a good java admin api. everything after that is implementation details as the api can then be easily exposed to xml or something else remotely usable. By doing it this way the admin functionality can easily be integrated to various places and technologies.

Some kind of extension mechanism needs to be used because nutch is extendable in general  (You could plug in additions to admin gui as you plug functionality to nutch). IMO that is not 1st priority. I would propose to put in the basic functionality first for configuring , scheduling and generally managing crawls, then add more functionality on top of that.


> Administration GUI
> ------------------
>
>                 Key: NUTCH-251
>                 URL: http://issues.apache.org/jira/browse/NUTCH-251
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: hadoop_nutch_gui_v1.patch, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch
>
>
> Having a web based administration interface would help to make nutch administration and management much more user friendly.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Chris, Rida,

Here the changes that I have made to XMLParseConfig.java in the
populateConfig(Document doc) method:


if (elemNode.getAttribute("nodeXpath") != null) {
                                        String nodeXpath =
elemNode.getAttributeValue("namespace");
                                        xip.setNodeXpath(nodeXpath);
                                }
                                List fieldList = XPath.selectNodes(elemNode,
"field");
                               
                                if(fieldList != null) // modified 20062011
by Armel
                                {
                                for (int j = 0; j < fieldList.size(); j++) {
                                        Element elem = (Element)
fieldList.get(j);
                                        XMLField xf =
populateXMLField(elem);
                                        fieldsColl.add(xf);
                                }
                                }
                               
                                /*
                                 * modifiied by Armel
                                 * 20062011
                                 * if fieldList is empty because it doesn't
contain
                                 * an element "field"
                                 */
                                if(fieldList == null){
                                       XMLField xf =
populateXMLField(elemNode);
                                        fieldsColl.add(xf);
                                    }

And the populateXMLField(Element el) method:

if (elem.getAttribute("name") != null)
                        xf.setFieldName(elem.getAttributeValue("name"));

                if(elem.getAttribute("name")== null)// modified by Armel
                {
                    List att = elem.getAttributes();
                    if(att != null){ // modified by Armel - loop and create
field accondingly
                        for (int i = 0; i < att.size(); i++){
                           Attribute at = (Attribute)att.get(i);
 
xf.setFieldName(elem.getAttributeValue(at.getName()));
                        }
                }
                if (elem.getAttribute("xpath") != null)
                        xf.setFieldXPath(elem.getAttributeValue("xpath"));

this is supposed to do the feature I want to implement, please advise.

Armel

-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: 20 November 2006 23:30
To: [hidden email]
Subject: Re: What's the status of Nutch-GUI?

Hi Armel,

On 11/20/06 1:44 PM, "Armel T. Nene" <[hidden email]> wrote:

> Hi Chris,
>
> I am trying to extend parse-xml to enable the creation of lucene fields
> straight from an xml file. For example, a database table that has been
parse
> as an XML file should be stored in the index with the relevant fields,
i.e.
> customer name, address and so on. This file will not have a namespace
> associated with it and should not be stored as "xmlcontent" in the
database.
> Currently, parse-xml looks for known fields in the document and stores the
> associated values with the field name. I have added an extra conditions as
> if the known fields are not present in the current document, the element
or
> node in the document should be the new field stored in the index with
their
> value.

I think that this is fine.
>
> Therefore, when parse-xml receives an xml document with no namespace
> available, it will parse the document and store it element name as new
field
> in the index and the element associated value.
>
> Let me know if I am on the right track because I know I don't have to
write
> a separate plugin for this feature but just extending ( or modifying)
> parse-xml.

I think that parse-xml will support what you are talking about. In terms of
the "check" that you are doing to see if a field exists or not before adding
another value for it in the index, as I understood Lucene, I believe that
you could just omit this check and add the field regardless. If you add
multiple values for the same field in a Document, e.g:

<snip>
Document doc = new Document();

doc.add(new Field("fieldname", "fieldvalue", ...));
doc.add(new Field("fieldname", "fieldvalue2",...));

</snip>

Both the values "fieldvalue" and "fieldvalue2" will both get stored in the
index for the key "fieldname". So, if I understand you correctly (which I
may not ;) ), then I think you can omit the check that you are talking about
above and just go with adding the same field name 2x.

HTH,
  Chris

>
> Cheers,
>
> Armel
>
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: 20 November 2006 18:40
> To: [hidden email]
> Subject: Re: What's the status of Nutch-GUI?
>
> Hi Sami and Scott,
>
>  This is on my TO-DO list as one of the items that I will begin working on
> getting into the sources as a committer. Additionally, I plan on
integrating

> and testing the parse-xml plugin into the source tree. As soon as I get my
> Apache account and SVN access, I will start working on this.
>
> Thanks!
>
> Cheers,
>   Chris
>
>
>
> On 11/20/06 9:24 AM, "Sami Siren" <[hidden email]> wrote:
>
>> scott green wrote:
>>> Hi
>>>
>>> Is nutch-gui dead? why i cannot find any source in svn repo?
>>
>> Unfortunately the sources for the admin gui never got into svn. It would
>> be great if someone could pick it up and bring it up to date to get it
>> integrated.
>>
>> --
>>   Sami Siren
>>
>
>
>
>
______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.





Hi Armel,

On 11/20/06 1:44 PM, "Armel T. Nene" <[hidden email]> wrote:

> Hi Chris,
>
> I am trying to extend parse-xml to enable the creation of lucene fields
> straight from an xml file. For example, a database table that has been parse
> as an XML file should be stored in the index with the relevant fields, i.e.
> customer name, address and so on. This file will not have a namespace
> associated with it and should not be stored as "xmlcontent" in the database.
> Currently, parse-xml looks for known fields in the document and stores the
> associated values with the field name. I have added an extra conditions as
> if the known fields are not present in the current document, the element or
> node in the document should be the new field stored in the index with their
> value.
I think that this is fine.
>
> Therefore, when parse-xml receives an xml document with no namespace
> available, it will parse the document and store it element name as new field
> in the index and the element associated value.
>
> Let me know if I am on the right track because I know I don't have to write
> a separate plugin for this feature but just extending ( or modifying)
> parse-xml.

I think that parse-xml will support what you are talking about. In terms of
the "check" that you are doing to see if a field exists or not before adding
another value for it in the index, as I understood Lucene, I believe that
you could just omit this check and add the field regardless. If you add
multiple values for the same field in a Document, e.g:

<snip>
Document doc = new Document();

doc.add(new Field("fieldname", "fieldvalue", ...));
doc.add(new Field("fieldname", "fieldvalue2",...));

</snip>

Both the values "fieldvalue" and "fieldvalue2" will both get stored in the
index for the key "fieldname". So, if I understand you correctly (which I
may not ;) ), then I think you can omit the check that you are talking about
above and just go with adding the same field name 2x.

HTH,
  Chris

>
> Cheers,
>
> Armel
>
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: 20 November 2006 18:40
> To: [hidden email]
> Subject: Re: What's the status of Nutch-GUI?
>
> Hi Sami and Scott,
>
>  This is on my TO-DO list as one of the items that I will begin working on
> getting into the sources as a committer. Additionally, I plan on integrating
> and testing the parse-xml plugin into the source tree. As soon as I get my
> Apache account and SVN access, I will start working on this.
>
> Thanks!
>
> Cheers,
>   Chris
>
>
>
> On 11/20/06 9:24 AM, "Sami Siren" <[hidden email]> wrote:
>
>> scott green wrote:
>>> Hi
>>>
>>> Is nutch-gui dead? why i cannot find any source in svn repo?
>>
>> Unfortunately the sources for the admin gui never got into svn. It would
>> be great if someone could pick it up and bring it up to date to get it
>> integrated.
>>
>> --
>>   Sami Siren
>>
>
>
>
>
______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Rida,

There is something I would like to clarify, when using a namespace and xpath
to store content in the index, can this be seen as multi-fields. For example
if we are storing customer name and customer address which are been declared
in a xml configuration file, is that multi-field. Please explain, sorry I am
quite new to the Nutch architecture.

Armel

-----Original Message-----
From: Rida Benjelloun (JIRA) [mailto:[hidden email]]
Sent: 20 November 2006 22:16
To: [hidden email]
Subject: [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser
plugin.

    [
http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12451452
]
           
Rida Benjelloun commented on NUTCH-185:
---------------------------------------

Nutch doesn't support multifieds values, so I decided to merge the content
in the same field. If you want to search the field you should index it as
"Text" instead of "keyword".



> XMLParser is configurable xml parser plugin.
> --------------------------------------------
>
>                 Key: NUTCH-185
>                 URL: http://issues.apache.org/jira/browse/NUTCH-185
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, indexer
>    Affects Versions: 0.7.2, 0.8.1, 0.8
>         Environment: OS Independent
>            Reporter: Rida Benjelloun
>         Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip
>
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the
mapping between the XML elements and Lucene fields.
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the
"xmlparser-conf.xml".
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and
lucene field.
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4"
/>
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a
namespace.
> If the namespace is found in the xml document, the fields represented by
the namespace will be indexed.
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="
http://purl.org/dc/elements/1.1/">
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0"
/>
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when
the parser
> didn't find any namespace in the document or when the namespace found in
the xml document doesn't match with the namespace defined in the
xmlIndexerProperties.
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
> </xmlIndexerProperties>

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       



    [ http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12451452 ]
           
Rida Benjelloun commented on NUTCH-185:
---------------------------------------

Nutch doesn't support multifieds values, so I decided to merge the content in the same field. If you want to search the field you should index it as "Text" instead of "keyword".



> XMLParser is configurable xml parser plugin.
> --------------------------------------------
>
>                 Key: NUTCH-185
>                 URL: http://issues.apache.org/jira/browse/NUTCH-185
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, indexer
>    Affects Versions: 0.7.2, 0.8.1, 0.8
>         Environment: OS Independent
>            Reporter: Rida Benjelloun
>         Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip
>
>
> Xml parser  is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields.
> Informations :
> 1- Copy "xmlparser-conf.xml" to the nutch/conf dir
> 2- To index your custom XML file, you have to modify the "xmlparser-conf.xml".
> This parser uses namespaces and XPATH to parse XML content
> The config file do the mapping between the XML noeds (using XPATH) and lucene field.
> Example : <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4" />
> 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace.
> If the namespace is found in the xml document, the fields represented by the namespace will be indexed.
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace=" http://purl.org/dc/elements/1.1/">
>   <field name="dctitle" xpath="//dc:title" type="Text" boost=" 1.4" />
>   <field name="dccreator" xpath="//dc:creator" type="keyword" boost=" 1.0" />
> </xmlIndexerProperties>
> 4- It is possible to define a default namespace that will be applied when the parser
> didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties.
> Example :
> <xmlIndexerProperties type="filePerDocument" namespace="default">
>   <field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0" />
> </xmlIndexerProperties>
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Hi Chris,

I am trying to extend parse-xml to enable the creation of lucene fields
straight from an xml file. For example, a database table that has been parse
as an XML file should be stored in the index with the relevant fields, i.e.
customer name, address and so on. This file will not have a namespace
associated with it and should not be stored as "xmlcontent" in the database.
Currently, parse-xml looks for known fields in the document and stores the
associated values with the field name. I have added an extra conditions as
if the known fields are not present in the current document, the element or
node in the document should be the new field stored in the index with their
value.

Therefore, when parse-xml receives an xml document with no namespace
available, it will parse the document and store it element name as new field
in the index and the element associated value.

Let me know if I am on the right track because I know I don't have to write
a separate plugin for this feature but just extending ( or modifying)
parse-xml.

Cheers,

Armel


-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: 20 November 2006 18:40
To: [hidden email]
Subject: Re: What's the status of Nutch-GUI?

Hi Sami and Scott,

 This is on my TO-DO list as one of the items that I will begin working on
getting into the sources as a committer. Additionally, I plan on integrating
and testing the parse-xml plugin into the source tree. As soon as I get my
Apache account and SVN access, I will start working on this.

Thanks!

Cheers,
  Chris



On 11/20/06 9:24 AM, "Sami Siren" <[hidden email]> wrote:

> scott green wrote:
>> Hi
>>
>> Is nutch-gui dead? why i cannot find any source in svn repo?
>
> Unfortunately the sources for the admin gui never got into svn. It would
> be great if someone could pick it up and bring it up to date to get it
> integrated.
>
> --
>   Sami Siren
>




    [ http://issues.apache.org/jira/browse/NUTCH-251?page=comments#action_12451419 ]
           
nutch.newbie commented on NUTCH-251:
------------------------------------

Some random thoughts...

I am a strong supporter of XML. Can we not re-think about this like SOLR-58 or plain/jsp like the way hadoop does it?

http://issues.apache.org/jira/browse/SOLR-58
 
Do we really need to use Nutch plugin architecture? The patch is currently out dated so I think it would be good idea to give it a another round of discussion.



> Administration GUI
> ------------------
>
>                 Key: NUTCH-251
>                 URL: http://issues.apache.org/jira/browse/NUTCH-251
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: hadoop_nutch_gui_v1.patch, nutch_gui_plugins_v1.zip, nutch_gui_v1.patch
>
>
> Having a web based administration interface would help to make nutch administration and management much more user friendly.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Hi Sami and Scott,

 This is on my TO-DO list as one of the items that I will begin working on
getting into the sources as a committer. Additionally, I plan on integrating
and testing the parse-xml plugin into the source tree. As soon as I get my
Apache account and SVN access, I will start working on this.

Thanks!

Cheers,
  Chris



On 11/20/06 9:24 AM, "Sami Siren" <[hidden email]> wrote:

> scott green wrote:
>> Hi
>>
>> Is nutch-gui dead? why i cannot find any source in svn repo?
>
> Unfortunately the sources for the admin gui never got into svn. It would
> be great if someone could pick it up and bring it up to date to get it
> integrated.
>
> --
>   Sami Siren
>