[jira] Created: (NUTCH-204) multiple field values in HitDetails

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
multiple field values in HitDetails
-----------------------------------

         Key: NUTCH-204
         URL: http://issues.apache.org/jira/browse/NUTCH-204
     Project: Nutch
        Type: Improvement
  Components: searcher  
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
     Fix For: 0.8-dev


Improvement as Howie Wang suggested.
http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-204?page=all ]

Stefan Groschupf updated NUTCH-204:
-----------------------------------

    Attachment: DetailGetValues070206.patch

Patch that adding getValues to HitDetails.

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12366472 ]

Stefan Groschupf commented on NUTCH-204:
----------------------------------------

Any improvment suggestions or negative comments? If not it would be great if one with write access to the svn can commit this since I have a meta data related patch I want to contribute that depends on this patch. Also this was a user request.
Thanks!

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367513 ]

Jerome Charron commented on NUTCH-204:
--------------------------------------

Hi Stefan,

There is something I don't understand with this patch. The way Lucene manage multi-valued fields is to have many mono-valued Field objects with the same name. My interrogation, is why not keeping this logic?
It will avoid patching the HitSearcher, and modifying the HitDetails constructor signature.
The idea I have in mind is to add a generic name/value(s) container (like Metadata but without the syntax tolerant feature. In fact, the actual metadata will internaly uses this generic container), that will be used by the HitDetails to store multivalued fields.
What do you think about this?
I imagine you are very busy with the admin GUI (it is really a big challenge, and a big new feature), so if you are ok with my proposed solution, I will code it.

Regards

Jérôme


> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367520 ]

Stefan Groschupf commented on NUTCH-204:
----------------------------------------

>There is something I don't understand with this patch. The way Lucene manage multi-valued fields is to have many mono-valued Field objects with the same name. My interrogation, is why not keeping this logic?

Sure that would be possible. My idea was that we don't need these many identically keys, they just eat some bytes we do not really need to transfer over the neztwork.
HitDetails is a writable and in case of multiple searchservers distributed in a network it makes sense to minimize the network io since getting details should be as fast as possible.
Would you agree? however I agree there are other ways to realize that, if you see space for improvements feel free in any case I really would love to see the feature in the sources.

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367530 ]

Jerome Charron commented on NUTCH-204:
--------------------------------------

> HitDetails is a writable and in case of multiple searchservers distributed in a network it makes
> sense to minimize the network io since getting details should be as fast as possible.
Sure Stefan.
I will take this into account of course. Using a map like structure in HitDetails will reduce the bytes used by not duplicating keys.
I will commit something in the next few days.

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367539 ]

Stefan Groschupf commented on NUTCH-204:
----------------------------------------

Woudn't you end with something very similar as it is now, having one key and multiple values per key?
The Lucene Document provides a getValues so I do not see any changes to the lucene API concepts as you mentioned in your first post.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#getValues(java.lang.String)
Sorry, I still do not understand your improvement suggestion can you give some more details?

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367551 ]

Jerome Charron commented on NUTCH-204:
--------------------------------------

You are right, very similar.
The only difference is that it doesn't implies some modifications on HitSearcher and doesn't change the HitDetail constructor signature. One other thing, is that is avoid exposing some String[][] which are (that's my own opinion) not very elegant and of easy manipulation.
One last thing, it will provide a generic MultiProperties container in util package (that can be usefull for many other purpose).
I agree with you, there is no real improvements in my solution, just a clearer code (I hope).
Once it will be done, I will attach a patch so that you can review it, instead of committing.
Regards

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367552 ]

Stefan Groschupf commented on NUTCH-204:
----------------------------------------

Make sense, I see, thanks for the clarification.

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-204?page=all ]

Jerome Charron updated NUTCH-204:
---------------------------------

    Attachment: NUTCH-204.jc.060227.patch

Stefan,

Here is a proposed patch (NUTCH-204.jc.060227.patch).
If you agree, I will commit it.

Jérôme

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367991 ]

Stefan Groschupf commented on NUTCH-204:
----------------------------------------

Jérôme,
After taking a look to the HitDetails object again - after a some time - I notice I completely had overseen that there are already all values in key:value tuples in the HitDetais object.
The problem is more that public String getValue(String field)  just returns the first field matching the field name. Accessing all values is already possible using  getLength, getField and getValue.
Isn't it?

From my point of view should keep things as lightweight as possible and may just  add one method getValues to the HitDetails object that could looks like this:
public String[] getValues(String field) {
  ArrayList arrayList = new ArrayList();
  for (int i = 0; i < length; i++) {
    if (fields[i].equals(field))
      arrayList.addvalues[i]);
    }
  if(arrayList.size()>0){
    return (String[]) arrayList.toArray(new String[arrayList.size()]);
  }
  return null;
}
So I think introduce a new Property object, that needs to be instantiated  and serialized any time is just more overhead we should not introduce.
HitDetails has influence of the search performance and with having one object instantiated more for each HitDetails we will slow down this by calling gc doubled often than before.
Would you agree just adding a method getValues to the HitDetails object?



> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-204?page=all ]
     
Jerome Charron closed NUTCH-204:
--------------------------------

    Resolution: Fixed

Committed : http://svn.apache.org/viewcvs.cgi?rev=381465&view=rev
Thanks Stefan for pointing out the performance issue of my patch.
We can perhaps in a next patch add a cache of field/values to avoid iterating over the whole list each time the getValues method is called.


> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12368038 ]

Stefan Groschupf commented on NUTCH-204:
----------------------------------------

Yes that is a good idea. Thanks for getting this into the sources.
Cheers,
Stefan

> multiple field values in HitDetails
> -----------------------------------
>
>          Key: NUTCH-204
>          URL: http://issues.apache.org/jira/browse/NUTCH-204
>      Project: Nutch
>         Type: Improvement
>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3c43D7D45A.2070609@...%3e

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira