developing a parse-/index-/query- plugin set

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

developing a parse-/index-/query- plugin set

chrismattmann
Hi Folks,

 

  I was wondering if anybody could give me some advice on what I'm doing
wrong in the following situation.

 

I am trying to fetch and search some bioinformatics data with specific data
elements that I want to index, parse out, and search on. For instance, for
each page of data I fetch, I would like to store things like PROTOCOL_ID,
and CONTACT_EMAIL. Okay, so to go about this, I went and wrote a
parse-specimen plugin to suck out the specific metadata elements I wanted to
index. I have tested and verified that this part of the process is working.
For instance, after the page content is fetched, I've instrumented the code
with LOG.log commands to verify that the metadata is being added to the
Properties object that is sent back with the ParseImpl. Okay, so then I
wrote an index-specimen plugin, that basically takes the reconstructed parse
data (as all indexing plugins do), gets out the specific properties that I
captured during the parse, and then adds them to the Lucene document and
returns the document. I have also verified that this portion of the process
is working as well, for instance, I have instrumented the code with LOG.log
commands again, and verified that the fields are getting added to the
Document object, which is then returned. Okay, so then I just thought I
could deploy and start up the nutch web app at that point, and I would be
able to do queries like, "PROTOCOL_ID:36.0", and
"CONTACT_EMAIL:[hidden email]", for instance, and since the
metadata was stored in the index, that the hits would come back. However, of
course, I found out that this wasn't the case. After some snooping around, I
saw that it seems that in order for the query to work right, a user needs to
then write a query-xxx plugin that declares its support for the specific
fields that were indexed, and that you want to search on. Well I've been
trying to do this for the last day and a half, and for the life of me, I
can't get the thing working. Could someone give me some help or suggestions
on how to do this? To write my query-specimen plugin that I have now, that
doesn't work; I used the model of the query-more plugin. I've written two
classes which extend the RawFieldQueryFilter, to test out if I could at
least get the PROTOCOL_ID and CONTACT_EMAIL queries working. So I wrote a
ProtocolIDQueryFilter class and a ContactEmailQueryFilter class, which just
extended the RawFieldQueryFilter class, and passed in "PROTOCOL_ID" and
"CONTACT_EMAIL" to the constructor of it, again, this is what I saw in the
query-more plugin that I used as an example. Additionally, in my plugin.xml
file for the query-specimen plugin, I've declared that my plugin supports
those 2 raw fields, in the following fashion:

 

 

   <extension id="gov.nasa.jpl.edrn.nutch.searcher.specimen"

              name="Specimen Query Filter"

              point="org.apache.nutch.searcher.QueryFilter">

      <implementation id="ProtocolIDQueryFilter"

 
class="gov.nasa.jpl.edrn.nutch.searcher.specimen.ProtocolIDQueryFilter"

                      raw-fields="PROTOCOL_ID"/>

   </extension>

   

   <extension id="gov.nasa.jpl.edrn.nutch.searcher.specimen"

              name="Specimen Query Filter"

              point="org.apache.nutch.searcher.QueryFilter">

      <implementation id="ContactEmailQueryFilter"

 
class="gov.nasa.jpl.edrn.nutch.searcher.specimen.ContactEmailQueryFilter"

                      raw-fields="CONTACT_EMAIL "/>

   </extension>

   

 

However, after rebuilding the Nutch webapp with the query-specimen plugin
enabled (which I have verified via the LOG files that it is actually
enabled), and then trying the queries such as "PROTOCOL_ID:36.0", the
queries still don't work. I've verified that the fields were indexed
correct, and that 36.0 is actually a valid value for the PROTOCOL_ID,
because for instance, when I just do a regular query that I know returns
hits (I've only indexed 3 documents so far), and then I click on the
"explain" link, it shows that I have indexed all the fields which I wanted
to query on (such as PROTOCOL_ID, and CONTACT_EMAIL), and it shows me the
values for each field, such as PROTOCOL_ID = 36.0. So, now I'm stuck. I
can't get the queries to work and if anyone can help me with this, I would
be really appreciative. Oh yeah, one more thing, it turns out that a lot of
my fields are numeric-like values, such as 36.0, 2.0, etc. However, when I
indexed them I indexed them as Field.Text() values in the Lucene document.
I've never done this before, so if that was the wrong thing to do, then that
might be the problem? Here is the snippet of code in my index-specimen
plugin where I index the fields:

 

public Document filter(Document doc, Parse parse, FetcherOutput fo)

            throws IndexingException {

       

        //get the parse metadata

        Properties metadata = parse.getData().getMetadata();

       

 

        for(int i = 0; i < edrnCDES.length; i++){

            String key = edrnCDES[i];

            String val = (String)metadata.get(key);

            if(val != null){

                LOG.log(Level.INFO,"SpecimenIndexer:adding
["+key+"=>"+val+"]");

                doc.add(Field.Text(key,val));                

            }

           

 

        }

       

        return doc;

       

    }

 

 

"edrnCDES" is an array of the field names I want to index, such as
"PROTOCOL_ID" and "CONTACT_EMAIL". So, does the fact that some of these
fields are numerical values make a difference, even though I'm trying to
index them as text? I mean, one thing that I know is that even the
non-numerical values, e.g., CONTACT_EMAIL, isn't working, so I suspect that
the numerical value issue isn't the thing that's causing my problem.

 

If anyone can provide any help on this, again, I would appreciate it. Thanks
a lot!

 

Cheers,

  Chris

 

 

 

 

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 

Reply | Threaded
Open this post in threaded view
|

RE: developing a parse-/index-/query- plugin set

chrismattmann
Hi Folks,

 

 I've done some tracing on my problem that I previously posted to the list
about developing a parse-/index-/query- plugin set. It seems that by default
The NutchAnalysis class I believe turns all fields into lower case, e.g.
PROTOCOL_ID gets turned into protocol_id. Then, in the Query.parse method
there, is a call to Query.fixup. The fixup method, if it can't match a field
in a clause to one of the fields provided by filters registered with the
QueryFilters class, turns the field into a Default field, so protocol_id
gets turned into "protocol id", two separate strings, or tokens. Then, to
top it all of, the query PROTOCOL_ID:36.0 gets turned into "protocol id 36
0".

 

 So, one thing it seems is that fields to be indexed, and used in a field
query must be fully lowercase to work? Additionally, it seems that they
can't have symbols in them, such as "_", is that correct? Would you guys
consider this to be a bug? I mean, maybe it was your intention for it to be
this way, but I haven't found anything in the Nutch documentation that
states that fields should be lowercase?

 

 Okay, so that's one thing. So, what I did was then make all my fields
lowercase, and I removed the "_" character, so PROTOCOL_ID becomes
protocolid. However, I'm still stuck. Now, the query gets formulated
correctly, for instance, "protocolid:36.0" gets translated to
protocolid:36.0, when it is sent to the filter. Then, the filter that I
wrote correctly recognizes that it can handle that term, and it adds a
TermQuery clause to the booleanQuery output from the QueryFilters class.
However, my query for protocolid:36.0 still returns nothing. I've traced the
call all the way down to the LuceneQueryOptimizer.optimize method. I've
added two System.out.printlns in that method at the end to see what's going
on. Here is the small snippet of code:

 

 

    if (sortField == null && !reverse) {

      System.out.println("Performing Lucene Query: "+query);

      System.out.println("using filter "+filter+" and numHits = "+numHits);

      return searcher.search(query, filter, numHits);

    } else {

      return searcher.search(query, filter, numHits,

                             new Sort(sortField, reverse));

    }

 

 

Okay, and here is what those System.out.printlns print out for me:

 

 

Performing Lucene Query:

using filter QueryFilter(+contactemail:[hidden email]^0.0) and
numHits = 20

051016 190347 11 total hits: 0

 

 

However, as I mentioned, even though I can look at my results and see that
there is a result with:

 

contactemail = [hidden email]

 

I still get no hits. Does anybody have any clue as to what I'm doing wrong?

 

 

Thanks in advance.

 

Cheers,

  Chris

 

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 

Reply | Threaded
Open this post in threaded view
|

Re: developing a parse-/index-/query- plugin set

Andrzej Białecki-2
Chris Mattmann wrote:

> I still get no hits. Does anybody have any clue as to what I'm doing wrong?

I have a clue (which is not the same as a solution ;-) ). Please use
Luke and check how the terms look like in your index. The best way to do
it is to open the index, then go to one of the documents and press
"Reconstruct & Edit". In the dialog that pops up you will have all
fields content, and also how they were tokenized (which is more
important). It's possible that NutchAnalyzer swallowed some of the text
you are looking for... you should see that in the tokenized field
content. If your query plugin returns the clause as you wrote it, i.e.
with at sign, dots and whatever, then a corresponding token needs to
show up in the tokenized content - and I bet it doesn't, because it was
broken into parts by the tokenizer...


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: developing a parse-/index-/query- plugin set

chrismattmann
Hi Andrzej,


On 10/17/05 10:59 AM, "Andrzej Bialecki" <[hidden email]> wrote:

> Chris Mattmann wrote:
>
>> I still get no hits. Does anybody have any clue as to what I'm doing wrong?
>
> I have a clue (which is not the same as a solution ;-) ). Please use
> Luke and check how the terms look like in your index. The best way to do
> it is to open the index, then go to one of the documents and press
> "Reconstruct & Edit". In the dialog that pops up you will have all
> fields content, and also how they were tokenized (which is more
> important). It's possible that NutchAnalyzer swallowed some of the text
> you are looking for... you should see that in the tokenized field
> content. If your query plugin returns the clause as you wrote it, i.e.
> with at sign, dots and whatever, then a corresponding token needs to
> show up in the tokenized content - and I bet it doesn't, because it was
> broken into parts by the tokenizer...
>

I downloaded Luke from the getopt site during the peaks of my frustration,
and then browsed my small index of 3 documents (which I can send to you in a
separate email if you want to look at it, it's real small). I  looked up the
field for "contactemail" for one of the documents in the index. I also
verified as I mentioned, that my query was being captured by the filter
correctly. For instance a query for
"contactemail:[hidden email]" correctly shows up as:
"contactemail:[hidden email]". When I used Luke to look up the
doc in the index, and its corresponding contactemail field, here is what it
appeared as under the "tokenized" tab:

"[hidden email]"

Which is the exact same way that it was stored, and the same way that I
queried on it. So, not really sure what the problem is here. Thanks for the
suggestion, however. Any other ideas? :-)


Take care,
  Chris


______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 



Reply | Threaded
Open this post in threaded view
|

Re: developing a parse-/index-/query- plugin set

Doug Cutting-2
In reply to this post by chrismattmann
Chris Mattmann wrote:
>  So, one thing it seems is that fields to be indexed, and used in a field
> query must be fully lowercase to work? Additionally, it seems that they
> can't have symbols in them, such as "_", is that correct? Would you guys
> consider this to be a bug?

Yes, this sounds like a bug.

> Performing Lucene Query:
>
> using filter QueryFilter(+contactemail:[hidden email]^0.0) and
> numHits = 20
>
> 051016 190347 11 total hits: 0

A query whose only clause has a boost of 0.0 will return no results.
Nutch uses the convention that clauses whose boost is 0.0 may be
converted to filters, for efficiency.  A filter affects the set of hits,
but not their ranking.  So a boost of 0.0 is used to declare that a
clause does not affect ranking and may not be used in isolation.  This
makes it akin to searching for "filetype:pdf" on Google--filetype is
only used to filter other queries and may not be a standalone query.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: developing a parse-/index-/query- plugin set

chrismattmann
Hi Doug,


On 10/17/05 11:38 AM, "Doug Cutting" <[hidden email]> wrote:

> Chris Mattmann wrote:
>>  So, one thing it seems is that fields to be indexed, and used in a field
>> query must be fully lowercase to work? Additionally, it seems that they
>> can't have symbols in them, such as "_", is that correct? Would you guys
>> consider this to be a bug?
>
> Yes, this sounds like a bug.

Okay, I will look and see if I can figure out why this is happening and if I
can, I will try and submit a patch.


>
>> Performing Lucene Query:
>>
>> using filter QueryFilter(+contactemail:[hidden email]^0.0) and
>> numHits = 20
>>
>> 051016 190347 11 total hits: 0
>
> A query whose only clause has a boost of 0.0 will return no results.
> Nutch uses the convention that clauses whose boost is 0.0 may be
> converted to filters, for efficiency.  A filter affects the set of hits,
> but not their ranking.  So a boost of 0.0 is used to declare that a
> clause does not affect ranking and may not be used in isolation.  This
> makes it akin to searching for "filetype:pdf" on Google--filetype is
> only used to filter other queries and may not be a standalone query.

Okay, this makes sense. In fact, when I do a query now for:

"contactemail:[hidden email] specimen"

The query actually works. Of the 3 documents I indexed only one of them has
the contactemail [hidden email], and so I only got one result
back. So your answer there makes total sense. So, my question to you then
is, what type of QueryFilter should I develop in order to get my query for
contactemail:<email address> to work as a standalone query? For instance,
right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
the right way to do it now. Is there a class in Nutch that I can sub-class
to get most of the functionality for doing a type:<value> query as a
standalone query?

Thanks for the help.

Cheers,
  Chris

>
> Doug

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 



Reply | Threaded
Open this post in threaded view
|

Re: developing a parse-/index-/query- plugin set

Doug Cutting-2
Chris Mattmann wrote:
> So, my question to you then
> is, what type of QueryFilter should I develop in order to get my query for
> contactemail:<email address> to work as a standalone query? For instance,
> right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
> the right way to do it now. Is there a class in Nutch that I can sub-class
> to get most of the functionality for doing a type:<value> query as a
> standalone query?

You can simply pass a non-zero boost to the RawFieldQueryFilter
constructor, e.g.:

public class MyQueryFilter extends RawFieldQueryFilter {
   public MyQueryFilter() {
     super("myfield", 1.0f);
   }
}

Or you can implement QueryFilter directly.  There's not that much to it.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: developing a parse-/index-/query- plugin set

chrismattmann
Hi Doug,

 Thanks, that worked.

Cheers,
  Chris



On 10/17/05 11:56 AM, "Doug Cutting" <[hidden email]> wrote:

> Chris Mattmann wrote:
>> So, my question to you then
>> is, what type of QueryFilter should I develop in order to get my query for
>> contactemail:<email address> to work as a standalone query? For instance,
>> right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
>> the right way to do it now. Is there a class in Nutch that I can sub-class
>> to get most of the functionality for doing a type:<value> query as a
>> standalone query?
>
> You can simply pass a non-zero boost to the RawFieldQueryFilter
> constructor, e.g.:
>
> public class MyQueryFilter extends RawFieldQueryFilter {
>    public MyQueryFilter() {
>      super("myfield", 1.0f);
>    }
> }
>
> Or you can implement QueryFilter directly.  There's not that much to it.
>
> Doug

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.