writing a metadata content tag

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

writing a metadata content tag

Raghavendra Prabhu
Hi guys

Sorry for the follow up mail

My requirement as i was mentioning previously shud let me stamp documents
with some kind of type


How do i do it ?


For example add sports to a field TYPEFIELD on seeing football,tennis in
extracted text

For example add technology to the same field TYPEFIELD on seeing
web,internet


Where do i add this ??

Rgds

Prabhu
Reply | Threaded
Open this post in threaded view
|

RE: writing a metadata content tag

Howie Wang
You need to write your own indexing filter plugin. Take a look
at index-basic. In BasicIndexingFilter.java there are a whole
bunch of lines that do something like:

doc.add(Field.Text("myfield", myFieldValue));

Just add your own field. You have access to title, anchor,
and page text in this function. Search the text for your
keywords and add whatever field you want.

To search on this field, you'll have to create a query filter plugin also
so that you can search for "myfield:sports".  See query-site for an
example. You'll only have to change a couple of lines of code:

public class MyQueryFilter extends RawFieldQueryFilter {
  public MyQueryFilter() {
    super("myfield");
  }
}

Don't forget to add your new plugins to nutch-site.xml.

By the way, I would recommend writing some extra code to
allow yourself to read in keywords from a file and map them
to your category. It's kind of a pain to edit the code every
time you think of a new keyword.

Howie

>Hi guys
>
>Sorry for the follow up mail
>
>My requirement as i was mentioning previously shud let me stamp documents
>with some kind of type
>
>
>How do i do it ?
>
>
>For example add sports to a field TYPEFIELD on seeing football,tennis in
>extracted text
>
>For example add technology to the same field TYPEFIELD on seeing
>web,internet
>
>
>Where do i add this ??
>
>Rgds
>
>Prabhu


Reply | Threaded
Open this post in threaded view
|

Re: writing a metadata content tag

Raghavendra Prabhu
Hi Howie

What you have mentioned is in the indexing fields

I am speaking abt content

i thought there are three steps


parse-filter
index-filter
query-filter


I think you are referring to the second step index-filter. I want more on
the first step parse-filter

What i want to do is i should add some header info in parse-filter which
will be used by index-filter to add my own nature of the new FIELD

Rgds
Prabhu


On 3/9/06, Howie Wang <[hidden email]> wrote:

>
> You need to write your own indexing filter plugin. Take a look
> at index-basic. In BasicIndexingFilter.java there are a whole
> bunch of lines that do something like:
>
> doc.add(Field.Text("myfield", myFieldValue));
>
> Just add your own field. You have access to title, anchor,
> and page text in this function. Search the text for your
> keywords and add whatever field you want.
>
> To search on this field, you'll have to create a query filter plugin also
> so that you can search for "myfield:sports".  See query-site for an
> example. You'll only have to change a couple of lines of code:
>
> public class MyQueryFilter extends RawFieldQueryFilter {
> public MyQueryFilter() {
>    super("myfield");
> }
> }
>
> Don't forget to add your new plugins to nutch-site.xml.
>
> By the way, I would recommend writing some extra code to
> allow yourself to read in keywords from a file and map them
> to your category. It's kind of a pain to edit the code every
> time you think of a new keyword.
>
> Howie
>
> >Hi guys
> >
> >Sorry for the follow up mail
> >
> >My requirement as i was mentioning previously shud let me stamp documents
> >with some kind of type
> >
> >
> >How do i do it ?
> >
> >
> >For example add sports to a field TYPEFIELD on seeing football,tennis in
> >extracted text
> >
> >For example add technology to the same field TYPEFIELD on seeing
> >web,internet
> >
> >
> >Where do i add this ??
> >
> >Rgds
> >
> >Prabhu
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: writing a metadata content tag

Howie Wang
>What i want to do is i should add some header info in parse-filter which
>will be used by index-filter to add my own nature of the new FIELD
>
>Rgds
>Prabhu

I would recommend doing it at the index phase if possible. If the end
goal is to have it searchable from the index, ask if you really need to have
the information at the parsing stage. If you decide you want to
tweak your keywords, it's easy to re-index. If you do it at the parsing
stage, it will take twice as long since you have to re-parse and then
re-index. Plus re-parsing is not complicated, but involves kind of a
hack with renaming a bunch of directories.

One reason to do your analysis at parse time is that it's easier to
get the entire page contents like HTML tags in case you need that
for categorization. If you don't need this stuff, you probably don't
need to categorize at the parsing phase.

If you really want to do it at parse time, it's not difficult. Take a
look at parse-html. You can use the metadata object to store
your category. Look in HtmlParseFilter.java in getParse. Just do:

metadata.put("myfield", "sports");

In your index filter, you can then do a metadata.get to get your
category and then index it.

Howie


Reply | Threaded
Open this post in threaded view
|

Re: writing a metadata content tag

Raghavendra Prabhu
Hi Howie

That is what i am looking at it

But as you said generalize for all requirements including intranet
requirement

I am better off doing what u said

Rgds
Prabu


On 3/9/06, Howie Wang <[hidden email]> wrote:

>
> >What i want to do is i should add some header info in parse-filter which
> >will be used by index-filter to add my own nature of the new FIELD
> >
> >Rgds
> >Prabhu
>
> I would recommend doing it at the index phase if possible. If the end
> goal is to have it searchable from the index, ask if you really need to
> have
> the information at the parsing stage. If you decide you want to
> tweak your keywords, it's easy to re-index. If you do it at the parsing
> stage, it will take twice as long since you have to re-parse and then
> re-index. Plus re-parsing is not complicated, but involves kind of a
> hack with renaming a bunch of directories.
>
> One reason to do your analysis at parse time is that it's easier to
> get the entire page contents like HTML tags in case you need that
> for categorization. If you don't need this stuff, you probably don't
> need to categorize at the parsing phase.
>
> If you really want to do it at parse time, it's not difficult. Take a
> look at parse-html. You can use the metadata object to store
> your category. Look in HtmlParseFilter.java in getParse. Just do:
>
> metadata.put("myfield", "sports");
>
> In your index filter, you can then do a metadata.get to get your
> category and then index it.
>
> Howie
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: writing a metadata content tag:use case example

Richard Braman
I am following this thread as I have a similar issue to deal with in my
coming developments.  Howie thanks for your insights into this as I
think this may solve my problem.  

I am trying to index Title 26 of the US Code
http://www.access.gpo.gov/uscode/title26/title26.html

The problem is I don't want the search engines users to have to go crazy
trying to find a particular code section.

Genrally the code is cited by users in this format: 26USC1
Which transaltes to Title 26, Section 1.

Fortunately, the government puts the citation on the top of each page
[CITE: 26USC1]
See"
http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&doc
id=Cite:+26USC1 at the top of the page

My goal is to parse that citation out and make it so that I can let
users search on the citation.

So would I do something like

1. parse out the citation
2. metadata.put(<citation>, <citation>);

?

Thanks for your help on this.


-----Original Message-----
From: Raghavendra Prabhu [mailto:[hidden email]]
Sent: Thursday, March 09, 2006 2:53 AM
To: [hidden email]
Subject: Re: writing a metadata content tag


Hi Howie

That is what i am looking at it

But as you said generalize for all requirements including intranet
requirement

I am better off doing what u said

Rgds
Prabu


On 3/9/06, Howie Wang <[hidden email]> wrote:

>
> >What i want to do is i should add some header info in parse-filter
> >which will be used by index-filter to add my own nature of the new
> >FIELD
> >
> >Rgds
> >Prabhu
>
> I would recommend doing it at the index phase if possible. If the end
> goal is to have it searchable from the index, ask if you really need
> to have the information at the parsing stage. If you decide you want
> to tweak your keywords, it's easy to re-index. If you do it at the
> parsing stage, it will take twice as long since you have to re-parse
> and then re-index. Plus re-parsing is not complicated, but involves
> kind of a hack with renaming a bunch of directories.
>
> One reason to do your analysis at parse time is that it's easier to
> get the entire page contents like HTML tags in case you need that for
> categorization. If you don't need this stuff, you probably don't need
> to categorize at the parsing phase.
>
> If you really want to do it at parse time, it's not difficult. Take a
> look at parse-html. You can use the metadata object to store your
> category. Look in HtmlParseFilter.java in getParse. Just do:
>
> metadata.put("myfield", "sports");
>
> In your index filter, you can then do a metadata.get to get your
> category and then index it.
>
> Howie
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: writing a metadata content tag:use case example

Thomas Delnoij-3
Richard.

So would I do something like
>
> 1. parse out the citation
> 2. metadata.put(<citation>, <citation>);



Yes, I think that is the way to proceed. And then on implementing the
Indexing and Query FIlters, all as desribed in the WritingPlugin tutorial:
http://wiki.apache.org/nutch/WritingPluginExample

Rgrds, Thomas

?

>
> Thanks for your help on this.
>
>
> -----Original Message-----
> From: Raghavendra Prabhu [mailto:[hidden email]]
> Sent: Thursday, March 09, 2006 2:53 AM
> To: [hidden email]
> Subject: Re: writing a metadata content tag
>
>
> Hi Howie
>
> That is what i am looking at it
>
> But as you said generalize for all requirements including intranet
> requirement
>
> I am better off doing what u said
>
> Rgds
> Prabu
>
>
> On 3/9/06, Howie Wang <[hidden email]> wrote:
> >
> > >What i want to do is i should add some header info in parse-filter
> > >which will be used by index-filter to add my own nature of the new
> > >FIELD
> > >
> > >Rgds
> > >Prabhu
> >
> > I would recommend doing it at the index phase if possible. If the end
> > goal is to have it searchable from the index, ask if you really need
> > to have the information at the parsing stage. If you decide you want
> > to tweak your keywords, it's easy to re-index. If you do it at the
> > parsing stage, it will take twice as long since you have to re-parse
> > and then re-index. Plus re-parsing is not complicated, but involves
> > kind of a hack with renaming a bunch of directories.
> >
> > One reason to do your analysis at parse time is that it's easier to
> > get the entire page contents like HTML tags in case you need that for
> > categorization. If you don't need this stuff, you probably don't need
> > to categorize at the parsing phase.
> >
> > If you really want to do it at parse time, it's not difficult. Take a
> > look at parse-html. You can use the metadata object to store your
> > category. Look in HtmlParseFilter.java in getParse. Just do:
> >
> > metadata.put("myfield", "sports");
> >
> > In your index filter, you can then do a metadata.get to get your
> > category and then index it.
> >
> > Howie
> >
> >
> >
>
>