Feature idea - Indexing Text Lengths

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Feature idea - Indexing Text Lengths

Hero Doug
Sorry i cant give more then an idea, I'm not a java developer, but I think the idea could prove useful.

I'm not completely sure how the spider works while indexing, but I've noticed when indexing a site like w3schools.com they have a lot of keywords listed in their side menus. So, if I just index that one site (which many people may use nutch for), and search for asp, I get a lot of pages that have little to do with asp, but have the keywords listed over and over (Since their in the side menu) and get high placement.

The idea is to limit the length of sentences that get entered into the index. So, after parsing a page, and words that don't make what appears to be a complete sentence get ignored.

Hopefully I can properly explain what I'm thinking with this example:

A typical webpage may look like this.
----------------------------
<table>
        <tr>
                <td><a href='manual/'>php manual</a></td>
                <td><a href='functions/'>php functions</a></td>
                <td><a href='arrays/'>php arrays</a></td>
                <td><a href='variables/'>php variables</a></td>
                <td><a href='modules/'>php modules</a></td>
        </tr>
</table>
<table>
        <tr>
                <td>This page gives detailed information about how to compile php with aspell capabilities. First, you need a computer,......</td>
        </tr>
</table>
----------------------------

Once the HTML is stripped (Not the line breaks) it may look something like this

----------------------------


php manual
php functions
php arrays
php variables
php modules




This page gives detailed information about how to compile php with aspell capabilities. First, you need a computer,......


----------------------------

So, there are a lot of left over words from the side column menus. Since their no more then two words long, I would love to be able to ignore them since I don't believe their always related to the content of the page. Being able to configure a setting at 3 words, 5 words, 20 words, etc could help increase relevancy since users will be visiting that page to read the content, not the side menu.



Reply | Threaded
Open this post in threaded view
|

Re: Feature idea - Indexing Text Lengths

Jérôme Charron
> Sorry i cant give more then an idea, I'm not a java developer, but I think
> the idea could prove useful.
> The idea is to limit the length of sentences that get entered into the
> index. So, after parsing a page, and words that don't make what appears to
> be a complete sentence get ignored.

Douglas,

Here is a previous discussion about this subject on the list:
http://www.mail-archive.com/nutch-dev@.../msg03070.html
Take a look at this thread.. this problem is not so easy.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Feature idea - Indexing Text Lengths

Hero Doug
In reply to this post by Hero Doug
Thanks for the link, it was an interesting read. Seems like their over complicating things a bit. To me it's just a matter of counting how long a sentence is, if you look at most web pages the sentences in their side columns are usually filler, and short, while the sentences in the main content area are longer.

Anyways, I'll leave it to the Java pro's, thanks for the link.

> > Sorry i cant give more then an idea, I'm not a java developer, but I think
> > the idea could prove useful.
> > The idea is to limit the length of sentences that get entered into the
> > index. So, after parsing a page, and words that don't make what appears to
> > be a complete sentence get ignored.
>
> Douglas,
>
> Here is a previous discussion about this subject on the list:
> http://www.mail-archive.com/nutch-dev@.../msg03070.html
> Take a look at this thread.. this problem is not so easy.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/