indexing Guides? Indexing names

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing Guides? Indexing names

Lee Goddard
Could you recommend a good guide on constructing an index — analyzers,
filters....

I've inherited a set-up that indexes company names. It does a great job
on 1,000 names or so, but when I put in a million or more, it makes no
sense.

My test search is searching 'A & B Household' to target 'A & B
Households' — when I have a million records (of several tens of million
to come), I see the name has an equal score to other names with
different initials.

Is it possible to weight the individual initials as words?

Would you recommend employing a stemmer?

Thanks in anticipation
Lee
Reply | Threaded
Open this post in threaded view
|

Re: indexing Guides? Indexing names

Ted Dunning
On Tue, Jun 10, 2014 at 8:08 AM, Lee Goddard <[hidden email]> wrote:

> Is it possible to weight the individual initials as words?
>
> Would you recommend employing a stemmer?
>
>
Yes it is definitely possible.  But don't just use any stemmer.  You need
to adapt something so that you preserve initial letters and likely uses
heuristics such as possibly preserving case.

You will also probably want to include alternative forms in other fields.
 These would include nicknames, stock symbols and abbreviations.
Reply | Threaded
Open this post in threaded view
|

Re: indexing Guides? Indexing names

Lee Goddard

On 10/06/2014 18:40, Ted Dunning wrote:
>
 > On Tue, Jun 10, 2014 at 8:08 AM, Lee Goddard <[hidden email]
 > <mailto:[hidden email]>> wrote:
 >
 > Is it possible to weight the individual initials as words?
 >
 > Would you recommend employing a stemmer?
 >
 >
 > Yes it is definitely possible.  But don't just use any stemmer.  You
 > need to adapt something so that you preserve initial letters and
 > likely uses heuristics such as possibly preserving case.

Am I going to have to write a parser in Java for that, or is it a matter
of combing what is in the box? I've previously created indexes of photos
(my own parser) and indexes of documents, but indexing a single company
name is quite a new idea to me.

> You will also probably want to  include alternative forms in other
 > fields.  These would include nicknames, stock symbols and
 > abbreviations.

Not in this — it's simply an interface to find information held by the
state on the affairs of a company, so the alternative forms are of the
final element of the company registered name: it might be 'Limited' but
people may search 'ltd', it may be 'SE' but people may search 'european'.

TIA
Lee