Which analyzer

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Which analyzer

spring
Hi,

I have a huge number of documents which contain mainly numbers and dates
(german format dd.MM.yyyy), like this:

Tgr. gilt ab           01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99
01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99  46X0     01
0000048010108    0512070010
 Gefahrenklass                01       01       01       01       01
01       01       01       01       01       01       01  46X0     01
0000049010108    0512070010
 Bezahlte Std.            152,25   152,25   152,25   152,25   152,25
152,25   152,25   152,25   152,25   152,25   152,25   152,25  46X0     01
0000050010108    0512070010
 Woech.Arbzeit             35,00    35,00    35,00    35,00    35,00
35,00    35,00    35,00    35,00    35,00    35,00    35,00  46X0     01
0000051010108    0512070010
 Monatl.Arbzt.            152,25   152,25   152,25   152,25   152,25
152,25   152,25   152,25   152,25   152,25   152,25   152,25  

Which anlyzer should I use when someone searches for a certain number or
date?

Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which analyzer

Erick Erickson
*How* do you want to search them? If it's simply exact matches, then
WhitespaceAnalyzer should work fine.

But if you want to, for example, look at date ranges or number
ranges, you'll have to be more clever.

What do you want to accomplish?

Best
Erick

On Feb 7, 2008 3:25 PM, <[hidden email]> wrote:

> Hi,
>
> I have a huge number of documents which contain mainly numbers and dates
> (german format dd.MM.yyyy), like this:
>
> Tgr. gilt ab           01.01.99 01.01.99 01.01.99 01.01.99 01.01.99
> 01.01.99
> 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99  46X0     01
> 0000048010108    0512070010
>  Gefahrenklass                01       01       01       01       01
> 01       01       01       01       01       01       01  46X0     01
> 0000049010108    0512070010
>  Bezahlte Std.            152,25   152,25   152,25   152,25   152,25
> 152,25   152,25   152,25   152,25   152,25   152,25   152,25  46X0     01
> 0000050010108    0512070010
>  Woech.Arbzeit             35,00    35,00    35,00    35,00    35,00
> 35,00    35,00    35,00    35,00    35,00    35,00    35,00  46X0     01
> 0000051010108    0512070010
>  Monatl.Arbzt.            152,25   152,25   152,25   152,25   152,25
> 152,25   152,25   152,25   152,25   152,25   152,25   152,25
>
> Which anlyzer should I use when someone searches for a certain number or
> date?
>
> Thank you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Which analyzer

spring
Hello,

lets say the document contains

01.02.1999

and

152,45

Then I want to search for:

01.02.1999 AND 152,45
01.02.1999
152,45
1999
152

Thank you.

> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Freitag, 8. Februar 2008 00:20
> To: [hidden email]
> Subject: Re: Which analyzer
>
> *How* do you want to search them? If it's simply exact matches, then
> WhitespaceAnalyzer should work fine.
>
> But if you want to, for example, look at date ranges or number
> ranges, you'll have to be more clever.
>
> What do you want to accomplish?
>
> Best
> Erick
>
> On Feb 7, 2008 3:25 PM, <[hidden email]> wrote:
>
> > Hi,
> >
> > I have a huge number of documents which contain mainly
> numbers and dates
> > (german format dd.MM.yyyy), like this:
> >
> > Tgr. gilt ab           01.01.99 01.01.99 01.01.99 01.01.99 01.01.99
> > 01.01.99
> > 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99  46X0     01
> > 0000048010108    0512070010
> >  Gefahrenklass                01       01       01       01       01
> > 01       01       01       01       01       01       01  
> 46X0     01
> > 0000049010108    0512070010
> >  Bezahlte Std.            152,25   152,25   152,25   152,25   152,25
> > 152,25   152,25   152,25   152,25   152,25   152,25  
> 152,25  46X0     01
> > 0000050010108    0512070010
> >  Woech.Arbzeit             35,00    35,00    35,00    35,00    35,00
> > 35,00    35,00    35,00    35,00    35,00    35,00    35,00
>  46X0     01
> > 0000051010108    0512070010
> >  Monatl.Arbzt.            152,25   152,25   152,25   152,25   152,25
> > 152,25   152,25   152,25   152,25   152,25   152,25   152,25
> >
> > Which anlyzer should I use when someone searches for a
> certain number or
> > date?
> >
> > Thank you.
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Searching the backlog of mailing list for lucene java users

Mitchell, Erica
Hi,

I've found this link to trawl through the backlog of questions and
answers from this mailing list.

I'm new to lucene and don't want to send questions that have already
been answered in the list.
Is there other link to search previous entries rather than clicking the
the Prev and Next links looking through the titles of the threads.

http://readlist.com/lists/lucene.apache.org/java-user/4/20264.html


Thanks,
erica

----------------------------
IONA Technologies PLC (registered in Ireland)
Registered Number: 171387
Registered Address: The IONA Building, Shelbourne Road, Dublin 4, Ireland

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Searching the backlog of mailing list for lucene java users

Grant Ingersoll-2
http://wiki.apache.org/lucene-java/MailingListArchives has a variety  
of options (although the readlist one is not listed)


On Feb 8, 2008, at 6:31 AM, Mitchell, Erica wrote:

> Hi,
>
> I've found this link to trawl through the backlog of questions and
> answers from this mailing list.
>
> I'm new to lucene and don't want to send questions that have already
> been answered in the list.
> Is there other link to search previous entries rather than clicking  
> the
> the Prev and Next links looking through the titles of the threads.
>
> http://readlist.com/lists/lucene.apache.org/java-user/4/20264.html
>
>
> Thanks,
> erica
>
> ----------------------------
> IONA Technologies PLC (registered in Ireland)
> Registered Number: 171387
> Registered Address: The IONA Building, Shelbourne Road, Dublin 4,  
> Ireland
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Searching the backlog of mailing list for lucene java users

steve_rowe
Hi Erica,

Another good place to look is at the FAQ:

   <http://wiki.apache.org/lucene-java/LuceneFAQ>

Steve

On 02/08/2008 at 8:10 AM, Grant Ingersoll wrote:

> http://wiki.apache.org/lucene-java/MailingListArchives has a variety
> of options (although the readlist one is not listed)
>
> On Feb 8, 2008, at 6:31 AM, Mitchell, Erica wrote:
>
> > Hi,
> >
> > I've found this link to trawl through the backlog of questions and
> > answers from this mailing list.
> >
> > I'm new to lucene and don't want to send questions that have already
> > been answered in the list. Is there other link to search previous
> > entries rather than clicking the the Prev and Next links looking
> > through the titles of the threads.
> >
> > http://readlist.com/lists/lucene.apache.org/java-user/4/20264.html
> >
> >
> > Thanks,
> > erica
> >
> > ----------------------------
> > IONA Technologies PLC (registered in Ireland)
> > Registered Number: 171387
> > Registered Address: The IONA Building, Shelbourne Road, Dublin 4,
> > Ireland

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which analyzer

Erick Erickson
In reply to this post by spring
WhitespaceAnalyzer should do the trick. Give it a try...

My point was that RangeQuerys wouldn't work very well,
but since you're not trying to do that, WhitespaceAnalyzer
should handle your case.

Erick

On Feb 8, 2008 4:40 AM, <[hidden email]> wrote:

> Hello,
>
> lets say the document contains
>
> 01.02.1999
>
> and
>
> 152,45
>
> Then I want to search for:
>
> 01.02.1999 AND 152,45
> 01.02.1999
> 152,45
> 1999
> 152
>
> Thank you.
>
> > -----Original Message-----
> > From: Erick Erickson [mailto:[hidden email]]
> > Sent: Freitag, 8. Februar 2008 00:20
> > To: [hidden email]
> > Subject: Re: Which analyzer
> >
> > *How* do you want to search them? If it's simply exact matches, then
> > WhitespaceAnalyzer should work fine.
> >
> > But if you want to, for example, look at date ranges or number
> > ranges, you'll have to be more clever.
> >
> > What do you want to accomplish?
> >
> > Best
> > Erick
> >
> > On Feb 7, 2008 3:25 PM, <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > > I have a huge number of documents which contain mainly
> > numbers and dates
> > > (german format dd.MM.yyyy), like this:
> > >
> > > Tgr. gilt ab           01.01.99 01.01.99 01.01.99 01.01.99 01.01.99
> > > 01.01.99
> > > 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99  46X0     01
> > > 0000048010108    0512070010
> > >  Gefahrenklass                01       01       01       01       01
> > > 01       01       01       01       01       01       01
> > 46X0     01
> > > 0000049010108    0512070010
> > >  Bezahlte Std.            152,25   152,25   152,25   152,25   152,25
> > > 152,25   152,25   152,25   152,25   152,25   152,25
> > 152,25  46X0     01
> > > 0000050010108    0512070010
> > >  Woech.Arbzeit             35,00    35,00    35,00    35,00    35,00
> > > 35,00    35,00    35,00    35,00    35,00    35,00    35,00
> >  46X0     01
> > > 0000051010108    0512070010
> > >  Monatl.Arbzt.            152,25   152,25   152,25   152,25   152,25
> > > 152,25   152,25   152,25   152,25   152,25   152,25   152,25
> > >
> > > Which anlyzer should I use when someone searches for a
> > certain number or
> > > date?
> > >
> > > Thank you.
> > >
> > >
> > >
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

MultiFieldQueryParser question

Mitchell, Erica
Hi,


I'm trying to build up an index of fields to represent an
org.eclipse.emf.ecore.EObject;


So I'm adding these fields to my lucene doc

doc.add(new Field(NAME, cls.getName(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field(IDENTITY, obj.eGet(cls.getEIDAttribute()).toString(),
Field.Store.YES, Field.Index.TOKENIZED));

I then get the Eattributes from my Eobject and I want to add these to
the index

 for (Object o : cls.getEAllAttributes()) {
            if (o instanceof EAttribute) {
                EAttribute attribute = (EAttribute) o;
                String attributeName = attribute.getName();
           Strung attributeValue = obj.eGet(attribute).toString();

                    doc.add(new Field(attributeName , attributeValue ,
Field.Store.YES, Field.Index.TOKENIZED));
                }
 }


This is how I'm searching all fields for text containing "aaa"

        String[] searchFields = {IDENTITY, NAME, ATTRIBUTE};
           
        MultiFieldQueryParser multiparser = new
MultiFieldQueryParser(searchFields, new StandardAnalyzer());
        multiparser.setDefaultOperator(QueryParser.Operator.OR);
        Query query = multiparser.parse("aaa*");
        Hits hits = isearcher.search(query);


So using Luke I can see my index contains information like this

<id> aaa2
<id> aaa1

<guid> aaa2
<guid> pi4

<name> attribute
<name> poliyinstance


I'd like my query to return both the id fields and the guids.
The problem is the guid is added to the index by virtue of my
doc.addField(attributeName, attributeValue ...)

So for the query to return the attributes correctly, I'd need to pass in
the attributeName which I don't have.

Is the answer simply that I should be keeping an array of the
attributeName's returned, and passing these as part of the String[] into
searchFields?

Thanks a million,
Erica

----------------------------
IONA Technologies PLC (registered in Ireland)
Registered Number: 171387
Registered Address: The IONA Building, Shelbourne Road, Dublin 4, Ireland

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Which analyzer

spring
In reply to this post by Erick Erickson
OK, I will try it.
Thank you.

> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Freitag, 8. Februar 2008 14:25
> To: [hidden email]
> Subject: Re: Which analyzer
>
> WhitespaceAnalyzer should do the trick. Give it a try...
>
> My point was that RangeQuerys wouldn't work very well,
> but since you're not trying to do that, WhitespaceAnalyzer
> should handle your case.
>
> Erick
>
> On Feb 8, 2008 4:40 AM, <[hidden email]> wrote:
>
> > Hello,
> >
> > lets say the document contains
> >
> > 01.02.1999
> >
> > and
> >
> > 152,45
> >
> > Then I want to search for:
> >
> > 01.02.1999 AND 152,45
> > 01.02.1999
> > 152,45
> > 1999
> > 152
> >
> > Thank you.
> >
> > > -----Original Message-----
> > > From: Erick Erickson [mailto:[hidden email]]
> > > Sent: Freitag, 8. Februar 2008 00:20
> > > To: [hidden email]
> > > Subject: Re: Which analyzer
> > >
> > > *How* do you want to search them? If it's simply exact
> matches, then
> > > WhitespaceAnalyzer should work fine.
> > >
> > > But if you want to, for example, look at date ranges or number
> > > ranges, you'll have to be more clever.
> > >
> > > What do you want to accomplish?
> > >
> > > Best
> > > Erick
> > >
> > > On Feb 7, 2008 3:25 PM, <[hidden email]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a huge number of documents which contain mainly
> > > numbers and dates
> > > > (german format dd.MM.yyyy), like this:
> > > >
> > > > Tgr. gilt ab           01.01.99 01.01.99 01.01.99
> 01.01.99 01.01.99
> > > > 01.01.99
> > > > 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99  
> 46X0     01
> > > > 0000048010108    0512070010
> > > >  Gefahrenklass                01       01       01      
>  01       01
> > > > 01       01       01       01       01       01       01
> > > 46X0     01
> > > > 0000049010108    0512070010
> > > >  Bezahlte Std.            152,25   152,25   152,25  
> 152,25   152,25
> > > > 152,25   152,25   152,25   152,25   152,25   152,25
> > > 152,25  46X0     01
> > > > 0000050010108    0512070010
> > > >  Woech.Arbzeit             35,00    35,00    35,00    
> 35,00    35,00
> > > > 35,00    35,00    35,00    35,00    35,00    35,00    35,00
> > >  46X0     01
> > > > 0000051010108    0512070010
> > > >  Monatl.Arbzt.            152,25   152,25   152,25  
> 152,25   152,25
> > > > 152,25   152,25   152,25   152,25   152,25   152,25   152,25
> > > >
> > > > Which anlyzer should I use when someone searches for a
> > > certain number or
> > > > date?
> > > >
> > > > Thank you.
> > > >
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail:
> [hidden email]
> > > >
> > > >
> > >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: MultiFieldQueryParser question

Mitchell, Erica
In reply to this post by Mitchell, Erica
Solved it, for those who were wondering where I went wrong....

I've built up an ArrayList while adding my attributes to a lucence doc
and updated my multiquerySearchParser to contain all these attribute
names as follows

Object[] attributeNamesArray = (Object[]) attrList.toArray();
        String[] otherFieldsArray  = {IDENTITY, NAME};
        String[] searchFields = new String[attributeNamesArray.length +
otherFieldsArray.length];        
       
        System.arraycopy(attributeNamesArray, 0, searchFields, 0,
attributeNamesArray.length);
        System.arraycopy(otherFieldsArray, 0, searchFields,
attributeNamesArray.length, otherFieldsArray.length);
     
        MultiFieldQueryParser multiparser = new
MultiFieldQueryParser(searchFields, new StandardAnalyzer());
          multiparser.setDefaultOperator(QueryParser.Operator.OR);
        Query query = multiparser.parse("aaa*");


When I call toString() on my query, the syntax looks correct

Query guid:aaa* name:aaa* value:aaa* id:aaa* name:aaa*

When I debug into the hits documents, I can see that only the 2 id
results are returned.
Its not finding the guid as a hit because its in the same document as
the id.
I discovered this by using Luke to inspect the documents.



-----Original Message-----
From: Mitchell, Erica [mailto:[hidden email]]
Sent: 08 February 2008 14:54
To: [hidden email]
Subject: MultiFieldQueryParser question

Hi,


I'm trying to build up an index of fields to represent an
org.eclipse.emf.ecore.EObject;


So I'm adding these fields to my lucene doc

doc.add(new Field(NAME, cls.getName(), Field.Store.YES,
Field.Index.TOKENIZED)); doc.add(new Field(IDENTITY,
obj.eGet(cls.getEIDAttribute()).toString(),
Field.Store.YES, Field.Index.TOKENIZED));

I then get the Eattributes from my Eobject and I want to add these to
the index

 for (Object o : cls.getEAllAttributes()) {
            if (o instanceof EAttribute) {
                EAttribute attribute = (EAttribute) o;
                String attributeName = attribute.getName();
           Strung attributeValue = obj.eGet(attribute).toString();

                    doc.add(new Field(attributeName , attributeValue ,
Field.Store.YES, Field.Index.TOKENIZED));
                }
 }


This is how I'm searching all fields for text containing "aaa"

        String[] searchFields = {IDENTITY, NAME, ATTRIBUTE};
           
        MultiFieldQueryParser multiparser = new
MultiFieldQueryParser(searchFields, new StandardAnalyzer());
        multiparser.setDefaultOperator(QueryParser.Operator.OR);
        Query query = multiparser.parse("aaa*");
        Hits hits = isearcher.search(query);


So using Luke I can see my index contains information like this

<id> aaa2
<id> aaa1

<guid> aaa2
<guid> pi4

<name> attribute
<name> poliyinstance


I'd like my query to return both the id fields and the guids.
The problem is the guid is added to the index by virtue of my
doc.addField(attributeName, attributeValue ...)

So for the query to return the attributes correctly, I'd need to pass in
the attributeName which I don't have.

Is the answer simply that I should be keeping an array of the
attributeName's returned, and passing these as part of the String[] into
searchFields?

Thanks a million,
Erica

----------------------------
IONA Technologies PLC (registered in Ireland) Registered Number: 171387
Registered Address: The IONA Building, Shelbourne Road, Dublin 4,
Ireland

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

----------------------------
IONA Technologies PLC (registered in Ireland)
Registered Number: 171387
Registered Address: The IONA Building, Shelbourne Road, Dublin 4, Ireland

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: MultiFieldQueryParser question

hossman
In reply to this post by Mitchell, Erica

: Subject: MultiFieldQueryParser question
: Date: Fri, 8 Feb 2008 14:53:37 -0000
: Message-ID:
:     <[hidden email]>
: In-Reply-To: <[hidden email]>

http://people.apache.org/~hossman/#threadhijack

Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]