lucene and UTF-8

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

lucene and UTF-8

John Cherouvim
Hello

I'm having some problems indexing my UTF-8 html pages. I am running
lucene on Linux and I cannot understand why does the index generated
depends on the locale of my operating system.
If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this
to en_US the index generated will be different. Why is this the case? My
HTMLs are all UTF-8.

Also, is there a lucene index browser? I am currently using Luke, which
is good but it doesn't show the Greek UTF-8 from within the index
correctly. Is this a matter of a setting in Luke?

Regads,
J.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene and UTF-8

John Haxby-2
John Cherouvim wrote:

> I'm having some problems indexing my UTF-8 html pages. I am running
> lucene on Linux and I cannot understand why does the index generated
> depends on the locale of my operating system.
> If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set
> this to en_US the index generated will be different. Why is this the
> case? My HTMLs are all UTF-8.

What verison of Linux are you using?

On Fedora Core 4 (and probably other Fedora's and RHEL)  LANG=el_GR sets
the character set to ISO 8859-7, eg (on my various machines):

    $ LANG=en_GR date | iconv -f iso88597
    ??? ??? 29 11:59:19 BST 2005
    $ LANG=el_GR.utf8 date
    ??? ??? 29 12:01:40 BST 2005

(Everything in FC4 is UTF-8 so it displays right and it seems that the
Greek for "Sep" is "Sep" -- no surprises there I guess.)

In your case, replacing "date" with whatever the command is that you use
to generate the indexes should do the right thing.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene and UTF-8

Andrzej Białecki-2
In reply to this post by John Cherouvim
John Cherouvim wrote:
> Hello
>
> I'm having some problems indexing my UTF-8 html pages. I am running
> lucene on Linux and I cannot understand why does the index generated
> depends on the locale of my operating system.
> If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this
> to en_US the index generated will be different. Why is this the case? My
> HTMLs are all UTF-8.

I think the difference comes from the default character encoding, if the
page is NOT clearly marked as UTF-8 - then the system has to guess, and
it guesses differently depending on the current locale.

>
> Also, is there a lucene index browser? I am currently using Luke, which
> is good but it doesn't show the Greek UTF-8 from within the index
> correctly. Is this a matter of a setting in Luke?

It's a matter of setting the appropriate font in Settings.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

luke start problem

Dirk Hennig
Hello,

I downloaded lukeall.jar, put it in my classpath and tried to start it:
 > java org.getopt.luke.Luke

and I get:
------
Exception in thread "main" java.lang.SecurityException: class
"org.apache.lucene.store.IndexInput"'s signer information does not match
signer information of other classes in the same package
        at java.lang.ClassLoader.checkCerts(ClassLoader.java:575)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:503)
        at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:246)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:54)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:193)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:186)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:265)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:262)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:322)
------

What's the problem?

Or am I using Luke the wrong way?
I couldn't find any documentation on how to use it. Is there something
available?

Thanx,
Dirk


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

TermDocs.freq()

pgwillia
In reply to this post by Andrzej Białecki-2
I am finding that TermDocs.freq() method is returning an incorrect value.
I was wondering if anyone else had experienced this problem.

I am using tp = IndexReader.termPositions( queryTerm ) to return a object
which implements TermPositions.  I then use tp.skipTo( docid ) to go
directly to the document from which I wish to retrieve term positions. The
following for loop adds the positions to my ArrayList which I use later:

for( int pos = tp.nextPosition(), k = 0;
        k < tp.freq();
        pos = tp.nextPosition(), k++ )
{
        positionMatches.add( new Integer( pos ) );
}

In a document which I know has 48 references to the term, a frequency of
23 is returned.  There doesn't seem to be a pattern to this as some other
documents have (frequency, actual): (25, 48), (36, 43), (30, 149).

These frequencies are from results within my code and confirmed in Luke,
so I'm pretty certain that this isn't an error on my part.

I've been trying to find out where the origin of this issue is without
luck thus far.  Any help or advice would be appreciated.

Thanks,
Tricia

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TermDocs.freq()

Jérôme BENOIS
Hello everybody,

        I would like implement a "Google
Suggest" (http://www.google.com/webhp?complete=1&hl=en) like but how to
get similar criteria and number of results.

        Are you an idea ?

Thanks,
Jérôme.

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TermDocs.freq()

greggersh
Save user queries in a database along with number of
results from last time queried, use that as suggestion
base.

Notice that Google's result count in Suggest differs
from the actual result count.  They are not computing
results on the fly.

Greg

--- J?r?me BENOIS <[hidden email]> wrote:

> Hello everybody,
>
> I would like implement a "Google
> Suggest"
> (http://www.google.com/webhp?complete=1&hl=en) like
> but how to
> get similar criteria and number of results.
>
> Are you an idea ?
>
> Thanks,
> J?r?me.
>



               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene and UTF-8

Chris Hostetter-3
In reply to this post by John Cherouvim

: I'm having some problems indexing my UTF-8 html pages. I am running
: lucene on Linux and I cannot understand why does the index generated
: depends on the locale of my operating system.
: If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this
: to en_US the index generated will be different. Why is this the case? My
: HTMLs are all UTF-8.

To elaborate a little bit more on the comments other people have made, the
differences you are seeing are most likely related to your JVM using the
LANG variable to determine what the default charset will be when you open
readers.  You should look carefully at how you are opening the HTML files
and reading them in. If you raen't specifying the Charset explicitly in
your code, then you're getting the system default.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TermDocs.freq()

pgwillia
In reply to this post by pgwillia
To follow up on my post from Thursday.  I have written a very basic test
for TermPositions.  This test allows me to identify that only the
first 10001 tokens are considered to determine term frequency (ie with
the searching term in a position greater than 10001 my test fails).

Is this by design?  Is there an obvious work-around so that the frequency
that I receive is correct for my document?

Thank you for your consideration,
Tricia

On Thu, 29 Sep 2005, Tricia Williams wrote:

> I am finding that TermDocs.freq() method is returning an incorrect value.
> I was wondering if anyone else had experienced this problem.
>
> I am using tp = IndexReader.termPositions( queryTerm ) to return a object
> which implements TermPositions.  I then use tp.skipTo( docid ) to go
> directly to the document from which I wish to retrieve term positions. The
> following for loop adds the positions to my ArrayList which I use later:
>
> for( int pos = tp.nextPosition(), k = 0;
> k < tp.freq();
> pos = tp.nextPosition(), k++ )
> {
> positionMatches.add( new Integer( pos ) );
> }
>
> In a document which I know has 48 references to the term, a frequency of
> 23 is returned.  There doesn't seem to be a pattern to this as some other
> documents have (frequency, actual): (25, 48), (36, 43), (30, 149).
>
> These frequencies are from results within my code and confirmed in Luke,
> so I'm pretty certain that this isn't an error on my part.
>
> I've been trying to find out where the origin of this issue is without
> luck thus far.  Any help or advice would be appreciated.
>
> Thanks,
> Tricia
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TermDocs.freq()

Yonik Seeley
See IndexWriter.setMaxFieldLength()

-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 10/3/05, Tricia Williams <[hidden email]> wrote:

>
> To follow up on my post from Thursday. I have written a very basic test
> for TermPositions. This test allows me to identify that only the
> first 10001 tokens are considered to determine term frequency (ie with
> the searching term in a position greater than 10001 my test fails).
>
> Is this by design? Is there an obvious work-around so that the frequency
> that I receive is correct for my document?
>
> Thank you for your consideration,
> Tricia
>
> On Thu, 29 Sep 2005, Tricia Williams wrote:
>
> > I am finding that TermDocs.freq() method is returning an incorrect
> value.
> > I was wondering if anyone else had experienced this problem.
> >
> > I am using tp = IndexReader.termPositions( queryTerm ) to return a
> object
> > which implements TermPositions. I then use tp.skipTo( docid ) to go
> > directly to the document from which I wish to retrieve term positions.
> The
> > following for loop adds the positions to my ArrayList which I use later:
> >
> > for( int pos = tp.nextPosition(), k = 0;
> > k < tp.freq();
> > pos = tp.nextPosition(), k++ )
> > {
> > positionMatches.add( new Integer( pos ) );
> > }
> >
> > In a document which I know has 48 references to the term, a frequency of
> > 23 is returned. There doesn't seem to be a pattern to this as some other
> > documents have (frequency, actual): (25, 48), (36, 43), (30, 149).
> >
> > These frequencies are from results within my code and confirmed in Luke,
> > so I'm pretty certain that this isn't an error on my part.
> >
> > I've been trying to find out where the origin of this issue is without
> > luck thus far. Any help or advice would be appreciated.
> >
> > Thanks,
> > Tricia
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>