Preliminary, fundamental question about the demo

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Preliminary, fundamental question about the demo

pagod
Hi,

I just started with Lucene today, and the first thing I did was try out the small demo. I followed the instructions in "Getting started - Building and Installing the Basic Demo" by the letter -- I downloaded the JAR files (2.3.2), unpacked and launched the indexer on the src directory -- worked fine, indexed all java files in the directory and its subdirectories. I didn't try to search for a swearword, but I did try to search for "vector". The fact that I got only one result whereas the demo says I should get a bunch of them isn't really the problem. The problem is that I got only one result although the word "vector" appears in TWO documents:
src/demo/org/apache/lucene/demo/html/HTMLParser.java
src/demo/org/apache/lucene/demo/SearchFiles.java
(I checked that with grep)

When I enter my query, I get a very clear answer:
Enter query:
vector
Searching for: vector
1 total matching documents
1. src/demo/org/apache/lucene/demo/SearchFiles.java

grep's version:
[silenos:apache/lucene/demo] veda> pwd
/home/veda/lucene/lucene-2.3.2/src/demo/org/apache/lucene/demo
[silenos:apache/lucene/demo] veda> grep -i vector * */*
SearchFiles.java:   * are all identical, then single norm vector may be shared. */
html/HTMLParser.java:  private java.util.Vector jj_expentries = new java.util.Vector();
[silenos:apache/lucene/demo] veda>


So my question is a very easy one: what happened? Is there a special processing for java files, like for HTML documents, which leaves comments out? Is that a bug only in the "demo" part of this small program (this would be surprising, as other queries seem to be working fine)? Is there actually a way I can check the content of my index -- what files were actually indexed, or search for a file in particular? A bit like a field search, but with the URI of the file itself (though I think I read this is implementation-dependent, that means one could do it programmatically, but it's not in the demo, right?)?

Anyway, thx for your answers. I hope there is a good one to this question, cos I'd feel rather deceived if a search engine so obviously ignores some results...

David
Reply | Threaded
Open this post in threaded view
|

Re: Preliminary, fundamental question about the demo

pagod
ok, my mistake. apparently the dot '.' is not considered a separator, so documents containing "java.util.Vector" will *not* be matched by a search for "vector". quite surprising if you ask me, but well, this can most probably be changed...
D
Reply | Threaded
Open this post in threaded view
|

Re: Preliminary, fundamental question about the demo

hossman


Hello,

Two things you should know:

1) this is the general@lucene list -- it's hte starting point for people
with questions baout the entire Lucene project wheren they really have no
idea where to get started.  You seem to be asking about the Lucene-Java
demo code, so i'm assuming you are interested in writing java code that
uses the Lucene search library to build your own applications.  In that
case, your best bet for future assistence is the java-user@lucene mailing
list.  (if i'm wrong, and you are more interested in using applications
already built with the Lucene library such as Solr or Nutch; or iwth using
the .Net port of hte library, these subprojects all have their own
subproject mailing list as well)...

        http://lucene.apache.org/mail.html

2) regarding this comment...

: ok, my mistake. apparently the dot '.' is not considered a separator, so
: documents containing "java.util.Vector" will *not* be matched by a search
: for "vector". quite surprising if you ask me, but well, this can most
: probably be changed...

That is a specific behavior of the "Analyzer" used when analyzing the
text, it most certianly can be changed and there is a wide variety of
Analyzers available that come with Lucene (particularly in the analysis
contrib package)

The other oddity to arrise from what you are seeing is that recent
versions of Lucene have reduced the usage of the Vector class quite a bit,
but the tutorial still uses it as an example, i'll commit a quick fix for
that.




-Hoss