[lucy-user] Feature question about Lucy vs. Ferret

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Feature question about Lucy vs. Ferret

Andrew S. Townley
Hi folks,

I'm trying to integrate fulltext search into my primarily Ruby project.  Initially, I'd discovered Ferret, and I see that David is updating the git repository periodically.  After that, I discovered the whole plan to merge KinoSearch and Ferret into Lucy, so I have to admit I'm a bit confused--especially given the time elapsed since that announcement (although I did read some of the list archives about the licensing/IP issues, so can understand).

I'd hoped it would be easy to integrate Ferret into my project because of the way that it allows separate, field-based indexing.  However, I'm a bit puzzled why there's no way to actually get more information about the search hits than the document ID and a score.

I tried to look through the recent KinoSearch documentation to see if it (and therefore lucy) are also going to have this limitation, but I don't know Perl, and I have a bit of a hard time figuring out exactly what's going on.  If it was C, C++, Ruby, Python, Java, C# or even a few other obscure languages, I might have better luck. ;)

Anyway, what I want to do is index a bunch of structures - essentially Ruby Hash objects - with named values and be able to know at least which of the fields actually matched the search query.  Ideally, I'd also like to have the offset and length of the match so I could do highlighting on the original data and not have to store everything effectively twice.

I'm only a few days into learning Ferret, and it had looked like it would fit the bill for my immediate - and I do mean "immediate" need.  However, now that I've discovered this, I'm entertaining alternatives.

With that background in mind, I have the following questions:

1) Can lucy store multi-field "documents" a la Ferret?

2) Can lucy give me the match result information I'm looking for within each document as part of the search hit information?

3) How would you relate the completeness/stability of the core C library and Ruby bindings?

Any information, answers or suggestions would be really appreciated.

Cheers,

ast
--
Andrew S. Townley <[hidden email]>
http://atownley.org

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Peter Karman
Hi Andrew,

Andrew S. Townley wrote on 02/23/2011 09:51 AM:

>
> 1) Can lucy store multi-field "documents" a la Ferret?

Yes.

>
> 2) Can lucy give me the match result information I'm looking for within each document as part of the search hit information?
>

For highlighting and snippet extraction? yes.

> 3) How would you relate the completeness/stability of the core C library and Ruby bindings?
>

Alas, here's the rub. There are no Ruby bindings at present. The core C
code is stable and "complete" (for some value of "complete" -- i.e. it
works). But to date there are only Perl bindings.

I posted about this on the Ferret list awhile back, inviting Ruby
developers to come have a look and help jump-start the Ruby
implemenation. I realize your project has some immediate needs; please
also consider hanging around and helping us define the Ruby
implementation. Subscribe to lucy-dev to get started.

--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Andrew S. Townley
Hi Peter,

Thanks for the quick reply!

On 23 Feb 2011, at 4:49 PM, Peter Karman wrote:

> Hi Andrew,
>
> Andrew S. Townley wrote on 02/23/2011 09:51 AM:
>
>>
>> 1) Can lucy store multi-field "documents" a la Ferret?
>
> Yes.

Great.

>>
>> 2) Can lucy give me the match result information I'm looking for within each document as part of the search hit information?
>>
>
> For highlighting and snippet extraction? yes.

Well, actually, I want it for more than that.  For my particular needs, I need to get the field name where the match occurred in the document, and then I'd ideally like to have the start offset into that field and the length of the match.

This is the core information I can't get right now from Ferret.  For example (nevermind about the accuracy of the information here ;):

valkyrie$ irb
>> require 'ferret'
=> true
>> include Ferret
=> Object
>> index = Index::Index.new
=> #<Ferret::Index::Index:0x101342108 @default_input_field=:id, @mon_waiting_queue=[], @qp=nil, @default_field=:*, @key=nil, @auto_flush=false, @mon_entering_queue=[], @open=true, @dir=#<Ferret::Store::RAMDirectory:0x101342068>, @mon_count=0, @id_field=:id, @reader=nil, @searcher=nil, @close_dir=true, @mon_owner=nil, @writer=nil, @options={:dir=>#<Ferret::Store::RAMDirectory:0x101342068>, :analyzer=>#<Ferret::Analysis::StandardAnalyzer:0x101341e60>, :lock_retry_time=>2, :default_field=>:*}>
>> index << {:title => "Fred flinstone", :description => "The cartoon series" }
=> nil
>> index << {:title => "The Flinstones", :description => "Fred flinstone's family" }
=> nil
>> index.search("flinstones")
=> #<struct Ferret::Search::TopDocs total_hits=1, hits=[#<struct Ferret::Search::Hit doc=1, score=0.254271149635315>], max_score=0.254271149635315, searcher=#<Ferret::Search::Searcher:0x101314e60>>

The Ferret::Search::Hit gives me the document number and the score, but that's it.  In whatever list format the results are actually in, I'd also like to have the information I mentioned.  If you weren't storing the offset information, then it would make sense for it not to be available, but if you were, then I'd expect to have the whole thing right there.  I can't see how there'd be a performance issue in providing this information.

I just want to make sure we're on the same page, as this is a critical feature for what I'm trying to do.

>
>> 3) How would you relate the completeness/stability of the core C library and Ruby bindings?
>>
>
> Alas, here's the rub. There are no Ruby bindings at present. The core C
> code is stable and "complete" (for some value of "complete" -- i.e. it
> works). But to date there are only Perl bindings.
>
> I posted about this on the Ferret list awhile back, inviting Ruby
> developers to come have a look and help jump-start the Ruby
> implemenation. I realize your project has some immediate needs; please
> also consider hanging around and helping us define the Ruby
> implementation. Subscribe to lucy-dev to get started.


Thanks for the information.  Unfortunate.  Thanks for the offer to help out though.  It might be a while before I have any bandwidth, but depending on how things go, lucy might be the best long-term solution.

In digging around Google in the interim between now and my original note, I re-read the charter for Lucy.  One of the things that struck me was the "implementing as much functionality in high-level languages as possible" comment.  What does this mean, exactly?

Part of the reason I ask has to do with the future of my own project.  Much of what I have now will eventually be rewritten piecemeal in C++ and then wrapped via SWIG so I can have Ruby and Java bindings as well as use it in other environments natively supporting C/C++.  Whatever route I end up going for fulltext, this is something that would need to support the same kind of thing as I'd actually be leveraging it more from the C++ code than the Ruby code.

With the way the statement above is phrased, it seems like this wouldn't really be possible.  It also seems like there might be an awful lot of duplication of effort involved in actually creating each language binding.  Why was this approach chosen rather than put all the muscle in the C code and provide thin wrappers--even via SWIG or something more hand-tailored where necessary/appropriate?

I tried to dig through the lucy SVN repository via the web UI, but I couldn't really figure out what's there.  The code generator framework you're using is something I haven't seen before, but at least it explains why I couldn't find the Ruby bindings! :)

Anyway, thanks for the answers.  Presently tinkering with the Ferret internals since it seems like there ought to be a way to expose what I want (it's in the explain output), but there's a lot of code, and I'm certainly no search engine expert!

Cheers,

ast
--
Andrew S. Townley <[hidden email]>
http://atownley.org

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Nathan Kurz
On Wed, Feb 23, 2011 at 9:14 AM, Andrew S. Townley <[hidden email]> wrote:
> Well, actually, I want it for more than that.  For my particular needs, I need to get the field name where the match occurred in the document, and then I'd ideally like to have the start offset into that field and the length of the match.

Apart from the lack of Ruby bindings, this won't be a problem with
Lucy.  It's a data-forward approach, so that if the information is in
the indexes, you'll have access to it.  You might need to write a
custom Hit class (or the like), but it will certainly be possible.

> One of the things that struck me was the "implementing as much functionality in high-level languages as possible" comment.  What does this mean, exactly?
> Why was this approach chosen rather than put all the muscle in the C code and provide thin wrappers--even via SWIG or something more hand-tailored where necessary/appropriate?

I think you're missing an implied "And not only that, if you order by
midnight tonight now you'll also receive..."  Lucy is/will-have a
complete C core that can be used directly, but it will also be
possible to override the functionality class-by-class in Perl, Ruby,
Python, etc.   It's the added potential for accessing this
functionality from a scripting language that is being highlighted, not
the requirement.

> Part of the reason I ask has to do with the future of my own project.  Much of what I have now will eventually be rewritten piecemeal in C++ and then wrapped via SWIG so I can have Ruby and Java bindings as well as use it in other environments natively supporting C/C++.  Whatever route I end up going for fulltext, this is something that would need to support the same kind of thing as I'd actually be leveraging it more from the C++ code than the Ruby code.

Sounds like an excellent fit for Lucy.   In the same way that we hope
to allow the C-core to be overridden with scripting languages for fast
prototyping, it's also should be easy to then selectively optimize
that with C++.  It's an ambitious multilingual goal, so it's possible
it will not be fully achieved, but your sort of application is exactly
the reason this approach was chosen.

Nathan Kurz
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Andrew S. Townley
Hi Nathan,

On 23 Feb 2011, at 6:01 PM, Nathan Kurz wrote:

> On Wed, Feb 23, 2011 at 9:14 AM, Andrew S. Townley <[hidden email]> wrote:
>> Well, actually, I want it for more than that.  For my particular needs, I need to get the field name where the match occurred in the document, and then I'd ideally like to have the start offset into that field and the length of the match.
>
> Apart from the lack of Ruby bindings, this won't be a problem with
> Lucy.  It's a data-forward approach, so that if the information is in
> the indexes, you'll have access to it.  You might need to write a
> custom Hit class (or the like), but it will certainly be possible.

Sounds good.

>
>> One of the things that struck me was the "implementing as much functionality in high-level languages as possible" comment.  What does this mean, exactly?
>> Why was this approach chosen rather than put all the muscle in the C code and provide thin wrappers--even via SWIG or something more hand-tailored where necessary/appropriate?
>
> I think you're missing an implied "And not only that, if you order by
> midnight tonight now you'll also receive..."  Lucy is/will-have a
> complete C core that can be used directly, but it will also be
> possible to override the functionality class-by-class in Perl, Ruby,
> Python, etc.   It's the added potential for accessing this
> functionality from a scripting language that is being highlighted, not
> the requirement.

Does it come with flying cars too?? ;) http://xkcd.com/864/

>> Part of the reason I ask has to do with the future of my own project.  Much of what I have now will eventually be rewritten piecemeal in C++ and then wrapped via SWIG so I can have Ruby and Java bindings as well as use it in other environments natively supporting C/C++.  Whatever route I end up going for fulltext, this is something that would need to support the same kind of thing as I'd actually be leveraging it more from the C++ code than the Ruby code.
>
> Sounds like an excellent fit for Lucy.   In the same way that we hope
> to allow the C-core to be overridden with scripting languages for fast
> prototyping, it's also should be easy to then selectively optimize
> that with C++.  It's an ambitious multilingual goal, so it's possible
> it will not be fully achieved, but your sort of application is exactly
> the reason this approach was chosen.


Well, all that makes me very optimistic about the future of Lucy.  I'll certainly keep my eye on things, and if I have time, I'll certainly help if I can.  This is a pretty core feature of the system I'm building - and it's not your average Web search application - so I'm sure I'll be able to provide some alternative needs from the middle-of-the-road ones.

Thanks for your replies, guys.

Cheers,

ast
--
Andrew S. Townley <[hidden email]>
http://atownley.org

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Marvin Humphrey
In reply to this post by Andrew S. Townley
On Wed, Feb 23, 2011 at 05:14:49PM +0000, Andrew S. Townley wrote:

> In digging around Google in the interim between now and my original note, I
> re-read the charter for Lucy.  One of the things that struck me was the
> "implementing as much functionality in high-level languages as possible"
> comment.  What does this mean, exactly?

That's the 2006 Lucene sub-project proposal at
<http://wiki.apache.org/jakarta-lucene/LucyProposal/>.  Lucy was rebooted last
July, entering the Incubator with a new proposal at
<http://wiki.apache.org/incubator/LucyProposal>, absorbing the KinoSearch
codebase, and aiming to become a top-level Apache project.

We should probably go tag that outdated 2006 page with a note indicating that
it's obsolete.

Our approach has changed over the years.  Now nearly all core code is in C.

Marvin Humphrey

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Marvin Humphrey
In reply to this post by Andrew S. Townley
On Wed, Feb 23, 2011 at 05:14:49PM +0000, Andrew S. Townley wrote:
> Well, actually, I want it for more than that.  For my particular needs, I
> need to get the field name where the match occurred in the document, and
> then I'd ideally like to have the start offset into that field and the
> length of the match.
 
> The Ferret::Search::Hit gives me the document number and the score, but
> that's it.  In whatever list format the results are actually in, I'd also
> like to have the information I mentioned.  If you weren't storing the offset
> information, then it would make sense for it not to be available, but if you
> were, then I'd expect to have the whole thing right there.  I can't see how
> there'd be a performance issue in providing this information.

You have to generate that information after the fact, by post-processing the
Hits that come back.  Lucy, Lucene, and Ferret all have the same behavior in
this regard.

Matching and scoring are highly abstracted for speed.  The matching engine
does not scan raw document content, a la an RDBMS full table scan -- instead,
it iterates over heavily optimized data structures devoid of introspection
overhead.  At the end of a search, you will only have documents and scores --
not sophisticated metadata about what part of the subquery matched and what
parts didn't and how much each matching part contributed to the score.
Keeping track of such metadata during the matching phase would be
prohibitively expensive.

In Lucy, our highlighting capabilities are powered by the Highlight_Spans()
method, which is invoked on a derivative of the Query object:

    /** Return an array of Span objects, indicating where in the given
     * field the text that matches the parent query occurs.  In this case,
     * the span's offset and length are measured in Unicode code points.
     * The default implementation returns an empty array.    
     *  
     * @param searcher A Searcher.
     * @param doc_vec A DocVector.
     * @param field The name of the field.
     */  
    public incremented VArray*
    Highlight_Spans(Compiler *self, Searcher *searcher,
                    DocVector *doc_vec, const CharBuf *field);

Perhaps that might be of use for you.

> Part of the reason I ask has to do with the future of my own project.  Much
> of what I have now will eventually be rewritten piecemeal in C++ and then
> wrapped via SWIG so I can have Ruby and Java bindings as well as use it in
> other environments natively supporting C/C++.  Whatever route I end up going
> for fulltext, this is something that would need to support the same kind of
> thing as I'd actually be leveraging it more from the C++ code than the Ruby
> code.

I concur with Nate that this is exactly the kind of project that we would like
to support with Lucy.

> With the way the statement above is phrased, it seems like this wouldn't
> really be possible.  It also seems like there might be an awful lot of
> duplication of effort involved in actually creating each language binding.
> Why was this approach chosen rather than put all the muscle in the C code
> and provide thin wrappers--even via SWIG or something more hand-tailored
> where necessary/appropriate?

From the very start we've been determined to make Lucy's bindings feel like
native code in the host language, so that users would feel as at home as
possible.  However, we've changed our approach over the years.  Now nearly
everything's in C, but we've modified our object model to make e.g. native
subclassing transparent and easy.  This approach has proven highly successful;
most KinoSearch power users do some degree of subclassing, and a number of
projects have been published on CPAN.

> I tried to dig through the lucy SVN repository via the web UI, but I
> couldn't really figure out what's there.  The code generator framework
> you're using is something I haven't seen before, but at least it explains
> why I couldn't find the Ruby bindings! :)

There's a short high-level introduction to the Lucy codebase here:

  http://svn.apache.org/repos/asf/incubator/lucy/trunk/core/Lucy/Docs/DevGuide.cfh

> Presently tinkering with the Ferret internals since it seems like  there
> ought to be a way to expose what I want (it's in the explain output)

That might work.  Most people use the Explanation API for tuning and
troubleshooting, though; it might prove a little expensive or unwieldy for
what you're doing.

Marvin Humphrey

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Nathan Kurz
On Fri, Feb 25, 2011 at 8:12 AM, Marvin Humphrey <[hidden email]> wrote:
> At the end of a search, you will only have documents and scores --
> not sophisticated metadata about what part of the subquery matched and what
> parts didn't and how much each matching part contributed to the score.
> Keeping track of such metadata during the matching phase would be
> prohibitively expensive.

It's only prohibitive if you don't need that data.  If actually need
it (as Andrew seems to), and are going to do it in post-processing
anyway, it's just the cost of doing business.

My kick has been about making it easy to swap in non-TF/IDF scorers.
I think part of doing so will be adding greater room for scratch data
to Hits returned. My canonical example is that I want to to be
possible to do alphabetical sorting of Hits by a category field.   At
some point you need a collector that can see field values, which if
you squint right is just a special case of what Andrew wants.

While I can see that argument that this is traditionally not the way
that TF/IDF systems work, it's this potential for search/database
hybridization that makes Lucy so attractive to me.

Nathan Kurz
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Andrew S. Townley
In reply to this post by Marvin Humphrey
Hi Marvin,

On 25 Feb 2011, at 4:12 PM, Marvin Humphrey wrote:

> On Wed, Feb 23, 2011 at 05:14:49PM +0000, Andrew S. Townley wrote:
>> Well, actually, I want it for more than that.  For my particular needs, I
>> need to get the field name where the match occurred in the document, and
>> then I'd ideally like to have the start offset into that field and the
>> length of the match.
>
>> The Ferret::Search::Hit gives me the document number and the score, but
>> that's it.  In whatever list format the results are actually in, I'd also
>> like to have the information I mentioned.  If you weren't storing the offset
>> information, then it would make sense for it not to be available, but if you
>> were, then I'd expect to have the whole thing right there.  I can't see how
>> there'd be a performance issue in providing this information.
>
> You have to generate that information after the fact, by post-processing the
> Hits that come back.  Lucy, Lucene, and Ferret all have the same behavior in
> this regard.
>
> Matching and scoring are highly abstracted for speed.  The matching engine
> does not scan raw document content, a la an RDBMS full table scan -- instead,
> it iterates over heavily optimized data structures devoid of introspection
> overhead.  At the end of a search, you will only have documents and scores --
> not sophisticated metadata about what part of the subquery matched and what
> parts didn't and how much each matching part contributed to the score.
> Keeping track of such metadata during the matching phase would be
> prohibitively expensive.

I can understand the need to abstract a lot of things for speed.  I'm no search expert as I've said before, but I don't understand why at the very least the field information (e.g. name) can't be encoded in this data structure in such a way that you can determine this information at match time.  Highlighting and offsets are a different matter, and I never thought it was doing a full-text scan or a table scan like an RDBMS.  If I wanted that, I'd just use regex searches (which I do in some cases for small datasets).

Obviously, I'm missing something here, but to me I don't see why it matters to keep track of fields at all if you don't have the information about which field matched an "all fields" or "multiple field" search query to hand when you get the match information back in terms of term and field.  Obviously, actually finding the offsets is a much more expensive operation, and I'm ok with having to do that after the search is completed--even if I have to do my own matching without API support for highlighting.  However, this is only possible if I know what term and what field and don't have to effectively perform the search again on the document (which is what Ferret seems to require).

> In Lucy, our highlighting capabilities are powered by the Highlight_Spans()
> method, which is invoked on a derivative of the Query object:
>
>    /** Return an array of Span objects, indicating where in the given
>     * field the text that matches the parent query occurs.  In this case,
>     * the span's offset and length are measured in Unicode code points.
>     * The default implementation returns an empty array.    
>     *  
>     * @param searcher A Searcher.
>     * @param doc_vec A DocVector.
>     * @param field The name of the field.
>     */  
>    public incremented VArray*
>    Highlight_Spans(Compiler *self, Searcher *searcher,
>                    DocVector *doc_vec, const CharBuf *field);
>
> Perhaps that might be of use for you.

This API has the same problem as Ferret--if I don't know what field, then I've got to try all the fields (maybe > 20 in some cases) on the document.  If you need this information to display to users, then it doesn't matter how fast the search is if you're going to slow down the whole interaction by checking between 2-x fields * the number of matches in the results chunk you're processing.

The advantages of the fulltext search capabilities exposed via a query language like FQL or whatever Lucy uses is that you can effectively defer all of the introspection/heavy lifting of the searching and results matching to the underlying fulltext system (or, at least that's the way I see it).  If you then don't have enough information available to fully describe the matches in an efficient way, then the only other option you have is to both pre-process the query to see if any explicit fields are present, and then, if not, try all of the fields indexed to see if they happen to match (effectively performing the search again over the result set).

Maybe I'm using it wrong, or maybe I just don't get it, but these are the kinds of things I need to do.

[snip]

>> I tried to dig through the lucy SVN repository via the web UI, but I
>> couldn't really figure out what's there.  The code generator framework
>> you're using is something I haven't seen before, but at least it explains
>> why I couldn't find the Ruby bindings! :)
>
> There's a short high-level introduction to the Lucy codebase here:
>
>  http://svn.apache.org/repos/asf/incubator/lucy/trunk/core/Lucy/Docs/DevGuide.cfh

You weren't kidding about the "short" part! :)  Still, thanks for the pointer.  I'd seen it earlier.

>> Presently tinkering with the Ferret internals since it seems like  there
>> ought to be a way to expose what I want (it's in the explain output)
>
> That might work.  Most people use the Explanation API for tuning and
> troubleshooting, though; it might prove a little expensive or unwieldy for
> what you're doing.

After spending about 12-14 hours trying to get my head around the code and the way the searching worked, I gave up.  There wasn't a good, consistent API abstraction that allowed you to access the same information from the internals of the search code that were leveraged by the explain code--and the fact that explain is overloaded for each subclass, but not in a universal way would've required more surgery than I was prepared to do at the C level given the time I have.  Jens took an alternative approach and implemented some changes at the Ruby level.  These helped, but they  still required some tweaking to be used by both the Searcher API and the Index API since again, some of the information available for the index isn't available for the Searcher API.

For now, thanks to Jens' patch, I have the capability to do what I need to do with Ferret--even if it isn't as fast as it could be.  However, unless the same type of information is exposed at an API level in Lucy, the same kinds of workarounds would be required to use Lucy instead of Ferret for my application.

At least my Wed turned out not to be a total wasted day after all! :)

Cheers for all the information,

ast
--
Andrew S. Townley <[hidden email]>
http://atownley.org

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Andrew S. Townley
In reply to this post by Nathan Kurz

On 25 Feb 2011, at 8:03 PM, Nathan Kurz wrote:

> On Fri, Feb 25, 2011 at 8:12 AM, Marvin Humphrey <[hidden email]> wrote:
>>  At the end of a search, you will only have documents and scores --
>> not sophisticated metadata about what part of the subquery matched and what
>> parts didn't and how much each matching part contributed to the score.
>> Keeping track of such metadata during the matching phase would be
>> prohibitively expensive.
>
> It's only prohibitive if you don't need that data.  If actually need
> it (as Andrew seems to), and are going to do it in post-processing
> anyway, it's just the cost of doing business.

Exactly.  Fulltext is only one of several indexing/searching mechanisms I have, and 99% of the time, the only reason I'm going to use the fulltext index is to display the results to humans.  Thanks to Web search engines, users have certain expectations of being able to see the highlighted matches, so that's the standard use case I have.

I'm happy to have the option of setting a :your_performance_will_suck_do_you_really_want_to_do_this flag to :yes_damnit in order to get the results I want, but I'd prefer to have the API for dealing with the results as straightforward as possible -- oh, yeah, and I'll hardly ever be storing the information being queried in the index itself as I've already got a place for it to live, and it needs to be available to the other indexing methods too.

> My kick has been about making it easy to swap in non-TF/IDF scorers.
> I think part of doing so will be adding greater room for scratch data
> to Hits returned. My canonical example is that I want to to be
> possible to do alphabetical sorting of Hits by a category field.   At
> some point you need a collector that can see field values, which if
> you squint right is just a special case of what Andrew wants.
>
> While I can see that argument that this is traditionally not the way
> that TF/IDF systems work, it's this potential for search/database
> hybridization that makes Lucy so attractive to me.


Not knowing that much about TF/IDF systems, all I can agree with is the part about the fulltext/other indexing hybrid approach being an essential part of information management in the future.

People thought things like Datablades/ORDBMS didn't make sense in RDBMS systems either until vendors proved that you could essentially have your cake and eat it too from a performance and flexibility perspective.  I see this as just part of the evolution of search technology on the basis of realizations that the closed-world view of systems is a vestige of the past.  Given the potential for Lucy given its approach, it seems sensible to try and design for the future here too.

Again, thanks for all the discussion and information.

Cheers,

ast
--
Andrew S. Townley <[hidden email]>
http://atownley.org

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Feature question about Lucy vs. Ferret

Peter Karman
In reply to this post by Andrew S. Townley
Andrew S. Townley wrote on 2/26/11 7:17 AM:

>> You have to generate that information after the fact, by post-processing
>> the Hits that come back.  Lucy, Lucene, and Ferret all have the same
>> behavior in this regard.
>>
>> Matching and scoring are highly abstracted for speed.  The matching engine
>> does not scan raw document content, a la an RDBMS full table scan --
>> instead, it iterates over heavily optimized data structures devoid of
>> introspection overhead.  At the end of a search, you will only have
>> documents and scores -- not sophisticated metadata about what part of the
>> subquery matched and what parts didn't and how much each matching part
>> contributed to the score. Keeping track of such metadata during the
>> matching phase would be prohibitively expensive.
>
> I can understand the need to abstract a lot of things for speed.  I'm no
> search expert as I've said before, but I don't understand why at the very
> least the field information (e.g. name) can't be encoded in this data
> structure in such a way that you can determine this information at match
> time.  Highlighting and offsets are a different matter, and I never thought
> it was doing a full-text scan or a table scan like an RDBMS.  If I wanted
> that, I'd just use regex searches (which I do in some cases for small
> datasets).
>
> Obviously, I'm missing something here, but to me I don't see why it matters
> to keep track of fields at all if you don't have the information about which
> field matched an "all fields" or "multiple field" search query to hand when
> you get the match information back in terms of term and field.  Obviously,
> actually finding the offsets is a much more expensive operation, and I'm ok
> with having to do that after the search is completed--even if I have to do my
> own matching without API support for highlighting.  However, this is only
> possible if I know what term and what field and don't have to effectively
> perform the search again on the document (which is what Ferret seems to
> require).
>

I miss this feature too (native interrogation of HitDoc objects to discover
which field(s) generated the hit).

Marvin, where would be the appropriate place to extend Lucy in this way? I'm
guessing Search::Searcher and Search::MatchDoc?


--
Peter Karman  .  http://peknet.com/  .  [hidden email]