Quantcast

Automata and Transducer on Lucene 6

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Automata and Transducer on Lucene 6

Juarez Sampaio
Hello everyone,

Recently I've watched a few videos and read a few blog posts on Lucene's
Automata and how one can speed up things by 100x when properly using
Automata and Transducers. "I can definitely use a boost like this", right?
The problem is that this material I've read was writen to Lucene 4 and it
seems the API has changes a lot since then.

To beggin with, *I can't find transducers* anywhere and I'm missing a few
Automata construction capabilities such as union (it used to be located on
the class BasicOperations). I think what I am really missing is an intro to
Automata classes on Lucene 6. *Can someone point me to a link introducing
Automata (and possibly Transducers) on Lucene 6?*

So far I've been learning by navigating java docs with ctrl + F, which
hasn't been productive: It took me a while to figure out I had to use a
AutomatonRun to check that the automaton accepts a given char sequence. And
before that I had tried to manually start from node 0 and manually traverse
the Automata and check for a final state at the end of a String. I'd really
appreciate some guidance here.

I'd like to read something written by who designed these classes. What
motivated, usage examples, what it is good for and what it is not good for.
Maybe a history of the development of Automata on Lucene. Where they built
for in-memory usage only? Is there a good way to go about serializing it?
If possible, I'd like some explanation on the mad pointers structure used
to efficiently implement automata. From the videos I watched I was
expecting a byte[] implementation, but looking at the code I see a couple
of int[] used to represent states and transitions. What happened to the
byte[] implementation of Lucene 4?
--
Juarez
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Dawid Weiss-2
> I'd like to read something written by who designed these classes. What
> motivated, usage examples, what it is good for and what it is not good for.
> Maybe a history of the development of Automata on Lucene

Are you looking for a historical book on Lucene development or are you
looking to solve a particular problem? Perhaps if you explain the problem
then it'll be easier to come up with answers (or hints as to where you may
begin looking for them).

You stated so many questions that it's hard to address them without
spending a few hours just typing...

But just so that it's evident I tried:

- the FST class implements a transducer; check out examples (tests) as they're
the best documentation on how to use these classes; they do use byte[]
underneath;

- Automaton etc. are completely independent and used for slightly different
purposes (it's brics library ported to Lucene). Again -- tests will be
helpful to understand how they work. These classes use object
representation of states and nodes
until you "compile" them into a RunAutomaton which is essentially an
immutable deterministic
automaton compiled into byte[]. It has benefits, but also drawbacks
(you can't change it). Determinization
of arbitrary automata can be costly or prohibitive.

Dawid


On Tue, Apr 18, 2017 at 3:58 PM, Juarez Sampaio
<[hidden email]> wrote:

> Hello everyone,
>
> Recently I've watched a few videos and read a few blog posts on Lucene's
> Automata and how one can speed up things by 100x when properly using
> Automata and Transducers. "I can definitely use a boost like this", right?
> The problem is that this material I've read was writen to Lucene 4 and it
> seems the API has changes a lot since then.
>
> To beggin with, *I can't find transducers* anywhere and I'm missing a few
> Automata construction capabilities such as union (it used to be located on
> the class BasicOperations). I think what I am really missing is an intro to
> Automata classes on Lucene 6. *Can someone point me to a link introducing
> Automata (and possibly Transducers) on Lucene 6?*
>
> So far I've been learning by navigating java docs with ctrl + F, which
> hasn't been productive: It took me a while to figure out I had to use a
> AutomatonRun to check that the automaton accepts a given char sequence. And
> before that I had tried to manually start from node 0 and manually traverse
> the Automata and check for a final state at the end of a String. I'd really
> appreciate some guidance here.
>
> I'd like to read something written by who designed these classes. What
> motivated, usage examples, what it is good for and what it is not good for.
> Maybe a history of the development of Automata on Lucene. Where they built
> for in-memory usage only? Is there a good way to go about serializing it?
> If possible, I'd like some explanation on the mad pointers structure used
> to efficiently implement automata. From the videos I watched I was
> expecting a byte[] implementation, but looking at the code I see a couple
> of int[] used to represent states and transitions. What happened to the
> byte[] implementation of Lucene 4?
> --
> Juarez

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Michael McCandless-2
On Tue, Apr 18, 2017 at 2:33 PM, Dawid Weiss <[hidden email]> wrote:

- Automaton etc. are completely independent and used for slightly different
> purposes (it's brics library ported to Lucene). Again -- tests will be
> helpful to understand how they work. These classes use object
> representation of states and nodes
>

One small correction: we moved away from objects to more compact int[] a
while ago for our automata implementation.

+1 to use the tests to learn how things work; I don't know of any guide /
high level documentation for these low level classes, sorry.  Maybe write
it up yourself and set it free somewhere online ;)

The union method is now in oal.util.automata.Operations

Mike McCandless

http://blog.mikemccandless.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Robert Muir
On Tue, Apr 18, 2017 at 5:16 PM, Michael McCandless
<[hidden email]> wrote:

>
> +1 to use the tests to learn how things work; I don't know of any guide /
> high level documentation for these low level classes, sorry.  Maybe write
> it up yourself and set it free somewhere online ;)

For the FST case there are some simple examples in the package level
docs, but i think the layout of javadocs html does not make this
obvious and they can be missed. Maybe they help to get started with /
can be fixed if they are out of date.

See http://lucene.apache.org/core/6_5_0/core/org/apache/lucene/util/fst/package-summary.html
(scroll to the bottom).

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Dawid Weiss-2
In reply to this post by Michael McCandless-2
> One small correction: we moved away from objects to more compact int[] a
> while ago for our automata implementation.

Right, forgot about that. There are still some trappy object-heavy
utilities like this one:

https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L127-L129

and the API is using objects (Transition) for an 'inout' mutable
holder type which may be confusing at first (but is unavoidable in
Java).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Juarez Sampaio
Thanks for the reply everyone!

I've spent some time looking at tests and source code and I've learned a
lot about Lucene's Automata and FST. Way more productive than scanning
javadocs. Thanks for the hint.

*> Are you looking for a historical book on Lucene development or are you*
*> looking to solve a particular problem?*

Dawid, the thing is that I am not even sure that Automata are the perfect
fit for my project and I thought some literature on it would help me decide
whether to use it or not. Furthermore, I'll probably have to modify
Lucene's Automata implementation for my project, and I thought that reading
about design choices would allow me to better understand how to improve it.

Anyway, I am done with basic usage and whenever I find time to write about
what I've learned I'll make sure to share the link here.

I'll list some features I'll need to add to Lucene 6 base implementation of
Automata and FST. I'd like to hear your thoughts on it, specially if you
find any of them particularly a waste of time.

1. I need to be able to add data to an FST (add new strings and update the
mapped value in the case of FST). I thought about a multi-layer strategy
where old data has been compressed to an FST format whereas new data is
added to a delta partition (probably a BST or a simple list). A background
process merges delta into the closed FST. The merge process consists of
materializing all strings encoded into the FST, merge this list with
strings on delta, and then construct a new FST. Probably the merge can be
done during the process of enumerating, since the enumeration happens in
lexicographical order.

2. I need to have multiple FST loaded and to be able to search them on
demand. I thought about modifying the implementation to access data on a
memory mapped file instead of a raw in memory byte[] or int[]' s.

Juarez

On Wed, Apr 19, 2017 at 3:39 AM Dawid Weiss <[hidden email]> wrote:

> > One small correction: we moved away from objects to more compact int[] a
> > while ago for our automata implementation.
>
> Right, forgot about that. There are still some trappy object-heavy
> utilities like this one:
>
>
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L127-L129
>
> and the API is using objects (Transition) for an 'inout' mutable
> holder type which may be confusing at first (but is unavoidable in
> Java).
>
> Dawid
>
--
Juarez
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Dawid Weiss-2
> Dawid, the thing is that I am not even sure that Automata are the perfect
> fit for my project and I thought some literature on it would help me decide
> whether to use it or not.

Still looks to me like you're approaching the problem from the wrong
side or don't
want to share the core problem which you're trying to solve. All
automata inside Lucene
serve a purpose somewhere -- they're not part of Lucene's search API
or even the core
goal of the library; they merely serve as an implementation detail for
some higher-level
use cases (say, efficient storage of a large number of term-value
pairs). It is this kind of
"high-level goal" I asked about. Your answer only adds to the mystery:

> Furthermore, I'll probably have to modify Lucene's
> Automata implementation for my project, and I thought that reading about
> design choices would allow me to better understand how to improve it.

Why? Is there something in Lucene's core search API that is not
working for you? Or maybe you need
an automaton library, not Lucene? In that case there are a few other,
more efficient alternatives out
there (in C/C++).

> 1. I need to be able to add data to an FST (add new strings and update the
> mapped value in the case of FST). I thought about a multi-layer strategy
> where old data has been compressed to an FST format whereas new data is
> added to a delta partition (probably a BST or a simple list).

Sure, it'll work. You could also build a dynamically merged view of
several FSTs (if their keys are distinct, or provide a value-merging
function). It'd be a nice addition to Lucene if you can share it.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Automata and Transducer on Lucene 6

Chris Hostetter-3

: pairs). It is this kind of
: "high-level goal" I asked about. Your answer only adds to the mystery:

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341



-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...