questions of FST

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

questions of FST

guohuawu227
I am a software developer from China. Recently I am reading the source of lucene of 6.6.0. I can not understand some code of FST. List them below.

1 . There is a field called 'inCounts' in the class of FST. But in my opinion every node of FST has only one arc pointing to it. If a node is not shared, it is certain it has only one arc pointing to it since it is like a tree. If a node is shared,it is returned directly from NodeHash so the number of it's coming in arc is not updated. So I just can't understand this attribute though it is used is 'pach' method. 

2. Is about the method of 'pach' of class FST. This line 'long delta = newNodeAddress.get((int) arc.target) + addressError - writer.getPosition() - 2;' . I know it is comparing which one is smaller but I just can't understand why 2 is minused. 

3. If a FST is packed,then there are 3 ways to save the target of an arc, first is the difference of the target and this arc ,second is the ordinal number from a hashmap if the target is among those having top N incoming arcs, third is absolute position of the node. But in the method 'readNextRealArc' , I can not understand the process of the absolute position while reading. Below is the code.

if (packed) {
final long pos = in.getPosition();
final long code = in.readVLong();
if (arc.flag(BIT_TARGET_DELTA)) {
// record the difference
} else if (code < nodeRefToAddress.size()) {
// record the ordinal number from a hashmap
} else {// Absolute position
arc.target = code;//I can not understa this.In opinion, nodeRefToAddress.size() should be minused because it is summed into the absolute value in 'pack' method.
}
} else {

-----------------------
Please pardon me if I use some wrong words ,my English is not good.Thanks for reading the email. Looking forward to the reply.
--------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: questions of FST

Michael McCandless-2
Hello!

A few quick answers:

In theory, an FST node can have inCount > 1 (when it is in the "suffix" part of the FST), but it sounds like maybe you found an exciting bug if indeed FST Builder fails to ever increment it.  E.g. try building some FSTs here: http://examples.mikemccandless.com/fst.py the default FST there already shows 2 nodes with inCount = 2.

Note that the pack method has been deleted in newer Lucene versions, because it was sizable complexity, not very well tested and not so big gains in practice.

I'm not sure about question 2)

Question 3) does sound like a bug, because as the code now stands it means we cannot reference an address below nodeRefToAddress.size() which seems like it might be problematic.

These might be real bugs!  I would strongly suggest ensuring pack() really is working for your use case before trusting it!

On Thu, Jan 7, 2021 at 10:14 PM guohuawu227 <[hidden email]> wrote:
I am a software developer from China. Recently I am reading the source of lucene of 6.6.0. I can not understand some code of FST. List them below.

1 . There is a field called 'inCounts' in the class of FST. But in my opinion every node of FST has only one arc pointing to it. If a node is not shared, it is certain it has only one arc pointing to it since it is like a tree. If a node is shared,it is returned directly from NodeHash so the number of it's coming in arc is not updated. So I just can't understand this attribute though it is used is 'pach' method. 

2. Is about the method of 'pach' of class FST. This line 'long delta = newNodeAddress.get((int) arc.target) + addressError - writer.getPosition() - 2;' . I know it is comparing which one is smaller but I just can't understand why 2 is minused. 

3. If a FST is packed,then there are 3 ways to save the target of an arc, first is the difference of the target and this arc ,second is the ordinal number from a hashmap if the target is among those having top N incoming arcs, third is absolute position of the node. But in the method 'readNextRealArc' , I can not understand the process of the absolute position while reading. Below is the code.

if (packed) {
final long pos = in.getPosition();
final long code = in.readVLong();
if (arc.flag(BIT_TARGET_DELTA)) {
// record the difference
} else if (code < nodeRefToAddress.size()) {
// record the ordinal number from a hashmap
} else {// Absolute position
arc.target = code;//I can not understa this.In opinion, nodeRefToAddress.size() should be minused because it is summed into the absolute value in 'pack' method.
}
} else {

-----------------------
Please pardon me if I use some wrong words ,my English is not good.Thanks for reading the email. Looking forward to the reply.
--------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]