Document links

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Document links

mark harwood
I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph)
as a faster alternative to relational stores. I notice they either embed Lucene
for indexing node properties or (in the case of OrientDB) are talking about
doing this.

I think their fundamental performance advantage over relational stores is that
they don't have to de-reference foreign keys in a b-tree index to get from a
source node to a target node. Instead they use internally-generated IDs to act
like pointers with more-or-less direct references between nodes/vertexes.  As a
result they can follow links very quickly. This got me thinking could Lucene
adopt the idea of creating links between documents that were equally fast using
Lucene doc ids?

Maybe the user API would look something like this...

    indexWriter.addLink(fromDocId, toDocId);
    DocIdSet reader.getInboundLinks(docId);
    DocIdSet reader.getOutboundLinks(docId);


Internally a new index file structure would be needed to record link info. Any
recorded links that connect documents from different segments would need careful
adjustment of referenced link IDs when segments merge and Lucene doc IDs are
shuffled.

As well as handling typical graphs (social networks, web data) this could
potentially be used to support tagging operations where apps could create "tag"
documents and then link them to existing documents that are being tagged without
having to update the target doc. There are probably a ton of applications for
this stuff.

I see the Graph DBs busy recreating transactional support, indexes, segment
merging etc and it seems to me that Lucene has a pretty good head start with
this stuff.
Anyone else think this might be an area worth exploring?

Cheers
Mark




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Paul Elschot
When the (primary) key values are provided by the user,
one could use additional small documents to only store/index
these relations whenever they change.

Wouldn't that be sufficient?

Regards,
Paul Elschot



Op dinsdag 21 september 2010 00:35:02 schreef mark harwood:

> I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph)
> as a faster alternative to relational stores. I notice they either embed Lucene
> for indexing node properties or (in the case of OrientDB) are talking about
> doing this.
>
> I think their fundamental performance advantage over relational stores is that
> they don't have to de-reference foreign keys in a b-tree index to get from a
> source node to a target node. Instead they use internally-generated IDs to act
> like pointers with more-or-less direct references between nodes/vertexes.  As a
> result they can follow links very quickly. This got me thinking could Lucene
> adopt the idea of creating links between documents that were equally fast using
> Lucene doc ids?
>
> Maybe the user API would look something like this...
>
>     indexWriter.addLink(fromDocId, toDocId);
>     DocIdSet reader.getInboundLinks(docId);
>     DocIdSet reader.getOutboundLinks(docId);
>
>
> Internally a new index file structure would be needed to record link info. Any
> recorded links that connect documents from different segments would need careful
> adjustment of referenced link IDs when segments merge and Lucene doc IDs are
> shuffled.
>
> As well as handling typical graphs (social networks, web data) this could
> potentially be used to support tagging operations where apps could create "tag"
> documents and then link them to existing documents that are being tagged without
> having to update the target doc. There are probably a ton of applications for
> this stuff.
>
> I see the Graph DBs busy recreating transactional support, indexes, segment
> merging etc and it seems to me that Lucene has a pretty good head start with
> this stuff.
> Anyone else think this might be an area worth exploring?
>
> Cheers
> Mark
>
>
>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
>>Wouldn't that be sufficient?

Not for some apps. I tried playing the "Kevin Bacon" game using a Lucene index
of IMDB data using actorID and movieID keys.
The difference between that and Neo4j on the same data and query  is night and
day. The graph databases are really onto something when resolving a relationship
doesn't first require an index to look up endpoints.





----- Original Message ----
From: Paul Elschot <[hidden email]>
To: [hidden email]
Sent: Tue, 21 September, 2010 17:25:31
Subject: Re: Document links

When the (primary) key values are provided by the user,
one could use additional small documents to only store/index
these relations whenever they change.

Wouldn't that be sufficient?

Regards,
Paul Elschot



Op dinsdag 21 september 2010 00:35:02 schreef mark harwood:
> I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph)

> as a faster alternative to relational stores. I notice they either embed Lucene
>
> for indexing node properties or (in the case of OrientDB) are talking about
> doing this.
>
> I think their fundamental performance advantage over relational stores is that

> they don't have to de-reference foreign keys in a b-tree index to get from a
> source node to a target node. Instead they use internally-generated IDs to act

> like pointers with more-or-less direct references between nodes/vertexes.  As a
>
> result they can follow links very quickly. This got me thinking could Lucene
> adopt the idea of creating links between documents that were equally fast using
>
> Lucene doc ids?
>
> Maybe the user API would look something like this...
>
>     indexWriter.addLink(fromDocId, toDocId);
>     DocIdSet reader.getInboundLinks(docId);
>     DocIdSet reader.getOutboundLinks(docId);
>
>
> Internally a new index file structure would be needed to record link info. Any

> recorded links that connect documents from different segments would need
>careful
>
> adjustment of referenced link IDs when segments merge and Lucene doc IDs are
> shuffled.
>
> As well as handling typical graphs (social networks, web data) this could
> potentially be used to support tagging operations where apps could create "tag"
>
> documents and then link them to existing documents that are being tagged
>without
>
> having to update the target doc. There are probably a ton of applications for
> this stuff.
>
> I see the Graph DBs busy recreating transactional support, indexes, segment
> merging etc and it seems to me that Lucene has a pretty good head start with
> this stuff.
> Anyone else think this might be an area worth exploring?
>
> Cheers
> Mark
>
>
>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Paul Elschot
Op dinsdag 21 september 2010 18:30:08 schreef mark harwood:
> >>Wouldn't that be sufficient?
>
> Not for some apps. I tried playing the "Kevin Bacon" game using a Lucene index
> of IMDB data using actorID and movieID keys.
> The difference between that and Neo4j on the same data and query  is night and
> day. The graph databases are really onto something when resolving a relationship
> doesn't first require an index to look up endpoints.

When the key values are given by the user this would boil down to adding
the primary and foreign key to Lucene, but that does not appear to be the idea.

It should be possible to randomly add and delete such relationships after
indexWriter.addDocument(), is that the idea?

Adding such relationships by docId would need the addition of
a separate (from the segments) index structure (probably some B-tree) that
would have segmentId-segmentDocId as (part of) the keys, and also as (part of) the values.

Would each link also have an attribute (think payload)?
Would such relationships be named (sth like foreign key field names)?


Regards,
Paul Elschot


>
>
> ----- Original Message ----
> From: Paul Elschot <[hidden email]>
> To: [hidden email]
> Sent: Tue, 21 September, 2010 17:25:31
> Subject: Re: Document links
>
> When the (primary) key values are provided by the user,
> one could use additional small documents to only store/index
> these relations whenever they change.
>
> Wouldn't that be sufficient?
>
> Regards,
> Paul Elschot
>
>
>
> Op dinsdag 21 september 2010 00:35:02 schreef mark harwood:
> > I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph)
>
> > as a faster alternative to relational stores. I notice they either embed Lucene
> >
> > for indexing node properties or (in the case of OrientDB) are talking about
> > doing this.
> >
> > I think their fundamental performance advantage over relational stores is that
>
> > they don't have to de-reference foreign keys in a b-tree index to get from a
> > source node to a target node. Instead they use internally-generated IDs to act
>
> > like pointers with more-or-less direct references between nodes/vertexes.  As a
> >
> > result they can follow links very quickly. This got me thinking could Lucene
> > adopt the idea of creating links between documents that were equally fast using
> >
> > Lucene doc ids?
> >
> > Maybe the user API would look something like this...
> >
> >     indexWriter.addLink(fromDocId, toDocId);
> >     DocIdSet reader.getInboundLinks(docId);
> >     DocIdSet reader.getOutboundLinks(docId);
> >
> >
> > Internally a new index file structure would be needed to record link info. Any
>
> > recorded links that connect documents from different segments would need
> >careful
> >
> > adjustment of referenced link IDs when segments merge and Lucene doc IDs are
> > shuffled.
> >
> > As well as handling typical graphs (social networks, web data) this could
> > potentially be used to support tagging operations where apps could create "tag"
> >
> > documents and then link them to existing documents that are being tagged
> >without
> >
> > having to update the target doc. There are probably a ton of applications for
> > this stuff.
> >
> > I see the Graph DBs busy recreating transactional support, indexes, segment
> > merging etc and it seems to me that Lucene has a pretty good head start with
> > this stuff.
> > Anyone else think this might be an area worth exploring?
> >
> > Cheers
> > Mark
> >
> >
> >      
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood

It should be possible to randomly add and delete such relationships after
indexWriter.addDocument(), is that the idea?


Yes. A "like" action may, for example allow me to tag an existing document by connecting 2 documents - my personal "like" document and a document with content of interest.
   doc 1   = [user:mark       tag:like]
   doc 56 = [title:Lucene    body:Lucene is a search library...]

I then call:
   indexWriter.addLink(1,56)

If this was my first "Like" then I may need to contemplate using a variation of the above API that allows a yet-to-be-committed "Document" object in place of the doc ids.


Adding such relationships by docId would need the addition of
a separate (from the segments) index structure


Yes, I need to think about the detail of file structures next. For now I'm sticking with thinking about user API and functionality and assuming we can maintain cross-segment docid references that get updated somehow at merge time.



Would each link also have an attribute (think payload)?

I was thinking if attributes are needed (e.g. a star rating on my document "like" example) then this could be catered for with a document e.g. rather than linking the single doc [user:mark tag:like] to all my liked docs I could create specific doc instances of [user:mark rating:5 tag:like] and linking via that. 


Would such relationships be named (sth like foreign key field names)?

For now I was thinking of storing simple docid->docid links.

Once we have these links we could do some funky things:
{pseudo code:}
       //My fave docs from last week
       int myLikesDocId=searchForLuceneDocWithUserNameAndTag("mark", "like");
       DocIdSet myLikedDocs =indexReader.getOutboundLinks(myLikesDocId)
       searcher.search(lastWeekRangeQuery, new Filter(myLikedDocs));

      //Other users who share my interests
      DocIdSet usersWhoLikeWhatILike = indexReader.getInboundLinks(myLikedDocs);


Cheers
Mark

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Ryan McKinley
In reply to this post by mark harwood
> Maybe the user API would look something like this...
>
>    indexWriter.addLink(fromDocId, toDocId);
>    DocIdSet reader.getInboundLinks(docId);
>    DocIdSet reader.getOutboundLinks(docId);
>

To support 'typed' links, it would be nice to have:

    indexWriter.addLink("link", fromDocId, toDocId);
    DocIdSet reader.getInboundLinks(docId, "link");
    DocIdSet reader.getOutboundLinks(docId, "link");

In my app, I build this graph that has many types of links that need
to be handled differently.

> Anyone else think this might be an area worth exploring?
>

for sure!

ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
Some inital thoughts on the challenges in maintaining docid->docid links:

https://spreadsheets.google.com/ccc?key=0AsKVSn5SGg_wdHhMUW9ya0xxUFI3VXBHZGZHVUo4RkE&hl=en&authkey=CLOmwrgL#gid=0




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
In reply to this post by mark harwood
This slideshow has a first-cut on the Lucene file format extensions required to
support fast linking between documents:

http://www.slideshare.net/MarkHarwood/linking-lucene-documents


Interested in any of your thoughts.

Cheers,
Mark





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Grant Ingersoll-2
While not exactly equivalent, it reminds me of our earlier discussion around "layered segments" for dealing with field updates [1], [2], albeit this is a bit more generic since one could not only use the links for relating documents, but one could use "special" links underneath the covers in Lucene to maintain/mark which fields have been updated and then traverse to them.

[1] http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
[2] http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e

On Sep 24, 2010, at 10:36 AM, mark harwood wrote:

> This slideshow has a first-cut on the Lucene file format extensions required to
> support fast linking between documents:
>
> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>
>
> Interested in any of your thoughts.
>
> Cheers,
> Mark
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
>>While not exactly equivalent, it reminds me of our earlier discussion around
>>"layered segments" for dealing with field updates

Right. Fast discovery of document relations is a foundation on which lots of
things like this can build. Relations can be given types to support a number of
different use cases.



----- Original Message ----
From: Grant Ingersoll <[hidden email]>
To: [hidden email]
Sent: Fri, 24 September, 2010 16:26:27
Subject: Re: Document links

While not exactly equivalent, it reminds me of our earlier discussion around
"layered segments" for dealing with field updates [1], [2], albeit this is a bit
more generic since one could not only use the links for relating documents, but
one could use "special" links underneath the covers in Lucene to maintain/mark
which fields have been updated and then traverse to them.

[1]
http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384

[2]
http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e


On Sep 24, 2010, at 10:36 AM, mark harwood wrote:

> This slideshow has a first-cut on the Lucene file format extensions required to
>
> support fast linking between documents:
>
> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>
>
> Interested in any of your thoughts.
>
> Cheers,
> Mark
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Paul Elschot
Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >>While not exactly equivalent, it reminds me of our earlier discussion around
> >>"layered segments" for dealing with field updates
>
> Right. Fast discovery of document relations is a foundation on which lots of
> things like this can build. Relations can be given types to support a number of
> different use cases.

How about using this (bsd licenced) tree as a starting point:
http://bplusdotnet.sourceforge.net/
It has various keys: ao. byte array, String and long.

A fixed size byte array as key seems to be just fine: two bytes for a field number,
four for the segment number and four for the in-segment document id.
The separate segment number would allow to minimize the updates
in the tree during merges. One could also use the normal doc id directly.

The value could then be a similar to the key, but without
the field number, and with an indication of the direction of the link.
Or perhaps the direction of the link should be added to the key.
A link would be present twice, once for each direction.
Also both directions could have their own payloads.

It could be put in its own file as a separate 'segment', or maybe
each segment could allow for allocation of a part of this tree.

I like this somehow, in case it is done right one might never
need a relational database again. Well, almost...

Regards,
Paul Elschot


>
>
>
> ----- Original Message ----
> From: Grant Ingersoll <[hidden email]>
> To: [hidden email]
> Sent: Fri, 24 September, 2010 16:26:27
> Subject: Re: Document links
>
> While not exactly equivalent, it reminds me of our earlier discussion around
> "layered segments" for dealing with field updates [1], [2], albeit this is a bit
> more generic since one could not only use the links for relating documents, but
> one could use "special" links underneath the covers in Lucene to maintain/mark
> which fields have been updated and then traverse to them.
>
> [1]
> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>
> [2]
> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>
>
> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>
> > This slideshow has a first-cut on the Lucene file format extensions required to
> >
> > support fast linking between documents:
> >
> > http://www.slideshare.net/MarkHarwood/linking-lucene-documents
> >
> >
> > Interested in any of your thoughts.
> >
> > Cheers,
> > Mark
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>      
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
My starting point in the solution I propose was to eliminate linking via any type of key. Key lookups mean indexes and indexes mean disk seeks. Graph traversals have exponential numbers of links and so all these index disk seeks start to stack up. The solution I propose uses doc ids as more-or-less direct pointers into file structures avoiding any index lookup.
I've started coding up some tests using the file structures I outlined and will compare that with a traditional key-based approach.

For reference - playing the "Kevin Bacon game" on a traditional Lucene index of IMDB data took 18 seconds to find a short path that Neo4j finds in 200 milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m edges).
Going from actor->movies->actors->movies produces a lot of key lookups and the difference between key indexes and direct node pointers becomes clear.
I know path finding analysis is perhaps not a typical Lucene application but other forms of link analysis e.g. recommendation engines require similar performance.

Cheers
Mark



On 25 Sep 2010, at 11:41, Paul Elschot wrote:

> Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>>>> While not exactly equivalent, it reminds me of our earlier discussion around
>>>> "layered segments" for dealing with field updates
>>
>> Right. Fast discovery of document relations is a foundation on which lots of
>> things like this can build. Relations can be given types to support a number of
>> different use cases.
>
> How about using this (bsd licenced) tree as a starting point:
> http://bplusdotnet.sourceforge.net/
> It has various keys: ao. byte array, String and long.
>
> A fixed size byte array as key seems to be just fine: two bytes for a field number,
> four for the segment number and four for the in-segment document id.
> The separate segment number would allow to minimize the updates
> in the tree during merges. One could also use the normal doc id directly.
>
> The value could then be a similar to the key, but without
> the field number, and with an indication of the direction of the link.
> Or perhaps the direction of the link should be added to the key.
> A link would be present twice, once for each direction.
> Also both directions could have their own payloads.
>
> It could be put in its own file as a separate 'segment', or maybe
> each segment could allow for allocation of a part of this tree.
>
> I like this somehow, in case it is done right one might never
> need a relational database again. Well, almost...
>
> Regards,
> Paul Elschot
>
>
>>
>>
>>
>> ----- Original Message ----
>> From: Grant Ingersoll <[hidden email]>
>> To: [hidden email]
>> Sent: Fri, 24 September, 2010 16:26:27
>> Subject: Re: Document links
>>
>> While not exactly equivalent, it reminds me of our earlier discussion around
>> "layered segments" for dealing with field updates [1], [2], albeit this is a bit
>> more generic since one could not only use the links for relating documents, but
>> one could use "special" links underneath the covers in Lucene to maintain/mark
>> which fields have been updated and then traverse to them.
>>
>> [1]
>> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>> [2]
>> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>>
>> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>>
>>> This slideshow has a first-cut on the Lucene file format extensions required to
>>>
>>> support fast linking between documents:
>>>
>>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>>>
>>>
>>> Interested in any of your thoughts.
>>>
>>> Cheers,
>>> Mark
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Paul Elschot
Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
> My starting point in the solution I propose was to eliminate linking via any type of key. Key lookups mean indexes and indexes mean disk seeks. Graph traversals have exponential numbers of links and so all these index disk seeks start to stack up. The solution I propose uses doc ids as more-or-less direct pointers into file structures avoiding any index lookup.
> I've started coding up some tests using the file structures I outlined and will compare that with a traditional key-based approach.

Both these on disk data structures and the ones in a B+ tree have seek offsets into files
that require disk seeks. And both could use document ids as key values.

But do these disk data structures support dynamic addition and deletion of (larger
numbers of) document links?

B+ trees are a standard solution for problems like this one, and it would probably
not be easy to outperform them.
It may be possible to improve performance of B+ trees somewhat by specializing
for the fairly simple keys that would be needed, and by encoding very short lists of links
for a single document directly into a seek offset to avoid the actual seek, but that's
about it.

Regards,
Paul Elschot

>
> For reference - playing the "Kevin Bacon game" on a traditional Lucene index of IMDB data took 18 seconds to find a short path that Neo4j finds in 200 milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m edges).
> Going from actor->movies->actors->movies produces a lot of key lookups and the difference between key indexes and direct node pointers becomes clear.
> I know path finding analysis is perhaps not a typical Lucene application but other forms of link analysis e.g. recommendation engines require similar performance.
>
> Cheers
> Mark
>
>
>
> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>
> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >>>> While not exactly equivalent, it reminds me of our earlier discussion around
> >>>> "layered segments" for dealing with field updates
> >>
> >> Right. Fast discovery of document relations is a foundation on which lots of
> >> things like this can build. Relations can be given types to support a number of
> >> different use cases.
> >
> > How about using this (bsd licenced) tree as a starting point:
> > http://bplusdotnet.sourceforge.net/
> > It has various keys: ao. byte array, String and long.
> >
> > A fixed size byte array as key seems to be just fine: two bytes for a field number,
> > four for the segment number and four for the in-segment document id.
> > The separate segment number would allow to minimize the updates
> > in the tree during merges. One could also use the normal doc id directly.
> >
> > The value could then be a similar to the key, but without
> > the field number, and with an indication of the direction of the link.
> > Or perhaps the direction of the link should be added to the key.
> > A link would be present twice, once for each direction.
> > Also both directions could have their own payloads.
> >
> > It could be put in its own file as a separate 'segment', or maybe
> > each segment could allow for allocation of a part of this tree.
> >
> > I like this somehow, in case it is done right one might never
> > need a relational database again. Well, almost...
> >
> > Regards,
> > Paul Elschot
> >
> >
> >>
> >>
> >>
> >> ----- Original Message ----
> >> From: Grant Ingersoll <[hidden email]>
> >> To: [hidden email]
> >> Sent: Fri, 24 September, 2010 16:26:27
> >> Subject: Re: Document links
> >>
> >> While not exactly equivalent, it reminds me of our earlier discussion around
> >> "layered segments" for dealing with field updates [1], [2], albeit this is a bit
> >> more generic since one could not only use the links for relating documents, but
> >> one could use "special" links underneath the covers in Lucene to maintain/mark
> >> which fields have been updated and then traverse to them.
> >>
> >> [1]
> >> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
> >>
> >> [2]
> >> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
> >>
> >>
> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
> >>
> >>> This slideshow has a first-cut on the Lucene file format extensions required to
> >>>
> >>> support fast linking between documents:
> >>>
> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
> >>>
> >>>
> >>> Interested in any of your thoughts.
> >>>
> >>> Cheers,
> >>> Mark
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
>>Both these on disk data structures and the ones in a B+ tree have seek offsets
>>into files
>>that require disk seeks. And both could use document ids as key values.

Yep. However my approach doesn't use a doc id as a key that is searched in any
B+ tree index (which involves disk seeks) - it is used as direct offset into a
file to get the pointer into a "links" data structure.



>>But do these disk data structures support dynamic addition and deletion of
>>(larger
>>numbers of) document links?

Yes, the slide deck I linked to shows how links (like documents) spend the early
stages of life being merged frequently in the smaller, newer segments and over
time migrate into larger, more stable segments as part of Lucene transactions.

That's the theory - I'm currently benchmarking an early prototype.



----- Original Message ----
From: Paul Elschot <[hidden email]>
To: [hidden email]
Sent: Sat, 25 September, 2010 22:03:28
Subject: Re: Document links

Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
> My starting point in the solution I propose was to eliminate linking via any
>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>traversals have exponential numbers of links and so all these index disk seeks
>start to stack up. The solution I propose uses doc ids as more-or-less direct
>pointers into file structures avoiding any index lookup.
> I've started coding up some tests using the file structures I outlined and will
>compare that with a traditional key-based approach.

Both these on disk data structures and the ones in a B+ tree have seek offsets
into files
that require disk seeks. And both could use document ids as key values.

But do these disk data structures support dynamic addition and deletion of
(larger
numbers of) document links?

B+ trees are a standard solution for problems like this one, and it would
probably
not be easy to outperform them.
It may be possible to improve performance of B+ trees somewhat by specializing
for the fairly simple keys that would be needed, and by encoding very short
lists of links
for a single document directly into a seek offset to avoid the actual seek, but
that's
about it.

Regards,
Paul Elschot

>
> For reference - playing the "Kevin Bacon game" on a traditional Lucene index of
>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m
>edges).
> Going from actor->movies->actors->movies produces a lot of key lookups and the
>difference between key indexes and direct node pointers becomes clear.
> I know path finding analysis is perhaps not a typical Lucene application but
>other forms of link analysis e.g. recommendation engines require similar
>performance.
>
> Cheers
> Mark
>
>
>
> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>
> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>around
>
> >>>> "layered segments" for dealing with field updates
> >>
> >> Right. Fast discovery of document relations is a foundation on which lots of
>
> >> things like this can build. Relations can be given types to support a number
>of
>
> >> different use cases.
> >
> > How about using this (bsd licenced) tree as a starting point:
> > http://bplusdotnet.sourceforge.net/
> > It has various keys: ao. byte array, String and long.
> >
> > A fixed size byte array as key seems to be just fine: two bytes for a field
>number,
> > four for the segment number and four for the in-segment document id.
> > The separate segment number would allow to minimize the updates
> > in the tree during merges. One could also use the normal doc id directly.
> >
> > The value could then be a similar to the key, but without
> > the field number, and with an indication of the direction of the link.
> > Or perhaps the direction of the link should be added to the key.
> > A link would be present twice, once for each direction.
> > Also both directions could have their own payloads.
> >
> > It could be put in its own file as a separate 'segment', or maybe
> > each segment could allow for allocation of a part of this tree.
> >
> > I like this somehow, in case it is done right one might never
> > need a relational database again. Well, almost...
> >
> > Regards,
> > Paul Elschot
> >
> >
> >>
> >>
> >>
> >> ----- Original Message ----
> >> From: Grant Ingersoll <[hidden email]>
> >> To: [hidden email]
> >> Sent: Fri, 24 September, 2010 16:26:27
> >> Subject: Re: Document links
> >>
> >> While not exactly equivalent, it reminds me of our earlier discussion around
>
> >> "layered segments" for dealing with field updates [1], [2], albeit this is a
>bit
>
> >> more generic since one could not only use the links for relating documents,
>but
>
> >> one could use "special" links underneath the covers in Lucene to
>maintain/mark
>
> >> which fields have been updated and then traverse to them.
> >>
> >> [1]
> >>
>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>
> >>
> >> [2]
> >>
>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>
> >>
> >>
> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
> >>
> >>> This slideshow has a first-cut on the Lucene file format extensions
>required to
>
> >>>
> >>> support fast linking between documents:
> >>>
> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
> >>>
> >>>
> >>> Interested in any of your thoughts.
> >>>
> >>> Cheers,
> >>> Mark
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Grant Ingersoll-2
Hi Mark,

Any update on this?

-Grant

On Sep 25, 2010, at 5:42 PM, mark harwood wrote:

>>> Both these on disk data structures and the ones in a B+ tree have seek offsets
>>> into files
>>> that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>> But do these disk data structures support dynamic addition and deletion of
>>> (larger
>>> numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> ----- Original Message ----
> From: Paul Elschot <[hidden email]>
> To: [hidden email]
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>> type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>> traversals have exponential numbers of links and so all these index disk seeks
>> start to stack up. The solution I propose uses doc ids as more-or-less direct
>> pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and will
>> compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, but
> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index of
>> IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>> milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m
>> edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and the
>> difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>> other forms of link analysis e.g. recommendation engines require similar
>> performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>>> Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>>>>>> While not exactly equivalent, it reminds me of our earlier discussion
>> around
>>
>>>>>> "layered segments" for dealing with field updates
>>>>
>>>> Right. Fast discovery of document relations is a foundation on which lots of
>>
>>>> things like this can build. Relations can be given types to support a number
>> of
>>
>>>> different use cases.
>>>
>>> How about using this (bsd licenced) tree as a starting point:
>>> http://bplusdotnet.sourceforge.net/
>>> It has various keys: ao. byte array, String and long.
>>>
>>> A fixed size byte array as key seems to be just fine: two bytes for a field
>> number,
>>> four for the segment number and four for the in-segment document id.
>>> The separate segment number would allow to minimize the updates
>>> in the tree during merges. One could also use the normal doc id directly.
>>>
>>> The value could then be a similar to the key, but without
>>> the field number, and with an indication of the direction of the link.
>>> Or perhaps the direction of the link should be added to the key.
>>> A link would be present twice, once for each direction.
>>> Also both directions could have their own payloads.
>>>
>>> It could be put in its own file as a separate 'segment', or maybe
>>> each segment could allow for allocation of a part of this tree.
>>>
>>> I like this somehow, in case it is done right one might never
>>> need a relational database again. Well, almost...
>>>
>>> Regards,
>>> Paul Elschot
>>>
>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: Grant Ingersoll <[hidden email]>
>>>> To: [hidden email]
>>>> Sent: Fri, 24 September, 2010 16:26:27
>>>> Subject: Re: Document links
>>>>
>>>> While not exactly equivalent, it reminds me of our earlier discussion around
>>
>>>> "layered segments" for dealing with field updates [1], [2], albeit this is a
>> bit
>>
>>>> more generic since one could not only use the links for relating documents,
>> but
>>
>>>> one could use "special" links underneath the covers in Lucene to
>> maintain/mark
>>
>>>> which fields have been updated and then traverse to them.
>>>>
>>>> [1]
>>>>
>> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>>>>
>>>> [2]
>>>>
>> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>>>>
>>>>
>>>> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>>>>
>>>>> This slideshow has a first-cut on the Lucene file format extensions
>> required to
>>
>>>>>
>>>>> support fast linking between documents:
>>>>>
>>>>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>>>>>
>>>>>
>>>>> Interested in any of your thoughts.
>>>>>
>>>>> Cheers,
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Ryan McKinley
In reply to this post by mark harwood
Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <[hidden email]> wrote:

>>>Both these on disk data structures and the ones in a B+ tree have seek offsets
>>>into files
>>>that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>>But do these disk data structures support dynamic addition and deletion of
>>>(larger
>>>numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> ----- Original Message ----
> From: Paul Elschot <[hidden email]>
> To: [hidden email]
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>>traversals have exponential numbers of links and so all these index disk seeks
>>start to stack up. The solution I propose uses doc ids as more-or-less direct
>>pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and will
>>compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, but
> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index of
>>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>>milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m
>>edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and the
>>difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>>other forms of link analysis e.g. recommendation engines require similar
>>performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>>around
>>
>> >>>> "layered segments" for dealing with field updates
>> >>
>> >> Right. Fast discovery of document relations is a foundation on which lots of
>>
>> >> things like this can build. Relations can be given types to support a number
>>of
>>
>> >> different use cases.
>> >
>> > How about using this (bsd licenced) tree as a starting point:
>> > http://bplusdotnet.sourceforge.net/
>> > It has various keys: ao. byte array, String and long.
>> >
>> > A fixed size byte array as key seems to be just fine: two bytes for a field
>>number,
>> > four for the segment number and four for the in-segment document id.
>> > The separate segment number would allow to minimize the updates
>> > in the tree during merges. One could also use the normal doc id directly.
>> >
>> > The value could then be a similar to the key, but without
>> > the field number, and with an indication of the direction of the link.
>> > Or perhaps the direction of the link should be added to the key.
>> > A link would be present twice, once for each direction.
>> > Also both directions could have their own payloads.
>> >
>> > It could be put in its own file as a separate 'segment', or maybe
>> > each segment could allow for allocation of a part of this tree.
>> >
>> > I like this somehow, in case it is done right one might never
>> > need a relational database again. Well, almost...
>> >
>> > Regards,
>> > Paul Elschot
>> >
>> >
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> From: Grant Ingersoll <[hidden email]>
>> >> To: [hidden email]
>> >> Sent: Fri, 24 September, 2010 16:26:27
>> >> Subject: Re: Document links
>> >>
>> >> While not exactly equivalent, it reminds me of our earlier discussion around
>>
>> >> "layered segments" for dealing with field updates [1], [2], albeit this is a
>>bit
>>
>> >> more generic since one could not only use the links for relating documents,
>>but
>>
>> >> one could use "special" links underneath the covers in Lucene to
>>maintain/mark
>>
>> >> which fields have been updated and then traverse to them.
>> >>
>> >> [1]
>> >>
>>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>> >>
>> >> [2]
>> >>
>>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>> >>
>> >>
>> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> >>
>> >>> This slideshow has a first-cut on the Lucene file format extensions
>>required to
>>
>> >>>
>> >>> support fast linking between documents:
>> >>>
>> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>> >>>
>> >>>
>> >>> Interested in any of your thoughts.
>> >>>
>> >>> Cheers,
>> >>> Mark
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [hidden email]
>> >>> For additional commands, e-mail: [hidden email]
>> >>>
>> >>
>> >> --------------------------
>> >> Grant Ingersoll
>> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

mark harwood
I came to the conclusion that the transient meaning of document ids is too
deeply ingrained in Lucene's design to use them to underpin any reliable
linking.
While it might work for relatively static indexes, any index with a reasonable
number of updates or deletes will invalidate any stored document references in
ways which are very hard to track. Lucene's compaction shuffles IDs without
taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling
IDs" here: http://goo.gl/5UbJi )


Cheers,
Mark


----- Original Message ----
From: Ryan McKinley <[hidden email]>
To: [hidden email]
Sent: Mon, 8 November, 2010 19:03:59
Subject: Re: Document links

Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood <[hidden email]> wrote:
>>>Both these on disk data structures and the ones in a B+ tree have seek
offsets

>>>into files
>>>that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>>But do these disk data structures support dynamic addition and deletion of
>>>(larger
>>>numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the
>early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> ----- Original Message ----
> From: Paul Elschot <[hidden email]>
> To: [hidden email]
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>>traversals have exponential numbers of links and so all these index disk seeks
>>start to stack up. The solution I propose uses doc ids as more-or-less direct
>>pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and
>will
>>compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek,
but

> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index
>of
>>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>>milliseconds on the same data (and this was a disk based graph of 3m nodes,
10m
>>edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and
the

>>difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>>other forms of link analysis e.g. recommendation engines require similar
>>performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>>around
>>
>> >>>> "layered segments" for dealing with field updates
>> >>
>> >> Right. Fast discovery of document relations is a foundation on which lots
>of
>>
>> >> things like this can build. Relations can be given types to support a
>number
>>of
>>
>> >> different use cases.
>> >
>> > How about using this (bsd licenced) tree as a starting point:
>> > http://bplusdotnet.sourceforge.net/
>> > It has various keys: ao. byte array, String and long.
>> >
>> > A fixed size byte array as key seems to be just fine: two bytes for a field
>>number,
>> > four for the segment number and four for the in-segment document id.
>> > The separate segment number would allow to minimize the updates
>> > in the tree during merges. One could also use the normal doc id directly.
>> >
>> > The value could then be a similar to the key, but without
>> > the field number, and with an indication of the direction of the link.
>> > Or perhaps the direction of the link should be added to the key.
>> > A link would be present twice, once for each direction.
>> > Also both directions could have their own payloads.
>> >
>> > It could be put in its own file as a separate 'segment', or maybe
>> > each segment could allow for allocation of a part of this tree.
>> >
>> > I like this somehow, in case it is done right one might never
>> > need a relational database again. Well, almost...
>> >
>> > Regards,
>> > Paul Elschot
>> >
>> >
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> From: Grant Ingersoll <[hidden email]>
>> >> To: [hidden email]
>> >> Sent: Fri, 24 September, 2010 16:26:27
>> >> Subject: Re: Document links
>> >>
>> >> While not exactly equivalent, it reminds me of our earlier discussion
>around
>>
>> >> "layered segments" for dealing with field updates [1], [2], albeit this is
>a
>>bit
>>
>> >> more generic since one could not only use the links for relating
documents,

>>but
>>
>> >> one could use "special" links underneath the covers in Lucene to
>>maintain/mark
>>
>> >> which fields have been updated and then traverse to them.
>> >>
>> >> [1]
>> >>
>>http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>>
>>
>> >>
>> >> [2]
>> >>
>>http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>>
>>
>> >>
>> >>
>> >> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> >>
>> >>> This slideshow has a first-cut on the Lucene file format extensions
>>required to
>>
>> >>>
>> >>> support fast linking between documents:
>> >>>
>> >>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>> >>>
>> >>>
>> >>> Interested in any of your thoughts.
>> >>>
>> >>> Cheers,
>> >>> Mark
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [hidden email]
>> >>> For additional commands, e-mail: [hidden email]
>> >>>
>> >>
>> >> --------------------------
>> >> Grant Ingersoll
>> >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Paul Elschot
In reply to this post by Ryan McKinley
On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
> Any updates/progress with this?
>
> I'm looking at ways to implement an RTree with lucene -- and this
> discussion seems relevant

Did you consider merging the bits of each dimension into a NumericField?

For example: one dimension a0 a1 .. an
and a second dimension b0 b1 .. bn
into a0 b0 a1 b1 .. an bn
and then index this number as a NumericField.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Ryan McKinley
On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot <[hidden email]> wrote:

> On Monday 08 November 2010 20:03:59 Ryan McKinley wrote:
>> Any updates/progress with this?
>>
>> I'm looking at ways to implement an RTree with lucene -- and this
>> discussion seems relevant
>
> Did you consider merging the bits of each dimension into a NumericField?
>
> For example: one dimension a0 a1 .. an
> and a second dimension b0 b1 .. bn
> into a0 b0 a1 b1 .. an bn
> and then index this number as a NumericField.
>

Something like the geohash algorithm but with n dimensions?

The linking work that Mark discussed seems nice since it would give
faster access to navigating the tree -- finding N nearest neigbhors
etc...

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document links

Ryan McKinley
In reply to this post by mark harwood
On Mon, Nov 8, 2010 at 2:52 PM, mark harwood <[hidden email]> wrote:
> I came to the conclusion that the transient meaning of document ids is too
> deeply ingrained in Lucene's design to use them to underpin any reliable
> linking.

What about if we define an id field (like in solr)?

Whatever does the traversal would need to make a Map<id,docID>, but
that is still better then then needing to do a query for each link.


> While it might work for relatively static indexes, any index with a reasonable
> number of updates or deletes will invalidate any stored document references in
> ways which are very hard to track. Lucene's compaction shuffles IDs without
> taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling
> IDs" here: http://goo.gl/5UbJi )
>

oh ya -- and it is even more akward since each subreader often reuses
the same docId

ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12