Using Solr as a Database?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Solr as a Database?

Ralph Soika
Inspired by an article in the last german JavaMagazin written by Uwe
Schindler I wonder if Solr can also be used as a database?

In our open source project Imixs-Workflow we use Lucene
<https://imixs.org/doc/engine/queries.html> since several years with
great success. We have unstructured document-like data generated by the
workflow engine. We store all the data in a transactional RDBMS into a
blob column and index the data with lucene. This works great and is
impressive fast also when we use complex queries.

The thing is that we do not store any fields into lucene - only the
primary key of our dataset is stored in lucene. The document data is
stored in the SQL database.

Now as far as I understand is solr a cluster enabled datastore which can
be used to store also all the data form our document.
The problem with relational databases was always the lack of
cloud/cluster support to get more stable data by using redundancy over
serveral nodes.

What do you think? Is solr an alternative to store and index data
instead of useing Lucene in combination with RDBMS?


===
Ralph

Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Jörn Franke
It depends what you want to do with it. You can store all fields in Solr and filter on them. However, as soon as it comes to Acid guarantees or if you need to join the data you will be probably needing something else than Solr (or have other workarounds eg flatten the table ).

Maybe you can describe more what the users do in Solr or in the database.

> Am 02.06.2019 um 15:28 schrieb Ralph Soika <[hidden email]>:
>
> Inspired by an article in the last german JavaMagazin written by Uwe Schindler I wonder if Solr can also be used as a database?
>
> In our open source project Imixs-Workflow we use Lucene <https://imixs.org/doc/engine/queries.html> since several years with great success. We have unstructured document-like data generated by the workflow engine. We store all the data in a transactional RDBMS into a blob column and index the data with lucene. This works great and is impressive fast also when we use complex queries.
>
> The thing is that we do not store any fields into lucene - only the primary key of our dataset is stored in lucene. The document data is stored in the SQL database.
>
> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.
> The problem with relational databases was always the lack of cloud/cluster support to get more stable data by using redundancy over serveral nodes.
>
> What do you think? Is solr an alternative to store and index data instead of useing Lucene in combination with RDBMS?
>
>
> ===
> Ralph
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Erick Erickson
You must be able to rebuild your index completely when, at some point, you change your schema in incompatible ways. For that reason, either you have to play tricks with Solr (i.e. store all fields or the original document or….) or somehow have access to the original document.

Furthermore, starting with Lucene 8, Lucene will not even open an index _ever_ touched with Lucene 6. In general you can’t even open an index with Lucene X that was ever worked on with Lucene X-2 (starting where X = 8).

That said, it’s a common pattern to put enough information into Solr that a user can identify documents that they need then go to the system-of-record for the full document, whether that is an RDBMS or file system or whatever. I’ve seen lots of hybrid systems that store additional data besides the id and let the user get to the document she wants and only when she clicks on a single document go to the system-of-record and fetch it. Think of a Google search where the information you see as the result of a search is stored in Solr, but when the user clicks on a link the original doc is fetched from someplace other than Solr.

FWIW,
Erick

> On Jun 2, 2019, at 7:05 AM, Jörn Franke <[hidden email]> wrote:
>
> It depends what you want to do with it. You can store all fields in Solr and filter on them. However, as soon as it comes to Acid guarantees or if you need to join the data you will be probably needing something else than Solr (or have other workarounds eg flatten the table ).
>
> Maybe you can describe more what the users do in Solr or in the database.
>
>> Am 02.06.2019 um 15:28 schrieb Ralph Soika <[hidden email]>:
>>
>> Inspired by an article in the last german JavaMagazin written by Uwe Schindler I wonder if Solr can also be used as a database?
>>
>> In our open source project Imixs-Workflow we use Lucene <https://imixs.org/doc/engine/queries.html> since several years with great success. We have unstructured document-like data generated by the workflow engine. We store all the data in a transactional RDBMS into a blob column and index the data with lucene. This works great and is impressive fast also when we use complex queries.
>>
>> The thing is that we do not store any fields into lucene - only the primary key of our dataset is stored in lucene. The document data is stored in the SQL database.
>>
>> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.
>> The problem with relational databases was always the lack of cloud/cluster support to get more stable data by using redundancy over serveral nodes.
>>
>> What do you think? Is solr an alternative to store and index data instead of useing Lucene in combination with RDBMS?
>>
>>
>> ===
>> Ralph
>>

Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Ralph Soika
Thanks Jörn and Erick for your explanations.

What I do so far is the following:

  * I have a RDBMS with one totally flatten table holding all the data
and the id.
  * The data is unstructured. Fields can vary from document to document.
I have no fixed schema. A dataset is represented by a Hashmap.
  * Lucene (7.5) is perfect to index the data - with analysed-fulltext
and also with non-analysed-fields.

The whole system is highly transactional as it runs on Java EE with JPA
and Session EJBs.
I can easily rebuild my index on any time as I have all the data in a
RDBMS. And of course it was necessary in the past to rebuild the index
for many projects after upgrading lucene (e.g. from 4.x to 7.x).

So, as far as I understand, you recommend to leave the data in the RDBMS?

The problem with RDBMS is that you can not easily scale over many nodes
with a master less cluster. This was why I thought Solr can solve this
problem easily. On the other hand my Lucene index also did not scale
over multiple nodes. Maybe Solr would be a solution to scale just the
index?

Another solution I am working on is to store all my data in a HA
Cassandra cluster because I do not need the SQL-Core functionallity. But
in this case I only replace the RDBMS with Cassandra and Lucene/Solr
holds again only the index.

So Solr can't improve my architecture, with the exception of the fact
that the search index could be distributed across multiple nodes with
Solr. Did I get that right?


===
Ralph


On 02.06.19 16:35, Erick Erickson wrote:

> You must be able to rebuild your index completely when, at some point, you change your schema in incompatible ways. For that reason, either you have to play tricks with Solr (i.e. store all fields or the original document or….) or somehow have access to the original document.
>
> Furthermore, starting with Lucene 8, Lucene will not even open an index _ever_ touched with Lucene 6. In general you can’t even open an index with Lucene X that was ever worked on with Lucene X-2 (starting where X = 8).
>
> That said, it’s a common pattern to put enough information into Solr that a user can identify documents that they need then go to the system-of-record for the full document, whether that is an RDBMS or file system or whatever. I’ve seen lots of hybrid systems that store additional data besides the id and let the user get to the document she wants and only when she clicks on a single document go to the system-of-record and fetch it. Think of a Google search where the information you see as the result of a search is stored in Solr, but when the user clicks on a link the original doc is fetched from someplace other than Solr.
>
> FWIW,
> Erick
>
>> On Jun 2, 2019, at 7:05 AM, Jörn Franke <[hidden email]> wrote:
>>
>> It depends what you want to do with it. You can store all fields in Solr and filter on them. However, as soon as it comes to Acid guarantees or if you need to join the data you will be probably needing something else than Solr (or have other workarounds eg flatten the table ).
>>
>> Maybe you can describe more what the users do in Solr or in the database.
>>
>>> Am 02.06.2019 um 15:28 schrieb Ralph Soika <[hidden email]>:
>>>
>>> Inspired by an article in the last german JavaMagazin written by Uwe Schindler I wonder if Solr can also be used as a database?
>>>
>>> In our open source project Imixs-Workflow we use Lucene <https://imixs.org/doc/engine/queries.html> since several years with great success. We have unstructured document-like data generated by the workflow engine. We store all the data in a transactional RDBMS into a blob column and index the data with lucene. This works great and is impressive fast also when we use complex queries.
>>>
>>> The thing is that we do not store any fields into lucene - only the primary key of our dataset is stored in lucene. The document data is stored in the SQL database.
>>>
>>> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.
>>> The problem with relational databases was always the lack of cloud/cluster support to get more stable data by using redundancy over serveral nodes.
>>>
>>> What do you think? Is solr an alternative to store and index data instead of useing Lucene in combination with RDBMS?
>>>
>>>
>>> ===
>>> Ralph
>>>

Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Walter Underwood
In reply to this post by Ralph Soika

> On Jun 2, 2019, at 6:28 AM, Ralph Soika <[hidden email]> wrote:
>
> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.

That understanding is incorrect. Solr is not a data store. Reasoning based on that false assumption leads to false statements.

I’ve used Solr for about a dozen years and I’ve worked at two different non-relational database companies. Solr does not meet the minimal requirements for a reliable data store. For example, there is no transactional backup or even dump/load.

If you use Solr as your primary repository, you will lose all your data at some point.

On Monday, I need to delete an index with 45 million documents and recreate it from the source repository. I screwed up and made an incompatible schema change. Part of the index is written one way and the other part another way. Solr won’t even open the index now. So all that data is unrecoverable.

If you need a cluster-aware data store with search features, buy it from MarkLogic.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

David Hastings
In reply to this post by Ralph Soika
You *can use solr as a database, in the same sense that you *can use a chainsaw to remodel your bathroom.  Is it the right tool for the job? No. Can you make it work? Yes.  As for HA and cluster rdbms gallera cluster works great for Maria db, and is acid compliant.  I’m sure any other database has its own cluster like product, or would hope so.  Generally it’s best to use the right tool for the job and not force behavior from something it’s not intended to do.


That being said I heavily abuse solr as a data store as pre processing data before it goes into the index is more efficient for me than a bunch of sql joins for the outgoing product, but for actually editing and storing the data a relational db is easier to deal with and easier to find others that can work with it.



> On Jun 2, 2019, at 4:33 PM, Ralph Soika <[hidden email]> wrote:
>
> Thanks Jörn and Erick for your explanations.
>
> What I do so far is the following:
>
>  * I have a RDBMS with one totally flatten table holding all the data and the id.
>  * The data is unstructured. Fields can vary from document to document. I have no fixed schema. A dataset is represented by a Hashmap.
>  * Lucene (7.5) is perfect to index the data - with analysed-fulltext and also with non-analysed-fields.
>
> The whole system is highly transactional as it runs on Java EE with JPA and Session EJBs.
> I can easily rebuild my index on any time as I have all the data in a RDBMS. And of course it was necessary in the past to rebuild the index for many projects after upgrading lucene (e.g. from 4.x to 7.x).
>
> So, as far as I understand, you recommend to leave the data in the RDBMS?
>
> The problem with RDBMS is that you can not easily scale over many nodes with a master less cluster. This was why I thought Solr can solve this problem easily. On the other hand my Lucene index also did not scale over multiple nodes. Maybe Solr would be a solution to scale just the index?
>
> Another solution I am working on is to store all my data in a HA Cassandra cluster because I do not need the SQL-Core functionallity. But in this case I only replace the RDBMS with Cassandra and Lucene/Solr holds again only the index.
>
> So Solr can't improve my architecture, with the exception of the fact that the search index could be distributed across multiple nodes with Solr. Did I get that right?
>
>
> ===
> Ralph
>
>
>> On 02.06.19 16:35, Erick Erickson wrote:
>> You must be able to rebuild your index completely when, at some point, you change your schema in incompatible ways. For that reason, either you have to play tricks with Solr (i.e. store all fields or the original document or….) or somehow have access to the original document.
>>
>> Furthermore, starting with Lucene 8, Lucene will not even open an index _ever_ touched with Lucene 6. In general you can’t even open an index with Lucene X that was ever worked on with Lucene X-2 (starting where X = 8).
>>
>> That said, it’s a common pattern to put enough information into Solr that a user can identify documents that they need then go to the system-of-record for the full document, whether that is an RDBMS or file system or whatever. I’ve seen lots of hybrid systems that store additional data besides the id and let the user get to the document she wants and only when she clicks on a single document go to the system-of-record and fetch it. Think of a Google search where the information you see as the result of a search is stored in Solr, but when the user clicks on a link the original doc is fetched from someplace other than Solr.
>>
>> FWIW,
>> Erick
>>
>>> On Jun 2, 2019, at 7:05 AM, Jörn Franke <[hidden email]> wrote:
>>>
>>> It depends what you want to do with it. You can store all fields in Solr and filter on them. However, as soon as it comes to Acid guarantees or if you need to join the data you will be probably needing something else than Solr (or have other workarounds eg flatten the table ).
>>>
>>> Maybe you can describe more what the users do in Solr or in the database.
>>>
>>>> Am 02.06.2019 um 15:28 schrieb Ralph Soika <[hidden email]>:
>>>>
>>>> Inspired by an article in the last german JavaMagazin written by Uwe Schindler I wonder if Solr can also be used as a database?
>>>>
>>>> In our open source project Imixs-Workflow we use Lucene <https://imixs.org/doc/engine/queries.html> since several years with great success. We have unstructured document-like data generated by the workflow engine. We store all the data in a transactional RDBMS into a blob column and index the data with lucene. This works great and is impressive fast also when we use complex queries.
>>>>
>>>> The thing is that we do not store any fields into lucene - only the primary key of our dataset is stored in lucene. The document data is stored in the SQL database.
>>>>
>>>> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.
>>>> The problem with relational databases was always the lack of cloud/cluster support to get more stable data by using redundancy over serveral nodes.
>>>>
>>>> What do you think? Is solr an alternative to store and index data instead of useing Lucene in combination with RDBMS?
>>>>
>>>>
>>>> ===
>>>> Ralph
>>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Erick Erickson
In reply to this post by Ralph Soika
Not exactly. If I’m reading this right, you do now, and will continue, to have all the data in the RDBMS, correct? That’s what I call the “system of record”. So you’re not talking about getting rid of the RDBMS, rather basically copying it all over in to Solr and periodically updating your Solr indexes. So at any time, you can throw the Solr cluster away and re-create it by ingesting from the RDBMS (or whatever data store you settle on).

In that case, storing all your fields in Solr is perfectly reasonable, and SolrCloud will scale as necessary. There are some practical considerations, mostly having to do with hardware. You get HA/DR with Solr at the expense of multiple copies of the index etc.

What I and others are saying is that putting all your data in Solr then throwing the RDBMS (or whatever) away is not a good idea.

Best,
Erick

> On Jun 2, 2019, at 1:32 PM, Ralph Soika <[hidden email]> wrote:
>
> Thanks Jörn and Erick for your explanations.
>
> What I do so far is the following:
>
>  * I have a RDBMS with one totally flatten table holding all the data and the id.
>  * The data is unstructured. Fields can vary from document to document. I have no fixed schema. A dataset is represented by a Hashmap.
>  * Lucene (7.5) is perfect to index the data - with analysed-fulltext and also with non-analysed-fields.
>
> The whole system is highly transactional as it runs on Java EE with JPA and Session EJBs.
> I can easily rebuild my index on any time as I have all the data in a RDBMS. And of course it was necessary in the past to rebuild the index for many projects after upgrading lucene (e.g. from 4.x to 7.x).
>
> So, as far as I understand, you recommend to leave the data in the RDBMS?
>
> The problem with RDBMS is that you can not easily scale over many nodes with a master less cluster. This was why I thought Solr can solve this problem easily. On the other hand my Lucene index also did not scale over multiple nodes. Maybe Solr would be a solution to scale just the index?
>
> Another solution I am working on is to store all my data in a HA Cassandra cluster because I do not need the SQL-Core functionallity. But in this case I only replace the RDBMS with Cassandra and Lucene/Solr holds again only the index.
>
> So Solr can't improve my architecture, with the exception of the fact that the search index could be distributed across multiple nodes with Solr. Did I get that right?
>
>
> ===
> Ralph
>
>
> On 02.06.19 16:35, Erick Erickson wrote:
>> You must be able to rebuild your index completely when, at some point, you change your schema in incompatible ways. For that reason, either you have to play tricks with Solr (i.e. store all fields or the original document or….) or somehow have access to the original document.
>>
>> Furthermore, starting with Lucene 8, Lucene will not even open an index _ever_ touched with Lucene 6. In general you can’t even open an index with Lucene X that was ever worked on with Lucene X-2 (starting where X = 8).
>>
>> That said, it’s a common pattern to put enough information into Solr that a user can identify documents that they need then go to the system-of-record for the full document, whether that is an RDBMS or file system or whatever. I’ve seen lots of hybrid systems that store additional data besides the id and let the user get to the document she wants and only when she clicks on a single document go to the system-of-record and fetch it. Think of a Google search where the information you see as the result of a search is stored in Solr, but when the user clicks on a link the original doc is fetched from someplace other than Solr.
>>
>> FWIW,
>> Erick
>>
>>> On Jun 2, 2019, at 7:05 AM, Jörn Franke <[hidden email]> wrote:
>>>
>>> It depends what you want to do with it. You can store all fields in Solr and filter on them. However, as soon as it comes to Acid guarantees or if you need to join the data you will be probably needing something else than Solr (or have other workarounds eg flatten the table ).
>>>
>>> Maybe you can describe more what the users do in Solr or in the database.
>>>
>>>> Am 02.06.2019 um 15:28 schrieb Ralph Soika <[hidden email]>:
>>>>
>>>> Inspired by an article in the last german JavaMagazin written by Uwe Schindler I wonder if Solr can also be used as a database?
>>>>
>>>> In our open source project Imixs-Workflow we use Lucene <https://imixs.org/doc/engine/queries.html> since several years with great success. We have unstructured document-like data generated by the workflow engine. We store all the data in a transactional RDBMS into a blob column and index the data with lucene. This works great and is impressive fast also when we use complex queries.
>>>>
>>>> The thing is that we do not store any fields into lucene - only the primary key of our dataset is stored in lucene. The document data is stored in the SQL database.
>>>>
>>>> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.
>>>> The problem with relational databases was always the lack of cloud/cluster support to get more stable data by using redundancy over serveral nodes.
>>>>
>>>> What do you think? Is solr an alternative to store and index data instead of useing Lucene in combination with RDBMS?
>>>>
>>>>
>>>> ===
>>>> Ralph
>>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Ralph Soika
Thanks a lot again for your answers. I do now better understand the
operation purpose of Solar

Thanks for your help

===
Ralph

On 02.06.19 23:27, Erick Erickson wrote:

> Not exactly. If I’m reading this right, you do now, and will continue, to have all the data in the RDBMS, correct? That’s what I call the “system of record”. So you’re not talking about getting rid of the RDBMS, rather basically copying it all over in to Solr and periodically updating your Solr indexes. So at any time, you can throw the Solr cluster away and re-create it by ingesting from the RDBMS (or whatever data store you settle on).
>
> In that case, storing all your fields in Solr is perfectly reasonable, and SolrCloud will scale as necessary. There are some practical considerations, mostly having to do with hardware. You get HA/DR with Solr at the expense of multiple copies of the index etc.
>
> What I and others are saying is that putting all your data in Solr then throwing the RDBMS (or whatever) away is not a good idea.
>
> Best,
> Erick
>
>> On Jun 2, 2019, at 1:32 PM, Ralph Soika <[hidden email]> wrote:
>>
>> Thanks Jörn and Erick for your explanations.
>>
>> What I do so far is the following:
>>
>>   * I have a RDBMS with one totally flatten table holding all the data and the id.
>>   * The data is unstructured. Fields can vary from document to document. I have no fixed schema. A dataset is represented by a Hashmap.
>>   * Lucene (7.5) is perfect to index the data - with analysed-fulltext and also with non-analysed-fields.
>>
>> The whole system is highly transactional as it runs on Java EE with JPA and Session EJBs.
>> I can easily rebuild my index on any time as I have all the data in a RDBMS. And of course it was necessary in the past to rebuild the index for many projects after upgrading lucene (e.g. from 4.x to 7.x).
>>
>> So, as far as I understand, you recommend to leave the data in the RDBMS?
>>
>> The problem with RDBMS is that you can not easily scale over many nodes with a master less cluster. This was why I thought Solr can solve this problem easily. On the other hand my Lucene index also did not scale over multiple nodes. Maybe Solr would be a solution to scale just the index?
>>
>> Another solution I am working on is to store all my data in a HA Cassandra cluster because I do not need the SQL-Core functionallity. But in this case I only replace the RDBMS with Cassandra and Lucene/Solr holds again only the index.
>>
>> So Solr can't improve my architecture, with the exception of the fact that the search index could be distributed across multiple nodes with Solr. Did I get that right?
>>
>>
>> ===
>> Ralph
>>
>>
>> On 02.06.19 16:35, Erick Erickson wrote:
>>> You must be able to rebuild your index completely when, at some point, you change your schema in incompatible ways. For that reason, either you have to play tricks with Solr (i.e. store all fields or the original document or….) or somehow have access to the original document.
>>>
>>> Furthermore, starting with Lucene 8, Lucene will not even open an index _ever_ touched with Lucene 6. In general you can’t even open an index with Lucene X that was ever worked on with Lucene X-2 (starting where X = 8).
>>>
>>> That said, it’s a common pattern to put enough information into Solr that a user can identify documents that they need then go to the system-of-record for the full document, whether that is an RDBMS or file system or whatever. I’ve seen lots of hybrid systems that store additional data besides the id and let the user get to the document she wants and only when she clicks on a single document go to the system-of-record and fetch it. Think of a Google search where the information you see as the result of a search is stored in Solr, but when the user clicks on a link the original doc is fetched from someplace other than Solr.
>>>
>>> FWIW,
>>> Erick
>>>
>>>> On Jun 2, 2019, at 7:05 AM, Jörn Franke <[hidden email]> wrote:
>>>>
>>>> It depends what you want to do with it. You can store all fields in Solr and filter on them. However, as soon as it comes to Acid guarantees or if you need to join the data you will be probably needing something else than Solr (or have other workarounds eg flatten the table ).
>>>>
>>>> Maybe you can describe more what the users do in Solr or in the database.
>>>>
>>>>> Am 02.06.2019 um 15:28 schrieb Ralph Soika <[hidden email]>:
>>>>>
>>>>> Inspired by an article in the last german JavaMagazin written by Uwe Schindler I wonder if Solr can also be used as a database?
>>>>>
>>>>> In our open source project Imixs-Workflow we use Lucene <https://imixs.org/doc/engine/queries.html> since several years with great success. We have unstructured document-like data generated by the workflow engine. We store all the data in a transactional RDBMS into a blob column and index the data with lucene. This works great and is impressive fast also when we use complex queries.
>>>>>
>>>>> The thing is that we do not store any fields into lucene - only the primary key of our dataset is stored in lucene. The document data is stored in the SQL database.
>>>>>
>>>>> Now as far as I understand is solr a cluster enabled datastore which can be used to store also all the data form our document.
>>>>> The problem with relational databases was always the lack of cloud/cluster support to get more stable data by using redundancy over serveral nodes.
>>>>>
>>>>> What do you think? Is solr an alternative to store and index data instead of useing Lucene in combination with RDBMS?
>>>>>
>>>>>
>>>>> ===
>>>>> Ralph
>>>>>

Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Shawn Heisey-2
In reply to this post by Ralph Soika
On 6/2/2019 7:28 AM, Ralph Soika wrote:

This is not intended to contradict the other replies you've gotten, only
supplement them.

> Now as far as I understand is solr a cluster enabled datastore which can
> be used to store also all the data form our document.
> The problem with relational databases was always the lack of
> cloud/cluster support to get more stable data by using redundancy over
> serveral nodes.

At it's heart, Solr is using something you already understand -- Lucene.
  Certain functionality is implemented above that in Solr -- facets
being probably the primary example.  For the most part, if you wouldn't
use Lucene for some purpose, you shouldn't use Solr for that purpose
either -- because Solr is written with the Lucene API.

Search is Solr's primary function, and what it is optimized to do.  Any
other use, even when it is possible, is probably going to be better
handled by another piece of software.

We have done what we can to eliminate problems that lose data, but data
retention in the face of all potential problems is not one of the design
goals.  Things CAN go wrong that result in data loss ... while most
database software is hardened against data loss from even unexpected
problems.

> What do you think? Is solr an alternative to store and index data
> instead of useing Lucene in combination with RDBMS?

In general, no.  There are things databases can do that Solr can't, and
some things that a database is better at than Solr is.  Solr is good at
search, and things related to search.

If you have the system resources, putting a complete copy of your data
in Solr is not necessarily a bad thing.  Some amazing things can be done
in the arena of data mining.  The facet feature that I mentioned above
tends to be very usable for that.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Using Solr as a Database?

Davis, Daniel (NIH/NLM) [C]
I think the sweet spot of Cassandra and Solr should be mentioned in this discussion.   Cassandra is more scalable/clusterable than an RDBMS, without losing all of the structure that is desirable in an RDBMS.  

In contrast, if you use a full document store such as MongoDB, you lose some of the abilities to know what is in your schema.

DataStax markets a platform that combines Cassandra (as a distributed replacement for an RDBMS) that is integrated with Solr so that records in managed in Cassandra are indexed and up-to-date.

If your real problem with an RDBMS is the lack of scaling, but you like the ability to specify columnar structure explicitly, then this combination might be a good fit.

Now, MongoDB is also a strong alternative to an RDBMS.

The other thing to recall though is that the power of sharding has reached into the databases themselves, and databases such as PostgreSQL can operate with some tables sharded and other tables duplicated.   See https://pgdash.io/blog/postgres-11-sharding.html.


Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Christopher Schultz
In reply to this post by Ralph Soika
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Ralph,

On 6/2/19 16:32, Ralph Soika wrote:
> The whole system is highly transactional as it runs on Java EE with
> JPA and Session EJBs.

And you write-through from your application -> RDBMS -> Lucene/Solr?

How are you handling commits (both soft and hard) and re-opening the
index?

> So, as far as I understand, you recommend to leave the data in the
> RDBMS?

I certainly would, even if it's just to allow a rebuild of the index
from a "trusted" source.

> The problem with RDBMS is that you can not easily scale over many
> nodes with a master less cluster.

That sounds like it's a problem with your choice of RDBMS, and not of
RDBMS's in general.

> This was why I thought Solr can solve this problem easily. On the
> other hand my Lucene index also did not scale over multiple nodes.

If you want a clustered document-store[1], you might want to look at a
storage system designed for that purpose such as CouchDB or MongoDB.
Lucene/Solr is really best used as a distillation of data stored
elsewhere and not as a backing-store itself.

> Maybe Solr would be a solution to scale just the index?

That's exactly what Solr is for.

> Another solution I am working on is to store all my data in a HA
> Cassandra cluster because I do not need the SQL-Core
> functionallity. But in this case I only replace the RDBMS with
> Cassandra and Lucene/Solr holds again only the index.

This seems like another plausible solution.

> So Solr can't improve my architecture, with the exception of the
> fact that the search index could be distributed across multiple
> nodes with Solr. Did I get that right?

Yes.

Hope that helps,
- -chris

[1]
https://en.wikipedia.org/wiki/Document-oriented_database#Implementations
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlz1jYYACgkQHPApP6U8
pFjw6w/9GGv4Z4FIoypv8XQrtIf5heT8yH0On6pQaFI313mglmzerTrD4W9Jz3y7
VWQHeMw5Q5LBg56KKMGSKv/PEnNmiA+69YTMdXB+R5gJnwtW0ZEZU0jP1uhPO+af
UO6ZpdbMnIuIyeZK8oeo99rL7nrb0CaPvzrVP7LoF+flX9gp5qt30841QPTVwNgZ
ryC+mrlWTidRpFF/uKCctDOwDJgw6pKNf352F+n/Oc85maBTySgIla1ZEqz+B+G3
tdgdTiDT/ueZY0BNFubnWlpjVTP+rwQjOrq1cD/Z53zV6APs4v7RQ0JBqDeJcadj
5xohEmZh47lKiNqsrSpB+CZy5mebxEalB3ptB+O7zexwLoixzJB4wmqfbP/hcO69
ijp58mhdoYDZqqwNJXoRNQ6OfQ9KlTyxtQwQGNcKCDiOOzZkhPInaYFnDo4AARG7
bI4z4eMpDuAm0VKi+b1voASSDxvIcT1gUZVVEtQWR5O3lzWDYmpKLsdMXQi34TKG
CXtpjgq5CR8x8kFhVQD8QijTG/zOsDf0pksF1AZx/6DQvN3JaFy3hy2dSW1Plbm6
n0WMDIkJ8w9IxofU+pFcu+tJuSRvKdcieK6dHSMHSrTvUAZc3VcCXWI4w25eODX2
985JoQF5tP6IizxBOv334VwizGu7GRyPmLLMSnQFuJXzjB52v2w=
=cN5b
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Christopher Schultz
In reply to this post by Davis, Daniel (NIH/NLM) [C]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Daniel,

On 6/3/19 16:26, Davis, Daniel (NIH/NLM) [C] wrote:
> I think the sweet spot of Cassandra and Solr should be mentioned
> in this discussion.   Cassandra is more scalable/clusterable than
> an RDBMS, without losing all of the structure that is desirable in
> an RDBMS.

Amusingly enough, there is also Solandra if you don't want to choose :)

https://github.com/tjake/Solandra

It's a lot like DataStax.

> In contrast, if you use a full document store such as MongoDB, you
> lose some of the abilities to know what is in your schema.
>
> DataStax markets a platform that combines Cassandra (as a
> distributed replacement for an RDBMS) that is integrated with Solr
> so that records in managed in Cassandra are indexed and
> up-to-date.
>
> If your real problem with an RDBMS is the lack of scaling, but you
> like the ability to specify columnar structure explicitly, then
> this combination might be a good fit.
>
> Now, MongoDB is also a strong alternative to an RDBMS.
>
> The other thing to recall though is that the power of sharding has
> reached into the databases themselves, and databases such as
> PostgreSQL can operate with some tables sharded and other tables
> duplicated.   See
> https://pgdash.io/blog/postgres-11-sharding.html.

Even MySQL and MariaDB -- the most bare-bones solutions in the RDBMS
space -- now have clustering available to them, to it's hard to defend
an RDBMS solution at this point that does NOT provide clustering, or
something similar.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlz1kA8ACgkQHPApP6U8
pFif4w/+Ph5ZsQdEiVuK96ygWYJcq0x5RzBfrQhQ5oq7IvhdlLzdzwIPilwLZoaO
9/JcwQOUfVo5XNC72mpclg6J+1jhkBuvee7tMqvSA90PLoTmJLft/oeFoBBm374Z
9UAhJgHF/lhcyp00w4L1JjRH+jQzZia3cohi56oeLReKnyHY//EvqzHKNe2TbiPf
7m5jOIiscxmzAMaI2pEBE4gHWUL8rXVG0SVkUbMQYqR+dRj50sOKk3w2lO2akWV/
rLkYD175LAtpQ7qMXU+CAGro2UAIdTXJOtp7yhCquA6T6Vo4BcBsvQ2bGBMDpeld
MsnyxzM1hiOZ71DOhyFjfGN9Ivqr1/UijVNsZWazBYtYp9N9/H1l3hl6NlKUVGIF
c+pSVWleNAzsO4ShUGJrOkdfv64vRjfK1s/unggAnu/XtyTWKoNV6vxhXwceEYlD
1xVDk8O4ANErXxj4XvQvtgrvBeYOK5sJ5aqn0guN1UIX6Q2gE61bclYwJp9r4NO9
cJjTQedEPdVdRYAz+lDucmSESETQITghhSgub8558BmTSc1PF61f3nAKEYiWrhfN
NnxR0dLKY+QOQ5Mo9lX6RSsCYb9x5F8K1jAoy/GSllpnGc88oswquJT/7Vm6R0yX
9YvFI7JsUHfhIwSkV8uupBZ03KJpYgJvXwirBGzV4j7i4M4qr7o=
=9ZXf
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: Using Solr as a Database?

Ralph Soika
In reply to this post by Christopher Schultz
Hello Christopher,

On 03.06.19 23:13, Christopher Schultz wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Ralph,
>
> On 6/2/19 16:32, Ralph Soika wrote:
>> The whole system is highly transactional as it runs on Java EE with
>> JPA and Session EJBs.
> And you write-through from your application -> RDBMS -> Lucene/Solr?
>
> How are you handling commits (both soft and hard) and re-opening the
> index?

we use "Change Data Capture" events which we write during a transaction
into the RDBMS.

The Lucene Service consumes this event log entries async. E.g. if a
search request occurs. So we have the guarantee the the search result
contains only commited data. This works great and is fast enough 
because in the worst case only the first search request will be delayed
with the small amount of updating new entries from another completed
transaction.



===

Raph