Data import handler not indexing all data

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Data import handler not indexing all data

Yangrui Guo
Hello

I'm being troubled by solr's data import handler. My solr version is 5.3.1
and mysql is 5.5. I tried to index imdb data but found solr only partially
indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the query
result was 1636549. However DIH only fetched and indexed 287041 rows. I
didn't see any error in the log. Why was this happening?

Here's my data-config.xml

<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
<document>
<entity name="director" transformer="RegexTransformer" query="SELECT
DISTINCT * FROM imdb.director">
<field name="id" column="id" />
<field name="content_type" column="content_type" />
</entity>
</document>
</dataConfig>

Yangrui Guo
Reply | Threaded
Open this post in threaded view
|

Re: Data import handler not indexing all data

Alexandre Rafalovitch
Just to get the paranoid option out of the way, is 'id' actually the
column that has unique ids in your database? If you do "select
distinct id from imdb.director" - how many items do you get?

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 November 2015 at 18:21, Yangrui Guo <[hidden email]> wrote:

> Hello
>
> I'm being troubled by solr's data import handler. My solr version is 5.3.1
> and mysql is 5.5. I tried to index imdb data but found solr only partially
> indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the query
> result was 1636549. However DIH only fetched and indexed 287041 rows. I
> didn't see any error in the log. Why was this happening?
>
> Here's my data-config.xml
>
> <dataConfig>
> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> <document>
> <entity name="director" transformer="RegexTransformer" query="SELECT
> DISTINCT * FROM imdb.director">
> <field name="id" column="id" />
> <field name="content_type" column="content_type" />
> </entity>
> </document>
> </dataConfig>
>
> Yangrui Guo
Reply | Threaded
Open this post in threaded view
|

Re: Data import handler not indexing all data

Yangrui Guo
Hi thanks for the continued support. I'm really worried as my project
deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
distinct in the beginning of the query because IMDB doesn't have a table
for cast & crew. It puts movie and person and their roles into one huge
table 'cast_info'. Hence there are multiple rows for a director, one row
per his movie.

On Saturday, November 7, 2015, Alexandre Rafalovitch <[hidden email]>
wrote:

> Just to get the paranoid option out of the way, is 'id' actually the
> column that has unique ids in your database? If you do "select
> distinct id from imdb.director" - how many items do you get?
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 18:21, Yangrui Guo <[hidden email]
> <javascript:;>> wrote:
> > Hello
> >
> > I'm being troubled by solr's data import handler. My solr version is
> 5.3.1
> > and mysql is 5.5. I tried to index imdb data but found solr only
> partially
> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> query
> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
> > didn't see any error in the log. Why was this happening?
> >
> > Here's my data-config.xml
> >
> > <dataConfig>
> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> > <document>
> > <entity name="director" transformer="RegexTransformer" query="SELECT
> > DISTINCT * FROM imdb.director">
> > <field name="id" column="id" />
> > <field name="content_type" column="content_type" />
> > </entity>
> > </document>
> > </dataConfig>
> >
> > Yangrui Guo
>
Reply | Threaded
Open this post in threaded view
|

Re: Data import handler not indexing all data

Alexandre Rafalovitch
That's not quite the question I asked. Do a distinct on 'id' only in
the database itself. If your ids are NOT unique, you need to create a
composite or a virtual id for Solr. Because whatever your
solrconfig.xml say is uniqueKey will be used to deduplicate the
documents. If you have 10 documents with the same id value, only one
will be in the final Solr.

I am not saying that's where the problem is, DIH is fiddly. But just
get that out of the way.

If that's not the case, you may need to isolate which documents are
failing. The easiest way to do so is probably to index a smaller
subset of records, say 1000. Pick a condition in your SQL to do so
(e.g. id value range). Then, see how many made it into Solr. If not
all 1000, export the list of IDs from SQL, then a list of IDs from
Solr (use CSV format and just fl=id). Sort both, compare, see what ids
are missing. Look what is strange about those documents as opposed to
the documents that did make it into Solr. Try to push one of those
missing documents explicitly into Solr by either modifying SQL query
in DIH or as CSV or whatever.

Good luck,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 November 2015 at 19:07, Yangrui Guo <[hidden email]> wrote:

> Hi thanks for the continued support. I'm really worried as my project
> deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> distinct in the beginning of the query because IMDB doesn't have a table
> for cast & crew. It puts movie and person and their roles into one huge
> table 'cast_info'. Hence there are multiple rows for a director, one row
> per his movie.
>
> On Saturday, November 7, 2015, Alexandre Rafalovitch <[hidden email]>
> wrote:
>
>> Just to get the paranoid option out of the way, is 'id' actually the
>> column that has unique ids in your database? If you do "select
>> distinct id from imdb.director" - how many items do you get?
>>
>> Regards,
>>    Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 7 November 2015 at 18:21, Yangrui Guo <[hidden email]
>> <javascript:;>> wrote:
>> > Hello
>> >
>> > I'm being troubled by solr's data import handler. My solr version is
>> 5.3.1
>> > and mysql is 5.5. I tried to index imdb data but found solr only
>> partially
>> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
>> query
>> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
>> > didn't see any error in the log. Why was this happening?
>> >
>> > Here's my data-config.xml
>> >
>> > <dataConfig>
>> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
>> > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
>> > <document>
>> > <entity name="director" transformer="RegexTransformer" query="SELECT
>> > DISTINCT * FROM imdb.director">
>> > <field name="id" column="id" />
>> > <field name="content_type" column="content_type" />
>> > </entity>
>> > </document>
>> > </dataConfig>
>> >
>> > Yangrui Guo
>>
Reply | Threaded
Open this post in threaded view
|

Re: Data import handler not indexing all data

Yangrui Guo
Yes the id is unique. If I only select distinct id,count(id) I get the same
results. However I found this is more likely a MySQL issue. I created a new
table called director1 and ran query "insert into director1 select * from
director" I got only 287041 results inserted, which was the same as Solr. I
don't know why the same query is causing two different results.

On Saturday, November 7, 2015, Alexandre Rafalovitch <[hidden email]>
wrote:

> That's not quite the question I asked. Do a distinct on 'id' only in
> the database itself. If your ids are NOT unique, you need to create a
> composite or a virtual id for Solr. Because whatever your
> solrconfig.xml say is uniqueKey will be used to deduplicate the
> documents. If you have 10 documents with the same id value, only one
> will be in the final Solr.
>
> I am not saying that's where the problem is, DIH is fiddly. But just
> get that out of the way.
>
> If that's not the case, you may need to isolate which documents are
> failing. The easiest way to do so is probably to index a smaller
> subset of records, say 1000. Pick a condition in your SQL to do so
> (e.g. id value range). Then, see how many made it into Solr. If not
> all 1000, export the list of IDs from SQL, then a list of IDs from
> Solr (use CSV format and just fl=id). Sort both, compare, see what ids
> are missing. Look what is strange about those documents as opposed to
> the documents that did make it into Solr. Try to push one of those
> missing documents explicitly into Solr by either modifying SQL query
> in DIH or as CSV or whatever.
>
> Good luck,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:07, Yangrui Guo <[hidden email]
> <javascript:;>> wrote:
> > Hi thanks for the continued support. I'm really worried as my project
> > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> > distinct in the beginning of the query because IMDB doesn't have a table
> > for cast & crew. It puts movie and person and their roles into one huge
> > table 'cast_info'. Hence there are multiple rows for a director, one row
> > per his movie.
> >
> > On Saturday, November 7, 2015, Alexandre Rafalovitch <[hidden email]
> <javascript:;>>
> > wrote:
> >
> >> Just to get the paranoid option out of the way, is 'id' actually the
> >> column that has unique ids in your database? If you do "select
> >> distinct id from imdb.director" - how many items do you get?
> >>
> >> Regards,
> >>    Alex.
> >> ----
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 7 November 2015 at 18:21, Yangrui Guo <[hidden email]
> <javascript:;>
> >> <javascript:;>> wrote:
> >> > Hello
> >> >
> >> > I'm being troubled by solr's data import handler. My solr version is
> >> 5.3.1
> >> > and mysql is 5.5. I tried to index imdb data but found solr only
> >> partially
> >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> >> query
> >> > result was 1636549. However DIH only fetched and indexed 287041 rows.
> I
> >> > didn't see any error in the log. Why was this happening?
> >> >
> >> > Here's my data-config.xml
> >> >
> >> > <dataConfig>
> >> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> >> > url="jdbc:mysql://localhost:3306/imdb" user="root"
> password="password" />
> >> > <document>
> >> > <entity name="director" transformer="RegexTransformer" query="SELECT
> >> > DISTINCT * FROM imdb.director">
> >> > <field name="id" column="id" />
> >> > <field name="content_type" column="content_type" />
> >> > </entity>
> >> > </document>
> >> > </dataConfig>
> >> >
> >> > Yangrui Guo
> >>
>