number of files indexed (re-formatted)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

number of files indexed (re-formatted)

Nan Yu
Sorry that I just found out that the mailing list takes plain text and my previous post looks really messy. So I reformatted it.


Hi,
    I did a simple indexing of a directory that contains a lot of pdf, text, doc, zip etc. There are no structures for the content of the files and I would like to index them and later on search "key words" within the files.


    After creating the core, I indexed the files in the directory using the following command: 


bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log


    The log file shows something like below (the first and last few lines in the log file):


java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes -Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /DATA_FOLDER
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/myCore/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
...
...
...
POSTing file Report.pdf (application/pdf) to [base]/extract
47256 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
Time spent: 1:03:59.587




But when using browser to try to look at the result, the "overview" (http://localhost:8983/solr/#/myCore/core-overview) shows:
Num Docs: 47648


Most of the files indexed has an metadata id has the value of the full path of the file indexed, such as /DATA_FOLDER/20180321/Report.pdf 


But there are about 400 of them, the id looks like: 232d7bd6-c586-4726-8d2b-bc9b1febcff4.


So my questions are:
(1)why the two numbers are different (in log file vs. in the overview).
(2)for those ids that are not a full path of a file, how do I know where they comes from (the original file)?




Thanks for your help!
Nan




PS: a few examples of query result for those strange ids:


{
        "bolt-small-online":["Test strip-north"],
        "3696714.008":[3702848.584],
        "380614.564":[376900.143],
        "100.038":[111.074],
        "gpo-bolt":["teststrip"],
        "id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4",
        "_version_":1652839231413813252
}




{
        "Date":["8/24/2001"],
        "EXT31":[0],
        "EXT32":[0.12],
        "Aggregate":[0.12],
        "Pounds_Vap":[37],
        "Gallons_Vap":[5.8],
        "Gallons_Liq":[0],
        "Gallons_Tot":[5.8],
        "Avg_Rate":[1.8],
        "Gallons_Rec":[577],
        "Water":[577],
        "id":"840c05af-caf0-4407-8753-dcc6957abcc5",
        "Well_s_":["EXT31;EXT32"],
        "Time__hrs_":[3.25],
        "_version_":1652898731969740800}]
  }


{
        "2":[4],
        "SFS1":["PLM1"],
        "1.00":[1.0],
        "69":[79],
        "id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725",
        "_version_":1652825435791163395
}
Reply | Threaded
Open this post in threaded view
|

Re: number of files indexed (re-formatted)

Jörn Franke
This depends on your ingestion process. Usually the unique ids that are not filenames may come not from a file or your ingestion process does not tel the file name. In this case the Collection seems to be configured to generate a unique identifier.

Maybe you can describe more in detail on how you process the files.

A wild speculation could be that they come from inside a zip file. In this case Metadata from Tika could be used as an Id were you concatenation zip file + file inside zip file .
However we don’t know what you have defined how your ingestion process looks like so this is pure speculation from my side.

> Am 18.12.2019 um 16:40 schrieb Nan Yu <[hidden email]>:
>
> Sorry that I just found out that the mailing list takes plain text and my previous post looks really messy. So I reformatted it.
>
>
> Hi,
>     I did a simple indexing of a directory that contains a lot of pdf, text, doc, zip etc. There are no structures for the content of the files and I would like to index them and later on search "key words" within the files.
>
>
>     After creating the core, I indexed the files in the directory using the following command:
>
>
> bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log
>
>
>     The log file shows something like below (the first and last few lines in the log file):
>
>
> java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes -Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /DATA_FOLDER
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:8983/solr/myCore/update...
> Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> ...
> ...
> ...
> POSTing file Report.pdf (application/pdf) to [base]/extract
> 47256 files indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
> Time spent: 1:03:59.587
>
>
>
>
> But when using browser to try to look at the result, the "overview" (http://localhost:8983/solr/#/myCore/core-overview) shows:
> Num Docs: 47648
>
>
> Most of the files indexed has an metadata id has the value of the full path of the file indexed, such as /DATA_FOLDER/20180321/Report.pdf
>
>
> But there are about 400 of them, the id looks like: 232d7bd6-c586-4726-8d2b-bc9b1febcff4.
>
>
> So my questions are:
> (1)why the two numbers are different (in log file vs. in the overview).
> (2)for those ids that are not a full path of a file, how do I know where they comes from (the original file)?
>
>
>
>
> Thanks for your help!
> Nan
>
>
>
>
> PS: a few examples of query result for those strange ids:
>
>
> {
>         "bolt-small-online":["Test strip-north"],
>         "3696714.008":[3702848.584],
>         "380614.564":[376900.143],
>         "100.038":[111.074],
>         "gpo-bolt":["teststrip"],
>         "id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4",
>         "_version_":1652839231413813252
> }
>
>
>
>
> {
>         "Date":["8/24/2001"],
>         "EXT31":[0],
>         "EXT32":[0.12],
>         "Aggregate":[0.12],
>         "Pounds_Vap":[37],
>         "Gallons_Vap":[5.8],
>         "Gallons_Liq":[0],
>         "Gallons_Tot":[5.8],
>         "Avg_Rate":[1.8],
>         "Gallons_Rec":[577],
>         "Water":[577],
>         "id":"840c05af-caf0-4407-8753-dcc6957abcc5",
>         "Well_s_":["EXT31;EXT32"],
>         "Time__hrs_":[3.25],
>         "_version_":1652898731969740800}]
>   }
>
>
> {
>         "2":[4],
>         "SFS1":["PLM1"],
>         "1.00":[1.0],
>         "69":[79],
>         "id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725",
>         "_version_":1652825435791163395
> }
Reply | Threaded
Open this post in threaded view
|

Re: number of files indexed (re-formatted)

Erick Erickson
I’d urge you to consider moving the process from using ExtractingRequestHandler (i.e. just sending the data to Solr) to doing the Tika parser externally. ExtractingRequestHandler is a great way to get started, but I’ve often found that I need much finer control over the process.

Here’s the full treatment:
https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

> On Dec 18, 2019, at 11:15 AM, Jörn Franke <[hidden email]> wrote:
>
> This depends on your ingestion process. Usually the unique ids that are not filenames may come not from a file or your ingestion process does not tel the file name. In this case the Collection seems to be configured to generate a unique identifier.
>
> Maybe you can describe more in detail on how you process the files.
>
> A wild speculation could be that they come from inside a zip file. In this case Metadata from Tika could be used as an Id were you concatenation zip file + file inside zip file .
> However we don’t know what you have defined how your ingestion process looks like so this is pure speculation from my side.
>
>> Am 18.12.2019 um 16:40 schrieb Nan Yu <[hidden email]>:
>>
>> Sorry that I just found out that the mailing list takes plain text and my previous post looks really messy. So I reformatted it.
>>
>>
>> Hi,
>>    I did a simple indexing of a directory that contains a lot of pdf, text, doc, zip etc. There are no structures for the content of the files and I would like to index them and later on search "key words" within the files.
>>
>>
>>    After creating the core, I indexed the files in the directory using the following command:
>>
>>
>> bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log
>>
>>
>>    The log file shows something like below (the first and last few lines in the log file):
>>
>>
>> java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes -Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /DATA_FOLDER
>> SimplePostTool version 5.0.0
>> Posting files to [base] url http://localhost:8983/solr/myCore/update...
>> Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
>> ...
>> ...
>> ...
>> POSTing file Report.pdf (application/pdf) to [base]/extract
>> 47256 files indexed.
>> COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
>> Time spent: 1:03:59.587
>>
>>
>>
>>
>> But when using browser to try to look at the result, the "overview" (http://localhost:8983/solr/#/myCore/core-overview) shows:
>> Num Docs: 47648
>>
>>
>> Most of the files indexed has an metadata id has the value of the full path of the file indexed, such as /DATA_FOLDER/20180321/Report.pdf
>>
>>
>> But there are about 400 of them, the id looks like: 232d7bd6-c586-4726-8d2b-bc9b1febcff4.
>>
>>
>> So my questions are:
>> (1)why the two numbers are different (in log file vs. in the overview).
>> (2)for those ids that are not a full path of a file, how do I know where they comes from (the original file)?
>>
>>
>>
>>
>> Thanks for your help!
>> Nan
>>
>>
>>
>>
>> PS: a few examples of query result for those strange ids:
>>
>>
>> {
>>        "bolt-small-online":["Test strip-north"],
>>        "3696714.008":[3702848.584],
>>        "380614.564":[376900.143],
>>        "100.038":[111.074],
>>        "gpo-bolt":["teststrip"],
>>        "id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4",
>>        "_version_":1652839231413813252
>> }
>>
>>
>>
>>
>> {
>>        "Date":["8/24/2001"],
>>        "EXT31":[0],
>>        "EXT32":[0.12],
>>        "Aggregate":[0.12],
>>        "Pounds_Vap":[37],
>>        "Gallons_Vap":[5.8],
>>        "Gallons_Liq":[0],
>>        "Gallons_Tot":[5.8],
>>        "Avg_Rate":[1.8],
>>        "Gallons_Rec":[577],
>>        "Water":[577],
>>        "id":"840c05af-caf0-4407-8753-dcc6957abcc5",
>>        "Well_s_":["EXT31;EXT32"],
>>        "Time__hrs_":[3.25],
>>        "_version_":1652898731969740800}]
>>  }
>>
>>
>> {
>>        "2":[4],
>>        "SFS1":["PLM1"],
>>        "1.00":[1.0],
>>        "69":[79],
>>        "id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725",
>>        "_version_":1652825435791163395
>> }