Not all EML files are indexing during indexing

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Not all EML files are indexing during indexing

Zheng Lin Edwin Yeo
Hi,

I am running this on Solr 7.6.0

Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.

When I do the indexing, it is suppose to index the new EML files, and
update those index in which the EML file content has changed. However, I
found that not all new EML files are updated with each run of the indexing.

Could it be caused by the large number of files in the folder? Or due to
some other reasons?

Regards,
Edwin
Reply | Threaded
Open this post in threaded view
|

Re: Not all EML files are indexing during indexing

Charlie Hull-3
Hi Edwin,

What code is actually doing the indexing? AFAIK Solr doesn't include any
code for actually walking a folder, extracting the content from .eml
files and pushing this data into its index, so I'm guessing you've built
something external?

Charlie


On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:

> Hi,
>
> I am running this on Solr 7.6.0
>
> Currently I have a situation whereby there's more than 2 million EML file
> in a folder, and the folder is constantly updating the EML files with the
> latest information and adding new EML files.
>
> When I do the indexing, it is suppose to index the new EML files, and
> update those index in which the EML file content has changed. However, I
> found that not all new EML files are updated with each run of the indexing.
>
> Could it be caused by the large number of files in the folder? Or due to
> some other reasons?
>
> Regards,
> Edwin
>

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Reply | Threaded
Open this post in threaded view
|

Re: Not all EML files are indexing during indexing

Zheng Lin Edwin Yeo
Hi Charlie,

The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.

The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml file are done by OCR that comes with Solr.

As we have modified the SimplePostTool code to do the checking if the file
already exists in the index by running a Solr search query of the ID, I'm
thinking if this issue is caused by the PowerShell script or the query in
the SimplePostTool code not being able to keep up with the large number of
files?

Regards,
Edwin


On Mon, 1 Jun 2020 at 17:19, Charlie Hull <[hidden email]> wrote:

> Hi Edwin,
>
> What code is actually doing the indexing? AFAIK Solr doesn't include any
> code for actually walking a folder, extracting the content from .eml
> files and pushing this data into its index, so I'm guessing you've built
> something external?
>
> Charlie
>
>
> On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:
> > Hi,
> >
> > I am running this on Solr 7.6.0
> >
> > Currently I have a situation whereby there's more than 2 million EML file
> > in a folder, and the folder is constantly updating the EML files with the
> > latest information and adding new EML files.
> >
> > When I do the indexing, it is suppose to index the new EML files, and
> > update those index in which the EML file content has changed. However, I
> > found that not all new EML files are updated with each run of the
> indexing.
> >
> > Could it be caused by the large number of files in the folder? Or due to
> > some other reasons?
> >
> > Regards,
> > Edwin
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Not all EML files are indexing during indexing

Charlie Hull-3
Ah OK. I haven't used SimplePostTool myself and I note the docs say
"View this not as a best-practice code example, but as a standalone
example built with an explicit purpose of not having external jar
dependencies."

I'm wondering if it's some kind of synchronisation issue between new
files arriving in the folder and being picked up by your Powershell
script. Hard to say really without seeing all the code...perhaps take
out the Tika & Solr parts for now and verify the rest of your code
really can spot every new or updated file that arrives?

If it was me I'd probably build a standalone indexer script in Python
that did the file handling, called out to a separate Tika service for
extraction, posted to Solr.

Cheers


Charlie





On 02/06/2020 14:48, Zheng Lin Edwin Yeo wrote:

> Hi Charlie,
>
> The main code that is doing the indexing is from the Solr's
> SimplePostTools, but we have done some modification to it.
>
> The walking through a folder is done by PowerShell script, the extracting
> of the content from .eml file is from Tika that comes with Solr, and the
> images in the .eml file are done by OCR that comes with Solr.
>
> As we have modified the SimplePostTool code to do the checking if the file
> already exists in the index by running a Solr search query of the ID, I'm
> thinking if this issue is caused by the PowerShell script or the query in
> the SimplePostTool code not being able to keep up with the large number of
> files?
>
> Regards,
> Edwin
>
>
> On Mon, 1 Jun 2020 at 17:19, Charlie Hull <[hidden email]> wrote:
>
>> Hi Edwin,
>>
>> What code is actually doing the indexing? AFAIK Solr doesn't include any
>> code for actually walking a folder, extracting the content from .eml
>> files and pushing this data into its index, so I'm guessing you've built
>> something external?
>>
>> Charlie
>>
>>
>> On 01/06/2020 02:13, Zheng Lin Edwin Yeo wrote:
>>> Hi,
>>>
>>> I am running this on Solr 7.6.0
>>>
>>> Currently I have a situation whereby there's more than 2 million EML file
>>> in a folder, and the folder is constantly updating the EML files with the
>>> latest information and adding new EML files.
>>>
>>> When I do the indexing, it is suppose to index the new EML files, and
>>> update those index in which the EML file content has changed. However, I
>>> found that not all new EML files are updated with each run of the
>> indexing.
>>> Could it be caused by the large number of files in the folder? Or due to
>>> some other reasons?
>>>
>>> Regards,
>>> Edwin
>>>
>> --
>> Charlie Hull
>> OpenSource Connections, previously Flax
>>
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.o19s.com
>>
>>

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com

Reply | Threaded
Open this post in threaded view
|

Re: Not all EML files are indexing during indexing

Walter Underwood

> On Jun 2, 2020, at 7:40 AM, Charlie Hull <[hidden email]> wrote:
>
> If it was me I'd probably build a standalone indexer script in Python that did the file handling, called out to a separate Tika service for extraction, posted to Solr.

I would do the same thing, and I would base that script on Scrapy (https://scrapy.org <https://scrapy.org/>). I worked on a Python-based web spider for about ten years.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

Reply | Threaded
Open this post in threaded view
|

Re: Not all EML files are indexing during indexing

Charlie Hull-3
I think the OP is indexing flat files, not web pages (but otherwise, I
agree with you that Scrapy is great - I know some of the people behind
it too and they're a good bunch).

Charlie

On 02/06/2020 16:41, Walter Underwood wrote:

>> On Jun 2, 2020, at 7:40 AM, Charlie Hull <[hidden email]> wrote:
>>
>> If it was me I'd probably build a standalone indexer script in Python that did the file handling, called out to a separate Tika service for extraction, posted to Solr.
> I would do the same thing, and I would base that script on Scrapy (https://scrapy.org <https://scrapy.org/>). I worked on a Python-based web spider for about ten years.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>

--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com