Feasability

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Feasability

Chris Manu
Hello,


I want to start off by saying that I am not a programmer...and have very little knowledge in this area.


What I would like to know if Apache would be capable of doing the following:

Take an extensive list (A) of strings of unique words (these are titles - anywhere from 4 words to 30) saved in either an Excel worksheet or in a text file and search for instances (B) where these can be found in PDF files saved on a hard drive (over 100k files). The search would need to be done using a fuzzy logic rather than exact matching and the output would be in an Excel file list the unique string found (A), the file name in which the match was made (B), the page number where the match was made and the surrounding text on either side of As well, would this be a complicated program, usable by novices coached in the process necessary to input the title file (A) and direct the search to the relevant folder containing the PDF files (B).


I eagerly await (hopefully) an affirmative answer.


Cheers!

Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Xavier Morera
The answer is yes, but you would need to do some programming and
configuring.

On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]> wrote:

> Hello,
>
>
> I want to start off by saying that I am not a programmer...and have very
> little knowledge in this area.
>
>
> What I would like to know if Apache would be capable of doing the
> following:
>
> Take an extensive list (A) of strings of unique words (these are titles -
> anywhere from 4 words to 30) saved in either an Excel worksheet or in a
> text file and search for instances (B) where these can be found in PDF
> files saved on a hard drive (over 100k files). The search would need to be
> done using a fuzzy logic rather than exact matching and the output would be
> in an Excel file list the unique string found (A), the file name in which
> the match was made (B), the page number where the match was made and the
> surrounding text on either side of As well, would this be a complicated
> program, usable by novices coached in the process necessary to input the
> title file (A) and direct the search to the relevant folder containing the
> PDF files (B).
>
>
> I eagerly await (hopefully) an affirmative answer.
>
>
> Cheers!
>
>


--

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

*www.xaviermorera.com <http://www.xaviermorera.com/>*

office:  (305) 600-4919

cel:     +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
<https://www.linkedin.com/in/xmorera> | Pluralsight Author
<http://www.pluralsight.com/author/xavier-morera>
Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Chris Manu
Thank you for responding. So, theoretically, I would need to hire someone with Apache programing experience to do this correct (given that I know nothing about programing)? What type of experience should I look for?


________________________________
From: Xavier Morera <[hidden email]>
Sent: December 1, 2016 2:23 AM
To: [hidden email]
Subject: Re: Feasability

The answer is yes, but you would need to do some programming and
configuring.

On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]> wrote:

> Hello,
>
>
> I want to start off by saying that I am not a programmer...and have very
> little knowledge in this area.
>
>
> What I would like to know if Apache would be capable of doing the
> following:
>
> Take an extensive list (A) of strings of unique words (these are titles -
> anywhere from 4 words to 30) saved in either an Excel worksheet or in a
> text file and search for instances (B) where these can be found in PDF
> files saved on a hard drive (over 100k files). The search would need to be
> done using a fuzzy logic rather than exact matching and the output would be
> in an Excel file list the unique string found (A), the file name in which
> the match was made (B), the page number where the match was made and the
> surrounding text on either side of As well, would this be a complicated
> program, usable by novices coached in the process necessary to input the
> title file (A) and direct the search to the relevant folder containing the
> PDF files (B).
>
>
> I eagerly await (hopefully) an affirmative answer.
>
>
> Cheers!
>
>


--

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

*www.xaviermorera.com <http://www.xaviermorera.com/>*
[https://i2.wp.com/www.xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.jpg?resize=150%2C150]<http://www.xaviermorera.com/>

Xavier Morera<http://www.xaviermorera.com/>
www.xaviermorera.com
I have been working with Solr for a while, mainly from the .NET world and I basically love it. I use SolrNet which I think it is a very mature and stable library.



office:  (305) 600-4919

cel:     +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
[https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_400x400.jpeg]<https://twitter.com/xmorera>

xmorera (@xmorera) | Twitter<https://twitter.com/xmorera>
twitter.com
The latest Tweets from xmorera (@xmorera). Eternal optimist, entrepreneur, lifelong learner, passionate about technology. Costa Rica


<https://www.linkedin.com/in/xmorera> | Pluralsight Author
[https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg]<https://www.linkedin.com/in/xmorera>

Xavier Morera | LinkedIn<https://www.linkedin.com/in/xmorera>
www.linkedin.com
Xavier Morera is an entrepreneur, project manager, Pluralsight author, speaker, trainer, Certified Scrum Master & Professional and Certified Microsoft professional ...


<http://www.pluralsight.com/author/xavier-morera>
Xavier Morera - .Net Author | Pluralsight<http://www.pluralsight.com/author/xavier-morera>
www.pluralsight.com
Xavier is an entrepreneur, project manager, technical author, trainer, Certified Scrum Professional & Scrum Master, and Certified Microsoft Professional.


Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Reda Kouba
Someone with a good experience in programming and a good knowledge of Lucene and IR.

best,
reda

> On 1 Dec. 2016, at 14:33, Chris Manu <[hidden email]> wrote:
>
> Thank you for responding. So, theoretically, I would need to hire someone with Apache programing experience to do this correct (given that I know nothing about programing)? What type of experience should I look for?
>
>
> ________________________________
> From: Xavier Morera <[hidden email] <mailto:[hidden email]>>
> Sent: December 1, 2016 2:23 AM
> To: [hidden email] <mailto:[hidden email]>
> Subject: Re: Feasability
>
> The answer is yes, but you would need to do some programming and
> configuring.
>
> On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]> wrote:
>
>> Hello,
>>
>>
>> I want to start off by saying that I am not a programmer...and have very
>> little knowledge in this area.
>>
>>
>> What I would like to know if Apache would be capable of doing the
>> following:
>>
>> Take an extensive list (A) of strings of unique words (these are titles -
>> anywhere from 4 words to 30) saved in either an Excel worksheet or in a
>> text file and search for instances (B) where these can be found in PDF
>> files saved on a hard drive (over 100k files). The search would need to be
>> done using a fuzzy logic rather than exact matching and the output would be
>> in an Excel file list the unique string found (A), the file name in which
>> the match was made (B), the page number where the match was made and the
>> surrounding text on either side of As well, would this be a complicated
>> program, usable by novices coached in the process necessary to input the
>> title file (A) and direct the search to the relevant folder containing the
>> PDF files (B).
>>
>>
>> I eagerly await (hopefully) an affirmative answer.
>>
>>
>> Cheers!
>>
>>
>
>
> --
>
> *Xavier Morera*
>
> Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master
>
> *www.xaviermorera.com <http://www.xaviermorera.com/>*
> [https://i2.wp.com/www.xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.jpg?resize=150%2C150 <https://i2.wp.com/www.xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.jpg?resize=150%2C150>]<http://www.xaviermorera.com/ <http://www.xaviermorera.com/>>
>
> Xavier Morera<http://www.xaviermorera.com/ <http://www.xaviermorera.com/>>
> www.xaviermorera.com <http://www.xaviermorera.com/>
> I have been working with Solr for a while, mainly from the .NET world and I basically love it. I use SolrNet which I think it is a very mature and stable library.
>
>
>
> office:  (305) 600-4919
>
> cel:     +506 8849-8866
>
> skype: xmorera
> Twitter <https://twitter.com/xmorera <https://twitter.com/xmorera>> | LinkedIn
> [https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_400x400.jpeg <https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_400x400.jpeg>]<https://twitter.com/xmorera <https://twitter.com/xmorera>>
>
> xmorera (@xmorera) | Twitter<https://twitter.com/xmorera <https://twitter.com/xmorera>>
> twitter.com <http://twitter.com/>
> The latest Tweets from xmorera (@xmorera). Eternal optimist, entrepreneur, lifelong learner, passionate about technology. Costa Rica
>
>
> <https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>> | Pluralsight Author
> [https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg <https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg>]<https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>>
>
> Xavier Morera | LinkedIn<https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>>
> www.linkedin.com <http://www.linkedin.com/>
> Xavier Morera is an entrepreneur, project manager, Pluralsight author, speaker, trainer, Certified Scrum Master & Professional and Certified Microsoft professional ...
>
>
> <http://www.pluralsight.com/author/xavier-morera <http://www.pluralsight.com/author/xavier-morera>>
> Xavier Morera - .Net Author | Pluralsight<http://www.pluralsight.com/author/xavier-morera <http://www.pluralsight.com/author/xavier-morera>>
> www.pluralsight.com <http://www.pluralsight.com/>
> Xavier is an entrepreneur, project manager, technical author, trainer, Certified Scrum Professional & Scrum Master, and Certified Microsoft Professional.

Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Xavier Morera
Yes, you need someone I would say with Solr and some sort of development skills.

Xavier

--------------------------------------------
Sent from a small attention grabbing screen


> On Nov 30, 2016, at 21:37, Reda Kouba <[hidden email]> wrote:
>
> Someone with a good experience in programming and a good knowledge of Lucene and IR.
>
> best,
> reda
>
>> On 1 Dec. 2016, at 14:33, Chris Manu <[hidden email]> wrote:
>>
>> Thank you for responding. So, theoretically, I would need to hire someone with Apache programing experience to do this correct (given that I know nothing about programing)? What type of experience should I look for?
>>
>>
>> ________________________________
>> From: Xavier Morera <[hidden email] <mailto:[hidden email]>>
>> Sent: December 1, 2016 2:23 AM
>> To: [hidden email] <mailto:[hidden email]>
>> Subject: Re: Feasability
>>
>> The answer is yes, but you would need to do some programming and
>> configuring.
>>
>>> On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>>
>>> I want to start off by saying that I am not a programmer...and have very
>>> little knowledge in this area.
>>>
>>>
>>> What I would like to know if Apache would be capable of doing the
>>> following:
>>>
>>> Take an extensive list (A) of strings of unique words (these are titles -
>>> anywhere from 4 words to 30) saved in either an Excel worksheet or in a
>>> text file and search for instances (B) where these can be found in PDF
>>> files saved on a hard drive (over 100k files). The search would need to be
>>> done using a fuzzy logic rather than exact matching and the output would be
>>> in an Excel file list the unique string found (A), the file name in which
>>> the match was made (B), the page number where the match was made and the
>>> surrounding text on either side of As well, would this be a complicated
>>> program, usable by novices coached in the process necessary to input the
>>> title file (A) and direct the search to the relevant folder containing the
>>> PDF files (B).
>>>
>>>
>>> I eagerly await (hopefully) an affirmative answer.
>>>
>>>
>>> Cheers!
>>>
>>>
>>
>>
>> --
>>
>> *Xavier Morera*
>>
>> Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master
>>
>> *www.xaviermorera.com <http://www.xaviermorera.com/>*
>> [https://i2.wp.com/www.xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.jpg?resize=150%2C150 <https://i2.wp.com/www.xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.jpg?resize=150%2C150>]<http://www.xaviermorera.com/ <http://www.xaviermorera.com/>>
>>
>> Xavier Morera<http://www.xaviermorera.com/ <http://www.xaviermorera.com/>>
>> www.xaviermorera.com <http://www.xaviermorera.com/>
>> I have been working with Solr for a while, mainly from the .NET world and I basically love it. I use SolrNet which I think it is a very mature and stable library.
>>
>>
>>
>> office:  (305) 600-4919
>>
>> cel:     +506 8849-8866
>>
>> skype: xmorera
>> Twitter <https://twitter.com/xmorera <https://twitter.com/xmorera>> | LinkedIn
>> [https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_400x400.jpeg <https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_400x400.jpeg>]<https://twitter.com/xmorera <https://twitter.com/xmorera>>
>>
>> xmorera (@xmorera) | Twitter<https://twitter.com/xmorera <https://twitter.com/xmorera>>
>> twitter.com <http://twitter.com/>
>> The latest Tweets from xmorera (@xmorera). Eternal optimist, entrepreneur, lifelong learner, passionate about technology. Costa Rica
>>
>>
>> <https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>> | Pluralsight Author
>> [https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg <https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg>]<https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>>
>>
>> Xavier Morera | LinkedIn<https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>>
>> www.linkedin.com <http://www.linkedin.com/>
>> Xavier Morera is an entrepreneur, project manager, Pluralsight author, speaker, trainer, Certified Scrum Master & Professional and Certified Microsoft professional ...
>>
>>
>> <http://www.pluralsight.com/author/xavier-morera <http://www.pluralsight.com/author/xavier-morera>>
>> Xavier Morera - .Net Author | Pluralsight<http://www.pluralsight.com/author/xavier-morera <http://www.pluralsight.com/author/xavier-morera>>
>> www.pluralsight.com <http://www.pluralsight.com/>
>> Xavier is an entrepreneur, project manager, technical author, trainer, Certified Scrum Professional & Scrum Master, and Certified Microsoft Professional.
>
Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Charlie Hull-3
Hi,

You have several options:

1. Try and recruit someone with existing Solr skills (hard, there is a
skills shortage certainly in the UK and I suspect worldwide, good Solr
people are rare and expensive)
2. Try and recruit someone with good enterprise Java skills, hopefully some
interest in search and Solr, and train them up (slightly easier but a
longer timescale, there are various organisations providing Solr
training/mentoring including my own)
3. Engage a consultancy like us with pre-existing experience in Solr
development to build what you need (much quicker but obviously there's a
cost)
4. Buy a 'packaged' Solr-based search engine such as Fusion from our
partner Lucidworks which will save a lot of time/effort developing
something (very quick but also an initial and ongoing subscription cost).

Hope this helps!

Charlie
Flax
www.flax.co.uk

On 1 December 2016 at 05:21, Xavier Morera <[hidden email]> wrote:

> Yes, you need someone I would say with Solr and some sort of development
> skills.
>
> Xavier
>
> --------------------------------------------
> Sent from a small attention grabbing screen
>
>
> > On Nov 30, 2016, at 21:37, Reda Kouba <[hidden email]> wrote:
> >
> > Someone with a good experience in programming and a good knowledge of
> Lucene and IR.
> >
> > best,
> > reda
> >
> >> On 1 Dec. 2016, at 14:33, Chris Manu <[hidden email]> wrote:
> >>
> >> Thank you for responding. So, theoretically, I would need to hire
> someone with Apache programing experience to do this correct (given that I
> know nothing about programing)? What type of experience should I look for?
> >>
> >>
> >> ________________________________
> >> From: Xavier Morera <[hidden email] <mailto:
> [hidden email]>>
> >> Sent: December 1, 2016 2:23 AM
> >> To: [hidden email] <mailto:[hidden email]>
> >> Subject: Re: Feasability
> >>
> >> The answer is yes, but you would need to do some programming and
> >> configuring.
> >>
> >>> On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]>
> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>> I want to start off by saying that I am not a programmer...and have
> very
> >>> little knowledge in this area.
> >>>
> >>>
> >>> What I would like to know if Apache would be capable of doing the
> >>> following:
> >>>
> >>> Take an extensive list (A) of strings of unique words (these are
> titles -
> >>> anywhere from 4 words to 30) saved in either an Excel worksheet or in a
> >>> text file and search for instances (B) where these can be found in PDF
> >>> files saved on a hard drive (over 100k files). The search would need
> to be
> >>> done using a fuzzy logic rather than exact matching and the output
> would be
> >>> in an Excel file list the unique string found (A), the file name in
> which
> >>> the match was made (B), the page number where the match was made and
> the
> >>> surrounding text on either side of As well, would this be a complicated
> >>> program, usable by novices coached in the process necessary to input
> the
> >>> title file (A) and direct the search to the relevant folder containing
> the
> >>> PDF files (B).
> >>>
> >>>
> >>> I eagerly await (hopefully) an affirmative answer.
> >>>
> >>>
> >>> Cheers!
> >>>
> >>>
> >>
> >>
> >> --
> >>
> >> *Xavier Morera*
> >>
> >> Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master
> >>
> >> *www.xaviermorera.com <http://www.xaviermorera.com/>*
> >> [https://i2.wp.com/www.xaviermorera.com/wp-content/
> uploads/2016/06/xavier-morera.jpg?resize=150%2C150 <https://i2.wp.com/www.
> xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.
> jpg?resize=150%2C150>]<http://www.xaviermorera.com/ <
> http://www.xaviermorera.com/>>
> >>
> >> Xavier Morera<http://www.xaviermorera.com/ <
> http://www.xaviermorera.com/>>
> >> www.xaviermorera.com <http://www.xaviermorera.com/>
> >> I have been working with Solr for a while, mainly from the .NET world
> and I basically love it. I use SolrNet which I think it is a very mature
> and stable library.
> >>
> >>
> >>
> >> office:  (305) 600-4919
> >>
> >> cel:     +506 8849-8866
> >>
> >> skype: xmorera
> >> Twitter <https://twitter.com/xmorera <https://twitter.com/xmorera>> |
> LinkedIn
> >> [https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_
> 400x400.jpeg <https://pbs.twimg.com/profile_images/
> 464050157344940033/7AA_lsgC_400x400.jpeg>]<https://twitter.com/xmorera <
> https://twitter.com/xmorera>>
> >>
> >> xmorera (@xmorera) | Twitter<https://twitter.com/xmorera <
> https://twitter.com/xmorera>>
> >> twitter.com <http://twitter.com/>
> >> The latest Tweets from xmorera (@xmorera). Eternal optimist,
> entrepreneur, lifelong learner, passionate about technology. Costa Rica
> >>
> >>
> >> <https://www.linkedin.com/in/xmorera <https://www.linkedin.com/in/
> xmorera>> | Pluralsight Author
> >> [https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/
> 07f/033/28fdf8e.jpg <https://media.licdn.com/mpr/
> mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg>]<https://
> www.linkedin.com/in/xmorera <https://www.linkedin.com/in/xmorera>>
> >>
> >> Xavier Morera | LinkedIn<https://www.linkedin.com/in/xmorera <
> https://www.linkedin.com/in/xmorera>>
> >> www.linkedin.com <http://www.linkedin.com/>
> >> Xavier Morera is an entrepreneur, project manager, Pluralsight author,
> speaker, trainer, Certified Scrum Master & Professional and Certified
> Microsoft professional ...
> >>
> >>
> >> <http://www.pluralsight.com/author/xavier-morera <
> http://www.pluralsight.com/author/xavier-morera>>
> >> Xavier Morera - .Net Author | Pluralsight<http://www.
> pluralsight.com/author/xavier-morera <http://www.pluralsight.com/
> author/xavier-morera>>
> >> www.pluralsight.com <http://www.pluralsight.com/>
> >> Xavier is an entrepreneur, project manager, technical author, trainer,
> Certified Scrum Professional & Scrum Master, and Certified Microsoft
> Professional.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Jeremy Branham
In reply to this post by Chris Manu
Someone like this maybe?
https://www.linkedin.com/in/jeremybranham

;)



Sent from my Sprint Phone.

------ Original message------
From: Chris Manu
Date: Wed, Nov 30, 2016 9:33 PM
To: [hidden email];
Cc:
Subject:Re: Feasability

Thank you for responding. So, theoretically, I would need to hire someone with Apache programing experience to do this correct (given that I know nothing about programing)? What type of experience should I look for?


________________________________
From: Xavier Morera <[hidden email]>
Sent: December 1, 2016 2:23 AM
To: [hidden email]
Subject: Re: Feasability

The answer is yes, but you would need to do some programming and
configuring.

On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]> wrote:

> Hello,
>
>
> I want to start off by saying that I am not a programmer...and have very
> little knowledge in this area.
>
>
> What I would like to know if Apache would be capable of doing the
> following:
>
> Take an extensive list (A) of strings of unique words (these are titles -
> anywhere from 4 words to 30) saved in either an Excel worksheet or in a
> text file and search for instances (B) where these can be found in PDF
> files saved on a hard drive (over 100k files). The search would need to be
> done using a fuzzy logic rather than exact matching and the output would be
> in an Excel file list the unique string found (A), the file name in which
> the match was made (B), the page number where the match was made and the
> surrounding text on either side of As well, would this be a complicated
> program, usable by novices coached in the process necessary to input the
> title file (A) and direct the search to the relevant folder containing the
> PDF files (B).
>
>
> I eagerly await (hopefully) an affirmative answer.
>
>
> Cheers!
>
>


--

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

*www.xaviermorera.com <http://www.xaviermorera.com/>*
[https://i2.wp.com/www.xaviermorera.com/wp-content/uploads/2016/06/xavier-morera.jpg?resize=150%2C150]<http://www.xaviermorera.com/>

Xavier Morera<http://www.xaviermorera.com/>
www.xaviermorera.com<http://www.xaviermorera.com>
I have been working with Solr for a while, mainly from the .NET world and I basically love it. I use SolrNet which I think it is a very mature and stable library.



office:  (305) 600-4919

cel:     +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
[https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_400x400.jpeg]<https://twitter.com/xmorera>

xmorera (@xmorera) | Twitter<https://twitter.com/xmorera>
twitter.com
The latest Tweets from xmorera (@xmorera). Eternal optimist, entrepreneur, lifelong learner, passionate about technology. Costa Rica


<https://www.linkedin.com/in/xmorera> | Pluralsight Author
[https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/07f/033/28fdf8e.jpg]<https://www.linkedin.com/in/xmorera>

Xavier Morera | LinkedIn<https://www.linkedin.com/in/xmorera>
www.linkedin.com<http://www.linkedin.com>
Xavier Morera is an entrepreneur, project manager, Pluralsight author, speaker, trainer, Certified Scrum Master & Professional and Certified Microsoft professional ...


<http://www.pluralsight.com/author/xavier-morera>
Xavier Morera - .Net Author | Pluralsight<http://www.pluralsight.com/author/xavier-morera>
www.pluralsight.com<http://www.pluralsight.com>
Xavier is an entrepreneur, project manager, technical author, trainer, Certified Scrum Professional & Scrum Master, and Certified Microsoft Professional.


Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Xavier Morera
In reply to this post by Chris Manu
I have two courses and a book on Solr, aimed for getting started. If you
watch the first part of both of them you could get a better idea of what
needs to be done. They are in Pluralsight:

Getting Started with Enterprise Search Using Apache Solr
<https://www.pluralsight.com/courses/enterprise-search-using-apache-solr>

Implementing Search in .NET Applications
<https://www.pluralsight.com/courses/implementing-search-dotnet-applications>

On Wed, Nov 30, 2016 at 10:33 PM, Chris Manu <[hidden email]>
wrote:

> Thank you for responding. So, theoretically, I would need to hire someone
> with Apache programing experience to do this correct (given that I know
> nothing about programing)? What type of experience should I look for?
>
>
> ________________________________
> From: Xavier Morera <[hidden email]>
> Sent: December 1, 2016 2:23 AM
> To: [hidden email]
> Subject: Re: Feasability
>
> The answer is yes, but you would need to do some programming and
> configuring.
>
> On Wed, Nov 30, 2016 at 7:54 PM, Chris Manu <[hidden email]>
> wrote:
>
> > Hello,
> >
> >
> > I want to start off by saying that I am not a programmer...and have very
> > little knowledge in this area.
> >
> >
> > What I would like to know if Apache would be capable of doing the
> > following:
> >
> > Take an extensive list (A) of strings of unique words (these are titles -
> > anywhere from 4 words to 30) saved in either an Excel worksheet or in a
> > text file and search for instances (B) where these can be found in PDF
> > files saved on a hard drive (over 100k files). The search would need to
> be
> > done using a fuzzy logic rather than exact matching and the output would
> be
> > in an Excel file list the unique string found (A), the file name in which
> > the match was made (B), the page number where the match was made and the
> > surrounding text on either side of As well, would this be a complicated
> > program, usable by novices coached in the process necessary to input the
> > title file (A) and direct the search to the relevant folder containing
> the
> > PDF files (B).
> >
> >
> > I eagerly await (hopefully) an affirmative answer.
> >
> >
> > Cheers!
> >
> >
>
>
> --
>
> *Xavier Morera*
>
> Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master
>
> *www.xaviermorera.com <http://www.xaviermorera.com/>*
> [https://i2.wp.com/www.xaviermorera.com/wp-content/
> uploads/2016/06/xavier-morera.jpg?resize=150%2C150]<http://
> www.xaviermorera.com/>
>
> Xavier Morera<http://www.xaviermorera.com/>
> www.xaviermorera.com
> I have been working with Solr for a while, mainly from the .NET world and
> I basically love it. I use SolrNet which I think it is a very mature and
> stable library.
>
>
>
> office:  (305) 600-4919
>
> cel:     +506 8849-8866
>
> skype: xmorera
> Twitter <https://twitter.com/xmorera> | LinkedIn
> [https://pbs.twimg.com/profile_images/464050157344940033/7AA_lsgC_
> 400x400.jpeg]<https://twitter.com/xmorera>
>
> xmorera (@xmorera) | Twitter<https://twitter.com/xmorera>
> twitter.com
> The latest Tweets from xmorera (@xmorera). Eternal optimist, entrepreneur,
> lifelong learner, passionate about technology. Costa Rica
>
>
> <https://www.linkedin.com/in/xmorera> | Pluralsight Author
> [https://media.licdn.com/mpr/mpr/shrinknp_200_200/p/5/005/
> 07f/033/28fdf8e.jpg]<https://www.linkedin.com/in/xmorera>
>
> Xavier Morera | LinkedIn<https://www.linkedin.com/in/xmorera>
> www.linkedin.com
> Xavier Morera is an entrepreneur, project manager, Pluralsight author,
> speaker, trainer, Certified Scrum Master & Professional and Certified
> Microsoft professional ...
>
>
> <http://www.pluralsight.com/author/xavier-morera>
> Xavier Morera - .Net Author | Pluralsight<http://www.
> pluralsight.com/author/xavier-morera>
> www.pluralsight.com
> Xavier is an entrepreneur, project manager, technical author, trainer,
> Certified Scrum Professional & Scrum Master, and Certified Microsoft
> Professional.
>
>
>


--

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer & Scrum Master

*www.xaviermorera.com <http://www.xaviermorera.com/>*

office:  (305) 600-4919

cel:     +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
<https://www.linkedin.com/in/xmorera> | Pluralsight Author
<http://www.pluralsight.com/author/xavier-morera>
Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Ted Dunning
In reply to this post by Chris Manu
On Thu, Dec 1, 2016 at 12:33 PM, Chris Manu <[hidden email]> wrote:

> Thank you for responding. So, theoretically, I would need to hire someone
> with Apache programing experience to do this correct (given that I know
> nothing about programing)? What type of experience should I look for?
>

Chris,

In addition to the Solr recommendation that you are hearing (which is a
fine one), you should expand your search to include Elasticsearch.
Elasticsearch is based on Apache software, but is not itself an Apache
project for the overall system.

What you describe (pulling words from one place, finding them in another)
is very doable with Apache software.

In addition to the search function, you should look at the PdfBox project
for extracting data from PDF files. The Apache POI project has software
that will help you get data from excel files.
Reply | Threaded
Open this post in threaded view
|

Re: Feasability

Alex Ott
I would recommend to use Apache Tika if you need to extract text from files
of different types.  Going to PdfBox or POI is required if you need to dig
into internals of these file formats, but if you only need text, then Tika
will be easier choice...

On Sun, Dec 4, 2016 at 4:01 AM, Ted Dunning <[hidden email]> wrote:

> On Thu, Dec 1, 2016 at 12:33 PM, Chris Manu <[hidden email]>
> wrote:
>
> > Thank you for responding. So, theoretically, I would need to hire someone
> > with Apache programing experience to do this correct (given that I know
> > nothing about programing)? What type of experience should I look for?
> >
>
> Chris,
>
> In addition to the Solr recommendation that you are hearing (which is a
> fine one), you should expand your search to include Elasticsearch.
> Elasticsearch is based on Apache software, but is not itself an Apache
> project for the overall system.
>
> What you describe (pulling words from one place, finding them in another)
> is very doable with Apache software.
>
> In addition to the search function, you should look at the PdfBox project
> for extracting data from PDF files. The Apache POI project has software
> that will help you get data from excel files.
>



--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
Skype: alex.ott