Solr web crawler with recursive option

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr web crawler with recursive option

Shivprasad Shetty
Hello Team,


                I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com  and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
        at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
        at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
        ... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Solr web crawler with recursive option

Shivprasad Shetty
Hello Team,


                I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com  and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
        at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
        at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
        ... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Solr web crawler with recursive option

Shivprasad Shetty
In reply to this post by Shivprasad Shetty


                I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com  and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
        at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
        at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
        ... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Solr web crawler with recursive option

Alexandre Rafalovitch
In reply to this post by Shivprasad Shetty
One of the files that post tool identified as XML is not. Possibly a 404
error or some such. So it is trying to parse the file and sees non-xml
content right at start. Or if you are sure it is an XML file, maybe there
is a BOM mark. Either way try to isolate the specific file.

On a bigger picture though, if crawling is actual part of the project
rather than just a test, you should use proper crawlers that integrate with
Solr. Mitch, StormCrawler (so?), etc.

Regards,
     Alex

On Thu, Apr 11, 2019, 6:09 AM Shivprasad Shetty, <[hidden email]>
wrote:

> Hello Team,
>
>
>                 I am working on solr for the first time and got the setup
> done. Now I have created a core using command line and want to perform
> webcrawl of a third party site.
> If I try it with individual links, I am able to do the crawl and index it
> to the core.This was done using >
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar
> post.jar http://www.example.com
>
> Now what I intend to do is to give a url and using the recursive option
> (-Drecursive) and let it crawl the entire site.
> Note that I am pointing to a website that has around 125 pages and I am
> using the below command >
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=yes -jar post.jar http://www.example.com  and
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=2 -jar post.jar http://www.example.com
>
> and I am getting the below error message.
> Error:
>
>
> POSTed web resource http://www.example.com (depth: 0)
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
> not allowed in prolog.
>         at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
>         at
> org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
>         at
> org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
>         at
> org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
>         at
> org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
>         at
> org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
> Content is not allowed in prolog.
>         at
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
>         at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
> Source)
>         at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
>         at
> org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
>         at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
>         ... 5 more
>
>
>
> I would be very grateful if anyone could get me to solve this issue I have
> been trying to fix for a couple of days.
>
>
> Regards,
> ShivprasadS
>
>
> Confidentiality Notice: This e-mail message, including any attachments, is
> for the sole use of the intended recipient(s) and may contain confidential
> and privileged information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient, please
> contact the sender by reply e-mail, delete and then destroy all copies of
> the original message.
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr web crawler with recursive option

Erick Erickson
In reply to this post by Shivprasad Shetty
You are sending malformed XML to Solr. This can be something as silly as having extra spaces at the beginning. I’d capture the page being sent to Solr and put it in a formatter to check it….

Best,
Erick

> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty <[hidden email]> wrote:
>
> Hello Team,
>
>
>                I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
> If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com
>
> Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
> Note that I am pointing to a website that has around 125 pages and I am using the below command >
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com  and
> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com
>
> and I am getting the below error message.
> Error:
>
>
> POSTed web resource http://www.example.com (depth: 0)
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
>        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
>        at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
>        at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
>        at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
>        at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
>        at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
>        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
>        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
>        at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
>        at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
>        ... 5 more
>
>
>
> I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.
>
>
> Regards,
> ShivprasadS
>
>
> Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.

Reply | Threaded
Open this post in threaded view
|

Re: Solr web crawler with recursive option

Jan Høydahl / Cominvent
I think there may actually be a bug. I was not able to crawl some other web site either.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 11. apr. 2019 kl. 18:55 skrev Erick Erickson <[hidden email]>:
>
> You are sending malformed XML to Solr. This can be something as silly as having extra spaces at the beginning. I’d capture the page being sent to Solr and put it in a formatter to check it….
>
> Best,
> Erick
>
>> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty <[hidden email]> wrote:
>>
>> Hello Team,
>>
>>
>>               I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
>> If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
>> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com
>>
>> Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
>> Note that I am pointing to a website that has around 125 pages and I am using the below command >
>> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com  and
>> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com
>>
>> and I am getting the below error message.
>> Error:
>>
>>
>> POSTed web resource http://www.example.com (depth: 0)
>> [Fatal Error] :1:1: Content is not allowed in prolog.
>> Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
>>       at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
>>       at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
>>       at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
>>       at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
>>       at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
>>       at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
>> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
>>       at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
>>       at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>>       at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
>>       at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
>>       at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
>>       ... 5 more
>>
>>
>>
>> I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.
>>
>>
>> Regards,
>> ShivprasadS
>>
>>
>> Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.
>

Reply | Threaded
Open this post in threaded view
|

Re: Solr web crawler with recursive option

Andrew MacKay
You should look at Nutch apache solution that has Solr client support, it
has all the index options you need and has schema to build Solr collection
with all required fields for indexing.

We have used it and works well, supports sitemap.xml to simplify indexing.

On Fri, Apr 12, 2019 at 6:43 AM Jan Høydahl <[hidden email]> wrote:

> I think there may actually be a bug. I was not able to crawl some other
> web site either.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 11. apr. 2019 kl. 18:55 skrev Erick Erickson <[hidden email]>:
> >
> > You are sending malformed XML to Solr. This can be something as silly as
> having extra spaces at the beginning. I’d capture the page being sent to
> Solr and put it in a formatter to check it….
> >
> > Best,
> > Erick
> >
> >> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty <
> [hidden email]> wrote:
> >>
> >> Hello Team,
> >>
> >>
> >>               I am working on solr for the first time and got the setup
> done. Now I have created a core using command line and want to perform
> webcrawl of a third party site.
> >> If I try it with individual links, I am able to do the crawl and index
> it to the core.This was done using >
> >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar
> post.jar http://www.example.com
> >>
> >> Now what I intend to do is to give a url and using the recursive option
> (-Drecursive) and let it crawl the entire site.
> >> Note that I am pointing to a website that has around 125 pages and I am
> using the below command >
> >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=yes -jar post.jar http://www.example.com  and
> >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=2 -jar post.jar http://www.example.com
> >>
> >> and I am getting the below error message.
> >> Error:
> >>
> >>
> >> POSTed web resource http://www.example.com (depth: 0)
> >> [Fatal Error] :1:1: Content is not allowed in prolog.
> >> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
> not allowed in prolog.
> >>       at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
> >>       at
> org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
> >>       at
> org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
> >>       at
> org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
> >>       at
> org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
> >>       at
> org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
> >> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
> 1; Content is not allowed in prolog.
> >>       at
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
> >>       at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
> Source)
> >>       at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
> >>       at
> org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
> >>       at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
> >>       ... 5 more
> >>
> >>
> >>
> >> I would be very grateful if anyone could get me to solve this issue I
> have been trying to fix for a couple of days.
> >>
> >>
> >> Regards,
> >> ShivprasadS
> >>
> >>
> >> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> confidential and privileged information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail, delete and then
> destroy all copies of the original message.
> >
>
>

--

--
CONFIDENTIALITY NOTICE: The information contained in this email is
privileged and confidential and intended only for the use of the individual
or entity to whom it is addressed.   If you receive this message in error,
please notify the sender immediately at 613-729-1100 and destroy the
original message and all copies. Thank you.