encode special characters in url

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

encode special characters in url

Jun Zhou
Hi all,

I'm using nutch 1.6 to crawl a web site which have lots of special
characters in the url, like "?,=@" etc.  For each character, I can add a
regex in the regex-normalize.xml to change it into percent encoding.

My question is, is there an easier way to do this? Like a url-encode method
to encode all the special characters rather than add regex one by one?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: encode special characters in url

Rajinimaski
Hi,

 I think this thread should be useful:
http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html



Thanks & Regards
Rajani Maski



On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou <[hidden email]> wrote:

> Hi all,
>
> I'm using nutch 1.6 to crawl a web site which have lots of special
> characters in the url, like "?,=@" etc.  For each character, I can add a
> regex in the regex-normalize.xml to change it into percent encoding.
>
> My question is, is there an easier way to do this? Like a url-encode method
> to encode all the special characters rather than add regex one by one?
>
> Thanks!
>
Reply | Threaded
Open this post in threaded view
|

Re: encode special characters in url

amuseme
Hi Jun

Can you use one regex pattern to match all special situations. or maybe you
can extend your own url normalizer plugin to fit your requirement.


On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski <[hidden email]> wrote:

> Hi,
>
>  I think this thread should be useful:
>
> http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html
>
>
>
> Thanks & Regards
> Rajani Maski
>
>
>
> On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou <[hidden email]> wrote:
>
> > Hi all,
> >
> > I'm using nutch 1.6 to crawl a web site which have lots of special
> > characters in the url, like "?,=@" etc.  For each character, I can add a
> > regex in the regex-normalize.xml to change it into percent encoding.
> >
> > My question is, is there an easier way to do this? Like a url-encode
> method
> > to encode all the special characters rather than add regex one by one?
> >
> > Thanks!
> >
>



--
Don't Grow Old, Grow Up... :-)
Don't Grow Old, Grow Up.
Reply | Threaded
Open this post in threaded view
|

Re: encode special characters in url

Jun Zhou
In reply to this post by Rajinimaski
Thanks Rajani!

Actually the problem is special characters in the url, not in the content.
Thanks anyway!


On Wed, Apr 10, 2013 at 5:17 AM, Rajani Maski <[hidden email]> wrote:

> Hi,
>
>  I think this thread should be useful:
>
> http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html
>
>
>
> Thanks & Regards
> Rajani Maski
>
>
>
> On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou <[hidden email]> wrote:
>
> > Hi all,
> >
> > I'm using nutch 1.6 to crawl a web site which have lots of special
> > characters in the url, like "?,=@" etc.  For each character, I can add a
> > regex in the regex-normalize.xml to change it into percent encoding.
> >
> > My question is, is there an easier way to do this? Like a url-encode
> method
> > to encode all the special characters rather than add regex one by one?
> >
> > Thanks!
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: encode special characters in url

Jun Zhou
In reply to this post by amuseme
Thanks, Feng.

I thought this might be a common problem when using nutch. I'll try your
suggestions.


Best Regards,
Jun Zhou
University of Southern California
http://www-scf.usc.edu/~junzhou


On Wed, Apr 10, 2013 at 7:11 AM, feng lu <[hidden email]> wrote:

> Hi Jun
>
> Can you use one regex pattern to match all special situations. or maybe you
> can extend your own url normalizer plugin to fit your requirement.
>
>
> On Wed, Apr 10, 2013 at 8:17 PM, Rajani Maski <[hidden email]>
> wrote:
>
> > Hi,
> >
> >  I think this thread should be useful:
> >
> >
> http://lucene.472066.n3.nabble.com/Parsed-content-in-form-of-special-characters-td4047239.html
> >
> >
> >
> > Thanks & Regards
> > Rajani Maski
> >
> >
> >
> > On Sun, Apr 7, 2013 at 4:56 AM, Jun Zhou <[hidden email]> wrote:
> >
> > > Hi all,
> > >
> > > I'm using nutch 1.6 to crawl a web site which have lots of special
> > > characters in the url, like "?,=@" etc.  For each character, I can add
> a
> > > regex in the regex-normalize.xml to change it into percent encoding.
> > >
> > > My question is, is there an easier way to do this? Like a url-encode
> > method
> > > to encode all the special characters rather than add regex one by one?
> > >
> > > Thanks!
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>