outlink extractor finds lots of junk

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

outlink extractor finds lots of junk

AJ Chen-2
During fetching, OutlinkExtractor.getOutlinks() finds lots of junk, such as
the following:
rdf:about=
xmlns:pdf=
http://ns.adobe.com/pdf/1.3/
pdf:Producer
pdf:Producer
rdf:Description
rdf:Description
rdf:about=
xmlns:xap=
http://ns.adobe.com/xap/1.0/
xap:CreatorTool
xap:CreatorTool
xap:ModifyDate
T14:43:23-07:00

This is because the defined URL_PATTERN matches things that are not web
links. Is there a fix for it?  Is there a way to set protocols (e.g. http,
https) for the desired outlinks? This way, only links containing the
specified protocols will be considered as "outlink".  I'm using 0.9-devcode.

Thanks,
--
AJ Chen, PhD
http://web2express.org