Links in documents

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Links in documents

thorsten
Hi all,

I am looking at
http://incubator.apache.org/tika/apidocs/org/apache/tika/parser/Parser.html
and wonder how I can extract links from the parsed document?

I ask because I want to use tika for the parsing part of droids instead
the custom code I have implemented ATM.

As I understand the API I pass a SAX-handler, the stream and a meta data
object to the paser. The parser then populates the handler and the meta
data object by parsing the stream.

Essential part of a crawler is to extract links from the document it
parses. How do extracting out-links fit in tika?

I mean I can use tika for populating the SAX-handler and extract the
links from there, or would it makes sense to do it in the parser?

salu2
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Links in documents

Jukka Zitting-3
Hi,

On Tue, Mar 18, 2008 at 9:54 PM, Thorsten Scherler <[hidden email]> wrote:
>  Essential part of a crawler is to extract links from the document it
>  parses. How do extracting out-links fit in tika?
>
>  I mean I can use tika for populating the SAX-handler and extract the
>  links from there, or would it makes sense to do it in the parser?

Good point. So far we've mostly focused on extracting just the text
content of the documents with indexing as the main use case in mind.
However, the crawler use case is important and Tika should also
support extraction of links from documents.

Currently I think only the HTML parser preserves links as <A
href="..">...</A> tags, and you can catch them from the SAX events
generated by Tika. However, the HTML parser should really be producing
<a href="...">...</a> tags using the standard XHTML namespace.

We should add similar support also for the other parsers that can
extract link information from documents.

PS. There is also the ParserPostProcessor decorator that uses a regexp
to detect links within the document text, and populates the "outlinks"
metadata property with the detected links. I think we should replace
that functionality with a ContentHandler decorator that does the same
thing using <a href="..."/> tags.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Links in documents

Jukka Zitting-3
Hi,

On Tue, Mar 18, 2008 at 10:28 PM, Jukka Zitting <[hidden email]> wrote:
>  Currently I think only the HTML parser preserves links as <A
>  href="..">...</A> tags, and you can catch them from the SAX events
>  generated by Tika. However, the HTML parser should really be producing
>  <a href="...">...</a> tags using the standard XHTML namespace.

I fixed that with TIKA-128, so you can now use something like this:

    final List<String> links = new ArrayList<String>();
    handler = new TeeContentHandler(handler, new DefaultHandler() {
        public void startElement(
                String uri, String local, String name, Attributes attributes) {
            String ns = XHTMLContentHandler.XHTML;
            if (ns.equals(uri) && "a".equals(local)) {
                String href = attributes.getValue(ns, "href");
                if (href != null) {
                    links.add(href);
                }
            }
        }
    });
    parser.parse(stream, handler, metadata);

Perhaps we should turn that into a Tika utility class, something like
org.apache.tika.sax.LinkContentHandler. Is it important just to get a
list of link URIs, or would the link text also be interesting to a
typical client?

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Links in documents

thorsten
On Wed, 2008-03-19 at 11:07 +0200, Jukka Zitting wrote:

> Hi,
>
> On Tue, Mar 18, 2008 at 10:28 PM, Jukka Zitting <[hidden email]> wrote:
> >  Currently I think only the HTML parser preserves links as <A
> >  href="..">...</A> tags, and you can catch them from the SAX events
> >  generated by Tika. However, the HTML parser should really be producing
> >  <a href="...">...</a> tags using the standard XHTML namespace.
>
> I fixed that with TIKA-128, so you can now use something like this:
>
>     final List<String> links = new ArrayList<String>();
>     handler = new TeeContentHandler(handler, new DefaultHandler() {
>         public void startElement(
>                 String uri, String local, String name, Attributes attributes) {
>             String ns = XHTMLContentHandler.XHTML;
>             if (ns.equals(uri) && "a".equals(local)) {
>                 String href = attributes.getValue(ns, "href");
>                 if (href != null) {
>                     links.add(href);
>                 }
>             }
>         }
>     });
>     parser.parse(stream, handler, metadata);
>
> Perhaps we should turn that into a Tika utility class, something like
> org.apache.tika.sax.LinkContentHandler.

Sounds good. We should add more elements for the link recognition,
though.

I mean ATM we are looking for <a/> but for a crawler that scraps the
whole page all external resources are links and needs to be saved.

Meaning for xhtml the following elements are important:
- <img src="..."/>
- <link href="..."/>
- <script src="..."/>

In css files there can as well be links to either images or other css
files.

> Is it important just to get a
> list of link URIs, or would the link text also be interesting to a
> typical client?

For a crawler IMO only the link.

Thanks for the update I will have a closer look ASAP.

salu2

>
> BR,
>
> Jukka Zitting
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Links in documents

Jukka Zitting-3
Hi,

On Wed, Mar 19, 2008 at 11:33 PM, Thorsten Scherler <[hidden email]> wrote:

>  Sounds good. We should add more elements for the link recognition,
>  though.
>
>  I mean ATM we are looking for <a/> but for a crawler that scraps the
>  whole page all external resources are links and needs to be saved.
>
>  Meaning for xhtml the following elements are important:
>  - <img src="..."/>
>  - <link href="..."/>
>  - <script src="..."/>

Good point!

I'm wondering how we should best handle those in Tika, i.e. as <img/>
and <script/> tags don't really have much meaning in the scope of text
extraction. Perhaps we should map <img src="..." alt="..."/> to <a
href="...">...</a> or something like that to keep the client view
simple.

Not sure what to do with <script/> tags, perhaps those links should go
a metadata property? I don't think inline scripts should be part of
the extracted text content (but others may disagree), so script links
should probably also not be included in the XHTML output.

>  In css files there can as well be links to either images or other css
>  files.

We don't currently have explicit CSS parser support in Tika, the plain
text extractor comes closest. I'll see if we could add something that
would allow easy detection of links within CSS files.

On a related note, currently there's no base URI support in Tika. We
probably should add that, and treat a dc:identifier URI (if available)
as the base unless one has been explicitly specified in the document.
Also, if a base URI is available, Tika should automatically make all
relative URIs absolute to make client life easier.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Links in documents

thorsten
On Wed, 2008-03-19 at 23:50 +0200, Jukka Zitting wrote:

> Hi,
>
> On Wed, Mar 19, 2008 at 11:33 PM, Thorsten Scherler <[hidden email]> wrote:
> >  Sounds good. We should add more elements for the link recognition,
> >  though.
> >
> >  I mean ATM we are looking for <a/> but for a crawler that scraps the
> >  whole page all external resources are links and needs to be saved.
> >
> >  Meaning for xhtml the following elements are important:
> >  - <img src="..."/>
> >  - <link href="..."/>
> >  - <script src="..."/>
>
> Good point!
>
> I'm wondering how we should best handle those in Tika, i.e. as <img/>
> and <script/> tags don't really have much meaning in the scope of text
> extraction. Perhaps we should map <img src="..." alt="..."/> to <a
> href="...">...</a> or something like that to keep the client view
> simple.

Maybe something like
http://svn.apache.org/repos/asf/labs/droids/trunk/src/core/java/org/apache/droids/parse/Outlink.java

You said in another mail that the outlinks are stored in a metadata
object. Why not store it in an exclusive object for it. I guess besides
the depth variable (which does not make an awful lot of sense for tika)

>
> Not sure what to do with <script/> tags, perhaps those links should go
> a metadata property?

IMO all links should go into the same outlink object. Tika should not
further tread them just report them.

> I don't think inline scripts should be part of
> the extracted text content (but others may disagree), so script links
> should probably also not be included in the XHTML output.

I agree. However AJAX is becoming more popular and some page content
only can be reached via scripts. Not sure where this leaves us.

>
> >  In css files there can as well be links to either images or other css
> >  files.
>
> We don't currently have explicit CSS parser support in Tika, the plain
> text extractor comes closest. I'll see if we could add something that
> would allow easy detection of links within CSS files.

http://svn.apache.org/repos/asf/forrest/trunk/main/webapp/resources/chaperon/
In forrest we are using chaperon for this, but not sure whether tika
should parse css files. Is similar to inline scripts, or?

>
> On a related note, currently there's no base URI support in Tika. We
> probably should add that, and treat a dc:identifier URI (if available)
> as the base unless one has been explicitly specified in the document.
> Also, if a base URI is available, Tika should automatically make all
> relative URIs absolute to make client life easier.

IMO that should be the problem of the client because there are situation
where the client may prefer relative links.

salu2
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Links in documents

Jukka Zitting-3
Hi,

On Thu, Mar 20, 2008 at 11:56 PM, Thorsten Scherler
<[hidden email]> wrote:
>  You said in another mail that the outlinks are stored in a metadata
>  object. Why not store it in an exclusive object for it. I guess besides
>  the depth variable (which does not make an awful lot of sense for tika)

I'd prefer to keep the Tika interfaces as general as possible and
avoid special cases that only serve a single use case. If there's a
reasonable way for a parser to report outgoing links through the
existing API, then I think we should use it. We can then of course add
a utility class that takes such information and presents it in a more
convenient way to a specific use case or client, but the basic
mechanism should be as generic as possible.

For example, while (AFAIUI) you're only interested in getting the set
of outgoing URIs in a document, some other crawler/indexers might also
want to know the link text and perhaps even the surrounding words as
context information to use when indexing the target document. Instead
of extending an explicit Outlink class with such information (it's not
clear what all information would be needed), I'd rather use the
existing XHTML SAX stream for that.

>  > I don't think inline scripts should be part of
>  > the extracted text content (but others may disagree), so script links
>  > should probably also not be included in the XHTML output.
>
>  I agree. However AJAX is becoming more popular and some page content
>  only can be reached via scripts. Not sure where this leaves us.

We could perhaps add a configuration option to the HTML parser to keep
or drop any inline scripts, comments, and/or style sheets. See also
the other thread for potential syntax-aware parsing of JavaScript and
CSS. We could turn the HTML parser into a composite parser class that
calls other language-specific parsers when it encounters inline
content in some specific language.

>  > We don't currently have explicit CSS parser support in Tika, the plain
>  > text extractor comes closest. I'll see if we could add something that
>  > would allow easy detection of links within CSS files.
>
>  http://svn.apache.org/repos/asf/forrest/trunk/main/webapp/resources/chaperon/
>  In forrest we are using chaperon for this, but not sure whether tika
>  should parse css files. Is similar to inline scripts, or?

Seems useful! Anything that goes beyond text/plain is good...

>  > On a related note, currently there's no base URI support in Tika. We
>  > probably should add that, and treat a dc:identifier URI (if available)
>  > as the base unless one has been explicitly specified in the document.
>  > Also, if a base URI is available, Tika should automatically make all
>  > relative URIs absolute to make client life easier.
>
>  IMO that should be the problem of the client because there are situation
>  where the client may prefer relative links.

Good point. Perhaps another place where we should make the parser
configurable, as other clients would probably be better served with
"cooked" links so they wouldn't need to worry about base URIs etc.

BR,

Jukka Zitting