Scanning for the URLs in a web page

How do I match URL's within an HTML document?

A Uniform Resource Locator is a string of text that describes the location of a network accessible resource. URLs are used extensively in HTML documents and, for various reasons, we often need to manipulate them. To accomplish this we need to be able to scan the documents and identify the URLs.

This document describes a super-simple yet effective scanner, written in PHP, that I recently wrote for a small project.

What's in a URL?

Detailed information about Uniform Resource Locators can be found in a number of places on the web, including the popular Wikipedia.org URL page. As expected the bottom of the Wikipedia page includes extensive references to more detailed information.

It does not take much research to find out that URLs are more complicated than we expect them to be. It turns out that we generally use only a small subset of the elements that can go into a URL. For this reason the scanner presented here is drastically simplified. Rather than trying to support all the protocols (called Schemes in the official reference material,) this scanner attempts only to cover the most common ground: basic ftp, http and mail-to URLs, along with the relative references that we often use in a typical web page. These generally take the following formats:

  • [scheme:// hostname /] path [/path] [? query] [# fragment]
  • scheme: userid @ hostname

In addition: the hostname can also include a User ID, a Password and a Port specification:

  • [scheme:// [userid [: password @]] hostname [: port] /]

The real issues

At first, when I started working on this code, I thought the problem would be identifying URLs. Later I discovered that the hard part is identifying the more limited subsets of a URL. Let me explain: It's relatively easy to scan for and find a full reference in a web page, such as this one:

<a href="http://mysite.com/public_files">Public Files</a>

The protocol spec can be quickly matched by any scanner because of the unique :// sequence between the protocol name and the host name. The problem is that there is often a page in the root of mysite.com that contains a relative link to the public files folder, like this:

<a href="public_files">Public Files</a>

In essence, from the perspective of a software scanner, relative URLs often look exactly like any of the other words in a web page. Notice also the HTML tags in the above text. The closing anchor tag contains /a which, technically, looks like a root-relative path to a resource called a. As you can imagine the scanner that I am presenting here contains relatively little code to scan for URLs. The bulk of the effort went into writing code to deal with the practical problems that arise when trying to figure out what strings of text represent URLs and what strings need to be ignored.

Fortunately we can accomplish much of what we need to do using the sophisticated and powerful Perl Compatible Regular Expression Library that is included in recent releases of PHP. Detailed documentation for this text scanning library is available from the pcre.org web site.

The Solution

Below you will find the source code for URL_PATTERN between lines 94 and 204. This is my attempt at a PCRE expression to match URLs within a web page.

Below that, between lines 308 and 354, you can see the code for the list_urls() function. (The comment for the function is way up on line 206. It seems really far away because of the Unit Testing active comments that are embedded before the function code.)

The URL_PATTERN PCRE expression is passed to the PHP preg_match_all() function along with the text to scan. It returns a rather complicated tree structure containing information about the matching results.

Fortunately we are only interested in the named subpatterns url1, url2, url3 and url4 that were matched from the main URL_PATTERN. Starting from line 290 there are a couple of simple loops that fish out the desired information and build a results array which is returned to the caller.

The array consists of a set of Position => URL pairs, one for each matched URL. The position is an offset from the start of the text block to the start of the matched URL. You can see a sample of the output from the function by looking at the Unit Test Code embedded within the comment for the function. The test code begins at line 228 where you can see the $sample_text that will be passed to the list_urls() function. In line 263 you can see the var_dump() of the $results array returned by the function.

The Unit Test code for this module is expected to run under phpdt/DocTest. (Note that the original repository for that project was at Google DocTest.)

Using list_urls() and unique_urls()

To use the list_urls() function you need only include the url-scanner.inc file in your module. Pass a string containing the HTML document you would like to scan and the function will return the list of matched URLs and their offsets in the string. See the source code comments for information about the $sort_flag in the event that you want the results to be returned in order of their appearance in the source document. The source code for the module contains a unit test that you can check to see a working call and the results it produces.

In most cases you will probably have a list of URLs that you wish to update. You will then want to know if one of the URLs to update exists in the list of URLs returned by the scanner. In this case you might prefer to call the unique_urls() function. It returns an associative array where the scanned URLs are the keys to the array. This makes it easy to test for existence of the URL rather than having to search the results set:

$results = unique_urls( $web_page_contents );

...

if ( isset( $results[$URL_that_I_am_seeking] ) )
{
   $found_at = $results[$URL_that_I_am_seeking];

   echo "Regarding the url: '{$URL_that_I_am_seeking}'\n";
   echo "I found ", count($found), " occurrence(s) of it.\n";
   echo "They are at the following offsets: ";
   foreach( $found as $offset )
      echo "{$found} ";

   ...
}

As you can see from the sample code above the unique_urls() function returns position information for each found occurence of a URL. Optionally, it can also return context information from the original document. See the comments and the unit test code in the module source for more information about that.

Use the substr_replace() function to replace URLs within your text, like this:

foreach ( $result as $pos => $input_url )
{
   if ( ($replacement_url = find_my_replacement_url( $input_url )) !== false )
      $my_document_text = substr_replace( $my_document_text,
                                          $replacement_url,
                                          $pos, strlen( $input_url ) );
}

Did I miss anything important? Please let me know.

Summary

While this was a non-trivial exercise for me I'm glad I was able to find a relatively simple yet suitable solution. Still, it's a scanner - not a parser. It has no understanding of the meaning of the text that it is scanning and, therefore, can be expected to fail as new sets of input documents are passed through it. Fortunately it is good enough for me given the work I'm doing now. I hope you also get good milage out of it. Let me know. Thanks.

Download

A zip file containing the latest copy of the url-scanner.inc PHP module is available here:

url-scanner.zip

Below is the source code for the module as of December 2nd, 2012:

Filename: url-scanner.inc
0001 <?php /* -*- mode: php; mode: mmm; coding: utf-8-unix; -*- */
0002 /**
0003  *
0004  * URL Scanner
0005  *
0006  * A scanner based on the following references (but drastically
0007  * simplified:)
0008  *
0009  *    http://en.wikipedia.org/wiki/URL
0010  *    http://www.w3.org/Addressing/URL/5_BNF.html
0011  *    
0012  * This module contains a function, list_urls() that accepts a single
0013  * string parameter (usually a long string containing the contents of
0014  * an HTML page.) It scans the block for URLs and returns a
0015  * list of those found. The scanned URLs match one of the following
0016  * patterns:
0017  *
0018  *    [scheme:// [userid [: password @]] hostname [: port] /] path [/path] [? params] [# fragment]
0019  *    scheme: userid @ hostname
0020  *
0021  * Note that this is a scanner - it is not a parser. It does not
0022  * understand the context of any URL that it finds. Therefore
0023  * it will match URLs that are in the text of a web page as
0024  * well as the URL-related attributes of the HTML tags. If this is a
0025  * problem you can modify the code below to look for the HTML
0026  * attributes that are found by the pattern as a way of narrowing the
0027  * scan. However, to get the job done more accurately, you may want to
0028  * visit the ANTLR web site and get an HTML parser:
0029  *
0030  *    http://www.antlr.org/
0031  *
0032  * Performance? Well, the PHP interpreter is passing the pattern below
0033  * to fully compiled code that has been kicking around for many
0034  * years. Most likely it is already optimized to a large extent. If
0035  * this code is too slow the first step to take to speed it up would
0036  * be to try to simplify parts of the pattern below. The next step
0037  * would be to switch to drastically simpler patterns and write code
0038  * to go through all the resulting search results to remove those that
0039  * are unwanted.
0040  *
0041  *
0042  * UNIT TESTING & Documentation: 
0043  *
0044  * The active comments below are written for PHPDT/DocTest:
0045  * http://code.google.com/p/testing-doctest/wiki/Documentation
0046  *
0047  * Try the highlight_file() function in PHP to generate an updated
0048  * HTML copy of this script:
0049  *
0050  *    $ php -r "highlight_file( 'url-scanner.inc' );" >url-scanner.html
0051  *
0052  * --
0053  *
0054  * Copyright (c) 2012 by Sam Azer <sam at azertech.net>. 
0055  *                       All Rights Reserved.
0056  *
0057  * This program is free software: you can redistribute it and/or modify
0058  * it under the terms of the GNU General Public License as published by
0059  * the Free Software Foundation, either version 3 of the License, or
0060  * (at your option) any later version.
0061  *
0062  * This program is distributed in the hope that it will be useful,
0063  * but WITHOUT ANY WARRANTY; without even the implied warranty of
0064  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0065  * GNU General Public License for more details.
0066  *
0067  * You should have received a copy of the GNU General Public License
0068  * along with this program.  If not, see <http://www.gnu.org/licenses/>.
0069  *
0070  * @author Sam Azer <sam at azertech.net>
0071  * @version 5.0
0072  * @package aforms
0073  * @subpackage lib-general
0074  * @license http://www.gnu.org/licenses/gpl.html GPL
0075  * @copyright 2001-2012 by Sam Azer, All rights reserved
0076  * @link http://www.azertech.net/
0077  * @link http://www.samazer.net/about
0078  *
0079  */
0080
0081
0082 /**
0083  * The following PCRE pattern tries to match URLs in a few
0084  * common cases. The subpatterns url1, url2 and url3 have high
0085  * confidence as URLs in their specific cases. The subpattern
0086  * htmltag will match cases such as </tag> where /tag is a valid URL
0087  * that we generally do not want in the result set.
0088  * 
0089  * The list_urls() function below will scan the results returned by
0090  * preg_match_all() and aggregate the high-confidence URL strings
0091  * only - discarding the htmltags.
0092  *
0093  */
0094 
define"URL_PATTERN"
0095         
"=" // using = as a delim char
0096       
"(" 
0097           
"(" // try first to match a full web URL
0098               
"(?<url1>" // ie: http://userid:password@www.mydomain.com:80/path?options#frag
0099                   
"[a-z]{3,9}"  // http
0100                   
"://"         // ://
0101                   
"(" 
0102                       
"("
0103                           
"[^:/;&#?[:space:]]+" // userid 
0104                           
"("
0105                               
":"
0106                               
"[^:/;&#?[:space:]]+" // :password
0107                           
")?"
0108                           
"@"       // userid / password followed by @
0109                       
")?"
0110                       
"[^:/;&#?[:space:]]+" // domain name
0111                       
"(:[0-9]+)?"          // optional :<port> number
0112                   
")"
0113                   
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]*"
0114                       
"[\?\#][^\'\"[:space:]]*"
0115               
")"
0116           
"|" 
0117               
"(?<url2>"
0118                   
"(" 
0119                       
"[a-z]{3,9}"  // http
0120                       
"://"         // ://
0121                       
"("
0122                           
"[^:/;&#?[:space:]]+" // userid 
0123                           
"("
0124                               
":"
0125                               
"[^:/;&#?[:space:]]+" // :password
0126                           
")?"
0127                           
"@"       // userid / password followed by @
0128                       
")?"
0129                       
"[^:/;&#?[:space:]]+" // domain name
0130                       
"(:[0-9]+)?"          // optional :<port> number
0131                   
")?"
0132                   
"(" // look here for /path/ or /path/?#frag
0133                       
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]*"
0134                       
"/"
0135                       
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]*"
0136                       
"([\?\#][^\'\"[:space:]]*)?"
0137                   
")"
0138               
")"
0139           
"|" // next try to match mailto:<email> URLs
0140               
"(?<url3>"
0141                   
"[a-z]{3,9}" // mailto
0142                   
":"          // : name 
0143                   
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]+"
0144                   
"@"          // @ my.domain.name
0145                   
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]+"
0146               
")"
0147           
")"
0148
0149      
/** 
0150       * here we only look for partial URLs because the more complete URLs
0151       * will be detected by the pattern above 
0152       */
0153       
"|" // try to match relative URLs within HTML tag attributes
0154           
"(?:" // look for attribute = <delim>, ie: src="<url>"
0155               
"(action|background|cite|classid|codebase|data|"
0156               
.         "formaction|href|icon|longdesc|manifest|"
0157               
.         "poster|profile|src|usemap)"
0158               
"([[:space:]]*[\=][[:space:]]*)"
0159               
"([[:space:]]*[\"\'][[:space:]]*)"
0160           
")"
0161           
"(?<url4>"
0162               
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]+"
0163               
"([\?\#][^\'\"[:space:]]*)?"
0164           
")"
0165           
"(?:[[:space:]]*[\"\'])"
0166      
/** 
0167       * This next pattern will match a pathref such as /a or /form
0168       * which appears frequently in an HTML page - surrounded by
0169       * angle-brackets. By adding this pattern and giving it the name
0170       * "htmltags" we make sure that any paths that match the pattern
0171       * will be listed in the results under htmltags - which we can
0172       * safely ignore.
0173       */
0174       
"|"
0175           
"(?:"  // match an opening angle bracket
0176               
"([[:space:]]*[<][[:space:]]*)"
0177           
")"
0178           
"(?<htmltag>"
0179               
"/" // match the slash that indicates a closing tag
0180               
"[[:space:]]*[[:word:]]+" // match the tag name
0181           
")"
0182           
"(?:" // match the closing angle bracket
0183               
"([[:space:]]*[>][[:space:]]*)"
0184           
")"
0185      
/** 
0186       * This next pattern will match a word with a ? or # at the
0187       * end. It is used to prevent words at the end of a sentence from
0188       * being matched as paths with queries or fragments.
0189       */
0190       
"|"
0191           
"(?:"  // match a word break
0192               
"([[:space:]]*)"
0193           
")"
0194           
"(?<bareword>"
0195               
"[^\\\|\=\'\(\[\]\);:#^<>\"!*&?[:space:]]+"
0196               
"[\?\#]"
0197           
")"
0198           
"(?:"  // match a word break
0199               
"([[:space:]]*)"
0200           
")"
0201       
")"
0202       
"=i" // close the pattern and specify case-insensitive search
0203 
);
0204
0205
0206 
/**
0207  *
0208  * list_urls( $txt, $sort_results = false )
0209  *
0210  * Scans a block of text in the $txt string and returns an array of
0211  * URLs that match the subpatterns in the URL_PATTERN above. The
0212  * returned value is an array of Position => URL pairs, one for each
0213  * matched URL.
0214  *
0215  * @param string $txt a block of text such as an HTML web page 
0216  * @param bool   $sort_flag defaults to false. Pass true to get 
0217  *               the return results sorted in order of their 
0218  *               appearance in the source document
0219  *
0220  * @return array an array of Position => URL pairs where Position is
0221  *               the character count in the text block at which the
0222  *               URL was matched. If preg_match_all() fails the return
0223  *               value of this function is boolean false. However, as
0224  *               this will almost always be the result of a coding
0225  *               error, the function is generally coded to call an
0226  *               exit function or throw an exception instead.
0227  *
0228  * <code>
0229  * 
0230  * $sample_text = <<< EOT
0231  * <p>This is a test document containing URLs. Will it 
0232  *    match the wrong words? I'm not sure. we need to be able to 
0233  *    match strings like rfc123/text/plain as relative paths and
0234  *    strings like http://www.kubuntu.org/ as full URLs.
0235  *    So, let's test it:
0236  * <ul>
0237  * <li><a href="community">info</a> 
0238  * <li><img src = ' /my/relative/image.png ' >
0239  * <li><img src =' /my/relative/image.png '>
0240  * <li><img 
0241  * src = 
0242  * '/my/relative/image.png' 
0243  * >
0244  * <li><a action="mailto:info@unknown.com">info</a>
0245  * <li><a formaction="community#frag">info</a>
0246  * <li><a href="/community#frag">info</a>
0247  * <li><a href="/community/PostfixAmavisNew#frag">info</a>
0248  * <li><a href="/community/PostfixAmavisNew?abc=123&amp;foo=%2c&amp;zot=1#frag">info</a>
0249  * <li><a href="PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0250  * <li><a href="/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0251  * <li><a href="community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0252  * <li><a href="/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0253  * <li><a href="https://help.ubuntu.com/community/PostfixAmavisNew">Info</a>
0254  * <li><a href="https://userid@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0255  * <li><a href="https://userid:password@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0256  * <li><a href="https://userid:password@help.ubuntu.com:80/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0257  * </ul>
0258  * EOT;
0259  * 
0260  * $result = list_urls( $sample_text, true );
0261  * var_dump( $result );
0262  *
0263  * // expects:
0264  * // array(19) {
0265  * //   [136] =>
0266  * //   string(17) "rfc123/text/plain"
0267  * //   [192] =>
0268  * //   string(23) "http://www.kubuntu.org/"
0269  * //   [270] =>
0270  * //   string(9) "community"
0271  * //   [307] =>
0272  * //   string(22) "/my/relative/image.png"
0273  * //   [350] =>
0274  * //   string(22) "/my/relative/image.png"
0275  * //   [392] =>
0276  * //   string(22) "/my/relative/image.png"
0277  * //   [433] =>
0278  * //   string(23) "mailto:info@unknown.com"
0279  * //   [486] =>
0280  * //   string(14) "community#frag"
0281  * //   [524] =>
0282  * //   string(15) "/community#frag"
0283  * //   [563] =>
0284  * //   string(32) "/community/PostfixAmavisNew#frag"
0285  * //   [619] =>
0286  * //   string(62) "/community/PostfixAmavisNew?abc=123&amp;foo=%2c&amp;zot=1#frag"
0287  * //   [705] =>
0288  * //   string(51) "PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0289  * //   [780] =>
0290  * //   string(52) "/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0291  * //   [856] =>
0292  * //   string(61) "community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0293  * //   [941] =>
0294  * //   string(62) "/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0295  * //   [1027] =>
0296  * //   string(50) "https://help.ubuntu.com/community/PostfixAmavisNew"
0297  * //   [1101] =>
0298  * //   string(92) "https://userid@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0299  * //   [1217] =>
0300  * //   string(101) "https://userid:password@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0301  * //   [1342] =>
0302  * //   string(104) "https://userid:password@help.ubuntu.com:80/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag"
0303  * // }
0304  *
0305  * </code>
0306  *
0307  */
0308 
function list_urls$txt_block$sort_flag false )
0309 {
0310   
/**
0311    * start by getting a list of all the strings that match the
0312    * URL_PATTERN:
0313    */
0314    
$res = array();
0315    if ( (
$v preg_match_allURL_PATTERN$txt_block$res
0316                               
PREG_OFFSET_CAPTURE )) === false )
0317    {
0318      
/**
0319       * Failure here is generally the result of a problem with the
0320       * pattern. It's quite alright to crash here as long as the error
0321       * is logged to alert the administrator.
0322       */
0323       // return false;
0324       
throw new Exception"preg_match_all() returns an error result" );
0325    }
0326
0327
0328   
/**
0329    * now loop through the results and pack the url1, url2 and url3
0330    * subpattern results into a final results array:
0331    */
0332    
$subpatterns = array( "url1""url2""url3""url4" );
0333    
$results     = array();
0334    foreach ( 
$subpatterns as $subpattern )
0335       foreach ( 
$res[$subpattern] as $idx => $result )
0336          if ( 
is_array$result ) )
0337             if ( 
$result[0] > "" 
0338               
&& $result[1] !== -1
0339               
&& $result[0] !== "/" 
0340               
&& $result[0] !== "//" 
0341               
&& $result[0] !== "///" 
0342                
)
0343                
$results[(int)$result[1]] = $result[0];
0344
0345    
/**
0346     * Optionally sort the results so that they are in order of
0347     * appearance in the HTML text:
0348     */
0349    
if ( $sort_flag )
0350       
ksort$results );
0351
0352    return 
$results;
0353 }
0354
0355
0356
0357 
/**
0358  *
0359  * unique_urls( $txt, $context_chars = false, $sort_results = false )
0360  *
0361  * Calls list_urls() to scan a block of text in the $txt string and
0362  * collect an array of URLs that match the subpatterns in the
0363  * URL_PATTERN above. The returned value can take one of two forms: If
0364  * the $context_chars count is 0 or false, the resulting array is a
0365  * list of URL and array of offset pairs, ie: 
0366  *
0367  * $result = array( URL => array( offset, ... ), ... );
0368  *
0369  * This is handy if you need a list of the Unique URLs that were found
0370  * in the input text along with an array of offsets for each occurance
0371  * of a URL.
0372  *
0373  * In the case where you request a number of characters of context for
0374  * each occurance of a URL, the return result is modified to include a
0375  * context string along with each offset, ie:
0376  *
0377  * $result = array( URL => array( "context"  => <context string>,
0378  *                                "position" => offset ),
0379  *                                ... 
0380  *                              );
0381  *
0382  * The context string contains the original text from the text block
0383  * at the position of the URL surrounded by the number of characters
0384  * specified in $context_chars. This allows the developer to review
0385  * the matched URL text in the context of the original text block in
0386  * which it was found.
0387  *
0388  * @param string $txt a block of text such as an HTML web page 
0389  * @param bool   $sort_flag defaults to false. Pass true to get 
0390  *               the return results sorted in order of their 
0391  *               appearance in the source document
0392  * @param int $context_chars the number of characters of context to
0393  *               collect before and after the matched URL.
0394  *
0395  * @return array an array of Position => URL pairs where Position is
0396  *               the character count in the text block at which the
0397  *               URL was matched. If preg_match_all() fails the return
0398  *               value of this function is boolean false. However, as
0399  *               this will almost always be the result of a coding
0400  *               error, the function is generally coded to call an
0401  *               exit function or throw an exception instead.
0402  *
0403  * <code>
0404  * 
0405  * $sample_text = <<< EOT
0406  * <p>This is a test document containing URLs. Will it 
0407  *    match the wrong words? I'm not sure. we need to be able to 
0408  *    match strings like rfc123/text/plain as relative paths and
0409  *    strings like http://www.kubuntu.org/ as full URLs.
0410  *    So, let's test it:
0411  * <ul>
0412  * <li><a href="community">info</a> 
0413  * <li><img src = ' /my/relative/image.png ' >
0414  * <li><img src =' /my/relative/image.png '>
0415  * <li><img 
0416  * src = 
0417  * '/my/relative/image.png' 
0418  * >
0419  * <li><a action="mailto:info@unknown.com">info</a>
0420  * <li><a formaction="community#frag">info</a>
0421  * <li><a href="/community#frag">info</a>
0422  * <li><a href="/community/PostfixAmavisNew#frag">info</a>
0423  * <li><a href="/community/PostfixAmavisNew?abc=123&amp;foo=%2c&amp;zot=1#frag">info</a>
0424  * <li><a href="PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0425  * <li><a href="/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0426  * <li><a href="community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0427  * <li><a href="/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0428  * <li><a href="https://help.ubuntu.com/community/PostfixAmavisNew">Info</a>
0429  * <li><a href="https://userid@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0430  * <li><a href="https://userid:password@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0431  * <li><a href="https://userid:password@help.ubuntu.com:80/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>
0432  * </ul>
0433  * EOT;
0434  * 
0435  * $result = unique_urls( $sample_text, false, true );
0436  * var_dump( $result );
0437  *
0438  * echo "\n***next***\n\n";
0439  *
0440  * $result = unique_urls( $sample_text, 50, true );
0441  * var_dump( $result );
0442  *
0443  * // expects:
0444  * // array(17) {
0445  * //   'rfc123/text/plain' =>
0446  * //   array(1) {
0447  * //     [0] =>
0448  * //     int(136)
0449  * //   }
0450  * //   'http://www.kubuntu.org/' =>
0451  * //   array(1) {
0452  * //     [0] =>
0453  * //     int(192)
0454  * //   }
0455  * //   'community' =>
0456  * //   array(1) {
0457  * //     [0] =>
0458  * //     int(270)
0459  * //   }
0460  * //   '/my/relative/image.png' =>
0461  * //   array(3) {
0462  * //     [0] =>
0463  * //     int(307)
0464  * //     [1] =>
0465  * //     int(350)
0466  * //     [2] =>
0467  * //     int(392)
0468  * //   }
0469  * //   'mailto:info@unknown.com' =>
0470  * //   array(1) {
0471  * //     [0] =>
0472  * //     int(433)
0473  * //   }
0474  * //   'community#frag' =>
0475  * //   array(1) {
0476  * //     [0] =>
0477  * //     int(486)
0478  * //   }
0479  * //   '/community#frag' =>
0480  * //   array(1) {
0481  * //     [0] =>
0482  * //     int(524)
0483  * //   }
0484  * //   '/community/PostfixAmavisNew#frag' =>
0485  * //   array(1) {
0486  * //     [0] =>
0487  * //     int(563)
0488  * //   }
0489  * //   '/community/PostfixAmavisNew?abc=123&amp;foo=%2c&amp;zot=1#frag' =>
0490  * //   array(1) {
0491  * //     [0] =>
0492  * //     int(619)
0493  * //   }
0494  * //   'PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0495  * //   array(1) {
0496  * //     [0] =>
0497  * //     int(705)
0498  * //   }
0499  * //   '/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0500  * //   array(1) {
0501  * //     [0] =>
0502  * //     int(780)
0503  * //   }
0504  * //   'community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0505  * //   array(1) {
0506  * //     [0] =>
0507  * //     int(856)
0508  * //   }
0509  * //   '/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0510  * //   array(1) {
0511  * //     [0] =>
0512  * //     int(941)
0513  * //   }
0514  * //   'https://help.ubuntu.com/community/PostfixAmavisNew' =>
0515  * //   array(1) {
0516  * //     [0] =>
0517  * //     int(1027)
0518  * //   }
0519  * //   'https://userid@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0520  * //   array(1) {
0521  * //     [0] =>
0522  * //     int(1101)
0523  * //   }
0524  * //   'https://userid:password@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0525  * //   array(1) {
0526  * //     [0] =>
0527  * //     int(1217)
0528  * //   }
0529  * //   'https://userid:password@help.ubuntu.com:80/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0530  * //   array(1) {
0531  * //     [0] =>
0532  * //     int(1342)
0533  * //   }
0534  * // }
0535  * // 
0536  * // ***next***
0537  * // 
0538  * // array(17) {
0539  * //   'rfc123/text/plain' =>
0540  * //   array(1) {
0541  * //     [0] =>
0542  * //     array(2) {
0543  * //       'context' =>
0544  * //       string(117) "sure. we need to be able to\n   match strings like rfc123/text/plain as relative paths and\n   strings like http://www."
0545  * //       'position' =>
0546  * //       int(136)
0547  * //     }
0548  * //   }
0549  * //   'http://www.kubuntu.org/' =>
0550  * //   array(1) {
0551  * //     [0] =>
0552  * //     array(2) {
0553  * //       'context' =>
0554  * //       string(123) "/text/plain as relative paths and\n   strings like http://www.kubuntu.org/ as full URLs.\n   So, let's test it:\n<ul>\n<li><a h"
0555  * //       'position' =>
0556  * //       int(192)
0557  * //     }
0558  * //   }
0559  * //   'community' =>
0560  * //   array(1) {
0561  * //     [0] =>
0562  * //     array(2) {
0563  * //       'context' =>
0564  * //       string(109) "ull URLs.\n   So, let's test it:\n<ul>\n<li><a href="community">info</a>\n<li><img src = ' /my/relative/image.png"
0565  * //       'position' =>
0566  * //       int(270)
0567  * //     }
0568  * //   }
0569  * //   '/my/relative/image.png' =>
0570  * //   array(3) {
0571  * //     [0] =>
0572  * //     array(2) {
0573  * //       'context' =>
0574  * //       string(122) "<li><a href="community">info</a>\n<li><img src = ' /my/relative/image.png ' >\n<li><img src =' /my/relative/image.png '>\n<li"
0575  * //       'position' =>
0576  * //       int(307)
0577  * //     }
0578  * //     [1] =>
0579  * //     array(2) {
0580  * //       'context' =>
0581  * //       string(122) "rc = ' /my/relative/image.png ' >\n<li><img src =' /my/relative/image.png '>\n<li><img\nsrc =\n'/my/relative/image.png'\n>\n<li>"
0582  * //       'position' =>
0583  * //       int(350)
0584  * //     }
0585  * //     [2] =>
0586  * //     array(2) {
0587  * //       'context' =>
0588  * //       string(122) " src =' /my/relative/image.png '>\n<li><img\nsrc =\n'/my/relative/image.png'\n>\n<li><a action="mailto:info@unknown.com">info</"
0589  * //       'position' =>
0590  * //       int(392)
0591  * //     }
0592  * //   }
0593  * //   'mailto:info@unknown.com' =>
0594  * //   array(1) {
0595  * //     [0] =>
0596  * //     array(2) {
0597  * //       'context' =>
0598  * //       string(123) "g\nsrc =\n'/my/relative/image.png'\n>\n<li><a action="mailto:info@unknown.com">info</a>\n<li><a formaction="community#frag">info"
0599  * //       'position' =>
0600  * //       int(433)
0601  * //     }
0602  * //   }
0603  * //   'community#frag' =>
0604  * //   array(1) {
0605  * //     [0] =>
0606  * //     array(2) {
0607  * //       'context' =>
0608  * //       string(114) "lto:info@unknown.com">info</a>\n<li><a formaction="community#frag">info</a>\n<li><a href="/community#frag">info</a>\n"
0609  * //       'position' =>
0610  * //       int(486)
0611  * //     }
0612  * //   }
0613  * //   '/community#frag' =>
0614  * //   array(1) {
0615  * //     [0] =>
0616  * //     array(2) {
0617  * //       'context' =>
0618  * //       string(115) "formaction="community#frag">info</a>\n<li><a href="/community#frag">info</a>\n<li><a href="/community/PostfixAmavisNe"
0619  * //       'position' =>
0620  * //       int(524)
0621  * //     }
0622  * //   }
0623  * //   '/community/PostfixAmavisNew#frag' =>
0624  * //   array(1) {
0625  * //     [0] =>
0626  * //     array(2) {
0627  * //       'context' =>
0628  * //       string(132) "i><a href="/community#frag">info</a>\n<li><a href="/community/PostfixAmavisNew#frag">info</a>\n<li><a href="/community/PostfixAmavisNe"
0629  * //       'position' =>
0630  * //       int(563)
0631  * //     }
0632  * //   }
0633  * //   '/community/PostfixAmavisNew?abc=123&amp;foo=%2c&amp;zot=1#frag' =>
0634  * //   array(1) {
0635  * //     [0] =>
0636  * //     array(2) {
0637  * //       'context' =>
0638  * //       string(162) "nity/PostfixAmavisNew#frag">info</a>\n<li><a href="/community/PostfixAmavisNew?abc=123&amp;foo=%2c&amp;zot=1#frag">info</a>\n<li><a href="PostfixAmavisNew?abc=123&a"
0639  * //       'position' =>
0640  * //       int(619)
0641  * //     }
0642  * //   }
0643  * //   'PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0644  * //   array(1) {
0645  * //     [0] =>
0646  * //     array(2) {
0647  * //       'context' =>
0648  * //       string(151) "amp;foo=%2c&amp;zot=1#frag">info</a>\n<li><a href="PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="/PostfixAmavisNew?abc=123&"
0649  * //       'position' =>
0650  * //       int(705)
0651  * //     }
0652  * //   }
0653  * //   '/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0654  * //   array(1) {
0655  * //     [0] =>
0656  * //     array(2) {
0657  * //       'context' =>
0658  * //       string(152) "amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="community/PostfixAmavisNew"
0659  * //       'position' =>
0660  * //       int(780)
0661  * //     }
0662  * //   }
0663  * //   'community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0664  * //   array(1) {
0665  * //     [0] =>
0666  * //     array(2) {
0667  * //       'context' =>
0668  * //       string(161) "amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="/community/PostfixAmavisNe"
0669  * //       'position' =>
0670  * //       int(856)
0671  * //     }
0672  * //   }
0673  * //   '/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0674  * //   array(1) {
0675  * //     [0] =>
0676  * //     array(2) {
0677  * //       'context' =>
0678  * //       string(162) "amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="https://help.ubuntu.com/co"
0679  * //       'position' =>
0680  * //       int(941)
0681  * //     }
0682  * //   }
0683  * //   'https://help.ubuntu.com/community/PostfixAmavisNew' =>
0684  * //   array(1) {
0685  * //     [0] =>
0686  * //     array(2) {
0687  * //       'context' =>
0688  * //       string(150) "amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="https://help.ubuntu.com/community/PostfixAmavisNew">Info</a>\n<li><a href="https://userid@help.ubuntu"
0689  * //       'position' =>
0690  * //       int(1027)
0691  * //     }
0692  * //   }
0693  * //   'https://userid@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0694  * //   array(1) {
0695  * //     [0] =>
0696  * //     array(2) {
0697  * //       'context' =>
0698  * //       string(192) "community/PostfixAmavisNew">Info</a>\n<li><a href="https://userid@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="https://userid:password@he"
0699  * //       'position' =>
0700  * //       int(1101)
0701  * //     }
0702  * //   }
0703  * //   'https://userid:password@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0704  * //   array(1) {
0705  * //     [0] =>
0706  * //     array(2) {
0707  * //       'context' =>
0708  * //       string(201) "amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="https://userid:password@help.ubuntu.com/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="https://userid:password@he"
0709  * //       'position' =>
0710  * //       int(1217)
0711  * //     }
0712  * //   }
0713  * //   'https://userid:password@help.ubuntu.com:80/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag' =>
0714  * //   array(1) {
0715  * //     [0] =>
0716  * //     array(2) {
0717  * //       'context' =>
0718  * //       string(170) "amp;foo=bar&amp;zot=1#frag">info</a>\n<li><a href="https://userid:password@help.ubuntu.com:80/community/PostfixAmavisNew?abc=123&amp;foo=bar&amp;zot=1#frag">info</a>\n</ul>"
0719  * //       'position' =>
0720  * //       int(1342)
0721  * //     }
0722  * //   }
0723  * // }
0724  * </code>
0725  *
0726  */
0727 
function unique_urls$txt_block$context_chars false$sort_flag false )
0728 {
0729    
$txtlen strlen$txt_block );
0730
0731    
$unique = array();
0732    
$urls   list_urls$txt_block$sort_flag );
0733
0734    foreach ( 
$urls as $pos => $url )
0735    {
0736       if ( !isset( 
$unique[$url] ) )
0737          
$unique[$url] = array();
0738
0739       
$len strlen$url );
0740
0741       if ( (
$min $pos $context_chars) < )
0742          
$min 0;
0743
0744       if ( (
$max $pos $len $context_chars) >= $txtlen )
0745          
$max $txtlen;
0746
0747       if ( !
$context_chars )
0748          
$unique[$url][] = $pos;
0749       else 
$unique[$url][] = array( "context"  => substr$txt_block$min$max $min ),
0750                                     
"position" => $pos,
0751                                   );
0752    }
0753
0754    return 
$unique;
0755 }

0756