Include all external URLs in site statistics

The stats page has long shown a count of the number of unique external links referenced on the site. Before this change the stats showed 483 unique external links, the relevant code is contained in the snippet below,

def extract_external_links(text: str) -> List:
    url_pattern = r"(https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+(?:/[^)\s]*)?)"
    matches = re.findall(url_pattern, text)

    external_links = set()
    for url in matches:
        parsed_url = urlparse(url)
        if parsed_url.netloc.lower() != "silasjelley.com":
            external_links.add(url)

    return list(external_links)

...

try:
    plain_text = document["content"]["plain"]
    external_links = extract_external_links(plain_text)

    document["links"]["external"].update(external_links)
    site.links["external"].update(external_links)
except KeyError:
    pass

Up to now I’ve only included links within the main body of a page (content["plain"]), but my document format has evolved since then with more and more links being structured into the [source] and [via] tables.

First I wanted to grasp what kind of a difference this would make so, as usual, I started with a shell pipeline search through the output of the build script,

rg --no-filename \
   --only-matching \
   --no-line-number \
   'http.*?://[^ \[\]"&<>)]*' \
   **/*.html | \
   sed \
     -e "s/'.*//" \
     -e '/silasjelley\.com/d' \
     -e 's,/$,,' \
     -e 's/#.*//' | \
     sort | \
     uniq | \
     wc -l

The regex for link matching was pretty coarse so I tidied/normalised the output and removed internal links with some sed patterns, before sorting, de-duping, and counting ( sort | uniq | wc -l ).

The result: 907 unique external links, almost double the 483 contained purely in the main body content.

Now that number is slightly inflated as it includes links to external APIs such as https://api.maptiler.com/maps/topo-v2/{z}/{x}/{y}.png?key=APIKEY for my walk map, so our final number should be slightly lower.

This change includes the [source] and [via] tables when searching for links,

try:
    plain_text = document.get("content", {}).get("plain", "") + " "
    plain_text += document.get("source", {}).get("url", "") + " "
    plain_text += document.get("via", {}).get("url", "") + " "

    external_links = extract_external_links(plain_text)

    document["links"]["external"].update(external_links)
    site.links["external"].update(external_links)
except KeyError:
    print(f"KeyError while compiling external links from {document['filename']}")
    pass

After the change, the stats reflect 869 unique external links.