✅ SEP 16: Extract backlinks from plaintext

Status: Finished 

Date: 2023-04-08 
Commit: e521f96aa448b83f61e25b9f986e17456934ec15
> Implemented during a rewrite of `build_backlinks()` function. Allowed
> dependency on Beautiful Soup library to be dropped entirely. Runs lighter and
> faster, and behaves in a more obvious and inspectable way.

Currently (as at 2023-04-07) the build_backlinks() function extracts links from the content of a document after it has been processed into HTML. This is done using the Beautiful Soup library. This is a large dependency, that is pulled in only for this single purpose. I have a feeling that the same robust discovery of internal backlinks could be achieved using a simple re.finditer(r'\[.*\]\(.*\)', text) loop on the unprocessed plaintext. This would remove this large dependency and be more consistent with the overall programming style and design goals of the project.

An incidental but desirable benefit of this change would be the ability to execute the build_backlinks() function generate_html() which is preferred because it means that all document metadata would be fully compiled before the document content is processed into HTML, allowing for greater flexibility, and a better conceptual separation of concerns.

Before any change to the code is finalised, the new function would have to be tested against the existing one to ensure equivelant function. This can be tested using a count of retrieved and deduplicated links.


  1. This build step must be carried out after the insert_substitutes() function has run, in order that UUID and prefix referenced links are also included.

  2. Currently the new method will fail to pick up links that are written in the plaintext source in preformatted HTML. Will need to either expand the function to resolve this, or never use HTML links in source documents. If the latter then it would make sense to write a simple linter to check for this at build time.

  3. My first attempt to implement this (around midnight on 2023-04-07) is revealing the possible brittleness of this approach, which I think is why I went with the BS library to begin with…

    For some reason any attempt to extract links from the root index creates weird parse issues. For no this is excluded using a simple if page['slug'] != '':.

    Aha, just realised one of the reasons why this is causing such a bother… REFERENCE style links. The primitive plaintext link finder completely misses any link of the form [linktext][link]!