✅ SEP 16: Extract backlinks from plaintext
Status: Finished
Date: 2023-04-08
Commit: e521f96aa448b83f61e25b9f986e17456934ec15
> Implemented during a rewrite of `build_backlinks()` function. Allowed
> dependency on Beautiful Soup library to be dropped entirely. Runs lighter and
> faster, and behaves in a more obvious and inspectable way.
Currently (as at 2023-04-07) the build_backlinks()
function extracts links
from the content of a document after it has been processed into HTML. This is
done using the Beautiful Soup library. This is a large dependency, that is
pulled in only for this single purpose. I have a feeling that the same robust
discovery of internal backlinks could be achieved using a simple
re.finditer(r'\[.*\]\(.*\)', text)
loop on the unprocessed plaintext. This
would remove this large dependency and be more consistent with the overall
programming style and design goals of the project.
An incidental but desirable benefit of this change would be the ability to
execute the build_backlinks()
function generate_html()
which is
preferred because it means that all document metadata would be fully compiled
before the document content is processed into HTML, allowing for greater
flexibility, and a better conceptual separation of concerns.
Before any change to the code is finalised, the new function would have to be tested against the existing one to ensure equivelant function. This can be tested using a count of retrieved and deduplicated links.
Considerations:
-
This build step must be carried out after the
insert_substitutes()
function has run, in order that UUID and prefix referenced links are also included. -
Currently the new method will fail to pick up links that are written in the plaintext source in preformatted HTML. Will need to either expand the function to resolve this, or never use HTML links in source documents. If the latter then it would make sense to write a simple linter to check for this at build time.
-
My first attempt to implement this (around midnight on 2023-04-07) is revealing the possible brittleness of this approach, which I think is why I went with the BS library to begin with…
For some reason any attempt to extract links from the root index creates weird parse issues. For no this is excluded using a simple
if page['slug'] != '':
.Aha, just realised one of the reasons why this is causing such a bother… REFERENCE style links. The primitive plaintext link finder completely misses any link of the form
[linktext][link]
!