Distinguishing between prose and code when counting words

I’m including more and larger code snippets in documents throughout the site — such as this one — and I didn’t like that these were being counted amongst the ‘wordcount’ of the site. I’ve always distinguished between my words and reference words — when you see a quote in one of my documents, that isn’t counted in my own wordcount — but now it’s time do distinguish more granularly between my prose and my code.

The first change was to the SiteMetadata dataclass.

@dataclass
class SiteMetadata:
...
    words: Dict[str, Any] = field(
        default_factory=lambda: {
            "self": 0,
            "drafts": 0,
            "code": {
                "lines": 0,
                "words": 0,
            },
            "references": 0,
        }
    )
...

The same block was also added to the DocumentMetadata dataclass, just without the drafts key.

The logic for tallying up the totals isn’t very elegant, I consider this a first draft, but it works. First the total document is summed,

    for key, page in documents.items():
        prose_wordcount = len(page.content.get("plain", "").split())
        references_wordcount = len(page.source.get("text", "").split())
        if page.status == "draft":
            page.words["self"] += prose_wordcount
            site.words["drafts"] += prose_wordcount
        else:
            page.words["self"] += prose_wordcount
            site.words["self"] += prose_wordcount

            site.words["references"] += references_wordcount
            page.words["references"] += references_wordcount

        logger.debug(f"  {key}, {page.title[:40]}")

And then, during the syntax highlighting, the code lines are tallied up and then subtracted from the prose lines.

def save_code_block(match):
    leading_space = match.group(1)
    raw_html_marker = match.group(2)
    language = match.group(3)
    code = match.group(4).rstrip()
    trailing_space = match.group(5)
    code_words = len(code.split())
    code_lines = len(code.splitlines())
    page.words["code"]["lines"] += code_lines
    page.words["code"]["words"] += code_words
    site.words["code"]["lines"] += code_lines
    site.words["code"]["words"] += code_words
    # Remove the wordcount of codeblocks from the prose wordcounts
    page.words["self"] -= code_words
    site.words["self"] -= code_words

    ...

The result is exactly what I wanted, but the method isn’t super elegant.

Including this post, the current counts are as follows,

222,852 words

    136,784 of my own published words
     21,441 words of unpublished drafts
     64,627 words of quotes and reference material
      6,659 lines of code
     24,182 words of code