Distinguishing between prose and code when counting words
I’m including more and larger code snippets in documents throughout the site — such as this one — and I didn’t like that these were being counted amongst the ‘wordcount’ of the site. I’ve always distinguished between my words and reference words — when you see a quote in one of my documents, that isn’t counted in my own wordcount — but now it’s time do distinguish more granularly between my prose and my code.
The first change was to the SiteMetadata
dataclass.
@dataclass
class SiteMetadata:
...
words: Dict[str, Any] = field(
default_factory=lambda: {
"self": 0,
"drafts": 0,
"code": {
"lines": 0,
"words": 0,
},
"references": 0,
}
)
...
The same block was also added to the DocumentMetadata
dataclass, just without the drafts
key.
The logic for tallying up the totals isn’t very elegant, I consider this a first draft, but it works. First the total document is summed,
for key, page in documents.items():
prose_wordcount = len(page.content.get("plain", "").split())
references_wordcount = len(page.source.get("text", "").split())
if page.status == "draft":
page.words["self"] += prose_wordcount
site.words["drafts"] += prose_wordcount
else:
page.words["self"] += prose_wordcount
site.words["self"] += prose_wordcount
site.words["references"] += references_wordcount
page.words["references"] += references_wordcount
logger.debug(f" {key}, {page.title[:40]}")
And then, during the syntax highlighting, the code lines are tallied up and then subtracted from the prose lines.
def save_code_block(match):
leading_space = match.group(1)
raw_html_marker = match.group(2)
language = match.group(3)
code = match.group(4).rstrip()
trailing_space = match.group(5)
code_words = len(code.split())
code_lines = len(code.splitlines())
page.words["code"]["lines"] += code_lines
page.words["code"]["words"] += code_words
site.words["code"]["lines"] += code_lines
site.words["code"]["words"] += code_words
# Remove the wordcount of codeblocks from the prose wordcounts
page.words["self"] -= code_words
site.words["self"] -= code_words
...
The result is exactly what I wanted, but the method isn’t super elegant.
Including this post, the current counts are as follows,
222,852 words
136,784 of my own published words
21,441 words of unpublished drafts
64,627 words of quotes and reference material
6,659 lines of code
24,182 words of code