Wrestling Calibre to organise books my way

I’ve always been annoyed by the way Calibre organises ebooks. Today I decided to dig into its export templating syntax though, and I’ve come up with something I’m happy with. As with MusicBrainz previously, I’m preserving this here so I never have to fiddle with its arcane syntax again!

program:
    first_author = list_item(field('authors'),0,'&');
    author    = re(re(re(re(re(re(re(re( first_author, '\.', ''), '&', 'and'), '^\s+|\s+$', ''), ':', ''), '[\[\]\(\)\{\}]', ''), "'", ''), ',', ''), '\s+', '-');
    title     = re(re(re(re(re(re(re( field('title'),       '&', 'And'), '^\s+|\s+$', ''), ':', ''), '[\[\]\(\)\{\}]', ''), "'", ''), ',', ''), '\s+', '-');
    publisher = re(re(re(re(re(re(re( field('publisher'),   '&', 'And'), '^\s+|\s+$', ''), ':', ''), '[\[\]\(\)\{\}]', ''), "'", ''), ',', ''), '\s+', '-');
    year      = format_date(field('pubdate'),'yyyy');
    titleyear = "_" & year;

    series_part = test(
        field('series'),
        '-' & re(re(re(re(re(re(re( field('series'), '&', 'And'), '^\s+|\s+$', ''), ':', ''), '[\[\]\(\)\{\}]', ''), "'", ''), ',', ''), '\s+', '-')
        & '-' & format_number(field('series_index'), '0') & '_',
        ''
    );

    publisher_part = test(
        field('publisher'),
        '_' & publisher,
        ''
    );

    author & '/' & year & '/' & series_part & title & titleyear & publisher_part

Of course, just as soon as I’m done fiddling I learn that Calibre also supports Python templating, so I’ve since rewritten in its far preferable syntax and expanded it to cover a lot more edge cases. The code is pretty straightforward.

NOTE: The python: at the beginning is a necessary sigil for Calibre to enter Python Template Mode.

python:
def evaluate(book, context):
    import re
    import ast

    def clean(s):
        if not s:
            return ""
        s = re.sub(r"&", "and", str(s))
        s = re.sub(r"[\.\[\]\(\)\{\}\'\":,]", "", s)
        s = re.sub(r"[\\/<>|?*]", "-", s)
        s = re.sub(r"\s+", "-", s.strip())
        return re.sub(r"[-_]+", lambda m: m.group()[0], s).strip("-_")

    def first_author_name(book):
        if a_sort := (book.get("author_sort") or "").strip():
            first = re.split(r"\s*[&;]\s*|\s+and\s+", a_sort)[0].strip()
            if "," in first:
                last, firsts = [p.strip() for p in first.split(",", 1)]
                return f"{firsts} {last}".strip()
            return first

    # Build components
    author_clean = clean(first_author_name(book) or "Unknown")
    title = clean(book.get("title", ""))
    publisher = clean(book.get("publisher", ""))

    year = ""
    if pubdate := book.get("pubdate"):
        if hasattr(pubdate, "strftime") and (y := pubdate.strftime("%Y")).isdigit() and int(y) >= 1000:
            year = y

    series_part = ""
    if series := book.get("series"):
        series_part = f"{clean(series)}-{int(book.get('series_index', 0)):0d}_"

    # Assemble path
    path = f"{author_clean}/"
    if year:
        path += f"{year}/"
    path += f"{series_part}{title}"
    if year:
        path += f"_{year}"
    if publisher:
        path += f"_{publisher}"

    return re.sub(r"[-_]+", lambda m: m.group()[0], path).strip("-_")

Things weren’t as easy and smooth as they could have been however. After banging my head against the wall for nearly an hour, having got my python script to exactly* where I wanted it, there emerged a persistent bug wherein one of my variables was returning an unexpected value for no discernible reason…

I found the cause. For no bloody good reason, Kovid Goyal, the ~~benevolant~~ dictator of Calibre has opted to run a couple of arbitrary string replaces on custom templates at runtime. Buried in a 464 line python file installed by Calibre at /usr/lib/calibre/calibre/library/save_to_disk.py is this function:

def preprocess_template(template):
    template = template.replace('//', '/')
    template = template.replace('{author}', '{authors}')
    template = template.replace('{tag}', '{tags}')
    if not isinstance(template, str):
        template = template.decode(preferred_encoding, 'replace')
    return template

WHAT! Why are you replacing strings in a user’s template Kovid?! What’s most insidious about it is he’s not replacing author, but {author} so this bug only bites you if you use an author variable inside an f-string (eg path = f"{author}/{year}" in my case).

Now I know the source of this weird feckin bug, it’s a trivial fix: rename my author variable to anything else, I went with author_clean. Bugger me.

Click here (future me) for the full error that eventually led me back to the source of this mayhem.

calibre, version 8.7.0
ERROR: Error while saving: Failed to save any books to disk, click "Show details" for more information

Failed to save: Confessions of an English Opium-Eater by Thomas De Quincey to disk, with error:
    Traceback (most recent call last):
      File "/usr/lib/calibre/calibre/utils/formatter.py", line 1770, in _run_python_template
        rslt = compiled_template(self.book, self.python_context_object)
      File "<string>", line 94, in evaluate
    NameError: name 'authors' is not defined. Did you mean: 'author'?

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/lib/calibre/calibre/library/save_to_disk.py", line 286, in get_path_components
        components = get_components(opts.template, mi, book_id, opts.timefmt, path_length,
            ascii_filename if opts.asciiize else sanitize_file_name,
            to_lowercase=opts.to_lowercase,
            replace_whitespace=opts.replace_whitespace, safe_format=False,
            last_has_extension=False, single_dir=opts.single_dir)
      File "/usr/lib/calibre/calibre/library/save_to_disk.py", line 251, in get_components
        components = Formatter().unsafe_format(template, format_args, mi)
      File "/usr/lib/calibre/calibre/utils/formatter.py", line 1978, in unsafe_format
        return self.evaluate(fmt, [], kwargs, self.global_vars)
               ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/calibre/calibre/utils/formatter.py", line 1891, in evaluate
        ans = self._eval_python_template(fmt[7:], self.column_name)
      File "/usr/lib/calibre/calibre/utils/formatter.py", line 1758, in _eval_python_template
        return self._run_python_template(func, arguments=None)
               ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/calibre/calibre/utils/formatter.py", line 1784, in _run_python_template
        raise ValueError(_('Error in function {0} on line {1} : {2} - {3}').format(
                        ss.name, ss.lineno, type(e).__name__, str(e)))
    ValueError: Error in function evaluate on line 94 : NameError - name 'authors' is not defined

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/lib/calibre/calibre/gui2/save.py", line 138, in do_one_collect
        self.collect_data(book_id)
        ~~~~~~~~~~~~~~~~~^^^^^^^^^
      File "/usr/lib/calibre/calibre/gui2/save.py", line 146, in collect_data
        components = get_path_components(self.opts, mi, book_id, self.path_length)
      File "/usr/lib/calibre/calibre/library/save_to_disk.py", line 292, in get_path_components
        raise ValueError(_('Failed to calculate path for '
            'save to disk. Template: %(templ)s\n'
            'Error: %(err)s')%dict(templ=opts.template, err=e))
    ValueError: Failed to calculate path for save to disk. Template: python:
    def evaluate(book, context):
        import re
        import ast

        def clean(s):
            if not s:
                return ""
            s = str(s)
            # Replace & with "and", remove dots, brackets, quotes, colons, commas
            s = re.sub(r"&", "and", s)
            s = re.sub(r"\.", "", s)
            s = re.sub(r"[\[\]\(\)\{\}]", "", s)
            s = re.sub(r"[\'\":,]", "", s)
            s = s.strip()
            # Whitespace -> hyphen
            s = re.sub(r"\s+", "-", s)
            # Collapse repeated separators
            s = re.sub(r"-+", "-", s)
            s = re.sub(r"_+", "_", s)
            # Trim edges
            return s.strip("-_")

        def first_author_name(book):
            # Prefer author_sort: "Last, First & Last2, First2"
            a_sort = book.get("author_sort", "") or ""
            if a_sort:
                first = re.split(r"\s*&\s*|\s*;\s*|\s+and\s+", a_sort)[0].strip()
                if "," in first:
                    last, firsts = [p.strip() for p in first.split(",", 1)]
                    return f"{firsts} {last}".strip()
                return first

            # Fallback: authors (may be list or a stringified list)
            authors = book.get("authors", [])
            if isinstance(authors, (list, tuple)):
                return authors[0] if authors else ""

            if isinstance(authors, str):
                s = authors.strip()
                if s.startswith("[") and s.endswith("]"):
                    try:
                        # Attempt to parse it as a Python literal
                        parsed = ast.literal_eval(s)
                        if isinstance(parsed, (list, tuple)) and parsed:
                            return str(parsed[0])
                    except (ValueError, SyntaxError):
                        # If ast fails, fall back to string manipulation.
                        s_no_brackets = s[1:-1].strip()
                        first_author_str = s_no_brackets.split(",")[0].strip()
                        return first_author_str.strip("'\"")

                # If it's a regular string with multiple authors, split by common delimiters
                if "&" in s or "," in s or ";" in s:
                    return re.split(r"\s*&\s*|\s*,\s*|\s*;\s*", s)[0].strip()

                return s  # It's a single author name as a string

            return ""

        # ---- AUTHOR ----
        author_name = first_author_name(book) or "Unknown"
        author = clean(author_name)

        # ---- TITLE ----
        title = clean(book.get("title", ""))

        # ---- PUBLISHER ----
        publisher = clean(book.get("publisher", ""))

        # ---- YEAR ----
        pubdate = book.get("pubdate", None)
        year = ""
        if pubdate and hasattr(pubdate, "strftime"):
            y = pubdate.strftime("%Y")
            if y.isdigit() and int(y) >= 1000:
                year = y

        # ---- SERIES ----
        series = book.get("series", "")
        if series:
            series_index = book.get("series_index", 0)
            series_part = f"-{clean(series)}-{int(series_index):0d}_"
        else:
            series_part = ""

        # ---- PUBLISHER PART ----
        publisher_part = f"_{publisher}" if publisher else ""

        # ---- TITLE+YEAR ----
        titleyear = f"_{year}" if year else ""

        # ---- BUILD PATH ----
        path = f"{authors}/"
        if year:
            path += f"{year}/"
        path += f"{series_part}{title}{titleyear}{publisher_part}"

        # Final safety pass
        path = re.sub(r"-+", "-", path)
        path = re.sub(r"_+", "_", path)
        return path.strip("-_")
    Error: Error in function evaluate on line 94 : NameError - name 'authors' is not defined

And lastly, here’s a sample of my ebooks nicely organised.

library/documents/books
├── Alexandre-Dumas
│   └── 2004
│       └── Count-of-Monte-Cristo-Abridged_2004_Barnes-and-Noble.epub
├── Arundhati-Roy
│   └── 2020
│       └── Azadi-Freedom-Fascism-Fiction_2020_Haymarket-Books.epub
├── Avvaiyar
│   └── 2009
│       └── Give-Eat-and-Live-Poems-of-Avvaiyar_2009_Red-Hen-Press.pdf
├── Ellen-Lupton
│   ├── 2010
│   │   └── Thinking-with-Type-A-Critical-Guide-for-Designers-Writers-Editors-and-Students-2nd-Edition_2010.epub
├── George-Saunders
│   └── 2021
│       └── A-Swim-in-a-Pond-in-the-Rain_2021_Random-House-Publishing-Group.epub
├── Julia-Cameron
│   └── 2016
│       └── The-Artists-Way-25th-Anniversary-Edition_2016_Penguin-Publishing-Group.epub
├── Kyle-Siemens
│   └── 2022
│       └── Piranha-Fishing-in-the-Amazon_2022.pdf
├── Lao-tzu
│   └── 1996
│       └── Taoteching-With-Selected-Commentaries-from-the-Past-2000-Years_1996_Red-Pine.pdf
├── Martha-Beck
│   └── 2021
│       └── The-Way-of-Integrity-Finding-the-Path-to-Your-True-Self_2021_Penguin-Publishing-Group.epub
├── Raynor-Winn
│   └── 2018
│       └── The-Salt-Path_2018_Penguin-Books-Limited.epub
├── Sir-Ernest-Henry-Shackleton
│   └── 2012
│       └── South-The-Story-of-Shackletons-Last-Expedition-1914-1917_2012_Duke-Classics.epub
├── Susan-Sontag
│   └── 2021
│       └── On-Photography_2021.pdf
├── Ta-Nehisi-Coates
│   └── 2015
│       └── Between-the-World-and-Me_2015_Random-House-Publishing-Group.epub
└── Viktor-E-Frankl
    └── 2006
        └── Mans-Search-for-Meaning_2006_Beacon-Press.epub

I also dug out my PDF fix-up tricks on a couple PDFs. The first was just missing an EOF in the file, so:

pdftk \
    Cappadocia-A-Travel-Guide-Through-the-Land-of-Fairychimneys-and-Rock-Castles_2010_Books-on-Demand.pdf \
    output \
    Cappadocia-A-Travel-Guide-Through-the-Land-of-Fairychimneys-and-Rock-Castles_2010_Books-on-Demand-Repaired.pdf

The second had major Xref issues throughout and even using ghostscript to try and fully rewrite it proved insufficient :(

ghostscript \
    -o Cool-Tools-A-Catalog-of-Possibilities_2014-Repaired.pdf \
    -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
    Cool-Tools-A-Catalog-of-Possibilities_2014.pdf

I’ll have to revisit that one another time. File still works fine, PDF readers are tolerant, but it being malformed means I can’t fix-up all the metadata I’ve added.