Martijn has Bear

Custom reStructuredText Directives with Pandoc

A little over a week ago there was some short mention of reStructuredText (ReST) in the IndieWeb chat. I was a bit critical of it then, which ended up on the Wiki.

I did not want the wiki to only have one throw-away line of criticism from chat, so I returned to flesh out what my criticism was.

One of the things I really like about ReST is how it has always supported custom directives. Theoretically this meant you could use it for all sorts of content and simply patch in any missing elements. I say theoretically, as the thing I really do not like about ReST is how few options you have to parse it outside of Python.

As the IndieWeb wiki page talks about Ryan using pandoc to do the parsing, I wondered if that might be an avenue of implementing custom directives!

Pandoc allows a simple way to develop extra transformation rules through filters. These come in two variants, so called traditional filters and Lua filters. As my Lua foo is quiet weak, I will go for the traditional ones.

For my proof of concept, I will implement a YouTube embed. In ReST this will take the form of a .. youtube:: directive. In the HTML output this should be the lite-youtube custom element from Paul Irish.

Filters are run on the internal representation of a document that pandoc creates after parsing the input. This is given to the filter in the form of a JSON string that needs to be parsed. The filter should then output a JSON string in the same format for pandoc to process further.

Unknown directives are parsed into Div blocks by pandoc. And to be able to insert code that pandoc will leave untouched on further processing we can use RawBlock blocks. With just that, we can already build a functioning filter:

#!/usr/bin/env bun
const ast = await Bun.stdin.json();
ast.blocks.forEach((block, index) => {
  if (block.t === "Div" && block.c[0][1][0] === "youtube") {
    ast.blocks[index] = {
      t: "RawBlock",
      c: ["html", `<lite-youtube videoid="???"></lite-youtube>`]
    };
  }
});
process.stdout.write(JSON.stringify(ast));

Now we must extract the video ID still. I am working with the ReST syntax as follows:

.. youtube:: https://www.youtube.com/watch?v=dQw4w9WgXcQ

The way pandoc ends up parsing this is as a Div with a nested Para that includes a Link. It would have been better if it had a special case for a single line. But I am guessing it does this because the usual ReST directives are different types of admonitions like:

.. error:: The text can start here
   and continue after softwrapping.

   And even include a second paragraph.

Assuming I will never change the structure of providing the YouTube link on the first line, the simplest way is to walk the tree to where we know this happens:

const link = block.c[1][0].c[0].c[2][0];
//                        ^? { t: "Para", c: Inline[] }
//                             ^? { t: "Link", c: [Attr, Inline[], Target] }

We also need to extract the actual video ID from the URL. There are a couple of different potential URLs to think about here. I am just going with the simplest ones where the video ID is part of the search parameters (or whatever you want to call them).

function parseYt(url) {
  try {
    const parsedUrl = new URL(url);
    return parsedUrl.searchParams.get("v");
  } catch (e) {
    return null;
  }
}

There is still some interesting things left as a future exercise. The custom HTML element supports a number of parameters, and ReST directives support this to. Maybe a future transformation to support that:

.. youtube:: https://www.youtube.com/watch?v=dQw4w9WgXcQ
   :playlabel: An extremely wise man sharing their thoughts.

<lite-youtube videoid="ogfYd705cRs" playlabel="An extremely wise man sharing their thoughts."></lite-youtube>

It would also be good to turn this into a Lua Filter for the performance gains.

But I am happy to have established that I can use a non-Python language and a non-Python dependency to implement my own custom ReST directives!

#!/usr/bin/env bun
function parseYt(url) {
  try {
    const parsedUrl = new URL(url);
    return parsedUrl.searchParams.get("v");
  } catch (e) {
    return null;
  }
}
const ast = await Bun.stdin.json();
ast.blocks.forEach((block, index) => {
  if (block.t === "Div" && block.c[0][1][0] === "youtube") {
    const videoId = parseYt(block.c[1][0].c[0].c[2][0]);
    if (videoId !== null) {
      ast.blocks[index] = {
        t: "RawBlock",
        c: ["html", `<lite-youtube videoid="${videoId}"></lite-youtube>`]
      };
    }
  }
});
process.stdout.write(JSON.stringify(ast));
pandoc -s article.rst --filter youtube.js -o article.html