go-sanitise-html: The Making Of
First published Sat Dec 16, 2023
tldr; keep in mind what you’d prefer to happen in the “worst case” scenario. When sanitising HTML you could loop through and remove anything you don’t want, but it may be better to construct a new output and only copy in whats valid. Don’t throw this away for insignificant performance gains.
Earlier this year I created a library called go-sanitise-html.
I needed two main features:
- Remove tags that aren’t in the whitelist, while keeping any inner
contents. (For example
Hello <strong>world</strong>
might become,Hello world
) - Remove attributes on allowed tags, if that tag doesn’t have the attribute in the whitelist
I’ll spare you the story of trying to find an existing library as it’s pretty boring - I didn’t find one that I was happy with, so I decided to write my own.
First attempt: The wrong way
My first attempt to create this library was a disaster, because I made a critical design mistake early on by wanting to loop through the AST and unwrap any disallowed HTML nodes. I didn’t do this without thought though - I chose it because I figured it was the most efficient way. Only had to walk the tree once and no duplicate in memory.
There were major downsides though - it was tricky keeping the pointers correct when removing nodes so everything’s checked correctly without falling into an infinite loop. Code readability and maintainability also took a major hit since there was a lot of times that pointers had to be manually updated, and if you forgot one it could crash or miss checking some branches.
Ultimately, though I seemingly got this version working, I could never trust it fully as there was too much scope for some unforeseen edge-case that could cause a crash, or let unsafe data through.
Rewrite: How I should have done it from the start
In the end I decided to rewrite the library. In hindsight my previous choice on design was a serious case of over-optimising given how insignificant the memory usage here is anyway!
I took a more sensible approach this time where I walk the parsed input AST and copy over anything that is valid. Much cleaner code, and a much better outcome if there’s an issue. I’d prefer missing output than unsafe output!
No matter how long I’ve been programming, I always seem to forget this lesson. So future self - keep this in mind next time! Optimise code readability over performance. It’ll save you hours.