Python html5lib Skipped Elements

By  on  

I've been working on some interesting python stuff at Mozilla and one task recently called for called for rending a page and then finding elements with a URL attribute value (like img[src] or a[href]) and ensuring they become absolute URLs.  One problem I encountered when using html5lib was that LINK and IMG elements were being skipped when I tokenized the HTML.  After browsing through the html5lib source code, I found a variable called voidElements which included both LINK and IMAGE:

voidElements = frozenset((
    "base",
    "command",
    "event-source",
    "link",
    "meta",
    "hr",
    "br",
    "img",
    "embed",
    "param",
    "area",
    "col",
    "input",
    "source"
))

When I commented out those two elements, they were found upon next run of my routine, meaning their presence in the set were causing me problems.  Here's how I skirted the issue:

new_void_set = set()
for item in html5lib_constants.voidElements:
	new_void_set.add(item)
new_void_set.remove('link')
new_void_set.remove('img')
html5lib_constants.voidElements = frozenset(new_void_set)

Since voidElements is a frozenset, I couldn't simply remove LINK and IMG, so I needed to create a new frozenset without those elements.  Let me know if there's a more python-ish way of creating this frozen set.  In an event, delving into the deep recesses of html5lib paid off and I accomplished the goal!

Recent Features

  • By
    How to Create a RetroPie on Raspberry Pi – Graphical Guide

    Today we get to play amazing games on our super powered game consoles, PCs, VR headsets, and even mobile devices.  While I enjoy playing new games these days, I do long for the retro gaming systems I had when I was a kid: the original Nintendo...

  • By
    39 Shirts – Leaving Mozilla

    In 2001 I had just graduated from a small town high school and headed off to a small town college. I found myself in the quaint computer lab where the substandard computers featured two browsers: Internet Explorer and Mozilla. It was this lab where I fell...

Incredible Demos

  • By
    NSFW Blocker Using MooTools and CSS

    One of my guilty pleasures is scoping out the latest celebrity gossip from PerezHilton.com, DListed.com, and JoBlo.com. Unfortunately, these sites occasionally post NSFW pictures which makes checking these sites on lunch a huge gamble -- a trip to HR's office could be just a click away. Since...

  • By
    Create a Clearable TextBox with the Dojo Toolkit

    Usability is a key feature when creating user interfaces;  it's all in the details.  I was recently using my iPhone and it dawned on my how awesome the "x" icon is in its input elements.  No holding the delete key down.  No pressing it a...

Discussion

  1. You could use list comprehension to filter the elements instead; something like:

    html5lib_constants.voidElements = frozenset([e for e in html5lib_constants.voidElements if e not in [“link”, “img”]])

    Sounds like it would be useful to add void element overriding as a feature to the library in the future, though.

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!