Python html5lib Skipped Elements

By  on  

I've been working on some interesting python stuff at Mozilla and one task recently called for called for rending a page and then finding elements with a URL attribute value (like img[src] or a[href]) and ensuring they become absolute URLs.  One problem I encountered when using html5lib was that LINK and IMG elements were being skipped when I tokenized the HTML.  After browsing through the html5lib source code, I found a variable called voidElements which included both LINK and IMAGE:

voidElements = frozenset((
    "base",
    "command",
    "event-source",
    "link",
    "meta",
    "hr",
    "br",
    "img",
    "embed",
    "param",
    "area",
    "col",
    "input",
    "source"
))

When I commented out those two elements, they were found upon next run of my routine, meaning their presence in the set were causing me problems.  Here's how I skirted the issue:

new_void_set = set()
for item in html5lib_constants.voidElements:
	new_void_set.add(item)
new_void_set.remove('link')
new_void_set.remove('img')
html5lib_constants.voidElements = frozenset(new_void_set)

Since voidElements is a frozenset, I couldn't simply remove LINK and IMG, so I needed to create a new frozenset without those elements.  Let me know if there's a more python-ish way of creating this frozen set.  In an event, delving into the deep recesses of html5lib paid off and I accomplished the goal!

Recent Features

  • By
    Being a Dev Dad

    I get asked loads of questions every day but I'm always surprised that they're rarely questions about code or even tech -- many of the questions I get are more about non-dev stuff like what my office is like, what software I use, and oftentimes...

  • By
    Send Text Messages with PHP

    Kids these days, I tell ya.  All they care about is the technology.  The video games.  The bottled water.  Oh, and the texting, always the texting.  Back in my day, all we had was...OK, I had all of these things too.  But I still don't get...

Incredible Demos

  • By
    MooTools-Like Element Creation in jQuery

    I really dislike jQuery's element creation syntax. It's basically the same as typing out HTML but within a JavaScript string...ugly! Luckily Basil Goldman has created a jQuery plugin that allows you to create elements using MooTools-like syntax. Standard jQuery Element Creation Looks exactly like writing out...

  • By
    dat.gui:  Exceptional JavaScript Interface Controller

    We all love trusted JavaScript frameworks like MooTools, jQuery, and Dojo, but there's a big push toward using focused micro-frameworks for smaller purposes. Of course, there are positives and negatives to using them.  Positives include smaller JS footprint (especially good for mobile) and less cruft, negatives...

Discussion

  1. You could use list comprehension to filter the elements instead; something like:

    html5lib_constants.voidElements = frozenset([e for e in html5lib_constants.voidElements if e not in [“link”, “img”]])

    Sounds like it would be useful to add void element overriding as a feature to the library in the future, though.

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!