Custom XML notes for Cambridge lectures

Document formats are evolving. New formats are continuously being created and destroyed resulting in an incredible amount of stagnant data. Why should I have to learn how to deal with every popular trend when I just want to store my thoughts?

XML is extremely powerful as it can express semantic ideas in a form that can be easily transformed into common formats. This enables people to write data for specific applications and use it later in a variety of circumstances. By combining transformations data formats can be converted to any desired representation.

Last term I designed a custom format for exam revision notes, converting the erratic slides into a uniform format beautifully typeset in HTML and CSS. These were converted from XML using a short XSL transformation supported by most modern browsers. This meant that each set of notes was visible on my EEPC (Ubuntu), desktop (Windows 7) and at the Cambridge Computer Lab (Windows XP/Linux). This is an efficient setup that anyone can emulate.

An example section

This is the general format of the document:

<section>
	<title>Natural language interfaces and dialogue systems</title>
	<p><defn><for>Natural Language User Interfaces</for> 
	(LUI) are <is>a type of computer human interface where linguistic 
	phenomena such as verbs, phrases and clauses act as UI controls for creating, 
	selecting and modifying data in software applications</is>.</defn>
		<resource author="Wikipedia" 
			url="http://en.wikipedia.org/wiki/Natural_language_user_interface">
			Natural language user interface
		</resource>
	</p>
</section>

My aim is to refine this portable document format to write lecture notes from the very beginning of third year Computer Science. Using XML allows a XSLT transformation between the old and new versions.

Required features

  • Hierarchical textual content
  • ASCIIMathML to convert inline LaTeX mathematics to MathML that can be rendered in documents with mime-type application/xhtml+xml
  • Automatic hyphenation to increase readability using Hyphenator.js
  • References to external documents using URLs
  • Tables, lists (using HTML syntax)
  • Syntax highlighting for common languages, provided by Gorbatchev’s SyntaxHighlighter

Additional extras include

  • Rich graphics from external files (unfortunately inline graphics add too much complexity to the document)
  • Internal references like see also to link related topics
  • A section for the syllabus and links to where each topic is explained (contents)

Changes from the last version

The first version was based on the hierarchy of topic, section, subsection with titles as attributes. A little research into existing XML formats like DocBook can remove many of the design decisions. Why reinvent when teams of intelligent people have already come up with a better solution? Using different elements names for levels in the hierarchy was a mistake. It hampered refactoring the XML document.

Using elements like seealso, result and source signified external references in the first version. These will be unified into a resource element with multiple uses. In other aspects there will be an emphasis in using cdata over attribute values to represent information (title will no longer be an attribute).

Colloquialisms should be used to repeat key sections explaining underlying concepts. These can be represented using a handwriting font and rendered in collapsible regions in the transformed XHTML.

Definitions must fit into document prose while being structured semantically to summarise into a definition section. A niche feature this would support is exporting question/answer pairs to SuperMemo using a supermemo.xsl transformation that only applies to definitions.