[XAR2] New implementation of Blocklayout compiler
- From: Marcel van der Boom <marcel (at) hsdev.com>
- Date: Fri, 03 Nov 2006 20:01:33 +0100
Jason wrote:
> The available entities (the ones people will use) should be defined in the
> DTD for XHTML. Could we just pull those in?
>
We can, but that would be a bit wasteful (because there are LOTS for
xhtml for example) We have to see how this one develops. To get going
at least, i've been using ( i think two in all of core ) numeric
entities. (  ) being the 99% one, the other i already forgot.
> If someone used £ in a template say, and the output page was XHTML
> UTF-8, then what would (or should) appear in the output page? Would it be
> 'ï' or £? My guess would be the former.
>
Oh my, there are not many people who can answer this correctly and
precise, there is no single correct answer either, there are so many
factors involved.
Let me say what i know, lots of gaps probably.
If the input xml contains £ and assuming it is defined, either
explictly (in the xsl later on) or by a reference to a DOCTYPE
somewhere. Let's say it is defined as Ӓ (meaning unicode
character 1234 ) that's it, input wise. XSLT doesnt need to know more
(nor that it is actually a pound sign) For the input, for example:
€ € and â are *exactly* the same. (lets hope that all
comes through ok :-) )
XSLT does its work based on its XSL transformation and gets character
1234 in. What it does with that from now on, becomes dependent on a
number of things and the tools used. One thing it needs a least is the
definition of £ either deliverd by the doctype of the input or
defined in the transformation itself.
The transform as such just puts character number 1234 in the result
document, done. (assuming the template matches go through and all that)
The last bit of the transform is (most likely) serialising the result
document into a stream of bytes, like an XML document in some
encoding. The output document may or may not contain a doctype, over
which you have some control in XSL. This doesn not change what bytes
are put into the output, the encoding decides that.
That doc in turn is then sent off to its destination (say, a browser)
which renders the bytes into something we can look at. We all know the
last bit is very different for different browsers even if given the
exact same set of bytes.
Now the fun begins:
- if character 1234 isnt available in the character table the browser
uses -> boom!
- if the character is available, but your font doesnt occupy it -> boom!
- if the encoding of the xml document is out of reach of your browser
--> boom!
- if character 1234 is not in the doctype or there is no doctype, it
cant replace &1234; with anything possibly more comfortable.
Lets say all of the above is taken care of, then what is actually
displayed when you look at a textual representation of the document
(which in itself is a new transformation) depends again on your tool.
- it can be Ӓ (browser leaves it alone)
- it can be £ (browser looked it up in documents doctype)
- it can be &blah; (browser looked it up, result doctype defined it
differently)
the best way to cope with entities is not to pay attention too _long_
in my experience. They dont exist, everything is a character number
and entities are just labels "at that given time". Bit of a
"schrÃdingers cat" situation. :-) As soon as you look, it's something
else :-)
In practice, i think most browsers would shouw you £ or Ӓ
in their source, more likely if the doctype is one of the w3 doctypes
(forgot the link)
I've been to my neck in xml for a lot of years and i often get it
wrong, still. :-)
hope this helps.
marcel
_______________________________________________
Xaraya_devel mailing list
Xaraya_devel@xaraya.com
http://xaraya.com/mailman/listinfo/xaraya_devel