By Date: <-- -->
By Thread: <-- -->

[XAR2] New implementation of Blocklayout compiler



Jason wrote:
> The available entities (the ones people will use) should be defined in the 
> DTD for XHTML. Could we just pull those in?
> 
We can, but that would be a bit wasteful (because there are LOTS for 
xhtml for example) We have to see how this one develops. To get going 
at least, i've been using ( i think two in all of core ) numeric 
entities. &nbsp; ( &#160;) being the 99% one, the other i already forgot.

> If someone used &pound; in a template say, and the output page was XHTML 
> UTF-8, then what would (or should) appear in the output page? Would it be 
> 'ï' or &pound;? My guess would be the former.
> 
Oh my, there are not many people who can answer this correctly and 
precise, there is no single correct answer either, there are so many 
factors involved.

Let me say what i know, lots of gaps probably.

If the input xml contains &pound; and assuming it is defined, either 
explictly (in the xsl later on) or by a reference to a DOCTYPE 
somewhere. Let's say it is defined as &#1234; (meaning unicode 
character 1234 ) that's it, input wise. XSLT doesnt need to know more 
(nor that it is actually a pound sign) For the input, for example: 
&#x20AC; &euro; and â are *exactly* the same. (lets hope that all 
comes through ok :-) )

XSLT does its work based on its XSL transformation and gets character 
1234 in. What it does with that from now on, becomes dependent on a 
number of things and the tools used. One thing it needs a least is the 
definition of &pound; either deliverd by the doctype of the input or 
defined in the transformation itself.

The transform as such just puts character number 1234 in the result 
document, done. (assuming the template matches go through and all that)

The last bit of the transform is (most likely) serialising the result 
document into a stream of bytes, like an XML document in some 
encoding. The output document may or may not contain a doctype, over 
which you have some control in XSL. This doesn not change what bytes 
are put into the output, the encoding decides that.

That doc in turn is then sent off to its destination (say, a browser) 
which renders the bytes into something we can look at. We all know the 
last bit is very different for different browsers even if given the 
exact same set of bytes.

Now the fun begins:
- if character 1234 isnt available in the character table the browser 
uses -> boom!
- if the character is available, but your font doesnt occupy it -> boom!
- if the encoding of the xml document is out of reach of your browser 
--> boom!
- if character 1234 is not in the doctype or there is no doctype, it 
cant replace &1234; with anything possibly more comfortable.

Lets say all of the above is taken care of, then what is actually 
displayed when you look at a textual representation of the document 
(which in itself is a new transformation) depends again on your tool.

- it can be &#1234; (browser leaves it alone)
- it can be &pound; (browser looked it up in documents  doctype)
- it can be &blah; (browser looked it up, result doctype defined it 
differently)

the best way to cope with entities is not to pay attention too _long_ 
in my experience. They dont exist, everything is a character number 
and entities are just labels "at that given time". Bit of a 
"schrÃdingers cat" situation. :-) As soon as you look, it's something 
else :-)

In practice, i think most browsers would shouw you &pound; or &#1234; 
in their source, more likely if the doctype is one of the w3 doctypes 
(forgot the link)

I've been to my neck in xml for a lot of years and i often get it 
wrong, still. :-)

hope this helps.

marcel
_______________________________________________
Xaraya_devel mailing list
Xaraya_devel@xaraya.com
http://xaraya.com/mailman/listinfo/xaraya_devel