“A reader from India informs us about the following incident:”
Recent issues in Malayalam Language encoding associated with Unicode are getting worser. The Unicode decisions are not at all open and meeting minutes not published on time (see http://www.unicode.org/consortium/utc-minutes). Recently a Microsoft representative argued for a change in encoding of a letter sequence in Malayalam to hide the bug in their Kartika font and UTC approved it.
And it broke all backward compatibility policy of Unicode, violated the linguistic rules of language and completely contrary to what being taught in schools.
More details available in Praveen’s blog post linked [below].
[The] Blog post questions the “open standard” status of Unicode
Microsoft accused of harming Unicode, causing problems for the Indian population
There is also a lot of muddying and misconceptions over just what Unicode (UTF8, UTF16 & UTF32) actually are.
For example, some coders think that UTF8 will produce smaller files. In fact, this is only really true for ASCII encodings. For codepoints above U+0800, the file will be larger in UTF-8 than in UTF-16.
Also, ‘Unicode’ is only one part of the ISO 10646 standard which is in fact called the Universal Character Set (UCS) and has 4 levels. Unicode is actually UCS-3. M$ in the early days however decided to imply that the ‘Unicode’ encoding was UCS-2 and it initially didn’t make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This was primarily because on Posix systems (Linux and Unix), using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like “” or “/” which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and cannot read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, etc. This is where many of the early compatibility issues between Windows and Linux stemmed from. In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The M$ definitions of UTF-8 between UCS and Unicode differed slightly, because in UCS, up to 6-byte long UTF-8 sequences were possible to represent characters up to U-7FFFFFFF, while in M$ Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.)
]]>Unicode was put together by an international committee. Everyone came out of the process unhappy, which I take as a sign that it was properly done: e.g., a lot of Chinese and Japanese ideograms overlap, but they are NOT represented by the same codes in Unicode, because deep down underneath, they’re not really the same character, nor do they necessarily collate the same way… and no way was either of those nations going to allow itself to be forced to use the other’s character codes. Ultimately, politics and practical needs were both satisfied, apparently to no one’s complete satisfaction. As I said… GOOD.
I’ve worked on only two international web sites, and one of those almost doesn’t count, because it’s an English/Spanish site… you can construct one of those using what you can find in most intro HTML books. (It did, however, have to function completely in either language, down to the last detail… hey, it was a lawyer’s site. 🙂 ) But the foresight of code page designers and browser implementers in making possible a much more diverse variety of language options still impresses me.
I blame M$ for 99 percent of what’s wrong in our industry today. But Unicode is not one of their mistakes, and they did not manage to co-opt and spoil it.
]]>When I became project manager for my first major engineering project after leaving the MI world, My priority was management and control of the project from design to production. I did a lot of research and decided to use Apollo Workstations for many reasons, including that they had the only real SCCS/RCS system at that time called DSEE (popularly called ‘dizzy’) It was brilliant and went on to become ClearCase (still available from IBM today). No programmer could create or modify any project code except through dizzy, and it tracked everything. We used it to create code blocks (modules today) that could be merged with other blocks and cut down on long term development cycles. It was also used to control all documentation, including checklists etc. I sent all the staff on training (including myself of course) for all the tools and systems used. We encountered very few problems because we did change/project management as the core of the project from day one, not as an afterthought. After HP took over Apollo, we moved to another SCCS/RCS system that looked very promising from a US company that I forget the name of! Was a long name… Their logo was an Napoleon era French solder… Drat! I can’t remember the name. They changed their name in the 90’s to TRUE Software, Inc. and were eventually taken over by McCabe who I became a distro for in the 90’s.
]]>We used to call it “pencil whipping”, checking off things that you were supposed to have done, but didn’t actually do because “they are just file fillers to make the suits happy and not important”.
It’s simple, Steve – after years of being lucky, the luck ran out. One of the crew of the supply ship said that the rig crew have been complaining about this well for months, that it wasn’t cooperating and constantly causing problems. Based on what I’ve been told about the geography of the Gulf off our beaches, I get the feeling that there’s a hell of lot more gas than oil in that site, and they weren’t ready for a gas well.
OT: Yes, the Я is the soft vowel pronounced “ya” in Russian. It is also the pronoun “I”, and the last letter in the alphabet. The Russians say “ah” to “ya” instead of A to Z. Of course I like ♥, ‽, ∝, and all the other little bits available.
]]>Aside to Bryan: thanks for the lesson in Unicode. My “need” (heh) to display a backwards R in another of your threads yesterday led to my discovery of the ‘ya’ (is that right??) in the Russian code block, which led me to spend a pleasant couple of hours exploring all the code blocks, and which characters are displayed by which browsers in which fonts. I had no idea that FF and IE are so powerful in displaying a wide variety of languages. Sometimes the most basic things escape one’s attention until one’s nose is thrust in them. 😆
]]>“Field modifications”, “jury rigging”, “temporary fixes”, and the list goes on.
Yeah, there’s no point it trying to explain that you can’t power a 12-volt device from a 5-volt power supply, or that the ratings on the parts used do make a difference.
Now they get into the blame game over who made the changes that made the BOP worthless for its intended purpose.
]]>The World is full of clueless naive or ignorant fools. It’s why it’s such a toilet.
]]>