Comments on: A Common Problem

By: Kryten42

Kryten42 — Fri, 14 May 2010 12:55:21 +0000

Umm…. Not quite so I’m afraid. 🙂 Is is always the case with anything M$ is involved in. 🙂

“A reader from India informs us about the following incident:”

Recent issues in Malayalam Language encoding associated with Unicode are getting worser. The Unicode decisions are not at all open and meeting minutes not published on time (see http://www.unicode.org/consortium/utc-minutes). Recently a Microsoft representative argued for a change in encoding of a letter sequence in Malayalam to hide the bug in their Kartika font and UTC approved it.

And it broke all backward compatibility policy of Unicode, violated the linguistic rules of language and completely contrary to what being taught in schools.

More details available in Praveen’s blog post linked [below].

[The] Blog post questions the “open standard” status of Unicode

Microsoft accused of harming Unicode, causing problems for the Indian population

There is also a lot of muddying and misconceptions over just what Unicode (UTF8, UTF16 & UTF32) actually are.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

For example, some coders think that UTF8 will produce smaller files. In fact, this is only really true for ASCII encodings. For codepoints above U+0800, the file will be larger in UTF-8 than in UTF-16.

Also, ‘Unicode’ is only one part of the ISO 10646 standard which is in fact called the Universal Character Set (UCS) and has 4 levels. Unicode is actually UCS-3. M$ in the early days however decided to imply that the ‘Unicode’ encoding was UCS-2 and it initially didn’t make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This was primarily because on Posix systems (Linux and Unix), using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings with these encodings can contain as parts of many wide characters bytes like “” or “/” which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and cannot read 16-bit words as characters without major modifications. For these reasons, UCS-2 is not a suitable external encoding of Unicode in filenames, text files, environment variables, etc. This is where many of the early compatibility issues between Windows and Linux stemmed from. In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The M$ definitions of UTF-8 between UCS and Unicode differed slightly, because in UCS, up to 6-byte long UTF-8 sequences were possible to represent characters up to U-7FFFFFFF, while in M$ Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.)

By: Steve Bates

Steve Bates — Fri, 14 May 2010 07:09:47 +0000

Kryten, what exactly can you not do with Unicode because of M$’s involvement? I can’t defend M$ for 99% of what they do, but face it: the 8, 16 and 32-bit versions of Unicode are essential for practical reasons. They don’t change the substance in any material way… they simply make Unicode usable on smaller processors and larger ones in a diverse world. (I’ve never in practical fact seen anyone use anything bigger than 16.) As for big-endian and little-endian… aw, get real; if you haven’t had to write or use code at some point to swap bytes in a stream of characters (or anything else), you haven’t worked on very many multiplatform projects, not Apple/M$, not Linux/M$… and I know for a fact by your own statements that you have done so.

Unicode was put together by an international committee. Everyone came out of the process unhappy, which I take as a sign that it was properly done: e.g., a lot of Chinese and Japanese ideograms overlap, but they are NOT represented by the same codes in Unicode, because deep down underneath, they’re not really the same character, nor do they necessarily collate the same way… and no way was either of those nations going to allow itself to be forced to use the other’s character codes. Ultimately, politics and practical needs were both satisfied, apparently to no one’s complete satisfaction. As I said… GOOD.

I’ve worked on only two international web sites, and one of those almost doesn’t count, because it’s an English/Spanish site… you can construct one of those using what you can find in most intro HTML books. (It did, however, have to function completely in either language, down to the last detail… hey, it was a lawyer’s site. 🙂 ) But the foresight of code page designers and browser implementers in making possible a much more diverse variety of language options still impresses me.

I blame M$ for 99 percent of what’s wrong in our industry today. But Unicode is not one of their mistakes, and they did not manage to co-opt and spoil it.

By: Kryten42

Kryten42 — Fri, 14 May 2010 05:38:47 +0000

Unicode was a big step in the right direction, until M$ and other interests got involved. Now instead of a single *actual* standard, we have at least 6. Unicode 8, 16, 32, big-endian, little-endian… I love standards, there are so many of them! 🙄

When I became project manager for my first major engineering project after leaving the MI world, My priority was management and control of the project from design to production. I did a lot of research and decided to use Apollo Workstations for many reasons, including that they had the only real SCCS/RCS system at that time called DSEE (popularly called ‘dizzy’) It was brilliant and went on to become ClearCase (still available from IBM today). No programmer could create or modify any project code except through dizzy, and it tracked everything. We used it to create code blocks (modules today) that could be merged with other blocks and cut down on long term development cycles. It was also used to control all documentation, including checklists etc. I sent all the staff on training (including myself of course) for all the tools and systems used. We encountered very few problems because we did change/project management as the core of the project from day one, not as an afterthought. After HP took over Apollo, we moved to another SCCS/RCS system that looked very promising from a US company that I forget the name of! Was a long name… Their logo was an Napoleon era French solder… Drat! I can’t remember the name. They changed their name in the 90’s to TRUE Software, Inc. and were eventually taken over by McCabe who I became a distro for in the 90’s.

By: Bryan

Bryan — Thu, 13 May 2010 21:05:08 +0000

In reply to Steve Bates. We used to call it "pencil whipping", checking off things that you were supposed to have done, but didn't actually do because "they are just file fillers to make the suits happy and not important". It's simple, Steve - after years of being lucky, the luck ran out. One of the crew of the supply ship said that the rig crew have been complaining about this well for months, that it wasn't cooperating and constantly causing problems. Based on what I've been told about the geography of the Gulf off our beaches, I get the feeling that there's a hell of lot more gas than oil in that site, and they weren't ready for a gas well. OT: Yes, the Я is the soft vowel pronounced "ya" in Russian. It is also the pronoun "I", and the last letter in the alphabet. The Russians say "ah" to "ya" instead of A to Z. Of course I like , ‽, ∝, and all the other little bits available.

By: Steve Bates

Steve Bates — Thu, 13 May 2010 20:07:44 +0000

Sigh. So much for the effort I put into that change control – change history s/w. That was not for a company involved in this debacle, but I remember being told at the time (~15 years ago) that no matter how facile the s/w was, some percentage of the workers (mostly contractors not employees, or so I was told) would cheerfully sign documents saying they completed listed procedures which in fact they never bothered with. It’s a wonder this sort of thing doesn’t happen weekly.

Aside to Bryan: thanks for the lesson in Unicode. My “need” (heh) to display a backwards R in another of your threads yesterday led to my discovery of the ‘ya’ (is that right??) in the Russian code block, which led me to spend a pleasant couple of hours exploring all the code blocks, and which characters are displayed by which browsers in which fonts. I had no idea that FF and IE are so powerful in displaying a wide variety of languages. Sometimes the most basic things escape one’s attention until one’s nose is thrust in them. 😆

By: Bryan

Bryan — Thu, 13 May 2010 17:00:10 +0000

Service manuals, wiring diagrams, building plans, etc. it is a minor miracle if they are even close to the system you are trying to fix. You flip a circuit board and see “spider wire” all over; you find jumpers inside wiring harnesses; you dig up the backyard looking for a sewer line that they decided to run under the house instead…

“Field modifications”, “jury rigging”, “temporary fixes”, and the list goes on.

Yeah, there’s no point it trying to explain that you can’t power a 12-volt device from a 5-volt power supply, or that the ratings on the parts used do make a difference.

Now they get into the blame game over who made the changes that made the BOP worthless for its intended purpose.

By: Kryten42

Kryten42 — Thu, 13 May 2010 14:19:39 +0000

Yes. It’s one of the reasons I got out of industrial design. I got tired of the many justification sessions to morons who’s only appreciation of engineering was leggo blocks, if that. “Do we really need this bit?”. “why can’t we get rid of this, it’s save some money?” The best was the marketing morons who’d say “This looks ugly, we think this would look better”. Of course, that was whenever they actually bothered to ask. I worked briefly for a company (now long defunct thankfully) that just changed things without any consultation with designers at all. I literally just walked out and never went back. The worst was when the powers that be would hire some young fool with a degree with still wet ink who wanted to impress and get a job would say that something could be changed, no worries!

The World is full of clueless naive or ignorant fools. It’s why it’s such a toilet.