việtual: 2006

Abstract

This paper outlines a roadmap for encoding the Tay script in Việt Nam into Unicode (and ISO 10646) – consisting of three phases from the current state of multiple encodings to a universally accepted standard which is already widely implemented – to support the eventual goals of preserving the written Tay culture in Việt Nam, and opening it to the world of information online.

Introduction

The Workshop on Encoding and Digitizing the Thái Việt Nam Script on 3 November 2006 in Điện Biên Phủ, Việt Nam, attended by 70 participants (mostly from the surrounding provinces, with a handful from abroad) of diverse backgrounds and disciplines, successfully concluded with a consensus on encoding the Thai script in Vietnam, using the Thái Sơn La repertoire and glyph set as the underlying basis, with “Tay” or “Thái Việt Nam” as the script name. (In this paper, the word “Tay” is used to designate “Thái Việt Nam”.) The strong desire expressed at the Workshop is to encode the Tay script in Unicode (and ISO 10646) to allow interchange of Tay throughout the world.

The current state of the Tay script is characterized by multiple encodings which are incompatible with each other, making it impossible for Tay users to conveniently exchange data and information, or to make it available in the online world. The goal of getting Tay into Unicode (and ISO 10646) would, on the other hand, preserve the Tay culture and open the world of information to the introduction of Tay data and information. With respect to Việt Nam in particular, the Tay script would be the last script group to have a character encoding standard, after the Latin, Khmer, Nôm, and Chàm scripts.

The roadmap proposed in this paper include three phases:

Phase 1 – definition: to integrate the latest consensus on the script into the current proposal to the Unicode Technical Committee (UTC).
Phase 2 – standardization: to work toward the approval of the proposal by Unicode with suitable technical improvements, and an affirmative vote by ISO (International Organization for Standardization).
Phase 3 – implementation: to achieve the inclusion of the Tay character encoding in the major operating systems

which will lead to the Tay script becoming available for digitization, data interchange, and information dissemination in the multilingual environment.

Phase 1: Definition

An initial proposal was presented to the Unicode Technical Committee (UTC) in February 2006; the UTC provided feedback regarding details of the proposal. The November 2006 Workshop in Điện Biên Phủ reached consensus on remaining technical issues regarding the Tay script.

The Phase 1 deliverable will be:

a revised proposal for Tay in Unicode, that integrates input from the UTC and from the November 2006 Workshop.

The revised proposal will include the following contents:

repertoire, to be based on the Thái Sơn La font set,
additional characters to meet the needs and usage from other Tay communities,
a defined store order, namely the visual order,
a defined character collation order,
an encoding table, and
a complete set of character names

all of the above being jointly determined by users inside and outside Viet Nam, with assistance from specialists as appropriate.

This phase is expected to be of shortest duration, and can be completed as soon as the necessary input is received from all the parties concerned, and proper inspection of the revised proposal can be completed.

In this phase, users would continue using the existing Tay fonts and input methods, even though the inherent character encoding may only be for 8-bit.

Phase 2: Standardization

With the revised proposal completed in Phase 1, Phase 2 would first consist of working the UTC on the technical details of the Tay script, and then the ISO bodies in order to achieve an affirmative vote.

The Phase 2 goal is:

a Tay encoding in the BMP (Basic Multilingual Plane).

The Phase 2 deliverables will be:

the UTC’s approval of the revised proposal, and
an affirmative vote from ISO (International Organization for Standardization) on Tay on Unicode.

The duration of this phase is determined by the meeting schedules of the UTC, and of the responsible ISO SC2/WG2 working group. Linguistic and technical specialists can participate with the UTC in presenting and discussing the revised proposal. Since the ISO‘s membership consists of national standards bodies, Việt Nam’s standards body – TCVN/STAMEC – will definitely need to participate in a vote to approve, or not, the revised proposal.

In this phase, users will continue to use the Tay fonts and input methods as in Phase 1. However, specialists may chose to reencode the 8-bit Tay repertoire into 16-bit, following the SIL PUA, to provide the framework so that fonts and tools can be developed and experimented for Tay. The ordinary user should not get these intermediate 16-bit fonts and tools, in order to minimize the complications of multiple code conversions.

Some technical activities in this phase include:

implement fonts in the SIL PUA,
define and implement input methods (keyboard layouts),
define a Tay locale,
determine the ISO script code for Tay,

while:

attendance at UTC meetings,
attendance at ISO/IEC JTC1/SC2/WG2 meetings,
voting by Việt Nam’s national standards body, TCVN/STAMEC,

would be required whenever Tay appears on the agenda.

Phase 3: Implementation

As Phase 2 concludes, the Tay character set will be assigned to a specific encoding table in Unicode (and ISO 10646). This character encoding table will be definitive for the future. Data and information can be created in this encoding for interchange and dissemination worldwide.

The Phase 3 deliverable is:

an add-on implementation, including fonts and input methods, pending the built-in implementations.

The Phase 3 goals are:

built-in implementations of Tay in the major operating systems (Windows, Linux, Mac OS, Symbian) and web applications (eg, search), with:
- fonts to include the Tay charset (likely),
- input methods for Tay (somewhat likely),
- search (likely),
- localization -- locale, GUI, ... (not likely)

In this phase, fonts and tools implemented for the definitive encoding table can be distributed to the user. A tool that must be included is a converter from legacy encodings to the definitive encoding.

The duration for this Phase is indeterminate, as the built-in implementations of the new Unicode script sets depend on the respective vendors, and their determination of market demand. Consequently, add-on implementations are necessary.

Success factors

The success factors for this endeavor must include:

participation of indigenous speakers and users,
convincing presentation to standardization bodies, backed by solid evidence of usage by a community of users,
feedback from computer and linguistic experts,
implementation by vendors,
providing tools and fonts to users.

Conclusion

The 3-phase roadmap is a realistic and realizable plan for quickly getting Tay into Unicode (and ISO 10646). However, it will also require concerted efforts from everyone concerned, from the user, to researchers and specialists, to interested other parties and funders who are deeply concerned about the preservation of a culture through technology.

Appendix 1: Summary 2006-11-05

Following is a summary of the discussion on Sunday, between Jim Brase, Ngô Trung Việt & [James] Đỗ Bá Phước, based on the consensus of the Workshop in Điện Biên Phủ on Friday, 2006-11-03:

Script name: “Tay”, or “Thái Việt Nam”
- Jim will write to Peter Constable regarding the ISO 3-letter code
- We will also to define the script-country pair, eg: th-VN
Repertoire:
- The Tay repertoire in the ThaiSonLa.ttf font is considered to be the minumum set
- Additions are:
  - Lai Châu aspirated
  - g, r
  - e, j andor combining marks
    - (consider issues of search, sort)
  - hatted a
- Propose a minimum set first, then add later
Storage order is logical order, ie phonetic order
Character sort order will be based on:
- either Thái Sơn La sort order, as defined as a reference to a book of epic songs
- or Lao sort order
Encoding table to be based on the Lao table in Unicode
Character naming to be based on Lao naming conventions in Unicode

Appendix 2: Miscellany

http://groups.google.com/group/tayvn
wiki
"Tay" is the complete spelling, and is not to be confused with "Tày" or "Tây".

References

Workshop Proceedings (November 2005)
Jim Brase & Ngô Trung Việt, Proposal (February 2006)
Lò Mai Cương, Thái Sơn La fonts & keyboards (July 2006)
Workshop Proceedings (November 2006)
Ngô Trung Việt, Report on the November 2006 Workshop
http://en.wikipedia.org/wiki/Tai_Dam_language
http://www.ethnologue.com/show_language.asp?code=blt
http://www.omniglot.com/writing/taidam.htm
http://scripts.sil.org/SILPUAassignments
http://scripts.sil.org/UnicodePUA
http://scripts.sil.org/VendorUseOfPUA

Revision history

2006-12-01 jDo - add BMP encoding as Phase 2 goal.
2006-11-30 Ngô Trung Việt - Vietnamese translation: http://docs.google.com/View?docid=dfwzjsv9_5znc4b3.
2006-11-28 jDo - revised & expanded.
2006-11-11 [James] Đỗ Bá Phước - new: http://vietual.blogspot.com/2006/11/roadmap-for-tay-thi-vit-nam-into_28.html.

Unicode support still varies greatly amongst the major free web email services.

Following is the updated ranking from best to worst for the top four:

Gmail: messages with Unicode text composed in a web browser appear correctly everywhere, such as browsers and desktop POP3 clients such as Outlook Express.
AIM|AOL Mail: most improved service, with messages with Unicode text composed in a web browser -- either in both plaintext and richtext/HTML messages -- appearing correctly everywhere. Mailbox can now be accessed with desktop clients through the IMAP protocol. Negative is that the web interface is ad-laden and quite confusing.
Yahoo!Mail: with the older, non-beta version, messages with Unicode text appear correctly on Yahoo!Mail or Gmail, but not in Outlook Express. POP3 access is not free in the US.
Hotmail: this is the worst. Messages with Unicode text composed in a web browser do not display correctly anywhere! Come on, Microsoft -- make Hotmail as Unicode-savvy as your desktop applications.

So, try out the first two, but avoid the last!

việtual

2006-11-28

Roadmap for Tay (Thái Việt Nam) into Unicode

Abstract

Introduction

Phase 1: Definition

Phase 2: Standardization

Phase 3: Implementation

Success factors

Conclusion

Appendix 1: Summary 2006-11-05

Appendix 2: Miscellany

References

Revision history

2006-08-12

Unicode-savvy (or not) web email services

2006-04-24

Yahoo!Mail Beta is Unicode-savvy

2006-03-05

Introducing "Unicode Ding"™

Yahoo better at video search than Google

2006-02-24

Einstein can't write in Unicode

2006-02-23

New Vietnamese reddit.com page

2006-01-12

Happy (Lunar) New Year of the Dog!!!

Blog Archive

Links