1. Aim of Establishment of GSK

In the research and development of speech / natural language processing technology, the importance of language resources for speech data, lexicons, text corpora, terminology, and various language tools has grown in recent years. Particularly like can be seen in the recent trend of “corpus-based speech / natural language processing,” probabilistic and statistical methods using corpora and real, large-scale data have been successful.

Regardless of the importance of speech and language data, as fundamentals of the knowledge information processing field, to the development of the information-communication industry, because the construction of large-scale speech and language data resources requires enormous effort, time, and money, we must say that the their development at each research site is a difficulty. The present situation is that research sites that wish to use such data must reluctantly do so, using those developed someplace else. Many general large-scale language data resources exist that were developed by publishers and newspaper companies, different types of businesses than organizations that perform speech / natural language processing research and development, and moreover, resources that were developed without any intrinsic research goals in mind.

Because of this, it is the case that users who want to use language resources including speech and language data have no choice but to individually negotiate rights and costs with such language resource providers. On the other hand, even with those that provide language resources, because there may be forms of use not previously conceived, hesitation and confusion towards providing language resources can also be seen. There is also no general rule established for offering language resources. As a result this situation is the primary cause in the obstruction of Japan’s development of speech / natural language processing technology and research.

In view of this, establishing a system for the use of language resources, and offering them in a form agreeable to both providers and users, promotes both the circulation of language resources and Japan’s research on speech and natural language processing. For the sake of contributing to the development of the Speech and Language Industry, there is a pressing need for the establishment of such a system. This also allows not only for contributions to speech / natural language processing, but broadly to the development of linguistics research as a whole.

In Europe and the United States there was immediately an awareness of the necessity for such a system, and consortiums, LDC (Linguistic Data Consortium) in the States, and ELRA (European Language Resources Association), were created that officially support one another, collecting developed speech / language resources from each location, and carrying out a mediation business (on behalf of parties providing language resources, levying usage fees and receiving a fixed margin) that distributes these resources to users that wish to use them. Herewith users can, through a simple procedure, acquire and use necessary language resources. In Japan too, the establishment of an organization for carrying out the distribution and collection of language resources like that of the LDC and ELRA is desirable.

Based on the above background, GSK aspires to be the organization that contributes to the promotion of Japanese scientific research and technology in this field, by propelling the circulation of indispensable language resources to the research and development of speech / natural language processing. Also, by not limiting our target to domestic Japanese language resources, but by expanding in the near future to the rest of Asia, it is expected that we will play a role in the three large consortiums of Europe, America, and Asia, and that this will further lead to international contributions to natural language processing technology and language research.

2. Importance of GSK

Gengo-Shigen-Kyokai (“GSK”), through the circulation of language resources, has the following merits to both parties providing language resources and those using them.

2.1 For parties providing language resources

  • By supporting traditionally unthought of new uses, GSK stimulates new demand which is also connected to revenue, and also expects rapid improvement in language resources.
  • In the near future, the following two merits can be considered:
    • Since GSK will act as proxy for contracts (mediation business), it is not necessary to deal with troublesome contract procedures.
    • All data is used with a contract in which copyrights and related handling issues are clearly defined, which protects against misuse and rights infringement.

2.2 For parties using language resources

  • Large scale language resources can be used at no or little cost.
  • Effective utilization of otherwise stored and unused language resources can be facilitated.
  • In the near future, since contracts and mediation will be carried out by GSK, there is no individual negotiation with resource providers, only a simple process before language resources can be used.