Viatcheslav Yatsko, 2005 - 2011 COMPUTATIONAL LINGUISTICS LABORATORY

LINGUISTICS AND INFORMATICS

This Web site has been created by V. Yatsko (last name also spelt "Iatsko"), head of the Computational Linguistics Laboratory (CLL), Professor in the Department of Information Technologies and Systems at Katanov State University of Khakasia (KSUK) located in Abakan, Russia.
The CLL at KSUK was founded in 2002 to conduct work in the following areas.

1. Applied linguistics research, development of computer systems to be used CALL.
By now 2 such systems have been created - PASS and TITE (see below).

2. Automatic text summarization research.
V.Yatsko is the author of symmetric summarization conception that underlies PASS and ETS allowing to produce coherent and adequate summaries. UNIS summarizer has been designed to perform adaptive text summarization. For details see 'Publications' section[1-4].

3. Evaluation of the Internet information retrieval systems and text summarization systems.
V.Yatsko is the author of depth of user''s search conception described in the paper submitted to Dialog-2006 (http://www.dialog-21.ru/default.asp) . The reference dictionary conception is being developed to evaluate automatic text summarization systems as well as the Internet information retrieval systems.

4. Discourse analysis.
Integrational discourse analysis conception suggested by V.Yatsko [5-7] distinguishes between surface and deep levels of discourse structure. Currently we are investigating various types of possessive discourse and linguistic features of possessive relations differentiating between alienable and inalienable possession [8].

5. Corpus linguistics and computer learner corpora research project.
Linguistic Toolbox is a concordance that outperforms existing analogues in functionality.
CLCR is an ongoing project aimed at 1) creating corpora of texts (dictations, expositions, compositions, etc.) produced by Russian-speaking learners of English; 2) creating tools for error tagging and automatic analysis of these corpora; 3) contrastive analysis of Russian learner corpora with corpora produced by speakers of other languages. The project is in line with research done by Granger et al [9].

6. Text/data mining.
We completed a commercial project on sentiments mining having created a system that recognized and analyzed opinions of users about commercial products. The system works on an ontology and linear grammar that we specially developed for this project.

OUR PRODUCTS

By now there are 10 such products:
Compare a program for comparative linguistics research;
PASS a semi-automatic network text summarization system for foreign language teaching ;
TITE a bilingual network translation system for foreign language teaching;
- ETS - event tracking summarizer;
- UNIS Summarizer - a system recognizes input text genre and uses summarization algorithms optimized for the given text genre;
- TEXOR Topic Extractor specially designed to recognize and extract topics of chats;
- Y-Stemmer - a program that allows the user getting stems of words in the input text;
- STAR-Tagger - a program that annotates input text with parts-of-speech tags;
- TF*IDF Ranker - a program that computes weights for terms in the input text;
- Linguistic Toolbox - a concordance, with the help of which the user can create his/her own annotated corpora and perform various types of searches.

This page focuses on the latest five products. Description of all products is available at the CLL's main site

TF*IDF Ranker

TF*IDF Ranker allows the user getting scores for each term of input text according to the classic formula and its modified version. This is the classic formula that is widely used in term weighting techniques:

w(ij)=tf(ij)*log2 N/n
where
wij = weight of Term Tj in Document Di

tfij = frequency of Term Tj in Document Di

N = number of Documents in corpus

n = number of Documents where term Tj occurs at least once

Once you get a list of terms with their weights arranged in decending order you can copy it to some external editor to use it in various ways, for example to filter stop-words, i.e. words with zero or low scores or, on the contrary, use most salient words that have highest weights to represent the content of input text.

A drawback of this formula is that terms that occur in input text but that cannot be found in corpus get zero scores. In many cases such terms may be important for text understanding. For example a scientist may describe his invention introducing some newly coined terms, or a writer may invent neologisms that are not registered in existing corpora. That is why we modified the classic formula: if a word occurs in input text but doesnt occur in corpus n in the formula is assigned the value 1, rather than 0.

The main problem with this TF*IDF technique is number of texts in corpus, i.e. the value of N. Who many texts must the corpus include to be representative enough? No formal criteria have been developed so far and we work at this problem to suggest a solution in the nearest future.

TF*IDF Ranker works on Windows machines and requires .net framework. It supports English texts in .txt format.

How to use

1) With add button add texts to create a corpus. 2) Upload a text to analyze. 3) Select a classic formula or a modified formula. Classic version is a default option. 4). Click analyze.



4) Get a list of terms arranged in decending order of their weights.



5) Copy the list to an external editor for further processing.


Requirements
Windows 98/NT/2000/XP
300 MHz processor
0,5 MB free disc space

Download TF*IDF Ranker

Y-Stemmer

Stemming as a procedure of automatic morphological analysis has been an indispensable feature of information retrieval and text summarization since early 1960s. The general idea underlying stemming is to identify words that are the same in meaning but different in form by removing suffixes and endings. Such identification is important for correct term weighting and significantly increases effectiveness of information retrieval

By now a number of stemmers for different languages have been created, the most famous English ones being Porter stemmer and Paice/Husk (Lancaster) stemmer. Both stemmers are algorithmic and work on lists of suffixes specific for English.

Y-Stemmer (Yatskos stemmer) in contrast with existing stemmers is built on Star-Tagger. First the input text is annotated with POS tags and then suffixes and endings specific for a given part of speech are removed. Stemming is done on preliminary POS tagging, which reduces the number of overstemming mistakes.

Another specific feature of Y-stemmer is identification of irregular verb forms and nouns and pronouns that have irregular plural forms. For example Y-stemmer will identify was, were, are, am as forms of the verb be, and buys, buying, bought as forms of the verb buy.

Stemmers are usually integrated into NLP system to be used during text preprocessing. We distribute Y-Stemmer as a stand-alone application for purely testing purposes. It can also be used for educational purposes and term-weighting.

To get your text stemmed just open it in Y-Stemmer; stems will be given in square brackets.



Check only stemmed words box to get only stems without word-forms.



We evaluated our stemmer against Paice/Husk stemmer to find out that quality of our stemmer is 98.7% (1,3% mistakes per 1000 words) while quality of Lancaster stemmer is 88.87%, i.e. 11,13% mistakes pre 1000 words. Y-Stemmer outperformed Paice/Husk stemmer by 9,83%.

Download Y-Stemmer

Start-Tagger

Start-Tagger (Starikovs Tagger or StarT) allows the user getting a text annotated with POS tags. It works on a well-known bi-directional inference algorithm according to which a POS tag is assigned to a token depending on POS tags of tokens to the right and to the left of current token.

Part-of speech tagging has been widely used in corpus linguistics and in the last decades has become an indispensable component in such fields as text mining and text classification/categorization. Application of tagging in these fields faces one major problem: it is a time consuming procedure that badly affects the speed of an NLP system when performed dynamically. To make POS-tagging faster we modified the bi-directional inference algorithm excluding from it two parameters that can be computed on the fly when the rest of the parameters are already known. For details see our paper [16] in the list of publications at the foot of the page.
As a result StarT works much faster than its immediate analogue, a tagger developed by the Japanese scientists that employs the same algorithm.The table below displays results of tests conducted on Pentium 4, 2.8 GHz, 768 Mb of RAM machine.

Text size T&T tagger http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/postaggerStarT
10 KB2 sec.in no time (<< 1 sec.)
50 KB9 sec.< 1 sec.
100 KB17 sec.1 sec.
500 KB1 min. 22 sec.3 sec.
1000 KB2 . 50 sec.6 sec.


We also evaluated quality of StarT against quality of a tagger used in American National Corpus (ANC) by matching their annotated texts against texts annotated manually by human experts. We found out that quality of StarT was 99.27% , i.e. it made 0,73% mistakes per 1000 words while quality ANC tagger was 99.33%, i.e. it made 0,67% mistakes per 1000 words.

The ANC texts were chosen for contrastive analyses because this corpus is the latest one and it employs the most modern software. While developing StarT we took ANC as a model and used the same tagset.

To use StarT open a text form a directory by clicking load and tag button. You can copy the annotated text to an external editor.



StarT can process English texts in .txt format on Windows machines and requires .net framework.

Download StarT

UNIS Summarizer

UNIS is a system that performs adaptive text summarization. It recognizes genre of input text and employs algorithms optimized for the given genre. This prototype version supports recognition of 3 genres: fiction, academic, and newspaper.
To implement genre recognition we manually distinguished linguistic parameters of texts (such as, for example, distribution of proper nouns and different pronouns) and then conducted a number of experiments using a neural network to fund out parameters specific for each of the three genres.
As soon as you open a text (use leftmost button) it is identified either as fiction, or academic, or newspaper. Press summarize (rightmost) button to get a summary. In the left section of the program you can see a list of terms with their weights arranged in descending order. The terms with highest weights are supposed to be the most salient in the given text. Summary size is determined dynamically by the system depending on text structure.



We conducted evaluations comparing summaries produced by UNIS with those ones created by human experts to find out that the quality of newspaper summaries was about 90% percent, while the quality of fiction and academic summaries ranged between 65-70 %.

UNIS can process English texts in .txt format on Windows machines and requires .net framework.

Requirements Windows 98/NT/2000/XP 300 MHz processor 0,5 MB free disc space

Download UNIS

Linguistic Toolbox (LIT)

Linguistic Toolbox is a concordance that allows the user creating his/her own pos-annotated corpora and performing all types of searches adopted in corpus linguistics. LIT has a built-in part-of speech tagger that we created ourselves - STAR-tagger (START) (see above).

Requirements:

Windows 2000/XP/Vista/7
1500 MHz processor
Mono runtime
1 GB of free disc space

Download LIT tutorial

Download Linguistic Toolbox

All these programs are distributed as freeware under the GNU General Public License

CONTACTS
Laboratory's address: Lenin Street 90, Abakan, Russia, 655017
Tel/fax:++(3902)260227
E-mail: iatsko@gmail.com
You are also invited to visit other Web pages created by me (V.Yatsko).


Literary Web page (in English and Russian)

Educational Web page (in Russian)

Find my picture here



Хостинг от uCoz
Хостинг от uCoz