JeRTeh - Društvo za jezičke resurse i tehnologije
Good Dictionary EXamples: API manual

GDEX web service can be invoked e.g. using curl in Unix, in the following manner:

		curl -d '{
			"data": "Ovo je neki nasumični latinični tekst",
			"lang": "sr",
			"kwic": "neki",
			"kwic_pos": "A",
			"feature_names": ["sentence_length", "avg_word_len"]
			}'
		-H "Content-Type": application/json -X POST https://gdex.jerteh.rs/features
	
Fields description:
data (str)
mandatory: should contain text (sentence/example)
lang (str)
optional, either 'sr' (default) or 'en'
kwic (str)
optional, use only for headword-dependent features
kwic_pos (str)
optional, POS-tag of the kwic
feature_names (list)
optional, list of strings that represent feature names; if not provided, returns the list of all features

If you use this API, please cite:

Ranka Stanković, Branislava Šandrih, Rada Stijović, Cvetana Krstev, Duško Vitas, and Aleksandra Marković. SASA Dictionary as the Gold Standard for Good Dictionary Examples for Serbian. In Zingano Kuhn T. Correia M. Ferreria J. P. Jansen M. Pereira I. Kallas J. Jakubı́ček M. Krek S. Kosem, I. and C. Tiberius, editors, Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference, pages 248–269. 1-3 October 2019, Sintra, Portugal. Brno: Lexical Computing CZ, s.r.o., 2019. URL
Feature name Lemma independent Description Return Type Language
sentence_length T Count of all characters int en, sr
no_all_nospace_chars T Count of all characters that are not spaces int en, sr
no_digits T Count of digits int en, sr
no_weird_chars T Count of characters: —ȍȋȃȂȇȁòàъ!"#$%&\'()*+-/:;<=>?@[\\]^_`{|}~’„”… int en, sr
no_commas T Count of commas int en, sr
no_fullstops T Count of full stops int en, sr
no_punctuation T Count of all punctuation marks int en, sr
no_all_tokens T Count of all tokens int en, sr
avg_token_len T Average tokens length float en, sr
max_token_len T Max token length int en, sr
exist_token_longer_than_12 T There is a token with more than 12 characters bool en, sr
more_than_7_tokens T There are more than 7 tokens bool en, sr
less_than_60_tokens T There are less than 60 tokens bool en, sr
no_all_words T Count of all words int en, sr
avg_word_len T Average words length float en, sr
max_word_len T Max word length int en, sr
min_word_len T Min word length int en, sr
no_capitalised_words T Count of words that begin with uppercase, not in the 1st position int en, sr
grammarly_sentence T Sentence begins with uppercase and ends with punctuation bool en, sr
contains_web_or_email T Contains an email or a web address bool en, sr
no_stopwords F Count of stop-words int en, sr
no_pronouns T Number of tokens tagged as pronouns int en, sr
no_rare_tokens T Number of tokens with frequency <= 10 in a referent corpus int sr
avg_freq_in_corpus T Average words’ frequency in a referent corpus float sr
no_known_words T Number of non-stop and non-punctuation words found in referent corpus int sr
no_unknown_words T Number of non-stop and non-punctuation words not found in referent corpus int sr
blacklist_count T Count of blacklisted words int sr
kwic_occurs_more_than_1 F True if KWIC appears more than 1 int en, sr
no_repeated_lemmas T Number of lemmas that appear multiple times int en, sr
between_15_40_tokens T Sentence contains between 15 and 40 tokens bool en, sr
no_tokens_mixed_symbols T Number of tokens with mixed symbols (e.g. letters and digits) int en, sr
no_of_proper_names T Number of tokens tagged as proper names int en, sr
init_word_tag T POS-tag of the first word char en, sr
kwic_abs_position F Position of KWIC in a sentence (absolute value, char position) int en, sr
kwic_found F An input KWIC or a similar word ma string en, sr
kwic_norm_position_char F Position of KWIC in a sentence (percent, character normalised) float en, sr
kwic_norm_position_token F Position of KWIC in a sentence (percent, character normalised) float en, sr
stoplist_sentence_initial_words F Initial word of a sentence is present in a stoplist bool sr