c='static/img/jerteh-small.jpg' />
JeRTeh - Društvo za jezičke resurse i tehnologije
[Here comes an explanation of the project...]
The Web service can be invoked, e.g. using curl, in the following manner:
        curl -d '{
                "data": "Ovo je neki nasumičan latinični tekst", 
                "lang":"sr", 
                "kwic": "neki", 
                "kwic_pos": "A", 
                "feature_names": ["sentence_length","avg_word_len"]}' 
            -H "Content-Type: application/json" -X POST http://147.91.183.8:12347/features
        
with fields:
data (string)
- mandatory key
lang (string)
- optional (default ‘sr’)
kwic (string)
- optional (only for kwic-d features)
kwic_pos (string)
- optional (does not affect features extraction, this is for later)
feature_names (list of strings)
- optional (if omitted, returns dict of all feature values)

A list of all available features that can be extracted

Name of the FeatureLemma IndependentDescriptionReturn TypeLanguage
sentence_lengthTCount of all charactersinten, sr
no_all_nospace_charsTCount of all characters that are not spacesinten, sr
no_digitsTCount of digitsinten, sr
no_weird_charsTCount of characters: —ȍȋȃȂȇȁòàъ!"#$%&\'()*+-/:;<=>?@[\\]^_`{|}~’„”…inten, sr
no_commasTCount of commasinten, sr
no_fullstopsTCount of full stopsinten, sr
no_punctuationTCount of all punctuation marksinten, sr
no_all_tokensTCount of all tokensinten, sr
avg_token_lenTAverage tokens lengthfloaten, sr
max_token_lenTMax token lengthinten, sr
exist_token_longer_than_12TThere is a token with more than 12 charactersboolen, sr
more_than_7_tokensTThere are more than 7 tokensboolen, sr
less_than_60_tokensTThere are less than 60 tokensboolen, sr
no_all_wordsTCount of all wordsinten, sr
avg_word_lenTAverage words lengthfloaten, sr
max_word_lenTMax word lengthinten, sr
min_word_lenTMin word lengthinten, sr
no_capitalised_wordsTCount of words that begin with uppercase, not in the 1st positioninten, sr
grammarly_sentenceTSentence begins with uppercase and ends with punctuationboolen, sr
contains_web_or_emailTContains an email or a web addressboolen, sr
no_stopwordsCount of stop-wordsinten, sr
no_pronounsTNumber of tokens tagged as pronounsinten, sr
no_rare_tokensTNumber of tokens with frequency <= 10 in a referent corpusintsr
avg_freq_in_corpusTAverage words’ frequency in a referent corpusfloatsr
blacklist_countTCount of blacklisted wordsintsr
kwic_occurs_more_than_1FTrue if KWIC appears more than 1inten, sr
no_repeated_lemmasTNumber of lemmas that appear multiple timesinten, sr
between_15_40_tokensTSentence contains between 15 and 40 tokensboolen, sr
no_tokens_mixed_symbolsTNumber of tokens with mixed symbols (e.g. letters and digits)inten, sr
no_of_proper_namesTNumber of tokens tagged as proper namesinten, sr
init_word_tagTPOS-tag of the first wordcharen, sr
kwic_position_charFPosition of KWIC in a sentence (percent, character)floaten, sr
kwic_position_tokenFPosition of KWIC in a sentence (percent, character)float
stoplist_sentence_initial_wordsFInitial word of a sentence is present in a stoplistboolsr
stoplist_sentence_initial_mweFA MWE from the fixed list found at the beginning of the sentenceboolsr