Feature name |
Lemma independent |
Description |
Return Type |
Language |
|
sentence_length | T | Count of all characters | int | en, sr |
no_all_nospace_chars | T | Count of all characters that are not spaces | int | en, sr |
no_digits | T | Count of digits | int | en, sr |
no_weird_chars | T | Count of characters: —ȍȋȃȂȇȁòàъ!"#$%&\'()*+-/:;<=>?@[\\]^_`{|}~’„”… | int | en, sr |
no_commas | T | Count of commas | int | en, sr |
no_fullstops | T | Count of full stops | int | en, sr |
no_punctuation | T | Count of all punctuation marks | int | en, sr |
no_all_tokens | T | Count of all tokens | int | en, sr |
avg_token_len | T | Average tokens length | float | en, sr |
max_token_len | T | Max token length | int | en, sr |
exist_token_longer_than_12 | T | There is a token with more than 12 characters | bool | en, sr |
more_than_7_tokens | T | There are more than 7 tokens | bool | en, sr |
less_than_60_tokens | T | There are less than 60 tokens | bool | en, sr |
no_all_words | T | Count of all words | int | en, sr |
avg_word_len | T | Average words length | float | en, sr |
max_word_len | T | Max word length | int | en, sr |
min_word_len | T | Min word length | int | en, sr |
no_capitalised_words | T | Count of words that begin with uppercase, not in the 1st position | int | en, sr |
grammarly_sentence | T | Sentence begins with uppercase and ends with punctuation | bool | en, sr |
contains_web_or_email | T | Contains an email or a web address | bool | en, sr |
no_stopwords | F | Count of stop-words | int | en, sr |
no_pronouns | T | Number of tokens tagged as pronouns | int | en, sr |
no_rare_tokens | T | Number of tokens with frequency <= 10 in a referent corpus | int | sr |
avg_freq_in_corpus | T | Average words’ frequency in a referent corpus | float | sr |
no_known_words | T | Number of non-stop and non-punctuation words found in referent corpus | int | sr |
no_unknown_words | T | Number of non-stop and non-punctuation words not found in referent corpus | int | sr |
blacklist_count | T | Count of blacklisted words | int | sr |
kwic_occurs_more_than_1 | F | True if KWIC appears more than 1 | int | en, sr |
no_repeated_lemmas | T | Number of lemmas that appear multiple times | int | en, sr |
between_15_40_tokens | T | Sentence contains between 15 and 40 tokens | bool | en, sr |
no_tokens_mixed_symbols | T | Number of tokens with mixed symbols (e.g. letters and digits) | int | en, sr |
no_of_proper_names | T | Number of tokens tagged as proper names | int | en, sr |
init_word_tag | T | POS-tag of the first word | char | en, sr |
kwic_abs_position | F | Position of KWIC in a sentence (absolute value, char position) | int | en, sr |
kwic_found | F | An input KWIC or a similar word ma | string | en, sr |
kwic_norm_position_char | F | Position of KWIC in a sentence (percent, character normalised) | float | en, sr |
kwic_norm_position_token | F | Position of KWIC in a sentence (percent, character normalised) | float | en, sr |
stoplist_sentence_initial_words | F | Initial word of a sentence is present in a stoplist | bool | sr |