text_tagger package¶

Submodules¶

text_tagger.compare¶

class text_tagger.compare.Compare(database)¶

Bases: object

Class that wraps a database to compare texts two tags

Parameters: database – the database object with tags the module will work on
Returns: Compare object that can compare different tags on teh database

get_similarity(tag1, tag_column1, tag2, tag_column2, embeding_method='tf-idf', dist_method='cos')¶

Funcion that compare 2 tags of teh database

Parameters

tag1 – tag to slice the database with
tag_column1 – column of the database the tag is from
tag2 – second tag to slice the database with
tag_column2 – second column of the database the tag is from
embeding_method – method that will be used to get the tag, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
dist_method – method taht will be used to measure the distance between tags

Returns

list of most likely tags for each text

text_tagger.dataset_manager¶

class text_tagger.dataset_manager.DataBase(path, text_column, tags_columns, low_memory=False)¶

Bases: object

create_index(per_tag=True)¶: Function to create a index of words:number of appearences of the corpus documents and each tag

export(target='text')¶

Function to export dataset data into csv or .txt file

Parameters: target – if “csv” will store the dataset as csv with all modifications if “text” will store the text column in a .txt file default = “text”

generate_embedings(method='tf-idf', tag=None, tag_column=None, return_model=False)¶

Funtion that generates and saves in the object a embeding for the corpus texts

Parameters

method – method that will be used to perform the embeding, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
tag – specific tag for the embeing if a specific embeding is needed, default = None
tag_column – specific tag_column for the embeing if a specific embeding is needed, default = None
return_model – if teh function needs to return only the vectors of the texts or also the model used to generate such vectors, default = False

Returns

embeding vectors for each text in the dataset or tag model used to generate embedings if return_model = True

generate_tags(method='tf-idf', n_tags=10, vectors=None)¶

Function that uses a speific method and clustring to genrate new tags for the dataset, those tags will be added as “AutoTag” in teh dataset

Parameters

method – method that will be used to perform the clustering, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
n_tags – number of tags to generate
vectors – user embeding vecors for each text if they wish too generate tags according to their own embeding system

load()¶: Function to load a previoulsy pickled database objet

most_important_word(tag, tag_column, n_words=5, method='PMI')¶

Funcion that find the most important words in a tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from
n_words – number of words in tag, default = 5
method – method that will be used to gett teh words, default = “NPMI” can be [“P”, “PMI”, “NPMI”]

Returns

list of most important words in dataframe

open()¶: Function to open the dataset according to self.path

save()¶: Function to save the database object as pickle

text_tagger.extract¶

class text_tagger.extract.Extract(database)¶

Bases: object

Class that wraps a dataframe and allow for the extraction of features of each tag sych as main words, texts, wordcloud, and lda interpretations

Parameters: database – database object to explore.
Returns: Explore object with many different funtions that interprete the database

get_lda(tag, tag_column)¶

Funcion that geneates lda visualization for the tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from

Returns

A local server that host the lda visualization

get_similarity(tag, tag_column, word1, word2=None, use_pretrained=True)¶

Makes a calulation with word vectors to narrow similar words :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param word1: first word to use in comparison :param word2: word to get the similarity with the first, if none,

most similar word to the first will be returned, default, none

Parameters: if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –

Example

word1 = “man”, word2=”woman” –> 0.8 (high)

word1 = “man”, word2=None –> “Woman”

Returns: words with highest similarity

get_size(tag, tag_column)¶

Funcion to check how many texts there are in a specific tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from

Returns

number of texts in dataframe

get_text(tag, tag_column, n_texts=5, method='tf-idf')¶

Funcion that find the most important texts in a tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from
n_texts – number of texts in tag, default = 5
method – method that will be used to get the texts, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]

Returns

list of most important texts in dataframe

get_wordcloud(tag, tag_column, raw=False, save=False)¶

Funcion that geneates a wordcloud fr teh tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from
raw – if the data used should be the raw data or the preprocessed data, default False
save – wether to save or not the figure generated in teh ocal folder, default False

Returns

plots the word cloud of the tag

get_words(tag, tag_column, n_words=5, method='NPMI')¶

Funcion that find the most important words in a tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from
n_words – number of words in tag, default = 5
method – method that will be used to gett teh words, default = “NPMI” can be [“P”, “PMI”, “NPMI”, “word2vec”]

Returns

list of most important words in dataframe

make_analogy(tag, tag_column, relation, target, use_pretrained=True)¶

Makes an analogy with data from inside the tag :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param relation: 2 words with the desired relation :param target: word that the relation will be aplied to

most similar word to the first will be returned, default, none

Parameters: if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –

Example

relation=[‘man’, ‘king’], target=[‘woman’] –> queen: 0.8965

Returns: results of the analogy

make_word_difference(tag, tag_column, positive, negative, use_pretrained=True)¶

Makes a calulation with word vectors to narrow similar words :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param postive: list of words taht will be added :param negaive: list of words taht will be subtracted

most similar word to the first will be returned, default, none

Parameters: if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –

Example

positive=[‘woman’, ‘king’], negative=[‘man’] –> queen: 0.8965

Returns: results of the analogy

text_tagger.generate¶

class text_tagger.generate.Generate(database, max_sequence_len=20)¶

Bases: object

Class that wraps a database to generte new texts accoring to a specific tag

Parameters

the database object with tags the model will be trained from (database;) –
max_sequence_len – the maximum number o words the texts the model will use to rtrain should have, default = 20

Returns

Generate object taht can train a model in a tag and generate new text from it

generate(seed_text, next_words=20, T=0.9)¶

Generate a new text with the trained model

Parameters

seed_text – text the model will try to continue from based on what it learned
next_words – how many words to generat efoward
T – temperature, how much the generate will value the higher probabilties for each word closer to 1: more realistic and repetitive the model will be, closer to 0: more creative and nonsensical

Returns

newly generated text

train(tag, tag_column)¶

Function taht trains the generate object in the texts of a specific tag

Parameters

tag – tag to slice the database with
tag_column – column of the database the tag is from

text_tagger.identify¶

class text_tagger.identify.Identify(database)¶

Bases: object

Class that wraps a database to identify in whith tag some new texts belong

Parameters: the database object with tags for the object to refer to (database;) –
Returns: Identify object that can indentify a text according to embedings

identify(texts, method='tf-idf', n_searches=3)¶

Funcion that find the most likely tags for a text

Parameters

texts – string or list of strings with the text that should be identified
method – method that will be used to get the tag, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
n_searches – how many tags to return for each text, default = 3

Returns

list of most likely tags for each text

text_tagger.preprocess¶

class text_tagger.preprocess.Preprocess(tags_types, filter_flags={'digits': True, 'links': True, 'punct': True, 'refs': True, 'simbols': True, 'stopwords': True, 'text_only': False, 'tokenize': True}, languages=['english'], other_stopwords=[])¶

Bases: object

Class that preproccess a database to filter text and convert tags to absolute ones

Parameters

tags_types – dictionary mapping tag_name list of configurations: [method, number of clusters, original tags] example: {“Lat_Long”:(“numeric-simple”, 200, [“Longitude”, “Latitude”])}
filter_tags – dictionary with different keys for the text preprocess
languages – list of languages to be used for teh stopwords
other_stopwords – list of manual stopwords to be used

Returns

object capable of filtering a dataframe

filter_text(text)¶

Funcion to filter a single text according to preprocess object filter_flags

Parameters: text – text to filter
Returns: filtered text

numeric_process(data, method, n)¶

Process n-d numerical tags into a absolute numerical tag

Parameters

data – pd.Series that must be processed
method – method that will be used in the conversion simple or cluster
n – number of divisions or cluesters

Returns

new tag pd.Series

preprocess(database)¶

Function that recieves a database object and preprocess the data

Parameters: database – database to be preprocessed

preprocess_tags()¶: Function that preprocess all tags columns according to the objet configurations

preprocess_text(text_series)¶

Funcion to filter a pd.Series of texts according to preprocess object filter_flags

Parameters: text_series – pd.Series of texts to filter
Returns: pd.Series of filtered texts

text_tagger package¶

Submodules¶

text_tagger.compare¶

text_tagger.dataset_manager¶

text_tagger.extract¶

text_tagger.generate¶

text_tagger.identify¶

text_tagger.preprocess¶

Module contents¶