text_tagger package¶
Submodules¶
text_tagger.compare¶
-
class
text_tagger.compare.
Compare
(database)¶ Bases:
object
Class that wraps a database to compare texts two tags
- Parameters
database – the database object with tags the module will work on
- Returns
Compare object that can compare different tags on teh database
-
get_similarity
(tag1, tag_column1, tag2, tag_column2, embeding_method='tf-idf', dist_method='cos')¶ Funcion that compare 2 tags of teh database
- Parameters
tag1 – tag to slice the database with
tag_column1 – column of the database the tag is from
tag2 – second tag to slice the database with
tag_column2 – second column of the database the tag is from
embeding_method – method that will be used to get the tag, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
dist_method – method taht will be used to measure the distance between tags
- Returns
list of most likely tags for each text
text_tagger.dataset_manager¶
-
class
text_tagger.dataset_manager.
DataBase
(path, text_column, tags_columns, low_memory=False)¶ Bases:
object
-
create_index
(per_tag=True)¶ Function to create a index of words:number of appearences of the corpus documents and each tag
-
export
(target='text')¶ Function to export dataset data into csv or .txt file
- Parameters
target – if “csv” will store the dataset as csv with all modifications if “text” will store the text column in a .txt file default = “text”
-
generate_embedings
(method='tf-idf', tag=None, tag_column=None, return_model=False)¶ Funtion that generates and saves in the object a embeding for the corpus texts
- Parameters
method – method that will be used to perform the embeding, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
tag – specific tag for the embeing if a specific embeding is needed, default = None
tag_column – specific tag_column for the embeing if a specific embeding is needed, default = None
return_model – if teh function needs to return only the vectors of the texts or also the model used to generate such vectors, default = False
- Returns
embeding vectors for each text in the dataset or tag model used to generate embedings if return_model = True
Function that uses a speific method and clustring to genrate new tags for the dataset, those tags will be added as “AutoTag” in teh dataset
- Parameters
method – method that will be used to perform the clustering, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
n_tags – number of tags to generate
vectors – user embeding vecors for each text if they wish too generate tags according to their own embeding system
-
load
()¶ Function to load a previoulsy pickled database objet
-
most_important_word
(tag, tag_column, n_words=5, method='PMI')¶ Funcion that find the most important words in a tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
n_words – number of words in tag, default = 5
method – method that will be used to gett teh words, default = “NPMI” can be [“P”, “PMI”, “NPMI”]
- Returns
list of most important words in dataframe
-
open
()¶ Function to open the dataset according to self.path
-
save
()¶ Function to save the database object as pickle
-
text_tagger.extract¶
-
class
text_tagger.extract.
Extract
(database)¶ Bases:
object
Class that wraps a dataframe and allow for the extraction of features of each tag sych as main words, texts, wordcloud, and lda interpretations
- Parameters
database – database object to explore.
- Returns
Explore object with many different funtions that interprete the database
-
get_lda
(tag, tag_column)¶ Funcion that geneates lda visualization for the tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
- Returns
A local server that host the lda visualization
-
get_similarity
(tag, tag_column, word1, word2=None, use_pretrained=True)¶ Makes a calulation with word vectors to narrow similar words :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param word1: first word to use in comparison :param word2: word to get the similarity with the first, if none,
most similar word to the first will be returned, default, none
- Parameters
if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –
Example
word1 = “man”, word2=”woman” –> 0.8 (high)
word1 = “man”, word2=None –> “Woman”
- Returns
words with highest similarity
-
get_size
(tag, tag_column)¶ Funcion to check how many texts there are in a specific tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
- Returns
number of texts in dataframe
-
get_text
(tag, tag_column, n_texts=5, method='tf-idf')¶ Funcion that find the most important texts in a tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
n_texts – number of texts in tag, default = 5
method – method that will be used to get the texts, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
- Returns
list of most important texts in dataframe
-
get_wordcloud
(tag, tag_column, raw=False, save=False)¶ Funcion that geneates a wordcloud fr teh tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
raw – if the data used should be the raw data or the preprocessed data, default False
save – wether to save or not the figure generated in teh ocal folder, default False
- Returns
plots the word cloud of the tag
-
get_words
(tag, tag_column, n_words=5, method='NPMI')¶ Funcion that find the most important words in a tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
n_words – number of words in tag, default = 5
method – method that will be used to gett teh words, default = “NPMI” can be [“P”, “PMI”, “NPMI”, “word2vec”]
- Returns
list of most important words in dataframe
-
make_analogy
(tag, tag_column, relation, target, use_pretrained=True)¶ Makes an analogy with data from inside the tag :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param relation: 2 words with the desired relation :param target: word that the relation will be aplied to
most similar word to the first will be returned, default, none
- Parameters
if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –
Example
relation=[‘man’, ‘king’], target=[‘woman’] –> queen: 0.8965
- Returns
results of the analogy
-
make_word_difference
(tag, tag_column, positive, negative, use_pretrained=True)¶ Makes a calulation with word vectors to narrow similar words :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param postive: list of words taht will be added :param negaive: list of words taht will be subtracted
most similar word to the first will be returned, default, none
- Parameters
if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –
Example
positive=[‘woman’, ‘king’], negative=[‘man’] –> queen: 0.8965
- Returns
results of the analogy
text_tagger.generate¶
-
class
text_tagger.generate.
Generate
(database, max_sequence_len=20)¶ Bases:
object
Class that wraps a database to generte new texts accoring to a specific tag
- Parameters
the database object with tags the model will be trained from (database;) –
max_sequence_len – the maximum number o words the texts the model will use to rtrain should have, default = 20
- Returns
Generate object taht can train a model in a tag and generate new text from it
-
generate
(seed_text, next_words=20, T=0.9)¶ Generate a new text with the trained model
- Parameters
seed_text – text the model will try to continue from based on what it learned
next_words – how many words to generat efoward
T – temperature, how much the generate will value the higher probabilties for each word closer to 1: more realistic and repetitive the model will be, closer to 0: more creative and nonsensical
- Returns
newly generated text
-
train
(tag, tag_column)¶ Function taht trains the generate object in the texts of a specific tag
- Parameters
tag – tag to slice the database with
tag_column – column of the database the tag is from
text_tagger.identify¶
-
class
text_tagger.identify.
Identify
(database)¶ Bases:
object
Class that wraps a database to identify in whith tag some new texts belong
- Parameters
the database object with tags for the object to refer to (database;) –
- Returns
Identify object that can indentify a text according to embedings
-
identify
(texts, method='tf-idf', n_searches=3)¶ Funcion that find the most likely tags for a text
- Parameters
texts – string or list of strings with the text that should be identified
method – method that will be used to get the tag, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]
n_searches – how many tags to return for each text, default = 3
- Returns
list of most likely tags for each text
text_tagger.preprocess¶
-
class
text_tagger.preprocess.
Preprocess
(tags_types, filter_flags={'digits': True, 'links': True, 'punct': True, 'refs': True, 'simbols': True, 'stopwords': True, 'text_only': False, 'tokenize': True}, languages=['english'], other_stopwords=[])¶ Bases:
object
Class that preproccess a database to filter text and convert tags to absolute ones
- Parameters
tags_types – dictionary mapping tag_name list of configurations: [method, number of clusters, original tags] example: {“Lat_Long”:(“numeric-simple”, 200, [“Longitude”, “Latitude”])}
filter_tags – dictionary with different keys for the text preprocess
languages – list of languages to be used for teh stopwords
other_stopwords – list of manual stopwords to be used
- Returns
object capable of filtering a dataframe
-
filter_text
(text)¶ Funcion to filter a single text according to preprocess object filter_flags
- Parameters
text – text to filter
- Returns
filtered text
-
numeric_process
(data, method, n)¶ Process n-d numerical tags into a absolute numerical tag
- Parameters
data – pd.Series that must be processed
method – method that will be used in the conversion simple or cluster
n – number of divisions or cluesters
- Returns
new tag pd.Series
-
preprocess
(database)¶ Function that recieves a database object and preprocess the data
- Parameters
database – database to be preprocessed
Function that preprocess all tags columns according to the objet configurations
-
preprocess_text
(text_series)¶ Funcion to filter a pd.Series of texts according to preprocess object filter_flags
- Parameters
text_series – pd.Series of texts to filter
- Returns
pd.Series of filtered texts