text_tagger package

Submodules

text_tagger.compare

class text_tagger.compare.Compare(database)

Bases: object

Class that wraps a database to compare texts two tags

Parameters

database – the database object with tags the module will work on

Returns

Compare object that can compare different tags on teh database

get_similarity(tag1, tag_column1, tag2, tag_column2, embeding_method='tf-idf', dist_method='cos')

Funcion that compare 2 tags of teh database

Parameters
  • tag1 – tag to slice the database with

  • tag_column1 – column of the database the tag is from

  • tag2 – second tag to slice the database with

  • tag_column2 – second column of the database the tag is from

  • embeding_method – method that will be used to get the tag, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]

  • dist_method – method taht will be used to measure the distance between tags

Returns

list of most likely tags for each text

text_tagger.dataset_manager

class text_tagger.dataset_manager.DataBase(path, text_column, tags_columns, low_memory=False)

Bases: object

create_index(per_tag=True)

Function to create a index of words:number of appearences of the corpus documents and each tag

export(target='text')

Function to export dataset data into csv or .txt file

Parameters

target – if “csv” will store the dataset as csv with all modifications if “text” will store the text column in a .txt file default = “text”

generate_embedings(method='tf-idf', tag=None, tag_column=None, return_model=False)

Funtion that generates and saves in the object a embeding for the corpus texts

Parameters
  • method – method that will be used to perform the embeding, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]

  • tag – specific tag for the embeing if a specific embeding is needed, default = None

  • tag_column – specific tag_column for the embeing if a specific embeding is needed, default = None

  • return_model – if teh function needs to return only the vectors of the texts or also the model used to generate such vectors, default = False

Returns

embeding vectors for each text in the dataset or tag model used to generate embedings if return_model = True

generate_tags(method='tf-idf', n_tags=10, vectors=None)

Function that uses a speific method and clustring to genrate new tags for the dataset, those tags will be added as “AutoTag” in teh dataset

Parameters
  • method – method that will be used to perform the clustering, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]

  • n_tags – number of tags to generate

  • vectors – user embeding vecors for each text if they wish too generate tags according to their own embeding system

load()

Function to load a previoulsy pickled database objet

most_important_word(tag, tag_column, n_words=5, method='PMI')

Funcion that find the most important words in a tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

  • n_words – number of words in tag, default = 5

  • method – method that will be used to gett teh words, default = “NPMI” can be [“P”, “PMI”, “NPMI”]

Returns

list of most important words in dataframe

open()

Function to open the dataset according to self.path

save()

Function to save the database object as pickle

text_tagger.extract

class text_tagger.extract.Extract(database)

Bases: object

Class that wraps a dataframe and allow for the extraction of features of each tag sych as main words, texts, wordcloud, and lda interpretations

Parameters

database – database object to explore.

Returns

Explore object with many different funtions that interprete the database

get_lda(tag, tag_column)

Funcion that geneates lda visualization for the tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

Returns

A local server that host the lda visualization

get_similarity(tag, tag_column, word1, word2=None, use_pretrained=True)

Makes a calulation with word vectors to narrow similar words :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param word1: first word to use in comparison :param word2: word to get the similarity with the first, if none,

most similar word to the first will be returned, default, none

Parameters

if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –

Example

word1 = “man”, word2=”woman” –> 0.8 (high)

word1 = “man”, word2=None –> “Woman”

Returns

words with highest similarity

get_size(tag, tag_column)

Funcion to check how many texts there are in a specific tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

Returns

number of texts in dataframe

get_text(tag, tag_column, n_texts=5, method='tf-idf')

Funcion that find the most important texts in a tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

  • n_texts – number of texts in tag, default = 5

  • method – method that will be used to get the texts, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]

Returns

list of most important texts in dataframe

get_wordcloud(tag, tag_column, raw=False, save=False)

Funcion that geneates a wordcloud fr teh tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

  • raw – if the data used should be the raw data or the preprocessed data, default False

  • save – wether to save or not the figure generated in teh ocal folder, default False

Returns

plots the word cloud of the tag

get_words(tag, tag_column, n_words=5, method='NPMI')

Funcion that find the most important words in a tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

  • n_words – number of words in tag, default = 5

  • method – method that will be used to gett teh words, default = “NPMI” can be [“P”, “PMI”, “NPMI”, “word2vec”]

Returns

list of most important words in dataframe

make_analogy(tag, tag_column, relation, target, use_pretrained=True)

Makes an analogy with data from inside the tag :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param relation: 2 words with the desired relation :param target: word that the relation will be aplied to

most similar word to the first will be returned, default, none

Parameters

if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –

Example

relation=[‘man’, ‘king’], target=[‘woman’] –> queen: 0.8965

Returns

results of the analogy

make_word_difference(tag, tag_column, positive, negative, use_pretrained=True)

Makes a calulation with word vectors to narrow similar words :param tag: tag to slice the database with :param tag_column: column of the database the tag is from :param postive: list of words taht will be added :param negaive: list of words taht will be subtracted

most similar word to the first will be returned, default, none

Parameters

if the previouslyused embeding for this tag should be reused for this calculation (use_pretrained;) –

Example

positive=[‘woman’, ‘king’], negative=[‘man’] –> queen: 0.8965

Returns

results of the analogy

text_tagger.generate

class text_tagger.generate.Generate(database, max_sequence_len=20)

Bases: object

Class that wraps a database to generte new texts accoring to a specific tag

Parameters
  • the database object with tags the model will be trained from (database;) –

  • max_sequence_len – the maximum number o words the texts the model will use to rtrain should have, default = 20

Returns

Generate object taht can train a model in a tag and generate new text from it

generate(seed_text, next_words=20, T=0.9)

Generate a new text with the trained model

Parameters
  • seed_text – text the model will try to continue from based on what it learned

  • next_words – how many words to generat efoward

  • T – temperature, how much the generate will value the higher probabilties for each word closer to 1: more realistic and repetitive the model will be, closer to 0: more creative and nonsensical

Returns

newly generated text

train(tag, tag_column)

Function taht trains the generate object in the texts of a specific tag

Parameters
  • tag – tag to slice the database with

  • tag_column – column of the database the tag is from

text_tagger.identify

class text_tagger.identify.Identify(database)

Bases: object

Class that wraps a database to identify in whith tag some new texts belong

Parameters

the database object with tags for the object to refer to (database;) –

Returns

Identify object that can indentify a text according to embedings

identify(texts, method='tf-idf', n_searches=3)

Funcion that find the most likely tags for a text

Parameters
  • texts – string or list of strings with the text that should be identified

  • method – method that will be used to get the tag, default = “tf-idf” can be [“tf-idf”, “cbow”, “doc2vec”, “lda”]

  • n_searches – how many tags to return for each text, default = 3

Returns

list of most likely tags for each text

text_tagger.preprocess

class text_tagger.preprocess.Preprocess(tags_types, filter_flags={'digits': True, 'links': True, 'punct': True, 'refs': True, 'simbols': True, 'stopwords': True, 'text_only': False, 'tokenize': True}, languages=['english'], other_stopwords=[])

Bases: object

Class that preproccess a database to filter text and convert tags to absolute ones

Parameters
  • tags_types – dictionary mapping tag_name list of configurations: [method, number of clusters, original tags] example: {“Lat_Long”:(“numeric-simple”, 200, [“Longitude”, “Latitude”])}

  • filter_tags – dictionary with different keys for the text preprocess

  • languages – list of languages to be used for teh stopwords

  • other_stopwords – list of manual stopwords to be used

Returns

object capable of filtering a dataframe

filter_text(text)

Funcion to filter a single text according to preprocess object filter_flags

Parameters

text – text to filter

Returns

filtered text

numeric_process(data, method, n)

Process n-d numerical tags into a absolute numerical tag

Parameters
  • data – pd.Series that must be processed

  • method – method that will be used in the conversion simple or cluster

  • n – number of divisions or cluesters

Returns

new tag pd.Series

preprocess(database)

Function that recieves a database object and preprocess the data

Parameters

database – database to be preprocessed

preprocess_tags()

Function that preprocess all tags columns according to the objet configurations

preprocess_text(text_series)

Funcion to filter a pd.Series of texts according to preprocess object filter_flags

Parameters

text_series – pd.Series of texts to filter

Returns

pd.Series of filtered texts

Module contents