texthero.preprocessing.stem¶

stem(input: pandas.core.series.Series, stem='snowball', language='english') → pandas.core.series.Series¶

Stem series using either porter or snowball NLTK stemmers.

The act of stemming means removing the end of a words with an heuristic process. It’s useful in context where the meaning of the word is important rather than his derivation. Stemming is very efficient and adapt in case the given dataset is large.

texthero.preprocessing.stem make use of two NLTK stemming algorithms known as nltk.stem.SnowballStemmer and nltk.stem.PorterStemmer. SnowballStemmer should be used when the Pandas Series contains non-English text has it has multilanguage support.

Parameters

inputPandas Series
stemstr (snowball by default): Stemming algorithm. It can be either ‘snowball’ or ‘porter’
languagestr (english by default): Supported languages: danish, dutch, english, finnish, french, german , hungarian, italian, norwegian, portuguese, romanian, russian, spanish and swedish.

Notes

By default NLTK stemming algorithms lowercase all text.

Examples

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series("I used to go \t\n running.")
>>> hero.preprocessing.stem(s)
0    i use to go running.
dtype: object