Unterscheide Gedichte von Spam - Text Vectorization¶
[1]:
import pandas as pd
import sklearn.linear_model
import sklearn.pipeline
import sklearn.feature_extraction
/tmp/ipykernel_2054/4170151736.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Lade DataFrame vom fremden Jupyter Notebook.
[2]:
%%capture
%run "01 Unterscheide Gedichte von Spam - naiver Ansatz.ipynb"
df
Teile Datensatz auf.
[3]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
df["text"].values, df["category"].values, test_size=0.33, random_state=42
)
Definiere Lern-Pipeline. Mehr Infos hierzu unter https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.
[4]:
clf = sklearn.linear_model.LogisticRegression() # Definiere Klassifizierer
pipeline = sklearn.pipeline.Pipeline([
('vect', sklearn.feature_extraction.text.CountVectorizer()), # Zähle Häufigkeit von Wörtern
# ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()), # Verwende Logarithmus statt absolute Werte
('clf', clf),
])
Was macht der CountVector? Hier eine genauere Betrachung:
[5]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer()
X = vectorizer.fit_transform(X_train)
pd.DataFrame(data={
"words": vectorizer.get_feature_names_out(),
**{
f"counts_entry_{i}": X.toarray()[i]
for i in range(len(X.toarray()))
}
}).set_index("words").T
[5]:
words | 00 | 000 | 0086 | 010 | 018 | 021 | 0210 | 0211 | 0214 | 0221 | ... | zhu | zirconia | zone | zu | zugesprochen | zum | zusteht | zwei | áfrica | über |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
counts_entry_0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
counts_entry_98 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_99 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts_entry_102 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
103 rows × 4149 columns
Zur Erläuterung: counts_entry_<i>
steht für den i-ten Text im Datensatz.
Überprüfen Sie, was der sklearn.feature_extraction.text.TfidfTransformer
bringen würde. Informationen dazu finden Sie unter https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.
Aufgabe 1
Sollte die Zeile # ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()), # Verwende Logarithmus statt absolute Werte
weiter oben auskommentiert werden?
Ihre Antwort: …
Trainiere den Klassifizierer clf
. Die Rückgabe von pipeline.score
entspricht der Rückgabe der Methode score
des Klassifizierers clf
.
[6]:
pipeline.fit(X_train, y_train)
pipeline.score(X_train, y_train)
[6]:
1.0
[7]:
pipeline.score(X_test, y_test)
[7]:
1.0
Dieses Werk von Marvin Kastner ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.