Unterscheide Gedichte von Spam - Text Vectorization

[1]:
import pandas as pd
import sklearn.linear_model
import sklearn.pipeline
import sklearn.feature_extraction
/tmp/ipykernel_2054/4170151736.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd

Lade DataFrame vom fremden Jupyter Notebook.

[2]:
%%capture
%run "01 Unterscheide Gedichte von Spam - naiver Ansatz.ipynb"

df

Teile Datensatz auf.

[3]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    df["text"].values, df["category"].values, test_size=0.33, random_state=42
)

Definiere Lern-Pipeline. Mehr Infos hierzu unter https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

[4]:
clf = sklearn.linear_model.LogisticRegression()  # Definiere Klassifizierer

pipeline = sklearn.pipeline.Pipeline([
    ('vect', sklearn.feature_extraction.text.CountVectorizer()),    # Zähle Häufigkeit von Wörtern
    # ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()),  # Verwende Logarithmus statt absolute Werte
    ('clf', clf),
])

Was macht der CountVector? Hier eine genauere Betrachung:

[5]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer()
X = vectorizer.fit_transform(X_train)
pd.DataFrame(data={
    "words": vectorizer.get_feature_names_out(),
    **{
        f"counts_entry_{i}": X.toarray()[i]
        for i in range(len(X.toarray()))
    }
}).set_index("words").T
[5]:
words 00 000 0086 010 018 021 0210 0211 0214 0221 ... zhu zirconia zone zu zugesprochen zum zusteht zwei áfrica über
counts_entry_0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
counts_entry_98 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_99 0 0 2 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_100 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_101 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
counts_entry_102 0 0 2 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

103 rows × 4149 columns

Zur Erläuterung: counts_entry_<i> steht für den i-ten Text im Datensatz.

Überprüfen Sie, was der sklearn.feature_extraction.text.TfidfTransformer bringen würde. Informationen dazu finden Sie unter https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Aufgabe 1

Sollte die Zeile # ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()),  # Verwende Logarithmus statt absolute Werte weiter oben auskommentiert werden?

Ihre Antwort:

Trainiere den Klassifizierer clf. Die Rückgabe von pipeline.score entspricht der Rückgabe der Methode score des Klassifizierers clf.

[6]:
pipeline.fit(X_train, y_train)
pipeline.score(X_train, y_train)
[6]:
1.0
[7]:
pipeline.score(X_test, y_test)
[7]:
1.0

Creative Commons Lizenzvertrag     Dieses Werk von Marvin Kastner ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.