Unterscheide Gedichte von Spam - Text Vectorization¶

[1]:

import pandas as pd
import sklearn.linear_model
import sklearn.pipeline
import sklearn.feature_extraction

/tmp/ipykernel_2054/4170151736.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd

Lade DataFrame vom fremden Jupyter Notebook.

[2]:

%%capture
%run "01 Unterscheide Gedichte von Spam - naiver Ansatz.ipynb"

df

Teile Datensatz auf.

[3]:

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    df["text"].values, df["category"].values, test_size=0.33, random_state=42
)

Definiere Lern-Pipeline. Mehr Infos hierzu unter https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

[4]:

clf = sklearn.linear_model.LogisticRegression()  # Definiere Klassifizierer

pipeline = sklearn.pipeline.Pipeline([
    ('vect', sklearn.feature_extraction.text.CountVectorizer()),    # Zähle Häufigkeit von Wörtern
    # ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()),  # Verwende Logarithmus statt absolute Werte
    ('clf', clf),
])

Was macht der CountVector? Hier eine genauere Betrachung:

[5]:

vectorizer = sklearn.feature_extraction.text.CountVectorizer()
X = vectorizer.fit_transform(X_train)
pd.DataFrame(data={
    "words": vectorizer.get_feature_names_out(),
    **{
        f"counts_entry_{i}": X.toarray()[i]
        for i in range(len(X.toarray()))
    }
}).set_index("words").T

[5]:

words	00	000	0086	010	018	021	0210	0211	0214	0221	...	zhu	zirconia	zone	zu	zugesprochen	zum	zusteht	zwei	áfrica	über
counts_entry_0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
counts_entry_98	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_99	0	0	2	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_100	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_101	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
counts_entry_102	0	0	2	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

103 rows × 4149 columns

Zur Erläuterung: counts_entry_<i> steht für den i-ten Text im Datensatz.

Überprüfen Sie, was der sklearn.feature_extraction.text.TfidfTransformer bringen würde. Informationen dazu finden Sie unter https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Aufgabe 1

Sollte die Zeile # ('tfidf', sklearn.feature_extraction.text.TfidfTransformer()), # Verwende Logarithmus statt absolute Werte weiter oben auskommentiert werden?

Ihre Antwort: …

Trainiere den Klassifizierer clf. Die Rückgabe von pipeline.score entspricht der Rückgabe der Methode score des Klassifizierers clf.

[6]:

pipeline.fit(X_train, y_train)
pipeline.score(X_train, y_train)

[6]:

1.0

[7]:

pipeline.score(X_test, y_test)

[7]:

1.0

Creative Commons Lizenzvertrag Dieses Werk von Marvin Kastner ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.

Unterscheide Gedichte von Spam - Text Vectorization¶

Vorheriges Thema

Nächstes Thema

Diese Seite