Unterscheide Gedichte von Spam - naiver Ansatz

[1]:
import string
import numpy as np
import pandas as pd
import sklearn.tree
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
import collections
/tmp/ipykernel_1917/3856884777.py:3: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd

Lade Daten

[2]:
df_spam = pd.read_csv("spam_betreff_und_text.csv", index_col=0)
df_spam
[2]:
Betreff Von: (Name) Text
0 Re: Injection mold qhobjesrl Dear sir,\n\nWe have almost 20 years experienc...
1 Re: Quotation for Pump/dispenser Bottle (empty) Gray Liu Dear Sir/Madam\n\nWe are a manufacturer of PET...
2 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
3 Re: Offer Plastic Pet Bottle From China Gray Liu ?Dear Sir/Madam\n\nI am sorry to bother you. w...
4 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
5 CE certified medical face mask in London (Type... uouzfi To whom it may concern:\n\n\n\n\nHi Sir/Madam,...
6 Re: Professional mask manufacturer uduoemt Dear \n\nHope everything will better !\n\n\n\n...
7 Re: Offer Plastic Pet Bottle From China Gray Liu ?Dear Sir/Madam\n\nI am sorry to bother you. w...
8 Re: Offer Plastic Pet Bottle From China Gray Liu ?Dear Sir/Madam\n\nI am sorry to bother you. w...
9 CE certified medical face mask in London (Type... crhajg To whom it may concern:\n\n\n\n\nHi Sir/Madam,...
10 Informationen Xber E-Mail-Marketing ISA Email Marketing Guten Morgen\n\nMein Name ist Fabi?n Torre und...
11 Reliable international logistics service from ... auiubrnn Dear Friend\nThis is Sauron from Lionway Inter...
12 Re: Quotation for Pump/dispenser Bottle (empty) Gray Liu Dear Sir/Madam\n\nWe are a manufacturer of PET...
13 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
14 Re: manufacturer of ceramic foam filter Tracy Dear Friends,\n\nWe Manufacture\n\n1. Silicon ...
15 Re: Offer empty PET bottle with custom size Gray Liu Dear Sir/Madam\n\nI am sorry to bother you. we...
16 Re:Customized metal sheet Auto metal CNC machi... rhtuuhjrq Dear friend,\n\n\n\n\nGood day to you.\n\n\n\n...
19 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
20 Gewinnbenachrichtigung Vega und Paulson AKTENZEIHEN: BXK/ 57 - 876 / 11-20\nKUNDENNUMM...
21 Re: Offer Plastic Pet Bottle From China Gray Liu ?Dear Sir/Madam\n\nI am sorry to bother you. w...
22 You have an order worth $379 Your Order <img alt='' src='http://example.org'/>\n\n<CEN...
23 Re: center surface slitter rewinder Mini slitt... liaeukuwsa Hello,\n\nThis is Sasha from the HHM Slitter &...
24 Re: medical mask Judy Hello?\nWe are new factory which is build quic...
25 Re: quotation for pet bottle in China Gray Liu Dear Sir/Madam\n\nI am sorry to bother you. we...
26 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
27 Re: It's important to get fast delivery for Pr... rseuxxaqkh Hi Manager,\n\nStay safe and well!\n\nThere ar...
28 Re: quotation for pet bottle in China Gray Liu Dear Sir/Madam\n\nI am sorry to bother you. we...
29 Re: manufacturer of ceramic foam filter Tracy Dear Friends,\n\nWe Manufacture\n\n1. Silicon ...
30 Re:Do you need a disposable epidemic mask? ewpyhdobo Changzhou Youyi Hygiene Products Co., Ltd. is ...
31 Informationen Xber E-Mail-Marketing ISA Email Marketing Guten Morgen\n\nMein Name ist Fabi?n Torre und...
32 Re: stable quality of ceramic foam filter and ... Tracy Dear Friend\n\nWish you hava a nice day!\nWe a...
33 Call for Papers Savant Journals \n\n\nSavant Publishing House invites research...
34 Para: Marvin Kastner Loida Damian Estimado Marvin Kastner:\n\n\n\n\nSu investiga...
35 Re: Offer Plastic Pet Bottle From China Gray Liu ?Dear Sir/Madam\n\nI am sorry to bother you. w...
36 Re: medical mask Judy Hello?\nWe are new factory which is build quic...
37 WARNING: Mailbox Storage is Almost Full ! tuhh.de Dear marvin.kastner,\n\nYour mailbox storage i...
38 Editing & Proofreading Harvard Proofreading \nIt is hard to get ahead in the academic worl...
39 Re: stable quality of ceramic foam filter and ... Tracy Dear Friend\n\nWish you hava a nice day!\nWe a...
40 PO (RFQ: SM20917760C) 10/09/2020 Dr. Zhu Gao Dearmarvin.kastner,\n\n\n?????????????10/09/20...
41 Re: manufacturer of ceramic foam filter Tracy Dear Friends,\n\nWe Manufacture\n\n1. Silicon ...
42 Consulta en relación al trabajo académico de M... Loida Damian Estimado Marvin Kastner:\n\n\n\n\nQuería consu...
43 New job opportunities at AKADEUS Anna Nytko Having difficulty reading this email? View it ...
44 Re: Offer Plastic Pet Bottle From China Gray Liu ?Dear Sir/Madam\n\nI am sorry to bother you. w...
45 Event Contingency Plan - Remote Presenter and ... Jesse Eric Tanner \n\nHello Marvin Kastner,\n\n\nI reached out t...
46 Re: medical mask Judy Hello?\nWe are new factory which is build quic...
47 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
48 Open positions of your Faculty anna.westerberg@faculty-clubs.com Dear Friends,\n\nHope you are doing very well!...
49 Re: Sensor Soap Dispensers Manufacturer hzizri Dear friend,\n\n\n\n\nGood day to you!\n\n\n\n...
50 Fast Proofreading Service Harvard Proofreading English Dissertation, Thesis, or Proposal Edit...
51 Re: stable quality of ceramic foam filter and ... Tracy Dear Friend\n\nWish you hava a nice day!\nWe a...
52 Re: precision forged gears for outriggers and ... oqrodzug Dear friend,\n\n\n\n\nWe are the professional ...
53 Re: Offer empty PET bottle with custom size Gray Liu Dear Sir/Madam\n\nI am sorry to bother you. we...
54 Re: Offer empty PET bottle with custom size Gray Liu Dear Sir/Madam\n\nI am sorry to bother you. we...
55 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \...
56 Drawings and Specification micheal k wong Dear Sir,\n \nPlease check the attached drawin...
57 Urgent Request For Quotation Dinesh Kumar Hello marvin.kastner,\n\nCIS GAZ Romania compa...
58 MASKE IST NICHT GLEICH MASKE OrangeBlue \t\nMASKE IST NICHT GLEICH MASKE\n\n <http://e...
59 Informationen Xber E-Mail-Marketing ISA Email Marketing Guten Morgen\n\nMein Name ist Fabi?n Torre und...
[3]:
df_poems = pd.read_csv("poems.csv", index_col=0)
df_poems
[3]:
poets titles fulltexts
0 Mark Lemon How To Make A Man Of Consequence A brow austere, a circumspective eye.\nA frequ...
1 Edmund Hodgson Yates All-Saints In a church which is furnish'd with mullion an...
2 Jonathan Swift Gentle Echo On Woman, A \nIN THE DORIC MANNER\n\n\nShepherd. Echo, I w...
3 Richard Brinsley Butler Sheridan Wife, A Lord Erskine, at women presuming to rail,\nCal...
4 Richard Brinsley Butler Sheridan Literary Lady, The \nWhat motley cares Corilla's mind perplex,\nW...
... ... ... ...
94 Charles Sibley Plaidie, The \nUpon ane stormy Sunday,\nComing adoon the la...
95 Francis Davison Are Women Fair? "Are women fair?" Ay, wondrous fair to see, to...
96 Henry S. Leigh Maud \nNay, I cannot come into the garden just now,...
97 Unknown Two Fishers \nOne morning when Spring was in her teens,\nA...
98 Fred W. Loring Fair Millinger, The \nBy the Watertown Horse-Car Conductor\n\n\nIt...

99 rows × 3 columns

Führe die zwei Datensätze zusammen.

[4]:
df_poems_merger = df_poems.copy()
df_poems_merger = df_poems_merger.assign(category="poem")
df_poems_merger.columns = ["creator", "title", "text", "category"]
[5]:
df_spam_merger = df_spam.copy()
df_spam_merger = df_spam_merger.assign(category="spam")
df_spam_merger.columns = ["creator", "title", "text", "category"]
[6]:
df = pd.concat([df_poems_merger, df_spam_merger])
df
[6]:
creator title text category
0 Mark Lemon How To Make A Man Of Consequence A brow austere, a circumspective eye.\nA frequ... poem
1 Edmund Hodgson Yates All-Saints In a church which is furnish'd with mullion an... poem
2 Jonathan Swift Gentle Echo On Woman, A \nIN THE DORIC MANNER\n\n\nShepherd. Echo, I w... poem
3 Richard Brinsley Butler Sheridan Wife, A Lord Erskine, at women presuming to rail,\nCal... poem
4 Richard Brinsley Butler Sheridan Literary Lady, The \nWhat motley cares Corilla's mind perplex,\nW... poem
... ... ... ... ...
55 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \... spam
56 Drawings and Specification micheal k wong Dear Sir,\n \nPlease check the attached drawin... spam
57 Urgent Request For Quotation Dinesh Kumar Hello marvin.kastner,\n\nCIS GAZ Romania compa... spam
58 MASKE IST NICHT GLEICH MASKE OrangeBlue \t\nMASKE IST NICHT GLEICH MASKE\n\n <http://e... spam
59 Informationen Xber E-Mail-Marketing ISA Email Marketing Guten Morgen\n\nMein Name ist Fabi?n Torre und... spam

157 rows × 4 columns

[7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 157 entries, 0 to 59
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   creator   157 non-null    object
 1   title     156 non-null    object
 2   text      156 non-null    object
 3   category  157 non-null    object
dtypes: object(4)
memory usage: 6.1+ KB

Remove rows with missing values

[8]:
df = df.dropna()
df
[8]:
creator title text category
0 Mark Lemon How To Make A Man Of Consequence A brow austere, a circumspective eye.\nA frequ... poem
1 Edmund Hodgson Yates All-Saints In a church which is furnish'd with mullion an... poem
2 Jonathan Swift Gentle Echo On Woman, A \nIN THE DORIC MANNER\n\n\nShepherd. Echo, I w... poem
3 Richard Brinsley Butler Sheridan Wife, A Lord Erskine, at women presuming to rail,\nCal... poem
4 Richard Brinsley Butler Sheridan Literary Lady, The \nWhat motley cares Corilla's mind perplex,\nW... poem
... ... ... ... ...
55 ISA Email Marketing Datenbank 2020 ISA Email Marketing <http://example.org> \n\n\n\n\nProduktcode \... spam
56 Drawings and Specification micheal k wong Dear Sir,\n \nPlease check the attached drawin... spam
57 Urgent Request For Quotation Dinesh Kumar Hello marvin.kastner,\n\nCIS GAZ Romania compa... spam
58 MASKE IST NICHT GLEICH MASKE OrangeBlue \t\nMASKE IST NICHT GLEICH MASKE\n\n <http://e... spam
59 Informationen Xber E-Mail-Marketing ISA Email Marketing Guten Morgen\n\nMein Name ist Fabi?n Torre und... spam

155 rows × 4 columns

Feature Engineering

Es wird für jeden Eintrag ein Vektor \(x\) erzeugt. Die meisten ML-Verfahren können nur Zahlenwerte in Form von Vektoren und Matrizen verarbeiten, weswegen Texte speziell aufbereitet werden müssen.

[9]:
features = []

for i, row in df.iterrows():
    features.append({
        "category": row["category"],
        "Textlänge": len(row["text"]),
        "Anzahl 'Geld'": row["text"].lower().count("money") + row["text"].lower().count("geld"),
        "Anzahl '!'": row["text"].lower().count("!"),
        "Großbuchstaben": (len([x for x in row["text"] if x in string.ascii_uppercase]) /
                           len([x for x in row["text"] if x in string.ascii_letters]))
    })

df_text_features = pd.DataFrame(features)
df_text_features
[9]:
category Textlänge Anzahl 'Geld' Anzahl '!' Großbuchstaben
0 poem 322 0 0 0.031746
1 poem 397 0 0 0.049180
2 poem 1279 0 1 0.102510
3 poem 378 0 0 0.040816
4 poem 953 0 1 0.036339
... ... ... ... ... ...
150 spam 3372 0 0 0.142170
151 spam 346 0 0 0.061728
152 spam 791 0 1 0.103261
153 spam 1339 0 1 0.190999
154 spam 342 0 0 0.096899

155 rows × 5 columns

Teile Daten auf

[10]:
df_text_features_only = df_text_features.drop("category", axis=1)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    df_text_features_only.values, df_text_features["category"].values,
    test_size=0.33, random_state=42
)

X_train = np.stack(X_train, axis=0)
X_test = np.stack(X_test, axis=0)

Trainiere Entscheidungsbaum

[11]:
clf = sklearn.tree.DecisionTreeClassifier(max_depth=10)
clf = clf.fit(X_train, y_train)

Berechne Accuracy-Wert für den Trainings-Datensatz. Diese Daten kennt der Lernalgorithmus schon.

[12]:
clf.score(X_train, y_train)
[12]:
1.0

Untersuche Ergebnis

Berechne Accuracy-Wert für den Test-Datensatz. Dies zeigt, wie gut sich die Daten verallgemeinern lassen.

[13]:
clf.score(X_test, y_test)
[13]:
0.6538461538461539
[14]:
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, cmap="BuPu")
plt.show()
../../_images/03-einsatzszenarien_02_texterkennung_01_Unterscheide_Gedichte_von_Spam_-_naiver_Ansatz_22_0.svg
[15]:
y_train_counter = collections.Counter(y_train)
display(y_train_counter)
sorted_class_names_with_counts = list(reversed(sorted(y_train_counter.items(), key=lambda x: x[1])))
display(sorted_class_names_with_counts)
sorted_class_names = [el[0] for el in sorted_class_names_with_counts]
sorted_class_names
Counter({'poem': 60, 'spam': 43})
[('poem', 60), ('spam', 43)]
[15]:
['poem', 'spam']

Zur Erinnerung: In der obersten Zeile steht mit <= der Vergleich, nach dem nach links (zutreffend) und rechts (nicht zutreffend) aufgeteilt wird.

[16]:
plt.figure(figsize=(27, 10))
sklearn.tree.plot_tree(
    clf,
    feature_names=df_text_features_only.columns,
    class_names=sorted_class_names  # Dokumentation: "Names of each of the target classes in ascending numerical order"
)
plt.show()
../../_images/03-einsatzszenarien_02_texterkennung_01_Unterscheide_Gedichte_von_Spam_-_naiver_Ansatz_25_0.svg

Creative Commons Lizenzvertrag     Dieses Werk von Marvin Kastner ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.