Unterscheide Gedichte von Spam - naiver Ansatz¶
[1]:
import string
import numpy as np
import pandas as pd
import sklearn.tree
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
import collections
/tmp/ipykernel_1917/3856884777.py:3: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Lade Daten¶
[2]:
df_spam = pd.read_csv("spam_betreff_und_text.csv", index_col=0)
df_spam
[2]:
Betreff | Von: (Name) | Text | |
---|---|---|---|
0 | Re: Injection mold | qhobjesrl | Dear sir,\n\nWe have almost 20 years experienc... |
1 | Re: Quotation for Pump/dispenser Bottle (empty) | Gray Liu | Dear Sir/Madam\n\nWe are a manufacturer of PET... |
2 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
3 | Re: Offer Plastic Pet Bottle From China | Gray Liu | ?Dear Sir/Madam\n\nI am sorry to bother you. w... |
4 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
5 | CE certified medical face mask in London (Type... | uouzfi | To whom it may concern:\n\n\n\n\nHi Sir/Madam,... |
6 | Re: Professional mask manufacturer | uduoemt | Dear \n\nHope everything will better !\n\n\n\n... |
7 | Re: Offer Plastic Pet Bottle From China | Gray Liu | ?Dear Sir/Madam\n\nI am sorry to bother you. w... |
8 | Re: Offer Plastic Pet Bottle From China | Gray Liu | ?Dear Sir/Madam\n\nI am sorry to bother you. w... |
9 | CE certified medical face mask in London (Type... | crhajg | To whom it may concern:\n\n\n\n\nHi Sir/Madam,... |
10 | Informationen Xber E-Mail-Marketing | ISA Email Marketing | Guten Morgen\n\nMein Name ist Fabi?n Torre und... |
11 | Reliable international logistics service from ... | auiubrnn | Dear Friend\nThis is Sauron from Lionway Inter... |
12 | Re: Quotation for Pump/dispenser Bottle (empty) | Gray Liu | Dear Sir/Madam\n\nWe are a manufacturer of PET... |
13 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
14 | Re: manufacturer of ceramic foam filter | Tracy | Dear Friends,\n\nWe Manufacture\n\n1. Silicon ... |
15 | Re: Offer empty PET bottle with custom size | Gray Liu | Dear Sir/Madam\n\nI am sorry to bother you. we... |
16 | Re:Customized metal sheet Auto metal CNC machi... | rhtuuhjrq | Dear friend,\n\n\n\n\nGood day to you.\n\n\n\n... |
19 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
20 | Gewinnbenachrichtigung | Vega und Paulson | AKTENZEIHEN: BXK/ 57 - 876 / 11-20\nKUNDENNUMM... |
21 | Re: Offer Plastic Pet Bottle From China | Gray Liu | ?Dear Sir/Madam\n\nI am sorry to bother you. w... |
22 | You have an order worth $379 | Your Order | <img alt='' src='http://example.org'/>\n\n<CEN... |
23 | Re: center surface slitter rewinder Mini slitt... | liaeukuwsa | Hello,\n\nThis is Sasha from the HHM Slitter &... |
24 | Re: medical mask | Judy | Hello?\nWe are new factory which is build quic... |
25 | Re: quotation for pet bottle in China | Gray Liu | Dear Sir/Madam\n\nI am sorry to bother you. we... |
26 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
27 | Re: It's important to get fast delivery for Pr... | rseuxxaqkh | Hi Manager,\n\nStay safe and well!\n\nThere ar... |
28 | Re: quotation for pet bottle in China | Gray Liu | Dear Sir/Madam\n\nI am sorry to bother you. we... |
29 | Re: manufacturer of ceramic foam filter | Tracy | Dear Friends,\n\nWe Manufacture\n\n1. Silicon ... |
30 | Re:Do you need a disposable epidemic mask? | ewpyhdobo | Changzhou Youyi Hygiene Products Co., Ltd. is ... |
31 | Informationen Xber E-Mail-Marketing | ISA Email Marketing | Guten Morgen\n\nMein Name ist Fabi?n Torre und... |
32 | Re: stable quality of ceramic foam filter and ... | Tracy | Dear Friend\n\nWish you hava a nice day!\nWe a... |
33 | Call for Papers | Savant Journals | \n\n\nSavant Publishing House invites research... |
34 | Para: Marvin Kastner | Loida Damian | Estimado Marvin Kastner:\n\n\n\n\nSu investiga... |
35 | Re: Offer Plastic Pet Bottle From China | Gray Liu | ?Dear Sir/Madam\n\nI am sorry to bother you. w... |
36 | Re: medical mask | Judy | Hello?\nWe are new factory which is build quic... |
37 | WARNING: Mailbox Storage is Almost Full ! | tuhh.de | Dear marvin.kastner,\n\nYour mailbox storage i... |
38 | Editing & Proofreading | Harvard Proofreading | \nIt is hard to get ahead in the academic worl... |
39 | Re: stable quality of ceramic foam filter and ... | Tracy | Dear Friend\n\nWish you hava a nice day!\nWe a... |
40 | PO (RFQ: SM20917760C) 10/09/2020 | Dr. Zhu Gao | Dearmarvin.kastner,\n\n\n?????????????10/09/20... |
41 | Re: manufacturer of ceramic foam filter | Tracy | Dear Friends,\n\nWe Manufacture\n\n1. Silicon ... |
42 | Consulta en relación al trabajo académico de M... | Loida Damian | Estimado Marvin Kastner:\n\n\n\n\nQuería consu... |
43 | New job opportunities at AKADEUS | Anna Nytko | Having difficulty reading this email? View it ... |
44 | Re: Offer Plastic Pet Bottle From China | Gray Liu | ?Dear Sir/Madam\n\nI am sorry to bother you. w... |
45 | Event Contingency Plan - Remote Presenter and ... | Jesse Eric Tanner | \n\nHello Marvin Kastner,\n\n\nI reached out t... |
46 | Re: medical mask | Judy | Hello?\nWe are new factory which is build quic... |
47 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
48 | Open positions of your Faculty | anna.westerberg@faculty-clubs.com | Dear Friends,\n\nHope you are doing very well!... |
49 | Re: Sensor Soap Dispensers Manufacturer | hzizri | Dear friend,\n\n\n\n\nGood day to you!\n\n\n\n... |
50 | Fast Proofreading Service | Harvard Proofreading | English Dissertation, Thesis, or Proposal Edit... |
51 | Re: stable quality of ceramic foam filter and ... | Tracy | Dear Friend\n\nWish you hava a nice day!\nWe a... |
52 | Re: precision forged gears for outriggers and ... | oqrodzug | Dear friend,\n\n\n\n\nWe are the professional ... |
53 | Re: Offer empty PET bottle with custom size | Gray Liu | Dear Sir/Madam\n\nI am sorry to bother you. we... |
54 | Re: Offer empty PET bottle with custom size | Gray Liu | Dear Sir/Madam\n\nI am sorry to bother you. we... |
55 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... |
56 | Drawings and Specification | micheal k wong | Dear Sir,\n \nPlease check the attached drawin... |
57 | Urgent Request For Quotation | Dinesh Kumar | Hello marvin.kastner,\n\nCIS GAZ Romania compa... |
58 | MASKE IST NICHT GLEICH MASKE | OrangeBlue | \t\nMASKE IST NICHT GLEICH MASKE\n\n <http://e... |
59 | Informationen Xber E-Mail-Marketing | ISA Email Marketing | Guten Morgen\n\nMein Name ist Fabi?n Torre und... |
[3]:
df_poems = pd.read_csv("poems.csv", index_col=0)
df_poems
[3]:
poets | titles | fulltexts | |
---|---|---|---|
0 | Mark Lemon | How To Make A Man Of Consequence | A brow austere, a circumspective eye.\nA frequ... |
1 | Edmund Hodgson Yates | All-Saints | In a church which is furnish'd with mullion an... |
2 | Jonathan Swift | Gentle Echo On Woman, A | \nIN THE DORIC MANNER\n\n\nShepherd. Echo, I w... |
3 | Richard Brinsley Butler Sheridan | Wife, A | Lord Erskine, at women presuming to rail,\nCal... |
4 | Richard Brinsley Butler Sheridan | Literary Lady, The | \nWhat motley cares Corilla's mind perplex,\nW... |
... | ... | ... | ... |
94 | Charles Sibley | Plaidie, The | \nUpon ane stormy Sunday,\nComing adoon the la... |
95 | Francis Davison | Are Women Fair? | "Are women fair?" Ay, wondrous fair to see, to... |
96 | Henry S. Leigh | Maud | \nNay, I cannot come into the garden just now,... |
97 | Unknown | Two Fishers | \nOne morning when Spring was in her teens,\nA... |
98 | Fred W. Loring | Fair Millinger, The | \nBy the Watertown Horse-Car Conductor\n\n\nIt... |
99 rows × 3 columns
Führe die zwei Datensätze zusammen.
[4]:
df_poems_merger = df_poems.copy()
df_poems_merger = df_poems_merger.assign(category="poem")
df_poems_merger.columns = ["creator", "title", "text", "category"]
[5]:
df_spam_merger = df_spam.copy()
df_spam_merger = df_spam_merger.assign(category="spam")
df_spam_merger.columns = ["creator", "title", "text", "category"]
[6]:
df = pd.concat([df_poems_merger, df_spam_merger])
df
[6]:
creator | title | text | category | |
---|---|---|---|---|
0 | Mark Lemon | How To Make A Man Of Consequence | A brow austere, a circumspective eye.\nA frequ... | poem |
1 | Edmund Hodgson Yates | All-Saints | In a church which is furnish'd with mullion an... | poem |
2 | Jonathan Swift | Gentle Echo On Woman, A | \nIN THE DORIC MANNER\n\n\nShepherd. Echo, I w... | poem |
3 | Richard Brinsley Butler Sheridan | Wife, A | Lord Erskine, at women presuming to rail,\nCal... | poem |
4 | Richard Brinsley Butler Sheridan | Literary Lady, The | \nWhat motley cares Corilla's mind perplex,\nW... | poem |
... | ... | ... | ... | ... |
55 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... | spam |
56 | Drawings and Specification | micheal k wong | Dear Sir,\n \nPlease check the attached drawin... | spam |
57 | Urgent Request For Quotation | Dinesh Kumar | Hello marvin.kastner,\n\nCIS GAZ Romania compa... | spam |
58 | MASKE IST NICHT GLEICH MASKE | OrangeBlue | \t\nMASKE IST NICHT GLEICH MASKE\n\n <http://e... | spam |
59 | Informationen Xber E-Mail-Marketing | ISA Email Marketing | Guten Morgen\n\nMein Name ist Fabi?n Torre und... | spam |
157 rows × 4 columns
[7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 157 entries, 0 to 59
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 creator 157 non-null object
1 title 156 non-null object
2 text 156 non-null object
3 category 157 non-null object
dtypes: object(4)
memory usage: 6.1+ KB
Remove rows with missing values
[8]:
df = df.dropna()
df
[8]:
creator | title | text | category | |
---|---|---|---|---|
0 | Mark Lemon | How To Make A Man Of Consequence | A brow austere, a circumspective eye.\nA frequ... | poem |
1 | Edmund Hodgson Yates | All-Saints | In a church which is furnish'd with mullion an... | poem |
2 | Jonathan Swift | Gentle Echo On Woman, A | \nIN THE DORIC MANNER\n\n\nShepherd. Echo, I w... | poem |
3 | Richard Brinsley Butler Sheridan | Wife, A | Lord Erskine, at women presuming to rail,\nCal... | poem |
4 | Richard Brinsley Butler Sheridan | Literary Lady, The | \nWhat motley cares Corilla's mind perplex,\nW... | poem |
... | ... | ... | ... | ... |
55 | ISA Email Marketing Datenbank 2020 | ISA Email Marketing | <http://example.org> \n\n\n\n\nProduktcode \... | spam |
56 | Drawings and Specification | micheal k wong | Dear Sir,\n \nPlease check the attached drawin... | spam |
57 | Urgent Request For Quotation | Dinesh Kumar | Hello marvin.kastner,\n\nCIS GAZ Romania compa... | spam |
58 | MASKE IST NICHT GLEICH MASKE | OrangeBlue | \t\nMASKE IST NICHT GLEICH MASKE\n\n <http://e... | spam |
59 | Informationen Xber E-Mail-Marketing | ISA Email Marketing | Guten Morgen\n\nMein Name ist Fabi?n Torre und... | spam |
155 rows × 4 columns
Feature Engineering¶
Es wird für jeden Eintrag ein Vektor \(x\) erzeugt. Die meisten ML-Verfahren können nur Zahlenwerte in Form von Vektoren und Matrizen verarbeiten, weswegen Texte speziell aufbereitet werden müssen.
[9]:
features = []
for i, row in df.iterrows():
features.append({
"category": row["category"],
"Textlänge": len(row["text"]),
"Anzahl 'Geld'": row["text"].lower().count("money") + row["text"].lower().count("geld"),
"Anzahl '!'": row["text"].lower().count("!"),
"Großbuchstaben": (len([x for x in row["text"] if x in string.ascii_uppercase]) /
len([x for x in row["text"] if x in string.ascii_letters]))
})
df_text_features = pd.DataFrame(features)
df_text_features
[9]:
category | Textlänge | Anzahl 'Geld' | Anzahl '!' | Großbuchstaben | |
---|---|---|---|---|---|
0 | poem | 322 | 0 | 0 | 0.031746 |
1 | poem | 397 | 0 | 0 | 0.049180 |
2 | poem | 1279 | 0 | 1 | 0.102510 |
3 | poem | 378 | 0 | 0 | 0.040816 |
4 | poem | 953 | 0 | 1 | 0.036339 |
... | ... | ... | ... | ... | ... |
150 | spam | 3372 | 0 | 0 | 0.142170 |
151 | spam | 346 | 0 | 0 | 0.061728 |
152 | spam | 791 | 0 | 1 | 0.103261 |
153 | spam | 1339 | 0 | 1 | 0.190999 |
154 | spam | 342 | 0 | 0 | 0.096899 |
155 rows × 5 columns
Teile Daten auf¶
[10]:
df_text_features_only = df_text_features.drop("category", axis=1)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
df_text_features_only.values, df_text_features["category"].values,
test_size=0.33, random_state=42
)
X_train = np.stack(X_train, axis=0)
X_test = np.stack(X_test, axis=0)
Trainiere Entscheidungsbaum¶
[11]:
clf = sklearn.tree.DecisionTreeClassifier(max_depth=10)
clf = clf.fit(X_train, y_train)
Berechne Accuracy-Wert für den Trainings-Datensatz. Diese Daten kennt der Lernalgorithmus schon.
[12]:
clf.score(X_train, y_train)
[12]:
1.0
Untersuche Ergebnis¶
Berechne Accuracy-Wert für den Test-Datensatz. Dies zeigt, wie gut sich die Daten verallgemeinern lassen.
[13]:
clf.score(X_test, y_test)
[13]:
0.6538461538461539
[14]:
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, cmap="BuPu")
plt.show()
[15]:
y_train_counter = collections.Counter(y_train)
display(y_train_counter)
sorted_class_names_with_counts = list(reversed(sorted(y_train_counter.items(), key=lambda x: x[1])))
display(sorted_class_names_with_counts)
sorted_class_names = [el[0] for el in sorted_class_names_with_counts]
sorted_class_names
Counter({'poem': 60, 'spam': 43})
[('poem', 60), ('spam', 43)]
[15]:
['poem', 'spam']
Zur Erinnerung: In der obersten Zeile steht mit <=
der Vergleich, nach dem nach links (zutreffend) und rechts (nicht zutreffend) aufgeteilt wird.
[16]:
plt.figure(figsize=(27, 10))
sklearn.tree.plot_tree(
clf,
feature_names=df_text_features_only.columns,
class_names=sorted_class_names # Dokumentation: "Names of each of the target classes in ascending numerical order"
)
plt.show()
Dieses Werk von Marvin Kastner ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz.