国产人妻人伦精品_欧美一区二区三区图_亚洲欧洲久久_日韩美女av在线免费观看

合肥生活安徽新聞合肥交通合肥房產(chǎn)生活服務(wù)合肥教育合肥招聘合肥旅游文化藝術(shù)合肥美食合肥地圖合肥社保合肥醫(yī)院企業(yè)服務(wù)合肥法律

MATH70094代做、代寫(xiě)Python語(yǔ)言編程
MATH70094代做、代寫(xiě)Python語(yǔ)言編程

時(shí)間:2024-12-31  來(lái)源:合肥網(wǎng)hfw.cc  作者:hfw.cc 我要糾錯(cuò)



Assessment 4 MATH70094: Programming for Data Science Autumn 2024
Assessment 4
This assessment contains two questions that will test your ability to work with files and data in
R and Python, as well as how to create and package your code in these two languages. Question
1 is on R, while Question 2 is on Python. The available marks are indicated in brackets for each
question. Note that this assessment will count 50% towards the final grade for this module. This
assessment will be marked, and feedback will be provided.
Make sure that you carefully read the following sections on Background and Submission Instruc tions.
Background
In this assessment, we want to build a spam classifier to decide for a given message string if it is a
genuine message (ham) that we want to keep, or if it is not a genuine message (spam) that should be
filtered out. A message can be thought of as a random sequence of words, but since we hardly ever
see the same message twice it is common to ignore the word order, and to simply record how many
times each word appears. We therefore represent a message by a random vector X ∈ {0, 1, 2, . . . , }
p
with counts for a vocabulary of p words. The vocabulary stays fixed for all the messages.
Denote by P r(X) the probability of a specific message, let S be the event that X corresponds to
spam and let H be the complementary event that X is ham. From Bayes’ theorem the probability of
X being spam is
P r(S|X) = P r(X|S)P r(S)
P r(X)
.
Here, P r(S) is the prior probability of an arbitrary message being spam, and P r(X|S) is the
probability to see message X given that we know it is spam. Similarly, the probability of X being
ham is
P r(H|X) = P r(X|H)P r(H)
P r(X)
.
If P r(S|X) > P r(H|X), we classify the message X as spam, otherwise as ham.
To simplify the estimation of the probabilities P r(X|S) and P r(X|H) from training data we make
a second simplifying assumption, namely we assume that the probability of any word appearing in a
message is independent of any other word appearing or not. This means
P r(X|S) =
p
Y
j=1
P r(Xj |S), P r(X|H) =
p
Y
j=1
P r(Xj |H),
where Xj is the count of the jth word in the vocabulary. With these assumptions, the classifier is
called Naive Bayes classifier. Despite its simplicity, it works surprisingly well in practice.
Suppose now we have training data represented by a matrix M ∈ {0, 1, 2, . . . }
n×p
containing words
counts for n messages and a vector spam_type ∈ {ham, spam}
n assigning each message to a label
ham or spam. For example, Mij is the number of times word j appears in message i and spam_typei
is its label. By combining the information in M and spam_type we can compute nS and nH, the
1
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
total number of spam and ham messages, nS,j and nH,j the number of times the jth word appears in
spam and ham messages, as well as NS and NH the total number of words in spam and ham messages.
With this we form the estimates
P r(S) ≈
nS
nS + nH
,
P r(H) ≈
nH
nS + nH
,
P r(Xj |S) ≈
nS,j + α
NS + α × (NS + NH)
,
P r(Xj |H) ≈
nH,j + α
NH + α × (NS + NH)
.
The scalar α ∈ (0, 1] helps preventing zero estimates. Note that by applying the logarithm,
P r(S|X) > P r(H|X) is equivalent to
p
X
j=1
log P r(Xj |S) + log P r(S) >
p
X
j=1
log P r(Xj |H) + log P r(H).
To avoid numerical errors when multiplying many near zero numbers in the approximation of the
products Q p
j=1 P r(Xj |S) and Q p
j=1 P r(Xj |H) from the estimates above, it is therefore better to
base the classification on the logarithms of the estimates.
Submission Instructions
Along with this PDF, you are provided with two folders files_train and files_test which
contain within subfolders messages (formed of strings), and two files train.csv and test.csv.
Create files according to the two questions below, and then create one zip file (https://docs.filefor
mat.com/compression/zip/) named CID_PDS_Assessment4.zip with:
• the files train.csv and test.csv,
• a folder corpus, containing your R package files,
• a folder spamfilter, containing your Python package files,
• the corpus_0.0.1.tar.gz file created in Question 1,
• the file process_corpus.R created in Question 1,
• the file filter.py created in Question 2.
This can be visualised as follows:
CID_PDS_Assessment4.zip
|-- train.csv
|-- test.csv
|-- corpus folder
|-- spamfilter folder
|-- corpus_0.0.1.tar.gz
|-- process_corpus.R
|-- filter.py
2
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
Note the following before submitting:
• Do not add the folders files_train and files_test to your zip file.
• Replace CID in CID_PDS_Assessment4.zip by your own college ID number. For
example, if your college ID number is 12345678, then the zip file should be named
12345678_PDS_Assessment4.zip.
• The only external Python and R libraries allowed in this assessment are:
– Python: NumPy, Pandas, unittest,
– R: testthat, R6, stringr, stopwords
You should not load additional (non-base) libraries.
• For Python, provide doc string comments, and for R roxygen2 style comments (as described
in the Blackboard videos of week 9) for every attribute and method you define. You also
should add code comments as usual.
• Please answer in each cell/code block only the corresponding subpart (e.g., only answer Part
D(i) in the cell below the heading Part D(i)). The markers will try, where possible, not to
penalize answers to parts for errors in previous parts. For example, if you cannot do Part D(i),
leave the corresponding cell blank and do Part D(ii) assuming Part D(i) is working.
• You may use code and variables from previous subparts in your answers of a particular part.
• Marks may be deducted if these layout and format instructions are not followed.
Submit the zip file on Blackboard in the Assessment 4 submission tab in the module page. The
deadline is Monday 06 January 2025 at 09:00am, UK time.
Please note Imperial College’s policy on the late submission of assessments. This assessment must
be attempted individually. Your submission must be your own, unaided work. Candidates are
prohibited from discussing assessed coursework, and must abide by Imperial College’s rules. Enabling
other candidates to plagiarise your work constitutes an examination offence. To ensure quality
assurance is maintained, departments may choose to invite a random selection of students to an
‘authenticity interview’ on their submitted assessments.
Question 1 - R (60 marks)
The aim of this question is to build a package for loading and cleaning messages from data.
Some functions that may be useful in this question are:
• gsub, sapply, readLines, Filter,
• str_split from the stringr package,
• stopwords("en") from the stopwords package.
3
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
Code clarity (5 marks)
There is a famous saying among software developers that code is read more often than it is written.
Five marks will be awarded (or not awarded) based on the clarity of the code and appropriate use
of comments.
Part A (25 marks)
Create a script file corpus.R with a R6 class CorpusR6 containing
• private attributes: ham_strings (vector of strings), spam_strings (vector of strings),
• public methods:
– initialize: a function that takes the string name of a source folder as input, reads for
each message file in this folder (and also within subfolders) the contents of the file line
by line, and adds the message text (without the message head) either to spam_strings
or ham_strings depending on if the file name contains the substring "spam" or not.
– clean_messages: a function that modifies all the messages stored in the two private
variables; it proceeds for each message string as follows:
∗ transforms the message string to lower case,
∗ splits the string into words (tokens) separated by arbitrary long whitespace and
creates with these words a vector of strings,
∗ removes from the end of each token any arbitrary sequence of punctuations,
∗ removes any token that belongs to the list of English stopwords obtained from calling
stopwords("en"),
∗ removes from each token any remaning punctuations,
∗ remove all tokens of length less than three,
∗ collapses the vector of tokens into one string, with tokens separated by whitespace.
(We will not make more modifications to the tokens, even though we could.)
– print: a function that prints the CorpusR6 object. For example, when corpus is a
CorpusR6 object formed of 4345 ham messages and 6** spam messages, then we have as
output
> corpus
CorpusR6 object
Number of Ham files: 4345
Number of Spam files: 6**
– save_to_csv: a function that takes the name of a target csv file as input and saves to
it a csv file that contains in each line either ham or spam and separated from this by a
comma a message string, either from ham_string or message_string corresponding to
the first column (the format should be as in the provided files train.csv and test.csv).
4
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
In addition to providing these attributes and functions, include appropriate documentation, input
checks (for every argument!) and unit tests, which test all specifications listed above.
Part B (15 marks)
Create a package called corpus which contains the code from Part A and exposes the func tions in Part A to the user. Make sure that devtools::document(), devtools::test() and
devtools::check() do not produce any errors or warnings (notes are OK) when called from within
the folder corpus. The result should be the file corpus_0.0.1.tar.gz, if you have chosen version
number 0.0.1.
Part C (15 marks)
Make sure that the class from Part A is available. Create a script file process_corpus.R which cre ates two CorpusR6 objects corpus_train and corpus_test from the provided folders files_train
and files_test (you should set a path that works for you, the markers will set another one, make
sure these folders are not part of the final zip file!). For the two objects, clean all messages that
were contained in the folders using clean_messages and print the R6 objects to the screen. Finally,
use save_to_csv to save corpus_train to the file train.csv and corpus_test to test.csv.
The two csv files should contain the same entries as the provided files.
Question 2 - Python (** marks)
The aim of this question is to build a package for spam classification with Naive Bayes using Test
driven development and Defensive Programming, along with an application to a real data set.
Code clarity (5 marks)
There is a famous saying among software developers that code is read more often than it is written.
Five marks will be awarded (or not awarded) based on the clarity of the code and appropriate use
of comments.
Part A (20 marks)
Create a script file utils.py with three functions:
• tokenize: A function that takes a message string as input, splits it into words (tokens) along
whitespace and returns a list with the token strings. There should be no whitespace left
anywhere within the token strings.
5
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
• document_terms: A function that takes a list of word lists each created with tokenize as
input and returns a Document Term Dataframe. This Dataframe has as many rows as there
are word lists in the list, and has as many columns as there are unique words in the word lists.
Each column corresponds to one word, and each entry of the Dataframe counts how many
times a word appears in a document/message (the Dataframe is basically the matrix M from
the Background section).
• compute_word_counts: A function that takes a Document Term Dataframe (created with
document_terms) and a list spam_types of strings (with entries either ham or spam) as inputs,
and returns a 2 × p matrix with counts, where p is the length of the vocabulary, and where
the first row contains the overall counts for words in ham messages and the second row for
spam messages.
In addition to providing these functions, include appropriate documentation, input checks (for every
argument!) and unit tests, which test all specifications listed above.
For example, for the first function we expect the following output:
>>> tokenize(" properly separated text")
['properly', 'separated', 'text']
As an example for the latter two functions, suppose we have the following:
>>> doc1 = ["call", "here", "win", "prize", "money"]
>>> doc2 = ["call", "money", "call", "money", "bargain"]
>>> doc3 = ["call", "here", "information"]
>>> word_lists = [doc1, doc2, doc3]
>>> dtm = document_terms(word_lists)
>>> spam_types = ["spam", "spam", "ham"]
>>> word_counts = compute_word_counts(dtm,spam_types)
In this case, dtm should be
call here win prize money bargain information
1 1 1 1 1 1 0 0
2 2 0 0 0 2 1 0
3 1 1 0 0 0 0 1
while word_counts should be
call here win prize money bargain information
n_ham 1 1 0 0 0 0 1
n_spam 3 1 1 1 3 1 0
6
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
Part B (25 marks)
Create a script file classifier.py with a class NaiveBayes containing
• private attributes: __word_counts (a matrix), __spam_types (a list),
• public attributes: log_probs_ham (a list containing the logs of the approximated P r(Xj |H)
from the Background section), log_probs_spam (a list containing the logs of the approximated
P r(Xj |S)), log_prior_ham (the log of the approximated P r(H)), log_prior_spam (the log
of the approximated P r(S)),
• public methods:
– __init__: a function that takes word_counts and spam_types as arguments, and sets
the two corresponding private attributes;
– get_spam_types: a function that returns the private attribute spam_types;
– get_word_counts: a function that returns the private attribute word_counts;
– fit: a function that takes as argument an α value (with default α = 0.5), and sets the
values of the public attributes as described in the Background section;
– classify: a function that takes a message string, tokenizes it using the method tokenize
from Part A and returns a classification ham or spam, as explained in the Background
section; words in the message that were not seen in the training data are ignored;
– print: a print method that prints the object as specified in the example below.
In addition to providing this class, include appropriate documentation and unit tests, which test all
specifications listed above.
Continuing the example from Part A we should obtain
>>> nb <- NaiveBayes(word_counts,spam_types)
>>> nb.fit(1)
>>> nb
NaiveBayes object
vocabulary size: 7
top 5 ham words: call,here,information,win,prize
top 5 spam words: call,money,here,win,prize
(prior_ham,prior_spam): (0.3333333,0.6666667)
Here, top 5 ham words corresponds to the five largest values in log_probs_ham, and top 5 spam
words corresponds to the five largest values in log_probs_spam.
7
Assessment 4 MATH70094: Programming for Data Science Autumn 2024
Part C (15 marks)
Create a folder containing the data for a Python package called spamfilter with the code from
Parts A and B, and exposes the functions in Part A and the class in Part B to the user. Make sure
that running python -m pytest and python -m build from within the folder spamfilter in the
command line does not produce any errors or failures. The result of python -m build should be the
file spamfilter-0.0.1.tar.gz and/or spamfilter-0.0.1*******.whl in the folder spamfilter/dist,
if you have chosen version number 0.0.1 (and ******* are optional other characters).
Part D (25 marks)
Make sure that the functions and class from Parts A and B are available. Create a script file
filter.py with code as described below.
D(i)
Load the file train.csv and create with this an object from the NaiveBayes class. Fit the object
with α = 1 and print it to the console.
D(ii)
Load also the file test.csv and classify the messages in both train.csv and test.csv using the
classifier from Part D(i). Print the confusion tables (comparing true classifications to actual ones)
for both cases, and print the accuracy (diagonal of confusion matrix divided by total number of
messages in the respective set of messages) to the console. Comment briefly on the difference
between the accuracies for both cases.
D(iii)
Continuing on from D(ii) consider α now a tuning parameter. Fit NaiveBayes messages in train.csv
for 10 evenly spaced values of α in the interval [0,1]. Determine the best such α in terms of achieving
the highest accuracy when testing on the messages in test.csv.
Discuss briefly if choosing the tuning parameter α in this way is reasonable.




請(qǐng)加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp

掃一掃在手機(jī)打開(kāi)當(dāng)前頁(yè)
  • 上一篇:代寫(xiě)MECH201、代做MATLAB設(shè)計(jì)程序
  • 下一篇:解決客戶強(qiáng)制下款暴力催收問(wèn)題!米來(lái)花全國(guó)客服電話
  • 無(wú)相關(guān)信息
    合肥生活資訊

    合肥圖文信息
    流體仿真外包多少錢(qián)_專業(yè)CFD分析代做_友商科技CAE仿真
    流體仿真外包多少錢(qián)_專業(yè)CFD分析代做_友商科
    CAE仿真分析代做公司 CFD流體仿真服務(wù) 管路流場(chǎng)仿真外包
    CAE仿真分析代做公司 CFD流體仿真服務(wù) 管路
    流體CFD仿真分析_代做咨詢服務(wù)_Fluent 仿真技術(shù)服務(wù)
    流體CFD仿真分析_代做咨詢服務(wù)_Fluent 仿真
    結(jié)構(gòu)仿真分析服務(wù)_CAE代做咨詢外包_剛強(qiáng)度疲勞振動(dòng)
    結(jié)構(gòu)仿真分析服務(wù)_CAE代做咨詢外包_剛強(qiáng)度疲
    流體cfd仿真分析服務(wù) 7類仿真分析代做服務(wù)40個(gè)行業(yè)
    流體cfd仿真分析服務(wù) 7類仿真分析代做服務(wù)4
    超全面的拼多多電商運(yùn)營(yíng)技巧,多多開(kāi)團(tuán)助手,多多出評(píng)軟件徽y1698861
    超全面的拼多多電商運(yùn)營(yíng)技巧,多多開(kāi)團(tuán)助手
    CAE有限元仿真分析團(tuán)隊(duì),2026仿真代做咨詢服務(wù)平臺(tái)
    CAE有限元仿真分析團(tuán)隊(duì),2026仿真代做咨詢服
    釘釘簽到打卡位置修改神器,2026怎么修改定位在范圍內(nèi)
    釘釘簽到打卡位置修改神器,2026怎么修改定
  • 短信驗(yàn)證碼 豆包網(wǎng)頁(yè)版入口 破天一劍 目錄網(wǎng) 排行網(wǎng)

    關(guān)于我們 | 打賞支持 | 廣告服務(wù) | 聯(lián)系我們 | 網(wǎng)站地圖 | 免責(zé)聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 hfw.cc Inc. All Rights Reserved. 合肥網(wǎng) 版權(quán)所有
    ICP備06013414號(hào)-3 公安備 42010502001045

    国产人妻人伦精品_欧美一区二区三区图_亚洲欧洲久久_日韩美女av在线免费观看
    日韩av免费看网站| 不卡av电影院| 国产精品久久久久一区二区| 日韩av日韩在线观看| 爱福利视频一区二区| 国产精品久久视频| 欧美一级视频在线观看| 99久久99久久精品国产片| 欧美日韩aaaa| 国产午夜精品视频一区二区三区| 日韩有码在线电影| 日本精品一区二区三区在线| 97国产精品视频| 亚洲一区不卡在线| 成人免费毛片播放| 欧美激情免费在线| 国产美女99p| 欧美激情一区二区三级高清视频 | 在线一区日本视频| 国模吧一区二区| 久久香蕉国产线看观看av| 狠狠干 狠狠操| 国产精品高潮呻吟久久av黑人| 国内精品伊人久久| 国产精品久久91| 国产一区二区不卡视频| 精品久久久久久综合日本| 国产内射老熟女aaaa| 制服诱惑一区| 国产精品50p| 日韩av123| 国产成人免费av电影| 欧美成ee人免费视频| 久久九九全国免费精品观看| 女同一区二区| 欧美日韩国产va另类| www.av中文字幕| 都市激情久久久久久久久久久 | 色噜噜狠狠色综合网| 国产成人成网站在线播放青青| 日本亚洲导航| 国产精品爽爽爽| 国产欧美日韩精品在线观看| 欧美激情亚洲一区| 久久久一二三四| 欧美中文字幕在线视频| 国产精品久久久久久久av电影| 国产欧美在线观看| 天天久久人人| 国产精品美女主播| 成人免费毛片播放| 人妻少妇精品久久| 精品国产aⅴ麻豆| 91成人免费观看网站| 欧美在线一区二区三区四| 久久综合88中文色鬼| 777国产偷窥盗摄精品视频| 欧美精品一区二区三区三州| 国产精品久久久久久久久借妻 | 国产精品第12页| 国产精品91在线观看| 欧美日韩一区综合| 亚洲精品在线视频观看 | 国产精品户外野外| 久久久影视精品| 精品一区二区三区免费毛片| 亚州av一区二区| 国产精品激情av电影在线观看| 久久久亚洲精品视频| 欧美xxxx黑人又粗又长精品| 伊人久久大香线蕉综合75| 日韩在线国产精品| 国产精品一区二区三区在线观 | 国产精品高清在线观看| 久久综合一区| 国产精品主播视频| 欧美日韩第二页| 亚洲 中文字幕 日韩 无码| 国产精品人成电影在线观看 | 欧美久久精品午夜青青大伊人| 久久人人爽人人爽人人片av高清 | 久久精品欧美| 国产精品有限公司| 黄页网站在线观看视频| 日韩中文字幕在线视频观看| 制服诱惑一区| 久久伊人免费视频| 久久成人资源| 91美女片黄在线观看游戏| 免费h精品视频在线播放| 日韩欧美视频一区二区| 亚洲熟妇av日韩熟妇在线| 国产精品久久久久久久久粉嫩av| 国产成人精品999| 99久久精品免费看国产一区二区三区| 欧美精品免费观看二区| 日本黄网站免费| 亚洲欧美日韩综合一区| 精品国产免费久久久久久尖叫 | 欧美一区二区福利| 亚洲视频精品一区| 欧美激情免费在线| 欧美精品在线网站| 国产精品无码专区av在线播放| 久久频这里精品99香蕉| caoporn国产精品免费公开| 国产又粗又爽又黄的视频| 欧美日韩黄色一级片| 日本不卡一区二区三区视频 | 国产一区二区香蕉| 麻豆av一区二区三区| 精品欧美日韩在线| 免费在线观看亚洲视频| 欧美亚洲国产成人| 日av在线播放中文不卡| 日本精品一区二区三区在线| 日本一区视频在线| 日本视频精品一区| 日韩av黄色网址| 日韩中文字幕二区| 日韩中文字幕三区| 欧美一级片免费在线| 午夜精品蜜臀一区二区三区免费| 欧美激情精品久久久| 麻豆乱码国产一区二区三区| 国产精品都在这里| 精品国产一区二区三区四区精华 | 国产精品久久久久秋霞鲁丝| 国产精品电影在线观看| 欧美另类第一页| 九九精品在线观看| 欧美精品久久久久久久久| 欧美日韩国产第一页| 在线播放豆国产99亚洲| 中文字幕无码精品亚洲资源网久久| 永久免费看av| 午夜精品免费视频| 青青视频在线播放| 欧美亚洲日本黄色| 韩日精品中文字幕| 国产日韩在线免费| 99精品免费在线观看| 久久精品国产一区二区三区不卡 | 97色在线播放视频| 国产精品专区一| 久久青青草原| 久久久久99精品久久久久| 国产精品视频999| 欧美成aaa人片在线观看蜜臀| 精品国产乱码久久久久久久软件| 欧美精品电影在线| 午夜精品理论片| 青青青国产在线观看| 欧美一级二级三级九九九| 黄页网站在线观看视频| 国产精品稀缺呦系列在线| 国产极品美女高潮无套久久久| 色av吧综合网| 久久99久久99精品中文字幕| 亚洲精品国产精品久久| 青青草国产免费| 国产欧美在线一区二区| 97精品国产97久久久久久春色| 日韩有码视频在线| 精品久久久久亚洲| 日本一区二区三区在线播放| 激情五月开心婷婷| 99久久久精品视频| y97精品国产97久久久久久| 国产99久久精品一区二区永久免费| 亚洲乱码国产一区三区| 欧美日韩精品一区| 99在线观看| 国产精品视频自在线| 亚洲综合色av| 欧美日韩高清在线一区| 成人精品水蜜桃| 久久超碰亚洲| 久久久久久999| 欧美中文字幕视频| 99精品在线直播| 国产精品热视频| 婷婷久久青草热一区二区| 国产综合18久久久久久| 国产高清www| 欧美激情亚洲综合一区| 欧美日韩在线不卡一区| 久久天天狠狠| 色综合久久88| 欧美性受xxxx黑人猛交| 91精品久久久久久久久久久久久| 久久精品2019中文字幕| 亚洲一区亚洲二区| 美女视频久久| 国产成人a亚洲精v品无码| 一区二区三区四区久久| 欧美福利一区二区三区| 久久精品国产99精品国产亚洲性色 | www.av中文字幕| 久久精品中文字幕一区|