




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、第四屆“認(rèn)證杯”數(shù)學(xué)中國(guó) 數(shù)學(xué)建模國(guó)際賽承諾書(shū)我們仔細(xì)閱讀了第四屆“認(rèn)證杯”數(shù)學(xué)中國(guó)數(shù)學(xué)建模國(guó)際賽的競(jìng)賽規(guī)則。我們完全明白,在競(jìng)賽開(kāi)始后參賽隊(duì)員不能以任何方式(包括電話、電子郵 件、網(wǎng)上咨詢等)與隊(duì)外的任何人(包括指導(dǎo)教師)研究、討論與賽題有關(guān)的問(wèn) 題。我們知道,別人的成果是違反競(jìng)賽規(guī)則的, 如果引用別人的成果或其他公開(kāi)的資料(包括網(wǎng)上查到的資料),必須按照規(guī)定的參考文獻(xiàn)的表述方式在正文引用處和參考文獻(xiàn)中明確列出。我們鄭重承諾,嚴(yán)格遵守競(jìng)賽規(guī)則,以保證競(jìng)賽的公正、公平性。如有違反 競(jìng)賽規(guī)則的行為,我們將受到嚴(yán)肅處理。我們?cè)试S數(shù)學(xué)中國(guó)網(wǎng)站()公布論文,以供網(wǎng)友之間學(xué)習(xí)交
2、流,數(shù)學(xué)中國(guó)網(wǎng)站以非商業(yè)目的的論文交流不需要提前取得我們的同意。 我們的參賽隊(duì)號(hào)為:1235 我們選擇的題目是: C 題參賽隊(duì)員 (簽名) : 隊(duì)員 1:王東全隊(duì)員 2:吳卓其隊(duì)員 3:周洋參賽隊(duì)教練員 (簽名):楊劍波 更多數(shù)學(xué)建模資料請(qǐng)關(guān)注微店店鋪“數(shù)學(xué)建模學(xué)習(xí)交流”/RHO6PSpATeam# 1235Page 2 of 21第四屆“認(rèn)證杯”數(shù)學(xué)中國(guó) 數(shù)學(xué)建模國(guó)際賽編 號(hào) 專 用 頁(yè)參賽隊(duì)伍的參賽隊(duì)號(hào):(請(qǐng)各個(gè)參賽隊(duì)提前填寫(xiě)好):1235 競(jìng)賽統(tǒng)一編號(hào)(由競(jìng)賽組委會(huì)送至評(píng)委團(tuán)前編號(hào)): 競(jìng)賽評(píng)閱編號(hào)(由競(jìng)賽評(píng)委團(tuán)評(píng)閱前進(jìn)行編號(hào)): Team# 12
3、35Page 3 of 21UsingDataMiningTechniquesforDetectingTerror-RelatedActivities on the WebAbstract:The number of terror attacks is increasing year by year. On November 13, 2015, theterrorist attack that took place in Paris caused hundreds of deaths. The hazards of cyber terrorism have already become mor
4、e and more serious. The USA has enacted a number of laws aimed at the prevention of cyber terrorism, such as “USA PATRIOT Act”. It is necessary to establish a model for the prevention of terrorist network spread and to monitor and find the people with a tendency to terrorism. The Internet behavior a
5、nalysis and risk assessment model (IBARA) was established for the Internet to assess the internet behaviors of those people who are monitored. In this paper, based on IBARA, we not only research the relationship between peoples Internet behavior and their possible terrorist tendency, but also analyz
6、e and discuss the relative quantitative risk index of individual terrorism tendency and the relevant strategies to prevent terrorist attacks.Firstly, the Internet behavior was divided into two parts: Web text and image. The complex vector space of word frequency analysis algorithm was adopted to est
7、ablish the personal tendency of terrorism risk index sub module (PTTRISM) which can predict peoples tendency to terrorism. In PTTRISM, this paper analyzes the behavior of individual Web text using the keyword extraction technique and frequency analysis technique. According to the analysis results, i
8、ts given the value of the risk index of individual terrorism in this paper. Using the PTTRISM to analyze the data sample, we had drawn a conclusion that most people who have been access to the terrorism-related information are not likely to become potential terrorists.The PTTRISM could calculate peo
9、ples risk index about the tendency to terrorism through analyzing Internet behavior.Secondly, in fact, the object of network monitoring is not a person but a large number of people, which makes to monitoring data too large and complex. In order to facilitate the rapid and efficient classification an
10、d analysis of big data, a big data clustering statistics sub module (MDCSSM) is established based on the technique of density-based clustering. At the same time, in order to shorten the computing time of the MDCSSM, in this paper is adopted the standard particle swarm optimization (PSO) with the wei
11、ght-shrink factor. It realized the effective, fast and automatic clustering analysis of datasets. Validation of the sub model using the data,The model can be used to analyze a large amount of data. Due to sacristy of the monitoring data, we utilize some frequently-tested public datasets, “Iris”, “Gl
12、ass”, “Wine” and “Aggregation” to replace the monitoring data and verify the clustering algorithm. The clustering results demonstrate that the clustering algorithm can categorize the monitoring datasets in an effective, fast and automatic manner.Finally, We propose some suggestions to President Obam
13、a about fighting against terrorism as follows based on IBARA :1. Put into more resources in terms of network against terrorism. You could build User Online Monitoring System of Behavior and Psychological to monitor and assess the behavior of the public.2. Establish Information security evaluation sy
14、stem to weaken and even prevent the terrorist propaganda through the network.3. Strengthen public anti-terrorism education, raise public awareness of anti-terrorism.Due to the time constraints, the model still has some defects which need to be improved. In the PTTRI sub module, factors of voice and
15、image files are not considered. In the MDCS sub module, the selection of adaptive function in Clustering analysis could be further improved. With the further improvement of the model, we will get more accurate results.Key words: PSO, word frequency analysis algorithm , density-based clustering, terr
16、orism, Internet behaviorTeam# 1235Page 4 of 21ContentsI. Introduction5II. The Description of the Problem52.1 Our Approximation the Whole Course of Data Mining To terrorists onwebsite52.2 The Differences in Weights and Sizes of Available Data6III. IBARA63.1 PTTRISM63.1.1 Terms, Definitions and Symbol
17、s in PTTRISM63.1.2 Assumptions in PTTRISM63.1.3 The Model of Terrorism-Related Website Browsing and Vector Space Models of Lexical Meaning73.1.4 The Model of Risk Index83.1.5 Solutions and Results for PTTRISM93.1.6 Strength and Weakness in PTTRISM113.2 MDCSSM123.2.1 Extra Symbols123.2.2 Additional A
18、ssumptions123.2.3 The Foundation of MDCSSMto Categorize Big Data123.2.4 The Results of MDCSSM153.2.5 Strength and Weakness18IV. Conclusions194.1 Conclusions of the Problems194.2 Methods Used in our Models194.3 Applications of our Models19V. Proposal to Fighting Terrorism20VI. References20Team# 1235P
19、age 5 of 21I. IntroductionIn order to indicate the origin of web-related terrorism problems, the following background is worth mentioning.Terrorist cells are using the Internet infrastructure to exchange information and recruit new members and supporters12 (Lemos 2002; Kelley 2002). For example, hig
20、h-speed Internet connections were used intensively by members of the infamous Hamburg Cell that was largely responsible for the preparation of the September 11 attacks against the United States3 (Corbin 2002). This is one reason for the major effort made by law enforcement agencies around the world
21、in gathering information from the Web about terror-related activities. It is believed that the detection of terrorists on the Web might prevent further terrorist attacks2 (Kelley 2002). One way to detect terrorist activity on the Web is to eavesdrop on all traffic of Web sites associated with terror
22、ist organizations in order to detect the accessing users based on their IP address. Unfortunately it is difficult to monitor terrorist sites3 (such as Azzam Publications (Corbin 2002) since they do not use fixed IP addresses and URLs. The geographical locations of Web servers hosting those sites als
23、o change frequently in order to prevent successful eavesdropping. To overcome this problem, law enforcement agencies are trying to detect terrorists by monitoring all ISPs traffic4(Ingram 2001), though privacy issues raised still prevent relevant laws enforced.frombeingFigure 1: the annual number of
24、 terrorists attack from 1968 to 2009II. The Description of the Problem2.1 Our Approximation the Whole Course of Data Mining Toterrorists on websitesHow often does the internet user who is monitored visit the website that contains terrorized information and propaganda of terrorism.The lexical meaning
25、 of contents of their emails, chats, post views and text files being downloaded.Team# 1235Page 6 of 21As for other formats of files, such as videos, images and audios, the techniques of the image description and voice recognition are used as a tool to detect the terrorists.For categorizing the monit
26、oring data, the cluster techniques are adopted to sect data in an effective, fast and automatic manner.Present some useful suggestions to President Obama for fighting terrorism2.2 The Differences in Weights and Sizes of Available DataDue to differences between the collected datasets, its quite neces
27、sary to preprocess the available data, Such as text datasets, numerical datasets, image datasets and even voice datasets.1)The Preprocess of Text Data: remove non-alphabetical characters from the text dataset and put them into MATLAB cell structures.The Preprocess of Image Data: remove non-imagery i
28、nformation from the image datasets and convert the RGB images into the gray-value images. If the image datasets are polluted by noises, its quite necessary to denoise image before analyzing the relevant information.The Preprocess of Voice: if the audio datasets are polluted by noises, its a need to
29、implement audio-denoising steps before digging out the auditory information.The Preprocess of Numerical Dataset: Due to existence of differences between data samples in units and magnitudes, the numerical dataset needs to be normalized and standardized.2)3)4)III. IBARA3.1 PTTRISMIn this paper a new
30、methodology to detect users accessing terrorist related information by Frequency-Analysis Techniques, Vector Space Models of Lexical Meaning5, Image Description6 and Voice Recognition7, Data Cluster Terms, Definitions and Symbols in PTTRISMThe signs and definitions are mostly generated from o
31、ur models in this paper.R is the risk index, which denotes the risk degree that the Internet user canbe.Ptextis the degree that the text contents that the Internet user involves aretrelated to terrorism during the time interval t.Pimage isthe degree that images that the Internet user browses andtdow
32、nload are related to terrorism during the time interval t.Paudio isthe degree that audios that the Internet user listens to andtdownload are related to terrorism during the time interval t.wi, jis the weight factor of vector space q .3.1.2 Assumptions in PTTRISMThe main design criteria for the propo
33、sed methodology are:Team# 1235Page 7 of 21Training the detection algorithm should be based on the content of existing terrorist sites and known terrorist traffic on the Web.Detection should be carried out in real-time. This goal can be achieved only if terrorist information interests are presented i
34、n a compact manner for efficient processing.The detection sensitivity should be controlled by user-defined parameters to enable calibration of the desired detection performance.All information related to terrorism is not encrypted by enciphered algorithms, such as RSAAll information that can be moni
35、tored is presented by images, audios and texts.Neglect the social attributes of the monitored person and only consider the network properties3.1.3 The Model of Terrorism-Related Website Browsing and VectorSpace Models of Lexical MeaningOne major issue in this model is the representation of textual c
36、ontent of Web pages. More specifically, there is a need to represent the content of terror-related pages as against the content of a currently accessed page in order to efficiently compute thesimilarity between them9. This study will use the vector-space model commonlyused in Information Retrieval a
37、pplications for representing terrorists interests and eachaccessed Web page. In the vector-space model, the weightwi, jassociated with apair (ki , d j ) is positive and non-binary. Further, the index terms in the query q arebe the weight associated with the pair (ki , q) where wi,q ? 0 .also weighte
38、d. Letwi,qrThen, the query vector q = (w1,q , w2,q ,K, wt ,q ) is defined as where t is the total number of index. The vector for a document d j is represented by d j = (w1, j , w2, j ,K, wt , j ) . The vector model proposes to evaluate the degree of similarity of the document d j with regard to the
39、 query q as the correlation between the vectors d j and q This correlationcan be measured by the cosine of the angle between these two vectors as,rd jqsim =(3-1-1)rrd jqrd jrWhereandqare the norms of are the norms of the document and queryvectors. In the vector space model, the frequency of a term k
40、i inside a document d jnifreq=(3-1-2)i, jNjThe normalized frequency of term ki inside a document d j is given byfreqi, jf=(3-1-3)i , jmax( freq )i, jThe best known term-weighting schemes use weights which are given byTeam# 1235Page 8 of 21wi, j =- fi, j log( freqi, j )(3-1-4)In this paper each Web p
41、age in considered as a document and is represented as a vector. The terrorists interests are represented by several vectors where each vector relates to a different topic of interest. The query of the methodology defines and represents the typical behavior of terrorist users based on the content of
42、their Web activities. The query is based on a set of Web pages that were downloaded from terrorist related sites and is the main input of the detection algorithm. It is assumed that it is possible to collect Web pages from terror-related sites. The content of the collected pages is the input to the
43、Vector Generator module that converts the pages into vectors of weighted terms10 (each page is converted to one vector).In order to define the degree that the internet user browses the terrorism-related websites during the time interval t, the formula b(m) is defined by the function that the interne
44、t user behaves like a potential terrorist when browsing the website m as follows:b(m) = simc (sim)x ? threshold x threshold(3-1-5)where c (x) = ?1?0?= ?b(m)Ptext(3-1-6)tMIn this paper, we adopt 0.5 as the value of threshold. The query in this paper is listed in the table below.3.1.4 The Model of Ris
45、k IndexHere we report the remarkable finding that identical patterns of violence are currently emerging within these different international arenas. Not only have the wars in Iraq and Colombia evolved to yield a same power-law behavior, but this behavior isIDDetails of Queries1Bomb Suicide2Gunfire3K
46、idnap4Massacre5Attack to Civilians6Islamic State of Iraq and al Shams7Qaeda8al-Shabaab9Islamic State10hijack11AssassinationTeam# 1235Page 9 of 21currently of the same quantitative form as the war in Afghanistan and global terrorism in non-G7 countries. Not only is the models power-law behavior in ex
47、cellent agreement with the data from Iraq, Colombia and non-G7 terrorism, it is also consistent with data obtained from the recent war in Afghanistan. Power-law distributions are known to arise in a large number of physical, biological, economic and social systems. In the present context, a power-la
48、w distribution means that the probability that an event will occur with behavior P is given by12R(P) = CP-a(3-1-7)where P ?(0,1 , P = Ptext and C and are positive coefficients1314, Previouststudies have shown that the distribution obtained from past terrorists attack exhibits a power-law with15 a =1
49、.809 .Since we cant get the coefficients C effectively, we define a relative risk index r among a group of people who are monitored during the specific time interval as followsRa (P)r =(3-1-8)? R (P)aa3.1.5 Solutions and Results for PTTRISM1) The Solution Steps to PTTRISM1. Generation of Term-Freque
50、ncy matrixIt is term-frequency matrix of all unique terms in document d j withj = 1,2,K, N .The term document matrix Freq is a M ? N matrix with ti unique terms in dictionary i = 1, 2,K, M and N documents the elements of Freq are represented as infreq which each element indicates the frequency of it
51、h term indocument.jthi, jThe Cranfield data collection is preprocessed to convert into individual 1398 text files. Also, non-embedding special characters and numerals have been removed from these files. 79,728 words have been collected which are then processed to find the frequency of unique words i
52、n each documents. The dictionary of unique words is of 7805 words. Thus the term frequency matrix is of size 7805? 1398.2. Generation of Query matrix and Term-weight calculations and resultAfter removing stop-list words and non-embedding special characters is used as query, which contributes to the
53、set of 1398 unique queries represented as q .Here, we have taken queries as titles of the document instead of the dataset queries so as to judge the relevancy more profoundly. The generated matrix for 1398 queries is Q1398?7805 . A term-frequency matrix is processed to get the term weights consideri
54、ng term-weighting schemes.2) The Results of PTTRISMTeam# 1235Page 10 of 21Figure 2: Index terms in a dictionaryFigure 2 shows the distribution of index terms in dictionary for individual documents. The dictionary consists of 7805 unique terms.Figure 3: Frequency count of each unique term among data
55、collectionFigure 3 shows frequency count of each unique term in dictionary distributed in complete dataset. Some of the unique terms such as (ISIS, 2059), (Qaeda, 1245), (hijack, 1076), (Assassination, 897), with high frequency in entire documents is shown.Team# 1235Page 11 of 21Figure 4: the distri
56、bution of the P value among the monitored personsFigure 5: the distribution of the r value among the monitored personsIn the Figure 5, we define 0.1 as the threshold of the risk index. If a ones risk index is beyond 0.1, he or she can become a potential terrorist, and otherwise more likely to be an
57、ordinary personFrom the 1398 individual text files that are obtained from 1398 individuals, we can easily draw a conclusion that most people who have been access to the terrorism- related information are not likely to become potential terrorists. There are just 12 ones of all monitored persons who are likely to become potential terrorists, besides all their risk indexes are beyon
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年職稱英語(yǔ)考試題及答案
- 2025年電氣電焊考試題及答案
- 2025年心理證書(shū)考試題型及答案
- 2025年初中物理試題及答案
- 2025年青年發(fā)展類面試題及答案
- 2025年護(hù)理學(xué)基礎(chǔ)試題3及答案
- 2025年幼兒環(huán)創(chuàng)考試題及答案
- 工業(yè)機(jī)器人高級(jí)工練習(xí)卷附答案
- 2025年領(lǐng)導(dǎo)勝任力測(cè)試題及答案
- 2025年人格風(fēng)險(xiǎn)測(cè)試試題及答案
- 《抽水蓄能電站系統(tǒng)建模與特性分析》6300字(論文)
- 化學(xué)-江蘇省鎮(zhèn)江市2024-2025學(xué)年高三下學(xué)期期初質(zhì)量監(jiān)測(cè)試題和答案
- 2025年中考語(yǔ)文一輪復(fù)習(xí):民俗類散文閱讀 講義(含練習(xí)題及答案)
- 【正版授權(quán)】 IEC 63310:2025 EN Functional performance criteria for AAL robots used in connected home environment
- 2025屆新高考政治沖刺備考復(fù)習(xí)把握高考趨勢(shì)+科學(xué)高效命題
- 最終版附件1:“跨學(xué)科主題學(xué)習(xí)”教學(xué)設(shè)計(jì)(2025年版)
- 2025年春季安全教育主題班會(huì)教育記錄
- 2024年春季學(xué)期低年級(jí)學(xué)雷鋒講奉獻(xiàn)主題班會(huì)
- 2025年度環(huán)保咨詢與評(píng)估服務(wù)合同范本模板
- 機(jī)電一體化專科畢業(yè)論文范文
- 2025至2030年中國(guó)煙用接裝紙數(shù)據(jù)監(jiān)測(cè)研究報(bào)告
評(píng)論
0/150
提交評(píng)論