資料參考文稿講稿2014text style analysis using trace ratio criterion patch_第1頁
資料參考文稿講稿2014text style analysis using trace ratio criterion patch_第2頁
資料參考文稿講稿2014text style analysis using trace ratio criterion patch_第3頁
資料參考文稿講稿2014text style analysis using trace ratio criterion patch_第4頁
資料參考文稿講稿2014text style analysis using trace ratio criterion patch_第5頁
已閱讀5頁,還剩8頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、See discussions, stats, and author profiles for this publication at:Text style analysis using trace ratio criterion patch alignment embeddingARTICLE in NEUROCOMPUTING · JULY 2014pac Fac o : 2.08 DO : 0. 0 6/j. euco.20 4.0 .0 2C A ON1S413 AUTHORS, INCLUDING:bo ZhaoCity University ofTommy W SCity

2、 University of226 PUBL CA ONS 2,620 C38 PUBL CA ONS177 CAONSAONSSEE PROF LESEE PROF LEAvai ab eo : Mi gbo Z aoRe ieved o : 25 Sep e be 20 5Neurocomputing 136 (2014) 201212Contents lists available at ScienceDirectNeurocomputingjournal homepage:Text style analysis using trace ratio criterion patch ali

3、gnment embeddingPeng Tang,bo Zhao n, Tommy W.S.Department of Electronic Engineering, City University ofa r t i c l e i n f o a b s t r a c t An effective algorithm for extracting cues of text styles is proposed in this paper. When processing document collections, the documents are rst converted to a

4、 high dimensional data set with the assistant of a group of style markers. We also employ the Trace Ratio Criterion Patch Alignment Embedding (TR PAE) to obtain lower dimensional representation in a textual space. The TR PAE has some advantages that the inter class separability and intra class compa

5、ctness are well characterized by the special designed intrinsic graph and penalty graph, which are based on discriminative patch alignment strategy. Another advantage is that the proposed method is based on trace ratio criterion, which directly represents the average between class distance and avera

6、ge within class distance in the low dimensional space. To evaluate our proposed algorithm, three corpuses are designed and collected using existing popular corpuses and real life data covering diverse topics and genres. Extensive simulations are conducted to illustrate the feasibility and effectiven

7、ess of our implementation. Our simulations demonstrate that the proposed method is able to extract the deeply hidden information of styles of given documents, and efciently conduct reliable text analysis results on text styles can be provided.Article history:Received 25 June 2013 Received in revised

8、 form 2 January 2014Accepted 6 January 2014 Communicated by X. GaoAvailable online 29 January 2014Keywords:Text style analysisTrace ratio criterion patch alignment embeddingStyle markers Text clustering& 2014 Elsevier B.V.1. Introductiongenre based features in documents are of great help to our

9、research on text analysis. The Term Frequency (TF), as well as Term Frequency Inverse Document Frequency (TF IDF) is applied to mostThe tasks on text analyzing, such as text classication and docu ment categorization, have gained a prominent status in the informa tion systems eld, due to the availabi

10、lity of documents created by the World Wide Web and the increasing demand to retrieval them by exible means 1. In these text analyzing tasks, papers are usually measured and classied according to their contents, or topics, and genres, or types. A third type of text analysis can exist. For example, d

11、ifferent writers can write in different style when they describe the same thing, and experienced Englishers usually have no difculty in telling whether an essay was written by native English speakers or the non native ones by looking at the authors wordings and writing styles which differ from its g

12、enre or topic. Here, the cues to differentiate the native and English as a Secondary Language (ESL) speakers are writing text styles. Therefore, the style cues of documents can be an effective measure in automatic text processing tasks. In this research, we will analyze the text styles using style m

13、arkers and machine learning approaches.Document styles are obviously affected by topics and genres of documents. Consequently, approaches that handle topic based ordocument ms like Vector Space Ms or Probabilistic Ms2,3. It can be noticed that simple textual features, like word (orterm) frequency an

14、d length of sentences, are the most widely used. For instance, word length and sentence length features have beenused to test tre classes and authorship 4 6. Tweedie haspointed out that the richness of vocabulary highly depends on text length and is very unstable 7. Many works have been established

15、with the assistance of POS tagging. For example, POS tagging features are used to detect text genre 8 12. N gram mixed with POS tagging are used to investigate the inuence of syntax structure 12. Feldman et al. extracted text genre features with POS histo grams and machine learning technologies 13.

16、Biber 14 dened “style markers”, regarded as a formal denition of style of texts, as aset of measurable patterns. Kessler identied four generic cues on the bases of style markers 8. It is also believed that writers can be a determining factor for writing habits. There are also research work focusing

17、on identifying the authorship of given documents. Style markers are utilized to dealing with unrestricted text for an authorship based classication, and a 50% or above accuracy has been reported when a 10 author corpus are processed 15. In 15, multiple regression and discriminant analysis are employ

18、ed to analyse genres of documents. Similar approaches are also applied in web document classication 16 18. In these analyses, the stylen Corresponding author. Tel.: þ 852 34427756; fax: þ 852 27887791.addresses: .hk (P. Tang),.hk (M. Zhao), .hk (

19、T.W.S.).0925-2312/$ - see front matter & 2014 Elsevier B.V.202P. Tang et al. / Neurocomputing 136 (2014) 201212markers of text, together with HTML tags and entities are consid ered as textual features. By using regression and discriminate analysis, differences among given documents can be reveal

20、ed. Using the occurrence frequency of the most widely used words from a training corpus as style markers has also been studied 19,20,15. Textual features and self organizing maps are used for text classi cation 1,21. Proximity based information between words to extract extra features of documents is

21、 also widely used for retriev ing information. Petkova and Croft propose a document representalinear projection matrix is used for map extend LE to its linear version.new cosamples andThe aforementioned methods are developed based on the specicknowledge of eld experts for their own purposes. Recentl

22、y, Yan et al.27 demonstrate that several dimension reduction methods (e.g. PCA, LDA, ISOMAP, LLE and LE) can be unied in a graph embedding framework, in which the statistical and geometrical properties of the data are encoded as graph relationships. Zhang et al. 28 further reformulated several dimen

23、sion reduction methods into a unied patch alignment framework (PAF), which consists of two parts: localtion mbased on the proximity between occurrences of entitiesand terms 22. Lv and Zhai propagate the word count using the sopatch construction and whole alignment, and showed that the above methods

24、are different in the local patch construction stage and share an almost identical whole alignment stage. In addition, this general framework, which is also originally used by local tangent space alignment (LTSA) 34, has been widely used in different elds such as correspondence construction 35, image

25、 retrieval 36 and distance metric learning 37 by constructing different patches corre sponding to different applications.In general, most of the above methods are unsupervised and they do not use label information. However, label information is of great importance when handling classication problem.

26、 In addition, though LDA can achieve promising performance as a supervised method, it is developed based on the assumption that the samples in each class follow a Gaussian distribution. In many applications such as text classication problems, samples in a data set, however, may follow a non Gaussian

27、 distribution that cannot satisfy the above assumption. Without this assumption, the separation of different classes may not be well characterized which results in degrading the classication performance 31. To solve this problem, some super vised dimensionality reduction methods have adopted the ide

28、a from the mentioned unsupervised manifold methods for better preserving the discriminative information. These methods usually start from the local structure of data and preserve the geometric information provided by data points and the label information. Typical methods include Supervised Locality

29、Preserving Projectioncalled Positional Language Mto obtain a virtual propagatedword count and applied to other language ms 23. Differentfrom the above method, our proposed method ms a given textas a lexicon of weighted word pairs. In this paper, the weight of word pair, calculated by using proximity

30、 based kernels in manyapplications, refers to theness between the two terms ofthe word pairs. Syntactic features performs better than simple textual statistics such as word frequency and length of sentencesin genre classication 20,14. It is reported, however, that the syntactic dependent features ar

31、e computationally expensive andtime consu8. To balance the computational performanceand the effectiveness of analysis result, we use POS bigrams and trigrams, which can encode useful syntactic information 24.Most above mentioned approaches on document analysis contain two parts. First, they generate

32、 a matrix by using a set of style markers. Second, use regression, discriminate analysis, classiers, or other machine learning methods to evaluate the results. Hence, if we want to improve the text analysis results, two approaches can be applied: (1) by exploring more effective features and (2) byut

33、ilizing movanced machine learning methods that can bettermake use of hidden information in the original data. In this paper, we mainly focus the latter approach. Specically, to better repre sent the features of text or documents, we rst collected several corpuses using real life textual data and exi

34、sting popular corpuses for text analysis in different scenarios. In this way, document(SLPP) 38, Discriminative Locality Alignment (DLA) 28, Stableanalysis is transformed into a feature extraction problem with a high dimensional and non Gaussian data set. We then develop an effective approach to han

35、dle such data set for document analysis.However, dealing with high dimensional data has always been a major problem for pattern recognition. Hence, dimensionality reduction techniques can be used to reduce the complexity of the original data and embed high dimensional data into low dimensional data,

36、 while keemost of the desired intrinsic information 25,26. Over the past decades, many dimensionality reduction methods have been proposed 27,28. PCA pursues theOrthogonal Local Discriminant Embedding (SOLDE) 39, SparseNeighbor Selection and Sparse Representation based Enhancement (SNS SRE) 40, Unsu

37、pervised Transfer Learning based Target Detec tion (UTLD) 41, etc.To better unveil the hidden information in the given high dimensional data created by a group of specied style markers, a fast trace ratio criterion patch alignment embedding (TR PAE) method is introduced. Our proposed method has the

38、advantages that the inter class separability and intra class compactness are well characterized by the special designed intrinsic graph and penalty graph, which are based on discriminative patch alignment method. This strategy is essential for extracting the deeply hidden information about text styl

39、es in document collections. Another advantage is that the proposed method is based on trace ratio criterion, which directly represents the average between class distance and average within class distance in the low dimensional space. This advantage is helpful to directly obtain intuitive text analys

40、is results. In this paper, we have performed extensive study using style marker collections, our col lected corpuses and the proposed TR PAE. The simulation results show that using our method, we can distinguish various styles of documents of different genres. Meanwhile, the styles of British and Am

41、erican writing English, as well as English from Asian areas, can be separated. Moreover, our proposed algorithm can separate news items collected from the same media and of the same genre, but composed in different decades.The rest of this paper is organized as follows. The corpuses and textual feat

42、ures, i.e. the style marker collections, collected anddirection ofum variance for optimal reconstruction 29,30.For linear supervised methods, LDA and its variants nd theoptimal solution thatizes the distance between the meansof the classes while minimizing the variance within each class31. Due to th

43、e utilization of label information, LDA can achieve better classication results than those obtained by PCA if sufcient labeled samples are provided.To nd the intrinsic manifold structure of the data, nonlinear dimensionality reduction methods such as ISOMAP 25, Locally Linear Embedding (LLE) 26, Lap

44、lacian Eigenmap (LE) 32 were developed. These methods preserve the local structures and look for a direct non linearly embedding the data in a global coordinate. For example, ISOMAP aims to preserve global geodesic distances of all pairs of measurements; LLE uses linear coefcients, which reconstruct

45、 a given measurement by its neighbors, to represent the local geometry; LE is able to preserve the proximity relationships by using an undirected weighted graph to indicate neighbor relations of pair wise measure ments. But it is worth noting that all the above methods suffer from the out of sample

46、problem 33. To deal with the problem, He et al.33 developed the Locality Preserving Projections (LPP) in which aused in our study adressed in Section 2. In Section 3, webriey overview the conventional linear discriminant analysis. Our proposed fast trace ratio criterion patch alignment embeddingP. T

47、ang et al. / Neurocomputing 136 (2014) 201212203method is subsequently elaborated. In Section 4, data sets used in visualization and classications, experimental congurations andcollocation information among words and some other textual statistics. In our study, we obtain the listed style markers wit

48、hcorresponding experimental results are algorithm in Section 5.ed. We conclude ourthe assistant ofscripts and thepackage NLTK.Compared to features used in 18,15,42, we removed featuresrelated to topics and genres so that the bias caused by topics and genres can be minimized. Moreover, features conce

49、rning of frequent/function words are enhanced. Such methods also aim to eliminate the bias caused by genres and topics of different documents.2. Corpuses and features usedTo our knowledge, there are few corpuses aifor analyzingstyles of text or documents. Most existing corpuses are for classication

50、by the contents, topics, types, or genres of the documents. To conduct our text style analysis, we collected several corpuses using real life textual data and existing popular corpuses for text analysis in different scenarios. The components of Corpus 1, Corpus 2 and Corpus3. Feature extraction base

51、d on fast trace ratio criterion linear discriminant analysis3 areed in Tables 1 3, respectively. The Corpus 1 consists of a3.1. Related workgroup of materials covering various contents and genres, aitoexamine the performance of our proposed algorithm for analyzing the3.1.1. Review of trace ratio lin

52、ear discriminant analysisLDA uses the within class scatter matrix Sw to evaluate the compactness within each class and between class scatter matrix Sb to evaluate the separability of different classes. The goal of LDA is to nd a linear transformation matrix W A RD d, for which thestyles among differ

53、ent topics and genres. News and reportage items covering different English media, i.e. the British and American Englishmed (formthe native English media, and the English media in the CJK Japaand Korean) area as the non native English media,Corpus 2. This corpus is designed for testing our algotrace

54、of between class scatter matrix isized, while the tracerithm when dealing with documents of the same genre but in composed in different regions. The Corpus 3 includes news and reports from the same English media and covering the same topics. We collect this corpus mainly for validating the feasibili

55、ty of using our algorithm under such harsh conditions.of within class scatter matrix is minimized. Let X ¼ fx1; x2;xlg A RD l be the training set, each xi belongs to a classci ¼ f1; 2; cg . Let li be the number of data points in the ith classand l be the number of data points in all classe

56、s. Then, the between class scatter matrix Sb, within class scatter matrix Sw, and total class scatter matrix St are dened as follows:We choose some popular style markers of documents to construct a high dimensional data set 18,15,42. Here, each style marker is considered as a textual feature concern

57、ing of token level, lexical level, structural level. The 200 style markers used incTSt ¼ ðx Þðx Þx A cii 1our study areed in Table 4, which means we construct a 200dimensional data set by using them. Token level features include the classic term frequencies for the words, numbers, punctuations, and special symbols. Lexical level features consist of both funct ion words and content words, POS tagged tokens and some useful word using statistics that indicate the high and low frequencyTa

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論