Computer Science and Software Engineering
http://hdl.handle.net/2374.MIA/96
2024-03-14T20:11:42ZA Detailed Stylometric Investigation of the İnce Memed Tetralogy*
http://hdl.handle.net/2374.MIA/6680
A Detailed Stylometric Investigation of the İnce Memed Tetralogy*
Patton, Jon; Can, Fazli
We analyze four İnce Memed novels of Yaşar Kemal using six style markers: “most frequent words,” “syllable counts,” “word type -or part of speech- information,” “sentence length in terms of words,” “word length in text,” and “word length in vocabulary.” For analysis we divide each novel into five thousand word text blocks and count the frequencies of each style marker in these blocks. The principal component analysis results show clear separation between the first two and the last two volumes; the blocks of the first two novels are also distinguishable from each other. The blocks of the last two volumes are intermixed. This parallels the fact that the author planned the last two volumes as three separate novels, but later condensed them into two. The style markers showing the best separation are “most frequent words” and “sentence length”. We use stepwise discriminant analysis to determine the best discriminators of each style marker and then use them in cross validation. The related results concur with the principal component analysis results. For example, the cross validation results obtained by “most frequent words” and “sentence length,” respectively, provide 87% and 81% correct classification of the text blocks to their corresponding volumes. Further investigation based on multiple analysis of variance (MANOVA) reveals how the attributes of each style marker group distinguish among the volumes.
Effectiveness Assessment of the Cover Coefficient Based Clustering Methodology
http://hdl.handle.net/2374.MIA/6659
Effectiveness Assessment of the Cover Coefficient Based Clustering Methodology
Can, Fazli; Ozkarahan, Esen
An algorithm for document clustering is introduced. The basic concept of the algorithm, Cover Coefficient (CC) concept, provides means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds, and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the IR effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method which is known to have good performance. The experiments also show that the algorithm 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm, and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS and INSPECT, the later is a common database with 12684documents.
Towards the Realization of a DSML for Machine Learning: A Baseball Analytics Use Case
http://hdl.handle.net/2374.MIA/6224
Towards the Realization of a DSML for Machine Learning: A Baseball Analytics Use Case
Koseler, Kaan; Stephan, Matthew
Using machine learning (ML) for big data is challenging, requiring specialized knowledge of the domain, learning algorithms, and software engineering. To demonstrate the viability of model-driven engineering in the ML domain we consider an ML use case of baseball analytics by extending and applying an existing, but untested, ML domain specific modeling language (DSML). Additionally, we aim to make ML software development more accessible and formalized, and help facilitate future research in this area. This paper describes our plan, initial work, and anticipated contributions in extending, testing, and validating this DSML, and implementing a code generation scheme that is targeted at a binary classification baseball problem.
Keywords: Model driven engineering * Domain specific modeling languages * Machine Learning * Analytics * Baseball
A Survey of Baseball Machine Learning: A Technical Report
http://hdl.handle.net/2374.MIA/6218
A Survey of Baseball Machine Learning: A Technical Report
Koseler, Kaan; Stephan, Matthew
Statistical analysis of baseball has long been popular, albeit only in limited capacity until relatively recently. The recent proliferation of computers has added tremendous power and opportunity to this field. Even an amateur baseball fan can perform types of analyses that were unimaginable decades ago. In particular, analysts can easily apply machine learning algorithms to large baseball data sets to derive meaningful and novel insights into player and team performance. These algorithms fall mostly under three problem class umbrellas: Regression, Binary Classification, and multiclass classification. Professional teams have made extensive use of these algorithms, funding analytics departments within their own organizations and creating a multi-million dollar thriving industry. In the interest of stimulating new research and for the purpose of serving as a go-to resource for academic and industrial analysts, we have performed a systematic literature review of machine learning algorithms and approaches that have been applied to baseball analytics. We also provide our in-
sights on possible future applications. We categorize all the approaches we encountered during our survey, and summarize our findings in two tables. We find two algorithms dominated the literature, 1) Support Vector Machines for classification problems and 2) Bayesian Inference for both classification and Regression problems. These algorithms are often implemented manually, but can also be easily utilized by employing existing software, such as WEKA or the Scikit-learn Python library. We speculate that the current popularity of neural networks in general machine learning literature will soon carry over into baseball analytics, although we found relatively fewer existing articles utilizing this approach when compiling this report.