CSDS

SHORT COURSES

Short Course 1

Hugo Tremonte de Carvalho (Federal University of Rio de Janeiro, Brazil.).

Title: Applications of Markov Models in Music
Abstract: Music is a fundamental aspect of our culture, and was one of the first media for communication employed in the history of humanity. Being music a product of our cognition, it is expected that some sort of regularity is embedded within it, independently of the musical genre under consideration. In this short course we analyze this phenomena from a statistical viewpoint; more precisely, we present how Markov models can be employed to estimate high-level information like tonality and chord progression from the music sheet and waveform, respectively. In the first part of the course, an introduction to important music-theoretical aspects will be presented; therefore, previous musical knowledge is desirable but not required. (slides in English, talk in Portuguese).
References: Meinard Müller - Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications. David Temperley - Music and Probability.
Bio: Hugo T. de Carvalho has undergraduate and master degrees in Applied Mathematics (UFRJ, 2011 and 2013, respectively), and completed his D.Sc. in Electrical Engineering in 2017, at COPPE/UFRJ, where he united two of his greatest passions, Music and Mathematics, by employing Bayesian methods in audio restoration. Now Hugo is assistant professor at the Department of Statistical Methods (IM/UFRJ), member of the MusMat Research Group and member of the editorial board of MusMat • Brazilian Journal of Music and Mathematics. His main research interests are Music and Mathematics, Music Information Retrieval, Statistical Signal Processing, and Computational Statistics. In his spare time, Hugo plays classical guitar and is learning to play the harmonica.

Short Course 2

David Banks (Duke University, USA).

Title: Nonparametric Regression
Abstract: This course describes nonparametric regression, including the additive model and its generalizations, also the LASSO, and LARS. Then it proceeds to classification (SVMs, random forests, boosting). The Curse of Dimensionality is described in both contexts. The emphasis is upon the strengths and weaknesses of the tools, and guidance on when a particular method should be used. The audience should have a basic knowledge of multiple linear regression.
Bio: David Banks is a professor of statistics at Duke University and a fellow of the ASA, IMS and AAAS. He is a past editor of the Journal of the American Statistical Association and founding editor of Statistics and Public Policy. His research areas include agent-based models, adversarial risk analysis, dynamic networks, text data, and human rights statistics.

Short Course 3

Title: Introduction to Big Data Analytics with R
Abstract: Big data analytics examines large-scale data science to uncover hidden patterns, correlations and other insights. The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing server. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. In this short course we present an introduction to the analysis of massive data sets, such as import, manipulate and apply supervised and unsupervised algorithms using R language with focus on the sparklyr package.

Samuel Macêdo (CCNM/DAFG/IFPE, Brazil).

Bio: Statistician and pursues a doctorate in computer science, both at UFPE. He is a specialist in R programming and he is working, since 2012, with machine learning, big data, cloud computing and package and app development. On the development side, he is a contributor and committer of Rstudio's sparklyr package and he is the author of variantspark, sparkhail, and raws.profile package. Developer and coordinator of ARIA project, a software produced for Ministry of Education (MEC) to automatize the actualization of the student's dataset of the SISTEC platform. There are 12 institute in BRAZIL that uses this software nowadays. Since February 2021 he is posting videos on YouTube teaching data science and R programming. https://www.youtube.com/channel/UCj19IjFlNSmCAOwrlq-Svbg .

Anderson Ara (LEG/DEST/UFPR, Brazil).

Bio: Bachelor degree (2009) and Master degree (2011) in Statistics at Federal University of São Carlos (UFSCar). PhD in Statistics (2016) through the Graduate Programs in Statistics (PPGEst-UFSCar) and Graduate Studies in Computer Science (PPG-CC-UFSCar). Assistant Professor at Department of Statistics, Federal University of Paraná (DEST-UFPR). Lecturer in the Specialization in Data Science & Big Data (DSBD-UFPR), MBA in Financial Analytics (DAAGE-UTFPR) and in Specialization in Data Science and Big Data (ECD-UFBA). Researcher at Statistics and Geoinformation Laboratory (LEG-UFPR) and at Center for Data Integration and Knowledge for Health (CIDACS/Fiocruz). Researcher of the Graduate Program in Numerical Methods in Engineering (PPGMNE) at UFPR and the Graduate Program in Mathematics (PGMAT) at UFBA. Main Interests: Statistical Machine Learning, Statistical Inference and Computational Methods. Research fellow from Microsoft AI for Earth Program and from Wellcome Trust Foundation.

Short Course 4

Tahir Ekin (Texas State University, USA.).

Title: Fraud Analytics
Abstract: Fraud has been around since the early days of commerce, continuously evolving and adapting to changing times. The fraudulent cases are seen in a wide range of domains such as finance, credit card, telecommunications, insurance and health care. Examples include but not limited to the post COVID-19 instances in financial stimulus, unemployment eligibility and health care procurement. For instance, in health care, overpayments are estimated to correspond up to ten percent of total expenditures. This short course presents the use of analytical methods for fraud assessment. Fraud data and its types will be introduced with some examples and pre-processing techniques. Next, the course will cover the use of visualization and unsupervised methods (outlier detection, clustering, topic models) to describe data and reveal hidden relationships. Whereas supervised methods such as classification and regression can be used with labeled data sets for prediction purposes. These methods will be discussed using examples from finance and health care industries.
Bio: Tahir Ekin is an associate professor of quantitative methods in McCoy College of Business, Texas State University. His book Statistics and Health Care Fraud: How to Save Billions was published as part of ASA/CRC Series on Statistical Reasoning in Science and Society. His work has been published in a variety of journals, including International Statistical Review, Applied Statistics and American Statistician. He has given trainings on fraud analytics in workshops sponsored by European Health Care Fraud and Corruption Network, ISI and INFORMS. Dr. Ekin is an elected member of ISI.

Short Course 5

Crysttian Paixão (Federal University of Santa Catarina, Brazil.).

Title: Preparation of textual database for analysis
Abstract: Currently, different databases are available, with relevance to social networks. These databases consist of text messages generated by its users, with different formats and standards. Because of the lack of uniformity, these bases depend on prior processing. Therefore, having access to the data does not guarantee that the analysis is immediate, requiring preparatory steps. Preparation basically consists of standardizing the textual base for extracting information. In this mini course, the main text manipulation procedures will be presented, with a brief outline about regular expressions and some analyses. (slides in English, talk in Portuguese).
Bio: Crysttian Paixão holds a Bachelor's Degree in Computer Science from the Federal University of Lavras (2006), a Master's in Systems Engineering from the Federal University of Lavras (2008), a Ph.D. in Agricultural Statistics and Experimentation from the Federal University of Lavras (2012) and Post-doc from the School of Applied Mathematics (EMAp) at Fundação Getulio Vargas. He is currently Adjunct Professor A at the Federal University of Santa Catarina. He develops research projects in the areas of Statistics, Computational Modeling and Artificial Intelligence. He is a member of the Theoretical, Applied and Computational Statistical Research Group of the Directory of Research Groups in Brazil.

Third Conference on Statistics and Data Science - CSDS 2021

Salvador, 28 - 30 October, 2021 (online)