The rapid change in computational capabilities has made Big Data a major modern statistical challenge. Less well understood is the rise of Complex Data as a perhaps greater challenge. Object Oriented Data Analysis (OODA) is a framework for addressing this, in particular providing a general approach to the definition, representation, visualization and analysis of Complex Data. The notion of OODA generally guides data analysis, through providing a useful terminology for interdisciplinary discussion of the many choices typically needed in modern complex data analyses. The main ideas are illustrated through several OODA contexts, including shapes, trees (in the sense of graph theory), covariance matrices and nonnegative curves as data objects.
Time-dependent regionalization or spatially restricted grouping is an important field of research that has as one of the main goals to identify and
cluster neighboring regions with similar dynamics, while also investigating how these spatial patterns evolve over time. Regionalization or spatially
restricted clustering is challenging because partitioning a map involves searching a large number of possible partitions within the search space. This
task is even more challenging in the spatio-temporal clustering context.
Joint work with Jessica Pavani (Calgary University) and Fernando A. Quintana (PUC-Chile).
A fundamental task in the statistical analysis of data is to detect and estimate interesting "structures" hidden in it. In this talk I'll focus on aspects of this problem under a high dimensional regime, where each observed sample has many coordinates, and the number of samples is limited. We will show how in such cases: (i) standard methods to detect structure in high dimensions, such as principal component analysis, may not work well; (ii) sparsity can come to the rescue, albeit it brings with it significant statistical and computational challenges; and (iii) some interesting phenomena may occur in semi-supervised learning settings where for few of the samples we are also given their underlying labels. Specifically, merging labeled and unlabeled data may have significant computational benefits in high dimensions.
Os modelos lineares parciais aditivos generalizados combinam duas classes conhecidas de regressão, os modelos lineares generalizados e os modelos aditivos generalizados. Assim, num mesmo modelo é possível combinar efeitos paramétricos, em que há interesse na interpretação dos coeficientes estimados, com estruturas não lineares entre variáveis explicativas contínuas e a resposta, em que o interesse está no controle dos efeitos dessas variáveis contínuas e na predição da resposta. O principal objetivo da penalização é a redução do espaço paramétrico das funções aditivas e consequente suavização das curvas ajustadas. Tudo isso traz complexidade no desenvolvimento de processos iterativos para a estimação dos parâmetros e na inferência, uma vez que se trabalha com verossimilhanças penalizadas, ou seja, não regulares. Há duas linhas principais de estimação paramétrica nessa classe, a estimação direta por meio de procedimentos tipo escore de Fisher e a estimação condicional por meio do procedimento Gauss-Seidel. Porém, um dos pontos mais críticos neste tipo de modelagem está na estimação dos parâmetros de suavização, havendo inúmeras propostas de estimação global e local. Já os modelos parciais lineares single-index generalizados procuram reduzir a dimensão das funções aditivas agrupando as variáveis explicativas contínuas na forma de preditores lineares dentro de funções aditivas, resultando em geral numa importante redução paramétrica, embora com o aumento da complexidade da estimação e inferência. Uma das vantagens desse tipo de abordagem é a possibilidade de criação de índices específicos, tais como índices meteorológicos, índices econométricos, dentre outros. Os algoritmos existentes em modelagem single-index têm apresentado instabilidades e falta de eficiência para amostras moderadas e grandes, sendo um desafio o desenvolvimento de algoritmos mais estáveis (Silva & Paula). Nesta palestra faremos uma revisão dessas duas classes de modelos, em que discutiremos os principais aspectos metodológicos e computacionais, além de extensões para classes mais gerais e aplicações a conjuntos de dados reais.
Título e resumo serão divulgados em breve.
Nesta conferência, serão apresentados dois modelos dinâmicos recentemente propostos para modelar diferentes imagens de sensoriamento remoto. Os modelos são semelhantes em termos de estrutura dinâmica, mas diferem quanto ao componente aleatório, considerando a característica de cada sistema e tipo de sinal. Para modelar dados de amplitude de imagens de radar de abertura sintética (SAR), que são dados contínuos positivos, é considerada a distribuição Rayleigh. Para modelar imagens com valores duplamente limitados, como o índice de vegetação NDVI, é pressuposta a distribuição Kumaraswamy. Conceitos fundamentais desses sensores e de processamento de imagens serão apresentados, motivando cada um dos modelos. Aspectos de inferência sobre os parâmetros e algoritmos de detecção serão introduzidos e avaliados numericamente. Experimentos computacionais em imagens reais evidenciam a utilidade das metodologias propostas em problemas práticos de processamento de imagens. Por fim, trabalhos atuais e tópicos futuros serão discutidos.
Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This talk focuses on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.
Symbolic Data Analysis (SDA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) offer an innovative path regarding privacy in supervised machine learning algorithms. Through the use of SDA and t-SNE, sensitive information can be generalised, protecting individuals from re-identification and meeting regulatory privacy standards. As challenges, we have computational complexity and/or potential loss of accuracy. In the era of data-driven decision making, we propose a new approach that uses SDA and t-SNE as tools to create supervised machine learning models that guarantee individual privacy, aiming to meet ethical and legal requirements while allowing researchers to explore the necessary insights for innovation without needing the original problem data.
Scientific inquiry often involves uncovering causal relationships and distinguishing these from correlations. Causal inference provides a framework for establishing causal effects. This involves carefully formulating the causal question of interest, specifying the estimand, and explicitly stating the assumptions under which it can be estimated using the data at hand. The estimation step often involves developing and using novel statistical and computational tools. Machine learning has now become established for prediction problems, but does not perform well for causal tasks. There is also interest in using causal reasoning when building and interpreting machine learning algorithms. Doing so can help reduce unfairness and other algorithmic biases stemming from the training data not being representative of the target population. Causality can also help with interpretability and explainability of machine learning outputs. In this talk, I will review causal machine learning, a framework to `de-bias’ standard machine learning algorithms so they perform well for causal tasks. I will also discuss the role causal inference can play in machine learning to improve algorithmic fairness and explainability. .
Traditional Arithmetic Reduction of Age (ARA) models, commonly using a Power Law Process (PLP) with β > 1, primarily assume system degradation. However, some systems exhibit initial improvement (β < 1) due to factors like adaptation or updates, a scenario current ARA models inadequately address. This study introduces ARAM, a modified ARA model capable of robustly modeling systems experiencing either degradation or initial improvement between failures. Additionally, we propose a novel PLP reparameterization as a time truncation, which preserves the original interpretation of PLP parameters. Illustrated with a real-world dataset, ARAM emerges as a valuable tool for enhancing the understanding of reliability in systems subject to imperfect repairs and varied performance trends.