Datasets:

misovalko
/

my-research-papers

+---
+license: mit
+task_categories:
+  - text-classification
+language:
+  - en
+tags:
+  - research
+  - machine-learning
+  - reinforcement-learning
+  - bandits
+pretty_name: Michal Valko Research Papers
+---
+# Michal Valko - Research Papers Collection
+This dataset contains references to research papers by Michal Valko and collaborators.
+## Papers
+- [RL-finetuning LLMs from on- and off-policy data with a single algorithm](https://arxiv.org/abs/2503.19612)
+- [Preference optimization with multi-sample comparisons](https://arxiv.org/abs/2410.12138)
+- [Optimal design for reward modeling in RLHF](https://arxiv.org/abs/2410.17055)
+- [A new bound on the cumulant generating function of Dirichlet processes](https://arxiv.org/abs/2409.18621)
+- [Understanding the performance gap between online and offline alignment algorithms](https://arxiv.org/abs/2405.08448)
+- [Sharp deviations bounds for Dirichlet weighted sums with application to analysis of Bayesian algorithms](https://arxiv.org/abs/2304.03056)
+- [KL-entropy-regularized RL with a generative model is minimax optimal](https://arxiv.org/abs/2205.14211)
+- [Accelerating Nash learning from human feedback via Mirror Prox](https://arxiv.org/abs/2505.19731)
+- [The Llama 3 herd of models](https://arxiv.org/abs/2407.21783)
+- [Metacognitive capabilities of LLMs: An exploration in mathematical problem solving](https://arxiv.org/abs/2405.12205)
+- [Local and adaptive mirror descents in extensive-form games](https://arxiv.org/abs/2309.00656)
+- [Nash learning from human feedback](https://arxiv.org/abs/2312.00886)
+- [Human alignment of large language models through online preference optimisation](https://arxiv.org/abs/2403.08635)
+- [Generalized preference optimization: A unified approach to offline alignment](https://arxiv.org/abs/2402.05749)
+- [Decoding-time realignment of language models](https://arxiv.org/abs/2402.02992)
+- [A general theoretical paradigm to understand learning from human preferences](https://arxiv.org/abs/2310.12036)
+- [Unlocking the power of representations in long-term novelty-based exploration](https://arxiv.org/abs/2305.01521)
+- [Demonstration-regularized RL](https://arxiv.org/abs/2310.17303)
+- [Model-free posterior sampling via learning rate randomization](https://arxiv.org/abs/2310.18186)
+- [Curiosity in hindsight: Intrinsic exploration in stochastic environments](https://arxiv.org/abs/2211.10515)
+- [VA-learning as a more efficient alternative to Q-learning](https://arxiv.org/abs/2305.18161)
+- [Fast rates for maximum entropy exploration](https://arxiv.org/abs/2303.08059)
+- [Adapting to game trees in zero-sum imperfect information games](https://arxiv.org/abs/2212.12567)
+- [Understanding self-predictive learning for reinforcement learning](https://arxiv.org/abs/2212.03319)
+- [DoMo-AC: Doubly multi-step off-policy actor-critic algorithm](https://arxiv.org/abs/2305.18501)
+- [Regularization and variance-weighted regression achieves minimax optimality in linear MDPs: Theory and practice](https://arxiv.org/abs/2305.13185)
+- [Quantile credit assignment](https://arxiv.org/abs/2302.14041)
+- [Half-Hop: A graph upsampling approach for slowing down message passing](https://arxiv.org/abs/2308.09198)
+- [BYOL-Explore: Exploration by bootstrapped prediction](https://arxiv.org/abs/2206.08332)
+- [Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees](https://arxiv.org/abs/2209.14414)
+- [From Dirichlet to Rubin: Optimistic exploration in RL without bonuses](https://arxiv.org/abs/2205.07704)
+- [Retrieval-augmented reinforcement learning](https://arxiv.org/abs/2202.08417)
+- [Scaling Gaussian process optimization by evaluating a few unique
+	 	 candidates multiple times](https://arxiv.org/abs/2201.12909)
+- [Large-scale representation learning on graphs via bootstrapping](https://arxiv.org/abs/2102.06514)
+- [Adaptive multi-goal exploration](https://arxiv.org/abs/2111.12045)
+- [Marginalized operators for off-policy reinforcement learning](https://arxiv.org/abs/2203.16177)
+- [Drop, Swap, and Generate: A self-supervised approach for generating neural activity](https://arxiv.org/abs/2111.02338)
+- [Stochastic shortest path: minimax, parameter-free and towards horizon-free regret](https://arxiv.org/abs/2104.11186)
+- [A provably efficient sample collection strategy for reinforcement learning](https://arxiv.org/abs/2007.06437)
+- [Model-free learning for two-player zero-sum partially observable Markov games with perfect recall](https://arxiv.org/abs/2106.06279)
+- [Unifying gradient estimators for meta-reinforcement learning via off-policy evaluation](https://arxiv.org/abs/2106.13125)
+- [Broaden your views for self-supervised video learning](https://arxiv.org/abs/2103.16559)
+- [UCB Momentum Q-learning: Correcting the bias without forgetting](https://arxiv.org/abs/2103.01312)
+- [Fast active learning for pure exploration in reinforcement learning](https://arxiv.org/abs/2007.13442)
+- [Revisiting Peng's Q(λ) for modern reinforcement learning](https://arxiv.org/abs/2103.00107)
+- [Taylor expansion of discount factors](https://arxiv.org/abs/2106.06170)
+- [Online A-optimal design and active linear regression](https://arxiv.org/abs/1906.08509)
+- [Kernel-based reinforcement Learning: A finite-time analysis](https://arxiv.org/abs/2004.05599)
+- [Game plan: What AI can do for football, and what football can do for AI,](https://arxiv.org/abs/2011.09192)
+- [A kernel-based approach to non-stationary reinforcement learning in metric spaces](https://arxiv.org/abs/2007.05078)
+- [Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited](https://arxiv.org/abs/2010.03531)
+- [Adaptive reward-free exploration](https://arxiv.org/abs/2006.06294)
+- [Fast sampling from β-ensembles](https://arxiv.org/abs/2003.02344)
+- [Mine Your Own vieW: Self-supervised learning through across-sample prediction,](https://arxiv.org/abs/2102.10106)
+- [Bootstrap Your Own Latent: A new approach to self-supervised learning](https://arxiv.org/abs/2006.07733)
+- [BYOL works even without batch statistics](https://arxiv.org/abs/2010.10241)
+- [Improved sample complexity for incremental autonomous exploration in MDPs](https://arxiv.org/abs/2012.14755)
+- [Sampling from a k-DPP without looking at all items](https://arxiv.org/abs/2006.16947)
+- [Statistical efficiency of Thompson sampling for combinatorial semi-bandits](https://arxiv.org/abs/2006.06613)
+- [Planning in Markov decision processes with gap-dependent sample complexity](https://arxiv.org/abs/2006.05879)
+- [Monte-Carlo tree search as regularized policy optimization](https://arxiv.org/abs/2007.12509)
+- [Taylor expansion policy optimization](https://arxiv.org/abs/2003.06259)
+- [Gamification of pure exploration for linear bandits](https://arxiv.org/abs/2007.00953)
+- [No-regret exploration in goal-oriented reinforcement learning](https://arxiv.org/abs/1912.03517)
+- [Improved sleeping bandits with stochastic action sets and adversarial rewards](https://arxiv.org/abs/2004.06248)
+- [Stochastic bandits with arm-dependent delays](https://arxiv.org/abs/2006.10459)
+- [Near-linear time Gaussian process optimization with adaptive batching and resparsification](https://arxiv.org/abs/2002.09954)
+- [Fixed-confidence guarantees for Bayesian best-arm identification](https://arxiv.org/abs/1910.10945)
+- [Multiagent evaluation under incomplete information](https://arxiv.org/abs/1909.09849)
+- [Exact sampling of determinantal point processes with sublinear time preprocessing](https://arxiv.org/abs/1905.13476)
+- [Exploiting structure of uncertainty for efficient matroid semi-bandits](https://arxiv.org/abs/1902.03794)
+- [DPPy: Sampling determinantal point processes with Python](https://arxiv.org/abs/1809.07258)
+- [Rotting bandits are not harder than stochastic ones](https://arxiv.org/abs/1811.11043)
+- [Finding the bandit in a graph: Sequential search-and-stop](https://arxiv.org/abs/1806.02282)
+- [Optimistic optimization of a Brownian](https://arxiv.org/abs/1901.04884)
+- [Second-order kernel online convex optimization with adaptive sketching](https://arxiv.org/abs/1706.04892)
+- [Zonotope hit-and-run for efficient sampling from projection DPPs](https://arxiv.org/abs/1705.10498)
+- [Distributed adaptive sampling for kernel matrix approximation](https://arxiv.org/abs/1803.10172)
+- [Simple regret for infinitely many armed bandits](https://arxiv.org/abs/1505.04627)
+- [Cheap Bandits](https://arxiv.org/abs/1506.04782)
+- [Geometric entropic exploration,](https://arxiv.org/abs/2101.02055)
+- [On the approximation relationship between optimizing ratio of submodular (RS) and difference of submodular (DS) functions,](https://arxiv.org/abs/2101.01631)
+- [Learning to Act Greedily: Polymatroid Semi-Bandits](https://arxiv.org/abs/1405.7752)
+## Citation
+If you use any of these papers, please cite the original work.
+## Contact
+- Website: https://researchers.mila.quebec/en/profile/michal-valko
+- HAL: https://hal.science/search/index/?q=authIdHal_s:michal