Datasets:
Update README.md
Browse files
README.md
CHANGED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
task_categories:
|
| 4 |
+
- text-classification
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- research
|
| 9 |
+
- machine-learning
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- bandits
|
| 12 |
+
pretty_name: Michal Valko Research Papers
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Michal Valko - Research Papers Collection
|
| 16 |
+
|
| 17 |
+
This dataset contains references to research papers by Michal Valko and collaborators.
|
| 18 |
+
|
| 19 |
+
## Papers
|
| 20 |
+
|
| 21 |
+
- [RL-finetuning LLMs from on- and off-policy data with a single algorithm](https://arxiv.org/abs/2503.19612)
|
| 22 |
+
- [Preference optimization with multi-sample comparisons](https://arxiv.org/abs/2410.12138)
|
| 23 |
+
- [Optimal design for reward modeling in RLHF](https://arxiv.org/abs/2410.17055)
|
| 24 |
+
- [A new bound on the cumulant generating function of Dirichlet processes](https://arxiv.org/abs/2409.18621)
|
| 25 |
+
- [Understanding the performance gap between online and offline alignment algorithms](https://arxiv.org/abs/2405.08448)
|
| 26 |
+
- [Sharp deviations bounds for Dirichlet weighted sums with application to analysis of Bayesian algorithms](https://arxiv.org/abs/2304.03056)
|
| 27 |
+
- [KL-entropy-regularized RL with a generative model is minimax optimal](https://arxiv.org/abs/2205.14211)
|
| 28 |
+
- [Accelerating Nash learning from human feedback via Mirror Prox](https://arxiv.org/abs/2505.19731)
|
| 29 |
+
- [The Llama 3 herd of models](https://arxiv.org/abs/2407.21783)
|
| 30 |
+
- [Metacognitive capabilities of LLMs: An exploration in mathematical problem solving](https://arxiv.org/abs/2405.12205)
|
| 31 |
+
- [Local and adaptive mirror descents in extensive-form games](https://arxiv.org/abs/2309.00656)
|
| 32 |
+
- [Nash learning from human feedback](https://arxiv.org/abs/2312.00886)
|
| 33 |
+
- [Human alignment of large language models through online preference optimisation](https://arxiv.org/abs/2403.08635)
|
| 34 |
+
- [Generalized preference optimization: A unified approach to offline alignment](https://arxiv.org/abs/2402.05749)
|
| 35 |
+
- [Decoding-time realignment of language models](https://arxiv.org/abs/2402.02992)
|
| 36 |
+
- [A general theoretical paradigm to understand learning from human preferences](https://arxiv.org/abs/2310.12036)
|
| 37 |
+
- [Unlocking the power of representations in long-term novelty-based exploration](https://arxiv.org/abs/2305.01521)
|
| 38 |
+
- [Demonstration-regularized RL](https://arxiv.org/abs/2310.17303)
|
| 39 |
+
- [Model-free posterior sampling via learning rate randomization](https://arxiv.org/abs/2310.18186)
|
| 40 |
+
- [Curiosity in hindsight: Intrinsic exploration in stochastic environments](https://arxiv.org/abs/2211.10515)
|
| 41 |
+
- [VA-learning as a more efficient alternative to Q-learning](https://arxiv.org/abs/2305.18161)
|
| 42 |
+
- [Fast rates for maximum entropy exploration](https://arxiv.org/abs/2303.08059)
|
| 43 |
+
- [Adapting to game trees in zero-sum imperfect information games](https://arxiv.org/abs/2212.12567)
|
| 44 |
+
- [Understanding self-predictive learning for reinforcement learning](https://arxiv.org/abs/2212.03319)
|
| 45 |
+
- [DoMo-AC: Doubly multi-step off-policy actor-critic algorithm](https://arxiv.org/abs/2305.18501)
|
| 46 |
+
- [Regularization and variance-weighted regression achieves minimax optimality in linear MDPs: Theory and practice](https://arxiv.org/abs/2305.13185)
|
| 47 |
+
- [Quantile credit assignment](https://arxiv.org/abs/2302.14041)
|
| 48 |
+
- [Half-Hop: A graph upsampling approach for slowing down message passing](https://arxiv.org/abs/2308.09198)
|
| 49 |
+
- [BYOL-Explore: Exploration by bootstrapped prediction](https://arxiv.org/abs/2206.08332)
|
| 50 |
+
- [Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees](https://arxiv.org/abs/2209.14414)
|
| 51 |
+
- [From Dirichlet to Rubin: Optimistic exploration in RL without bonuses](https://arxiv.org/abs/2205.07704)
|
| 52 |
+
- [Retrieval-augmented reinforcement learning](https://arxiv.org/abs/2202.08417)
|
| 53 |
+
- [Scaling Gaussian process optimization by evaluating a few unique
|
| 54 |
+
candidates multiple times](https://arxiv.org/abs/2201.12909)
|
| 55 |
+
- [Large-scale representation learning on graphs via bootstrapping](https://arxiv.org/abs/2102.06514)
|
| 56 |
+
- [Adaptive multi-goal exploration](https://arxiv.org/abs/2111.12045)
|
| 57 |
+
- [Marginalized operators for off-policy reinforcement learning](https://arxiv.org/abs/2203.16177)
|
| 58 |
+
- [Drop, Swap, and Generate: A self-supervised approach for generating neural activity](https://arxiv.org/abs/2111.02338)
|
| 59 |
+
- [Stochastic shortest path: minimax, parameter-free and towards horizon-free regret](https://arxiv.org/abs/2104.11186)
|
| 60 |
+
- [A provably efficient sample collection strategy for reinforcement learning](https://arxiv.org/abs/2007.06437)
|
| 61 |
+
- [Model-free learning for two-player zero-sum partially observable Markov games with perfect recall](https://arxiv.org/abs/2106.06279)
|
| 62 |
+
- [Unifying gradient estimators for meta-reinforcement learning via off-policy evaluation](https://arxiv.org/abs/2106.13125)
|
| 63 |
+
- [Broaden your views for self-supervised video learning](https://arxiv.org/abs/2103.16559)
|
| 64 |
+
- [UCB Momentum Q-learning: Correcting the bias without forgetting](https://arxiv.org/abs/2103.01312)
|
| 65 |
+
- [Fast active learning for pure exploration in reinforcement learning](https://arxiv.org/abs/2007.13442)
|
| 66 |
+
- [Revisiting Peng's Q(λ) for modern reinforcement learning](https://arxiv.org/abs/2103.00107)
|
| 67 |
+
- [Taylor expansion of discount factors](https://arxiv.org/abs/2106.06170)
|
| 68 |
+
- [Online A-optimal design and active linear regression](https://arxiv.org/abs/1906.08509)
|
| 69 |
+
- [Kernel-based reinforcement Learning: A finite-time analysis](https://arxiv.org/abs/2004.05599)
|
| 70 |
+
- [Game plan: What AI can do for football, and what football can do for AI,](https://arxiv.org/abs/2011.09192)
|
| 71 |
+
- [A kernel-based approach to non-stationary reinforcement learning in metric spaces](https://arxiv.org/abs/2007.05078)
|
| 72 |
+
- [Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited](https://arxiv.org/abs/2010.03531)
|
| 73 |
+
- [Adaptive reward-free exploration](https://arxiv.org/abs/2006.06294)
|
| 74 |
+
- [Fast sampling from β-ensembles](https://arxiv.org/abs/2003.02344)
|
| 75 |
+
- [Mine Your Own vieW: Self-supervised learning through across-sample prediction,](https://arxiv.org/abs/2102.10106)
|
| 76 |
+
- [Bootstrap Your Own Latent: A new approach to self-supervised learning](https://arxiv.org/abs/2006.07733)
|
| 77 |
+
- [BYOL works even without batch statistics](https://arxiv.org/abs/2010.10241)
|
| 78 |
+
- [Improved sample complexity for incremental autonomous exploration in MDPs](https://arxiv.org/abs/2012.14755)
|
| 79 |
+
- [Sampling from a k-DPP without looking at all items](https://arxiv.org/abs/2006.16947)
|
| 80 |
+
- [Statistical efficiency of Thompson sampling for combinatorial semi-bandits](https://arxiv.org/abs/2006.06613)
|
| 81 |
+
- [Planning in Markov decision processes with gap-dependent sample complexity](https://arxiv.org/abs/2006.05879)
|
| 82 |
+
- [Monte-Carlo tree search as regularized policy optimization](https://arxiv.org/abs/2007.12509)
|
| 83 |
+
- [Taylor expansion policy optimization](https://arxiv.org/abs/2003.06259)
|
| 84 |
+
- [Gamification of pure exploration for linear bandits](https://arxiv.org/abs/2007.00953)
|
| 85 |
+
- [No-regret exploration in goal-oriented reinforcement learning](https://arxiv.org/abs/1912.03517)
|
| 86 |
+
- [Improved sleeping bandits with stochastic action sets and adversarial rewards](https://arxiv.org/abs/2004.06248)
|
| 87 |
+
- [Stochastic bandits with arm-dependent delays](https://arxiv.org/abs/2006.10459)
|
| 88 |
+
- [Near-linear time Gaussian process optimization with adaptive batching and resparsification](https://arxiv.org/abs/2002.09954)
|
| 89 |
+
- [Fixed-confidence guarantees for Bayesian best-arm identification](https://arxiv.org/abs/1910.10945)
|
| 90 |
+
- [Multiagent evaluation under incomplete information](https://arxiv.org/abs/1909.09849)
|
| 91 |
+
- [Exact sampling of determinantal point processes with sublinear time preprocessing](https://arxiv.org/abs/1905.13476)
|
| 92 |
+
- [Exploiting structure of uncertainty for efficient matroid semi-bandits](https://arxiv.org/abs/1902.03794)
|
| 93 |
+
- [DPPy: Sampling determinantal point processes with Python](https://arxiv.org/abs/1809.07258)
|
| 94 |
+
- [Rotting bandits are not harder than stochastic ones](https://arxiv.org/abs/1811.11043)
|
| 95 |
+
- [Finding the bandit in a graph: Sequential search-and-stop](https://arxiv.org/abs/1806.02282)
|
| 96 |
+
- [Optimistic optimization of a Brownian](https://arxiv.org/abs/1901.04884)
|
| 97 |
+
- [Second-order kernel online convex optimization with adaptive sketching](https://arxiv.org/abs/1706.04892)
|
| 98 |
+
- [Zonotope hit-and-run for efficient sampling from projection DPPs](https://arxiv.org/abs/1705.10498)
|
| 99 |
+
- [Distributed adaptive sampling for kernel matrix approximation](https://arxiv.org/abs/1803.10172)
|
| 100 |
+
- [Simple regret for infinitely many armed bandits](https://arxiv.org/abs/1505.04627)
|
| 101 |
+
- [Cheap Bandits](https://arxiv.org/abs/1506.04782)
|
| 102 |
+
- [Geometric entropic exploration,](https://arxiv.org/abs/2101.02055)
|
| 103 |
+
- [On the approximation relationship between optimizing ratio of submodular (RS) and difference of submodular (DS) functions,](https://arxiv.org/abs/2101.01631)
|
| 104 |
+
- [Learning to Act Greedily: Polymatroid Semi-Bandits](https://arxiv.org/abs/1405.7752)
|
| 105 |
+
|
| 106 |
+
## Citation
|
| 107 |
+
|
| 108 |
+
If you use any of these papers, please cite the original work.
|
| 109 |
+
|
| 110 |
+
## Contact
|
| 111 |
+
|
| 112 |
+
- Website: https://researchers.mila.quebec/en/profile/michal-valko
|
| 113 |
+
- HAL: https://hal.science/search/index/?q=authIdHal_s:michal
|