misovalko commited on
Commit
bfe6717
·
verified ·
1 Parent(s): 38435b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md CHANGED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ task_categories:
4
+ - text-classification
5
+ language:
6
+ - en
7
+ tags:
8
+ - research
9
+ - machine-learning
10
+ - reinforcement-learning
11
+ - bandits
12
+ pretty_name: Michal Valko Research Papers
13
+ ---
14
+
15
+ # Michal Valko - Research Papers Collection
16
+
17
+ This dataset contains references to research papers by Michal Valko and collaborators.
18
+
19
+ ## Papers
20
+
21
+ - [RL-finetuning LLMs from on- and off-policy data with a single algorithm](https://arxiv.org/abs/2503.19612)
22
+ - [Preference optimization with multi-sample comparisons](https://arxiv.org/abs/2410.12138)
23
+ - [Optimal design for reward modeling in RLHF](https://arxiv.org/abs/2410.17055)
24
+ - [A new bound on the cumulant generating function of Dirichlet processes](https://arxiv.org/abs/2409.18621)
25
+ - [Understanding the performance gap between online and offline alignment algorithms](https://arxiv.org/abs/2405.08448)
26
+ - [Sharp deviations bounds for Dirichlet weighted sums with application to analysis of Bayesian algorithms](https://arxiv.org/abs/2304.03056)
27
+ - [KL-entropy-regularized RL with a generative model is minimax optimal](https://arxiv.org/abs/2205.14211)
28
+ - [Accelerating Nash learning from human feedback via Mirror Prox](https://arxiv.org/abs/2505.19731)
29
+ - [The Llama 3 herd of models](https://arxiv.org/abs/2407.21783)
30
+ - [Metacognitive capabilities of LLMs: An exploration in mathematical problem solving](https://arxiv.org/abs/2405.12205)
31
+ - [Local and adaptive mirror descents in extensive-form games](https://arxiv.org/abs/2309.00656)
32
+ - [Nash learning from human feedback](https://arxiv.org/abs/2312.00886)
33
+ - [Human alignment of large language models through online preference optimisation](https://arxiv.org/abs/2403.08635)
34
+ - [Generalized preference optimization: A unified approach to offline alignment](https://arxiv.org/abs/2402.05749)
35
+ - [Decoding-time realignment of language models](https://arxiv.org/abs/2402.02992)
36
+ - [A general theoretical paradigm to understand learning from human preferences](https://arxiv.org/abs/2310.12036)
37
+ - [Unlocking the power of representations in long-term novelty-based exploration](https://arxiv.org/abs/2305.01521)
38
+ - [Demonstration-regularized RL](https://arxiv.org/abs/2310.17303)
39
+ - [Model-free posterior sampling via learning rate randomization](https://arxiv.org/abs/2310.18186)
40
+ - [Curiosity in hindsight: Intrinsic exploration in stochastic environments](https://arxiv.org/abs/2211.10515)
41
+ - [VA-learning as a more efficient alternative to Q-learning](https://arxiv.org/abs/2305.18161)
42
+ - [Fast rates for maximum entropy exploration](https://arxiv.org/abs/2303.08059)
43
+ - [Adapting to game trees in zero-sum imperfect information games](https://arxiv.org/abs/2212.12567)
44
+ - [Understanding self-predictive learning for reinforcement learning](https://arxiv.org/abs/2212.03319)
45
+ - [DoMo-AC: Doubly multi-step off-policy actor-critic algorithm](https://arxiv.org/abs/2305.18501)
46
+ - [Regularization and variance-weighted regression achieves minimax optimality in linear MDPs: Theory and practice](https://arxiv.org/abs/2305.13185)
47
+ - [Quantile credit assignment](https://arxiv.org/abs/2302.14041)
48
+ - [Half-Hop: A graph upsampling approach for slowing down message passing](https://arxiv.org/abs/2308.09198)
49
+ - [BYOL-Explore: Exploration by bootstrapped prediction](https://arxiv.org/abs/2206.08332)
50
+ - [Optimistic posterior sampling for reinforcement learning with few samples and tight guarantees](https://arxiv.org/abs/2209.14414)
51
+ - [From Dirichlet to Rubin: Optimistic exploration in RL without bonuses](https://arxiv.org/abs/2205.07704)
52
+ - [Retrieval-augmented reinforcement learning](https://arxiv.org/abs/2202.08417)
53
+ - [Scaling Gaussian process optimization by evaluating a few unique
54
+ candidates multiple times](https://arxiv.org/abs/2201.12909)
55
+ - [Large-scale representation learning on graphs via bootstrapping](https://arxiv.org/abs/2102.06514)
56
+ - [Adaptive multi-goal exploration](https://arxiv.org/abs/2111.12045)
57
+ - [Marginalized operators for off-policy reinforcement learning](https://arxiv.org/abs/2203.16177)
58
+ - [Drop, Swap, and Generate: A self-supervised approach for generating neural activity](https://arxiv.org/abs/2111.02338)
59
+ - [Stochastic shortest path: minimax, parameter-free and towards horizon-free regret](https://arxiv.org/abs/2104.11186)
60
+ - [A provably efficient sample collection strategy for reinforcement learning](https://arxiv.org/abs/2007.06437)
61
+ - [Model-free learning for two-player zero-sum partially observable Markov games with perfect recall](https://arxiv.org/abs/2106.06279)
62
+ - [Unifying gradient estimators for meta-reinforcement learning via off-policy evaluation](https://arxiv.org/abs/2106.13125)
63
+ - [Broaden your views for self-supervised video learning](https://arxiv.org/abs/2103.16559)
64
+ - [UCB Momentum Q-learning: Correcting the bias without forgetting](https://arxiv.org/abs/2103.01312)
65
+ - [Fast active learning for pure exploration in reinforcement learning](https://arxiv.org/abs/2007.13442)
66
+ - [Revisiting Peng's Q(λ) for modern reinforcement learning](https://arxiv.org/abs/2103.00107)
67
+ - [Taylor expansion of discount factors](https://arxiv.org/abs/2106.06170)
68
+ - [Online A-optimal design and active linear regression](https://arxiv.org/abs/1906.08509)
69
+ - [Kernel-based reinforcement Learning: A finite-time analysis](https://arxiv.org/abs/2004.05599)
70
+ - [Game plan: What AI can do for football, and what football can do for AI,](https://arxiv.org/abs/2011.09192)
71
+ - [A kernel-based approach to non-stationary reinforcement learning in metric spaces](https://arxiv.org/abs/2007.05078)
72
+ - [Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited](https://arxiv.org/abs/2010.03531)
73
+ - [Adaptive reward-free exploration](https://arxiv.org/abs/2006.06294)
74
+ - [Fast sampling from β-ensembles](https://arxiv.org/abs/2003.02344)
75
+ - [Mine Your Own vieW: Self-supervised learning through across-sample prediction,](https://arxiv.org/abs/2102.10106)
76
+ - [Bootstrap Your Own Latent: A new approach to self-supervised learning](https://arxiv.org/abs/2006.07733)
77
+ - [BYOL works even without batch statistics](https://arxiv.org/abs/2010.10241)
78
+ - [Improved sample complexity for incremental autonomous exploration in MDPs](https://arxiv.org/abs/2012.14755)
79
+ - [Sampling from a k-DPP without looking at all items](https://arxiv.org/abs/2006.16947)
80
+ - [Statistical efficiency of Thompson sampling for combinatorial semi-bandits](https://arxiv.org/abs/2006.06613)
81
+ - [Planning in Markov decision processes with gap-dependent sample complexity](https://arxiv.org/abs/2006.05879)
82
+ - [Monte-Carlo tree search as regularized policy optimization](https://arxiv.org/abs/2007.12509)
83
+ - [Taylor expansion policy optimization](https://arxiv.org/abs/2003.06259)
84
+ - [Gamification of pure exploration for linear bandits](https://arxiv.org/abs/2007.00953)
85
+ - [No-regret exploration in goal-oriented reinforcement learning](https://arxiv.org/abs/1912.03517)
86
+ - [Improved sleeping bandits with stochastic action sets and adversarial rewards](https://arxiv.org/abs/2004.06248)
87
+ - [Stochastic bandits with arm-dependent delays](https://arxiv.org/abs/2006.10459)
88
+ - [Near-linear time Gaussian process optimization with adaptive batching and resparsification](https://arxiv.org/abs/2002.09954)
89
+ - [Fixed-confidence guarantees for Bayesian best-arm identification](https://arxiv.org/abs/1910.10945)
90
+ - [Multiagent evaluation under incomplete information](https://arxiv.org/abs/1909.09849)
91
+ - [Exact sampling of determinantal point processes with sublinear time preprocessing](https://arxiv.org/abs/1905.13476)
92
+ - [Exploiting structure of uncertainty for efficient matroid semi-bandits](https://arxiv.org/abs/1902.03794)
93
+ - [DPPy: Sampling determinantal point processes with Python](https://arxiv.org/abs/1809.07258)
94
+ - [Rotting bandits are not harder than stochastic ones](https://arxiv.org/abs/1811.11043)
95
+ - [Finding the bandit in a graph: Sequential search-and-stop](https://arxiv.org/abs/1806.02282)
96
+ - [Optimistic optimization of a Brownian](https://arxiv.org/abs/1901.04884)
97
+ - [Second-order kernel online convex optimization with adaptive sketching](https://arxiv.org/abs/1706.04892)
98
+ - [Zonotope hit-and-run for efficient sampling from projection DPPs](https://arxiv.org/abs/1705.10498)
99
+ - [Distributed adaptive sampling for kernel matrix approximation](https://arxiv.org/abs/1803.10172)
100
+ - [Simple regret for infinitely many armed bandits](https://arxiv.org/abs/1505.04627)
101
+ - [Cheap Bandits](https://arxiv.org/abs/1506.04782)
102
+ - [Geometric entropic exploration,](https://arxiv.org/abs/2101.02055)
103
+ - [On the approximation relationship between optimizing ratio of submodular (RS) and difference of submodular (DS) functions,](https://arxiv.org/abs/2101.01631)
104
+ - [Learning to Act Greedily: Polymatroid Semi-Bandits](https://arxiv.org/abs/1405.7752)
105
+
106
+ ## Citation
107
+
108
+ If you use any of these papers, please cite the original work.
109
+
110
+ ## Contact
111
+
112
+ - Website: https://researchers.mila.quebec/en/profile/michal-valko
113
+ - HAL: https://hal.science/search/index/?q=authIdHal_s:michal