Emergent Semantics — Model_256_FLOAT (285M)

This repository provides Model_256_FLOAT (285M) — an ablation model from the paper:

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

This checkpoint isolates the effect of floating-point / normalized frozen embeddings (and the geometry they induce), while still keeping the embeddings non-trainable and non-semantic.

Key idea (what this ablation tests)

This model is a close counterpart to Model_256_BIT, but the embedding vectors are floats rather than binary.

Pipeline (high-level):

Assign each token a random unique code (collision-free “unique ID per token” guaranteed by construction).
Convert the code into a vector representation.
Apply PCA projection to obtain a compact n_embed = 256 representation.
Apply L2 normalization (so each token embedding has unit norm).
Freeze the embedding table (requires_grad=False) during training.

So Model_256_FLOAT tests whether improvements/convergence differences come from:

simply having a stable token identifier (random, frozen), or
additionally having a continuous normalized geometry (float values + normalization), even without any semantic or glyph information.

To match the Transformer hidden size, the 256-dim embedding is expanded to 1024 via a non-trainable repetition: repeat_interleave(4) → 256 * 4 = 1024.

Important: parameter count difference (vs 335M models)

This checkpoint has ~285M parameters, while models with a standard n_embed=1024 embedding table (e.g. UNI_GLYPH / unfrozen baselines) are ~335M.

The difference is primarily the embedding table size:

Standard embedding params: vocab_size * 1024 = 65536 * 1024 ≈ 67.1M
This model’s embedding params: vocab_size * 256 = 65536 * 256 ≈ 16.8M

The Transformer backbone is the same (layers/heads/d_model), but the total parameter count is lower because the embedding matrix is smaller.

Model summary

Architecture: decoder-only Transformer (GPT-like)
Hidden size (d_model): 1024
Layers: 16
Heads: 32
Positional encoding: rotary embeddings
Activation: GELU
Tokenizer / vocab size: 65,536 (bvv241-2-3 compatible)
Input embeddings: frozen, float, n_embed=256, derived from random unique IDs + PCA + L2 normalization, expanded to 1024 by repetition (non-trainable)
Output head: not tied to the input embeddings (trained separately)

Tokenizer

The intended tokenizer is bvv241-2-3 (same vocab size and indexing):

https://huggingface.co/Bochkov/bvv241-2-3

You may load the tokenizer either from this model repo (if included) or from the standalone tokenizer repo. The key requirement is exact vocab alignment.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-256-float-285m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-256-float-285m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:San Juan

Intended use

This model is intended for research only, especially for:

Comparing binary vs float normalized frozen embeddings under the same n_embed
Studying whether normalization / continuous geometry affects convergence and reasoning benchmarks
Controlled comparisons vs:
- Model_256_BIT
- Model_UNI_GLYPH
- trainable-embedding baselines

Not intended for production deployment.

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: 13

Collection including Bochkov/emergent-semantics-model-256-float-285m

Emergent Semantics Beyond Token Embeddings

Collection

Paper: 2507.04886 (TMLR, Oct 2025). 'Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations' • 12 items • Updated 10 days ago

Papers for Bochkov/emergent-semantics-model-256-float-285m

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper • 2507.07129 • Published Jul 8, 2025 • 2

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Paper • 2507.04886 • Published Jul 7, 2025 • 2

Bochkov
/

emergent-semantics-model-256-float-285m