Emergent Semantics — Model_UNI_GLYPH (335M)

This repository provides Model_UNI_GLYPH (335M) — a decoder-only Transformer language model where the entire input embedding layer is frozen and initialized from visual Unicode glyph representations (rendered glyph bitmaps → PCA projection → L2 normalization).
The model is released as part of the paper:

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

📚 Blog Article

Primary goal: research and ablations on where semantic structure emerges in Transformer LMs when input embeddings are non-trainable and non-semantic.

Key idea

Standard LMs learn token embeddings jointly with the rest of the network. In this work, embeddings are treated as fixed structural primitives rather than “meaning vectors”.

Model_UNI_GLYPH uses:

Frozen nn.Embedding weights (no gradient updates)
Embeddings derived from Unicode glyph images (deterministic, precomputed)
A standard decoder-only Transformer backbone trained normally

This isolates semantics as an emergent property of Transformer layers, rather than a property of learned input embeddings.

Model summary

Architecture: decoder-only Transformer (GPT-like)
Hidden size (d_model): 1024
Layers: 16
Heads: 32
Positional encoding: rotary embeddings
Activation: GELU
Input embeddings: frozen, glyph-based (visual Unicode, PCA-projected)
Output head: not tied to the input embeddings (trained separately)
Vocabulary size: 65,536
Tokenizer: Bochkov/bvv241-2-3

Intended use

This model is intended for:

Research on emergent semantics in Transformers
Studying optimization effects of frozen vs trainable embeddings
Multilingual / Unicode-heavy text experiments where deterministic coverage is useful
Reproducible ablations on embedding initialization

Not intended for production deployments. It is a research artifact trained under constrained compute/data to enable controlled comparisons.

Evaluation (reported)

These are the key numbers reported for this checkpoint:

MMLU: 23.81 (5-shot)
CommonsenseQA (C-SENSE): 19.75
ARC-E: 22.5
ARC-C: 22.1

Note: The objective of the project is not SOTA performance. The primary comparison is against architecturally identical baselines differing only in whether embeddings are trainable.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-uni-glyph-335m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-uni-glyph-335m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:Tokyo (city)

Training overview (high level)

Training data: multilingual Wikipedia subsets + a small portion of SFT-style QA data (see paper for details)
Scale: ~4B tokens (resource-constrained setting for controlled comparisons)
Hardware: H100 80GB (reported setup)

Limitations

Research-focused; not tuned for safety, factuality, or instruction following.
Benchmark scores reflect a constrained training regime; do not compare directly to large-scale LLMs trained on hundreds of billions of tokens.
Frozen embeddings may encode surface/form biases (e.g., length/density effects from glyph rasterization + PCA).

Related repositories

Paper model collection:
https://huggingface.co/collections/Bochkov/emergent-semantics-beyond-token-embeddings
Tokenizer:
https://huggingface.co/Bochkov/bvv241-2-3
Code (GitHub):
https://github.com/AVBochkov/Embeddings

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: 36

Collection including Bochkov/emergent-semantics-model-uni-glyph-335m

Emergent Semantics Beyond Token Embeddings

Collection

Paper: 2507.04886 (TMLR, Oct 2025). 'Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations' • 12 items • Updated 10 days ago

Papers for Bochkov/emergent-semantics-model-uni-glyph-335m

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper • 2507.07129 • Published Jul 8, 2025 • 2

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Paper • 2507.04886 • Published Jul 7, 2025 • 2