Emergent Semantics β€” Model_UNI_GLYPH (335M)

This repository provides Model_UNI_GLYPH (335M) β€” a decoder-only Transformer language model where the entire input embedding layer is frozen and initialized from visual Unicode glyph representations (rendered glyph bitmaps β†’ PCA projection β†’ L2 normalization).
The model is released as part of the paper:

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Blog Article

Primary goal: research and ablations on where semantic structure emerges in Transformer LMs when input embeddings are non-trainable and non-semantic.


Key idea

Standard LMs learn token embeddings jointly with the rest of the network. In this work, embeddings are treated as fixed structural primitives rather than β€œmeaning vectors”.

Model_UNI_GLYPH uses:

  • Frozen nn.Embedding weights (no gradient updates)
  • Embeddings derived from Unicode glyph images (deterministic, precomputed)
  • A standard decoder-only Transformer backbone trained normally

This isolates semantics as an emergent property of Transformer layers, rather than a property of learned input embeddings.


Model summary

  • Architecture: decoder-only Transformer (GPT-like)
  • Hidden size (d_model): 1024
  • Layers: 16
  • Heads: 32
  • Positional encoding: rotary embeddings
  • Activation: GELU
  • Input embeddings: frozen, glyph-based (visual Unicode, PCA-projected)
  • Output head: not tied to the input embeddings (trained separately)
  • Vocabulary size: 65,536
  • Tokenizer: Bochkov/bvv241-2-3

Intended use

This model is intended for:

  • Research on emergent semantics in Transformers
  • Studying optimization effects of frozen vs trainable embeddings
  • Multilingual / Unicode-heavy text experiments where deterministic coverage is useful
  • Reproducible ablations on embedding initialization

Not intended for production deployments. It is a research artifact trained under constrained compute/data to enable controlled comparisons.


Evaluation (reported)

These are the key numbers reported for this checkpoint:

  • MMLU: 23.81 (5-shot)
  • CommonsenseQA (C-SENSE): 19.75
  • ARC-E: 22.5
  • ARC-C: 22.1

Note: The objective of the project is not SOTA performance. The primary comparison is against architecturally identical baselines differing only in whether embeddings are trainable.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/emergent-semantics-model-uni-glyph-335m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/emergent-semantics-model-uni-glyph-335m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of Japan?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))

#Question: What is the capital of Japan?
#Answer:Tokyo (city)

Training overview (high level)

  • Training data: multilingual Wikipedia subsets + a small portion of SFT-style QA data (see paper for details)
  • Scale: ~4B tokens (resource-constrained setting for controlled comparisons)
  • Hardware: H100 80GB (reported setup)

Limitations

  • Research-focused; not tuned for safety, factuality, or instruction following.
  • Benchmark scores reflect a constrained training regime; do not compare directly to large-scale LLMs trained on hundreds of billions of tokens.
  • Frozen embeddings may encode surface/form biases (e.g., length/density effects from glyph rasterization + PCA).

Related repositories


πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/emergent-semantics-model-uni-glyph-335m

Papers for Bochkov/emergent-semantics-model-uni-glyph-335m