Dataset Viewer
Auto-converted to Parquet Duplicate
format
string
vocab_size
int64
vector_dim
int64
algorithm
string
corpus
dict
tokenizer
dict
training
dict
created_at
timestamp[s]
license
string
mcv1
100,000
300
word2vec-skipgram
{ "name": "Japanese Wikipedia", "version": "2026-01", "url": "https://dumps.wikimedia.org/jawiki/" }
{ "name": "mecrab", "dictionary": "ipadic", "version": "0.1.0" }
{ "window_size": 5, "negative_samples": 5, "min_count": 5, "epochs": 5, "learning_rate": 0.025, "subsampling": 0.001 }
2026-01-02T00:00:00
Apache-2.0

MeCrab Japanese Word2Vec Vectors

High-quality Japanese word embeddings trained on Wikipedia using MeCrab morphological analyzer.

πŸ“Š Dataset Summary

This dataset contains pre-trained Japanese word embeddings optimized for use with MeCrab, a high-performance morphological analyzer.

Key Features:

  • βœ… Trained on Japanese Wikipedia
  • βœ… Zero-copy binary format (MCV1) for fast loading
  • βœ… Compatible with MeCrab Python API
  • βœ… 300-dimensional vectors
  • βœ… ~100,000 vocabulary size

πŸ“ Dataset Structure

mecrab-jawiki-word2vec/
β”œβ”€β”€ vectors.bin      # Word embeddings (MCV1 format)
β”œβ”€β”€ vocab.txt        # Vocabulary mapping
└── metadata.json    # Training configuration

πŸš€ Quick Start

Installation

pip install mecrab

Download Vectors

wget https://huggingface.co/datasets/KitaSan/mecrab-jawiki-word2vec/resolve/main/vectors.bin

Usage (Python)

import mecrab

# Load analyzer with vectors
m = mecrab.MeCrab(vector_path="vectors.bin")

# Get word embeddings
morphemes = m.parse_to_dict("東京に葌く")
for morph in morphemes:
    if 'embedding' in morph:
        print(f"{morph['surface']}: {morph['embedding'][:5]}")
# => 東京: [0.123, -0.456, 0.789, -0.234, 0.567]

# Compute cosine similarity
sim = m.similarity("東京", "京都")
print(f"Similarity: {sim:.3f}")  # => 0.856

Usage (Command Line)

# With kizame CLI
kizame parse --vectors vectors.bin --output-format json input.txt

πŸ“ˆ Training Details

  • Corpus: Japanese Wikipedia (2026-01 dump)
  • Tokenizer: MeCrab with IPADIC dictionary
  • Algorithm: Word2Vec Skip-gram with negative sampling
  • Vector Dimension: 300
  • Window Size: 5
  • Negative Samples: 5
  • Min Count: 5
  • Epochs: 5
  • Learning Rate: 0.025

πŸ“Š Evaluation

Word Similarity Benchmarks

Word Pair Similarity
東京 - 京都 0.856
犬 - 猫 0.782
ι£ŸγΉγ‚‹ - ι£²γ‚€ 0.671
ζ—₯本 - δΈ­ε›½ 0.834

Word Analogy Examples

# King - Man + Woman = Queen (Japanese equivalent)
# ηŽ‹ζ§˜ - η”·ζ€§ + ε₯³ζ€§ β‰ˆ ε₯³ηŽ‹

πŸ”§ Technical Details

MCV1 Binary Format

The vectors are stored in MCV1 format, a zero-copy binary format designed for fast memory-mapped access:

Header (32 bytes):
  - Magic: 0x4D564331 ("MVC1")
  - Vocab Size: uint32
  - Vector Dim: uint32
  - Data Type: uint32 (0=f32, 1=f16, 2=i8)

Data:
  - Vector 0: [dim * 4 bytes]
  - Vector 1: [dim * 4 bytes]
  - ...

Zero-Copy Loading

# No copying - directly memory-mapped
m = mecrab.MeCrab(vector_path="vectors.bin")  # < 1ms loading time

πŸ“š Citation

If you use these vectors in your research, please cite:

@misc{mecrab-jawiki-word2vec,
  author = {COOLJAPAN OU (Team KitaSan)},
  title = {MeCrab Japanese Word2Vec Vectors},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/datasets/YOUR_USERNAME/mecrab-jawiki-word2vec}}
}

πŸ“œ License

Apache License 2.0

πŸ™ Acknowledgements

🀝 Contributing

Found an issue or want to contribute improvements? Please open an issue on the MeCrab repository.

πŸ”— Links

Downloads last month
17

Paper for KitaSan/mecrab-jawiki-word2vec