Efficient Estimation of Word Representations in Vector Space
Paper
β’
1301.3781
β’
Published
β’
8
format
string | vocab_size
int64 | vector_dim
int64 | algorithm
string | corpus
dict | tokenizer
dict | training
dict | created_at
timestamp[s] | license
string |
|---|---|---|---|---|---|---|---|---|
mcv1
| 100,000
| 300
|
word2vec-skipgram
|
{
"name": "Japanese Wikipedia",
"version": "2026-01",
"url": "https://dumps.wikimedia.org/jawiki/"
}
|
{
"name": "mecrab",
"dictionary": "ipadic",
"version": "0.1.0"
}
|
{
"window_size": 5,
"negative_samples": 5,
"min_count": 5,
"epochs": 5,
"learning_rate": 0.025,
"subsampling": 0.001
}
| 2026-01-02T00:00:00
|
Apache-2.0
|
High-quality Japanese word embeddings trained on Wikipedia using MeCrab morphological analyzer.
This dataset contains pre-trained Japanese word embeddings optimized for use with MeCrab, a high-performance morphological analyzer.
Key Features:
mecrab-jawiki-word2vec/
βββ vectors.bin # Word embeddings (MCV1 format)
βββ vocab.txt # Vocabulary mapping
βββ metadata.json # Training configuration
pip install mecrab
wget https://huggingface.co/datasets/KitaSan/mecrab-jawiki-word2vec/resolve/main/vectors.bin
import mecrab
# Load analyzer with vectors
m = mecrab.MeCrab(vector_path="vectors.bin")
# Get word embeddings
morphemes = m.parse_to_dict("ζ±δΊ¬γ«θ‘γ")
for morph in morphemes:
if 'embedding' in morph:
print(f"{morph['surface']}: {morph['embedding'][:5]}")
# => ζ±δΊ¬: [0.123, -0.456, 0.789, -0.234, 0.567]
# Compute cosine similarity
sim = m.similarity("ζ±δΊ¬", "δΊ¬ι½")
print(f"Similarity: {sim:.3f}") # => 0.856
# With kizame CLI
kizame parse --vectors vectors.bin --output-format json input.txt
| Word Pair | Similarity |
|---|---|
| ζ±δΊ¬ - δΊ¬ι½ | 0.856 |
| η¬ - η« | 0.782 |
| ι£γΉγ - ι£²γ | 0.671 |
| ζ₯ζ¬ - δΈε½ | 0.834 |
# King - Man + Woman = Queen (Japanese equivalent)
# ηζ§ - η·ζ§ + ε₯³ζ§ β ε₯³η
The vectors are stored in MCV1 format, a zero-copy binary format designed for fast memory-mapped access:
Header (32 bytes):
- Magic: 0x4D564331 ("MVC1")
- Vocab Size: uint32
- Vector Dim: uint32
- Data Type: uint32 (0=f32, 1=f16, 2=i8)
Data:
- Vector 0: [dim * 4 bytes]
- Vector 1: [dim * 4 bytes]
- ...
# No copying - directly memory-mapped
m = mecrab.MeCrab(vector_path="vectors.bin") # < 1ms loading time
If you use these vectors in your research, please cite:
@misc{mecrab-jawiki-word2vec,
author = {COOLJAPAN OU (Team KitaSan)},
title = {MeCrab Japanese Word2Vec Vectors},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/datasets/YOUR_USERNAME/mecrab-jawiki-word2vec}}
}
Apache License 2.0
Found an issue or want to contribute improvements? Please open an issue on the MeCrab repository.