Files
anki-tools/README.md
2026-01-26 19:44:22 -05:00

8.6 KiB

Anki tools for language learning

A modular collection of tools and scripts to enhance your anki-based language learning. These tools focus on listening, sentence mining, sentence decks, and more. Built for language learners and immersion enthusiasts with linux knowledge.

Tools Overview

Tool Purpose
audio-extractor Extract Anki card audio by language into playlists for passive listening
batch_importer Generate TTS audio from sentence lists and import into Anki
word-scraper Extract & lemmatize words from Anki decks (frequency analysis, mining)
yt-transcript Mine vocabulary/sentences from YouTube transcripts for analysis
deck-converter* Convert TSV+audio into .apkg Anki decks using config-driven workflow
youtube-to-anki* Convert YouTube subtitles/audio into fully timestamped Anki cards

*=haven't used these tools in a very long time and will update them when I use them again

Requirements

Each tool has its own set of dependencies. Common dependencies includes

  • Python3
  • Anki with AnkiConnect
  • yt-dlp, jq, yq, spaCy, gTTS, youtube-transcript-api, pyyaml, genanki, fugashi, regex, requests
  • ffmpeg

Personally, I like to have on venv that contains all the prerequisites.

python3.12 -m venv ~/.venv/anki-tools
source ~/.venv/anki-tools/bin/activate
python3 -m pip install -U pip
pip install gtts jq yq spacy youtube-transcript-api pyyaml genanki fugashi regex requests

# Also install ffmpeg
sudo dnf install ffmpeg

That way, whenever you want to run these scripts, you can just source the venv and run the appropriate script.

source ~/.venv/anki-tools/bin/activate

Getting started

Clone the repository:

git clone https://git.pawelsarkowicz.xyz/ps/anki-tools.git
cd anki-tools

Then explore.

Most scripts assume:

  • Anki is running
  • the AnkiConnect add-on is enabled (default: http://localhost:8765)
  • that your anki cards are basic, with audio on the front and the sentence (in the target language) on the back. These tools only look at the first line of the back, so you can have notes/translations/etc. on the following lines if you like. anki_basic_card_jp

Language support

  • 🇯🇵 日本語
  • 🇪🇸 Español

audio-extractor

Purpose: Extract audio referenced by [sound:...] tags from Anki decks, grouped by language.

Usage:

./extract_anki_audio.py jp [--concat] [--outdir DIR] [--copy-only-new]
./extract_anki_audio.py es [--concat] [--outdir DIR] [--copy-only-new]

Outputs:

  • Copies audio into ~/Languages/Anki/anki-audio/<language>/ by default
  • Writes <language>.m3u
  • With --concat, writes <language>_concat.mp3 (keeps individual files)

Requirements

  • Anki + AnkiConnect
  • requests
  • ffmpeg (only if you use --concat)

batch_importer

Purpose: Generate TTS audio from a sentence list and add notes to Anki via AnkiConnect.

Usage

./batch_anki_import.sh [jp|es] [--concat] [--outdir DIR]
  • Keeps all individual MP3s.
  • If --concat is passed, also writes one combined MP3 for the run.

Requirements

  • Anki + AnkiConnect
  • gtts-cli, ffmpeg, curl

Sentence files

  • Japanese: ~/Languages/Anki/sentences_jp.txt
  • Spanish: ~/Languages/Anki/sentences_es.txt

word-scraper

Extract frequent words from Anki notes using AnkiConnect and spaCy. This is primarily intended for language learning workflows (currently Japanese and Spanish).

The script:

  • queries notes from Anki
  • extracts visible text from a chosen field
  • tokenizes with spaCy
  • filters out stopwords / grammar
  • counts word frequencies
  • writes a sorted word list to a text file

Requirements

  • Anki + AnkiConnect - Python 3.12 (recommended; spaCy is not yet stable on 3.14)
  • spacy, regex, requests
  • spaCy models:
python -m spacy download es_core_news_sm
python -m spacy download ja_core_news_lg

Usage

./word_scraper.py {jp,es} [options]
Option Description
--query QUERY Full Anki search query (e.g. deck:"Español" tag:foo)
--deck DECK Deck name (repeatable). If omitted, decks are inferred from language
--field FIELD Note field to read (default: Back)
--min-freq N Minimum frequency to include (default: 2)
--outdir DIR Output directory (default: ~/Languages/Anki/anki-words/<language>)
--out FILE Output file path (default: <outdir>/words_<lang>.txt)
--full-field Use full field text instead of only the first visible line
--spacy-model MODEL Override spaCy model name
--logfile FILE Log file path

Examples

Basic usage (auto-detected decks)

./word_scraper.py jp
./word_scraper.py es

Specify a deck explicitly

./word_scraper.py jp --deck "日本語"
./word_scraper.py es --deck "Español"

Use a custom Anki query

./word_scraper.py es --query 'deck:"Español" tag:youtube'

Change output location and frequency threshold

./word_scraper.py jp --min-freq 3 --out words_jp.txt
./word_scraper.py es --outdir ~/tmp/words --out spanish_words.txt

Process full field text (not just first line)

./word_scraper.py jp --full-field

Output format

The output file contains one entry per line:

word frequency

Examples:

comer 12
hablar 9
行く (行き) 8
見る (見た) 6
  • Spanish output uses lemmas
  • Japanese output includes lemma (surface) when they differ

Language-specific notes

Japanese

  • Filters out particles and common grammar
  • Keeps nouns, verbs, adjectives, and proper nouns
  • Requires regex for Unicode script matching

Spanish

  • Filters stopwords
  • Keeps alphabetic tokens only
  • Lemmatized output

yt-transcript

Extract vocabulary or sentence-level text from YouTube video subtitles (transcripts), for language learning or analysis.

The script:

  • fetches captions via youtube-transcript-api
  • supports Spanish (es) and Japanese (jp)
  • tokenizes Japanese using MeCab (via fugashi)
  • outputs either:
    • word frequency lists, or
    • timestamped transcript lines

Features

  • Extract full vocabulary lists with frequency counts
  • Extract sentences (with timestamps or sentence indices)
  • Support for Japanese tokenization
  • Optional: stopword filtering
  • Modular and extendable for future features like CSV export or audio slicing

Requirements

  • youtube-transcript-api
  • For Japanese tokenization:
pip install "fugashi[unidic-lite]"

Usage

./yt-transcript.py {jp,es} <video_url_or_id> [options]

Options

Option Description
--mode {vocab,sentences} Output mode (default: vocab)
--top N Show only the top N words (vocab mode)
--no-stopwords Keep common words
--raw (Spanish only) Do not lowercase tokens

Examples

Extract Spanish vocabulary

./yt-transcript.py es https://youtu.be/VIDEO_ID

Top 50 words

./yt-transcript.py es VIDEO_ID --top 50

Japanese transcript with timestamps

./yt-transcript.py jp VIDEO_ID --mode sentences

Keep Spanish casing and stopwords

./yt-transcript.py es VIDEO_ID --raw --no-stopwords

Output formats

Vocabulary mode

palabra: count

Example:

comer: 12
hablar: 9

Sentence mode

[12.34s] sentence text here

Example:

[45.67s] 今日はいい天気ですね

Language Notes

Spanish

  • Simple regex-based tokenizer
  • Accented characters supported
  • Lowercased by default

Japanese

  • Uses fugashi (MeCab)
  • Outputs surface forms
  • Filters via stopword list only (no POS filtering)

License

This project is licensed under the MIT License. See the LICENSE file for details.