Rename project to Saiki and unify CLI
This commit is contained in:
13
CHANGELOG.md
Normal file
13
CHANGELOG.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Changelog
|
||||
|
||||
## Unreleased
|
||||
|
||||
- Renamed the project to Saiki (`採記`).
|
||||
- Replaced the old script collection with one CLI: `./saiki.py`.
|
||||
- Added YAML configuration loading from `~/.config/saiki/config.yaml`.
|
||||
- Added YouTube transcript exports for Anki-ready sentence mining.
|
||||
- Added known/new word comparison.
|
||||
- Added a Python TTS importer that accepts plain text, TSV, and CSV sentence
|
||||
sources.
|
||||
- Added focused tests for pure logic.
|
||||
|
||||
448
README.md
448
README.md
@@ -1,322 +1,234 @@
|
||||
# Anki tools for language learning
|
||||
# Saiki
|
||||
|
||||
A modular collection of tools and scripts to enhance your anki-based language learning. These tools focus on listening, sentence mining, sentence decks, and more. Built for language learners and immersion enthusiasts with linux knowledge.
|
||||
**Saiki** (`採記`) is a small toolkit for Anki-based language learning workflows:
|
||||
listening playlists, word mining, YouTube transcript mining, TTS sentence
|
||||
imports, and known/new word comparison.
|
||||
|
||||
### Tools Overview
|
||||
The name is a coined Japanese compound from `採` as in gathering/collecting and
|
||||
`記` as in remembering or recording. Pronunciation: `saiki`, roughly
|
||||
"sigh-key".
|
||||
|
||||
| Tool | Purpose |
|
||||
|-------------------------------------------|--------------------------------------------------------------------------|
|
||||
| [`audio-extractor`](#audio-extractor) | Extract Anki card audio by language into playlists for passive listening |
|
||||
| [`batch_importer`](#batch_importer) | Generate TTS audio from sentence lists and import into Anki |
|
||||
| [`word-scraper`](#word-scraper) | Extract & lemmatize words from Anki decks (frequency analysis, mining) |
|
||||
| [`yt-transcript`](#yt-transcript) | Mine vocabulary/sentences from YouTube transcripts for analysis |
|
||||
| `deck-converter`* | Convert TSV+audio into `.apkg` Anki decks using config-driven workflow |
|
||||
| `youtube-to-anki`* | Convert YouTube subtitles/audio into fully timestamped Anki cards |
|
||||
```shell
|
||||
./saiki.py --help
|
||||
```
|
||||
|
||||
*=haven't used these tools in a very long time and will update them when I use them again
|
||||
## Requirements
|
||||
|
||||
### Requirements
|
||||
|
||||
Each tool has its own set of dependencies. Common dependencies includes
|
||||
- Python3
|
||||
- Python 3.12 recommended
|
||||
- [Anki](https://apps.ankiweb.net/) with [AnkiConnect](https://github.com/amikey/anki-connect)
|
||||
- `yt-dlp`, `jq`, `yq`, `spaCy`, `gTTS`, `youtube-transcript-api`, `pyyaml`, `genanki`, `fugashi`, `regex`, `requests`
|
||||
- `ffmpeg`
|
||||
|
||||
Personally, I like to have one venv that contains all the prerequisites.
|
||||
- Python dependencies from `requirements.txt`
|
||||
- spaCy models for word mining:
|
||||
|
||||
```shell
|
||||
python3.12 -m venv ~/.venv/anki-tools
|
||||
source ~/.venv/anki-tools/bin/activate
|
||||
python3 -m pip install -U pip
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Also install system command-line dependencies
|
||||
sudo dnf install ffmpeg jq
|
||||
```
|
||||
That way, whenever you want to run these scripts, you can just source the venv and run the appropriate script.
|
||||
|
||||
```shell
|
||||
source ~/.venv/anki-tools/bin/activate
|
||||
```
|
||||
|
||||
### Getting started
|
||||
|
||||
Clone the repository:
|
||||
```shell
|
||||
git clone https://git.pawelsarkowicz.xyz/ps/anki-tools.git
|
||||
cd anki-tools
|
||||
```
|
||||
Then explore.
|
||||
|
||||
Most scripts assume:
|
||||
- Anki is running
|
||||
- the AnkiConnect add-on is enabled (default: http://localhost:8765)
|
||||
- that your anki cards are basic, with audio on the front and the sentence (in the target language) on the back. These tools only look at the first line of the back, so you can have notes/translations/etc. on the following lines if you like.
|
||||

|
||||
|
||||
### Shared configuration
|
||||
|
||||
Common settings live in `anki_common.py`, including:
|
||||
- the AnkiConnect URL
|
||||
- language code mappings (`jp`, `es`)
|
||||
- deck-to-language mappings
|
||||
- default output directories
|
||||
- the default Anki `collection.media` path used by `audio_extractor.py`
|
||||
|
||||
If you rename your decks, add another language, or use a different default media location, update `anki_common.py` once instead of editing each script separately. Some settings can also be overridden at runtime, such as `audio_extractor.py --media-dir`.
|
||||
|
||||
### Language support
|
||||
- 🇯🇵 日本語
|
||||
- 🇪🇸 Español
|
||||
|
||||
|
||||
## audio-extractor
|
||||
**Purpose**: Extract audio referenced by `[sound:...]` tags from Anki decks, grouped by language.
|
||||
|
||||
### Usage:
|
||||
|
||||
```bash
|
||||
./audio_extractor.py jp [--concat] [--outdir DIR] [--media-dir DIR] [--copy-only-new]
|
||||
./audio_extractor.py es [--concat] [--outdir DIR] [--media-dir DIR] [--copy-only-new]
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- Copies audio into `~/Languages/Anki/anki-audio/<language>/` by default
|
||||
- Writes `<language>.m3u`, including audio copied into subfolders
|
||||
- With `--concat`, writes `<language>_concat.mp3` (keeps individual files)
|
||||
|
||||
Options:
|
||||
- `--media-dir DIR`: override the Anki `collection.media` directory. By default, this uses the common Flatpak path: `~/.var/app/net.ankiweb.Anki/data/Anki2/User 1/collection.media`
|
||||
|
||||
### Requirements
|
||||
- Anki + AnkiConnect
|
||||
- `requests`
|
||||
- `ffmpeg` (only if you use `--concat`)
|
||||
|
||||
## batch_importer
|
||||
**Purpose**: Generate TTS audio from a sentence list and add notes to Anki via AnkiConnect.
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
./batch_anki_import.sh [jp|es] [--tags TAG1,TAG2,...]
|
||||
```
|
||||
|
||||
| Option | Description |
|
||||
| -------------------------- | ------------------------------------------------------------------------------------------------ |
|
||||
| `--tags TAG1,TAG2,...` | Comma-separated list of tags. `text-to-speech` is always included. Default: `AI-generated` |
|
||||
|
||||
### Tag behavior
|
||||
- By default, cards are tagged with `text-to-speech` and `AI-generated`
|
||||
- When `--tags` is specified, `text-to-speech` is always included, and `AI-generated` is replaced by the custom tags
|
||||
- Examples:
|
||||
- `./batch_anki_import.sh jp` → tags: `text-to-speech`, `AI-generated`
|
||||
- `./batch_anki_import.sh jp --tags manual` → tags: `text-to-speech`, `manual`
|
||||
- `./batch_anki_import.sh es --tags "youtube,media"` → tags: `text-to-speech`, `youtube`, `media`
|
||||
|
||||
### Requirements
|
||||
- Anki + AnkiConnect
|
||||
- `gtts-cli`, `ffmpeg`, `curl`, `jq`
|
||||
|
||||
### Sentence files
|
||||
- Japanese: `~/Languages/Anki/sentences_jp.txt`
|
||||
- Spanish: `~/Languages/Anki/sentences_es.txt`
|
||||
|
||||
### Notes
|
||||
- Audio files are generated in a temporary directory and cleaned up after import. No local audio files are retained.
|
||||
- Sentences and tags are encoded as JSON with `jq`, so quotes and punctuation in sentence files are handled safely.
|
||||
|
||||
## word-scraper
|
||||
|
||||
Extract frequent words from Anki notes using **AnkiConnect** and **spaCy**.
|
||||
This is primarily intended for language learning workflows (currently Japanese and Spanish).
|
||||
|
||||
The script:
|
||||
- queries notes from Anki
|
||||
- extracts visible text from a chosen field
|
||||
- tokenizes with spaCy
|
||||
- filters out stopwords / grammar
|
||||
- counts word frequencies
|
||||
- writes a sorted word list to a text file
|
||||
|
||||
|
||||
### Requirements
|
||||
|
||||
- Anki + AnkiConnect - Python **3.12** (recommended; spaCy is not yet stable on 3.14)
|
||||
- `spacy`, `regex`, `requests`
|
||||
- spaCy models:
|
||||
```bash
|
||||
python -m spacy download es_core_news_sm
|
||||
python -m spacy download ja_core_news_lg
|
||||
```
|
||||
|
||||
### Usage
|
||||
```bash
|
||||
./word_scraper.py {jp,es} [options]
|
||||
Setup example:
|
||||
|
||||
```shell
|
||||
python3.12 -m venv ~/.venv/saiki
|
||||
source ~/.venv/saiki/bin/activate
|
||||
python3 -m pip install -U pip
|
||||
pip install -r requirements.txt
|
||||
sudo dnf install ffmpeg
|
||||
```
|
||||
|
||||
| Option | Description |
|
||||
| --------------------- | -------------------------------------------------------------------- |
|
||||
| `--query QUERY` | Full Anki search query (e.g. `deck:"Español" tag:foo`) |
|
||||
| `--deck DECK` | Deck name (repeatable). If omitted, decks are inferred from language |
|
||||
| `--field FIELD` | Note field to read (default: `Back`) |
|
||||
| `--min-freq N` | Minimum frequency to include (default: `2`) |
|
||||
| `--outdir DIR` | Output directory (default: `~/Languages/Anki/anki-words/<language>`) |
|
||||
| `--out FILE` | Output file path (default: `<outdir>/words_<lang>.txt`) |
|
||||
| `--full-field` | Use full field text instead of only the first visible line |
|
||||
| `--spacy-model MODEL` | Override spaCy model name |
|
||||
| `--logfile FILE` | Log file path |
|
||||
## Configuration
|
||||
|
||||
### Examples
|
||||
#### Basic usage (auto-detected decks)
|
||||
```bash
|
||||
./word_scraper.py jp
|
||||
./word_scraper.py es
|
||||
Defaults are built in, but you can override them with YAML:
|
||||
|
||||
```shell
|
||||
~/.config/saiki/config.yaml
|
||||
```
|
||||
|
||||
#### Specify a deck explicitly
|
||||
```bash
|
||||
./word_scraper.py jp --deck "日本語"
|
||||
./word_scraper.py es --deck "Español"
|
||||
Or pass a config explicitly:
|
||||
|
||||
```shell
|
||||
./saiki.py --config ./config.yaml words jp
|
||||
```
|
||||
|
||||
#### Use a custom Anki query
|
||||
```bash
|
||||
./word_scraper.py es --query 'deck:"Español" tag:youtube'
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
anki_connect_url: http://localhost:8765
|
||||
media_dir: ~/.var/app/net.ankiweb.Anki/data/Anki2/User 1/collection.media
|
||||
audio_output_root: ~/Languages/Anki/anki-audio
|
||||
word_output_root: ~/Languages/Anki/anki-words
|
||||
sentence_dir: ~/Languages/Anki
|
||||
note_model: Basic
|
||||
fields:
|
||||
front: Front
|
||||
back: Back
|
||||
languages:
|
||||
jp:
|
||||
name: japanese
|
||||
transcript_code: ja
|
||||
tts_code: ja
|
||||
tts_tld: com
|
||||
tts_tempo: 1.35
|
||||
decks: ["日本語"]
|
||||
field: Back
|
||||
word_model: ja_core_news_lg
|
||||
sentence_file: sentences_jp.txt
|
||||
es:
|
||||
name: spanish
|
||||
transcript_code: es
|
||||
tts_code: es
|
||||
tts_tld: es
|
||||
tts_tempo: 1.25
|
||||
decks: ["Español"]
|
||||
field: Back
|
||||
word_model: es_core_news_sm
|
||||
sentence_file: sentences_es.txt
|
||||
```
|
||||
|
||||
#### Change output location and frequency threshold
|
||||
```bash
|
||||
./word_scraper.py jp --min-freq 3 --out words_jp.txt
|
||||
./word_scraper.py es --outdir ~/tmp/words --out spanish_words.txt
|
||||
A copyable template is also available at `examples/config.yaml`.
|
||||
|
||||
Supported language codes by default:
|
||||
|
||||
- `jp`
|
||||
- `es`
|
||||
|
||||
## CLI
|
||||
|
||||
### Audio
|
||||
|
||||
Extract audio referenced by `[sound:...]` tags from configured decks and create
|
||||
an `.m3u` playlist.
|
||||
|
||||
```shell
|
||||
./saiki.py audio jp
|
||||
./saiki.py audio es --concat
|
||||
./saiki.py audio jp --media-dir ~/.local/share/Anki2/User\ 1/collection.media --copy-only-new
|
||||
```
|
||||
|
||||
#### Process full field text (not just first line)
|
||||
```bash
|
||||
./word_scraper.py jp --full-field
|
||||
Outputs go to `~/Languages/Anki/anki-audio/<language>/` by default.
|
||||
|
||||
### Words
|
||||
|
||||
Extract frequent words from Anki notes using AnkiConnect and spaCy.
|
||||
|
||||
```shell
|
||||
./saiki.py words jp
|
||||
./saiki.py words es --deck "Español"
|
||||
./saiki.py words es --query 'deck:"Español" tag:youtube'
|
||||
./saiki.py words jp --min-freq 3 --out words_jp.txt
|
||||
./saiki.py words jp --full-field
|
||||
```
|
||||
|
||||
### Output format
|
||||
The output file contains one entry per line:
|
||||
```
|
||||
Output format:
|
||||
|
||||
```text
|
||||
word frequency
|
||||
```
|
||||
|
||||
Examples:
|
||||
```
|
||||
|
||||
```text
|
||||
comer 12
|
||||
hablar 9
|
||||
行く (行き) 8
|
||||
見る (見た) 6
|
||||
```
|
||||
|
||||
- Spanish output uses lemmas
|
||||
- Japanese output includes lemma (surface) when they differ
|
||||
### YouTube
|
||||
|
||||
### Language-specific notes
|
||||
#### Japanese
|
||||
- Filters out particles and common grammar
|
||||
- Keeps nouns, verbs, adjectives, and proper nouns
|
||||
- Requires `regex` for Unicode script matching
|
||||
Mine vocabulary or sentence rows from YouTube subtitles.
|
||||
|
||||
#### Spanish
|
||||
- Filters stopwords
|
||||
- Keeps alphabetic tokens only
|
||||
- Lemmatized output
|
||||
|
||||
## yt-transcript
|
||||
Extract vocabulary or sentence-level text from YouTube video subtitles (transcripts), for language learning or analysis.
|
||||
|
||||
The script:
|
||||
- fetches captions via `youtube-transcript-api`
|
||||
- supports **Spanish (es)** and **Japanese (jp)**
|
||||
- tokenizes Japanese using **MeCab (via fugashi)**
|
||||
- outputs either:
|
||||
- word frequency lists, or
|
||||
- timestamped transcript lines
|
||||
|
||||
### Features
|
||||
|
||||
- Extract full vocabulary lists with frequency counts
|
||||
- Extract sentences (with timestamps or sentence indices)
|
||||
- Support for Japanese tokenization
|
||||
- Optional: stopword filtering
|
||||
- Modular and extendable for future features like CSV export or audio slicing
|
||||
|
||||
### Requirements
|
||||
- `youtube-transcript-api`
|
||||
- For Japanese tokenization:
|
||||
```
|
||||
pip install "fugashi[unidic-lite]"
|
||||
```
|
||||
|
||||
### Usage
|
||||
```shell
|
||||
./yt-transcript.py {jp,es} <video_url_or_id> [options]
|
||||
./saiki.py youtube es VIDEO_ID
|
||||
./saiki.py youtube es VIDEO_ID --top 50
|
||||
./saiki.py youtube jp VIDEO_ID --mode sentences
|
||||
./saiki.py youtube es VIDEO_ID --raw --no-stopwords
|
||||
```
|
||||
|
||||
### Options
|
||||
| Option | Description |
|
||||
| -------------------------- | -------------------------------------- |
|
||||
| `--mode {vocab,sentences}` | Output mode (default: `vocab`) |
|
||||
| `--top N` | Show only the top N words (vocab mode) |
|
||||
| `--no-stopwords` | Keep common words |
|
||||
| `--raw` | (Spanish only) Do not lowercase tokens |
|
||||
Export Anki-ready sentence rows:
|
||||
|
||||
|
||||
### Examples
|
||||
#### Extract Spanish vocabulary
|
||||
```bash
|
||||
./yt-transcript.py es https://youtu.be/VIDEO_ID
|
||||
```shell
|
||||
./saiki.py youtube es VIDEO_ID --mode sentences --out youtube.tsv
|
||||
```
|
||||
|
||||
#### Top 50 words
|
||||
```bash
|
||||
./yt-transcript.py es VIDEO_ID --top 50
|
||||
Export only rows that appear to contain unknown vocabulary:
|
||||
|
||||
```shell
|
||||
./saiki.py youtube es VIDEO_ID \
|
||||
--mode sentences \
|
||||
--out youtube_new.tsv \
|
||||
--known-words ~/Languages/Anki/anki-words/spanish/words_es.txt \
|
||||
--only-new
|
||||
```
|
||||
|
||||
#### Japanese transcript with timestamps
|
||||
```bash
|
||||
./yt-transcript.py jp VIDEO_ID --mode sentences
|
||||
Sentence exports contain:
|
||||
|
||||
```text
|
||||
sentence timestamp video_url vocab_guess
|
||||
```
|
||||
|
||||
#### Keep Spanish casing and stopwords
|
||||
```bash
|
||||
./yt-transcript.py es VIDEO_ID --raw --no-stopwords
|
||||
### Import
|
||||
|
||||
Generate TTS audio and add sentence cards to Anki.
|
||||
|
||||
```shell
|
||||
./saiki.py import es
|
||||
./saiki.py import jp ~/Languages/Anki/sentences_jp.txt
|
||||
./saiki.py import es youtube.tsv --tags youtube,manual
|
||||
```
|
||||
|
||||
### Output formats
|
||||
#### Vocabulary mode
|
||||
```
|
||||
palabra: count
|
||||
```
|
||||
Example:
|
||||
```
|
||||
comer: 12
|
||||
hablar: 9
|
||||
The importer accepts plain text sentence files and TSV/CSV files with a
|
||||
`sentence` column. `text-to-speech` is always added as a tag. If `--tags` is not
|
||||
provided, `AI-generated` is added.
|
||||
|
||||
### Known/New Words
|
||||
|
||||
Compare any generated word list against an existing known list:
|
||||
|
||||
```shell
|
||||
./saiki.py compare-words transcript_words.txt ~/Languages/Anki/anki-words/spanish/words_es.txt
|
||||
```
|
||||
|
||||
#### Sentence mode
|
||||
```
|
||||
[12.34s] sentence text here
|
||||
```
|
||||
Example:
|
||||
```
|
||||
[45.67s] 今日はいい天気ですね
|
||||
This prints entries from the first file whose word key does not appear in the
|
||||
second file.
|
||||
|
||||
## Card Assumptions
|
||||
|
||||
The default configuration assumes Basic notes with audio on `Front` and the
|
||||
target-language sentence on `Back`. Word mining reads only the first visible
|
||||
line by default; use `--full-field` to process the whole field.
|
||||
|
||||

|
||||
|
||||
## To Do
|
||||
|
||||
- Add support for different Anki note/card types, including configurable field
|
||||
mappings per language and per import workflow.
|
||||
- Support multiple import profiles, such as sentence cards, vocab cards, audio
|
||||
cards, and cloze cards.
|
||||
- Let YouTube exports map directly into configurable note fields, not just a
|
||||
fixed `sentence` column.
|
||||
- Add richer transcript filtering, such as minimum/maximum sentence length,
|
||||
duplicate removal, and punctuation cleanup.
|
||||
- Add optional audio slicing from videos when timestamp data is available.
|
||||
- Improve known/new word matching with better lemmatization for transcript
|
||||
vocabulary.
|
||||
- Add more language profiles beyond Japanese and Spanish.
|
||||
- Add a dry-run mode for imports that previews notes before sending anything to
|
||||
AnkiConnect.
|
||||
- Build a GUI for common workflows like transcript review, sentence selection,
|
||||
import previews, and configuration editing.
|
||||
- Add integration tests with mocked AnkiConnect responses.
|
||||
- Add shell completion or a small installed command once packaging becomes
|
||||
useful.
|
||||
|
||||
## Tests
|
||||
|
||||
Pure logic tests use the standard library test runner:
|
||||
|
||||
```shell
|
||||
python -m unittest discover -s tests
|
||||
```
|
||||
|
||||
### Language Notes
|
||||
#### Spanish
|
||||
- Simple regex-based tokenizer
|
||||
- Accented characters supported
|
||||
- Lowercased by default
|
||||
## License
|
||||
|
||||
#### Japanese
|
||||
- Uses fugashi (MeCab)
|
||||
- Outputs surface forms
|
||||
- Filters via stopword list only (no POS filtering)
|
||||
|
||||
# License
|
||||
|
||||
This project is licensed under the MIT License.
|
||||
See the [`LICENSE`](./LICENSE) file for details.
|
||||
This project is licensed under the MIT License. See [`LICENSE`](./LICENSE).
|
||||
|
||||
@@ -1,47 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Shared configuration and AnkiConnect helpers for the toolkit scripts."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from typing import Dict
|
||||
|
||||
import requests
|
||||
|
||||
ANKI_CONNECT_URL = "http://localhost:8765"
|
||||
|
||||
LANG_MAP: Dict[str, str] = {
|
||||
"jp": "japanese",
|
||||
"es": "spanish",
|
||||
}
|
||||
|
||||
TRANSCRIPT_LANG_MAP: Dict[str, str] = {
|
||||
"jp": "ja",
|
||||
"es": "es",
|
||||
}
|
||||
|
||||
DECK_TO_LANGUAGE: Dict[str, str] = {
|
||||
"日本語": "japanese",
|
||||
"Español": "spanish",
|
||||
}
|
||||
|
||||
DEFAULT_ANKI_MEDIA_DIR = os.path.expanduser(
|
||||
"~/.var/app/net.ankiweb.Anki/data/Anki2/User 1/collection.media"
|
||||
)
|
||||
|
||||
DEFAULT_AUDIO_OUTPUT_ROOT = os.path.expanduser("~/Languages/Anki/anki-audio")
|
||||
DEFAULT_WORD_OUTPUT_ROOT = os.path.expanduser("~/Languages/Anki/anki-words")
|
||||
|
||||
|
||||
def anki_request(action: str, **params):
|
||||
"""Make an AnkiConnect request and return the result payload."""
|
||||
resp = requests.post(
|
||||
ANKI_CONNECT_URL,
|
||||
json={"action": action, "version": 6, "params": params},
|
||||
timeout=30,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
if data.get("error") is not None:
|
||||
raise RuntimeError(f"AnkiConnect error for {action}: {data['error']}")
|
||||
return data["result"]
|
||||
@@ -1,250 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
audio_extractor.py
|
||||
|
||||
Extract all Anki media referenced by [sound:...] tags from one or more decks (grouped by language),
|
||||
copy them into a language-specific output folder, write an .m3u playlist, and optionally concatenate
|
||||
all audio into a single MP3 file.
|
||||
|
||||
Howto:
|
||||
./audio_extractor.py jp [--concat] [--outdir DIR] [--copy-only-new]
|
||||
./audio_extractor.py es [--concat] [--outdir DIR] [--copy-only-new]
|
||||
|
||||
Requirements:
|
||||
- Anki running + AnkiConnect enabled at http://localhost:8765
|
||||
- Python package: requests
|
||||
- OPTIONAL (for --concat): ffmpeg
|
||||
|
||||
Notes:
|
||||
- This scans all fields of each note and extracts filenames inside [sound:...]
|
||||
- It copies referenced media files out of Anki's collection.media folder
|
||||
- It preserves filenames (and subfolders if they exist)
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import argparse
|
||||
import shutil
|
||||
import subprocess
|
||||
import tempfile
|
||||
from typing import List
|
||||
|
||||
from anki_common import (
|
||||
DEFAULT_ANKI_MEDIA_DIR,
|
||||
DEFAULT_AUDIO_OUTPUT_ROOT,
|
||||
DECK_TO_LANGUAGE,
|
||||
LANG_MAP,
|
||||
anki_request,
|
||||
)
|
||||
|
||||
AUDIO_EXTS = (".mp3", ".wav", ".ogg", ".m4a", ".flac")
|
||||
|
||||
|
||||
def ensure_ffmpeg_available() -> None:
|
||||
"""Raise a helpful error if ffmpeg isn't installed."""
|
||||
if shutil.which("ffmpeg") is None:
|
||||
raise RuntimeError("ffmpeg not found in PATH. Install ffmpeg to use --concat.")
|
||||
|
||||
|
||||
def resolve_media_paths(media_dir: str, out_dir: str, media_name: str) -> tuple[str, str] | None:
|
||||
"""Return safe source/destination paths for an Anki media filename."""
|
||||
normalized = os.path.normpath(media_name)
|
||||
if os.path.isabs(normalized) or normalized.startswith(".."):
|
||||
return None
|
||||
return os.path.join(media_dir, normalized), os.path.join(out_dir, normalized)
|
||||
|
||||
|
||||
def build_playlist(out_dir: str, language: str) -> str:
|
||||
"""
|
||||
Create an .m3u playlist listing audio files under out_dir (sorted by filename).
|
||||
Returns the playlist path.
|
||||
"""
|
||||
m3u_path = os.path.join(out_dir, f"{language}.m3u")
|
||||
concat_name = f"{language}_concat.mp3"
|
||||
files: List[str] = []
|
||||
for root, _, filenames in os.walk(out_dir):
|
||||
for fname in filenames:
|
||||
abs_path = os.path.join(root, fname)
|
||||
rel_path = os.path.relpath(abs_path, out_dir)
|
||||
if rel_path == os.path.basename(m3u_path):
|
||||
continue
|
||||
if rel_path == concat_name:
|
||||
continue
|
||||
if fname.lower().endswith(AUDIO_EXTS) and os.path.isfile(abs_path):
|
||||
files.append(rel_path)
|
||||
|
||||
with open(m3u_path, "w", encoding="utf-8") as fh:
|
||||
for fname in sorted(files):
|
||||
fh.write(f"{fname}\n")
|
||||
|
||||
return m3u_path
|
||||
|
||||
|
||||
def concat_audio_from_m3u(out_dir: str, m3u_path: str, out_path: str) -> None:
|
||||
"""
|
||||
Concatenate audio files in the order listed in the .m3u.
|
||||
Uses ffmpeg concat demuxer and re-encodes to MP3 for reliability.
|
||||
|
||||
Keeps original files untouched.
|
||||
"""
|
||||
ensure_ffmpeg_available()
|
||||
|
||||
# Read playlist entries (filenames, one per line)
|
||||
with open(m3u_path, "r", encoding="utf-8") as fh:
|
||||
rel_files = [line.strip() for line in fh if line.strip()]
|
||||
|
||||
# Filter to existing audio files
|
||||
abs_files: List[str] = []
|
||||
for rel in rel_files:
|
||||
p = os.path.join(out_dir, rel)
|
||||
if os.path.isfile(p) and rel.lower().endswith(AUDIO_EXTS):
|
||||
abs_files.append(os.path.abspath(p))
|
||||
|
||||
if not abs_files:
|
||||
raise RuntimeError("No audio files found to concatenate (playlist is empty?).")
|
||||
|
||||
# ffmpeg concat demuxer expects a file with lines like: file '/abs/path/to/file'
|
||||
# Use a temp file so we don't leave junk behind if ffmpeg fails.
|
||||
with tempfile.NamedTemporaryFile("w", delete=False, encoding="utf-8") as tmp:
|
||||
concat_list_path = tmp.name
|
||||
for p in abs_files:
|
||||
# Escape single quotes for ffmpeg concat list
|
||||
safe = p.replace("'", "'\\''")
|
||||
tmp.write(f"file '{safe}'\n")
|
||||
|
||||
# Re-encode to MP3 to avoid header/codec mismatches across files
|
||||
cmd = [
|
||||
"ffmpeg",
|
||||
"-hide_banner",
|
||||
"-loglevel", "error",
|
||||
"-f", "concat",
|
||||
"-safe", "0",
|
||||
"-i", concat_list_path,
|
||||
"-c:a", "libmp3lame",
|
||||
"-q:a", "4",
|
||||
"-y",
|
||||
out_path,
|
||||
]
|
||||
|
||||
try:
|
||||
subprocess.run(cmd, check=True)
|
||||
finally:
|
||||
try:
|
||||
os.remove(concat_list_path)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract Anki audio by language."
|
||||
)
|
||||
|
||||
# REQUIRED positional language code: jp / es
|
||||
parser.add_argument(
|
||||
"lang",
|
||||
choices=sorted(LANG_MAP.keys()),
|
||||
help="Language code (jp or es).",
|
||||
)
|
||||
|
||||
# Match bash-style flags
|
||||
parser.add_argument(
|
||||
"--concat",
|
||||
action="store_true",
|
||||
help="Also output a single concatenated MP3 file (in playlist order).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--outdir",
|
||||
help="Output directory. Default: ~/Languages/Anki/anki-audio/<language>",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--media-dir",
|
||||
default=DEFAULT_ANKI_MEDIA_DIR,
|
||||
help="Anki collection.media directory. Defaults to the common Flatpak profile path.",
|
||||
)
|
||||
|
||||
# Keep your existing useful behavior
|
||||
parser.add_argument(
|
||||
"--copy-only-new",
|
||||
action="store_true",
|
||||
help="Skip overwriting existing files.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
language = LANG_MAP[args.lang]
|
||||
media_dir = os.path.expanduser(args.media_dir)
|
||||
|
||||
# Find all decks whose mapped language matches
|
||||
selected_decks = [deck for deck, lang in DECK_TO_LANGUAGE.items() if lang == language]
|
||||
if not selected_decks:
|
||||
print(f"No decks found for language: {language}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# Output folder: either user-specified --outdir or default output root/<language>
|
||||
out_dir = os.path.expanduser(args.outdir) if args.outdir else os.path.join(DEFAULT_AUDIO_OUTPUT_ROOT, language)
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
# Collect note IDs across selected decks
|
||||
all_ids: List[int] = []
|
||||
for deck in selected_decks:
|
||||
ids = anki_request("findNotes", query=f'deck:"{deck}"')
|
||||
all_ids.extend(ids)
|
||||
|
||||
if not all_ids:
|
||||
print(f"No notes found in decks for language: {language}")
|
||||
return 0
|
||||
|
||||
# Fetch notes info (fields contain [sound:...] references)
|
||||
notes = anki_request("notesInfo", notes=all_ids)
|
||||
|
||||
# Copy referenced audio files into out_dir
|
||||
copied: List[str] = []
|
||||
for note in notes:
|
||||
fields = note.get("fields", {})
|
||||
for field in fields.values():
|
||||
val = field.get("value", "") or ""
|
||||
for match in re.findall(r"\[sound:(.+?)\]", val):
|
||||
paths = resolve_media_paths(media_dir, out_dir, match)
|
||||
if paths is None:
|
||||
print(f"Skipping unsafe media reference: {match}", file=sys.stderr)
|
||||
continue
|
||||
src, dst = paths
|
||||
|
||||
if not os.path.exists(src):
|
||||
continue
|
||||
|
||||
# If Anki stored media in subfolders, ensure the subfolder exists in out_dir
|
||||
dst_parent = os.path.dirname(dst)
|
||||
if dst_parent:
|
||||
os.makedirs(dst_parent, exist_ok=True)
|
||||
|
||||
if args.copy_only_new and os.path.exists(dst):
|
||||
continue
|
||||
|
||||
shutil.copy2(src, dst)
|
||||
copied.append(match)
|
||||
|
||||
# Create playlist, including audio in subfolders.
|
||||
m3u_path = build_playlist(out_dir, language)
|
||||
|
||||
print(f"\n✅ Copied {len(copied)} files for {language}")
|
||||
print(f"🎵 Playlist created at: {m3u_path}")
|
||||
print(f"📁 Output directory: {out_dir}")
|
||||
|
||||
# Optional: concatenate all audio into one MP3 (order = playlist order)
|
||||
if args.concat:
|
||||
concat_out = os.path.join(out_dir, f"{language}_concat.mp3")
|
||||
try:
|
||||
concat_audio_from_m3u(out_dir, m3u_path, concat_out)
|
||||
print(f"🎧 Concatenated file created at: {concat_out}")
|
||||
except Exception as e:
|
||||
print(f"❌ Concatenation failed: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,188 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
prog="$(basename "$0")"
|
||||
|
||||
print_help() {
|
||||
cat <<EOF
|
||||
usage: $prog [-h] [--tags TAG1,TAG2,...] {es,jp}
|
||||
|
||||
positional arguments:
|
||||
{es,jp}
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--tags TAG1,TAG2,... comma-separated list of additional tags (default: AI-generated)
|
||||
text-to-speech is always included
|
||||
EOF
|
||||
}
|
||||
|
||||
arg_error_missing_lang() {
|
||||
echo "usage: $prog [-h] [--tags TAG1,TAG2,...] {es,jp}" >&2
|
||||
echo "$prog: error: the following arguments are required: lang" >&2
|
||||
exit 2
|
||||
}
|
||||
|
||||
arg_error_unknown() {
|
||||
echo "usage: $prog [-h] [--tags TAG1,TAG2,...] {es,jp}" >&2
|
||||
echo "$prog: error: unrecognized arguments: $*" >&2
|
||||
exit 2
|
||||
}
|
||||
|
||||
lang=""
|
||||
custom_tags=""
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
-h|--help)
|
||||
print_help
|
||||
exit 0
|
||||
;;
|
||||
--tags)
|
||||
if [[ -z "$2" || "$2" == -* ]]; then
|
||||
echo "$prog: error: --tags requires an argument" >&2
|
||||
exit 2
|
||||
fi
|
||||
custom_tags="$2"
|
||||
shift 2
|
||||
;;
|
||||
jp|es)
|
||||
if [[ -n "$lang" ]]; then
|
||||
arg_error_unknown "$1"
|
||||
fi
|
||||
lang="$1"
|
||||
shift
|
||||
;;
|
||||
*)
|
||||
arg_error_unknown "$1"
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
[[ -z "$lang" ]] && arg_error_missing_lang
|
||||
|
||||
require_command() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
echo "$prog: error: required command not found: $1" >&2
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
require_command gtts-cli
|
||||
require_command ffmpeg
|
||||
require_command curl
|
||||
require_command jq
|
||||
|
||||
# Build tags JSON array - text-to-speech is always included
|
||||
tags=("text-to-speech")
|
||||
if [[ -n "$custom_tags" ]]; then
|
||||
IFS=',' read -ra tag_array <<< "$custom_tags"
|
||||
for tag in "${tag_array[@]}"; do
|
||||
# Trim whitespace
|
||||
tag="$(printf '%s' "$tag" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')"
|
||||
[[ -n "$tag" ]] && tags+=("$tag")
|
||||
done
|
||||
else
|
||||
tags+=("AI-generated")
|
||||
fi
|
||||
|
||||
TAGS="$(printf '%s\n' "${tags[@]}" | jq -R . | jq -s .)"
|
||||
|
||||
case "$lang" in
|
||||
jp)
|
||||
DECK_NAME="日本語"
|
||||
LANG_CODE="ja"
|
||||
TLD="com"
|
||||
TEMPO="1.35"
|
||||
SENTENCE_FILE="$HOME/Languages/Anki/sentences_jp.txt"
|
||||
;;
|
||||
es)
|
||||
DECK_NAME="Español"
|
||||
LANG_CODE="es"
|
||||
TLD="es"
|
||||
TEMPO="1.25"
|
||||
SENTENCE_FILE="$HOME/Languages/Anki/sentences_es.txt"
|
||||
;;
|
||||
esac
|
||||
|
||||
count=0
|
||||
|
||||
if [[ ! -f "$SENTENCE_FILE" ]]; then
|
||||
echo "$prog: error: sentence file not found: $SENTENCE_FILE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Use a temporary directory to handle processing
|
||||
TEMP_DIR=$(mktemp -d)
|
||||
trap 'rm -rf "$TEMP_DIR"' EXIT
|
||||
|
||||
while IFS= read -r sentence || [[ -n "$sentence" ]]; do
|
||||
[[ -z "$sentence" ]] && continue
|
||||
|
||||
# Generate unique filenames
|
||||
BASENAME="tts_$(date +%Y%m%d_%H%M%S)_${lang}_$RANDOM"
|
||||
# Path for the raw output from gtts
|
||||
RAW_OUTPUT="$TEMP_DIR/${BASENAME}_original.mp3"
|
||||
# Path for the sped-up output that goes to Anki
|
||||
OUTPUT_PATH="$TEMP_DIR/${BASENAME}.mp3"
|
||||
|
||||
echo "🔊 Processing: $sentence"
|
||||
|
||||
# 1. Generate TTS with specific TLD
|
||||
if gtts-cli "$sentence" --lang "$LANG_CODE" --tld "$TLD" --output "$RAW_OUTPUT"; then
|
||||
|
||||
# 2. Speed up audio using ffmpeg without changing pitch
|
||||
if ffmpeg -loglevel error -i "$RAW_OUTPUT" -filter:a "atempo=$TEMPO" -y "$OUTPUT_PATH" < /dev/null; then
|
||||
|
||||
# 3. Add to Anki using the sped-up file
|
||||
payload="$(jq -n \
|
||||
--arg deck "$DECK_NAME" \
|
||||
--arg sentence "$sentence" \
|
||||
--arg path "$OUTPUT_PATH" \
|
||||
--arg filename "${BASENAME}.mp3" \
|
||||
--argjson tags "$TAGS" \
|
||||
'{
|
||||
action: "addNote",
|
||||
version: 6,
|
||||
params: {
|
||||
note: {
|
||||
deckName: $deck,
|
||||
modelName: "Basic",
|
||||
fields: {
|
||||
Front: "",
|
||||
Back: $sentence
|
||||
},
|
||||
options: {
|
||||
allowDuplicate: false
|
||||
},
|
||||
tags: $tags,
|
||||
audio: [{
|
||||
path: $path,
|
||||
filename: $filename,
|
||||
fields: ["Front"]
|
||||
}]
|
||||
}
|
||||
}
|
||||
}')"
|
||||
|
||||
result=$(curl -s localhost:8765 -X POST -H "Content-Type: application/json" -d "$payload")
|
||||
|
||||
if jq -e '.error == null' >/dev/null 2>&1 <<< "$result"; then
|
||||
echo "✅ Added card: $sentence"
|
||||
((count++))
|
||||
else
|
||||
echo "❌ Failed to add card: $sentence"
|
||||
echo "$result"
|
||||
fi
|
||||
else
|
||||
echo "❌ Failed to speed up audio for: $sentence"
|
||||
fi
|
||||
|
||||
# 4. Cleanup
|
||||
rm -f "$OUTPUT_PATH" "$RAW_OUTPUT"
|
||||
else
|
||||
echo "❌ Failed to generate TTS for: $sentence"
|
||||
fi
|
||||
|
||||
done <"$SENTENCE_FILE"
|
||||
|
||||
echo "🎉 Done! Added $count cards to deck \"$DECK_NAME\"."
|
||||
43
examples/config.yaml
Normal file
43
examples/config.yaml
Normal file
@@ -0,0 +1,43 @@
|
||||
# Example Saiki configuration.
|
||||
#
|
||||
# Copy this to ~/.config/saiki/config.yaml and adjust paths, decks, fields, and
|
||||
# language profiles for your Anki setup.
|
||||
|
||||
anki_connect_url: http://localhost:8765
|
||||
|
||||
# Flatpak Anki commonly stores media here. A typical non-Flatpak Linux install
|
||||
# may use: ~/.local/share/Anki2/User 1/collection.media
|
||||
media_dir: ~/.var/app/net.ankiweb.Anki/data/Anki2/User 1/collection.media
|
||||
audio_output_root: ~/Languages/Anki/anki-audio
|
||||
word_output_root: ~/Languages/Anki/anki-words
|
||||
sentence_dir: ~/Languages/Anki
|
||||
note_model: Basic
|
||||
|
||||
fields:
|
||||
front: Front
|
||||
back: Back
|
||||
|
||||
languages:
|
||||
jp:
|
||||
name: japanese
|
||||
transcript_code: ja
|
||||
tts_code: ja
|
||||
tts_tld: com
|
||||
tts_tempo: 1.35
|
||||
decks:
|
||||
- 日本語
|
||||
field: Back
|
||||
word_model: ja_core_news_lg
|
||||
sentence_file: sentences_jp.txt
|
||||
|
||||
es:
|
||||
name: spanish
|
||||
transcript_code: es
|
||||
tts_code: es
|
||||
tts_tld: es
|
||||
tts_tempo: 1.25
|
||||
decks:
|
||||
- Español
|
||||
field: Back
|
||||
word_model: es_core_news_sm
|
||||
sentence_file: sentences_es.txt
|
||||
8
saiki.py
Executable file
8
saiki.py
Executable file
@@ -0,0 +1,8 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Saiki CLI entrypoint."""
|
||||
|
||||
from saiki.cli import main
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
2
saiki/__init__.py
Normal file
2
saiki/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
||||
"""Utilities for Anki-based language learning workflows."""
|
||||
|
||||
19
saiki/ankiconnect.py
Normal file
19
saiki/ankiconnect.py
Normal file
@@ -0,0 +1,19 @@
|
||||
"""Small AnkiConnect client."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import requests
|
||||
|
||||
|
||||
def anki_request(action: str, url: str = "http://localhost:8765", **params):
|
||||
resp = requests.post(
|
||||
url,
|
||||
json={"action": action, "version": 6, "params": params},
|
||||
timeout=30,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
if data.get("error") is not None:
|
||||
raise RuntimeError(f"AnkiConnect error for {action}: {data['error']}")
|
||||
return data["result"]
|
||||
|
||||
126
saiki/audio.py
Normal file
126
saiki/audio.py
Normal file
@@ -0,0 +1,126 @@
|
||||
"""Extract Anki audio media into playlists."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import tempfile
|
||||
from typing import Callable
|
||||
|
||||
from .ankiconnect import anki_request
|
||||
from .config import Config
|
||||
|
||||
AUDIO_EXTS = (".mp3", ".wav", ".ogg", ".m4a", ".flac")
|
||||
|
||||
|
||||
def resolve_media_paths(media_dir: str, out_dir: str, media_name: str) -> tuple[str, str] | None:
|
||||
normalized = os.path.normpath(media_name)
|
||||
if os.path.isabs(normalized) or normalized.startswith(".."):
|
||||
return None
|
||||
return os.path.join(media_dir, normalized), os.path.join(out_dir, normalized)
|
||||
|
||||
|
||||
def build_playlist(out_dir: str, language: str) -> str:
|
||||
m3u_path = os.path.join(out_dir, f"{language}.m3u")
|
||||
concat_name = f"{language}_concat.mp3"
|
||||
files: list[str] = []
|
||||
for root, _, filenames in os.walk(out_dir):
|
||||
for fname in filenames:
|
||||
abs_path = os.path.join(root, fname)
|
||||
rel_path = os.path.relpath(abs_path, out_dir)
|
||||
if rel_path in {os.path.basename(m3u_path), concat_name}:
|
||||
continue
|
||||
if fname.lower().endswith(AUDIO_EXTS) and os.path.isfile(abs_path):
|
||||
files.append(rel_path)
|
||||
|
||||
with open(m3u_path, "w", encoding="utf-8") as fh:
|
||||
for fname in sorted(files):
|
||||
fh.write(f"{fname}\n")
|
||||
return m3u_path
|
||||
|
||||
|
||||
def concat_audio_from_m3u(out_dir: str, m3u_path: str, out_path: str) -> None:
|
||||
if shutil.which("ffmpeg") is None:
|
||||
raise RuntimeError("ffmpeg not found in PATH. Install ffmpeg to use --concat.")
|
||||
|
||||
with open(m3u_path, "r", encoding="utf-8") as fh:
|
||||
rel_files = [line.strip() for line in fh if line.strip()]
|
||||
|
||||
abs_files = [
|
||||
os.path.abspath(os.path.join(out_dir, rel))
|
||||
for rel in rel_files
|
||||
if os.path.isfile(os.path.join(out_dir, rel)) and rel.lower().endswith(AUDIO_EXTS)
|
||||
]
|
||||
if not abs_files:
|
||||
raise RuntimeError("No audio files found to concatenate.")
|
||||
|
||||
with tempfile.NamedTemporaryFile("w", delete=False, encoding="utf-8") as tmp:
|
||||
concat_list_path = tmp.name
|
||||
for path in abs_files:
|
||||
tmp.write(f"file '{path.replace(chr(39), chr(39) + chr(92) + chr(39) + chr(39))}'\n")
|
||||
|
||||
cmd = [
|
||||
"ffmpeg", "-hide_banner", "-loglevel", "error", "-f", "concat", "-safe", "0",
|
||||
"-i", concat_list_path, "-c:a", "libmp3lame", "-q:a", "4", "-y", out_path,
|
||||
]
|
||||
try:
|
||||
subprocess.run(cmd, check=True)
|
||||
finally:
|
||||
try:
|
||||
os.remove(concat_list_path)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
def extract_audio(
|
||||
config: Config,
|
||||
lang: str,
|
||||
outdir: str | None = None,
|
||||
media_dir: str | None = None,
|
||||
copy_only_new: bool = False,
|
||||
concat: bool = False,
|
||||
request: Callable = anki_request,
|
||||
) -> dict[str, object]:
|
||||
language = config.language_name(lang)
|
||||
selected_decks = config.decks_for(lang)
|
||||
if not selected_decks:
|
||||
raise RuntimeError(f"No decks configured for language: {lang}")
|
||||
|
||||
media_root = media_dir or config.media_dir
|
||||
out_dir = os.path.expanduser(outdir) if outdir else os.path.join(config.audio_output_root, language)
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
all_ids: list[int] = []
|
||||
for deck in selected_decks:
|
||||
all_ids.extend(request("findNotes", url=config.anki_connect_url, query=f'deck:"{deck}"') or [])
|
||||
|
||||
if not all_ids:
|
||||
return {"copied": 0, "playlist": build_playlist(out_dir, language), "outdir": out_dir, "concat": None}
|
||||
|
||||
notes = request("notesInfo", url=config.anki_connect_url, notes=all_ids) or []
|
||||
copied: list[str] = []
|
||||
for note in notes:
|
||||
for field in (note.get("fields", {}) or {}).values():
|
||||
val = field.get("value", "") or ""
|
||||
for match in re.findall(r"\[sound:(.+?)\]", val):
|
||||
paths = resolve_media_paths(media_root, out_dir, match)
|
||||
if paths is None:
|
||||
continue
|
||||
src, dst = paths
|
||||
if not os.path.exists(src):
|
||||
continue
|
||||
os.makedirs(os.path.dirname(dst), exist_ok=True)
|
||||
if copy_only_new and os.path.exists(dst):
|
||||
continue
|
||||
shutil.copy2(src, dst)
|
||||
copied.append(match)
|
||||
|
||||
m3u_path = build_playlist(out_dir, language)
|
||||
concat_path = None
|
||||
if concat:
|
||||
concat_path = os.path.join(out_dir, f"{language}_concat.mp3")
|
||||
concat_audio_from_m3u(out_dir, m3u_path, concat_path)
|
||||
return {"copied": len(copied), "playlist": m3u_path, "outdir": out_dir, "concat": concat_path}
|
||||
|
||||
126
saiki/cli.py
Normal file
126
saiki/cli.py
Normal file
@@ -0,0 +1,126 @@
|
||||
"""Unified command-line interface for Saiki."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from .audio import extract_audio
|
||||
from .config import Config, language_choices, load_config
|
||||
from .importer import import_sentences
|
||||
from .words import compare_word_files, extract_words
|
||||
from .youtube import run_youtube
|
||||
|
||||
|
||||
def add_config_arg(parser: argparse.ArgumentParser) -> None:
|
||||
parser.add_argument("--config", help="Path to YAML config file.")
|
||||
|
||||
|
||||
def build_parser(config: Config | None = None) -> argparse.ArgumentParser:
|
||||
choices = language_choices(config or load_config())
|
||||
parser = argparse.ArgumentParser(description="Saiki: sentence mining and listening tools for Anki.")
|
||||
add_config_arg(parser)
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
audio = sub.add_parser("audio", help="Extract Anki audio into playlists.")
|
||||
audio.add_argument("lang", choices=choices)
|
||||
audio.add_argument("--concat", action="store_true")
|
||||
audio.add_argument("--outdir")
|
||||
audio.add_argument("--media-dir")
|
||||
audio.add_argument("--copy-only-new", action="store_true")
|
||||
|
||||
words = sub.add_parser("words", help="Extract frequent words from Anki.")
|
||||
words.add_argument("lang", choices=choices)
|
||||
group = words.add_mutually_exclusive_group()
|
||||
group.add_argument("--query")
|
||||
group.add_argument("--deck", action="append")
|
||||
words.add_argument("--field")
|
||||
words.add_argument("--min-freq", type=int, default=2)
|
||||
words.add_argument("--outdir")
|
||||
words.add_argument("--out")
|
||||
words.add_argument("--full-field", action="store_true")
|
||||
words.add_argument("--spacy-model")
|
||||
|
||||
compare = sub.add_parser("compare-words", help="Print words in source that are not in known.")
|
||||
compare.add_argument("source")
|
||||
compare.add_argument("known")
|
||||
|
||||
youtube = sub.add_parser("youtube", help="Mine a YouTube transcript.")
|
||||
youtube.add_argument("lang", choices=choices)
|
||||
youtube.add_argument("video")
|
||||
youtube.add_argument("--mode", choices=["vocab", "sentences"], default="vocab")
|
||||
youtube.add_argument("--top", type=int)
|
||||
youtube.add_argument("--no-stopwords", action="store_true")
|
||||
youtube.add_argument("--raw", action="store_true")
|
||||
youtube.add_argument("--out")
|
||||
youtube.add_argument("--format", choices=["tsv", "csv"], default="tsv")
|
||||
youtube.add_argument("--known-words", help="Word list to filter vocab_guess against.")
|
||||
youtube.add_argument("--only-new", action="store_true", help="Only export sentences with unknown vocab.")
|
||||
|
||||
importer = sub.add_parser("import", help="Generate TTS and import sentence cards.")
|
||||
importer.add_argument("lang", choices=choices)
|
||||
importer.add_argument("sentence_file", nargs="?")
|
||||
importer.add_argument("--tags", help="Comma-separated tags. text-to-speech is always included.")
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
pre = argparse.ArgumentParser(add_help=False)
|
||||
add_config_arg(pre)
|
||||
known, _ = pre.parse_known_args(argv)
|
||||
config = load_config(known.config)
|
||||
parser = build_parser(config)
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.command == "audio":
|
||||
result = extract_audio(config, args.lang, args.outdir, args.media_dir, args.copy_only_new, args.concat)
|
||||
print(f"Copied {result['copied']} files")
|
||||
print(f"Playlist: {result['playlist']}")
|
||||
print(f"Output directory: {result['outdir']}")
|
||||
if result["concat"]:
|
||||
print(f"Concatenated file: {result['concat']}")
|
||||
return 0
|
||||
|
||||
if args.command == "words":
|
||||
result = extract_words(
|
||||
config, args.lang, args.query, args.deck, args.field, args.min_freq,
|
||||
args.outdir, args.out, args.full_field, args.spacy_model,
|
||||
)
|
||||
print(f"Query: {result['query']}")
|
||||
print(f"Found {result['notes']} notes")
|
||||
print(f"Extracted {result['unique']} unique entries")
|
||||
print(f"Wrote {result['written']} entries to: {result['out']}")
|
||||
return 0
|
||||
|
||||
if args.command == "compare-words":
|
||||
for line in compare_word_files(args.source, args.known):
|
||||
print(line)
|
||||
return 0
|
||||
|
||||
if args.command == "youtube":
|
||||
result = run_youtube(
|
||||
config, args.lang, args.video, args.mode, args.top, args.no_stopwords,
|
||||
args.raw, args.out, args.format, args.known_words, args.only_new,
|
||||
)
|
||||
if args.mode == "sentences" and not args.out:
|
||||
for line in result["lines"]:
|
||||
print(f"[{line.start:.2f}s] {line.text}")
|
||||
elif args.mode == "sentences":
|
||||
print(f"Wrote {result['written']} rows to: {result['out']}")
|
||||
else:
|
||||
for word, count in result["items"]:
|
||||
print(f"{word}: {count}")
|
||||
return 0
|
||||
|
||||
if args.command == "import":
|
||||
result = import_sentences(config, args.lang, args.sentence_file, args.tags)
|
||||
print(f"Done. Added {result.added}/{result.processed} cards. Failed: {result.failed}")
|
||||
return 0 if result.failed == 0 else 1
|
||||
|
||||
parser.print_help()
|
||||
return 2
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main(sys.argv[1:]))
|
||||
148
saiki/config.py
Normal file
148
saiki/config.py
Normal file
@@ -0,0 +1,148 @@
|
||||
"""Configuration loading for Saiki.
|
||||
|
||||
Defaults mirror the original scripts. Users can override them with YAML at
|
||||
~/.config/saiki/config.yaml or by passing --config to the CLI.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import copy
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
try:
|
||||
import yaml
|
||||
except Exception: # pragma: no cover - handled when config files are loaded
|
||||
yaml = None
|
||||
|
||||
|
||||
DEFAULT_CONFIG: dict[str, Any] = {
|
||||
"anki_connect_url": "http://localhost:8765",
|
||||
"media_dir": "~/.var/app/net.ankiweb.Anki/data/Anki2/User 1/collection.media",
|
||||
"audio_output_root": "~/Languages/Anki/anki-audio",
|
||||
"word_output_root": "~/Languages/Anki/anki-words",
|
||||
"sentence_dir": "~/Languages/Anki",
|
||||
"note_model": "Basic",
|
||||
"fields": {"front": "Front", "back": "Back"},
|
||||
"languages": {
|
||||
"jp": {
|
||||
"name": "japanese",
|
||||
"transcript_code": "ja",
|
||||
"tts_code": "ja",
|
||||
"tts_tld": "com",
|
||||
"tts_tempo": 1.35,
|
||||
"decks": ["日本語"],
|
||||
"word_model": "ja_core_news_lg",
|
||||
"field": "Back",
|
||||
"sentence_file": "sentences_jp.txt",
|
||||
},
|
||||
"es": {
|
||||
"name": "spanish",
|
||||
"transcript_code": "es",
|
||||
"tts_code": "es",
|
||||
"tts_tld": "es",
|
||||
"tts_tempo": 1.25,
|
||||
"decks": ["Español"],
|
||||
"word_model": "es_core_news_sm",
|
||||
"field": "Back",
|
||||
"sentence_file": "sentences_es.txt",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Config:
|
||||
data: dict[str, Any]
|
||||
|
||||
@property
|
||||
def anki_connect_url(self) -> str:
|
||||
return str(self.data["anki_connect_url"])
|
||||
|
||||
@property
|
||||
def media_dir(self) -> str:
|
||||
return expand_path(str(self.data["media_dir"]))
|
||||
|
||||
@property
|
||||
def audio_output_root(self) -> str:
|
||||
return expand_path(str(self.data["audio_output_root"]))
|
||||
|
||||
@property
|
||||
def word_output_root(self) -> str:
|
||||
return expand_path(str(self.data["word_output_root"]))
|
||||
|
||||
@property
|
||||
def sentence_dir(self) -> str:
|
||||
return expand_path(str(self.data["sentence_dir"]))
|
||||
|
||||
@property
|
||||
def note_model(self) -> str:
|
||||
return str(self.data.get("note_model", "Basic"))
|
||||
|
||||
@property
|
||||
def fields(self) -> dict[str, str]:
|
||||
return dict(self.data.get("fields", {}))
|
||||
|
||||
@property
|
||||
def languages(self) -> dict[str, dict[str, Any]]:
|
||||
return dict(self.data.get("languages", {}))
|
||||
|
||||
def language(self, lang: str) -> dict[str, Any]:
|
||||
try:
|
||||
return dict(self.languages[lang])
|
||||
except KeyError as e:
|
||||
available = ", ".join(sorted(self.languages))
|
||||
raise ValueError(f"Unsupported language '{lang}'. Available: {available}") from e
|
||||
|
||||
def language_name(self, lang: str) -> str:
|
||||
return str(self.language(lang)["name"])
|
||||
|
||||
def transcript_code(self, lang: str) -> str:
|
||||
return str(self.language(lang)["transcript_code"])
|
||||
|
||||
def decks_for(self, lang: str) -> list[str]:
|
||||
return list(self.language(lang).get("decks", []))
|
||||
|
||||
def field_for(self, lang: str) -> str:
|
||||
return str(self.language(lang).get("field", self.fields.get("back", "Back")))
|
||||
|
||||
def sentence_file_for(self, lang: str) -> str:
|
||||
value = str(self.language(lang).get("sentence_file", f"sentences_{lang}.txt"))
|
||||
return expand_path(value if os.path.isabs(value) or value.startswith("~") else os.path.join(self.sentence_dir, value))
|
||||
|
||||
|
||||
def expand_path(path: str) -> str:
|
||||
return os.path.expanduser(os.path.expandvars(path))
|
||||
|
||||
|
||||
def default_config_path() -> str:
|
||||
return expand_path("~/.config/saiki/config.yaml")
|
||||
|
||||
|
||||
def deep_merge(base: dict[str, Any], override: dict[str, Any]) -> dict[str, Any]:
|
||||
result = copy.deepcopy(base)
|
||||
for key, value in override.items():
|
||||
if isinstance(value, dict) and isinstance(result.get(key), dict):
|
||||
result[key] = deep_merge(result[key], value)
|
||||
else:
|
||||
result[key] = value
|
||||
return result
|
||||
|
||||
|
||||
def load_config(path: str | None = None) -> Config:
|
||||
config = copy.deepcopy(DEFAULT_CONFIG)
|
||||
config_path = expand_path(path) if path else default_config_path()
|
||||
if os.path.exists(config_path):
|
||||
if yaml is None:
|
||||
raise RuntimeError("Loading config files requires PyYAML. Install pyyaml.")
|
||||
with open(config_path, "r", encoding="utf-8") as fh:
|
||||
loaded = yaml.safe_load(fh) or {}
|
||||
if not isinstance(loaded, dict):
|
||||
raise RuntimeError(f"Config must be a YAML mapping: {config_path}")
|
||||
config = deep_merge(config, loaded)
|
||||
return Config(config)
|
||||
|
||||
|
||||
def language_choices(config: Config) -> list[str]:
|
||||
return sorted(config.languages.keys())
|
||||
112
saiki/importer.py
Normal file
112
saiki/importer.py
Normal file
@@ -0,0 +1,112 @@
|
||||
"""Generate TTS audio and add sentence notes to Anki."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import csv
|
||||
import shutil
|
||||
import subprocess
|
||||
import tempfile
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Callable
|
||||
|
||||
from .ankiconnect import anki_request
|
||||
from .config import Config
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ImportResult:
|
||||
processed: int
|
||||
added: int
|
||||
failed: int
|
||||
|
||||
|
||||
def parse_tags(value: str | None) -> list[str]:
|
||||
tags = ["text-to-speech"]
|
||||
if value:
|
||||
tags.extend(tag.strip() for tag in value.split(",") if tag.strip())
|
||||
else:
|
||||
tags.append("AI-generated")
|
||||
return tags
|
||||
|
||||
|
||||
def require_command(name: str) -> None:
|
||||
if shutil.which(name) is None:
|
||||
raise RuntimeError(f"Required command not found: {name}")
|
||||
|
||||
|
||||
def generate_tts(sentence: str, raw_output: str, lang_code: str, tld: str) -> None:
|
||||
subprocess.run(["gtts-cli", sentence, "--lang", lang_code, "--tld", tld, "--output", raw_output], check=True)
|
||||
|
||||
|
||||
def speed_audio(raw_output: str, output_path: str, tempo: float) -> None:
|
||||
subprocess.run(
|
||||
["ffmpeg", "-loglevel", "error", "-i", raw_output, "-filter:a", f"atempo={tempo}", "-y", output_path],
|
||||
stdin=subprocess.DEVNULL,
|
||||
check=True,
|
||||
)
|
||||
|
||||
|
||||
def read_sentences(path: str) -> list[str]:
|
||||
expanded = os.path.expanduser(path)
|
||||
if expanded.lower().endswith((".tsv", ".csv")):
|
||||
delimiter = "\t" if expanded.lower().endswith(".tsv") else ","
|
||||
with open(expanded, "r", encoding="utf-8", newline="") as fh:
|
||||
reader = csv.DictReader(fh, delimiter=delimiter)
|
||||
if reader.fieldnames and "sentence" in reader.fieldnames:
|
||||
return [row["sentence"].strip() for row in reader if row.get("sentence", "").strip()]
|
||||
raise RuntimeError("TSV/CSV sentence imports must include a 'sentence' header.")
|
||||
|
||||
with open(expanded, "r", encoding="utf-8") as fh:
|
||||
return [line.strip() for line in fh if line.strip()]
|
||||
|
||||
|
||||
def import_sentences(
|
||||
config: Config,
|
||||
lang: str,
|
||||
sentence_file: str | None = None,
|
||||
tags_value: str | None = None,
|
||||
request: Callable = anki_request,
|
||||
) -> ImportResult:
|
||||
require_command("gtts-cli")
|
||||
require_command("ffmpeg")
|
||||
|
||||
language = config.language(lang)
|
||||
decks = list(language.get("decks", []))
|
||||
if not decks:
|
||||
raise RuntimeError(f"No deck configured for language: {lang}")
|
||||
deck = decks[0]
|
||||
|
||||
source = os.path.expanduser(sentence_file) if sentence_file else config.sentence_file_for(lang)
|
||||
sentences = read_sentences(source)
|
||||
tags = parse_tags(tags_value)
|
||||
front_field = config.fields.get("front", "Front")
|
||||
back_field = config.fields.get("back", "Back")
|
||||
added = 0
|
||||
failed = 0
|
||||
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
for sentence in sentences:
|
||||
basename = f"tts_{time.strftime('%Y%m%d_%H%M%S')}_{lang}_{os.getpid()}_{added + failed}"
|
||||
raw_output = os.path.join(temp_dir, f"{basename}_original.mp3")
|
||||
output_path = os.path.join(temp_dir, f"{basename}.mp3")
|
||||
try:
|
||||
generate_tts(sentence, raw_output, str(language["tts_code"]), str(language["tts_tld"]))
|
||||
speed_audio(raw_output, output_path, float(language["tts_tempo"]))
|
||||
request(
|
||||
"addNote",
|
||||
url=config.anki_connect_url,
|
||||
note={
|
||||
"deckName": deck,
|
||||
"modelName": config.note_model,
|
||||
"fields": {front_field: "", back_field: sentence},
|
||||
"options": {"allowDuplicate": False},
|
||||
"tags": tags,
|
||||
"audio": [{"path": output_path, "filename": f"{basename}.mp3", "fields": [front_field]}],
|
||||
},
|
||||
)
|
||||
added += 1
|
||||
except Exception:
|
||||
failed += 1
|
||||
return ImportResult(processed=len(sentences), added=added, failed=failed)
|
||||
29
saiki/text.py
Normal file
29
saiki/text.py
Normal file
@@ -0,0 +1,29 @@
|
||||
"""Text cleanup helpers shared by tools."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from html import unescape
|
||||
|
||||
import regex as re
|
||||
|
||||
|
||||
def extract_first_visible_line(text: str) -> str:
|
||||
text = unescape(text or "")
|
||||
text = re.sub(r"</?(br|div|p)[^>]*>", "\n", text, flags=re.IGNORECASE)
|
||||
text = re.sub(r"<[^>]+>", "", text)
|
||||
text = text.strip()
|
||||
return text.splitlines()[0] if text else ""
|
||||
|
||||
|
||||
def extract_visible_text(text: str) -> str:
|
||||
text = unescape(text or "")
|
||||
text = re.sub(r"</?(br|div|p)[^>]*>", "\n", text, flags=re.IGNORECASE)
|
||||
text = re.sub(r"<[^>]+>", "", text)
|
||||
text = re.sub(r"[ \t]+", " ", text)
|
||||
text = re.sub(r"\n{2,}", "\n", text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def normalize_word_key(value: str) -> str:
|
||||
return re.sub(r"\s+", " ", value.strip().lower())
|
||||
|
||||
183
saiki/words.py
Normal file
183
saiki/words.py
Normal file
@@ -0,0 +1,183 @@
|
||||
"""Extract and compare language-learning vocabulary."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from collections import Counter
|
||||
from typing import Callable
|
||||
|
||||
import regex as re
|
||||
|
||||
from .ankiconnect import anki_request
|
||||
from .config import Config
|
||||
from .text import extract_first_visible_line, extract_visible_text, normalize_word_key
|
||||
|
||||
JAPANESE_CHAR_RE = re.compile(r"[\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}ー]+")
|
||||
JAPANESE_PARTICLES = {
|
||||
"は", "が", "を", "に", "へ", "で", "と", "や", "も", "から", "まで", "より", "ば", "なら",
|
||||
"の", "ね", "よ", "ぞ", "ぜ", "さ", "わ", "か", "な", "って", "とき", "ってば", "けど", "けれど",
|
||||
"しかし", "でも", "ながら", "ほど", "し", "もの", "こと", "ところ", "よう", "らしい", "られる",
|
||||
}
|
||||
JAPANESE_GRAMMAR_EXCLUDE = {
|
||||
"て", "た", "ます", "れる", "てる", "ぬ", "ん", "しまう", "いる", "ない", "なる", "ある", "だ", "です",
|
||||
}
|
||||
JAPANESE_ALLOWED_POS = {"NOUN", "PROPN", "VERB", "ADJ"}
|
||||
|
||||
|
||||
def setup_logging(logfile: str) -> None:
|
||||
os.makedirs(os.path.dirname(os.path.abspath(logfile)), exist_ok=True)
|
||||
logging.basicConfig(filename=logfile, level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
|
||||
|
||||
def build_query_from_decks(decks: list[str]) -> str:
|
||||
return " OR ".join(f'deck:"{d}"' for d in decks)
|
||||
|
||||
|
||||
def japanese_filter(token) -> bool:
|
||||
text = (token.text or "").strip()
|
||||
lemma = (token.lemma_ or "").strip()
|
||||
if not text or not JAPANESE_CHAR_RE.fullmatch(text):
|
||||
return False
|
||||
if lemma in JAPANESE_GRAMMAR_EXCLUDE or text in JAPANESE_PARTICLES:
|
||||
return False
|
||||
if getattr(token, "pos_", None) not in JAPANESE_ALLOWED_POS:
|
||||
return False
|
||||
if getattr(token, "is_stop", False) or getattr(token, "like_url", False) or getattr(token, "like_email", False):
|
||||
return False
|
||||
if any(c in text for c in "<>=/\\:&%"):
|
||||
return False
|
||||
return text not in {"ruby", "rt", "div", "br", "nbsp", "href", "strong", "a"}
|
||||
|
||||
|
||||
def spanish_filter(token) -> bool:
|
||||
return bool(getattr(token, "is_alpha", False)) and not bool(getattr(token, "is_stop", False))
|
||||
|
||||
|
||||
def spanish_format(token) -> str:
|
||||
return (token.lemma_ or token.text or "").lower().strip()
|
||||
|
||||
|
||||
def japanese_format(token) -> str:
|
||||
lemma = (token.lemma_ or "").strip()
|
||||
surface = (token.text or "").strip()
|
||||
if lemma and surface and lemma != surface:
|
||||
return f"{lemma} ({surface})"
|
||||
return lemma or surface
|
||||
|
||||
|
||||
LANGUAGE_PROFILES = {
|
||||
"spanish": {"token_filter": spanish_filter, "output_format": spanish_format},
|
||||
"japanese": {"token_filter": japanese_filter, "output_format": japanese_format},
|
||||
}
|
||||
|
||||
|
||||
def load_spacy_model(model_name: str):
|
||||
try:
|
||||
import spacy # type: ignore
|
||||
except Exception as e:
|
||||
raise RuntimeError("Failed to import spaCy. Use a Python version supported by spaCy.") from e
|
||||
try:
|
||||
return spacy.load(model_name)
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Failed to load spaCy model '{model_name}'. Try: python -m spacy download {model_name}") from e
|
||||
|
||||
|
||||
def get_notes(query: str, config: Config, request: Callable = anki_request) -> list[dict]:
|
||||
note_ids = request("findNotes", url=config.anki_connect_url, query=query) or []
|
||||
if not note_ids:
|
||||
return []
|
||||
return request("notesInfo", url=config.anki_connect_url, notes=note_ids) or []
|
||||
|
||||
|
||||
def extract_counts(
|
||||
notes: list[dict],
|
||||
field_name: str,
|
||||
nlp,
|
||||
token_filter: Callable,
|
||||
output_format: Callable,
|
||||
use_full_field: bool,
|
||||
) -> Counter:
|
||||
counter: Counter = Counter()
|
||||
for note in notes:
|
||||
fields = note.get("fields", {}) or {}
|
||||
raw_val = (fields.get(field_name, {}) or {}).get("value", "") or ""
|
||||
text = extract_visible_text(raw_val) if use_full_field else extract_first_visible_line(raw_val)
|
||||
if not text:
|
||||
continue
|
||||
for token in nlp(text):
|
||||
if token_filter(token):
|
||||
key = output_format(token)
|
||||
if key:
|
||||
counter[key] += 1
|
||||
return counter
|
||||
|
||||
|
||||
def write_counts(counter: Counter, out_path: str, min_freq: int) -> int:
|
||||
items = [(w, c) for (w, c) in counter.items() if c >= min_freq]
|
||||
items.sort(key=lambda x: (-x[1], x[0]))
|
||||
os.makedirs(os.path.dirname(os.path.abspath(out_path)), exist_ok=True)
|
||||
with open(out_path, "w", encoding="utf-8") as f:
|
||||
for word, freq in items:
|
||||
f.write(f"{word} {freq}\n")
|
||||
return len(items)
|
||||
|
||||
|
||||
def read_word_file(path: str) -> set[str]:
|
||||
words: set[str] = set()
|
||||
with open(os.path.expanduser(path), "r", encoding="utf-8") as fh:
|
||||
for line in fh:
|
||||
stripped = line.strip()
|
||||
if not stripped:
|
||||
continue
|
||||
word = stripped.rsplit(" ", 1)[0]
|
||||
words.add(normalize_word_key(word))
|
||||
return words
|
||||
|
||||
|
||||
def compare_word_files(source_path: str, known_path: str) -> list[str]:
|
||||
known = read_word_file(known_path)
|
||||
new_words: list[str] = []
|
||||
with open(os.path.expanduser(source_path), "r", encoding="utf-8") as fh:
|
||||
for line in fh:
|
||||
stripped = line.strip()
|
||||
if not stripped:
|
||||
continue
|
||||
word = stripped.rsplit(" ", 1)[0]
|
||||
if normalize_word_key(word) not in known:
|
||||
new_words.append(stripped)
|
||||
return new_words
|
||||
|
||||
|
||||
def extract_words(
|
||||
config: Config,
|
||||
lang: str,
|
||||
query: str | None = None,
|
||||
decks: list[str] | None = None,
|
||||
field: str | None = None,
|
||||
min_freq: int = 2,
|
||||
outdir: str | None = None,
|
||||
out: str | None = None,
|
||||
full_field: bool = False,
|
||||
spacy_model: str | None = None,
|
||||
request: Callable = anki_request,
|
||||
) -> dict[str, object]:
|
||||
language_bucket = config.language_name(lang)
|
||||
profile = LANGUAGE_PROFILES[language_bucket]
|
||||
search_query = query or build_query_from_decks(decks or config.decks_for(lang))
|
||||
out_dir = os.path.expanduser(outdir) if outdir else os.path.join(config.word_output_root, language_bucket)
|
||||
out_path = os.path.expanduser(out) if out else os.path.join(out_dir, f"words_{lang}.txt")
|
||||
model_name = spacy_model or str(config.language(lang).get("word_model"))
|
||||
nlp = load_spacy_model(model_name)
|
||||
notes = get_notes(search_query, config, request=request)
|
||||
if notes:
|
||||
fields0 = (notes[0].get("fields", {}) or {})
|
||||
field_name = field or config.field_for(lang)
|
||||
if field_name not in fields0:
|
||||
raise RuntimeError(f"Field '{field_name}' not found. Available fields: {list(fields0.keys())}")
|
||||
else:
|
||||
field_name = field or config.field_for(lang)
|
||||
counter = extract_counts(notes, field_name, nlp, profile["token_filter"], profile["output_format"], full_field)
|
||||
written = write_counts(counter, out_path, min_freq)
|
||||
return {"query": search_query, "notes": len(notes), "unique": len(counter), "written": written, "out": out_path}
|
||||
|
||||
179
saiki/youtube.py
Normal file
179
saiki/youtube.py
Normal file
@@ -0,0 +1,179 @@
|
||||
"""YouTube transcript mining and Anki-ready exports."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import csv
|
||||
import os
|
||||
import re
|
||||
from collections import Counter
|
||||
from dataclasses import dataclass
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
|
||||
from .config import Config
|
||||
from .text import normalize_word_key
|
||||
from .words import read_word_file
|
||||
|
||||
STOPWORDS = {
|
||||
"es": {
|
||||
"de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por",
|
||||
"un", "para", "con", "no", "una", "su", "al", "lo", "como",
|
||||
},
|
||||
"en": {"the", "is", "and", "of", "to", "in", "it", "that", "on", "you", "this", "for", "with"},
|
||||
"ja": {"の", "に", "は", "を", "た", "が", "で", "て", "です", "ます", "する", "ある", "いる"},
|
||||
}
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TranscriptLine:
|
||||
start: float
|
||||
text: str
|
||||
|
||||
|
||||
def extract_video_id(url_or_id: str) -> str:
|
||||
if "youtube" in url_or_id or "youtu.be" in url_or_id:
|
||||
query = urlparse(url_or_id)
|
||||
if query.hostname == "youtu.be":
|
||||
return query.path.lstrip("/")
|
||||
if query.hostname in ("www.youtube.com", "youtube.com", "m.youtube.com"):
|
||||
values = parse_qs(query.query).get("v", [])
|
||||
if values:
|
||||
return values[0]
|
||||
return url_or_id
|
||||
|
||||
|
||||
def video_url(video_or_id: str) -> str:
|
||||
video_id = extract_video_id(video_or_id)
|
||||
return f"https://www.youtube.com/watch?v={video_id}"
|
||||
|
||||
|
||||
def fetch_transcript(video_id: str, lang_code: str):
|
||||
if hasattr(YouTubeTranscriptApi, "fetch"):
|
||||
api = YouTubeTranscriptApi()
|
||||
return api.fetch(video_id, languages=[lang_code])
|
||||
if hasattr(YouTubeTranscriptApi, "get_transcript"):
|
||||
return YouTubeTranscriptApi.get_transcript(video_id, languages=[lang_code])
|
||||
raise RuntimeError("Unsupported youtube-transcript-api version.")
|
||||
|
||||
|
||||
def snippet_text(entry) -> str:
|
||||
if isinstance(entry, dict):
|
||||
return entry.get("text", "") or ""
|
||||
return getattr(entry, "text", "") or ""
|
||||
|
||||
|
||||
def snippet_start(entry) -> float:
|
||||
if isinstance(entry, dict):
|
||||
return float(entry.get("start", 0.0) or 0.0)
|
||||
return float(getattr(entry, "start", 0.0) or 0.0)
|
||||
|
||||
|
||||
def transcript_lines(entries) -> list[TranscriptLine]:
|
||||
lines: list[TranscriptLine] = []
|
||||
for entry in entries:
|
||||
text = snippet_text(entry).replace("\n", " ").strip()
|
||||
if text:
|
||||
lines.append(TranscriptLine(snippet_start(entry), text))
|
||||
return lines
|
||||
|
||||
|
||||
def tokenize_japanese(text: str) -> list[str]:
|
||||
try:
|
||||
from fugashi import Tagger
|
||||
except ImportError as e:
|
||||
raise RuntimeError('Japanese requires fugashi. Install: pip install "fugashi[unidic-lite]"') from e
|
||||
tagger = Tagger()
|
||||
return [w.surface for w in tagger(text)]
|
||||
|
||||
|
||||
def tokenize_spanish(text: str, raw: bool = False) -> list[str]:
|
||||
tokens = re.findall(r"\b[\wáéíóúñü]+\b", text)
|
||||
return tokens if raw else [t.lower() for t in tokens]
|
||||
|
||||
|
||||
def tokenize_text(text: str, lang_code: str, raw: bool = False) -> list[str]:
|
||||
return tokenize_japanese(text) if lang_code == "ja" else tokenize_spanish(text, raw=raw)
|
||||
|
||||
|
||||
def count_words(tokens: list[str], lang_code: str, remove_stopwords: bool = True) -> Counter:
|
||||
if remove_stopwords:
|
||||
stopwords = STOPWORDS.get(lang_code, set())
|
||||
tokens = [t for t in tokens if t not in stopwords]
|
||||
return Counter(tokens)
|
||||
|
||||
|
||||
def sentence_vocab(sentence: str, lang_code: str, known_words: set[str] | None = None) -> list[str]:
|
||||
words: list[str] = []
|
||||
seen: set[str] = set()
|
||||
for token in tokenize_text(sentence, lang_code):
|
||||
key = normalize_word_key(token)
|
||||
if key in seen or key in STOPWORDS.get(lang_code, set()):
|
||||
continue
|
||||
if known_words is not None and key in known_words:
|
||||
continue
|
||||
seen.add(key)
|
||||
words.append(token)
|
||||
return words
|
||||
|
||||
|
||||
def write_sentence_export(
|
||||
lines: list[TranscriptLine],
|
||||
out_path: str,
|
||||
video: str,
|
||||
lang_code: str,
|
||||
delimiter: str = "\t",
|
||||
known_words_path: str | None = None,
|
||||
only_new: bool = False,
|
||||
) -> int:
|
||||
known = read_word_file(known_words_path) if known_words_path else None
|
||||
os.makedirs(os.path.dirname(os.path.abspath(out_path)), exist_ok=True)
|
||||
written = 0
|
||||
with open(out_path, "w", encoding="utf-8", newline="") as fh:
|
||||
writer = csv.writer(fh, delimiter=delimiter)
|
||||
writer.writerow(["sentence", "timestamp", "video_url", "vocab_guess"])
|
||||
for line in lines:
|
||||
vocab = sentence_vocab(line.text, lang_code, known)
|
||||
if only_new and not vocab:
|
||||
continue
|
||||
writer.writerow([line.text, f"{line.start:.2f}", video_url(video), ", ".join(vocab)])
|
||||
written += 1
|
||||
return written
|
||||
|
||||
|
||||
def run_youtube(
|
||||
config: Config,
|
||||
lang: str,
|
||||
video: str,
|
||||
mode: str = "vocab",
|
||||
top: int | None = None,
|
||||
no_stopwords: bool = False,
|
||||
raw: bool = False,
|
||||
out: str | None = None,
|
||||
fmt: str = "tsv",
|
||||
known_words: str | None = None,
|
||||
only_new: bool = False,
|
||||
) -> dict[str, object]:
|
||||
lang_code = config.transcript_code(lang)
|
||||
video_id = extract_video_id(video)
|
||||
entries = fetch_transcript(video_id, lang_code)
|
||||
lines = transcript_lines(entries)
|
||||
|
||||
if mode == "sentences":
|
||||
if out:
|
||||
delimiter = "," if fmt == "csv" else "\t"
|
||||
written = write_sentence_export(lines, out, video_id, lang_code, delimiter, known_words, only_new)
|
||||
return {"mode": mode, "lines": len(lines), "written": written, "out": out}
|
||||
return {"mode": mode, "lines": lines}
|
||||
|
||||
text = " ".join(line.text for line in lines)
|
||||
tokens = tokenize_text(text, lang_code, raw=raw)
|
||||
counts = count_words(tokens, lang_code, remove_stopwords=not no_stopwords)
|
||||
items = counts.most_common(top) if top else counts.most_common()
|
||||
if out:
|
||||
os.makedirs(os.path.dirname(os.path.abspath(out)), exist_ok=True)
|
||||
with open(out, "w", encoding="utf-8") as fh:
|
||||
for word, count in items:
|
||||
fh.write(f"{word} {count}\n")
|
||||
return {"mode": mode, "items": items, "out": out}
|
||||
|
||||
106
tests/test_core.py
Normal file
106
tests/test_core.py
Normal file
@@ -0,0 +1,106 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from saiki.audio import build_playlist, resolve_media_paths
|
||||
from saiki.config import DEFAULT_CONFIG, deep_merge
|
||||
from saiki.importer import parse_tags, read_sentences
|
||||
from saiki.text import extract_first_visible_line, extract_visible_text
|
||||
from saiki.words import build_query_from_decks, compare_word_files, read_word_file
|
||||
from saiki.youtube import TranscriptLine, extract_video_id, sentence_vocab, write_sentence_export
|
||||
|
||||
|
||||
class ConfigTests(unittest.TestCase):
|
||||
def test_deep_merge_preserves_nested_defaults(self):
|
||||
merged = deep_merge(DEFAULT_CONFIG, {"languages": {"es": {"decks": ["Spanish"]}}})
|
||||
self.assertEqual(merged["languages"]["es"]["decks"], ["Spanish"])
|
||||
self.assertEqual(merged["languages"]["es"]["transcript_code"], "es")
|
||||
self.assertIn("jp", merged["languages"])
|
||||
|
||||
|
||||
class TextTests(unittest.TestCase):
|
||||
def test_visible_text_helpers_strip_html(self):
|
||||
raw = "<div>Hola mundo</div><br><p>segunda linea</p>"
|
||||
self.assertEqual(extract_first_visible_line(raw), "Hola\xa0mundo")
|
||||
self.assertEqual(extract_visible_text(raw), "Hola\xa0mundo\nsegunda linea")
|
||||
|
||||
|
||||
class AudioTests(unittest.TestCase):
|
||||
def test_resolve_media_paths_rejects_unsafe_names(self):
|
||||
self.assertIsNone(resolve_media_paths("/media", "/out", "../secret.mp3"))
|
||||
self.assertIsNone(resolve_media_paths("/media", "/out", "/tmp/secret.mp3"))
|
||||
self.assertEqual(
|
||||
resolve_media_paths("/media", "/out", "nested/audio.mp3"),
|
||||
("/media/nested/audio.mp3", "/out/nested/audio.mp3"),
|
||||
)
|
||||
|
||||
def test_build_playlist_includes_audio_files(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
os.makedirs(os.path.join(tmp, "nested"))
|
||||
for rel in ["b.mp3", "nested/a.ogg", "note.txt", "spanish_concat.mp3"]:
|
||||
with open(os.path.join(tmp, rel), "w", encoding="utf-8") as fh:
|
||||
fh.write("x")
|
||||
playlist = build_playlist(tmp, "spanish")
|
||||
with open(playlist, "r", encoding="utf-8") as fh:
|
||||
self.assertEqual(fh.read().splitlines(), ["b.mp3", "nested/a.ogg"])
|
||||
|
||||
|
||||
class WordsTests(unittest.TestCase):
|
||||
def test_build_query_from_decks(self):
|
||||
self.assertEqual(build_query_from_decks(["A", "B"]), 'deck:"A" OR deck:"B"')
|
||||
|
||||
def test_compare_word_files(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
source = os.path.join(tmp, "source.txt")
|
||||
known = os.path.join(tmp, "known.txt")
|
||||
with open(source, "w", encoding="utf-8") as fh:
|
||||
fh.write("comer 3\nhablar 2\n")
|
||||
with open(known, "w", encoding="utf-8") as fh:
|
||||
fh.write("Comer 10\n")
|
||||
self.assertEqual(read_word_file(known), {"comer"})
|
||||
self.assertEqual(compare_word_files(source, known), ["hablar 2"])
|
||||
|
||||
|
||||
class YoutubeTests(unittest.TestCase):
|
||||
def test_extract_video_id(self):
|
||||
self.assertEqual(extract_video_id("https://youtu.be/abc123"), "abc123")
|
||||
self.assertEqual(extract_video_id("https://www.youtube.com/watch?v=abc123&t=5"), "abc123")
|
||||
self.assertEqual(extract_video_id("abc123"), "abc123")
|
||||
|
||||
def test_sentence_vocab_filters_known_words(self):
|
||||
self.assertEqual(sentence_vocab("Hola hola mundo", "es", {"hola"}), ["mundo"])
|
||||
|
||||
def test_write_sentence_export(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
out = os.path.join(tmp, "sentences.tsv")
|
||||
written = write_sentence_export(
|
||||
[TranscriptLine(12.3, "Hola mundo")],
|
||||
out,
|
||||
"abc123",
|
||||
"es",
|
||||
)
|
||||
self.assertEqual(written, 1)
|
||||
with open(out, "r", encoding="utf-8") as fh:
|
||||
rows = fh.read().splitlines()
|
||||
self.assertEqual(rows[0], "sentence\ttimestamp\tvideo_url\tvocab_guess")
|
||||
self.assertEqual(rows[1], "Hola mundo\t12.30\thttps://www.youtube.com/watch?v=abc123\thola, mundo")
|
||||
|
||||
|
||||
class ImporterTests(unittest.TestCase):
|
||||
def test_parse_tags(self):
|
||||
self.assertEqual(parse_tags(None), ["text-to-speech", "AI-generated"])
|
||||
self.assertEqual(parse_tags("youtube, manual "), ["text-to-speech", "youtube", "manual"])
|
||||
|
||||
def test_read_sentences_from_tsv_export(self):
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
path = os.path.join(tmp, "youtube.tsv")
|
||||
with open(path, "w", encoding="utf-8") as fh:
|
||||
fh.write("sentence\ttimestamp\tvideo_url\tvocab_guess\n")
|
||||
fh.write("Hola mundo\t1.00\thttps://example.test\tmundo\n")
|
||||
self.assertEqual(read_sentences(path), ["Hola mundo"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
382
word_scraper.py
382
word_scraper.py
@@ -1,382 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
word_scraper.py
|
||||
|
||||
Extract frequent words/lemmas from Anki notes via AnkiConnect.
|
||||
|
||||
Howto:
|
||||
./word_scraper.py jp [--deck "日本語"] [--field Back] [--min-freq 2] [--outdir DIR] [--out FILE]
|
||||
./word_scraper.py es [--deck "Español"] [--field Back] [--min-freq 2] [--outdir DIR] [--out FILE]
|
||||
|
||||
By default, this:
|
||||
- chooses decks based on the lang code (jp/es) using shared deck mappings
|
||||
- pulls notes from Anki via AnkiConnect (http://localhost:8765)
|
||||
- reads a single field (default: Back)
|
||||
- extracts the first visible line (HTML stripped) from that field
|
||||
- tokenizes with spaCy and counts words
|
||||
- writes "token count" lines sorted by descending count
|
||||
|
||||
Notes:
|
||||
- spaCy currently may not work on Python 3.14 in your environment.
|
||||
If spaCy import/load fails, create a Python 3.12 venv for this script.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
from collections import Counter
|
||||
from html import unescape
|
||||
from typing import Callable, List
|
||||
|
||||
import regex as re
|
||||
|
||||
from anki_common import DEFAULT_WORD_OUTPUT_ROOT, DECK_TO_LANGUAGE, LANG_MAP, anki_request
|
||||
|
||||
|
||||
# -------------------------
|
||||
# Logging
|
||||
# -------------------------
|
||||
def setup_logging(logfile: str) -> None:
|
||||
os.makedirs(os.path.dirname(os.path.abspath(logfile)), exist_ok=True)
|
||||
logging.basicConfig(
|
||||
filename=logfile,
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
)
|
||||
|
||||
|
||||
# -------------------------
|
||||
# HTML cleanup helpers
|
||||
# -------------------------
|
||||
def extract_first_visible_line(text: str) -> str:
|
||||
"""Remove common HTML and return only the first visible line."""
|
||||
text = unescape(text or "")
|
||||
text = re.sub(r"</?(br|div|p)[^>]*>", "\n", text, flags=re.IGNORECASE)
|
||||
text = re.sub(r"<[^>]+>", "", text)
|
||||
text = text.strip()
|
||||
return text.splitlines()[0] if text else ""
|
||||
|
||||
|
||||
def extract_visible_text(text: str) -> str:
|
||||
"""Remove common HTML and return all visible text as a single string."""
|
||||
text = unescape(text or "")
|
||||
text = re.sub(r"</?(br|div|p)[^>]*>", "\n", text, flags=re.IGNORECASE)
|
||||
text = re.sub(r"<[^>]+>", "", text)
|
||||
# Normalize whitespace a bit
|
||||
text = re.sub(r"[ \t]+", " ", text)
|
||||
text = re.sub(r"\n{2,}", "\n", text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def get_notes(query: str) -> List[dict]:
|
||||
"""
|
||||
Query Anki for notes and return notesInfo payload.
|
||||
"""
|
||||
note_ids = anki_request("findNotes", query=query) or []
|
||||
if not note_ids:
|
||||
return []
|
||||
return anki_request("notesInfo", notes=note_ids) or []
|
||||
|
||||
|
||||
# -------------------------
|
||||
# Language-specific token rules (spaCy-based)
|
||||
# -------------------------
|
||||
JAPANESE_CHAR_RE = re.compile(r"[\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}ー]+")
|
||||
|
||||
JAPANESE_PARTICLES = {
|
||||
"は", "が", "を", "に", "へ", "で", "と", "や", "も", "から", "まで", "より", "ば", "なら",
|
||||
"の", "ね", "よ", "ぞ", "ぜ", "さ", "わ", "か", "な", "って", "とき", "ってば", "けど", "けれど",
|
||||
"しかし", "でも", "ながら", "ほど", "し", "もの", "こと", "ところ", "よう", "らしい", "られる",
|
||||
}
|
||||
|
||||
JAPANESE_GRAMMAR_EXCLUDE = {
|
||||
"て", "た", "ます", "れる", "てる", "ぬ", "ん", "しまう", "いる", "ない", "なる", "ある", "だ", "です",
|
||||
}
|
||||
|
||||
JAPANESE_ALLOWED_POS = {"NOUN", "PROPN", "VERB", "ADJ"}
|
||||
|
||||
|
||||
def japanese_filter(token) -> bool:
|
||||
"""
|
||||
Filter Japanese tokens to keep “content-ish” words and avoid particles/grammar glue.
|
||||
Assumes a Japanese spaCy model that provides lemma_ and pos_ reasonably.
|
||||
"""
|
||||
text = (token.text or "").strip()
|
||||
lemma = (token.lemma_ or "").strip()
|
||||
|
||||
if not text:
|
||||
return False
|
||||
|
||||
# Must look like Japanese script (hiragana/katakana/kanji/ー)
|
||||
if not JAPANESE_CHAR_RE.fullmatch(text):
|
||||
return False
|
||||
|
||||
# Drop obvious grammar / particles
|
||||
if lemma in JAPANESE_GRAMMAR_EXCLUDE or text in JAPANESE_PARTICLES:
|
||||
return False
|
||||
|
||||
# Keep only selected parts of speech
|
||||
if getattr(token, "pos_", None) not in JAPANESE_ALLOWED_POS:
|
||||
return False
|
||||
|
||||
# Drop URLs/emails/stopwords when model flags them
|
||||
if getattr(token, "is_stop", False) or getattr(token, "like_url", False) or getattr(token, "like_email", False):
|
||||
return False
|
||||
|
||||
# Defensive: drop tokens that look like HTML fragments or garbage
|
||||
if any(c in text for c in "<>=/\\:&%"):
|
||||
return False
|
||||
if text in {"ruby", "rt", "div", "br", "nbsp", "href", "strong", "a"}:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def spanish_filter(token) -> bool:
|
||||
"""
|
||||
Keep alpha tokens that are not stopwords. (spaCy handles accent marks fine here.)
|
||||
"""
|
||||
return bool(getattr(token, "is_alpha", False)) and not bool(getattr(token, "is_stop", False))
|
||||
|
||||
|
||||
def spanish_format(token) -> str:
|
||||
return (token.lemma_ or token.text or "").lower().strip()
|
||||
|
||||
|
||||
def japanese_format(token) -> str:
|
||||
# Keep both lemma and surface form (useful when lemma normalization is aggressive)
|
||||
lemma = (token.lemma_ or "").strip()
|
||||
surface = (token.text or "").strip()
|
||||
if not lemma and not surface:
|
||||
return ""
|
||||
if lemma and surface and lemma != surface:
|
||||
return f"{lemma} ({surface})"
|
||||
return lemma or surface
|
||||
|
||||
|
||||
LANGUAGE_PROFILES = {
|
||||
"spanish": {
|
||||
"spacy_model": "es_core_news_sm",
|
||||
"token_filter": spanish_filter,
|
||||
"output_format": spanish_format,
|
||||
},
|
||||
"japanese": {
|
||||
"spacy_model": "ja_core_news_lg",
|
||||
"token_filter": japanese_filter,
|
||||
"output_format": japanese_format,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def load_spacy_model(model_name: str):
|
||||
"""
|
||||
Import spaCy lazily and load a model.
|
||||
This lets us show clearer errors when spaCy is missing/broken in the environment.
|
||||
"""
|
||||
try:
|
||||
import spacy # type: ignore
|
||||
except Exception as e:
|
||||
raise RuntimeError(
|
||||
"Failed to import spaCy. If you're on Python 3.14, spaCy may not be compatible yet.\n"
|
||||
"Use a Python 3.12 venv for this script."
|
||||
) from e
|
||||
|
||||
try:
|
||||
return spacy.load(model_name)
|
||||
except Exception as e:
|
||||
raise RuntimeError(
|
||||
f"Failed to load spaCy model '{model_name}'.\n"
|
||||
f"Try: python -m spacy download {model_name}"
|
||||
) from e
|
||||
|
||||
|
||||
# -------------------------
|
||||
# Core extraction
|
||||
# -------------------------
|
||||
def extract_counts(
|
||||
notes: List[dict],
|
||||
field_name: str,
|
||||
nlp,
|
||||
token_filter: Callable,
|
||||
output_format: Callable,
|
||||
use_full_field: bool,
|
||||
) -> Counter:
|
||||
"""
|
||||
For each note, take the specified field, strip HTML, tokenize, and count.
|
||||
"""
|
||||
counter: Counter = Counter()
|
||||
|
||||
for note in notes:
|
||||
fields = note.get("fields", {}) or {}
|
||||
raw_val = (fields.get(field_name, {}) or {}).get("value", "") or ""
|
||||
|
||||
text = extract_visible_text(raw_val) if use_full_field else extract_first_visible_line(raw_val)
|
||||
if not text:
|
||||
continue
|
||||
|
||||
doc = nlp(text)
|
||||
for token in doc:
|
||||
if token_filter(token):
|
||||
key = output_format(token)
|
||||
if key:
|
||||
counter[key] += 1
|
||||
|
||||
return counter
|
||||
|
||||
|
||||
def write_counts(counter: Counter, out_path: str, min_freq: int) -> int:
|
||||
"""
|
||||
Write "token count" lines sorted by descending count.
|
||||
Returns the number of written entries.
|
||||
"""
|
||||
items = [(w, c) for (w, c) in counter.items() if c >= min_freq]
|
||||
items.sort(key=lambda x: (-x[1], x[0]))
|
||||
|
||||
os.makedirs(os.path.dirname(os.path.abspath(out_path)), exist_ok=True)
|
||||
with open(out_path, "w", encoding="utf-8") as f:
|
||||
for word, freq in items:
|
||||
f.write(f"{word} {freq}\n")
|
||||
|
||||
return len(items)
|
||||
|
||||
|
||||
def build_query_from_decks(decks: List[str]) -> str:
|
||||
"""
|
||||
Build an Anki query that OR's multiple deck:"..." clauses.
|
||||
"""
|
||||
# deck:"日本語" OR deck:"日本語::subdeck" is possible but we keep it simple.
|
||||
parts = [f'deck:"{d}"' for d in decks]
|
||||
return " OR ".join(parts)
|
||||
|
||||
|
||||
# -------------------------
|
||||
# Main CLI
|
||||
# -------------------------
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract frequent words from Anki notes (CLI resembles other toolkit scripts)."
|
||||
)
|
||||
|
||||
# Match "positional lang” style (jp/es)
|
||||
parser.add_argument("lang", choices=sorted(LANG_MAP.keys()), help="Language code (jp or es).")
|
||||
|
||||
# Let you override deck selection, but keep sane defaults:
|
||||
# - if --query is provided, we use that exactly
|
||||
# - else if --deck is provided (repeatable), we use those decks
|
||||
# - else we infer decks from DECK_TO_LANGUAGE mapping
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument(
|
||||
"--query",
|
||||
help='Full Anki search query (e.g. \'deck:"Español" tag:foo\'). Overrides --deck.',
|
||||
)
|
||||
group.add_argument(
|
||||
"--deck",
|
||||
action="append",
|
||||
help='Deck name (repeatable). Example: --deck "日本語" --deck "日本語::Subdeck"',
|
||||
)
|
||||
|
||||
# Similar “bashy” knobs
|
||||
parser.add_argument("--field", default="Back", help="Which note field to read (default: Back).")
|
||||
parser.add_argument("--min-freq", type=int, default=2, help="Minimum frequency to include (default: 2).")
|
||||
parser.add_argument("--outdir", help="Output directory (default: ~/Languages/Anki/anki-words/<language>).")
|
||||
parser.add_argument("--out", help="Output file path (default: <outdir>/words_<lang>.txt).")
|
||||
parser.add_argument(
|
||||
"--full-field",
|
||||
action="store_true",
|
||||
help="Use the full field text (HTML stripped) instead of only the first visible line.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--spacy-model",
|
||||
help="Override the spaCy model name (advanced).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--logfile",
|
||||
default=os.path.join(DEFAULT_WORD_OUTPUT_ROOT, "extract_words.log"),
|
||||
help="Log file path.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
setup_logging(args.logfile)
|
||||
|
||||
language_bucket = LANG_MAP[args.lang]
|
||||
profile = LANGUAGE_PROFILES.get(language_bucket)
|
||||
if not profile:
|
||||
print(f"❌ Unsupported language bucket: {language_bucket}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# Resolve query / decks
|
||||
if args.query:
|
||||
query = args.query
|
||||
else:
|
||||
if args.deck:
|
||||
decks = args.deck
|
||||
else:
|
||||
decks = [d for d, lang in DECK_TO_LANGUAGE.items() if lang == language_bucket]
|
||||
if not decks:
|
||||
print(f"❌ No decks mapped for language: {language_bucket}", file=sys.stderr)
|
||||
return 1
|
||||
query = build_query_from_decks(decks)
|
||||
|
||||
# Output paths
|
||||
out_dir = os.path.expanduser(args.outdir) if args.outdir else os.path.join(DEFAULT_WORD_OUTPUT_ROOT, language_bucket)
|
||||
default_outfile = os.path.join(out_dir, f"words_{args.lang}.txt")
|
||||
out_path = os.path.expanduser(args.out) if args.out else default_outfile
|
||||
|
||||
logging.info("lang=%s bucket=%s query=%s field=%s", args.lang, language_bucket, query, args.field)
|
||||
print(f"🔎 Query: {query}")
|
||||
print(f"🧾 Field: {args.field}")
|
||||
|
||||
# Load spaCy model
|
||||
model_name = args.spacy_model or profile["spacy_model"]
|
||||
try:
|
||||
nlp = load_spacy_model(model_name)
|
||||
except Exception as e:
|
||||
print(f"❌ {e}", file=sys.stderr)
|
||||
logging.exception("spaCy load failed")
|
||||
return 1
|
||||
|
||||
# Fetch notes
|
||||
try:
|
||||
notes = get_notes(query)
|
||||
except Exception as e:
|
||||
print(f"❌ Failed to query AnkiConnect: {e}", file=sys.stderr)
|
||||
logging.exception("AnkiConnect query failed")
|
||||
return 1
|
||||
|
||||
print(f"✅ Found {len(notes)} notes.")
|
||||
if not notes:
|
||||
print("⚠️ No notes found. Check your query/deck names.")
|
||||
return 0
|
||||
|
||||
# Validate the field exists on at least one note
|
||||
fields0 = (notes[0].get("fields", {}) or {})
|
||||
if args.field not in fields0:
|
||||
available = list(fields0.keys())
|
||||
print(f"❌ Field '{args.field}' not found on sample note.", file=sys.stderr)
|
||||
print(f" Available fields: {available}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# Extract + write
|
||||
counter = extract_counts(
|
||||
notes=notes,
|
||||
field_name=args.field,
|
||||
nlp=nlp,
|
||||
token_filter=profile["token_filter"],
|
||||
output_format=profile["output_format"],
|
||||
use_full_field=args.full_field,
|
||||
)
|
||||
|
||||
print(f"🧠 Extracted {len(counter)} unique entries (before min-freq filter).")
|
||||
written = write_counts(counter, out_path, args.min_freq)
|
||||
|
||||
print(f"📄 Wrote {written} entries to: {out_path}")
|
||||
logging.info("wrote=%s out=%s", written, out_path)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
191
yt-transcript.py
191
yt-transcript.py
@@ -1,191 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
yt-transcript.py
|
||||
|
||||
Extract vocab or timestamped lines from a YouTube transcript.
|
||||
|
||||
Howto:
|
||||
./yt-transcript.py {jp,es} <video_url_or_id> [options]
|
||||
|
||||
Examples:
|
||||
./yt-transcript.py es https://youtu.be/SLgVwNulYhc --mode vocab --top 50
|
||||
./yt-transcript.py jp SLgVwNulYhc --mode sentences
|
||||
|
||||
Requirements:
|
||||
pip install youtube-transcript-api
|
||||
|
||||
Japanese tokenization (recommended "Option 1"):
|
||||
pip install "fugashi[unidic-lite]"
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
import sys
|
||||
import argparse
|
||||
from collections import Counter
|
||||
from urllib.parse import urlparse, parse_qs
|
||||
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
|
||||
from anki_common import TRANSCRIPT_LANG_MAP
|
||||
|
||||
# Small starter stopword lists (you can grow these over time)
|
||||
STOPWORDS = {
|
||||
"es": {
|
||||
"de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por",
|
||||
"un", "para", "con", "no", "una", "su", "al", "lo", "como",
|
||||
},
|
||||
"en": {"the", "is", "and", "of", "to", "in", "it", "that", "on", "you", "this", "for", "with"},
|
||||
"ja": {"の", "に", "は", "を", "た", "が", "で", "て", "です", "ます", "する", "ある", "いる"},
|
||||
}
|
||||
|
||||
|
||||
# -------------------------
|
||||
# URL / transcript helpers
|
||||
# -------------------------
|
||||
def extract_video_id(url_or_id: str) -> str:
|
||||
"""Accept full YouTube URLs (including youtu.be) or raw video IDs."""
|
||||
if "youtube" in url_or_id or "youtu.be" in url_or_id:
|
||||
query = urlparse(url_or_id)
|
||||
|
||||
# youtu.be/<id>
|
||||
if query.hostname == "youtu.be":
|
||||
return query.path.lstrip("/")
|
||||
|
||||
# youtube.com/watch?v=<id>
|
||||
if query.hostname in ("www.youtube.com", "youtube.com", "m.youtube.com"):
|
||||
qs = parse_qs(query.query)
|
||||
v = qs.get("v", [])
|
||||
if v:
|
||||
return v[0]
|
||||
|
||||
return url_or_id
|
||||
|
||||
|
||||
def fetch_transcript(video_id: str, lang_code: str):
|
||||
"""
|
||||
Support both youtube-transcript-api v1.x and older v0.x.
|
||||
|
||||
- v1.x: instance method .fetch(video_id, languages=[...]) -> list of snippet objects
|
||||
- v0.x: class method .get_transcript(video_id, languages=[...]) -> list of dicts
|
||||
"""
|
||||
# Newer API (v1.x)
|
||||
if hasattr(YouTubeTranscriptApi, "fetch"):
|
||||
api = YouTubeTranscriptApi()
|
||||
return api.fetch(video_id, languages=[lang_code])
|
||||
|
||||
# Older API (v0.x)
|
||||
if hasattr(YouTubeTranscriptApi, "get_transcript"):
|
||||
return YouTubeTranscriptApi.get_transcript(video_id, languages=[lang_code])
|
||||
|
||||
raise RuntimeError("Unsupported youtube-transcript-api version (missing fetch/get_transcript).")
|
||||
|
||||
|
||||
def snippet_text(entry) -> str:
|
||||
"""Entry can be a dict (old API) or a snippet object (new API)."""
|
||||
if isinstance(entry, dict):
|
||||
return (entry.get("text", "") or "")
|
||||
return (getattr(entry, "text", "") or "")
|
||||
|
||||
|
||||
def snippet_start(entry) -> float:
|
||||
"""Entry can be a dict (old API) or a snippet object (new API)."""
|
||||
if isinstance(entry, dict):
|
||||
return float(entry.get("start", 0.0) or 0.0)
|
||||
return float(getattr(entry, "start", 0.0) or 0.0)
|
||||
|
||||
|
||||
# -------------------------
|
||||
# Tokenization
|
||||
# -------------------------
|
||||
def tokenize_japanese(text: str) -> list[str]:
|
||||
"""
|
||||
Japanese tokenization using fugashi (MeCab wrapper).
|
||||
Recommended install: pip install "fugashi[unidic-lite]"
|
||||
"""
|
||||
try:
|
||||
from fugashi import Tagger
|
||||
except ImportError as e:
|
||||
raise RuntimeError('Japanese requires fugashi. Install: pip install "fugashi[unidic-lite]"') from e
|
||||
|
||||
tagger = Tagger()
|
||||
return [w.surface for w in tagger(text)]
|
||||
|
||||
|
||||
def tokenize_spanish(text: str, raw: bool = False) -> list[str]:
|
||||
"""
|
||||
Lightweight Spanish tokenization (keeps accented letters).
|
||||
If raw=False, lowercases everything.
|
||||
"""
|
||||
tokens = re.findall(r"\b[\wáéíóúñü]+\b", text)
|
||||
return tokens if raw else [t.lower() for t in tokens]
|
||||
|
||||
|
||||
def count_words(tokens: list[str], lang_code: str, remove_stopwords: bool = True) -> Counter:
|
||||
if remove_stopwords:
|
||||
sw = STOPWORDS.get(lang_code, set())
|
||||
tokens = [t for t in tokens if t not in sw]
|
||||
return Counter(tokens)
|
||||
|
||||
|
||||
# -------------------------
|
||||
# Main
|
||||
# -------------------------
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract vocab or timestamped lines from a YouTube transcript."
|
||||
)
|
||||
parser.add_argument("lang", choices=["jp", "es"], help="Language code (jp or es).")
|
||||
parser.add_argument("video", help="YouTube video URL or ID")
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
choices=["vocab", "sentences"],
|
||||
default="vocab",
|
||||
help="Mode: vocab (word counts) or sentences (timestamped lines)",
|
||||
)
|
||||
parser.add_argument("--top", type=int, default=None, help="Top N words (vocab mode only)")
|
||||
parser.add_argument("--no-stopwords", action="store_true", help="Don't remove common words")
|
||||
parser.add_argument(
|
||||
"--raw",
|
||||
action="store_true",
|
||||
help="(Spanish only) Do not lowercase tokens",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
lang_code = TRANSCRIPT_LANG_MAP[args.lang]
|
||||
video_id = extract_video_id(args.video)
|
||||
|
||||
try:
|
||||
transcript = fetch_transcript(video_id, lang_code)
|
||||
except Exception as e:
|
||||
print(f"Error fetching transcript: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
if args.mode == "sentences":
|
||||
for entry in transcript:
|
||||
start = snippet_start(entry)
|
||||
text = snippet_text(entry).replace("\n", " ").strip()
|
||||
if text:
|
||||
print(f"[{start:.2f}s] {text}")
|
||||
return 0
|
||||
|
||||
# vocab mode
|
||||
text = " ".join(snippet_text(entry) for entry in transcript).replace("\n", " ")
|
||||
|
||||
if lang_code == "ja":
|
||||
tokens = tokenize_japanese(text)
|
||||
else:
|
||||
tokens = tokenize_spanish(text, raw=args.raw)
|
||||
|
||||
counts = count_words(tokens, lang_code, remove_stopwords=not args.no_stopwords)
|
||||
items = counts.most_common(args.top) if args.top else counts.most_common()
|
||||
|
||||
for word, count in items:
|
||||
print(f"{word}: {count}")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user