ZebraLogicReasoner is a research toolchain for evaluating LLMs on structured logical reasoning tasks. It focuses on Zebra Logic puzzles and validates final answers as well as chain-of-thought transitions against a Z3 symbolic solver. The same pipeline also generates configurable puzzle sets, including factual, counterfactual, and noise-augmented variants, for evaluation and model development.
ZebraLogicReasoner fills a common gap in LLM evaluation by checking not only the final answer, but also whether each reasoning step stays consistent with the symbolic solution. It combines model execution, parsing, symbolic validation, dataset generation, and reporting in one reproducible pipeline.
Key features:
- Benchmark for evaluating LLMs on structured logical puzzles (abstract reasoning)
- Puzzle generation for normal, factual/counterfactual, and noise-augmented puzzle variants, that can be used for both evaluation and model development.
- Support for various frontier LLMs out-of-the-box (via OpenRouter model execution).
- Z3-backed validation of final solutions and reasoning transitions.
The tool operates in the following stages:
- Configure experiment settings (model, decoding, paths, and pipeline stages).
- Build prompts from puzzle datasets.
- Run model inference and store raw outputs.
- Parse responses into structured reasoning payloads.
- Evaluate parsed outputs with symbolic ground truth (solution correctness and transition consistency).
- Build/update analysis cache and generate reports/plots.
- Optionally generate new puzzles from the available puzzle configurations.
Typical stage sequence for full runs: build -> run -> parse -> reason -> analyze -> plots.
This tool is intended for:
- benchmarking LLM reasoning quality on structured logic puzzles
- distinguishing answer correctness from reasoning correctness
- controlled robustness studies under counterfactual and noisy puzzle settings
Within the AIMS5.0 context, the framework is especially relevant where AI outputs must be evaluated with transparent, verifiable logical criteria rather than only surface-level answer matching.
- Requires external model APIs (OpenRouter) and corresponding credentials.
- Symbolic checks are tailored to ZebraLogic-style structured puzzles and are not a generic verifier for arbitrary reasoning tasks.
- Current workflow is primarily CLI-centric; it is not yet packaged as a polished end-user application.
As this research is still in progress, a link will be provided once the paper is published.