Project Four: Synthetic Math/Code Textbook¶
Scenario: Enhance small model logic reasoning ability.
Core Technology: Evol-Instruct evolution strategy, Python code execution sandbox (Sandbox) verification, PoT (Program of Thought) data formatting.
Output: High-quality synthetic reasoning dataset after verification.
1. Project Background (Project Brief)¶
- Task Definition: Build a high-quality "Program of Thought" (PoT) dataset. We use LLMs (DeepSeek-V3) to "evolve" simple math problems into complex word problems, generate corresponding Python code solutions, and verify answer correctness through code execution sandbox.
- Input and Output:
- Input: Base math datasets (e.g., GSM8K, MBPP) raw JSONL files.
- Output: Cleaned JSONL dataset containing
question(evolved problem),thought_process(code solution reasoning),execution_output(execution result).
- Challenge Analysis: This project's biggest difficulty is "hallucination elimination." LLM-generated code often looks correct but cannot run (syntax errors or logic bugs). We need to build automated "Sandbox" to filter out non-executable samples, ensuring "textbook" rigor.
2. Architecture Design (Architecture Design)¶
Data Pipeline Diagram¶
Technology Stack¶
- Data Source:
HuggingFace Datasets(GSM8K/MBPP). - Generation Engine:
DeepSeek-V3(via SiliconFlow API) —— cost-effective code generation model. - Orchestration Logic: Python scripts (Evol-Instruct strategy).
- Verification Environment: Python
subprocess(local sandbox) —— production recommend Docker or MicroVM.
3. Step-by-Step Implementation¶
Phase 1: Seed Data Acquisition (Seed Preparation)¶
Everything starts with high-quality seeds. We don't need massive data—just representative logic cores.
Key Actions: 1. Download GSM8K (math) and MBPP (code) data. 2. Random sample as "evolution" foundation.
Glue Code (Data Sampler):
Code from download_data.py and sampler.py
# Core logic: Extract seeds from massive data, keep only Question field
# Original Answer discarded because we let model regenerate code-based solution
sampled = random.sample(data, SAMPLE_SIZE)
for entry in sampled:
seed_entry = {
"id": random.randint(1000, 9999),
"seed_question": entry['question'], # Keep only question
"original_answer": entry['answer'] # For reference only
}
Phase 2: Evol-Instruct and PoT Generation (Evolution & Generation)¶
This is the project core. We can't just do simple "Q&A pairs"—we need the model to think like a human expert.
Flow Logic: 1. Evol (Evolution): Rewrite simple problems (e.g., "1+1=?") into complex scenarios (e.g., "Xiaoming has 1 apple, affected by inflation..."), adding constraints. 2. PoT (Code solution): Force model to write Python code to solve, not directly output text answer.
Core Prompts (Prompt Engineering):
Code from evol.py
def get_evol_prompt(seed_question):
return f"""
You are a professional math competition problem composer. Please rewrite the following basic math problem into a more complex, logically rigorous one.
【Original】: {seed_question}
【Rewrite Requirements】:
1. Add constraints: Introduce more variables or limitations.
2. Add reasoning depth: Don't give numbers directly; have logical relationships between numbers.
3. Scenario-ize: Put abstract numbers into concrete physical or business scenarios.
...
"""
def get_pot_prompt(evolved_question):
return f"""
Please write Python code to solve the following math problem.
...
1. Write a function named `solve()`.
2. Clearly write reasoning steps in code comments.
3. `solve()` must return the final numerical answer.
...
"""
Phase 3: Sandbox Verification¶
Generated data has large amounts of "dead" samples (Syntax Error, Timeout, Loop). Must verify through execution.
Sandbox Logic:
1. Use regex to extract code blocks from Markdown.
2. Start subprocess (subprocess) to execute code.
3. Critical: Set timeout to prevent infinite loops from blocking pipeline.
Verification Script:
Code from sandbox.py
def execute_code(code, timeout=5):
"""
Execute Python code and get output.
WARNING: This function should only be called in strongly isolated sandbox (minimal privilege container/micro-VM, no network, restricted filesystem).
To prevent accidentally executing arbitrary code in host environment, will raise exception if sandbox not explicitly declared.
Can explicitly allow by setting environment variable EXECUTE_CODE_SANDBOXED=1 inside sandbox container.
"""
# Basic protection: Prohibit executing arbitrary code in undeclared sandbox
if os.environ.get("EXECUTE_CODE_SANDBOXED") != "1":
raise RuntimeError(
"execute_code can only be used in controlled sandbox environment; "
"please set environment variable EXECUTE_CODE_SANDBOXED=1 in isolated container/micro-VM before calling."
)
try:
# Use subprocess to start independent process
result = subprocess.run(
['python3', '-c', code],
capture_output=True, # Capture stdout
text=True,
timeout=timeout, # Must set timeout!
check=False,
)
if result.returncode == 0:
return True, result.stdout.strip()
else:
return False, f"Error: {result.stderr.strip()}"
4. Results Showcase (Showcase)¶
After sandbox cleaning, we get verified_textbook.jsonl. This is textbook-grade synthetic data.
Data Sample Comparison:
| Phase | Content Example |
|---|---|
| Original Seed | Jenny has 5 apples, ate 2, how many left? |
| Evol Evolution | Jenny runs a fruit store with 5 crates of apples (12 per crate). Monday she sold 40% of inventory, and 2 single items spoiled from improper storage. Please calculate the exact number of remaining sellable apples. |
| PoT Solution | def solve(): total = 5 * 12; sold = total * 0.4; ... return remaining |
| Execution Result | 34 (verified, saved to dataset) |
Verification Statistics: Typically, post-Evol code one-shot pass rate (Pass@1) is 60%-80%. The 20% error data filtered by sandbox are exactly the ones that would pollute model training—removing them significantly improves SFT model logic consistency.
5. Cost and Optimization (Cost & Optimization)¶
-
Resource consumption:
- API cost: Each valid sample consumes ~2 LLM calls (evolution + solution). Using cost-effective models like DeepSeek-V3, generating 1k high-quality textbook samples can be controlled under $5.
- Time cost: Local Python single-threaded is slow; verifying 1k code samples takes ~5-10 minutes.
-
Security Warning (Critical):
- This project uses
subprocessfor local execution. When processing unknown or untrusted model-generated code, extremely high risk exists (e.g.,os.system('rm -rf /')). - Production transformation plan: Must migrate
sandbox.pyexecution environment to Docker container or AWS Firecracker micro-VM, with network access disabled.
- This project uses
-
Scaling considerations:
- If data scales to millions, single-machine script cannot support. Need to introduce
RabbitMQorKafkafor task distribution, building distributed "generate-verify" cluster.
- If data scales to millions, single-machine script cannot support. Need to introduce
