Prompt Templates for Data Analysts: SQL, Python, and Visualization
You spend half your day writing SQL, cleaning data, and explaining charts to stakeholders who want answers "by end of day." AI can do the tedious parts — if you prompt it correctly.
Most analysts type vague requests like "write me a SQL query for revenue" and get back generic, broken code. The fix isn't a better model — it's a better prompt. This guide gives you 10 structured templates for the tasks data analysts do every day: SQL generation, data cleaning, EDA, statistical analysis, visualization, and report writing.
Every template works with ChatGPT, Claude, Gemini, or any other LLM. (Not sure which model to use? See our comparison of ChatGPT, Claude, and Gemini.)
Why Analysts Need Structured Prompts
The gap between "AI-generated SQL that sort of works" and "AI-generated SQL you can actually run" is almost entirely a prompting problem. Here's a real example:
Vague prompt:
Write a SQL query to find our best customers.
Result: The model guesses at your table names, invents columns, and defines "best" however it wants. You spend 20 minutes fixing it.
Structured prompt:
Write a PostgreSQL query to find the top 20 customers by total revenue
in the last 12 months.
Tables:
- orders (id, customer_id, total_amount, created_at, status)
- customers (id, name, email, company, created_at)
Requirements:
- Only include orders with status = 'completed'
- Group by customer, show customer name, email, order count, and total revenue
- Sort by total revenue descending
- Format revenue as dollars with 2 decimal places
Result: A query you can run immediately, or with minimal tweaks. The 60 seconds spent structuring the prompt saves 20 minutes of debugging.
The Templates
1. SQL Query Generation
The most common analyst task. Give the model your schema and it writes precise queries.
Template:
Write a [PostgreSQL/MySQL/BigQuery] query for the following request:
Request: [what you need in plain English]
Database schema (relevant tables only):
- [table_name] ([column1] [type], [column2] [type], ...)
- [table_name] ([column1] [type], [column2] [type], ...)
Relationships:
- [table1.column] references [table2.column]
Constraints:
- [date range, filters, exclusions]
- [any business rules]
Output columns: [list the columns you want in the result]
Sort order: [how to sort]
Limit: [number of rows, if applicable]
Example filled in:
Write a PostgreSQL query for the following request:
Request: Monthly recurring revenue (MRR) trend for the last 12 months,
broken down by pricing plan.
Database schema:
- subscriptions (id, customer_id, plan_name, monthly_amount, status,
started_at, cancelled_at)
- customers (id, name, email, segment)
Relationships:
- subscriptions.customer_id references customers.id
Constraints:
- Only active subscriptions (status = 'active' OR cancelled_at > month end)
- Last 12 complete months (exclude current partial month)
Output columns: month, plan_name, subscriber_count, mrr
Sort order: month ASC, mrr DESC
Why it works: Specifying the dialect, schema, relationships, and constraints eliminates 90% of the back-and-forth.
2. Data Cleaning and Transformation
Messy data is the reality. Let AI write the cleaning logic.
Template:
Write a [Python pandas / SQL / R] script to clean this dataset.
Dataset description:
- Source: [where the data comes from]
- Format: [CSV, JSON, database table]
- Row count: approximately [N] rows
- Columns: [list columns with types]
Known data quality issues:
1. [Issue: e.g., "email column has mixed formats and some entries are 'N/A'"]
2. [Issue: e.g., "date column has formats MM/DD/YYYY and YYYY-MM-DD mixed"]
3. [Issue: e.g., "revenue column has currency symbols and commas in some rows"]
Cleaning requirements:
- [How to handle nulls: drop, fill with default, interpolate]
- [How to handle duplicates: keep first, keep last, flag]
- [Standard format for dates, currencies, text fields]
- [Any derived columns to create]
Output: cleaned dataframe / table ready for analysis
Add comments explaining each cleaning step.
Why it works: Data cleaning is context-heavy. The model can't guess what "clean" means for your specific dataset without knowing the issues and your standards.
3. Exploratory Data Analysis (EDA)
When you get a new dataset and need to understand what's in it.
Template:
I have a dataset with the following structure:
- [N] rows, [M] columns
- Columns: [list with types]
- Source: [where it came from]
- Purpose: [what analysis we're planning]
Write a [Python/R] EDA script that:
1. Summary statistics for all numeric columns (mean, median, std, min, max, quartiles)
2. Value counts for categorical columns (top 10 values per column)
3. Missing value report (count and percentage per column)
4. Distribution analysis for key numeric columns: [list specific columns]
5. Correlation matrix for numeric columns
6. Identify outliers using IQR method for: [list columns]
7. Time-based trends if applicable: [date column] vs [metric columns]
For each finding, add a brief interpretation comment.
Generate matplotlib/seaborn plots for distributions and correlations.
Why it works: EDA is a checklist task — the model excels at running through a structured checklist consistently. Without structure, it picks a random subset of analyses.
4. Statistical Analysis
When you need more than descriptive stats.
Template:
Perform a [type of analysis] on the following data scenario:
Question: [the business question you're trying to answer]
Hypothesis: [what you expect to find, or null hypothesis]
Data:
- Dataset: [description]
- Key variables:
- Dependent variable: [column name, type]
- Independent variables: [column names, types]
- Control variables: [if any]
- Sample size: [N]
Requirements:
1. State the null and alternative hypotheses
2. Check assumptions for the chosen test (normality, homogeneity, independence)
3. Run the appropriate statistical test
4. Report: test statistic, p-value, effect size, confidence interval
5. Interpret the results in plain English for a non-technical stakeholder
Write this in [Python with scipy/statsmodels / R].
If assumptions are violated, suggest and run an appropriate non-parametric alternative.
Example analysis types: A/B test significance, regression analysis, cohort comparison, time series decomposition, chi-square test for categorical data.
Why it works: The model is excellent at statistical analysis but tends to skip assumption checks and effect sizes unless explicitly asked. This template forces rigor.
5. Dashboard Narrative Generation
You've built the dashboard. Now stakeholders want the story.
Template:
Write a data narrative for a [weekly/monthly/quarterly] business review.
Metrics:
[Paste the key numbers — the model needs actual data to interpret]
- Revenue: $[X] (vs $[Y] last period, vs $[Z] same period last year)
- Active users: [N] (trend: [up/down/flat])
- Conversion rate: [X]% (from [Y]%)
- Churn rate: [X]% (from [Y]%)
- [Other relevant metrics]
Context:
- [Any events that explain the numbers: product launch, pricing change,
seasonal effect, marketing campaign, outage]
Audience: [who will read this — executive team, board, department leads]
Tone: [concise and direct / detailed with recommendations / optimistic but honest]
Structure:
1. Executive summary (3 sentences max)
2. Key highlights (what's going well)
3. Areas of concern (what needs attention)
4. Recommended actions (what to do about it)
Do not make up explanations for trends — only interpret based on the context
and data provided. If a trend is unexplained, say so.
Why it works: Stakeholders don't want charts — they want stories. This template produces narratives that connect numbers to actions.
6. Visualization Code Generation
Writing matplotlib/plotly code from scratch is tedious. Let AI handle the boilerplate.
Template:
Generate a [Python matplotlib/seaborn / Plotly / D3.js] visualization:
Chart type: [bar, line, scatter, heatmap, box plot, funnel, etc.]
Data: [describe the data or paste a sample]
X-axis: [column/variable] — label: "[axis label]"
Y-axis: [column/variable] — label: "[axis label]"
Color/grouping: [column, if applicable]
Styling requirements:
- Title: "[chart title]"
- Figure size: [width x height in inches, or responsive]
- Color palette: [specific colors, or "professional/muted/colorblind-safe"]
- [Show/hide legend, position if shown]
- [Grid lines: yes/no]
- [Data labels on bars: yes/no, format]
- [Annotations for specific data points, if any]
Data format: [pandas DataFrame / list of dicts / CSV / direct values]
Make the chart presentation-ready — suitable for [slide deck / report / dashboard].
Why it works: Visualization is 20% logic and 80% styling. The template covers both so you get a chart you can use immediately, not just a technically correct but ugly plot.
7. Report Writing from Raw Data
Turn query results into stakeholder-ready analysis.
Template:
Write an analysis report based on the following query results.
Query purpose: [what question this data answers]
Audience: [who will read the report and their technical level]
Data:
[Paste query results as a table or CSV]
Report structure:
1. Key finding (one sentence)
2. Supporting data (reference specific numbers from the results)
3. Trend analysis (compare to previous periods if data is available)
4. Anomalies (anything unexpected in the data)
5. Recommendations (what action to take based on the findings)
6. Caveats (limitations of this analysis, data quality notes)
Tone: [formal for executives / casual for team / technical for engineers]
Length: [brief paragraph / full page / detailed multi-section]
Important: Only make claims supported by the data provided.
If the data is insufficient to draw a conclusion, say so.
Why it works: Analysts often spend more time writing about data than analyzing it. This template produces first drafts that need minimal editing.
8. Data Pipeline Debugging
When your ETL job fails or produces wrong numbers.
Template:
Help me debug a data pipeline issue.
Pipeline description:
- Source: [where data comes from]
- Destination: [where data goes]
- Technology: [Airflow, dbt, Spark, cron + scripts, etc.]
- Schedule: [how often it runs]
The problem:
- Expected behavior: [what should happen]
- Actual behavior: [what's happening instead]
- When it started: [date/time, or "always been like this"]
- Error message (if any): [paste error]
- Recent changes: [any code, schema, or infrastructure changes]
Data symptoms:
- [e.g., "row counts dropped 40% on March 1"]
- [e.g., "duplicate rows appearing in the destination table"]
- [e.g., "NULL values in a NOT NULL column"]
Think through this step by step:
1. List the most likely root causes (at least 3 hypotheses)
2. For each, explain what evidence would confirm or rule it out
3. Suggest specific diagnostic queries or checks to run
4. Once the cause is identified, suggest a fix and a preventive measure
Why it works: Pipeline debugging is hypothesis-driven. The Chain of Thought approach prevents the model from jumping to conclusions about the cause.
9. SQL Query Optimization
When your query runs for 20 minutes instead of 20 seconds.
Template:
Optimize this [PostgreSQL/MySQL/BigQuery] query for performance.
Current query:
```sql
[paste your slow query]
Context:
- Table sizes: [table_name: ~N rows, ...]
- Existing indexes: [list relevant indexes]
- Current execution time: [how long it takes]
- Target execution time: [how fast you need it]
- Database: [version, hosted/self-managed]
Analyze:
- Identify performance bottlenecks (full table scans, missing indexes, unnecessary joins, subquery issues)
- Explain why each bottleneck is slow
- Rewrite the query with optimizations
- Suggest any new indexes that would help
- Estimate the expected improvement
Show the optimized query and explain what changed and why. If EXPLAIN/EXPLAIN ANALYZE output is available, I'll provide it: [paste EXPLAIN output if you have it]
**Why it works:** SQL optimization requires understanding both the query and the data distribution. Providing table sizes and indexes gives the model the context it needs to suggest meaningful improvements.
### 10. Metric Definition and Documentation
The unglamorous but essential task of defining what your metrics actually mean.
**Template:**
Help me create a metric definition document for our analytics team.
Metric name: [e.g., "Monthly Active Users (MAU)"]
Write a complete definition that includes:
- Business definition: What this metric measures in plain English
- Technical definition: The exact calculation (formula)
- SQL implementation: A reference query that computes this metric
- Data source: Which tables/columns are used
- Filters and exclusions: What's included and what's not (e.g., "excludes internal employees", "only counts users with at least one completed action")
- Granularity: How it's segmented (daily, weekly, monthly, by region, etc.)
- Known issues: Edge cases, data quality caveats
- Related metrics: Other metrics this is often analyzed alongside
- Owner: Who to contact with questions
Database schema (relevant tables): [list tables and columns]
Business context: [how this metric is used, who looks at it, what decisions it drives]
**Why it works:** Metric definitions are one of those tasks everyone agrees is important but nobody wants to do. The template produces documentation that's thorough enough to prevent "wait, how do we actually calculate churn?" conversations.
## Tips for Better Analyst Prompts
### Always Include Your Schema
The single biggest improvement you can make to data prompts is including your actual table and column names. For more on structuring effective prompts, see our guide on [how to write better AI prompts](/blog/how-to-write-better-ai-prompts). Without schema, the model guesses — and its guesses are always wrong.
Quick schema dump for common databases:
```sql
-- PostgreSQL
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;
-- BigQuery
SELECT column_name, data_type
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'your_table';
Specify the SQL Dialect
PostgreSQL, MySQL, BigQuery, Snowflake, and Redshift all have syntax differences. A query that works in BigQuery won't work in PostgreSQL. Always specify which database you're using.
Paste Sample Data for Ambiguous Tasks
When the task involves data interpretation or transformation logic, paste 5-10 rows of sample data. The model handles examples better than descriptions.
Ask for Validation Steps
Add to any template:
After writing the query/script, suggest 2-3 validation checks I can run
to verify the output is correct.
This catches off-by-one errors, timezone issues, and filter problems that would otherwise slip through.
How Promplify Helps Data Teams
Writing structured prompts for every SQL query or analysis request is effective but time-consuming. Promplify optimizes your analyst prompts automatically:
- Adds missing context clues that help the model produce more precise queries
- Structures requests so the model handles edge cases instead of ignoring them
- Selects the right reasoning technique — Chain of Thought for multi-step analysis, standard for simple queries
- Maintains consistent output format so your team's AI outputs are uniform — learn more about getting structured output from LLMs
Paste your rough query request into the optimizer, and get back a structured prompt that produces better SQL, cleaner analysis, and more actionable reports.
Key Takeaways
- The difference between useful and useless AI-generated analysis is the prompt structure
- Always include your schema, SQL dialect, and constraints
- Template-based prompts are reusable — build a library your team shares
- Ask for validation steps on every generated query
- Use Chain of Thought for debugging and complex analysis
- Include sample data when the task involves interpretation
Tired of fixing AI-generated SQL? Try Promplify free — paste your analysis request and get back a structured prompt that produces accurate, runnable queries on the first try.
Ready to Optimize Your Prompts?
Try Promplify free — paste any prompt and get an AI-rewritten, framework-optimized version in seconds.
Start Optimizing