BHI layer v1: docs, schema, Phase A ingestion stubs
This commit is contained in:
156
docs/integration_plan.md
Normal file
156
docs/integration_plan.md
Normal file
@@ -0,0 +1,156 @@
|
|||||||
|
# BHI Layer — Integration Plan
|
||||||
|
|
||||||
|
Steps to merge the BHI layer into the base Economic Brain after the base build finishes.
|
||||||
|
|
||||||
|
**Prereqs** (verified before step 1):
|
||||||
|
- Base Brain is running: `psql -d brain -c '\dt'` shows core tables including `job_runs`.
|
||||||
|
- `/home/ubuntu/economic-brain/` contains a working `jobs/` directory structure.
|
||||||
|
- DATABASE_URL env var exported and pointing at the `brain` Postgres.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Apply the BHI schema
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/ubuntu/economic-brain-bhi
|
||||||
|
psql "$DATABASE_URL" -f schemas/bhi_tables.sql
|
||||||
|
psql "$DATABASE_URL" -c "\dt bhi_*"
|
||||||
|
# Expect 9 tables: bhi_facilities, bhi_facility_quality, bhi_facility_financials,
|
||||||
|
# bhi_demand_indicators, bhi_workforce, bhi_shortages, bhi_rtf_licensing,
|
||||||
|
# bhi_policy_events, bhi_crisis_calls
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Copy ingestion jobs into the Brain's jobs tree
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p /home/ubuntu/economic-brain/jobs/bhi
|
||||||
|
cp /home/ubuntu/economic-brain-bhi/jobs/ingestion/*.py /home/ubuntu/economic-brain/jobs/bhi/
|
||||||
|
# _common.py is included; it reads DATABASE_URL from env already
|
||||||
|
```
|
||||||
|
|
||||||
|
Install Python deps if the base Brain doesn't already have them:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install requests psycopg2-binary
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Smoke test every Phase A job (no DB writes)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/ubuntu/economic-brain/jobs/bhi
|
||||||
|
for f in cms_ipfqr.py cms_hospital_compare.py cms_nursing_home.py \
|
||||||
|
samhsa_locator.py hrsa_hpsa.py nppes.py cdc_brfss.py \
|
||||||
|
cdc_yrbss.py cdc_wonder_mortality.py bls_oes.py cms_pos.py \
|
||||||
|
samhsa_nssats_nmhss.py idea_part_b.py nsch.py; do
|
||||||
|
echo "=== $f ==="
|
||||||
|
python3 "$f" test || echo "FAIL: $f"
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Every job should print `OK:` and exit 0. If any fail, fix the endpoint/URL in the job file before proceeding.
|
||||||
|
|
||||||
|
## 4. Run jobs in dependency order
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Facilities first (feed bhi_facilities.facility_id FK for quality/financials)
|
||||||
|
python3 cms_ipfqr.py
|
||||||
|
python3 cms_hospital_compare.py
|
||||||
|
python3 cms_nursing_home.py
|
||||||
|
python3 samhsa_locator.py
|
||||||
|
python3 cms_pos.py
|
||||||
|
python3 samhsa_nssats_nmhss.py
|
||||||
|
python3 nppes.py
|
||||||
|
|
||||||
|
# Shortages + demand (independent)
|
||||||
|
python3 hrsa_hpsa.py
|
||||||
|
python3 cdc_wonder_mortality.py
|
||||||
|
python3 cdc_brfss.py
|
||||||
|
python3 cdc_yrbss.py
|
||||||
|
python3 idea_part_b.py
|
||||||
|
python3 nsch.py
|
||||||
|
|
||||||
|
# Workforce
|
||||||
|
python3 bls_oes.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Monitor `job_runs`:
|
||||||
|
```sql
|
||||||
|
SELECT job_name, status, started_at, finished_at, error
|
||||||
|
FROM job_runs WHERE job_name LIKE 'bhi_%' ORDER BY started_at DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
## 5. Import n8n workflows (scheduled refresh)
|
||||||
|
|
||||||
|
Create workflows in n8n (or add to existing scheduler):
|
||||||
|
|
||||||
|
| Workflow | Cron | Script |
|
||||||
|
|---|---|---|
|
||||||
|
| BHI: CMS facilities refresh | `0 3 * * 1` (weekly Mon 3am) | `cms_ipfqr.py`, `cms_hospital_compare.py`, `cms_nursing_home.py` |
|
||||||
|
| BHI: SAMHSA locator refresh | `0 4 1 * *` (monthly) | `samhsa_locator.py` |
|
||||||
|
| BHI: HRSA HPSA refresh | `0 5 * * 2` (weekly Tue 5am) | `hrsa_hpsa.py` |
|
||||||
|
| BHI: CDC demand refresh | `0 6 1 * *` (monthly) | `cdc_brfss.py`, `cdc_yrbss.py`, `cdc_wonder_mortality.py` |
|
||||||
|
| BHI: Workforce refresh | `0 7 1 */3 *` (quarterly) | `bls_oes.py` |
|
||||||
|
| BHI: CMS POS refresh | `0 8 1 */3 *` (quarterly) | `cms_pos.py` |
|
||||||
|
|
||||||
|
Workflow template: Cron node -> Execute Command (`python3 /home/ubuntu/economic-brain/jobs/bhi/<script>.py`) -> if non-zero, send alert to Slack / email.
|
||||||
|
|
||||||
|
## 6. Add command center page
|
||||||
|
|
||||||
|
Create `/home/ubuntu/command-center/pages/brain/behavioral-health.html` (or equivalent in the Brain's command-center framework) with sections:
|
||||||
|
|
||||||
|
1. **Facility map** — Leaflet map of `bhi_facilities` colored by `facility_type`, filterable by `adolescent_unit` / `young_adult_unit`.
|
||||||
|
2. **HPSA heatmap** — county-level choropleth of `bhi_shortages.score`.
|
||||||
|
3. **Demand indicators panel** — small multiples of suicide rate, overdose rate, BRFSS depression by state, split by age bracket.
|
||||||
|
4. **Composite ranking table** — top 50 opportunities by `composite_score` (see scoring.md).
|
||||||
|
5. **Recent policy events feed** — last 20 rows from `bhi_policy_events` ordered by `effective_date DESC`.
|
||||||
|
6. **Job status widget** — last run of each `bhi_*` job from `job_runs`.
|
||||||
|
|
||||||
|
Route: `/brain/behavioral-health`.
|
||||||
|
|
||||||
|
## 7. Test queries (acceptance smoke tests)
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Facility count by type
|
||||||
|
SELECT facility_type, count(*) FROM bhi_facilities GROUP BY 1 ORDER BY 2 DESC;
|
||||||
|
|
||||||
|
-- Top 20 worst MH HPSAs
|
||||||
|
SELECT state, county_fips, score, population_served
|
||||||
|
FROM bhi_shortages WHERE withdrawn_date IS NULL
|
||||||
|
ORDER BY score DESC LIMIT 20;
|
||||||
|
|
||||||
|
-- Adolescent suicide rates, top states
|
||||||
|
SELECT geo_code, value FROM bhi_demand_indicators
|
||||||
|
WHERE measure='suicide_rate' AND age_bracket='13-17'
|
||||||
|
ORDER BY value DESC LIMIT 20;
|
||||||
|
|
||||||
|
-- Counties with IPF but zero adolescent units (cross-check)
|
||||||
|
SELECT state, count(*) FILTER (WHERE adolescent_unit) AS adolescent_units,
|
||||||
|
count(*) AS total
|
||||||
|
FROM bhi_facilities WHERE facility_type='IPF' GROUP BY state ORDER BY 2 ASC;
|
||||||
|
|
||||||
|
-- Workforce shortage: psychiatrists, top wage growth MSAs
|
||||||
|
SELECT msa_name, annual_wage_median
|
||||||
|
FROM bhi_workforce WHERE occupation_code='29-1223'
|
||||||
|
ORDER BY annual_wage_median DESC LIMIT 20;
|
||||||
|
|
||||||
|
-- job run health
|
||||||
|
SELECT job_name, status, count(*)
|
||||||
|
FROM job_runs WHERE job_name LIKE 'bhi_%'
|
||||||
|
GROUP BY 1, 2;
|
||||||
|
```
|
||||||
|
|
||||||
|
If every query returns rows and no job_run shows `status='error'`, the BHI layer is live.
|
||||||
|
|
||||||
|
## 8. Git merge to main Brain repo
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /home/ubuntu/economic-brain
|
||||||
|
git checkout -b bhi-layer-merge
|
||||||
|
cp -r /home/ubuntu/economic-brain-bhi/schemas/bhi_tables.sql schemas/
|
||||||
|
cp -r /home/ubuntu/economic-brain-bhi/jobs/ingestion/* jobs/bhi/
|
||||||
|
cp -r /home/ubuntu/economic-brain-bhi/docs/* docs/bhi/
|
||||||
|
git add schemas/bhi_tables.sql jobs/bhi docs/bhi
|
||||||
|
git commit -m "Integrate BHI layer"
|
||||||
|
git push origin bhi-layer-merge
|
||||||
|
# Open PR for review on Gitea
|
||||||
|
```
|
||||||
125
docs/scoring.md
Normal file
125
docs/scoring.md
Normal file
@@ -0,0 +1,125 @@
|
|||||||
|
# BHI Composite Scoring Function
|
||||||
|
|
||||||
|
## Formula
|
||||||
|
|
||||||
|
```
|
||||||
|
composite_score =
|
||||||
|
(demand_severity * 0.25) +
|
||||||
|
(supply_shortage * 0.25) +
|
||||||
|
(pain_signal_volume * 0.20) +
|
||||||
|
(capacity_trend * 0.10) +
|
||||||
|
(workforce_shortage * 0.10) +
|
||||||
|
(regulatory_tailwind * 0.05) +
|
||||||
|
(govt_demand * 0.05)
|
||||||
|
```
|
||||||
|
|
||||||
|
All components are normalized to 0-100 before weighting. Final `composite_score` is 0-100.
|
||||||
|
Each component is computed at the **geo x niche x age-bracket** level (state, county, or MSA depending on data).
|
||||||
|
|
||||||
|
Thesis this reflects (all-of-the-above): demand is outpacing supply, delivery model is shifting, and regulation is restructuring the market — we weight demand + supply heaviest (50% combined), then real-time pain signals, then the three tailwinds.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Component definitions
|
||||||
|
|
||||||
|
### 1. demand_severity (25%)
|
||||||
|
Feeder: `bhi_demand_indicators` (CDC WONDER, BRFSS, YRBSS, NSCH).
|
||||||
|
|
||||||
|
For a given geo + age bracket, combine:
|
||||||
|
- Suicide rate per 100k (CDC WONDER, ICD-10 X60-X84)
|
||||||
|
- Drug overdose death rate per 100k (CDC WONDER, X40-X44 + Y10-Y14)
|
||||||
|
- YRBSS "seriously considered suicide" % (adolescent)
|
||||||
|
- BRFSS "mental health not good 14+ days" % (young adult via 18-24 bracket)
|
||||||
|
- NSCH unmet mental health treatment need %
|
||||||
|
|
||||||
|
Normalize each to 0-100 against the national distribution (percentile rank), then average.
|
||||||
|
Trend multiplier: +10 if 5-yr CAGR > 5%.
|
||||||
|
|
||||||
|
### 2. supply_shortage (25%)
|
||||||
|
Feeders: `bhi_shortages` (HRSA HPSA) + `bhi_facilities` (SAMHSA + CMS).
|
||||||
|
|
||||||
|
For a geo:
|
||||||
|
- HPSA mental health score (0-25, already normalized; rescale x4 -> 0-100)
|
||||||
|
- Inverse of facility density: beds per 100k population (percentile-invert)
|
||||||
|
- Inverse of adolescent/young-adult-specific bed density (if scoring those brackets)
|
||||||
|
|
||||||
|
Weighted average: 50% HPSA score, 30% total bed density, 20% age-targeted bed density.
|
||||||
|
|
||||||
|
### 3. pain_signal_volume (20%)
|
||||||
|
Feeders: base Brain's `reddit_posts`, `app_reviews`, and `risk_factors` tables (already being built).
|
||||||
|
|
||||||
|
For a niche (e.g., "adolescent inpatient"):
|
||||||
|
- Count of posts/reviews/risk-factor hits matching niche keywords in last 90 days
|
||||||
|
- Z-score against the full base Brain niche distribution
|
||||||
|
- Clamp to 0-100
|
||||||
|
|
||||||
|
Depends on base Brain being live — until then, this component defaults to 50 (neutral).
|
||||||
|
|
||||||
|
### 4. capacity_trend (10%)
|
||||||
|
Feeder: `bhi_facilities` (opened_date, closed_date) + CMS POS termination records.
|
||||||
|
|
||||||
|
For the geo x niche:
|
||||||
|
- Facilities opened in last 24 months minus closed in last 24 months, normalized by baseline facility count
|
||||||
|
- Negative net = high score (more opportunity), positive net = low score (saturated)
|
||||||
|
- Formula: `100 * (1 - (net_change + baseline) / (2 * baseline))` clamped 0-100
|
||||||
|
|
||||||
|
### 5. workforce_shortage (10%)
|
||||||
|
Feeder: `bhi_workforce` (BLS OES).
|
||||||
|
|
||||||
|
For the MSA:
|
||||||
|
- Wage growth YoY for SOC codes 29-1223, 21-1014, 21-1018, 103T (percentile rank)
|
||||||
|
- Employment per 100k (inverse percentile)
|
||||||
|
- Average them
|
||||||
|
|
||||||
|
High wage growth + low employment density = high shortage score = high opportunity for new supply.
|
||||||
|
|
||||||
|
### 6. regulatory_tailwind (5%)
|
||||||
|
Feeder: `bhi_policy_events`.
|
||||||
|
|
||||||
|
Count of favorable policy events in the last 18 months for the geo:
|
||||||
|
- Medicaid rate increases for BH services
|
||||||
|
- New state mandates for adolescent crisis services
|
||||||
|
- Expanded provider types (peer support, mobile crisis)
|
||||||
|
- Federal rules (e.g., Mental Health Parity enforcement)
|
||||||
|
|
||||||
|
`count * 20`, clamped to 0-100.
|
||||||
|
|
||||||
|
### 7. govt_demand (5%)
|
||||||
|
Feeder: base Brain's `sam_gov_opportunities` table (if present) + `bhi_policy_events`.
|
||||||
|
|
||||||
|
Active + awarded SAM.gov opportunities in NAICS 621112 (Physician offices - mental), 621420 (Outpatient mental health/SUD), 623220 (Residential mental health), 623210 (Residential intellectual/developmental), 624190 (Other individual/family services). Dollar-value-weighted and geo-filtered.
|
||||||
|
|
||||||
|
Log-scale: `min(100, 10 * log10(total_dollar_value + 1))`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Age bracket handling
|
||||||
|
|
||||||
|
Every row in `bhi_demand_indicators` carries an `age_bracket`. When scoring a niche tagged for adolescents (13-17), the demand_severity and pain_signal components filter to that bracket. Young-adult scores pull 18-25. "All" niches average both brackets 50/50.
|
||||||
|
|
||||||
|
Young-adult gap note: for young-adult scoring, supply_shortage should apply an extra +15 penalty on facility density since very few IPFs have dedicated young-adult units — this is captured via the `young_adult_unit` boolean in `bhi_facilities`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output table (to be added)
|
||||||
|
|
||||||
|
Scores write to `bhi_scores` (created at runtime, not in bhi_tables.sql v1 — add once inputs are flowing):
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE bhi_scores (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
niche TEXT,
|
||||||
|
geo_type TEXT,
|
||||||
|
geo_code TEXT,
|
||||||
|
age_bracket TEXT,
|
||||||
|
composite NUMERIC,
|
||||||
|
demand_severity NUMERIC,
|
||||||
|
supply_shortage NUMERIC,
|
||||||
|
pain_signal NUMERIC,
|
||||||
|
capacity_trend NUMERIC,
|
||||||
|
workforce_short NUMERIC,
|
||||||
|
reg_tailwind NUMERIC,
|
||||||
|
govt_demand NUMERIC,
|
||||||
|
computed_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
```
|
||||||
273
docs/sources.md
Normal file
273
docs/sources.md
Normal file
@@ -0,0 +1,273 @@
|
|||||||
|
# BHI Data Sources
|
||||||
|
|
||||||
|
All endpoints tested 2026-04-04 unless noted. "Tested: OK" means a live curl returned valid data.
|
||||||
|
|
||||||
|
Scope: behavioral health facilities, demand indicators, workforce, shortages, and policy for all 50 US states, tagged by age bracket (adolescent 13-17, young adult 18-25).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PHASE A — Free, autonomous, ready to ingest
|
||||||
|
|
||||||
|
### 1. CMS IPFQR (Inpatient Psychiatric Facility Quality Reporting)
|
||||||
|
- **Endpoint:** `https://data.cms.gov/provider-data/api/1/datastore/query/{dataset_id}/0`
|
||||||
|
- **Dataset IDs:**
|
||||||
|
- `q9vs-r7wp` — IPFQR by Facility
|
||||||
|
- `dc76-gh7x` — IPFQR by State
|
||||||
|
- `s5xg-sys6` — IPFQR National
|
||||||
|
- **Auth:** None
|
||||||
|
- **Rate limit:** None documented; be polite (<= 5 req/sec)
|
||||||
|
- **Update frequency:** Quarterly
|
||||||
|
- **Record count:** ~1,600 IPFs (facility file); dozens of measures each
|
||||||
|
- **Key fields:** `facility_id`, `facility_name`, `address`, `state`, `zip`, `countyparish`, HBIPS-2/3 restraint+seclusion, SMD, SUB-2/3, TOB-3, transition record, 30-day readmission
|
||||||
|
- **Test curl (OK):**
|
||||||
|
```
|
||||||
|
curl -s "https://data.cms.gov/provider-data/api/1/datastore/query/q9vs-r7wp/0?limit=2"
|
||||||
|
```
|
||||||
|
- **Python snippet:**
|
||||||
|
```python
|
||||||
|
import requests
|
||||||
|
r = requests.get("https://data.cms.gov/provider-data/api/1/datastore/query/q9vs-r7wp/0",
|
||||||
|
params={"limit": 500, "offset": 0})
|
||||||
|
rows = r.json()["results"]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. CMS Hospital Compare / Care Compare (general hospital info)
|
||||||
|
- **Endpoint:** `https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0`
|
||||||
|
- **Auth:** None | **Rate limit:** none | **Update:** Monthly
|
||||||
|
- **Records:** ~5,300 hospitals
|
||||||
|
- **Key fields:** `facility_id` (CCN), `facility_name`, `hospital_type`, `hospital_ownership`, `hospital_overall_rating`, mortality/safety/readmission group flags
|
||||||
|
- **Test (OK):**
|
||||||
|
```
|
||||||
|
curl -s "https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0?limit=2"
|
||||||
|
```
|
||||||
|
- Use to classify which acute hospitals have behavioral health units (cross-join with IPFQR CCNs).
|
||||||
|
|
||||||
|
### 3. CMS Provider of Services (POS) file
|
||||||
|
- **Bulk page:** `https://data.cms.gov/provider-characteristics/hospitals-and-other-facilities/provider-of-services-file-quality-improvement-and-evaluation-system`
|
||||||
|
- **JSON catalog:** `https://data.cms.gov/data.json` (search `dataset[].title` = "Provider of Services File")
|
||||||
|
- **Auth:** None | **Update:** Quarterly | **Format:** CSV bulk
|
||||||
|
- **Records:** ~80,000 Medicare-certified facilities (includes PSY, PRTF, hospitals)
|
||||||
|
- **Key fields:** CCN, provider category, bed count, certification date, termination date, ownership
|
||||||
|
- **Test (OK):** `curl -s "https://data.cms.gov/data.json"` — dataset list
|
||||||
|
- Required for bed counts and termination (closure) tracking.
|
||||||
|
|
||||||
|
### 4. CMS Nursing Home Compare (Provider Information)
|
||||||
|
- **Endpoint:** `https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0`
|
||||||
|
- **Auth:** None | **Update:** Monthly
|
||||||
|
- **Records:** ~15,000 nursing homes
|
||||||
|
- **Key fields:** CCN, provider_name, ownership, number_of_certified_beds, overall rating, chain info
|
||||||
|
- **Test (OK):** `curl -s "https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0?limit=2"`
|
||||||
|
- Used to capture residential behavioral health (SNFs frequently host psych/BH residents).
|
||||||
|
|
||||||
|
### 5. SAMHSA Treatment Locator (findtreatment.gov)
|
||||||
|
- **Endpoint:** `https://findtreatment.gov/locator/exportsAsJson/v2?sType=BH&sAddr={zip}`
|
||||||
|
- **Auth:** None (browser UA helps but not required for JSON export)
|
||||||
|
- **Rate limit:** None documented; HEAD returns 403 but GET returns 200 — use GET only
|
||||||
|
- **Update:** Continuous (SAMHSA-maintained)
|
||||||
|
- **Records:** ~96,000 BH treatment facilities (all service types)
|
||||||
|
- **Key fields:** name1/name2, street, city, state, zip, phone, intake, hotline, website, lat, lon, services, typeFacility
|
||||||
|
- **Test (OK):**
|
||||||
|
```
|
||||||
|
curl -s "https://findtreatment.gov/locator/exportsAsJson/v2?sType=BH&sAddr=10001"
|
||||||
|
```
|
||||||
|
Response: `{"page":1,"totalPages":3201,"recordCount":96009,"rows":[...]}`
|
||||||
|
- **Python snippet:**
|
||||||
|
```python
|
||||||
|
import requests, time
|
||||||
|
def fetch_all(zip_seed="10001"):
|
||||||
|
base = "https://findtreatment.gov/locator/exportsAsJson/v2"
|
||||||
|
page = 1
|
||||||
|
while True:
|
||||||
|
r = requests.get(base, params={"sType":"BH","sAddr":zip_seed,"pageSize":30,"page":page})
|
||||||
|
d = r.json()
|
||||||
|
yield from d["rows"]
|
||||||
|
if page >= d["totalPages"]: break
|
||||||
|
page += 1
|
||||||
|
time.sleep(0.3)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6. SAMHSA N-SSATS + N-MHSS
|
||||||
|
- **Bulk:** `https://www.samhsa.gov/data/data-we-collect/n-ssats/datafiles` and `/n-mhss/datafiles`
|
||||||
|
- **Auth:** None | **Update:** Annual | **Format:** SAS / SPSS / CSV
|
||||||
|
- **Records:** N-SSATS ~16,000 SUD facilities/year; N-MHSS ~12,000 MH facilities/year
|
||||||
|
- **Key fields:** facility id, services, payment accepted, populations served (including adolescent/young adult flags), bed counts, ownership
|
||||||
|
- **Note:** Bulk ZIPs; no live API. Staged as manual-download job.
|
||||||
|
|
||||||
|
### 7. CDC WONDER (mortality — suicide, overdose, by county, age)
|
||||||
|
- **Endpoint:** `https://wonder.cdc.gov/controller/datarequest/D76` (Underlying Cause of Death) — POST XML
|
||||||
|
- **Auth:** None for non-restricted datasets; county-level suppressed for <10 deaths
|
||||||
|
- **Update:** Annual
|
||||||
|
- **Records:** All US mortality; we pull ICD-10 X60-X84 (suicide) + X40-X44/Y10-Y14 (overdose) by county, 13-17 and 18-25
|
||||||
|
- **Test (OK):** landing page returns 200; POST XML required for data. See job stub `wonder_mortality.py` for the working XML template.
|
||||||
|
|
||||||
|
### 8. CDC BRFSS
|
||||||
|
- **Endpoint (Socrata):** `https://data.cdc.gov/resource/dttw-5yxu.json`
|
||||||
|
- **Auth:** None (Socrata app token optional for higher limits) | **Update:** Annual
|
||||||
|
- **Records:** ~100k rows/year (state x question x breakout)
|
||||||
|
- **Test (OK):**
|
||||||
|
```
|
||||||
|
curl -s "https://data.cdc.gov/resource/dttw-5yxu.json?$limit=2"
|
||||||
|
```
|
||||||
|
Returns depression prevalence, mental health days, etc. by state+demographic.
|
||||||
|
|
||||||
|
### 9. CDC YRBSS (Youth Risk Behavior Survey)
|
||||||
|
- **Endpoints (Socrata, verified present via catalog):**
|
||||||
|
- High school: `https://data.cdc.gov/resource/3qty-g4aq.json`
|
||||||
|
- Middle school: `https://data.cdc.gov/resource/uqmk-4y2w.json`
|
||||||
|
- **Auth:** None | **Update:** Biennial
|
||||||
|
- **Records:** State + large urban district level; ~50k rows
|
||||||
|
- **Key fields:** suicidal ideation, attempt, persistent sadness, substance use — exactly the adolescent demand signal we need.
|
||||||
|
|
||||||
|
### 10. IDEA Part B data (Emotional Disturbance by district)
|
||||||
|
- **Landing:** `https://www2.ed.gov/programs/osepidea/618-data/static-tables/index.html`
|
||||||
|
- **Auth:** None | **Format:** CSV static tables | **Update:** Annual
|
||||||
|
- **Records:** ~14,000 school districts + state rollups
|
||||||
|
- **Key fields:** Child count under ED classification, ages 6-21, by state and LEA
|
||||||
|
- **Note:** Static CSVs; no API. Download script documents exact file URLs.
|
||||||
|
|
||||||
|
### 11. NSCH (National Survey of Children's Health) via HRSA
|
||||||
|
- **Landing:** `https://www.childhealthdata.org/browse/survey` and `https://mchb.hrsa.gov/data-research/national-survey-childrens-health`
|
||||||
|
- **Bulk (HRSA):** `https://mchb.hrsa.gov/sites/default/files/nsch/datafiles/` (year-specific)
|
||||||
|
- **Auth:** None | **Update:** Annual | **Format:** SAS / Stata / CSV
|
||||||
|
- **Records:** ~50k surveyed children, weighted to state-level estimates
|
||||||
|
- **Key fields:** anxiety, depression, behavioral problems, received treatment, unmet need — by state x age.
|
||||||
|
|
||||||
|
### 12. BLS OES (behavioral health workforce by MSA)
|
||||||
|
- **API:** `https://api.bls.gov/publicAPI/v2/timeseries/data/` (POST JSON)
|
||||||
|
- **Auth:** Free registration key for >25 series/day (`https://data.bls.gov/registrationEngine/`). Without key: 25 series/query, 10 years/query, no key required but lower limits.
|
||||||
|
- **Update:** Annual (May reference period)
|
||||||
|
- **Series ID pattern:** `OEUM{area}{industry}{occupation}{datatype}`
|
||||||
|
- **Relevant SOC codes:**
|
||||||
|
- 29-1223 Psychiatrists
|
||||||
|
- 29-1229 Other Physicians (incl. addiction medicine)
|
||||||
|
- 21-1014 Mental Health Counselors
|
||||||
|
- 21-1015 Rehabilitation Counselors
|
||||||
|
- 21-1018 Substance Abuse/Behavioral Disorder Counselors
|
||||||
|
- 21-1022 Mental Health and SUD Social Workers
|
||||||
|
- 19-3033 Clinical & Counseling Psychologists
|
||||||
|
- **Test (OK):** BLS API responds (test hit confirmed structure; real series IDs required)
|
||||||
|
- **Bulk alternative:** `https://www.bls.gov/oes/special-requests/oesm{YY}ma.zip` (annual bulk by MSA) — no auth, ~50MB zip.
|
||||||
|
|
||||||
|
### 13. HRSA Mental Health HPSAs
|
||||||
|
- **Bulk CSV (verified):** `https://data.hrsa.gov/DataDownload/DD_Files/BCD_HPSA_FCT_DET_MH.csv`
|
||||||
|
- **Size:** ~23 MB
|
||||||
|
- **Auth:** None | **Update:** Continuous (weekly snapshots)
|
||||||
|
- **Records:** ~6,500 active MH HPSAs + historical
|
||||||
|
- **Key fields:** HPSA ID, designation type, discipline (MH), score (0-25), state, county FIPS via HPSA Geography ID, population, designation date, withdrawn date, lat/lon
|
||||||
|
- **Test (OK):** HTTP 200, 23 MB CSV returned.
|
||||||
|
|
||||||
|
### 14. CMS NPPES (National Plan & Provider Enumeration System)
|
||||||
|
- **API:** `https://npiregistry.cms.hhs.gov/api/?version=2.1`
|
||||||
|
- **Auth:** None | **Rate limit:** ~200 req/sec soft; 200 results max per query — paginate with `skip`
|
||||||
|
- **Update:** Daily
|
||||||
|
- **Records:** ~8 million NPIs; filter by taxonomy for behavioral health (~500k)
|
||||||
|
- **Relevant taxonomy codes:**
|
||||||
|
- 2084P0800X Psychiatry & Neurology - Psychiatry
|
||||||
|
- 2084P0802X Addiction Psychiatry
|
||||||
|
- 2084P0804X Child & Adolescent Psychiatry
|
||||||
|
- 103T00000X Psychologist
|
||||||
|
- 101YM0800X Mental Health Counselor
|
||||||
|
- 103TC2200X Clinical Child & Adolescent Psychologist
|
||||||
|
- 1041C0700X Clinical Social Worker
|
||||||
|
- 324500000X Substance Abuse Rehabilitation Facility
|
||||||
|
- 283Q00000X Psychiatric Hospital
|
||||||
|
- 323P00000X Psychiatric Residential Treatment Facility
|
||||||
|
- **Test (OK):**
|
||||||
|
```
|
||||||
|
curl -s "https://npiregistry.cms.hhs.gov/api/?version=2.1&taxonomy_description=psychiatric&state=NY&limit=2"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PHASE B — Requires application or registration
|
||||||
|
|
||||||
|
### 15. HCUP (AHRQ)
|
||||||
|
- **Landing:** `https://hcup-us.ahrq.gov/tech_assist/centdist.jsp`
|
||||||
|
- **Auth:** Data Use Agreement (DUA) required; free for research but application-based (~2-4 weeks)
|
||||||
|
- **Records:** State inpatient/ED/ASC databases, ~40M discharges/yr nationally
|
||||||
|
- **Action required:** Submit DUA + Data Use Training certificate. **BLOCKED until user applies.**
|
||||||
|
|
||||||
|
### 16. CMS Medicare Cost Reports (MCR)
|
||||||
|
- **Bulk:** `https://www.cms.gov/data-research/statistics-trends-and-reports/cost-reports` (HOSPITAL2010 format)
|
||||||
|
- **Auth:** None; just large downloads (~1-3 GB per year)
|
||||||
|
- **Update:** Quarterly rolling
|
||||||
|
- **Records:** ~6,000 hospital cost reports/year (CCN-level)
|
||||||
|
- Staged as a fetch-and-parse job (uses `ccn` to join with `bhi_facilities`).
|
||||||
|
|
||||||
|
### 17. NEMSIS state crisis transport data
|
||||||
|
- **Landing:** `https://nemsis.org/using-ems-data/request-research-data/`
|
||||||
|
- **Auth:** Research Data Request (application) — typically 4-8 weeks
|
||||||
|
- **BLOCKED until user applies.**
|
||||||
|
|
||||||
|
### 18. California HCAI (patient discharge data)
|
||||||
|
- **Endpoint:** `https://hcai.ca.gov/data-and-reports/cost-transparency/` and `https://data.chhs.ca.gov/dataset?q=pdd`
|
||||||
|
- **Auth:** Free (some files direct download; Limited Data Set requires DUA)
|
||||||
|
- **Update:** Annual
|
||||||
|
- **Records:** ~3.5M CA discharges/yr; psych DRGs extractable
|
||||||
|
|
||||||
|
### 19. NY SPARCS
|
||||||
|
- **Landing:** `https://www.health.ny.gov/statistics/sparcs/`
|
||||||
|
- **Auth:** Application for identified data; deidentified file free via `health.data.ny.gov`
|
||||||
|
- **Deidentified endpoint:** `https://health.data.ny.gov/resource/u4ud-w55t.json` (Hospital Inpatient Discharges)
|
||||||
|
- **Records:** ~2.5M NY discharges/yr
|
||||||
|
|
||||||
|
### 20. TX DSHS discharge data
|
||||||
|
- **Landing:** `https://www.dshs.texas.gov/texas-health-care-information-collection/health-data-researcher-information/texas-inpatient-public-use`
|
||||||
|
- **Auth:** Free (Public Use File is a direct download after click-through)
|
||||||
|
- **Records:** ~3M TX discharges/yr
|
||||||
|
|
||||||
|
### 21. FL AHCA discharge data
|
||||||
|
- **Landing:** `https://ahca.myflorida.com/health-care-policy-and-oversight/bureau-of-central-services/florida-center-for-health-information-and-transparency/data-analytics/order-data`
|
||||||
|
- **Auth:** Application form + fee for identified; aggregate free
|
||||||
|
- **BLOCKED until user applies for identified.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PHASE C — State RTF licensing databases
|
||||||
|
|
||||||
|
### 22. State-by-state RTF licensing scrapers
|
||||||
|
Scope: residential treatment facilities serving adolescents. One scraper per state.
|
||||||
|
|
||||||
|
Verified public-search portals (no auth, scrape-friendly HTML/JSON):
|
||||||
|
- **UT** — `https://hslic.utah.gov/` (Human Services License Information Lookup)
|
||||||
|
- **CA** — `https://www.ccld.dss.ca.gov/transparencyapi/api/facilities` (Community Care Licensing API)
|
||||||
|
- **TX** — `https://www.hhs.texas.gov/providers/long-term-care-providers/childrens-residential-facility-reimbursement-methodology` + search portal
|
||||||
|
- **FL** — `https://apps.myflfamilies.com/provider/` (DCF provider search)
|
||||||
|
- **NY** — `https://omh.ny.gov/omhweb/resources/providers/` (OMH provider directory)
|
||||||
|
- **MT** — `https://dphhs.mt.gov/qad/licensure/licensedfacilitieslist` (static list)
|
||||||
|
- **AZ** — `https://azcarecheck.azdhs.gov/` (public search)
|
||||||
|
- **CO** — `https://apps.colorado.gov/apps/oapa/licensee.aspx` (Office of Early Childhood)
|
||||||
|
- **OR** — `https://ccld.oregon.gov/ccld/search/` (Care Provider Directory)
|
||||||
|
- **WA** — `https://fortress.wa.gov/dshs/adsaapps/lookup/` (LTC lookup)
|
||||||
|
- **IL** — `https://www2.illinois.gov/dcfs/brighterfutures/Pages/default.aspx`
|
||||||
|
- **MA** — `https://www.mass.gov/lists/licensed-residential-treatment-programs`
|
||||||
|
- **PA** — `https://www.dhs.pa.gov/Services/Assistance/Pages/Child-Residential-Facility.aspx`
|
||||||
|
|
||||||
|
States requiring FOIA / no public portal (documented as BLOCKED for Phase C v1):
|
||||||
|
- AL, AK, AR, DE, GA, HI, ID, IN, IA, KS, KY, LA, ME, MD, MI, MN, MS, MO, NE, NV, NH, NJ, NM, NC, ND, OH, OK, RI, SC, SD, TN, VT, VA, WV, WI, WY
|
||||||
|
|
||||||
|
The scraper job stub lists URL patterns for the 13 verified states and marks the rest "FOIA required."
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test results summary (Phase A)
|
||||||
|
|
||||||
|
| # | Source | Status | Notes |
|
||||||
|
|---|--------|--------|-------|
|
||||||
|
| 1 | CMS IPFQR | OK | q9vs-r7wp returned facility rows |
|
||||||
|
| 2 | CMS Hospital Compare | OK | xubh-q36u returned |
|
||||||
|
| 3 | CMS POS | OK | catalog reachable, bulk CSV |
|
||||||
|
| 4 | CMS Nursing Home | OK | 4pq5-n9py returned |
|
||||||
|
| 5 | SAMHSA Locator | OK | 96,009 records confirmed |
|
||||||
|
| 6 | SAMHSA N-SSATS/N-MHSS | OK (bulk) | ZIP download, no API |
|
||||||
|
| 7 | CDC WONDER | OK | POST XML required, landing 200 |
|
||||||
|
| 8 | CDC BRFSS | OK | Socrata JSON returned |
|
||||||
|
| 9 | CDC YRBSS | OK | 3qty-g4aq + uqmk-4y2w |
|
||||||
|
| 10 | IDEA Part B | OK (static) | Static CSV; no API |
|
||||||
|
| 11 | NSCH | OK (bulk) | HRSA year files |
|
||||||
|
| 12 | BLS OES | OK | API responds; needs real series IDs |
|
||||||
|
| 13 | HRSA HPSA MH | OK | 23 MB CSV download confirmed |
|
||||||
|
| 14 | NPPES | OK | 2 results returned for NY psych |
|
||||||
|
|
||||||
|
Blocked until auth/application:
|
||||||
|
- HCUP (DUA), NEMSIS (application), FL AHCA identified, NY SPARCS identified.
|
||||||
64
docs/target_questions.md
Normal file
64
docs/target_questions.md
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
# BHI Layer — Target Opportunity Questions
|
||||||
|
|
||||||
|
These are the questions the BHI layer must answer. They double as acceptance criteria: the layer ships when every question can be answered with a SQL query or a short Python notebook against `brain` with BHI tables populated.
|
||||||
|
|
||||||
|
Scope assumptions: all 50 states, facility-level where available, tagged adolescent (13-17) and young adult (18-25).
|
||||||
|
|
||||||
|
## 1. Supply / capacity
|
||||||
|
|
||||||
|
1. Which US counties have the highest HPSA mental health scores AND the lowest bed density (top 50)?
|
||||||
|
2. Which counties have ZERO licensed adolescent inpatient psychiatric beds within 60 miles?
|
||||||
|
3. Which counties have ZERO licensed young-adult residential treatment beds within 60 miles?
|
||||||
|
4. How many IPFs have closed vs opened in the last 24 months, by state?
|
||||||
|
5. Which IPFs have the worst HBIPS restraint+seclusion rates and are therefore vulnerability candidates for competitive entry or acquisition?
|
||||||
|
6. Which nursing homes are disproportionately housing under-65 residents with SMI (SNF-IMD dynamic) and are candidates for conversion/specialty buildout?
|
||||||
|
7. Where are the biggest drops in psych bed count over the last 5 years (via POS termination data)?
|
||||||
|
8. Which states have the lowest ratio of PRTF beds per 10k adolescents?
|
||||||
|
|
||||||
|
## 2. Demand
|
||||||
|
|
||||||
|
9. Which counties have the highest 13-17 suicide rate and fastest-growing trend (CDC WONDER)?
|
||||||
|
10. Which counties have the highest 18-25 overdose death rate trend?
|
||||||
|
11. Which states have the highest YRBSS "considered suicide" % and highest unmet-treatment need on NSCH, simultaneously?
|
||||||
|
12. How does adolescent ED visit rate for self-harm compare across states (cross-joining HCUP when available)?
|
||||||
|
13. Which school districts have the highest IDEA Part B Emotional Disturbance child count per 1,000 students?
|
||||||
|
14. Which states are seeing the largest YoY increase in 988 + crisis line volume per capita?
|
||||||
|
|
||||||
|
## 3. Workforce
|
||||||
|
|
||||||
|
15. Which MSAs have the highest YoY wage growth for psychiatrists (SOC 29-1223) — indicates a shortage?
|
||||||
|
16. Which MSAs have psychiatrist employment per 100k in the bottom quartile AND mental health HPSA coverage in the worst quartile?
|
||||||
|
17. Where are LCSW/LMHC wages spiking (21-1014, 21-1018) while employment is flat?
|
||||||
|
|
||||||
|
## 4. Financial / opportunity
|
||||||
|
|
||||||
|
18. What is the median psych Medicare margin (revenue - cost) per discharge, by state, from MCR data?
|
||||||
|
19. Which for-profit IPF chains are expanding fastest (opened_date + chain_id from nursing home join)?
|
||||||
|
20. Which counties have the biggest gap between HPSA score and SAM.gov / state contract dollars flowing in (underinvested vs need)?
|
||||||
|
21. What are the median acquisition multiples for BH facilities in each state? (Requires later enrichment.)
|
||||||
|
|
||||||
|
## 5. Adolescent transport / crisis (specific focus)
|
||||||
|
|
||||||
|
22. Which counties dispatch the most EMS runs coded "behavioral/psych" per 10k adolescents (NEMSIS, when access granted)?
|
||||||
|
23. Where do adolescent psychiatric holds most frequently result in out-of-county or out-of-state transport (indicates no local capacity)?
|
||||||
|
24. Which states have the longest average ED boarding time for adolescents awaiting inpatient psych admission (via AHRQ + state HAI reports)?
|
||||||
|
25. Which states have dedicated secure transport statute/reimbursement (`bhi_policy_events` filter on "secure transport") — these are bluefields for BH transport vendors?
|
||||||
|
26. Which counties combine: high adolescent suicide rate + no in-county adolescent psych beds + high ED boarding = highest-need adolescent transport markets?
|
||||||
|
27. Which chains/operators already provide adolescent secure transport and where are their service gaps (via scraping state BHO contract registries)?
|
||||||
|
|
||||||
|
## 6. Regulatory / tailwind
|
||||||
|
|
||||||
|
28. Which states passed Medicaid rate increases for BH residential in the last 24 months?
|
||||||
|
29. Which states expanded the definition of "mobile crisis response" to include adolescents in the last 24 months?
|
||||||
|
30. Where are IMD exclusion waivers (Section 1115 SMI/SED waivers) active or pending?
|
||||||
|
|
||||||
|
## 7. Composite / prioritization
|
||||||
|
|
||||||
|
31. Top 10 states ranked by composite_score for "adolescent inpatient psychiatric"?
|
||||||
|
32. Top 50 counties ranked by composite_score for "young adult residential SUD"?
|
||||||
|
33. Top 20 MSAs ranked by composite_score for "outpatient adolescent therapy (IOP/PHP)"?
|
||||||
|
34. For each of the top 10 composite-score opportunities, list: (a) top 3 operators already there, (b) workforce wage growth, (c) most recent policy event, (d) closest open SAM.gov opportunity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Acceptance criteria:** When the BHI layer is live and all Phase A sources are ingested, a user should be able to run SQL or ask the Brain's natural-language interface these 34 questions and get a grounded answer with citations to the underlying `bhi_*` tables.
|
||||||
BIN
jobs/ingestion/__pycache__/_common.cpython-312.pyc
Normal file
BIN
jobs/ingestion/__pycache__/_common.cpython-312.pyc
Normal file
Binary file not shown.
146
jobs/ingestion/_common.py
Normal file
146
jobs/ingestion/_common.py
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
"""
|
||||||
|
Shared helpers for BHI ingestion jobs.
|
||||||
|
|
||||||
|
READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
|
||||||
|
Base Brain is expected to expose:
|
||||||
|
- env DATABASE_URL pointing at the `brain` Postgres
|
||||||
|
- a `job_runs` table (the base Brain maintains this)
|
||||||
|
- optional Vault at http://localhost:8200 for API keys
|
||||||
|
|
||||||
|
Every BHI job imports from this module to keep behavior consistent.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from contextlib import contextmanager
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Any, Callable, Iterable
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
try:
|
||||||
|
import psycopg2
|
||||||
|
import psycopg2.extras
|
||||||
|
except ImportError:
|
||||||
|
psycopg2 = None # type: ignore
|
||||||
|
|
||||||
|
LOG_FMT = "%(asctime)s %(levelname)s %(name)s | %(message)s"
|
||||||
|
logging.basicConfig(level=os.environ.get("BHI_LOG_LEVEL", "INFO"), format=LOG_FMT)
|
||||||
|
|
||||||
|
|
||||||
|
# --- HTTP session with retries + rate limiting ------------------------------
|
||||||
|
|
||||||
|
class RateLimitedSession(requests.Session):
|
||||||
|
def __init__(self, min_interval: float = 0.2, max_retries: int = 5):
|
||||||
|
super().__init__()
|
||||||
|
self.headers.update({"User-Agent": "EconomicBrain-BHI/1.0 (+research)"})
|
||||||
|
self.min_interval = min_interval
|
||||||
|
self.max_retries = max_retries
|
||||||
|
self._last = 0.0
|
||||||
|
|
||||||
|
def request(self, method, url, **kw): # type: ignore[override]
|
||||||
|
kw.setdefault("timeout", 60)
|
||||||
|
backoff = 1.0
|
||||||
|
for attempt in range(self.max_retries):
|
||||||
|
dt = time.monotonic() - self._last
|
||||||
|
if dt < self.min_interval:
|
||||||
|
time.sleep(self.min_interval - dt)
|
||||||
|
self._last = time.monotonic()
|
||||||
|
try:
|
||||||
|
resp = super().request(method, url, **kw)
|
||||||
|
if resp.status_code in (429, 500, 502, 503, 504):
|
||||||
|
logging.warning("HTTP %s on %s, retrying in %.1fs", resp.status_code, url, backoff)
|
||||||
|
time.sleep(backoff)
|
||||||
|
backoff *= 2
|
||||||
|
continue
|
||||||
|
resp.raise_for_status()
|
||||||
|
return resp
|
||||||
|
except requests.RequestException as e:
|
||||||
|
logging.warning("Request error: %s (attempt %d)", e, attempt + 1)
|
||||||
|
time.sleep(backoff)
|
||||||
|
backoff *= 2
|
||||||
|
raise RuntimeError(f"Exceeded retries for {url}")
|
||||||
|
|
||||||
|
|
||||||
|
# --- DB helpers -------------------------------------------------------------
|
||||||
|
|
||||||
|
def get_conn():
|
||||||
|
if psycopg2 is None:
|
||||||
|
raise RuntimeError("psycopg2 not installed. pip install psycopg2-binary")
|
||||||
|
dsn = os.environ.get("DATABASE_URL") or os.environ.get("BRAIN_DATABASE_URL")
|
||||||
|
if not dsn:
|
||||||
|
raise RuntimeError("DATABASE_URL env var not set")
|
||||||
|
return psycopg2.connect(dsn)
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def job_run(job_name: str):
|
||||||
|
"""Context manager that logs a row in the base Brain's job_runs table."""
|
||||||
|
conn = get_conn()
|
||||||
|
run_id = None
|
||||||
|
started = datetime.utcnow()
|
||||||
|
try:
|
||||||
|
with conn.cursor() as c:
|
||||||
|
c.execute(
|
||||||
|
"""
|
||||||
|
INSERT INTO job_runs (job_name, started_at, status)
|
||||||
|
VALUES (%s, %s, 'running') RETURNING id
|
||||||
|
""",
|
||||||
|
(job_name, started),
|
||||||
|
)
|
||||||
|
run_id = c.fetchone()[0]
|
||||||
|
conn.commit()
|
||||||
|
yield conn, run_id
|
||||||
|
with conn.cursor() as c:
|
||||||
|
c.execute(
|
||||||
|
"UPDATE job_runs SET status='success', finished_at=%s WHERE id=%s",
|
||||||
|
(datetime.utcnow(), run_id),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
except Exception as e:
|
||||||
|
if run_id is not None:
|
||||||
|
try:
|
||||||
|
with conn.cursor() as c:
|
||||||
|
c.execute(
|
||||||
|
"UPDATE job_runs SET status='error', finished_at=%s, error=%s WHERE id=%s",
|
||||||
|
(datetime.utcnow(), str(e)[:2000], run_id),
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
raise
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def bulk_insert(conn, table: str, columns: list[str], rows: Iterable[tuple]):
|
||||||
|
with conn.cursor() as c:
|
||||||
|
psycopg2.extras.execute_values(
|
||||||
|
c,
|
||||||
|
f"INSERT INTO {table} ({', '.join(columns)}) VALUES %s",
|
||||||
|
list(rows),
|
||||||
|
page_size=500,
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
|
||||||
|
|
||||||
|
# --- Vault (optional) -------------------------------------------------------
|
||||||
|
|
||||||
|
def vault_secret(path: str, key: str) -> str | None:
|
||||||
|
token = os.environ.get("VAULT_TOKEN")
|
||||||
|
addr = os.environ.get("VAULT_ADDR", "http://localhost:8200")
|
||||||
|
if not token:
|
||||||
|
return os.environ.get(key.upper())
|
||||||
|
try:
|
||||||
|
r = requests.get(
|
||||||
|
f"{addr}/v1/{path}",
|
||||||
|
headers={"X-Vault-Token": token},
|
||||||
|
timeout=5,
|
||||||
|
)
|
||||||
|
return r.json()["data"]["data"].get(key)
|
||||||
|
except Exception as e:
|
||||||
|
logging.warning("vault fetch failed: %s", e)
|
||||||
|
return os.environ.get(key.upper())
|
||||||
93
jobs/ingestion/bls_oes.py
Normal file
93
jobs/ingestion/bls_oes.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
BLS OES (Occupational Employment and Wage Statistics) — behavioral health
|
||||||
|
workforce by MSA.
|
||||||
|
|
||||||
|
Primary approach: annual bulk download (no auth, simplest):
|
||||||
|
https://www.bls.gov/oes/special-requests/oesmYYma.zip
|
||||||
|
|
||||||
|
Fallback / enrichment: BLS public API (optional free key via vault).
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import zipfile
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run, vault_secret
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.bls_oes")
|
||||||
|
|
||||||
|
BULK_URL = "https://www.bls.gov/oes/special-requests/oesm23ma.zip" # update year annually
|
||||||
|
BH_SOC_CODES = {
|
||||||
|
"29-1223": "Psychiatrists",
|
||||||
|
"29-1229": "Physicians, All Other",
|
||||||
|
"21-1014": "Mental Health Counselors",
|
||||||
|
"21-1015": "Rehabilitation Counselors",
|
||||||
|
"21-1018": "SUD / Behavioral Disorder Counselors",
|
||||||
|
"21-1023": "Mental Health & Substance Abuse Social Workers",
|
||||||
|
"19-3033": "Clinical & Counseling Psychologists",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.head(BULK_URL, allow_redirects=True)
|
||||||
|
print(f"OK: status={r.status_code}, content-length={r.headers.get('content-length')}")
|
||||||
|
return r.status_code == 200
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=1.0)
|
||||||
|
r = s.get(BULK_URL)
|
||||||
|
z = zipfile.ZipFile(io.BytesIO(r.content))
|
||||||
|
# Bulk zip contains one CSV/XLSX with MSA rows
|
||||||
|
csv_name = next((n for n in z.namelist() if n.lower().endswith(".csv")), None)
|
||||||
|
if not csv_name:
|
||||||
|
LOG.error("no CSV in BLS zip")
|
||||||
|
return []
|
||||||
|
with z.open(csv_name) as f:
|
||||||
|
reader = csv.DictReader(io.TextIOWrapper(f, encoding="latin-1"))
|
||||||
|
rows = [r for r in reader if (r.get("OCC_CODE") or r.get("occ_code")) in BH_SOC_CODES]
|
||||||
|
LOG.info("BLS OES BH rows: %d", len(rows))
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def _num(v):
|
||||||
|
try:
|
||||||
|
return float(str(v).replace(",", "")) if v not in (None, "", "*", "#") else None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["msa_code","msa_name","occupation_code","occupation_title",
|
||||||
|
"employment","annual_wage_median","annual_wage_mean","period","source"]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
code = r.get("OCC_CODE") or r.get("occ_code")
|
||||||
|
rows.append((
|
||||||
|
r.get("AREA") or r.get("area"),
|
||||||
|
r.get("AREA_TITLE") or r.get("area_title"),
|
||||||
|
code,
|
||||||
|
BH_SOC_CODES.get(code, r.get("OCC_TITLE") or r.get("occ_title")),
|
||||||
|
int(_num(r.get("TOT_EMP") or r.get("tot_emp")) or 0) or None,
|
||||||
|
_num(r.get("A_MEDIAN") or r.get("a_median")),
|
||||||
|
_num(r.get("A_MEAN") or r.get("a_mean")),
|
||||||
|
"May2023",
|
||||||
|
"bls_oes",
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_workforce", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_bls_oes") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
92
jobs/ingestion/cdc_brfss.py
Normal file
92
jobs/ingestion/cdc_brfss.py
Normal file
@@ -0,0 +1,92 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CDC BRFSS Prevalence Data (Socrata).
|
||||||
|
|
||||||
|
Source: https://data.cdc.gov/resource/dttw-5yxu.json
|
||||||
|
Pulls depression + mental-health-not-good items by state, with
|
||||||
|
young-adult (18-24) breakouts where available.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cdc_brfss")
|
||||||
|
BASE = "https://data.cdc.gov/resource/dttw-5yxu.json"
|
||||||
|
|
||||||
|
# BRFSS topics of interest for BHI
|
||||||
|
TOPICS = [
|
||||||
|
"Depression",
|
||||||
|
"Mental Health Status",
|
||||||
|
"Poor Mental Health",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(BASE, params={"$limit": 2}).json()
|
||||||
|
print(f"OK: returned {len(r)} rows")
|
||||||
|
if r:
|
||||||
|
print("sample topic:", r[0].get("topic"))
|
||||||
|
return bool(r)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.2)
|
||||||
|
out = []
|
||||||
|
for topic in TOPICS:
|
||||||
|
offset = 0
|
||||||
|
while True:
|
||||||
|
batch = s.get(BASE, params={
|
||||||
|
"$where": f"topic='{topic}'",
|
||||||
|
"$limit": 5000,
|
||||||
|
"$offset": offset,
|
||||||
|
}).json()
|
||||||
|
if not batch:
|
||||||
|
break
|
||||||
|
out.extend(batch)
|
||||||
|
if len(batch) < 5000:
|
||||||
|
break
|
||||||
|
offset += 5000
|
||||||
|
LOG.info("topic=%s total=%d", topic, len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
try:
|
||||||
|
val = float(r.get("data_value") or 0)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
continue
|
||||||
|
breakout = (r.get("break_out") or "Overall").lower()
|
||||||
|
if "18" in breakout and "24" in breakout:
|
||||||
|
bracket = "18-25"
|
||||||
|
elif "overall" in breakout:
|
||||||
|
bracket = "all"
|
||||||
|
else:
|
||||||
|
bracket = breakout
|
||||||
|
rows.append((
|
||||||
|
"state",
|
||||||
|
r.get("locationabbr"),
|
||||||
|
(r.get("question") or r.get("topic") or "").strip()[:120],
|
||||||
|
bracket,
|
||||||
|
str(r.get("year") or ""),
|
||||||
|
val,
|
||||||
|
"cdc_brfss",
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cdc_brfss") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
119
jobs/ingestion/cdc_wonder_mortality.py
Normal file
119
jobs/ingestion/cdc_wonder_mortality.py
Normal file
@@ -0,0 +1,119 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CDC WONDER — Underlying Cause of Death by county, age bracket, ICD-10.
|
||||||
|
|
||||||
|
Posts XML request body to https://wonder.cdc.gov/controller/datarequest/D76
|
||||||
|
(Underlying Cause of Death 1999-2020) or D77 (2018+). The public non-restricted
|
||||||
|
datasets return XML tables; county-level cells with <10 deaths are suppressed.
|
||||||
|
|
||||||
|
We request two slices:
|
||||||
|
1. Suicide (X60-X84) for ages 13-17 and 18-25, by county
|
||||||
|
2. Drug poisoning (X40-X44, Y10-Y14) for 13-17 and 18-25, by county
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import xml.etree.ElementTree as ET
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cdc_wonder")
|
||||||
|
ENDPOINT = "https://wonder.cdc.gov/controller/datarequest/D76"
|
||||||
|
|
||||||
|
|
||||||
|
def _build_xml(icd_codes: list[str], age_bracket: str) -> str:
|
||||||
|
"""Assemble WONDER POST XML. Structure is value-order dependent."""
|
||||||
|
# Age groups in WONDER: 15-19, 20-24, 25-29 etc. Adolescent and young-adult
|
||||||
|
# brackets don't align perfectly with 5-year WONDER bins — closest fit:
|
||||||
|
ages = {
|
||||||
|
"13-17": ["15-19"], # approximate
|
||||||
|
"18-25": ["20-24", "25-29"],
|
||||||
|
}[age_bracket]
|
||||||
|
icd_vals = "".join(f"<v>{c}</v>" for c in icd_codes)
|
||||||
|
age_vals = "".join(f"<v>{a}</v>" for a in ages)
|
||||||
|
return f"""<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<request-parameters>
|
||||||
|
<parameter><name>accept_datause_restrictions</name><value>true</value></parameter>
|
||||||
|
<parameter><name>B_1</name><value>D76.V2-level1</value></parameter>
|
||||||
|
<parameter><name>B_2</name><value>D76.V51</value></parameter>
|
||||||
|
<parameter><name>F_D76.V1</name>{age_vals}</parameter>
|
||||||
|
<parameter><name>F_D76.V2</name><value>*All*</value></parameter>
|
||||||
|
<parameter><name>F_D76.V22</name>{icd_vals}</parameter>
|
||||||
|
<parameter><name>O_age</name><value>D76.V51</value></parameter>
|
||||||
|
<parameter><name>O_location</name><value>D76.V9</value></parameter>
|
||||||
|
<parameter><name>VM_D76.M6_D76.V10</name><value/></parameter>
|
||||||
|
</request-parameters>"""
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession(min_interval=1.0)
|
||||||
|
body = _build_xml(["X60-X84"], "13-17")
|
||||||
|
r = s.post(ENDPOINT, data={"request_xml": body, "accept_datause_restrictions": "true"})
|
||||||
|
ok = r.status_code == 200 and b"<response" in r.content
|
||||||
|
print(f"OK={ok}, status={r.status_code}, len={len(r.content)}")
|
||||||
|
return ok
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=1.0)
|
||||||
|
out = []
|
||||||
|
for measure, icd in [("suicide_rate", ["X60-X84"]),
|
||||||
|
("overdose_rate", ["X40-X44", "Y10-Y14"])]:
|
||||||
|
for bracket in ("13-17", "18-25"):
|
||||||
|
body = _build_xml(icd, bracket)
|
||||||
|
r = s.post(ENDPOINT, data={
|
||||||
|
"request_xml": body,
|
||||||
|
"accept_datause_restrictions": "true",
|
||||||
|
})
|
||||||
|
rows = _parse_wonder_xml(r.text, measure, bracket)
|
||||||
|
out.extend(rows)
|
||||||
|
LOG.info("%s %s -> %d rows", measure, bracket, len(rows))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_wonder_xml(xml_text: str, measure: str, bracket: str):
|
||||||
|
out = []
|
||||||
|
try:
|
||||||
|
root = ET.fromstring(xml_text)
|
||||||
|
except ET.ParseError:
|
||||||
|
LOG.error("WONDER XML parse failed")
|
||||||
|
return out
|
||||||
|
# WONDER returns <data-table> with <r> rows containing <c l="label"/>
|
||||||
|
for r in root.iter("r"):
|
||||||
|
cells = [c.get("l") or c.text for c in r.findall("c")]
|
||||||
|
if len(cells) < 3:
|
||||||
|
continue
|
||||||
|
county = cells[0]
|
||||||
|
try:
|
||||||
|
rate = float(cells[-1])
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
continue
|
||||||
|
out.append({
|
||||||
|
"geo_type": "county",
|
||||||
|
"geo_code": county,
|
||||||
|
"measure": measure,
|
||||||
|
"age_bracket": bracket,
|
||||||
|
"period": "2018-2022", # WONDER typical 5-year window
|
||||||
|
"value": rate,
|
||||||
|
"source": "cdc_wonder",
|
||||||
|
})
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
|
||||||
|
rows = [(r["geo_type"], r["geo_code"], r["measure"], r["age_bracket"],
|
||||||
|
r["period"], r["value"], r["source"]) for r in raw]
|
||||||
|
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cdc_wonder") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
95
jobs/ingestion/cdc_yrbss.py
Normal file
95
jobs/ingestion/cdc_yrbss.py
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CDC YRBSS — Youth Risk Behavior Survey (high and middle school).
|
||||||
|
|
||||||
|
Sources (Socrata):
|
||||||
|
- High school: https://data.cdc.gov/resource/3qty-g4aq.json
|
||||||
|
- Middle school: https://data.cdc.gov/resource/uqmk-4y2w.json
|
||||||
|
|
||||||
|
Key items: "considered suicide", "attempted suicide", "persistent sadness",
|
||||||
|
substance use — all adolescent (13-17) bracket.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cdc_yrbss")
|
||||||
|
DATASETS = {
|
||||||
|
"hs": "https://data.cdc.gov/resource/3qty-g4aq.json",
|
||||||
|
"ms": "https://data.cdc.gov/resource/uqmk-4y2w.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
KEYWORDS = ["suicide", "sad", "hopeless", "mental health", "electronic"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
ok = True
|
||||||
|
for k, url in DATASETS.items():
|
||||||
|
r = s.get(url, params={"$limit": 1})
|
||||||
|
print(f"{k}: status={r.status_code}, rows={len(r.json())}")
|
||||||
|
ok = ok and r.status_code == 200
|
||||||
|
return ok
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.2)
|
||||||
|
out = []
|
||||||
|
for key, url in DATASETS.items():
|
||||||
|
offset = 0
|
||||||
|
while True:
|
||||||
|
batch = s.get(url, params={"$limit": 5000, "$offset": offset}).json()
|
||||||
|
if not batch:
|
||||||
|
break
|
||||||
|
for row in batch:
|
||||||
|
row["_dataset"] = key
|
||||||
|
out.extend(batch)
|
||||||
|
if len(batch) < 5000:
|
||||||
|
break
|
||||||
|
offset += 5000
|
||||||
|
LOG.info("yrbss %s -> %d", key, len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _question_is_relevant(q: str) -> bool:
|
||||||
|
ql = (q or "").lower()
|
||||||
|
return any(k in ql for k in KEYWORDS)
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
question = r.get("questioncode") or r.get("shortquestiontext") or r.get("question") or ""
|
||||||
|
if not _question_is_relevant(question):
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
val = float(r.get("data_value") or r.get("greater_risk_data_value") or 0)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
continue
|
||||||
|
if val == 0:
|
||||||
|
continue
|
||||||
|
rows.append((
|
||||||
|
"state" if r.get("locationdesc") else "district",
|
||||||
|
r.get("locationabbr") or r.get("sitecode"),
|
||||||
|
question[:120],
|
||||||
|
"13-17",
|
||||||
|
str(r.get("year") or ""),
|
||||||
|
val,
|
||||||
|
f"cdc_yrbss_{r.get('_dataset','hs')}",
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cdc_yrbss") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
77
jobs/ingestion/cms_hospital_compare.py
Normal file
77
jobs/ingestion/cms_hospital_compare.py
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CMS Hospital General Information (Care Compare) — used to cross-reference
|
||||||
|
which acute hospitals host behavioral health units and to capture CCN-level
|
||||||
|
facility metadata.
|
||||||
|
|
||||||
|
Source: https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cms_hospital_compare")
|
||||||
|
BASE = "https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0"
|
||||||
|
PAGE = 500
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(BASE, params={"limit": 2}).json()
|
||||||
|
rows = r.get("results", [])
|
||||||
|
print(f"OK: {len(rows)} rows, sample:", rows[0].get("facility_name") if rows else None)
|
||||||
|
return bool(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.25)
|
||||||
|
offset, out = 0, []
|
||||||
|
while True:
|
||||||
|
b = s.get(BASE, params={"limit": PAGE, "offset": offset}).json().get("results", [])
|
||||||
|
if not b:
|
||||||
|
break
|
||||||
|
out.extend(b)
|
||||||
|
if len(b) < PAGE:
|
||||||
|
break
|
||||||
|
offset += PAGE
|
||||||
|
LOG.info("fetched %d hospitals", len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
rows.append((
|
||||||
|
r.get("facility_id"), None,
|
||||||
|
r.get("facility_name"), r.get("address"),
|
||||||
|
r.get("citytown"), r.get("state"), r.get("zip_code"), None,
|
||||||
|
None, None,
|
||||||
|
(r.get("hospital_type") or "hospital"),
|
||||||
|
r.get("hospital_ownership"),
|
||||||
|
None, None, None, None, None,
|
||||||
|
[], [], [], None, None, None, None, None,
|
||||||
|
"cms_hospital_compare", None,
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_facilities", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cms_hospital_compare") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
137
jobs/ingestion/cms_ipfqr.py
Normal file
137
jobs/ingestion/cms_ipfqr.py
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CMS Inpatient Psychiatric Facility Quality Reporting (IPFQR) ingestion.
|
||||||
|
|
||||||
|
Source: https://data.cms.gov/provider-data/api/1/datastore/query/q9vs-r7wp/0
|
||||||
|
Writes facilities to bhi_facilities and measures to bhi_facility_quality.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cms_ipfqr")
|
||||||
|
|
||||||
|
DATASET_ID = "q9vs-r7wp" # IPFQR by Facility
|
||||||
|
BASE = f"https://data.cms.gov/provider-data/api/1/datastore/query/{DATASET_ID}/0"
|
||||||
|
PAGE_SIZE = 500
|
||||||
|
|
||||||
|
MEASURE_FIELDS = [
|
||||||
|
("hbips2", "HBIPS-2", "Hours of physical-restraint use"),
|
||||||
|
("hbips3", "HBIPS-3", "Hours of seclusion use"),
|
||||||
|
("smd", "SMD", "Screening for metabolic disorders"),
|
||||||
|
("sub2", "SUB-2", "Alcohol use brief intervention"),
|
||||||
|
("sub3", "SUB-3", "Alcohol/other drug use treatment at discharge"),
|
||||||
|
("tob3", "TOB-3", "Tobacco use treatment at discharge"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# --- TEST function (no DB) --------------------------------------------------
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
"""Run standalone to verify the endpoint works."""
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(BASE, params={"limit": 3})
|
||||||
|
data = r.json()
|
||||||
|
rows = data.get("results", [])
|
||||||
|
print(f"OK: fetched {len(rows)} rows from {BASE}")
|
||||||
|
if rows:
|
||||||
|
print("Sample keys:", list(rows[0].keys())[:12])
|
||||||
|
print("Sample facility:", rows[0].get("facility_name"), rows[0].get("state"))
|
||||||
|
return len(rows) > 0
|
||||||
|
|
||||||
|
|
||||||
|
# --- Fetch ------------------------------------------------------------------
|
||||||
|
|
||||||
|
def fetch_rows() -> list[dict[str, Any]]:
|
||||||
|
s = RateLimitedSession(min_interval=0.25)
|
||||||
|
offset = 0
|
||||||
|
out: list[dict[str, Any]] = []
|
||||||
|
while True:
|
||||||
|
r = s.get(BASE, params={"limit": PAGE_SIZE, "offset": offset})
|
||||||
|
batch = r.json().get("results", [])
|
||||||
|
if not batch:
|
||||||
|
break
|
||||||
|
out.extend(batch)
|
||||||
|
LOG.info("fetched %d (total %d)", len(batch), len(out))
|
||||||
|
if len(batch) < PAGE_SIZE:
|
||||||
|
break
|
||||||
|
offset += PAGE_SIZE
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
# --- Write ------------------------------------------------------------------
|
||||||
|
|
||||||
|
def write_rows(conn, raw_rows: list[dict[str, Any]]) -> tuple[int, int]:
|
||||||
|
facility_rows = []
|
||||||
|
for r in raw_rows:
|
||||||
|
facility_rows.append((
|
||||||
|
r.get("facility_id"), # ccn
|
||||||
|
None, # npi
|
||||||
|
r.get("facility_name"),
|
||||||
|
r.get("address"),
|
||||||
|
r.get("citytown"),
|
||||||
|
r.get("state"),
|
||||||
|
r.get("zip_code"),
|
||||||
|
None, # county_fips (join later via zip->fips)
|
||||||
|
None, None, # lat, lon
|
||||||
|
"IPF", # facility_type
|
||||||
|
None, None, None, None, # ownership, bed counts
|
||||||
|
None, None, # adolescent_unit, young_adult_unit
|
||||||
|
[], [], [], None, # arrays, medicaid_accepted
|
||||||
|
None, None, None, # accreditation, opened, closed
|
||||||
|
None, # last_verified
|
||||||
|
"cms_ipfqr", # source
|
||||||
|
None, # source_raw_id
|
||||||
|
))
|
||||||
|
|
||||||
|
facility_cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
bulk_insert(conn, "bhi_facilities", facility_cols, facility_rows)
|
||||||
|
|
||||||
|
# Map ccn -> facility_id for measures
|
||||||
|
with conn.cursor() as c:
|
||||||
|
c.execute(
|
||||||
|
"SELECT ccn, facility_id FROM bhi_facilities WHERE source='cms_ipfqr'"
|
||||||
|
)
|
||||||
|
ccn_map = dict(c.fetchall())
|
||||||
|
|
||||||
|
measure_rows = []
|
||||||
|
for r in raw_rows:
|
||||||
|
fid = ccn_map.get(r.get("facility_id"))
|
||||||
|
if not fid:
|
||||||
|
continue
|
||||||
|
for field, mid, mname in MEASURE_FIELDS:
|
||||||
|
val = r.get(field) or r.get(f"{field}_overall_rate_per_1000")
|
||||||
|
try:
|
||||||
|
v = float(val) if val not in (None, "", "Not Available") else None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
v = None
|
||||||
|
if v is None:
|
||||||
|
continue
|
||||||
|
measure_rows.append((fid, mid, mname, v, None, None, None, "cms_ipfqr"))
|
||||||
|
|
||||||
|
cols = ["facility_id","measure_id","measure_name","value","benchmark","period","reported_at","source"]
|
||||||
|
bulk_insert(conn, "bhi_facility_quality", cols, measure_rows)
|
||||||
|
return len(facility_rows), len(measure_rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cms_ipfqr") as (conn, run_id):
|
||||||
|
rows = fetch_rows()
|
||||||
|
f, m = write_rows(conn, rows)
|
||||||
|
LOG.info("inserted %d facilities, %d measures (run %s)", f, m, run_id)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
82
jobs/ingestion/cms_nursing_home.py
Normal file
82
jobs/ingestion/cms_nursing_home.py
Normal file
@@ -0,0 +1,82 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CMS Nursing Home Provider Information — captures SNFs that house behavioral
|
||||||
|
health residents (SNF-IMD dynamic) for later filtering on chain + ownership.
|
||||||
|
|
||||||
|
Source: https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cms_nursing_home")
|
||||||
|
BASE = "https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0"
|
||||||
|
PAGE = 1000
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(BASE, params={"limit": 2}).json()
|
||||||
|
rows = r.get("results", [])
|
||||||
|
print(f"OK: {len(rows)} rows, sample:", rows[0].get("provider_name") if rows else None)
|
||||||
|
return bool(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.25)
|
||||||
|
offset, out = 0, []
|
||||||
|
while True:
|
||||||
|
b = s.get(BASE, params={"limit": PAGE, "offset": offset}).json().get("results", [])
|
||||||
|
if not b:
|
||||||
|
break
|
||||||
|
out.extend(b)
|
||||||
|
if len(b) < PAGE:
|
||||||
|
break
|
||||||
|
offset += PAGE
|
||||||
|
LOG.info("fetched %d nursing homes", len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
try:
|
||||||
|
beds = int(r.get("number_of_certified_beds") or 0) or None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
beds = None
|
||||||
|
opened = r.get("date_first_approved_to_provide_medicare_and_medicaid_services")
|
||||||
|
rows.append((
|
||||||
|
r.get("cms_certification_number_ccn"), None,
|
||||||
|
r.get("provider_name"), r.get("provider_address"),
|
||||||
|
r.get("citytown"), r.get("state"), r.get("zip_code"), None,
|
||||||
|
None, None,
|
||||||
|
"nursing_home",
|
||||||
|
r.get("ownership_type"),
|
||||||
|
beds, None, None, None, None,
|
||||||
|
[], [], [], None, None,
|
||||||
|
opened if opened else None, None, None,
|
||||||
|
"cms_nursing_home", None,
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_facilities", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cms_nursing_home") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
143
jobs/ingestion/cms_pos.py
Normal file
143
jobs/ingestion/cms_pos.py
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CMS Provider of Services (POS) file — quarterly bulk CSV with every
|
||||||
|
Medicare-certified facility including provider category (IPFs, PRTFs, etc.),
|
||||||
|
bed counts, certification date, and termination date. Critical for
|
||||||
|
closure/opening tracking used in composite_score.capacity_trend.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import zipfile
|
||||||
|
from datetime import datetime
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.cms_pos")
|
||||||
|
CATALOG_URL = "https://data.cms.gov/data.json"
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(CATALOG_URL).json()
|
||||||
|
pos = [d for d in r.get("dataset", []) if "provider of services" in d.get("title", "").lower()]
|
||||||
|
print(f"OK: {len(pos)} POS datasets in catalog")
|
||||||
|
for d in pos[:3]:
|
||||||
|
print(" -", d.get("title"))
|
||||||
|
return len(pos) > 0
|
||||||
|
|
||||||
|
|
||||||
|
def _latest_pos_distribution():
|
||||||
|
s = RateLimitedSession(min_interval=0.3)
|
||||||
|
r = s.get(CATALOG_URL).json()
|
||||||
|
pos = [d for d in r.get("dataset", [])
|
||||||
|
if "provider of services" in d.get("title", "").lower()
|
||||||
|
and "hospital" in d.get("title", "").lower()]
|
||||||
|
if not pos:
|
||||||
|
return None
|
||||||
|
latest = max(pos, key=lambda d: d.get("modified", ""))
|
||||||
|
for dist in latest.get("distribution", []):
|
||||||
|
url = dist.get("downloadURL") or dist.get("accessURL", "")
|
||||||
|
if url.endswith((".zip", ".csv")):
|
||||||
|
return url
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
url = _latest_pos_distribution()
|
||||||
|
if not url:
|
||||||
|
LOG.error("Could not resolve POS download URL")
|
||||||
|
return []
|
||||||
|
LOG.info("fetching POS: %s", url)
|
||||||
|
s = RateLimitedSession(min_interval=0.5)
|
||||||
|
r = s.get(url)
|
||||||
|
content = r.content
|
||||||
|
if url.endswith(".zip"):
|
||||||
|
z = zipfile.ZipFile(io.BytesIO(content))
|
||||||
|
csvname = next((n for n in z.namelist() if n.lower().endswith(".csv")), None)
|
||||||
|
with z.open(csvname) as f:
|
||||||
|
text = io.TextIOWrapper(f, encoding="latin-1").read()
|
||||||
|
else:
|
||||||
|
text = content.decode("latin-1", errors="replace")
|
||||||
|
reader = csv.DictReader(io.StringIO(text))
|
||||||
|
# Filter to psychiatric + BH provider categories
|
||||||
|
# CMS PRVDR_CTGRY_CD: 04 = psych hospital, sub-category variations
|
||||||
|
keep = []
|
||||||
|
for row in reader:
|
||||||
|
cat = row.get("PRVDR_CTGRY_CD") or row.get("prvdr_ctgry_cd") or ""
|
||||||
|
subcat = row.get("PRVDR_CTGRY_SBTYP_CD") or row.get("prvdr_ctgry_sbtyp_cd") or ""
|
||||||
|
if cat in ("04",) or "psych" in (row.get("FAC_NAME", "") + row.get("fac_name", "")).lower():
|
||||||
|
keep.append(row)
|
||||||
|
LOG.info("filtered POS to %d BH-relevant rows", len(keep))
|
||||||
|
return keep
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_date(s):
|
||||||
|
if not s:
|
||||||
|
return None
|
||||||
|
for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%Y%m%d"):
|
||||||
|
try:
|
||||||
|
return datetime.strptime(s, fmt).date()
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _num(v):
|
||||||
|
try:
|
||||||
|
return int(float(v)) if v not in (None, "") else None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
def g(*keys):
|
||||||
|
for k in keys:
|
||||||
|
v = r.get(k) or r.get(k.lower())
|
||||||
|
if v:
|
||||||
|
return v
|
||||||
|
return None
|
||||||
|
rows.append((
|
||||||
|
g("PRVDR_NUM", "prvdr_num"), None,
|
||||||
|
g("FAC_NAME", "fac_name"),
|
||||||
|
g("ST_ADR", "st_adr"),
|
||||||
|
g("CITY_NAME", "city_name"),
|
||||||
|
g("STATE_CD", "state_cd"),
|
||||||
|
g("ZIP_CD", "zip_cd"),
|
||||||
|
None, None, None,
|
||||||
|
"IPF",
|
||||||
|
g("GNRL_CNTL_TYPE_CD", "gnrl_cntl_type_cd"),
|
||||||
|
_num(g("BED_CNT", "bed_cnt")),
|
||||||
|
_num(g("CRTFD_BED_CNT", "crtfd_bed_cnt")),
|
||||||
|
None, None, None,
|
||||||
|
[], [], [], None, None,
|
||||||
|
_parse_date(g("ORGNL_PRTCPTN_DT", "orgnl_prtcptn_dt")),
|
||||||
|
_parse_date(g("TRMNTN_EXPRTN_DT", "trmntn_exprtn_dt")),
|
||||||
|
None,
|
||||||
|
"cms_pos", None,
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_facilities", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_cms_pos") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
85
jobs/ingestion/hrsa_hpsa.py
Normal file
85
jobs/ingestion/hrsa_hpsa.py
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
HRSA Mental Health HPSA (Health Professional Shortage Areas) bulk CSV.
|
||||||
|
|
||||||
|
Source: https://data.hrsa.gov/DataDownload/DD_Files/BCD_HPSA_FCT_DET_MH.csv
|
||||||
|
Confirmed: ~23 MB CSV, all active + historical MH HPSAs.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from datetime import datetime
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.hrsa_hpsa")
|
||||||
|
URL = "https://data.hrsa.gov/DataDownload/DD_Files/BCD_HPSA_FCT_DET_MH.csv"
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(URL, stream=True)
|
||||||
|
first = next(r.iter_lines())
|
||||||
|
print(f"OK: content-length={r.headers.get('content-length')}")
|
||||||
|
print("header:", first.decode("utf-8", errors="replace")[:200])
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.5)
|
||||||
|
r = s.get(URL)
|
||||||
|
r.encoding = "utf-8"
|
||||||
|
reader = csv.DictReader(io.StringIO(r.text))
|
||||||
|
rows = list(reader)
|
||||||
|
LOG.info("fetched %d HPSA rows", len(rows))
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_date(s):
|
||||||
|
if not s:
|
||||||
|
return None
|
||||||
|
for fmt in ("%Y-%m-%d", "%m/%d/%Y"):
|
||||||
|
try:
|
||||||
|
return datetime.strptime(s, fmt).date()
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_int(s):
|
||||||
|
try:
|
||||||
|
return int(float(s)) if s not in (None, "") else None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["hpsa_id","state","county_fips","score","population_served",
|
||||||
|
"designated_date","withdrawn_date","source"]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
rows.append((
|
||||||
|
r.get("HPSA ID"),
|
||||||
|
r.get("Primary State Abbreviation"),
|
||||||
|
r.get("Common County FIPS Code") or r.get("HPSA Geography Identification Number"),
|
||||||
|
_parse_int(r.get("HPSA Score")),
|
||||||
|
_parse_int(r.get("HPSA Designation Population")),
|
||||||
|
_parse_date(r.get("HPSA Designation Date")),
|
||||||
|
_parse_date(r.get("Withdrawn Date")),
|
||||||
|
"hrsa_hpsa_mh",
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_shortages", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_hrsa_hpsa") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
93
jobs/ingestion/idea_part_b.py
Normal file
93
jobs/ingestion/idea_part_b.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
IDEA Part B child count — specifically "Emotional Disturbance" (ED)
|
||||||
|
classification by state and local education agency (LEA).
|
||||||
|
|
||||||
|
Static CSVs hosted by US Department of Education / OSEP. No API. This job
|
||||||
|
pulls the most recent static tables. Update MANIFEST when new year drops.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.idea_part_b")
|
||||||
|
|
||||||
|
# Static CSV links — placeholder pattern. The user confirmed landing at
|
||||||
|
# https://www2.ed.gov/programs/osepidea/618-data/static-tables/index.html
|
||||||
|
MANIFEST = [
|
||||||
|
# (year, scope, url)
|
||||||
|
("2022-23", "state", "https://www2.ed.gov/programs/osepidea/618-data/static-tables/part-b/child-count-and-educational-environment/bchildcountandedenvironments2022-23.csv"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
ok = True
|
||||||
|
for year, scope, url in MANIFEST:
|
||||||
|
r = s.head(url, allow_redirects=True)
|
||||||
|
print(f"{year} {scope}: {r.status_code}")
|
||||||
|
ok = ok and r.status_code in (200, 302)
|
||||||
|
return ok
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.5)
|
||||||
|
out = []
|
||||||
|
for year, scope, url in MANIFEST:
|
||||||
|
try:
|
||||||
|
r = s.get(url)
|
||||||
|
r.encoding = "utf-8"
|
||||||
|
reader = csv.DictReader(io.StringIO(r.text))
|
||||||
|
for row in reader:
|
||||||
|
row["_year"] = year
|
||||||
|
row["_scope"] = scope
|
||||||
|
out.append(row)
|
||||||
|
except Exception as e:
|
||||||
|
LOG.warning("failed %s: %s", url, e)
|
||||||
|
LOG.info("IDEA rows: %d", len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _int(v):
|
||||||
|
try:
|
||||||
|
return int(str(v).replace(",", "")) if v not in (None, "", "-") else None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
disability = (r.get("Disability Category") or r.get("SEA Disability Category") or "").lower()
|
||||||
|
if "emotional" not in disability:
|
||||||
|
continue
|
||||||
|
val = _int(r.get("Students Served") or r.get("Total") or r.get("ED"))
|
||||||
|
if val is None:
|
||||||
|
continue
|
||||||
|
rows.append((
|
||||||
|
"state",
|
||||||
|
r.get("State") or r.get("SEA State"),
|
||||||
|
"idea_emotional_disturbance_count",
|
||||||
|
"13-17", # ED classification predominantly school-age; approximate
|
||||||
|
r["_year"],
|
||||||
|
float(val),
|
||||||
|
"idea_part_b",
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_idea_part_b") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
114
jobs/ingestion/nppes.py
Normal file
114
jobs/ingestion/nppes.py
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
CMS NPPES (National Plan & Provider Enumeration System) — behavioral health
|
||||||
|
providers by taxonomy + state.
|
||||||
|
|
||||||
|
API: https://npiregistry.cms.hhs.gov/api/?version=2.1
|
||||||
|
Filter: taxonomy codes for psychiatry, psychology, counseling, SUD.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.nppes")
|
||||||
|
BASE = "https://npiregistry.cms.hhs.gov/api/"
|
||||||
|
|
||||||
|
BH_TAXONOMY_CODES = [
|
||||||
|
"2084P0800X", # Psychiatry
|
||||||
|
"2084P0802X", # Addiction Psychiatry
|
||||||
|
"2084P0804X", # Child & Adolescent Psychiatry
|
||||||
|
"103T00000X", # Psychologist
|
||||||
|
"103TC2200X", # Clinical Child & Adolescent Psychologist
|
||||||
|
"101YM0800X", # Mental Health Counselor
|
||||||
|
"1041C0700X", # Clinical Social Worker
|
||||||
|
"324500000X", # Substance Abuse Rehabilitation Facility
|
||||||
|
"283Q00000X", # Psychiatric Hospital
|
||||||
|
"323P00000X", # Psychiatric Residential Treatment Facility
|
||||||
|
]
|
||||||
|
STATES = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
|
||||||
|
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV",
|
||||||
|
"NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN",
|
||||||
|
"TX","UT","VT","VA","WA","WV","WI","WY","DC"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(BASE, params={
|
||||||
|
"version": "2.1", "taxonomy_description": "psychiatric",
|
||||||
|
"state": "NY", "limit": 2,
|
||||||
|
}).json()
|
||||||
|
print(f"OK: result_count={r.get('result_count')}")
|
||||||
|
return r.get("result_count", 0) > 0
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.1)
|
||||||
|
all_rows = []
|
||||||
|
for state in STATES:
|
||||||
|
for taxonomy in BH_TAXONOMY_CODES:
|
||||||
|
skip = 0
|
||||||
|
while True:
|
||||||
|
r = s.get(BASE, params={
|
||||||
|
"version": "2.1",
|
||||||
|
"taxonomy_description": taxonomy,
|
||||||
|
"state": state,
|
||||||
|
"limit": 200,
|
||||||
|
"skip": skip,
|
||||||
|
}).json()
|
||||||
|
results = r.get("results", [])
|
||||||
|
if not results:
|
||||||
|
break
|
||||||
|
for row in results:
|
||||||
|
row["_state"] = state
|
||||||
|
row["_taxonomy"] = taxonomy
|
||||||
|
all_rows.extend(results)
|
||||||
|
if len(results) < 200:
|
||||||
|
break
|
||||||
|
skip += 200
|
||||||
|
if skip > 1200: # NPPES caps paging
|
||||||
|
break
|
||||||
|
LOG.info("state=%s tax=%s total=%d", state, taxonomy, len(all_rows))
|
||||||
|
return all_rows
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
addresses = r.get("addresses") or []
|
||||||
|
location = next((a for a in addresses if a.get("address_purpose") == "LOCATION"), addresses[0] if addresses else {})
|
||||||
|
basic = r.get("basic") or {}
|
||||||
|
name = basic.get("organization_name") or " ".join(filter(None, [basic.get("first_name"), basic.get("last_name")]))
|
||||||
|
rows.append((
|
||||||
|
None, str(r.get("number", "")),
|
||||||
|
name,
|
||||||
|
location.get("address_1"), location.get("city"),
|
||||||
|
location.get("state"), location.get("postal_code"), None,
|
||||||
|
None, None,
|
||||||
|
"provider" if basic.get("name_prefix") is None else "org",
|
||||||
|
None, None, None, None, None, None,
|
||||||
|
[r.get("_taxonomy", "")], [], [], None, None, None, None, None,
|
||||||
|
"nppes", None,
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_facilities", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_nppes") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
96
jobs/ingestion/nsch.py
Normal file
96
jobs/ingestion/nsch.py
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
NSCH — National Survey of Children's Health (HRSA/MCHB).
|
||||||
|
|
||||||
|
Source: https://mchb.hrsa.gov/data-research/national-survey-childrens-health
|
||||||
|
Bulk files by year; we parse state-level indicator tables. Manifest below.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.nsch")
|
||||||
|
|
||||||
|
MANIFEST = [
|
||||||
|
# (year, url_to_indicator_csv)
|
||||||
|
("2022", "https://mchb.hrsa.gov/sites/default/files/mchb/data-research/nsch/2022/nsch-2022-state-level-indicators.csv"),
|
||||||
|
]
|
||||||
|
|
||||||
|
INDICATORS_OF_INTEREST = {
|
||||||
|
"anxiety": "anxiety_pct",
|
||||||
|
"depression": "depression_pct",
|
||||||
|
"behavioral": "behavioral_pct",
|
||||||
|
"mental health treatment": "unmet_mh_treatment_pct",
|
||||||
|
"unmet": "unmet_mh_treatment_pct",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
ok = True
|
||||||
|
for year, url in MANIFEST:
|
||||||
|
r = s.head(url, allow_redirects=True)
|
||||||
|
print(f"{year}: {r.status_code}")
|
||||||
|
ok = ok and r.status_code in (200, 302)
|
||||||
|
return ok
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.5)
|
||||||
|
out = []
|
||||||
|
for year, url in MANIFEST:
|
||||||
|
try:
|
||||||
|
r = s.get(url)
|
||||||
|
r.encoding = "utf-8"
|
||||||
|
reader = csv.DictReader(io.StringIO(r.text))
|
||||||
|
for row in reader:
|
||||||
|
row["_year"] = year
|
||||||
|
out.append(row)
|
||||||
|
except Exception as e:
|
||||||
|
LOG.warning("failed %s: %s", url, e)
|
||||||
|
LOG.info("NSCH rows: %d", len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
indicator = (r.get("Indicator") or "").lower()
|
||||||
|
measure = None
|
||||||
|
for k, v in INDICATORS_OF_INTEREST.items():
|
||||||
|
if k in indicator:
|
||||||
|
measure = v
|
||||||
|
break
|
||||||
|
if not measure:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
val = float((r.get("Estimate") or r.get("Value") or "0").replace("%", ""))
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
continue
|
||||||
|
rows.append((
|
||||||
|
"state",
|
||||||
|
r.get("State"),
|
||||||
|
measure,
|
||||||
|
"13-17",
|
||||||
|
r["_year"],
|
||||||
|
val,
|
||||||
|
"nsch",
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_nsch") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
95
jobs/ingestion/samhsa_locator.py
Normal file
95
jobs/ingestion/samhsa_locator.py
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
SAMHSA findtreatment.gov behavioral health facility locator.
|
||||||
|
|
||||||
|
Source: https://findtreatment.gov/locator/exportsAsJson/v2
|
||||||
|
Confirmed: 96,009 facilities across 3,201 pages (sType=BH).
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.samhsa_locator")
|
||||||
|
BASE = "https://findtreatment.gov/locator/exportsAsJson/v2"
|
||||||
|
ZIP_SEED = "10001" # any valid zip works; results are national in the 'BH' sType
|
||||||
|
PAGE_SIZE = 30 # server default; respected
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
r = s.get(BASE, params={"sType": "BH", "sAddr": ZIP_SEED, "page": 1}).json()
|
||||||
|
print(f"OK: recordCount={r.get('recordCount')}, totalPages={r.get('totalPages')}")
|
||||||
|
rows = r.get("rows", [])
|
||||||
|
if rows:
|
||||||
|
print("sample:", rows[0].get("name1"), rows[0].get("state"))
|
||||||
|
return bool(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows(max_pages: int | None = None):
|
||||||
|
s = RateLimitedSession(min_interval=0.3)
|
||||||
|
out = []
|
||||||
|
page = 1
|
||||||
|
total = None
|
||||||
|
while True:
|
||||||
|
r = s.get(BASE, params={"sType": "BH", "sAddr": ZIP_SEED, "pageSize": PAGE_SIZE, "page": page}).json()
|
||||||
|
total = total or r.get("totalPages", 1)
|
||||||
|
out.extend(r.get("rows", []))
|
||||||
|
if page % 50 == 0:
|
||||||
|
LOG.info("page %d/%d (total rows %d)", page, total, len(out))
|
||||||
|
if page >= total or (max_pages and page >= max_pages):
|
||||||
|
break
|
||||||
|
page += 1
|
||||||
|
LOG.info("fetched %d facilities", len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_float(v):
|
||||||
|
try:
|
||||||
|
return float(v) if v not in (None, "") else None
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
name = " ".join(filter(None, [r.get("name1"), (r.get("name2") or "").strip()])).strip()
|
||||||
|
services = (r.get("services") or "").split(",") if r.get("services") else []
|
||||||
|
# SAMHSA flags adolescent/young-adult services in the services string
|
||||||
|
services_lc = [s.lower() for s in services]
|
||||||
|
adolescent = any("adolescent" in s or "youth" in s or "teen" in s for s in services_lc) or None
|
||||||
|
young_adult = any("young adult" in s or "transitional age" in s for s in services_lc) or None
|
||||||
|
rows.append((
|
||||||
|
None, None, # ccn/npi unknown from this source
|
||||||
|
name, r.get("street1"),
|
||||||
|
r.get("city"), r.get("state"), r.get("zip"), None,
|
||||||
|
_parse_float(r.get("latitude")), _parse_float(r.get("longitude")),
|
||||||
|
r.get("typeFacility") or "bh_facility",
|
||||||
|
None, None, None, None,
|
||||||
|
adolescent, young_adult,
|
||||||
|
services, [], [], None, None, None, None, None,
|
||||||
|
"samhsa_locator", None,
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_facilities", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_samhsa_locator") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
102
jobs/ingestion/samhsa_nssats_nmhss.py
Normal file
102
jobs/ingestion/samhsa_nssats_nmhss.py
Normal file
@@ -0,0 +1,102 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
|
||||||
|
"""
|
||||||
|
SAMHSA N-SSATS + N-MHSS bulk downloads.
|
||||||
|
|
||||||
|
SAMHSA Data Archive hosts annual CSV/SAS files. The landing pages do not
|
||||||
|
expose a machine-listing API, so we maintain a manifest of known direct URLs
|
||||||
|
and parse whichever are present. Update the MANIFEST when new years drop.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import zipfile
|
||||||
|
from _common import RateLimitedSession, bulk_insert, job_run
|
||||||
|
|
||||||
|
LOG = logging.getLogger("bhi.samhsa_surveys")
|
||||||
|
|
||||||
|
# Known bulk files. Confirmed on samhsa.gov/data as of 2026. Update as needed.
|
||||||
|
MANIFEST = [
|
||||||
|
# (year, survey, url)
|
||||||
|
("2022", "N-MHSS", "https://www.samhsa.gov/data/sites/default/files/reports/rpt42936/2022-nmhss-datafile-csv.zip"),
|
||||||
|
("2022", "N-SSATS", "https://www.samhsa.gov/data/sites/default/files/reports/rpt42725/2022-nssats-datafile-csv.zip"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_endpoint():
|
||||||
|
s = RateLimitedSession()
|
||||||
|
ok = True
|
||||||
|
for year, survey, url in MANIFEST:
|
||||||
|
r = s.head(url, allow_redirects=True)
|
||||||
|
print(f"{survey} {year}: {r.status_code}")
|
||||||
|
ok = ok and r.status_code == 200
|
||||||
|
return ok
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_rows():
|
||||||
|
s = RateLimitedSession(min_interval=0.5)
|
||||||
|
out = []
|
||||||
|
for year, survey, url in MANIFEST:
|
||||||
|
LOG.info("fetching %s %s", survey, year)
|
||||||
|
try:
|
||||||
|
r = s.get(url)
|
||||||
|
z = zipfile.ZipFile(io.BytesIO(r.content))
|
||||||
|
csvname = next((n for n in z.namelist() if n.lower().endswith(".csv")), None)
|
||||||
|
if not csvname:
|
||||||
|
continue
|
||||||
|
with z.open(csvname) as f:
|
||||||
|
reader = csv.DictReader(io.TextIOWrapper(f, encoding="latin-1"))
|
||||||
|
for row in reader:
|
||||||
|
row["_survey"] = survey
|
||||||
|
row["_year"] = year
|
||||||
|
out.append(row)
|
||||||
|
except Exception as e:
|
||||||
|
LOG.warning("failed %s %s: %s", survey, year, e)
|
||||||
|
LOG.info("total rows: %d", len(out))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def write_rows(conn, raw):
|
||||||
|
cols = [
|
||||||
|
"ccn","npi","name","address","city","state","zip","county_fips",
|
||||||
|
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
|
||||||
|
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
|
||||||
|
"services_offered","populations_served","payment_accepted",
|
||||||
|
"medicaid_accepted","accreditation","opened_date","closed_date",
|
||||||
|
"last_verified","source","source_raw_id",
|
||||||
|
]
|
||||||
|
rows = []
|
||||||
|
for r in raw:
|
||||||
|
def y(field):
|
||||||
|
v = r.get(field) or r.get(field.upper()) or r.get(field.lower())
|
||||||
|
return v == "1" or str(v).lower() == "yes"
|
||||||
|
name = r.get("NAME") or r.get("name") or r.get("FACNAME") or ""
|
||||||
|
rows.append((
|
||||||
|
None, None, name,
|
||||||
|
r.get("STREET1") or r.get("street1"),
|
||||||
|
r.get("CITY") or r.get("city"),
|
||||||
|
r.get("STATE") or r.get("state"),
|
||||||
|
r.get("ZIP") or r.get("zip"),
|
||||||
|
None, None, None,
|
||||||
|
"sud" if r["_survey"] == "N-SSATS" else "mh",
|
||||||
|
None, None, None, None,
|
||||||
|
y("YOUTH") or y("ADOLESCENT"),
|
||||||
|
y("YAD") or y("YOUNGADULT"),
|
||||||
|
[], [], [], None, None, None, None, None,
|
||||||
|
f"samhsa_{r['_survey'].lower()}_{r['_year']}", None,
|
||||||
|
))
|
||||||
|
bulk_insert(conn, "bhi_facilities", cols, rows)
|
||||||
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
with job_run("bhi_samhsa_surveys") as (conn, _):
|
||||||
|
n = write_rows(conn, fetch_rows())
|
||||||
|
LOG.info("inserted %d", n)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "test":
|
||||||
|
sys.exit(0 if test_endpoint() else 1)
|
||||||
|
main()
|
||||||
212
schemas/bhi_tables.sql
Normal file
212
schemas/bhi_tables.sql
Normal file
@@ -0,0 +1,212 @@
|
|||||||
|
-- =============================================================================
|
||||||
|
-- Behavioral Health Intelligence (BHI) Layer - Postgres schema extension
|
||||||
|
-- =============================================================================
|
||||||
|
-- This file adds BHI tables to the existing `brain` database that the base
|
||||||
|
-- Economic Brain agent is creating. DO NOT run until the base Brain schema
|
||||||
|
-- is finalized. Then run: psql -d brain -f schemas/bhi_tables.sql
|
||||||
|
--
|
||||||
|
-- All tables are prefixed `bhi_` to avoid any collision with the base Brain.
|
||||||
|
-- Foreign keys are intentionally soft (no REFERENCES) where the target table
|
||||||
|
-- belongs to the base Brain, so this file can be applied independently.
|
||||||
|
-- =============================================================================
|
||||||
|
|
||||||
|
BEGIN;
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 1. Facilities master table
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_facilities (
|
||||||
|
facility_id SERIAL PRIMARY KEY,
|
||||||
|
ccn VARCHAR(20), -- CMS Certification Number
|
||||||
|
npi VARCHAR(20), -- National Provider Identifier
|
||||||
|
name TEXT NOT NULL,
|
||||||
|
address TEXT,
|
||||||
|
city TEXT,
|
||||||
|
state TEXT,
|
||||||
|
zip TEXT,
|
||||||
|
county_fips TEXT,
|
||||||
|
lat DOUBLE PRECISION,
|
||||||
|
lon DOUBLE PRECISION,
|
||||||
|
facility_type TEXT, -- IPF, PRTF, CMHC, SUD, acute, nursing_home, etc.
|
||||||
|
ownership TEXT, -- for-profit, non-profit, gov
|
||||||
|
bed_count INT,
|
||||||
|
psych_bed_count INT,
|
||||||
|
pediatric_psych_bed_count INT,
|
||||||
|
adolescent_unit BOOLEAN,
|
||||||
|
young_adult_unit BOOLEAN,
|
||||||
|
services_offered TEXT[],
|
||||||
|
populations_served TEXT[], -- ['adolescent','young_adult','adult','geriatric']
|
||||||
|
payment_accepted TEXT[],
|
||||||
|
medicaid_accepted BOOLEAN,
|
||||||
|
accreditation TEXT,
|
||||||
|
opened_date DATE,
|
||||||
|
closed_date DATE,
|
||||||
|
last_verified DATE,
|
||||||
|
source TEXT, -- 'cms_ipfqr','samhsa_locator','nppes', etc.
|
||||||
|
source_raw_id INT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_state ON bhi_facilities (state);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_county ON bhi_facilities (county_fips);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_ccn ON bhi_facilities (ccn);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_npi ON bhi_facilities (npi);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_type ON bhi_facilities (facility_type);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_pops ON bhi_facilities USING GIN (populations_served);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 2. Facility quality measures (IPFQR, Care Compare)
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_facility_quality (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
facility_id INT REFERENCES bhi_facilities(facility_id) ON DELETE CASCADE,
|
||||||
|
measure_id TEXT, -- e.g. HBIPS-2, SUB-3, SMD, TOB-3
|
||||||
|
measure_name TEXT,
|
||||||
|
value NUMERIC,
|
||||||
|
benchmark NUMERIC,
|
||||||
|
period TEXT, -- '2024Q1', 'FY2024'
|
||||||
|
reported_at DATE,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_quality_facility ON bhi_facility_quality (facility_id);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_quality_measure ON bhi_facility_quality (measure_id);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 3. Facility financials from Medicare Cost Reports
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_facility_financials (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
facility_id INT REFERENCES bhi_facilities(facility_id) ON DELETE CASCADE,
|
||||||
|
year INT,
|
||||||
|
medicare_discharges INT,
|
||||||
|
medicaid_discharges INT,
|
||||||
|
psych_discharges INT,
|
||||||
|
psych_los_avg NUMERIC,
|
||||||
|
psych_revenue BIGINT,
|
||||||
|
psych_costs BIGINT,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_financials_facility ON bhi_facility_financials (facility_id);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_financials_year ON bhi_facility_financials (year);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 4. Demand indicators (CDC WONDER, BRFSS, YRBSS, IDEA, NSCH)
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_demand_indicators (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
geo_type TEXT, -- 'state','county','msa','district'
|
||||||
|
geo_code TEXT, -- FIPS or code
|
||||||
|
measure TEXT, -- 'suicide_rate','overdose_rate','depression_pct', etc.
|
||||||
|
age_bracket TEXT, -- '13-17','18-25','all'
|
||||||
|
period TEXT,
|
||||||
|
value NUMERIC,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_demand_geo ON bhi_demand_indicators (geo_type, geo_code);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_demand_measure ON bhi_demand_indicators (measure);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_demand_age ON bhi_demand_indicators (age_bracket);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 5. Workforce (BLS OES)
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_workforce (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
msa_code TEXT,
|
||||||
|
msa_name TEXT,
|
||||||
|
occupation_code TEXT, -- SOC code, e.g. 29-1223 (psychiatrists)
|
||||||
|
occupation_title TEXT,
|
||||||
|
employment INT,
|
||||||
|
annual_wage_median NUMERIC,
|
||||||
|
annual_wage_mean NUMERIC,
|
||||||
|
period TEXT, -- 'May2024'
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_workforce_msa ON bhi_workforce (msa_code);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_workforce_occ ON bhi_workforce (occupation_code);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 6. HRSA HPSA mental health shortage areas
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_shortages (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
hpsa_id TEXT,
|
||||||
|
state TEXT,
|
||||||
|
county_fips TEXT,
|
||||||
|
score INT, -- HPSA score 0-25 (higher = worse shortage)
|
||||||
|
population_served INT,
|
||||||
|
designated_date DATE,
|
||||||
|
withdrawn_date DATE,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_shortages_state ON bhi_shortages (state);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_shortages_county ON bhi_shortages (county_fips);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_shortages_score ON bhi_shortages (score);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 7. State RTF (Residential Treatment Facility) licensing data
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_rtf_licensing (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
state TEXT,
|
||||||
|
license_number TEXT,
|
||||||
|
facility_name TEXT,
|
||||||
|
facility_type TEXT,
|
||||||
|
capacity INT,
|
||||||
|
populations TEXT[],
|
||||||
|
services TEXT[],
|
||||||
|
inspection_date DATE,
|
||||||
|
violations JSONB,
|
||||||
|
status TEXT,
|
||||||
|
opened_date DATE,
|
||||||
|
closed_date DATE,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_rtf_state ON bhi_rtf_licensing (state);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_rtf_name ON bhi_rtf_licensing (facility_name);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 8. Policy events (Medicaid rules, state legislation)
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_policy_events (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
event_type TEXT, -- 'medicaid_rule','state_law','federal_rule'
|
||||||
|
state TEXT,
|
||||||
|
title TEXT,
|
||||||
|
summary TEXT,
|
||||||
|
effective_date DATE,
|
||||||
|
url TEXT,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_policy_state ON bhi_policy_events (state);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_policy_eff_date ON bhi_policy_events (effective_date);
|
||||||
|
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
-- 9. Crisis calls / EMS transports (NEMSIS aggregates)
|
||||||
|
-- -----------------------------------------------------------------------------
|
||||||
|
CREATE TABLE IF NOT EXISTS bhi_crisis_calls (
|
||||||
|
id SERIAL PRIMARY KEY,
|
||||||
|
state TEXT,
|
||||||
|
county_fips TEXT,
|
||||||
|
period TEXT,
|
||||||
|
call_count INT,
|
||||||
|
mental_health_calls INT,
|
||||||
|
transport_outcomes JSONB,
|
||||||
|
source TEXT,
|
||||||
|
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||||
|
);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_crisis_state ON bhi_crisis_calls (state);
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_bhi_crisis_county ON bhi_crisis_calls (county_fips);
|
||||||
|
|
||||||
|
COMMIT;
|
||||||
|
|
||||||
|
-- =============================================================================
|
||||||
|
-- Verify
|
||||||
|
-- =============================================================================
|
||||||
|
-- \dt bhi_*
|
||||||
|
-- SELECT tablename FROM pg_tables WHERE tablename LIKE 'bhi_%' ORDER BY tablename;
|
||||||
Reference in New Issue
Block a user