BHI layer v1: docs, schema, Phase A ingestion stubs

This commit is contained in:
BHI Staging Agent
2026-04-05 20:15:36 +00:00
commit 3dfd9ea3c6
21 changed files with 2399 additions and 0 deletions

156
docs/integration_plan.md Normal file
View File

@@ -0,0 +1,156 @@
# BHI Layer — Integration Plan
Steps to merge the BHI layer into the base Economic Brain after the base build finishes.
**Prereqs** (verified before step 1):
- Base Brain is running: `psql -d brain -c '\dt'` shows core tables including `job_runs`.
- `/home/ubuntu/economic-brain/` contains a working `jobs/` directory structure.
- DATABASE_URL env var exported and pointing at the `brain` Postgres.
---
## 1. Apply the BHI schema
```bash
cd /home/ubuntu/economic-brain-bhi
psql "$DATABASE_URL" -f schemas/bhi_tables.sql
psql "$DATABASE_URL" -c "\dt bhi_*"
# Expect 9 tables: bhi_facilities, bhi_facility_quality, bhi_facility_financials,
# bhi_demand_indicators, bhi_workforce, bhi_shortages, bhi_rtf_licensing,
# bhi_policy_events, bhi_crisis_calls
```
## 2. Copy ingestion jobs into the Brain's jobs tree
```bash
mkdir -p /home/ubuntu/economic-brain/jobs/bhi
cp /home/ubuntu/economic-brain-bhi/jobs/ingestion/*.py /home/ubuntu/economic-brain/jobs/bhi/
# _common.py is included; it reads DATABASE_URL from env already
```
Install Python deps if the base Brain doesn't already have them:
```bash
pip install requests psycopg2-binary
```
## 3. Smoke test every Phase A job (no DB writes)
```bash
cd /home/ubuntu/economic-brain/jobs/bhi
for f in cms_ipfqr.py cms_hospital_compare.py cms_nursing_home.py \
samhsa_locator.py hrsa_hpsa.py nppes.py cdc_brfss.py \
cdc_yrbss.py cdc_wonder_mortality.py bls_oes.py cms_pos.py \
samhsa_nssats_nmhss.py idea_part_b.py nsch.py; do
echo "=== $f ==="
python3 "$f" test || echo "FAIL: $f"
done
```
Every job should print `OK:` and exit 0. If any fail, fix the endpoint/URL in the job file before proceeding.
## 4. Run jobs in dependency order
```bash
# Facilities first (feed bhi_facilities.facility_id FK for quality/financials)
python3 cms_ipfqr.py
python3 cms_hospital_compare.py
python3 cms_nursing_home.py
python3 samhsa_locator.py
python3 cms_pos.py
python3 samhsa_nssats_nmhss.py
python3 nppes.py
# Shortages + demand (independent)
python3 hrsa_hpsa.py
python3 cdc_wonder_mortality.py
python3 cdc_brfss.py
python3 cdc_yrbss.py
python3 idea_part_b.py
python3 nsch.py
# Workforce
python3 bls_oes.py
```
Monitor `job_runs`:
```sql
SELECT job_name, status, started_at, finished_at, error
FROM job_runs WHERE job_name LIKE 'bhi_%' ORDER BY started_at DESC;
```
## 5. Import n8n workflows (scheduled refresh)
Create workflows in n8n (or add to existing scheduler):
| Workflow | Cron | Script |
|---|---|---|
| BHI: CMS facilities refresh | `0 3 * * 1` (weekly Mon 3am) | `cms_ipfqr.py`, `cms_hospital_compare.py`, `cms_nursing_home.py` |
| BHI: SAMHSA locator refresh | `0 4 1 * *` (monthly) | `samhsa_locator.py` |
| BHI: HRSA HPSA refresh | `0 5 * * 2` (weekly Tue 5am) | `hrsa_hpsa.py` |
| BHI: CDC demand refresh | `0 6 1 * *` (monthly) | `cdc_brfss.py`, `cdc_yrbss.py`, `cdc_wonder_mortality.py` |
| BHI: Workforce refresh | `0 7 1 */3 *` (quarterly) | `bls_oes.py` |
| BHI: CMS POS refresh | `0 8 1 */3 *` (quarterly) | `cms_pos.py` |
Workflow template: Cron node -> Execute Command (`python3 /home/ubuntu/economic-brain/jobs/bhi/<script>.py`) -> if non-zero, send alert to Slack / email.
## 6. Add command center page
Create `/home/ubuntu/command-center/pages/brain/behavioral-health.html` (or equivalent in the Brain's command-center framework) with sections:
1. **Facility map** — Leaflet map of `bhi_facilities` colored by `facility_type`, filterable by `adolescent_unit` / `young_adult_unit`.
2. **HPSA heatmap** — county-level choropleth of `bhi_shortages.score`.
3. **Demand indicators panel** — small multiples of suicide rate, overdose rate, BRFSS depression by state, split by age bracket.
4. **Composite ranking table** — top 50 opportunities by `composite_score` (see scoring.md).
5. **Recent policy events feed** — last 20 rows from `bhi_policy_events` ordered by `effective_date DESC`.
6. **Job status widget** — last run of each `bhi_*` job from `job_runs`.
Route: `/brain/behavioral-health`.
## 7. Test queries (acceptance smoke tests)
```sql
-- Facility count by type
SELECT facility_type, count(*) FROM bhi_facilities GROUP BY 1 ORDER BY 2 DESC;
-- Top 20 worst MH HPSAs
SELECT state, county_fips, score, population_served
FROM bhi_shortages WHERE withdrawn_date IS NULL
ORDER BY score DESC LIMIT 20;
-- Adolescent suicide rates, top states
SELECT geo_code, value FROM bhi_demand_indicators
WHERE measure='suicide_rate' AND age_bracket='13-17'
ORDER BY value DESC LIMIT 20;
-- Counties with IPF but zero adolescent units (cross-check)
SELECT state, count(*) FILTER (WHERE adolescent_unit) AS adolescent_units,
count(*) AS total
FROM bhi_facilities WHERE facility_type='IPF' GROUP BY state ORDER BY 2 ASC;
-- Workforce shortage: psychiatrists, top wage growth MSAs
SELECT msa_name, annual_wage_median
FROM bhi_workforce WHERE occupation_code='29-1223'
ORDER BY annual_wage_median DESC LIMIT 20;
-- job run health
SELECT job_name, status, count(*)
FROM job_runs WHERE job_name LIKE 'bhi_%'
GROUP BY 1, 2;
```
If every query returns rows and no job_run shows `status='error'`, the BHI layer is live.
## 8. Git merge to main Brain repo
```bash
cd /home/ubuntu/economic-brain
git checkout -b bhi-layer-merge
cp -r /home/ubuntu/economic-brain-bhi/schemas/bhi_tables.sql schemas/
cp -r /home/ubuntu/economic-brain-bhi/jobs/ingestion/* jobs/bhi/
cp -r /home/ubuntu/economic-brain-bhi/docs/* docs/bhi/
git add schemas/bhi_tables.sql jobs/bhi docs/bhi
git commit -m "Integrate BHI layer"
git push origin bhi-layer-merge
# Open PR for review on Gitea
```

125
docs/scoring.md Normal file
View File

@@ -0,0 +1,125 @@
# BHI Composite Scoring Function
## Formula
```
composite_score =
(demand_severity * 0.25) +
(supply_shortage * 0.25) +
(pain_signal_volume * 0.20) +
(capacity_trend * 0.10) +
(workforce_shortage * 0.10) +
(regulatory_tailwind * 0.05) +
(govt_demand * 0.05)
```
All components are normalized to 0-100 before weighting. Final `composite_score` is 0-100.
Each component is computed at the **geo x niche x age-bracket** level (state, county, or MSA depending on data).
Thesis this reflects (all-of-the-above): demand is outpacing supply, delivery model is shifting, and regulation is restructuring the market — we weight demand + supply heaviest (50% combined), then real-time pain signals, then the three tailwinds.
---
## Component definitions
### 1. demand_severity (25%)
Feeder: `bhi_demand_indicators` (CDC WONDER, BRFSS, YRBSS, NSCH).
For a given geo + age bracket, combine:
- Suicide rate per 100k (CDC WONDER, ICD-10 X60-X84)
- Drug overdose death rate per 100k (CDC WONDER, X40-X44 + Y10-Y14)
- YRBSS "seriously considered suicide" % (adolescent)
- BRFSS "mental health not good 14+ days" % (young adult via 18-24 bracket)
- NSCH unmet mental health treatment need %
Normalize each to 0-100 against the national distribution (percentile rank), then average.
Trend multiplier: +10 if 5-yr CAGR > 5%.
### 2. supply_shortage (25%)
Feeders: `bhi_shortages` (HRSA HPSA) + `bhi_facilities` (SAMHSA + CMS).
For a geo:
- HPSA mental health score (0-25, already normalized; rescale x4 -> 0-100)
- Inverse of facility density: beds per 100k population (percentile-invert)
- Inverse of adolescent/young-adult-specific bed density (if scoring those brackets)
Weighted average: 50% HPSA score, 30% total bed density, 20% age-targeted bed density.
### 3. pain_signal_volume (20%)
Feeders: base Brain's `reddit_posts`, `app_reviews`, and `risk_factors` tables (already being built).
For a niche (e.g., "adolescent inpatient"):
- Count of posts/reviews/risk-factor hits matching niche keywords in last 90 days
- Z-score against the full base Brain niche distribution
- Clamp to 0-100
Depends on base Brain being live — until then, this component defaults to 50 (neutral).
### 4. capacity_trend (10%)
Feeder: `bhi_facilities` (opened_date, closed_date) + CMS POS termination records.
For the geo x niche:
- Facilities opened in last 24 months minus closed in last 24 months, normalized by baseline facility count
- Negative net = high score (more opportunity), positive net = low score (saturated)
- Formula: `100 * (1 - (net_change + baseline) / (2 * baseline))` clamped 0-100
### 5. workforce_shortage (10%)
Feeder: `bhi_workforce` (BLS OES).
For the MSA:
- Wage growth YoY for SOC codes 29-1223, 21-1014, 21-1018, 103T (percentile rank)
- Employment per 100k (inverse percentile)
- Average them
High wage growth + low employment density = high shortage score = high opportunity for new supply.
### 6. regulatory_tailwind (5%)
Feeder: `bhi_policy_events`.
Count of favorable policy events in the last 18 months for the geo:
- Medicaid rate increases for BH services
- New state mandates for adolescent crisis services
- Expanded provider types (peer support, mobile crisis)
- Federal rules (e.g., Mental Health Parity enforcement)
`count * 20`, clamped to 0-100.
### 7. govt_demand (5%)
Feeder: base Brain's `sam_gov_opportunities` table (if present) + `bhi_policy_events`.
Active + awarded SAM.gov opportunities in NAICS 621112 (Physician offices - mental), 621420 (Outpatient mental health/SUD), 623220 (Residential mental health), 623210 (Residential intellectual/developmental), 624190 (Other individual/family services). Dollar-value-weighted and geo-filtered.
Log-scale: `min(100, 10 * log10(total_dollar_value + 1))`.
---
## Age bracket handling
Every row in `bhi_demand_indicators` carries an `age_bracket`. When scoring a niche tagged for adolescents (13-17), the demand_severity and pain_signal components filter to that bracket. Young-adult scores pull 18-25. "All" niches average both brackets 50/50.
Young-adult gap note: for young-adult scoring, supply_shortage should apply an extra +15 penalty on facility density since very few IPFs have dedicated young-adult units — this is captured via the `young_adult_unit` boolean in `bhi_facilities`.
---
## Output table (to be added)
Scores write to `bhi_scores` (created at runtime, not in bhi_tables.sql v1 — add once inputs are flowing):
```sql
CREATE TABLE bhi_scores (
id SERIAL PRIMARY KEY,
niche TEXT,
geo_type TEXT,
geo_code TEXT,
age_bracket TEXT,
composite NUMERIC,
demand_severity NUMERIC,
supply_shortage NUMERIC,
pain_signal NUMERIC,
capacity_trend NUMERIC,
workforce_short NUMERIC,
reg_tailwind NUMERIC,
govt_demand NUMERIC,
computed_at TIMESTAMPTZ DEFAULT NOW()
);
```

273
docs/sources.md Normal file
View File

@@ -0,0 +1,273 @@
# BHI Data Sources
All endpoints tested 2026-04-04 unless noted. "Tested: OK" means a live curl returned valid data.
Scope: behavioral health facilities, demand indicators, workforce, shortages, and policy for all 50 US states, tagged by age bracket (adolescent 13-17, young adult 18-25).
---
## PHASE A — Free, autonomous, ready to ingest
### 1. CMS IPFQR (Inpatient Psychiatric Facility Quality Reporting)
- **Endpoint:** `https://data.cms.gov/provider-data/api/1/datastore/query/{dataset_id}/0`
- **Dataset IDs:**
- `q9vs-r7wp` — IPFQR by Facility
- `dc76-gh7x` — IPFQR by State
- `s5xg-sys6` — IPFQR National
- **Auth:** None
- **Rate limit:** None documented; be polite (<= 5 req/sec)
- **Update frequency:** Quarterly
- **Record count:** ~1,600 IPFs (facility file); dozens of measures each
- **Key fields:** `facility_id`, `facility_name`, `address`, `state`, `zip`, `countyparish`, HBIPS-2/3 restraint+seclusion, SMD, SUB-2/3, TOB-3, transition record, 30-day readmission
- **Test curl (OK):**
```
curl -s "https://data.cms.gov/provider-data/api/1/datastore/query/q9vs-r7wp/0?limit=2"
```
- **Python snippet:**
```python
import requests
r = requests.get("https://data.cms.gov/provider-data/api/1/datastore/query/q9vs-r7wp/0",
params={"limit": 500, "offset": 0})
rows = r.json()["results"]
```
### 2. CMS Hospital Compare / Care Compare (general hospital info)
- **Endpoint:** `https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0`
- **Auth:** None | **Rate limit:** none | **Update:** Monthly
- **Records:** ~5,300 hospitals
- **Key fields:** `facility_id` (CCN), `facility_name`, `hospital_type`, `hospital_ownership`, `hospital_overall_rating`, mortality/safety/readmission group flags
- **Test (OK):**
```
curl -s "https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0?limit=2"
```
- Use to classify which acute hospitals have behavioral health units (cross-join with IPFQR CCNs).
### 3. CMS Provider of Services (POS) file
- **Bulk page:** `https://data.cms.gov/provider-characteristics/hospitals-and-other-facilities/provider-of-services-file-quality-improvement-and-evaluation-system`
- **JSON catalog:** `https://data.cms.gov/data.json` (search `dataset[].title` = "Provider of Services File")
- **Auth:** None | **Update:** Quarterly | **Format:** CSV bulk
- **Records:** ~80,000 Medicare-certified facilities (includes PSY, PRTF, hospitals)
- **Key fields:** CCN, provider category, bed count, certification date, termination date, ownership
- **Test (OK):** `curl -s "https://data.cms.gov/data.json"` — dataset list
- Required for bed counts and termination (closure) tracking.
### 4. CMS Nursing Home Compare (Provider Information)
- **Endpoint:** `https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0`
- **Auth:** None | **Update:** Monthly
- **Records:** ~15,000 nursing homes
- **Key fields:** CCN, provider_name, ownership, number_of_certified_beds, overall rating, chain info
- **Test (OK):** `curl -s "https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0?limit=2"`
- Used to capture residential behavioral health (SNFs frequently host psych/BH residents).
### 5. SAMHSA Treatment Locator (findtreatment.gov)
- **Endpoint:** `https://findtreatment.gov/locator/exportsAsJson/v2?sType=BH&sAddr={zip}`
- **Auth:** None (browser UA helps but not required for JSON export)
- **Rate limit:** None documented; HEAD returns 403 but GET returns 200 — use GET only
- **Update:** Continuous (SAMHSA-maintained)
- **Records:** ~96,000 BH treatment facilities (all service types)
- **Key fields:** name1/name2, street, city, state, zip, phone, intake, hotline, website, lat, lon, services, typeFacility
- **Test (OK):**
```
curl -s "https://findtreatment.gov/locator/exportsAsJson/v2?sType=BH&sAddr=10001"
```
Response: `{"page":1,"totalPages":3201,"recordCount":96009,"rows":[...]}`
- **Python snippet:**
```python
import requests, time
def fetch_all(zip_seed="10001"):
base = "https://findtreatment.gov/locator/exportsAsJson/v2"
page = 1
while True:
r = requests.get(base, params={"sType":"BH","sAddr":zip_seed,"pageSize":30,"page":page})
d = r.json()
yield from d["rows"]
if page >= d["totalPages"]: break
page += 1
time.sleep(0.3)
```
### 6. SAMHSA N-SSATS + N-MHSS
- **Bulk:** `https://www.samhsa.gov/data/data-we-collect/n-ssats/datafiles` and `/n-mhss/datafiles`
- **Auth:** None | **Update:** Annual | **Format:** SAS / SPSS / CSV
- **Records:** N-SSATS ~16,000 SUD facilities/year; N-MHSS ~12,000 MH facilities/year
- **Key fields:** facility id, services, payment accepted, populations served (including adolescent/young adult flags), bed counts, ownership
- **Note:** Bulk ZIPs; no live API. Staged as manual-download job.
### 7. CDC WONDER (mortality — suicide, overdose, by county, age)
- **Endpoint:** `https://wonder.cdc.gov/controller/datarequest/D76` (Underlying Cause of Death) — POST XML
- **Auth:** None for non-restricted datasets; county-level suppressed for <10 deaths
- **Update:** Annual
- **Records:** All US mortality; we pull ICD-10 X60-X84 (suicide) + X40-X44/Y10-Y14 (overdose) by county, 13-17 and 18-25
- **Test (OK):** landing page returns 200; POST XML required for data. See job stub `wonder_mortality.py` for the working XML template.
### 8. CDC BRFSS
- **Endpoint (Socrata):** `https://data.cdc.gov/resource/dttw-5yxu.json`
- **Auth:** None (Socrata app token optional for higher limits) | **Update:** Annual
- **Records:** ~100k rows/year (state x question x breakout)
- **Test (OK):**
```
curl -s "https://data.cdc.gov/resource/dttw-5yxu.json?$limit=2"
```
Returns depression prevalence, mental health days, etc. by state+demographic.
### 9. CDC YRBSS (Youth Risk Behavior Survey)
- **Endpoints (Socrata, verified present via catalog):**
- High school: `https://data.cdc.gov/resource/3qty-g4aq.json`
- Middle school: `https://data.cdc.gov/resource/uqmk-4y2w.json`
- **Auth:** None | **Update:** Biennial
- **Records:** State + large urban district level; ~50k rows
- **Key fields:** suicidal ideation, attempt, persistent sadness, substance use — exactly the adolescent demand signal we need.
### 10. IDEA Part B data (Emotional Disturbance by district)
- **Landing:** `https://www2.ed.gov/programs/osepidea/618-data/static-tables/index.html`
- **Auth:** None | **Format:** CSV static tables | **Update:** Annual
- **Records:** ~14,000 school districts + state rollups
- **Key fields:** Child count under ED classification, ages 6-21, by state and LEA
- **Note:** Static CSVs; no API. Download script documents exact file URLs.
### 11. NSCH (National Survey of Children's Health) via HRSA
- **Landing:** `https://www.childhealthdata.org/browse/survey` and `https://mchb.hrsa.gov/data-research/national-survey-childrens-health`
- **Bulk (HRSA):** `https://mchb.hrsa.gov/sites/default/files/nsch/datafiles/` (year-specific)
- **Auth:** None | **Update:** Annual | **Format:** SAS / Stata / CSV
- **Records:** ~50k surveyed children, weighted to state-level estimates
- **Key fields:** anxiety, depression, behavioral problems, received treatment, unmet need — by state x age.
### 12. BLS OES (behavioral health workforce by MSA)
- **API:** `https://api.bls.gov/publicAPI/v2/timeseries/data/` (POST JSON)
- **Auth:** Free registration key for >25 series/day (`https://data.bls.gov/registrationEngine/`). Without key: 25 series/query, 10 years/query, no key required but lower limits.
- **Update:** Annual (May reference period)
- **Series ID pattern:** `OEUM{area}{industry}{occupation}{datatype}`
- **Relevant SOC codes:**
- 29-1223 Psychiatrists
- 29-1229 Other Physicians (incl. addiction medicine)
- 21-1014 Mental Health Counselors
- 21-1015 Rehabilitation Counselors
- 21-1018 Substance Abuse/Behavioral Disorder Counselors
- 21-1022 Mental Health and SUD Social Workers
- 19-3033 Clinical & Counseling Psychologists
- **Test (OK):** BLS API responds (test hit confirmed structure; real series IDs required)
- **Bulk alternative:** `https://www.bls.gov/oes/special-requests/oesm{YY}ma.zip` (annual bulk by MSA) — no auth, ~50MB zip.
### 13. HRSA Mental Health HPSAs
- **Bulk CSV (verified):** `https://data.hrsa.gov/DataDownload/DD_Files/BCD_HPSA_FCT_DET_MH.csv`
- **Size:** ~23 MB
- **Auth:** None | **Update:** Continuous (weekly snapshots)
- **Records:** ~6,500 active MH HPSAs + historical
- **Key fields:** HPSA ID, designation type, discipline (MH), score (0-25), state, county FIPS via HPSA Geography ID, population, designation date, withdrawn date, lat/lon
- **Test (OK):** HTTP 200, 23 MB CSV returned.
### 14. CMS NPPES (National Plan & Provider Enumeration System)
- **API:** `https://npiregistry.cms.hhs.gov/api/?version=2.1`
- **Auth:** None | **Rate limit:** ~200 req/sec soft; 200 results max per query — paginate with `skip`
- **Update:** Daily
- **Records:** ~8 million NPIs; filter by taxonomy for behavioral health (~500k)
- **Relevant taxonomy codes:**
- 2084P0800X Psychiatry & Neurology - Psychiatry
- 2084P0802X Addiction Psychiatry
- 2084P0804X Child & Adolescent Psychiatry
- 103T00000X Psychologist
- 101YM0800X Mental Health Counselor
- 103TC2200X Clinical Child & Adolescent Psychologist
- 1041C0700X Clinical Social Worker
- 324500000X Substance Abuse Rehabilitation Facility
- 283Q00000X Psychiatric Hospital
- 323P00000X Psychiatric Residential Treatment Facility
- **Test (OK):**
```
curl -s "https://npiregistry.cms.hhs.gov/api/?version=2.1&taxonomy_description=psychiatric&state=NY&limit=2"
```
---
## PHASE B — Requires application or registration
### 15. HCUP (AHRQ)
- **Landing:** `https://hcup-us.ahrq.gov/tech_assist/centdist.jsp`
- **Auth:** Data Use Agreement (DUA) required; free for research but application-based (~2-4 weeks)
- **Records:** State inpatient/ED/ASC databases, ~40M discharges/yr nationally
- **Action required:** Submit DUA + Data Use Training certificate. **BLOCKED until user applies.**
### 16. CMS Medicare Cost Reports (MCR)
- **Bulk:** `https://www.cms.gov/data-research/statistics-trends-and-reports/cost-reports` (HOSPITAL2010 format)
- **Auth:** None; just large downloads (~1-3 GB per year)
- **Update:** Quarterly rolling
- **Records:** ~6,000 hospital cost reports/year (CCN-level)
- Staged as a fetch-and-parse job (uses `ccn` to join with `bhi_facilities`).
### 17. NEMSIS state crisis transport data
- **Landing:** `https://nemsis.org/using-ems-data/request-research-data/`
- **Auth:** Research Data Request (application) — typically 4-8 weeks
- **BLOCKED until user applies.**
### 18. California HCAI (patient discharge data)
- **Endpoint:** `https://hcai.ca.gov/data-and-reports/cost-transparency/` and `https://data.chhs.ca.gov/dataset?q=pdd`
- **Auth:** Free (some files direct download; Limited Data Set requires DUA)
- **Update:** Annual
- **Records:** ~3.5M CA discharges/yr; psych DRGs extractable
### 19. NY SPARCS
- **Landing:** `https://www.health.ny.gov/statistics/sparcs/`
- **Auth:** Application for identified data; deidentified file free via `health.data.ny.gov`
- **Deidentified endpoint:** `https://health.data.ny.gov/resource/u4ud-w55t.json` (Hospital Inpatient Discharges)
- **Records:** ~2.5M NY discharges/yr
### 20. TX DSHS discharge data
- **Landing:** `https://www.dshs.texas.gov/texas-health-care-information-collection/health-data-researcher-information/texas-inpatient-public-use`
- **Auth:** Free (Public Use File is a direct download after click-through)
- **Records:** ~3M TX discharges/yr
### 21. FL AHCA discharge data
- **Landing:** `https://ahca.myflorida.com/health-care-policy-and-oversight/bureau-of-central-services/florida-center-for-health-information-and-transparency/data-analytics/order-data`
- **Auth:** Application form + fee for identified; aggregate free
- **BLOCKED until user applies for identified.**
---
## PHASE C — State RTF licensing databases
### 22. State-by-state RTF licensing scrapers
Scope: residential treatment facilities serving adolescents. One scraper per state.
Verified public-search portals (no auth, scrape-friendly HTML/JSON):
- **UT** — `https://hslic.utah.gov/` (Human Services License Information Lookup)
- **CA** — `https://www.ccld.dss.ca.gov/transparencyapi/api/facilities` (Community Care Licensing API)
- **TX** — `https://www.hhs.texas.gov/providers/long-term-care-providers/childrens-residential-facility-reimbursement-methodology` + search portal
- **FL** — `https://apps.myflfamilies.com/provider/` (DCF provider search)
- **NY** — `https://omh.ny.gov/omhweb/resources/providers/` (OMH provider directory)
- **MT** — `https://dphhs.mt.gov/qad/licensure/licensedfacilitieslist` (static list)
- **AZ** — `https://azcarecheck.azdhs.gov/` (public search)
- **CO** — `https://apps.colorado.gov/apps/oapa/licensee.aspx` (Office of Early Childhood)
- **OR** — `https://ccld.oregon.gov/ccld/search/` (Care Provider Directory)
- **WA** — `https://fortress.wa.gov/dshs/adsaapps/lookup/` (LTC lookup)
- **IL** — `https://www2.illinois.gov/dcfs/brighterfutures/Pages/default.aspx`
- **MA** — `https://www.mass.gov/lists/licensed-residential-treatment-programs`
- **PA** — `https://www.dhs.pa.gov/Services/Assistance/Pages/Child-Residential-Facility.aspx`
States requiring FOIA / no public portal (documented as BLOCKED for Phase C v1):
- AL, AK, AR, DE, GA, HI, ID, IN, IA, KS, KY, LA, ME, MD, MI, MN, MS, MO, NE, NV, NH, NJ, NM, NC, ND, OH, OK, RI, SC, SD, TN, VT, VA, WV, WI, WY
The scraper job stub lists URL patterns for the 13 verified states and marks the rest "FOIA required."
---
## Test results summary (Phase A)
| # | Source | Status | Notes |
|---|--------|--------|-------|
| 1 | CMS IPFQR | OK | q9vs-r7wp returned facility rows |
| 2 | CMS Hospital Compare | OK | xubh-q36u returned |
| 3 | CMS POS | OK | catalog reachable, bulk CSV |
| 4 | CMS Nursing Home | OK | 4pq5-n9py returned |
| 5 | SAMHSA Locator | OK | 96,009 records confirmed |
| 6 | SAMHSA N-SSATS/N-MHSS | OK (bulk) | ZIP download, no API |
| 7 | CDC WONDER | OK | POST XML required, landing 200 |
| 8 | CDC BRFSS | OK | Socrata JSON returned |
| 9 | CDC YRBSS | OK | 3qty-g4aq + uqmk-4y2w |
| 10 | IDEA Part B | OK (static) | Static CSV; no API |
| 11 | NSCH | OK (bulk) | HRSA year files |
| 12 | BLS OES | OK | API responds; needs real series IDs |
| 13 | HRSA HPSA MH | OK | 23 MB CSV download confirmed |
| 14 | NPPES | OK | 2 results returned for NY psych |
Blocked until auth/application:
- HCUP (DUA), NEMSIS (application), FL AHCA identified, NY SPARCS identified.

64
docs/target_questions.md Normal file
View File

@@ -0,0 +1,64 @@
# BHI Layer — Target Opportunity Questions
These are the questions the BHI layer must answer. They double as acceptance criteria: the layer ships when every question can be answered with a SQL query or a short Python notebook against `brain` with BHI tables populated.
Scope assumptions: all 50 states, facility-level where available, tagged adolescent (13-17) and young adult (18-25).
## 1. Supply / capacity
1. Which US counties have the highest HPSA mental health scores AND the lowest bed density (top 50)?
2. Which counties have ZERO licensed adolescent inpatient psychiatric beds within 60 miles?
3. Which counties have ZERO licensed young-adult residential treatment beds within 60 miles?
4. How many IPFs have closed vs opened in the last 24 months, by state?
5. Which IPFs have the worst HBIPS restraint+seclusion rates and are therefore vulnerability candidates for competitive entry or acquisition?
6. Which nursing homes are disproportionately housing under-65 residents with SMI (SNF-IMD dynamic) and are candidates for conversion/specialty buildout?
7. Where are the biggest drops in psych bed count over the last 5 years (via POS termination data)?
8. Which states have the lowest ratio of PRTF beds per 10k adolescents?
## 2. Demand
9. Which counties have the highest 13-17 suicide rate and fastest-growing trend (CDC WONDER)?
10. Which counties have the highest 18-25 overdose death rate trend?
11. Which states have the highest YRBSS "considered suicide" % and highest unmet-treatment need on NSCH, simultaneously?
12. How does adolescent ED visit rate for self-harm compare across states (cross-joining HCUP when available)?
13. Which school districts have the highest IDEA Part B Emotional Disturbance child count per 1,000 students?
14. Which states are seeing the largest YoY increase in 988 + crisis line volume per capita?
## 3. Workforce
15. Which MSAs have the highest YoY wage growth for psychiatrists (SOC 29-1223) — indicates a shortage?
16. Which MSAs have psychiatrist employment per 100k in the bottom quartile AND mental health HPSA coverage in the worst quartile?
17. Where are LCSW/LMHC wages spiking (21-1014, 21-1018) while employment is flat?
## 4. Financial / opportunity
18. What is the median psych Medicare margin (revenue - cost) per discharge, by state, from MCR data?
19. Which for-profit IPF chains are expanding fastest (opened_date + chain_id from nursing home join)?
20. Which counties have the biggest gap between HPSA score and SAM.gov / state contract dollars flowing in (underinvested vs need)?
21. What are the median acquisition multiples for BH facilities in each state? (Requires later enrichment.)
## 5. Adolescent transport / crisis (specific focus)
22. Which counties dispatch the most EMS runs coded "behavioral/psych" per 10k adolescents (NEMSIS, when access granted)?
23. Where do adolescent psychiatric holds most frequently result in out-of-county or out-of-state transport (indicates no local capacity)?
24. Which states have the longest average ED boarding time for adolescents awaiting inpatient psych admission (via AHRQ + state HAI reports)?
25. Which states have dedicated secure transport statute/reimbursement (`bhi_policy_events` filter on "secure transport") — these are bluefields for BH transport vendors?
26. Which counties combine: high adolescent suicide rate + no in-county adolescent psych beds + high ED boarding = highest-need adolescent transport markets?
27. Which chains/operators already provide adolescent secure transport and where are their service gaps (via scraping state BHO contract registries)?
## 6. Regulatory / tailwind
28. Which states passed Medicaid rate increases for BH residential in the last 24 months?
29. Which states expanded the definition of "mobile crisis response" to include adolescents in the last 24 months?
30. Where are IMD exclusion waivers (Section 1115 SMI/SED waivers) active or pending?
## 7. Composite / prioritization
31. Top 10 states ranked by composite_score for "adolescent inpatient psychiatric"?
32. Top 50 counties ranked by composite_score for "young adult residential SUD"?
33. Top 20 MSAs ranked by composite_score for "outpatient adolescent therapy (IOP/PHP)"?
34. For each of the top 10 composite-score opportunities, list: (a) top 3 operators already there, (b) workforce wage growth, (c) most recent policy event, (d) closest open SAM.gov opportunity.
---
**Acceptance criteria:** When the BHI layer is live and all Phase A sources are ingested, a user should be able to run SQL or ask the Brain's natural-language interface these 34 questions and get a grounded answer with citations to the underlying `bhi_*` tables.

Binary file not shown.

146
jobs/ingestion/_common.py Normal file
View File

@@ -0,0 +1,146 @@
"""
Shared helpers for BHI ingestion jobs.
READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
Base Brain is expected to expose:
- env DATABASE_URL pointing at the `brain` Postgres
- a `job_runs` table (the base Brain maintains this)
- optional Vault at http://localhost:8200 for API keys
Every BHI job imports from this module to keep behavior consistent.
"""
from __future__ import annotations
import logging
import os
import time
from contextlib import contextmanager
from datetime import datetime
from typing import Any, Callable, Iterable
import requests
try:
import psycopg2
import psycopg2.extras
except ImportError:
psycopg2 = None # type: ignore
LOG_FMT = "%(asctime)s %(levelname)s %(name)s | %(message)s"
logging.basicConfig(level=os.environ.get("BHI_LOG_LEVEL", "INFO"), format=LOG_FMT)
# --- HTTP session with retries + rate limiting ------------------------------
class RateLimitedSession(requests.Session):
def __init__(self, min_interval: float = 0.2, max_retries: int = 5):
super().__init__()
self.headers.update({"User-Agent": "EconomicBrain-BHI/1.0 (+research)"})
self.min_interval = min_interval
self.max_retries = max_retries
self._last = 0.0
def request(self, method, url, **kw): # type: ignore[override]
kw.setdefault("timeout", 60)
backoff = 1.0
for attempt in range(self.max_retries):
dt = time.monotonic() - self._last
if dt < self.min_interval:
time.sleep(self.min_interval - dt)
self._last = time.monotonic()
try:
resp = super().request(method, url, **kw)
if resp.status_code in (429, 500, 502, 503, 504):
logging.warning("HTTP %s on %s, retrying in %.1fs", resp.status_code, url, backoff)
time.sleep(backoff)
backoff *= 2
continue
resp.raise_for_status()
return resp
except requests.RequestException as e:
logging.warning("Request error: %s (attempt %d)", e, attempt + 1)
time.sleep(backoff)
backoff *= 2
raise RuntimeError(f"Exceeded retries for {url}")
# --- DB helpers -------------------------------------------------------------
def get_conn():
if psycopg2 is None:
raise RuntimeError("psycopg2 not installed. pip install psycopg2-binary")
dsn = os.environ.get("DATABASE_URL") or os.environ.get("BRAIN_DATABASE_URL")
if not dsn:
raise RuntimeError("DATABASE_URL env var not set")
return psycopg2.connect(dsn)
@contextmanager
def job_run(job_name: str):
"""Context manager that logs a row in the base Brain's job_runs table."""
conn = get_conn()
run_id = None
started = datetime.utcnow()
try:
with conn.cursor() as c:
c.execute(
"""
INSERT INTO job_runs (job_name, started_at, status)
VALUES (%s, %s, 'running') RETURNING id
""",
(job_name, started),
)
run_id = c.fetchone()[0]
conn.commit()
yield conn, run_id
with conn.cursor() as c:
c.execute(
"UPDATE job_runs SET status='success', finished_at=%s WHERE id=%s",
(datetime.utcnow(), run_id),
)
conn.commit()
except Exception as e:
if run_id is not None:
try:
with conn.cursor() as c:
c.execute(
"UPDATE job_runs SET status='error', finished_at=%s, error=%s WHERE id=%s",
(datetime.utcnow(), str(e)[:2000], run_id),
)
conn.commit()
except Exception:
pass
raise
finally:
conn.close()
def bulk_insert(conn, table: str, columns: list[str], rows: Iterable[tuple]):
with conn.cursor() as c:
psycopg2.extras.execute_values(
c,
f"INSERT INTO {table} ({', '.join(columns)}) VALUES %s",
list(rows),
page_size=500,
)
conn.commit()
# --- Vault (optional) -------------------------------------------------------
def vault_secret(path: str, key: str) -> str | None:
token = os.environ.get("VAULT_TOKEN")
addr = os.environ.get("VAULT_ADDR", "http://localhost:8200")
if not token:
return os.environ.get(key.upper())
try:
r = requests.get(
f"{addr}/v1/{path}",
headers={"X-Vault-Token": token},
timeout=5,
)
return r.json()["data"]["data"].get(key)
except Exception as e:
logging.warning("vault fetch failed: %s", e)
return os.environ.get(key.upper())

93
jobs/ingestion/bls_oes.py Normal file
View File

@@ -0,0 +1,93 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
BLS OES (Occupational Employment and Wage Statistics) — behavioral health
workforce by MSA.
Primary approach: annual bulk download (no auth, simplest):
https://www.bls.gov/oes/special-requests/oesmYYma.zip
Fallback / enrichment: BLS public API (optional free key via vault).
"""
import csv
import io
import logging
import sys
import zipfile
from _common import RateLimitedSession, bulk_insert, job_run, vault_secret
LOG = logging.getLogger("bhi.bls_oes")
BULK_URL = "https://www.bls.gov/oes/special-requests/oesm23ma.zip" # update year annually
BH_SOC_CODES = {
"29-1223": "Psychiatrists",
"29-1229": "Physicians, All Other",
"21-1014": "Mental Health Counselors",
"21-1015": "Rehabilitation Counselors",
"21-1018": "SUD / Behavioral Disorder Counselors",
"21-1023": "Mental Health & Substance Abuse Social Workers",
"19-3033": "Clinical & Counseling Psychologists",
}
def test_endpoint():
s = RateLimitedSession()
r = s.head(BULK_URL, allow_redirects=True)
print(f"OK: status={r.status_code}, content-length={r.headers.get('content-length')}")
return r.status_code == 200
def fetch_rows():
s = RateLimitedSession(min_interval=1.0)
r = s.get(BULK_URL)
z = zipfile.ZipFile(io.BytesIO(r.content))
# Bulk zip contains one CSV/XLSX with MSA rows
csv_name = next((n for n in z.namelist() if n.lower().endswith(".csv")), None)
if not csv_name:
LOG.error("no CSV in BLS zip")
return []
with z.open(csv_name) as f:
reader = csv.DictReader(io.TextIOWrapper(f, encoding="latin-1"))
rows = [r for r in reader if (r.get("OCC_CODE") or r.get("occ_code")) in BH_SOC_CODES]
LOG.info("BLS OES BH rows: %d", len(rows))
return rows
def _num(v):
try:
return float(str(v).replace(",", "")) if v not in (None, "", "*", "#") else None
except (TypeError, ValueError):
return None
def write_rows(conn, raw):
cols = ["msa_code","msa_name","occupation_code","occupation_title",
"employment","annual_wage_median","annual_wage_mean","period","source"]
rows = []
for r in raw:
code = r.get("OCC_CODE") or r.get("occ_code")
rows.append((
r.get("AREA") or r.get("area"),
r.get("AREA_TITLE") or r.get("area_title"),
code,
BH_SOC_CODES.get(code, r.get("OCC_TITLE") or r.get("occ_title")),
int(_num(r.get("TOT_EMP") or r.get("tot_emp")) or 0) or None,
_num(r.get("A_MEDIAN") or r.get("a_median")),
_num(r.get("A_MEAN") or r.get("a_mean")),
"May2023",
"bls_oes",
))
bulk_insert(conn, "bhi_workforce", cols, rows)
return len(rows)
def main():
with job_run("bhi_bls_oes") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,92 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CDC BRFSS Prevalence Data (Socrata).
Source: https://data.cdc.gov/resource/dttw-5yxu.json
Pulls depression + mental-health-not-good items by state, with
young-adult (18-24) breakouts where available.
"""
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cdc_brfss")
BASE = "https://data.cdc.gov/resource/dttw-5yxu.json"
# BRFSS topics of interest for BHI
TOPICS = [
"Depression",
"Mental Health Status",
"Poor Mental Health",
]
def test_endpoint():
s = RateLimitedSession()
r = s.get(BASE, params={"$limit": 2}).json()
print(f"OK: returned {len(r)} rows")
if r:
print("sample topic:", r[0].get("topic"))
return bool(r)
def fetch_rows():
s = RateLimitedSession(min_interval=0.2)
out = []
for topic in TOPICS:
offset = 0
while True:
batch = s.get(BASE, params={
"$where": f"topic='{topic}'",
"$limit": 5000,
"$offset": offset,
}).json()
if not batch:
break
out.extend(batch)
if len(batch) < 5000:
break
offset += 5000
LOG.info("topic=%s total=%d", topic, len(out))
return out
def write_rows(conn, raw):
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
rows = []
for r in raw:
try:
val = float(r.get("data_value") or 0)
except (TypeError, ValueError):
continue
breakout = (r.get("break_out") or "Overall").lower()
if "18" in breakout and "24" in breakout:
bracket = "18-25"
elif "overall" in breakout:
bracket = "all"
else:
bracket = breakout
rows.append((
"state",
r.get("locationabbr"),
(r.get("question") or r.get("topic") or "").strip()[:120],
bracket,
str(r.get("year") or ""),
val,
"cdc_brfss",
))
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
return len(rows)
def main():
with job_run("bhi_cdc_brfss") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,119 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CDC WONDER — Underlying Cause of Death by county, age bracket, ICD-10.
Posts XML request body to https://wonder.cdc.gov/controller/datarequest/D76
(Underlying Cause of Death 1999-2020) or D77 (2018+). The public non-restricted
datasets return XML tables; county-level cells with <10 deaths are suppressed.
We request two slices:
1. Suicide (X60-X84) for ages 13-17 and 18-25, by county
2. Drug poisoning (X40-X44, Y10-Y14) for 13-17 and 18-25, by county
"""
import logging
import sys
import xml.etree.ElementTree as ET
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cdc_wonder")
ENDPOINT = "https://wonder.cdc.gov/controller/datarequest/D76"
def _build_xml(icd_codes: list[str], age_bracket: str) -> str:
"""Assemble WONDER POST XML. Structure is value-order dependent."""
# Age groups in WONDER: 15-19, 20-24, 25-29 etc. Adolescent and young-adult
# brackets don't align perfectly with 5-year WONDER bins — closest fit:
ages = {
"13-17": ["15-19"], # approximate
"18-25": ["20-24", "25-29"],
}[age_bracket]
icd_vals = "".join(f"<v>{c}</v>" for c in icd_codes)
age_vals = "".join(f"<v>{a}</v>" for a in ages)
return f"""<?xml version="1.0" encoding="utf-8"?>
<request-parameters>
<parameter><name>accept_datause_restrictions</name><value>true</value></parameter>
<parameter><name>B_1</name><value>D76.V2-level1</value></parameter>
<parameter><name>B_2</name><value>D76.V51</value></parameter>
<parameter><name>F_D76.V1</name>{age_vals}</parameter>
<parameter><name>F_D76.V2</name><value>*All*</value></parameter>
<parameter><name>F_D76.V22</name>{icd_vals}</parameter>
<parameter><name>O_age</name><value>D76.V51</value></parameter>
<parameter><name>O_location</name><value>D76.V9</value></parameter>
<parameter><name>VM_D76.M6_D76.V10</name><value/></parameter>
</request-parameters>"""
def test_endpoint():
s = RateLimitedSession(min_interval=1.0)
body = _build_xml(["X60-X84"], "13-17")
r = s.post(ENDPOINT, data={"request_xml": body, "accept_datause_restrictions": "true"})
ok = r.status_code == 200 and b"<response" in r.content
print(f"OK={ok}, status={r.status_code}, len={len(r.content)}")
return ok
def fetch_rows():
s = RateLimitedSession(min_interval=1.0)
out = []
for measure, icd in [("suicide_rate", ["X60-X84"]),
("overdose_rate", ["X40-X44", "Y10-Y14"])]:
for bracket in ("13-17", "18-25"):
body = _build_xml(icd, bracket)
r = s.post(ENDPOINT, data={
"request_xml": body,
"accept_datause_restrictions": "true",
})
rows = _parse_wonder_xml(r.text, measure, bracket)
out.extend(rows)
LOG.info("%s %s -> %d rows", measure, bracket, len(rows))
return out
def _parse_wonder_xml(xml_text: str, measure: str, bracket: str):
out = []
try:
root = ET.fromstring(xml_text)
except ET.ParseError:
LOG.error("WONDER XML parse failed")
return out
# WONDER returns <data-table> with <r> rows containing <c l="label"/>
for r in root.iter("r"):
cells = [c.get("l") or c.text for c in r.findall("c")]
if len(cells) < 3:
continue
county = cells[0]
try:
rate = float(cells[-1])
except (TypeError, ValueError):
continue
out.append({
"geo_type": "county",
"geo_code": county,
"measure": measure,
"age_bracket": bracket,
"period": "2018-2022", # WONDER typical 5-year window
"value": rate,
"source": "cdc_wonder",
})
return out
def write_rows(conn, raw):
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
rows = [(r["geo_type"], r["geo_code"], r["measure"], r["age_bracket"],
r["period"], r["value"], r["source"]) for r in raw]
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
return len(rows)
def main():
with job_run("bhi_cdc_wonder") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,95 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CDC YRBSS — Youth Risk Behavior Survey (high and middle school).
Sources (Socrata):
- High school: https://data.cdc.gov/resource/3qty-g4aq.json
- Middle school: https://data.cdc.gov/resource/uqmk-4y2w.json
Key items: "considered suicide", "attempted suicide", "persistent sadness",
substance use — all adolescent (13-17) bracket.
"""
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cdc_yrbss")
DATASETS = {
"hs": "https://data.cdc.gov/resource/3qty-g4aq.json",
"ms": "https://data.cdc.gov/resource/uqmk-4y2w.json",
}
KEYWORDS = ["suicide", "sad", "hopeless", "mental health", "electronic"]
def test_endpoint():
s = RateLimitedSession()
ok = True
for k, url in DATASETS.items():
r = s.get(url, params={"$limit": 1})
print(f"{k}: status={r.status_code}, rows={len(r.json())}")
ok = ok and r.status_code == 200
return ok
def fetch_rows():
s = RateLimitedSession(min_interval=0.2)
out = []
for key, url in DATASETS.items():
offset = 0
while True:
batch = s.get(url, params={"$limit": 5000, "$offset": offset}).json()
if not batch:
break
for row in batch:
row["_dataset"] = key
out.extend(batch)
if len(batch) < 5000:
break
offset += 5000
LOG.info("yrbss %s -> %d", key, len(out))
return out
def _question_is_relevant(q: str) -> bool:
ql = (q or "").lower()
return any(k in ql for k in KEYWORDS)
def write_rows(conn, raw):
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
rows = []
for r in raw:
question = r.get("questioncode") or r.get("shortquestiontext") or r.get("question") or ""
if not _question_is_relevant(question):
continue
try:
val = float(r.get("data_value") or r.get("greater_risk_data_value") or 0)
except (TypeError, ValueError):
continue
if val == 0:
continue
rows.append((
"state" if r.get("locationdesc") else "district",
r.get("locationabbr") or r.get("sitecode"),
question[:120],
"13-17",
str(r.get("year") or ""),
val,
f"cdc_yrbss_{r.get('_dataset','hs')}",
))
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
return len(rows)
def main():
with job_run("bhi_cdc_yrbss") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,77 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CMS Hospital General Information (Care Compare) — used to cross-reference
which acute hospitals host behavioral health units and to capture CCN-level
facility metadata.
Source: https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0
"""
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cms_hospital_compare")
BASE = "https://data.cms.gov/provider-data/api/1/datastore/query/xubh-q36u/0"
PAGE = 500
def test_endpoint():
s = RateLimitedSession()
r = s.get(BASE, params={"limit": 2}).json()
rows = r.get("results", [])
print(f"OK: {len(rows)} rows, sample:", rows[0].get("facility_name") if rows else None)
return bool(rows)
def fetch_rows():
s = RateLimitedSession(min_interval=0.25)
offset, out = 0, []
while True:
b = s.get(BASE, params={"limit": PAGE, "offset": offset}).json().get("results", [])
if not b:
break
out.extend(b)
if len(b) < PAGE:
break
offset += PAGE
LOG.info("fetched %d hospitals", len(out))
return out
def write_rows(conn, raw):
cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
rows = []
for r in raw:
rows.append((
r.get("facility_id"), None,
r.get("facility_name"), r.get("address"),
r.get("citytown"), r.get("state"), r.get("zip_code"), None,
None, None,
(r.get("hospital_type") or "hospital"),
r.get("hospital_ownership"),
None, None, None, None, None,
[], [], [], None, None, None, None, None,
"cms_hospital_compare", None,
))
bulk_insert(conn, "bhi_facilities", cols, rows)
return len(rows)
def main():
with job_run("bhi_cms_hospital_compare") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

137
jobs/ingestion/cms_ipfqr.py Normal file
View File

@@ -0,0 +1,137 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CMS Inpatient Psychiatric Facility Quality Reporting (IPFQR) ingestion.
Source: https://data.cms.gov/provider-data/api/1/datastore/query/q9vs-r7wp/0
Writes facilities to bhi_facilities and measures to bhi_facility_quality.
"""
import logging
import sys
from typing import Any
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cms_ipfqr")
DATASET_ID = "q9vs-r7wp" # IPFQR by Facility
BASE = f"https://data.cms.gov/provider-data/api/1/datastore/query/{DATASET_ID}/0"
PAGE_SIZE = 500
MEASURE_FIELDS = [
("hbips2", "HBIPS-2", "Hours of physical-restraint use"),
("hbips3", "HBIPS-3", "Hours of seclusion use"),
("smd", "SMD", "Screening for metabolic disorders"),
("sub2", "SUB-2", "Alcohol use brief intervention"),
("sub3", "SUB-3", "Alcohol/other drug use treatment at discharge"),
("tob3", "TOB-3", "Tobacco use treatment at discharge"),
]
# --- TEST function (no DB) --------------------------------------------------
def test_endpoint():
"""Run standalone to verify the endpoint works."""
s = RateLimitedSession()
r = s.get(BASE, params={"limit": 3})
data = r.json()
rows = data.get("results", [])
print(f"OK: fetched {len(rows)} rows from {BASE}")
if rows:
print("Sample keys:", list(rows[0].keys())[:12])
print("Sample facility:", rows[0].get("facility_name"), rows[0].get("state"))
return len(rows) > 0
# --- Fetch ------------------------------------------------------------------
def fetch_rows() -> list[dict[str, Any]]:
s = RateLimitedSession(min_interval=0.25)
offset = 0
out: list[dict[str, Any]] = []
while True:
r = s.get(BASE, params={"limit": PAGE_SIZE, "offset": offset})
batch = r.json().get("results", [])
if not batch:
break
out.extend(batch)
LOG.info("fetched %d (total %d)", len(batch), len(out))
if len(batch) < PAGE_SIZE:
break
offset += PAGE_SIZE
return out
# --- Write ------------------------------------------------------------------
def write_rows(conn, raw_rows: list[dict[str, Any]]) -> tuple[int, int]:
facility_rows = []
for r in raw_rows:
facility_rows.append((
r.get("facility_id"), # ccn
None, # npi
r.get("facility_name"),
r.get("address"),
r.get("citytown"),
r.get("state"),
r.get("zip_code"),
None, # county_fips (join later via zip->fips)
None, None, # lat, lon
"IPF", # facility_type
None, None, None, None, # ownership, bed counts
None, None, # adolescent_unit, young_adult_unit
[], [], [], None, # arrays, medicaid_accepted
None, None, None, # accreditation, opened, closed
None, # last_verified
"cms_ipfqr", # source
None, # source_raw_id
))
facility_cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
bulk_insert(conn, "bhi_facilities", facility_cols, facility_rows)
# Map ccn -> facility_id for measures
with conn.cursor() as c:
c.execute(
"SELECT ccn, facility_id FROM bhi_facilities WHERE source='cms_ipfqr'"
)
ccn_map = dict(c.fetchall())
measure_rows = []
for r in raw_rows:
fid = ccn_map.get(r.get("facility_id"))
if not fid:
continue
for field, mid, mname in MEASURE_FIELDS:
val = r.get(field) or r.get(f"{field}_overall_rate_per_1000")
try:
v = float(val) if val not in (None, "", "Not Available") else None
except (TypeError, ValueError):
v = None
if v is None:
continue
measure_rows.append((fid, mid, mname, v, None, None, None, "cms_ipfqr"))
cols = ["facility_id","measure_id","measure_name","value","benchmark","period","reported_at","source"]
bulk_insert(conn, "bhi_facility_quality", cols, measure_rows)
return len(facility_rows), len(measure_rows)
def main():
with job_run("bhi_cms_ipfqr") as (conn, run_id):
rows = fetch_rows()
f, m = write_rows(conn, rows)
LOG.info("inserted %d facilities, %d measures (run %s)", f, m, run_id)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,82 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CMS Nursing Home Provider Information — captures SNFs that house behavioral
health residents (SNF-IMD dynamic) for later filtering on chain + ownership.
Source: https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0
"""
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cms_nursing_home")
BASE = "https://data.cms.gov/provider-data/api/1/datastore/query/4pq5-n9py/0"
PAGE = 1000
def test_endpoint():
s = RateLimitedSession()
r = s.get(BASE, params={"limit": 2}).json()
rows = r.get("results", [])
print(f"OK: {len(rows)} rows, sample:", rows[0].get("provider_name") if rows else None)
return bool(rows)
def fetch_rows():
s = RateLimitedSession(min_interval=0.25)
offset, out = 0, []
while True:
b = s.get(BASE, params={"limit": PAGE, "offset": offset}).json().get("results", [])
if not b:
break
out.extend(b)
if len(b) < PAGE:
break
offset += PAGE
LOG.info("fetched %d nursing homes", len(out))
return out
def write_rows(conn, raw):
cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
rows = []
for r in raw:
try:
beds = int(r.get("number_of_certified_beds") or 0) or None
except (TypeError, ValueError):
beds = None
opened = r.get("date_first_approved_to_provide_medicare_and_medicaid_services")
rows.append((
r.get("cms_certification_number_ccn"), None,
r.get("provider_name"), r.get("provider_address"),
r.get("citytown"), r.get("state"), r.get("zip_code"), None,
None, None,
"nursing_home",
r.get("ownership_type"),
beds, None, None, None, None,
[], [], [], None, None,
opened if opened else None, None, None,
"cms_nursing_home", None,
))
bulk_insert(conn, "bhi_facilities", cols, rows)
return len(rows)
def main():
with job_run("bhi_cms_nursing_home") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

143
jobs/ingestion/cms_pos.py Normal file
View File

@@ -0,0 +1,143 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CMS Provider of Services (POS) file — quarterly bulk CSV with every
Medicare-certified facility including provider category (IPFs, PRTFs, etc.),
bed counts, certification date, and termination date. Critical for
closure/opening tracking used in composite_score.capacity_trend.
"""
import csv
import io
import logging
import sys
import zipfile
from datetime import datetime
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.cms_pos")
CATALOG_URL = "https://data.cms.gov/data.json"
def test_endpoint():
s = RateLimitedSession()
r = s.get(CATALOG_URL).json()
pos = [d for d in r.get("dataset", []) if "provider of services" in d.get("title", "").lower()]
print(f"OK: {len(pos)} POS datasets in catalog")
for d in pos[:3]:
print(" -", d.get("title"))
return len(pos) > 0
def _latest_pos_distribution():
s = RateLimitedSession(min_interval=0.3)
r = s.get(CATALOG_URL).json()
pos = [d for d in r.get("dataset", [])
if "provider of services" in d.get("title", "").lower()
and "hospital" in d.get("title", "").lower()]
if not pos:
return None
latest = max(pos, key=lambda d: d.get("modified", ""))
for dist in latest.get("distribution", []):
url = dist.get("downloadURL") or dist.get("accessURL", "")
if url.endswith((".zip", ".csv")):
return url
return None
def fetch_rows():
url = _latest_pos_distribution()
if not url:
LOG.error("Could not resolve POS download URL")
return []
LOG.info("fetching POS: %s", url)
s = RateLimitedSession(min_interval=0.5)
r = s.get(url)
content = r.content
if url.endswith(".zip"):
z = zipfile.ZipFile(io.BytesIO(content))
csvname = next((n for n in z.namelist() if n.lower().endswith(".csv")), None)
with z.open(csvname) as f:
text = io.TextIOWrapper(f, encoding="latin-1").read()
else:
text = content.decode("latin-1", errors="replace")
reader = csv.DictReader(io.StringIO(text))
# Filter to psychiatric + BH provider categories
# CMS PRVDR_CTGRY_CD: 04 = psych hospital, sub-category variations
keep = []
for row in reader:
cat = row.get("PRVDR_CTGRY_CD") or row.get("prvdr_ctgry_cd") or ""
subcat = row.get("PRVDR_CTGRY_SBTYP_CD") or row.get("prvdr_ctgry_sbtyp_cd") or ""
if cat in ("04",) or "psych" in (row.get("FAC_NAME", "") + row.get("fac_name", "")).lower():
keep.append(row)
LOG.info("filtered POS to %d BH-relevant rows", len(keep))
return keep
def _parse_date(s):
if not s:
return None
for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%Y%m%d"):
try:
return datetime.strptime(s, fmt).date()
except ValueError:
continue
return None
def _num(v):
try:
return int(float(v)) if v not in (None, "") else None
except (TypeError, ValueError):
return None
def write_rows(conn, raw):
cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
rows = []
for r in raw:
def g(*keys):
for k in keys:
v = r.get(k) or r.get(k.lower())
if v:
return v
return None
rows.append((
g("PRVDR_NUM", "prvdr_num"), None,
g("FAC_NAME", "fac_name"),
g("ST_ADR", "st_adr"),
g("CITY_NAME", "city_name"),
g("STATE_CD", "state_cd"),
g("ZIP_CD", "zip_cd"),
None, None, None,
"IPF",
g("GNRL_CNTL_TYPE_CD", "gnrl_cntl_type_cd"),
_num(g("BED_CNT", "bed_cnt")),
_num(g("CRTFD_BED_CNT", "crtfd_bed_cnt")),
None, None, None,
[], [], [], None, None,
_parse_date(g("ORGNL_PRTCPTN_DT", "orgnl_prtcptn_dt")),
_parse_date(g("TRMNTN_EXPRTN_DT", "trmntn_exprtn_dt")),
None,
"cms_pos", None,
))
bulk_insert(conn, "bhi_facilities", cols, rows)
return len(rows)
def main():
with job_run("bhi_cms_pos") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,85 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
HRSA Mental Health HPSA (Health Professional Shortage Areas) bulk CSV.
Source: https://data.hrsa.gov/DataDownload/DD_Files/BCD_HPSA_FCT_DET_MH.csv
Confirmed: ~23 MB CSV, all active + historical MH HPSAs.
"""
import csv
import io
import logging
import sys
from datetime import datetime
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.hrsa_hpsa")
URL = "https://data.hrsa.gov/DataDownload/DD_Files/BCD_HPSA_FCT_DET_MH.csv"
def test_endpoint():
s = RateLimitedSession()
r = s.get(URL, stream=True)
first = next(r.iter_lines())
print(f"OK: content-length={r.headers.get('content-length')}")
print("header:", first.decode("utf-8", errors="replace")[:200])
return True
def fetch_rows():
s = RateLimitedSession(min_interval=0.5)
r = s.get(URL)
r.encoding = "utf-8"
reader = csv.DictReader(io.StringIO(r.text))
rows = list(reader)
LOG.info("fetched %d HPSA rows", len(rows))
return rows
def _parse_date(s):
if not s:
return None
for fmt in ("%Y-%m-%d", "%m/%d/%Y"):
try:
return datetime.strptime(s, fmt).date()
except ValueError:
continue
return None
def _parse_int(s):
try:
return int(float(s)) if s not in (None, "") else None
except (TypeError, ValueError):
return None
def write_rows(conn, raw):
cols = ["hpsa_id","state","county_fips","score","population_served",
"designated_date","withdrawn_date","source"]
rows = []
for r in raw:
rows.append((
r.get("HPSA ID"),
r.get("Primary State Abbreviation"),
r.get("Common County FIPS Code") or r.get("HPSA Geography Identification Number"),
_parse_int(r.get("HPSA Score")),
_parse_int(r.get("HPSA Designation Population")),
_parse_date(r.get("HPSA Designation Date")),
_parse_date(r.get("Withdrawn Date")),
"hrsa_hpsa_mh",
))
bulk_insert(conn, "bhi_shortages", cols, rows)
return len(rows)
def main():
with job_run("bhi_hrsa_hpsa") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,93 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
IDEA Part B child count — specifically "Emotional Disturbance" (ED)
classification by state and local education agency (LEA).
Static CSVs hosted by US Department of Education / OSEP. No API. This job
pulls the most recent static tables. Update MANIFEST when new year drops.
"""
import csv
import io
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.idea_part_b")
# Static CSV links — placeholder pattern. The user confirmed landing at
# https://www2.ed.gov/programs/osepidea/618-data/static-tables/index.html
MANIFEST = [
# (year, scope, url)
("2022-23", "state", "https://www2.ed.gov/programs/osepidea/618-data/static-tables/part-b/child-count-and-educational-environment/bchildcountandedenvironments2022-23.csv"),
]
def test_endpoint():
s = RateLimitedSession()
ok = True
for year, scope, url in MANIFEST:
r = s.head(url, allow_redirects=True)
print(f"{year} {scope}: {r.status_code}")
ok = ok and r.status_code in (200, 302)
return ok
def fetch_rows():
s = RateLimitedSession(min_interval=0.5)
out = []
for year, scope, url in MANIFEST:
try:
r = s.get(url)
r.encoding = "utf-8"
reader = csv.DictReader(io.StringIO(r.text))
for row in reader:
row["_year"] = year
row["_scope"] = scope
out.append(row)
except Exception as e:
LOG.warning("failed %s: %s", url, e)
LOG.info("IDEA rows: %d", len(out))
return out
def _int(v):
try:
return int(str(v).replace(",", "")) if v not in (None, "", "-") else None
except (TypeError, ValueError):
return None
def write_rows(conn, raw):
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
rows = []
for r in raw:
disability = (r.get("Disability Category") or r.get("SEA Disability Category") or "").lower()
if "emotional" not in disability:
continue
val = _int(r.get("Students Served") or r.get("Total") or r.get("ED"))
if val is None:
continue
rows.append((
"state",
r.get("State") or r.get("SEA State"),
"idea_emotional_disturbance_count",
"13-17", # ED classification predominantly school-age; approximate
r["_year"],
float(val),
"idea_part_b",
))
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
return len(rows)
def main():
with job_run("bhi_idea_part_b") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

114
jobs/ingestion/nppes.py Normal file
View File

@@ -0,0 +1,114 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
CMS NPPES (National Plan & Provider Enumeration System) — behavioral health
providers by taxonomy + state.
API: https://npiregistry.cms.hhs.gov/api/?version=2.1
Filter: taxonomy codes for psychiatry, psychology, counseling, SUD.
"""
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.nppes")
BASE = "https://npiregistry.cms.hhs.gov/api/"
BH_TAXONOMY_CODES = [
"2084P0800X", # Psychiatry
"2084P0802X", # Addiction Psychiatry
"2084P0804X", # Child & Adolescent Psychiatry
"103T00000X", # Psychologist
"103TC2200X", # Clinical Child & Adolescent Psychologist
"101YM0800X", # Mental Health Counselor
"1041C0700X", # Clinical Social Worker
"324500000X", # Substance Abuse Rehabilitation Facility
"283Q00000X", # Psychiatric Hospital
"323P00000X", # Psychiatric Residential Treatment Facility
]
STATES = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV",
"NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN",
"TX","UT","VT","VA","WA","WV","WI","WY","DC"]
def test_endpoint():
s = RateLimitedSession()
r = s.get(BASE, params={
"version": "2.1", "taxonomy_description": "psychiatric",
"state": "NY", "limit": 2,
}).json()
print(f"OK: result_count={r.get('result_count')}")
return r.get("result_count", 0) > 0
def fetch_rows():
s = RateLimitedSession(min_interval=0.1)
all_rows = []
for state in STATES:
for taxonomy in BH_TAXONOMY_CODES:
skip = 0
while True:
r = s.get(BASE, params={
"version": "2.1",
"taxonomy_description": taxonomy,
"state": state,
"limit": 200,
"skip": skip,
}).json()
results = r.get("results", [])
if not results:
break
for row in results:
row["_state"] = state
row["_taxonomy"] = taxonomy
all_rows.extend(results)
if len(results) < 200:
break
skip += 200
if skip > 1200: # NPPES caps paging
break
LOG.info("state=%s tax=%s total=%d", state, taxonomy, len(all_rows))
return all_rows
def write_rows(conn, raw):
cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
rows = []
for r in raw:
addresses = r.get("addresses") or []
location = next((a for a in addresses if a.get("address_purpose") == "LOCATION"), addresses[0] if addresses else {})
basic = r.get("basic") or {}
name = basic.get("organization_name") or " ".join(filter(None, [basic.get("first_name"), basic.get("last_name")]))
rows.append((
None, str(r.get("number", "")),
name,
location.get("address_1"), location.get("city"),
location.get("state"), location.get("postal_code"), None,
None, None,
"provider" if basic.get("name_prefix") is None else "org",
None, None, None, None, None, None,
[r.get("_taxonomy", "")], [], [], None, None, None, None, None,
"nppes", None,
))
bulk_insert(conn, "bhi_facilities", cols, rows)
return len(rows)
def main():
with job_run("bhi_nppes") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

96
jobs/ingestion/nsch.py Normal file
View File

@@ -0,0 +1,96 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
NSCH — National Survey of Children's Health (HRSA/MCHB).
Source: https://mchb.hrsa.gov/data-research/national-survey-childrens-health
Bulk files by year; we parse state-level indicator tables. Manifest below.
"""
import csv
import io
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.nsch")
MANIFEST = [
# (year, url_to_indicator_csv)
("2022", "https://mchb.hrsa.gov/sites/default/files/mchb/data-research/nsch/2022/nsch-2022-state-level-indicators.csv"),
]
INDICATORS_OF_INTEREST = {
"anxiety": "anxiety_pct",
"depression": "depression_pct",
"behavioral": "behavioral_pct",
"mental health treatment": "unmet_mh_treatment_pct",
"unmet": "unmet_mh_treatment_pct",
}
def test_endpoint():
s = RateLimitedSession()
ok = True
for year, url in MANIFEST:
r = s.head(url, allow_redirects=True)
print(f"{year}: {r.status_code}")
ok = ok and r.status_code in (200, 302)
return ok
def fetch_rows():
s = RateLimitedSession(min_interval=0.5)
out = []
for year, url in MANIFEST:
try:
r = s.get(url)
r.encoding = "utf-8"
reader = csv.DictReader(io.StringIO(r.text))
for row in reader:
row["_year"] = year
out.append(row)
except Exception as e:
LOG.warning("failed %s: %s", url, e)
LOG.info("NSCH rows: %d", len(out))
return out
def write_rows(conn, raw):
cols = ["geo_type","geo_code","measure","age_bracket","period","value","source"]
rows = []
for r in raw:
indicator = (r.get("Indicator") or "").lower()
measure = None
for k, v in INDICATORS_OF_INTEREST.items():
if k in indicator:
measure = v
break
if not measure:
continue
try:
val = float((r.get("Estimate") or r.get("Value") or "0").replace("%", ""))
except (TypeError, ValueError):
continue
rows.append((
"state",
r.get("State"),
measure,
"13-17",
r["_year"],
val,
"nsch",
))
bulk_insert(conn, "bhi_demand_indicators", cols, rows)
return len(rows)
def main():
with job_run("bhi_nsch") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,95 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
SAMHSA findtreatment.gov behavioral health facility locator.
Source: https://findtreatment.gov/locator/exportsAsJson/v2
Confirmed: 96,009 facilities across 3,201 pages (sType=BH).
"""
import logging
import sys
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.samhsa_locator")
BASE = "https://findtreatment.gov/locator/exportsAsJson/v2"
ZIP_SEED = "10001" # any valid zip works; results are national in the 'BH' sType
PAGE_SIZE = 30 # server default; respected
def test_endpoint():
s = RateLimitedSession()
r = s.get(BASE, params={"sType": "BH", "sAddr": ZIP_SEED, "page": 1}).json()
print(f"OK: recordCount={r.get('recordCount')}, totalPages={r.get('totalPages')}")
rows = r.get("rows", [])
if rows:
print("sample:", rows[0].get("name1"), rows[0].get("state"))
return bool(rows)
def fetch_rows(max_pages: int | None = None):
s = RateLimitedSession(min_interval=0.3)
out = []
page = 1
total = None
while True:
r = s.get(BASE, params={"sType": "BH", "sAddr": ZIP_SEED, "pageSize": PAGE_SIZE, "page": page}).json()
total = total or r.get("totalPages", 1)
out.extend(r.get("rows", []))
if page % 50 == 0:
LOG.info("page %d/%d (total rows %d)", page, total, len(out))
if page >= total or (max_pages and page >= max_pages):
break
page += 1
LOG.info("fetched %d facilities", len(out))
return out
def _parse_float(v):
try:
return float(v) if v not in (None, "") else None
except (TypeError, ValueError):
return None
def write_rows(conn, raw):
cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
rows = []
for r in raw:
name = " ".join(filter(None, [r.get("name1"), (r.get("name2") or "").strip()])).strip()
services = (r.get("services") or "").split(",") if r.get("services") else []
# SAMHSA flags adolescent/young-adult services in the services string
services_lc = [s.lower() for s in services]
adolescent = any("adolescent" in s or "youth" in s or "teen" in s for s in services_lc) or None
young_adult = any("young adult" in s or "transitional age" in s for s in services_lc) or None
rows.append((
None, None, # ccn/npi unknown from this source
name, r.get("street1"),
r.get("city"), r.get("state"), r.get("zip"), None,
_parse_float(r.get("latitude")), _parse_float(r.get("longitude")),
r.get("typeFacility") or "bh_facility",
None, None, None, None,
adolescent, young_adult,
services, [], [], None, None, None, None, None,
"samhsa_locator", None,
))
bulk_insert(conn, "bhi_facilities", cols, rows)
return len(rows)
def main():
with job_run("bhi_samhsa_locator") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

View File

@@ -0,0 +1,102 @@
#!/usr/bin/env python3
# READY TO DEPLOY — requires base Brain Postgres schema + run schemas/bhi_tables.sql
"""
SAMHSA N-SSATS + N-MHSS bulk downloads.
SAMHSA Data Archive hosts annual CSV/SAS files. The landing pages do not
expose a machine-listing API, so we maintain a manifest of known direct URLs
and parse whichever are present. Update the MANIFEST when new years drop.
"""
import csv
import io
import logging
import sys
import zipfile
from _common import RateLimitedSession, bulk_insert, job_run
LOG = logging.getLogger("bhi.samhsa_surveys")
# Known bulk files. Confirmed on samhsa.gov/data as of 2026. Update as needed.
MANIFEST = [
# (year, survey, url)
("2022", "N-MHSS", "https://www.samhsa.gov/data/sites/default/files/reports/rpt42936/2022-nmhss-datafile-csv.zip"),
("2022", "N-SSATS", "https://www.samhsa.gov/data/sites/default/files/reports/rpt42725/2022-nssats-datafile-csv.zip"),
]
def test_endpoint():
s = RateLimitedSession()
ok = True
for year, survey, url in MANIFEST:
r = s.head(url, allow_redirects=True)
print(f"{survey} {year}: {r.status_code}")
ok = ok and r.status_code == 200
return ok
def fetch_rows():
s = RateLimitedSession(min_interval=0.5)
out = []
for year, survey, url in MANIFEST:
LOG.info("fetching %s %s", survey, year)
try:
r = s.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
csvname = next((n for n in z.namelist() if n.lower().endswith(".csv")), None)
if not csvname:
continue
with z.open(csvname) as f:
reader = csv.DictReader(io.TextIOWrapper(f, encoding="latin-1"))
for row in reader:
row["_survey"] = survey
row["_year"] = year
out.append(row)
except Exception as e:
LOG.warning("failed %s %s: %s", survey, year, e)
LOG.info("total rows: %d", len(out))
return out
def write_rows(conn, raw):
cols = [
"ccn","npi","name","address","city","state","zip","county_fips",
"lat","lon","facility_type","ownership","bed_count","psych_bed_count",
"pediatric_psych_bed_count","adolescent_unit","young_adult_unit",
"services_offered","populations_served","payment_accepted",
"medicaid_accepted","accreditation","opened_date","closed_date",
"last_verified","source","source_raw_id",
]
rows = []
for r in raw:
def y(field):
v = r.get(field) or r.get(field.upper()) or r.get(field.lower())
return v == "1" or str(v).lower() == "yes"
name = r.get("NAME") or r.get("name") or r.get("FACNAME") or ""
rows.append((
None, None, name,
r.get("STREET1") or r.get("street1"),
r.get("CITY") or r.get("city"),
r.get("STATE") or r.get("state"),
r.get("ZIP") or r.get("zip"),
None, None, None,
"sud" if r["_survey"] == "N-SSATS" else "mh",
None, None, None, None,
y("YOUTH") or y("ADOLESCENT"),
y("YAD") or y("YOUNGADULT"),
[], [], [], None, None, None, None, None,
f"samhsa_{r['_survey'].lower()}_{r['_year']}", None,
))
bulk_insert(conn, "bhi_facilities", cols, rows)
return len(rows)
def main():
with job_run("bhi_samhsa_surveys") as (conn, _):
n = write_rows(conn, fetch_rows())
LOG.info("inserted %d", n)
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
sys.exit(0 if test_endpoint() else 1)
main()

212
schemas/bhi_tables.sql Normal file
View File

@@ -0,0 +1,212 @@
-- =============================================================================
-- Behavioral Health Intelligence (BHI) Layer - Postgres schema extension
-- =============================================================================
-- This file adds BHI tables to the existing `brain` database that the base
-- Economic Brain agent is creating. DO NOT run until the base Brain schema
-- is finalized. Then run: psql -d brain -f schemas/bhi_tables.sql
--
-- All tables are prefixed `bhi_` to avoid any collision with the base Brain.
-- Foreign keys are intentionally soft (no REFERENCES) where the target table
-- belongs to the base Brain, so this file can be applied independently.
-- =============================================================================
BEGIN;
-- -----------------------------------------------------------------------------
-- 1. Facilities master table
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_facilities (
facility_id SERIAL PRIMARY KEY,
ccn VARCHAR(20), -- CMS Certification Number
npi VARCHAR(20), -- National Provider Identifier
name TEXT NOT NULL,
address TEXT,
city TEXT,
state TEXT,
zip TEXT,
county_fips TEXT,
lat DOUBLE PRECISION,
lon DOUBLE PRECISION,
facility_type TEXT, -- IPF, PRTF, CMHC, SUD, acute, nursing_home, etc.
ownership TEXT, -- for-profit, non-profit, gov
bed_count INT,
psych_bed_count INT,
pediatric_psych_bed_count INT,
adolescent_unit BOOLEAN,
young_adult_unit BOOLEAN,
services_offered TEXT[],
populations_served TEXT[], -- ['adolescent','young_adult','adult','geriatric']
payment_accepted TEXT[],
medicaid_accepted BOOLEAN,
accreditation TEXT,
opened_date DATE,
closed_date DATE,
last_verified DATE,
source TEXT, -- 'cms_ipfqr','samhsa_locator','nppes', etc.
source_raw_id INT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_state ON bhi_facilities (state);
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_county ON bhi_facilities (county_fips);
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_ccn ON bhi_facilities (ccn);
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_npi ON bhi_facilities (npi);
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_type ON bhi_facilities (facility_type);
CREATE INDEX IF NOT EXISTS idx_bhi_facilities_pops ON bhi_facilities USING GIN (populations_served);
-- -----------------------------------------------------------------------------
-- 2. Facility quality measures (IPFQR, Care Compare)
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_facility_quality (
id SERIAL PRIMARY KEY,
facility_id INT REFERENCES bhi_facilities(facility_id) ON DELETE CASCADE,
measure_id TEXT, -- e.g. HBIPS-2, SUB-3, SMD, TOB-3
measure_name TEXT,
value NUMERIC,
benchmark NUMERIC,
period TEXT, -- '2024Q1', 'FY2024'
reported_at DATE,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_quality_facility ON bhi_facility_quality (facility_id);
CREATE INDEX IF NOT EXISTS idx_bhi_quality_measure ON bhi_facility_quality (measure_id);
-- -----------------------------------------------------------------------------
-- 3. Facility financials from Medicare Cost Reports
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_facility_financials (
id SERIAL PRIMARY KEY,
facility_id INT REFERENCES bhi_facilities(facility_id) ON DELETE CASCADE,
year INT,
medicare_discharges INT,
medicaid_discharges INT,
psych_discharges INT,
psych_los_avg NUMERIC,
psych_revenue BIGINT,
psych_costs BIGINT,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_financials_facility ON bhi_facility_financials (facility_id);
CREATE INDEX IF NOT EXISTS idx_bhi_financials_year ON bhi_facility_financials (year);
-- -----------------------------------------------------------------------------
-- 4. Demand indicators (CDC WONDER, BRFSS, YRBSS, IDEA, NSCH)
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_demand_indicators (
id SERIAL PRIMARY KEY,
geo_type TEXT, -- 'state','county','msa','district'
geo_code TEXT, -- FIPS or code
measure TEXT, -- 'suicide_rate','overdose_rate','depression_pct', etc.
age_bracket TEXT, -- '13-17','18-25','all'
period TEXT,
value NUMERIC,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_demand_geo ON bhi_demand_indicators (geo_type, geo_code);
CREATE INDEX IF NOT EXISTS idx_bhi_demand_measure ON bhi_demand_indicators (measure);
CREATE INDEX IF NOT EXISTS idx_bhi_demand_age ON bhi_demand_indicators (age_bracket);
-- -----------------------------------------------------------------------------
-- 5. Workforce (BLS OES)
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_workforce (
id SERIAL PRIMARY KEY,
msa_code TEXT,
msa_name TEXT,
occupation_code TEXT, -- SOC code, e.g. 29-1223 (psychiatrists)
occupation_title TEXT,
employment INT,
annual_wage_median NUMERIC,
annual_wage_mean NUMERIC,
period TEXT, -- 'May2024'
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_workforce_msa ON bhi_workforce (msa_code);
CREATE INDEX IF NOT EXISTS idx_bhi_workforce_occ ON bhi_workforce (occupation_code);
-- -----------------------------------------------------------------------------
-- 6. HRSA HPSA mental health shortage areas
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_shortages (
id SERIAL PRIMARY KEY,
hpsa_id TEXT,
state TEXT,
county_fips TEXT,
score INT, -- HPSA score 0-25 (higher = worse shortage)
population_served INT,
designated_date DATE,
withdrawn_date DATE,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_shortages_state ON bhi_shortages (state);
CREATE INDEX IF NOT EXISTS idx_bhi_shortages_county ON bhi_shortages (county_fips);
CREATE INDEX IF NOT EXISTS idx_bhi_shortages_score ON bhi_shortages (score);
-- -----------------------------------------------------------------------------
-- 7. State RTF (Residential Treatment Facility) licensing data
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_rtf_licensing (
id SERIAL PRIMARY KEY,
state TEXT,
license_number TEXT,
facility_name TEXT,
facility_type TEXT,
capacity INT,
populations TEXT[],
services TEXT[],
inspection_date DATE,
violations JSONB,
status TEXT,
opened_date DATE,
closed_date DATE,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_rtf_state ON bhi_rtf_licensing (state);
CREATE INDEX IF NOT EXISTS idx_bhi_rtf_name ON bhi_rtf_licensing (facility_name);
-- -----------------------------------------------------------------------------
-- 8. Policy events (Medicaid rules, state legislation)
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_policy_events (
id SERIAL PRIMARY KEY,
event_type TEXT, -- 'medicaid_rule','state_law','federal_rule'
state TEXT,
title TEXT,
summary TEXT,
effective_date DATE,
url TEXT,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_policy_state ON bhi_policy_events (state);
CREATE INDEX IF NOT EXISTS idx_bhi_policy_eff_date ON bhi_policy_events (effective_date);
-- -----------------------------------------------------------------------------
-- 9. Crisis calls / EMS transports (NEMSIS aggregates)
-- -----------------------------------------------------------------------------
CREATE TABLE IF NOT EXISTS bhi_crisis_calls (
id SERIAL PRIMARY KEY,
state TEXT,
county_fips TEXT,
period TEXT,
call_count INT,
mental_health_calls INT,
transport_outcomes JSONB,
source TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_bhi_crisis_state ON bhi_crisis_calls (state);
CREATE INDEX IF NOT EXISTS idx_bhi_crisis_county ON bhi_crisis_calls (county_fips);
COMMIT;
-- =============================================================================
-- Verify
-- =============================================================================
-- \dt bhi_*
-- SELECT tablename FROM pg_tables WHERE tablename LIKE 'bhi_%' ORDER BY tablename;