| name | bulk-import-companies |
| description | Bulk import unclaimed company directory (table-split JSON) vào Supabase. Use khi cần import batch data scraped/exported từ external source (business directory, MoF data) — folder có users/users_table/company_profiles/products_services/certificates/contacts/customers/cooperations JSON files. |
Skill: bulk-import-companies
Operational helper cho bulk import flow. Reference đầy đủ: [[docs/Workflows/Bulk Import Companies.md]] · [[docs/Architecture/Unclaimed Company Data Plan.md]].
Đọc trước
- [[docs/Workflows/Bulk Import Companies.md]] — full workflow + troubleshooting
- [[docs/Architecture/Unclaimed Company Data Plan.md]] — schema + claim flow design
scripts/import-companies.mjs — entry point script
scripts/make-import-subset.mjs — subset extractor để test
Khi nào dùng
- User nói "import dữ liệu công ty từ file JSON"
- User push 1 folder vào
data/companies/ chứa nhiều file .json
- User báo lỗi
ON CONFLICT … no unique or exclusion constraint matching khi chạy npm run import:companies
- Cần debug match rate thấp / children skip nhiều
Pre-checks BẮT BUỘC trước khi import
1. Migrations applied?
select conname, pg_get_constraintdef(oid)
from pg_constraint
where conrelid = 'public.company_profiles'::regclass
and conname like '%import_external%';
select column_name from information_schema.columns
where table_schema='public' and table_name='company_profiles' and column_name='import_external_id';
Nếu rỗng → apply 2 migrations:
supabase/migrations/20260505020000_company_profiles_import_external_id.sql
supabase/migrations/20260505020100_company_profiles_import_external_id_unique_constraint.sql
npm run sb:push
2. .env.local có SUPABASE_SERVICE_ROLE_KEY?
KHÔNG dùng anon key — phải service role để insert unclaimed rows (user_id NULL bypass RLS).
3. File encoding UTF-8 no BOM?
node -e "const u = require('./data/companies/users.json'); console.log(u[0]);"
Nếu chữ Việt mojibake (Tâ°y, dá»±ng) → file gốc encoding sai, cần re-export.
4. Inspect schema từng file
node -e "const fs = require('fs'); const path = require('path'); const dir = './data/companies/'; fs.readdirSync(dir).filter(f => f.endsWith('.json')).forEach(f => { try { const d = require(path.resolve(dir, f)); const arr = Array.isArray(d) ? d : [d]; console.log('\n=== ' + f + ' (' + arr.length + ' rows) ==='); console.log('FIELDS:', Object.keys(arr[0] || {}).join(', ')); } catch(e) { console.log('FAILED ' + f + ':', e.message); } });"
Verify field names match schema:
users.json: businessName, taxId, businessEmail, phoneNumber, source_file (camelCase)
users_table.json: id, business_name, tax_id, business_email, phone_number (snake_case)
company_profiles.json: legalName, mainIndustry, source_file, …
company_profiles_table.json: id, userId, legalName, …
- Children: businessName, … + table-specific fields
Nếu schema lệch (vd thiếu source_file trong users.json) → script không match được → low match rate. Cần data cleaning trước khi import.
Quy trình chuẩn
Step 1: Test subset 10+10
npm run import:subset -- ./data/companies/ ./data/companies-test/ 20
npm run import:companies -- ./data/companies-test/
Kỳ vọng output:
🔗 Bridge Set A: 10 entries
🔗 Bridge Set B: 10 entries
📦 Set A — company_profiles.json (10 rows)
matched=10, skip(no source_file)=0, skip(no bridge)=0
✓ Set A done: 10/10 upserted
📦 Set B — company_profiles_table.json (10 rows)
matched=10, …
✓ Set B done: 10/10 upserted
📦 products_services.json → products_services (N rows)
matched companies=10, rows matched=N, rows skipped=0
✓ products_services done: N inserted
✅ All done.
Verify visually trên Network page → 20 companies hiện trong directory list.
Step 2: Run full
npm run import:companies -- ./data/companies/
ETA cho 125k companies + 587k child rows: ~30-50 phút.
Step 3: Verify counts
select
case when import_external_id like 'set_a:%' then 'Set A'
when import_external_id like 'set_b:%' then 'Set B'
else 'other' end as set_,
count(*) cnt,
count(tax_id) with_tax_id
from company_profiles
where source = 'import' and claim_status = 'unclaimed'
group by 1;
select 'products_services' as t, count(*) from products_services where user_id is null
union all select 'certificates', count(*) from certificates where user_id is null
union all select 'contacts', count(*) from contacts where user_id is null
union all select 'customers', count(*) from customers where user_id is null
union all select 'cooperations', count(*) from cooperations where user_id is null;
Step 4: Pre-geocode addresses
Xem skill pre-geocode-companies.
Common errors & fix
there is no unique or exclusion constraint matching the ON CONFLICT specification
→ Migration 20260505020100 chưa apply (xem Pre-checks #1) hoặc PostgREST cache stale → notify pgrst, 'reload schema';
ON CONFLICT DO UPDATE command cannot affect row a second time
→ Same import_external_id (hoặc tax_id) xuất hiện 2 lần trong cùng batch. Script đã dedup qua Map nhưng nếu user chỉnh manual có thể xảy ra. Xem log dedup-collapsed=N — N lớn = input có duplicates.
Match rate thấp (< 50%)
Script tự log "Sample legalName SKIPPED" + "Sample bridge keys" cho debug. Common causes:
- Set A:
source_file differ giữa users.json ↔ company_profiles.json (whitespace/case)
- Set B:
userId orphan (không có trong users_table.id)
→ Stark phải clean data hoặc export lại với consistent keys.
Children skip rate cao (> 30%)
businessName trong children files không match bất kỳ company nào imported. Có thể export children từ source khác users files.
node -e "const c = require('./data/companies/products_services.json'); console.log(new Set(c.map(x => x.businessName?.trim().toLowerCase())).size);"
So với count companies imported. Chênh lớn → cần export children consistent.
Rollback toàn bộ
delete from products_services where user_id is null;
delete from certificates where user_id is null;
delete from contacts where user_id is null;
delete from customers where user_id is null;
delete from cooperations where user_id is null;
delete from company_profiles where source = 'import' and claim_status = 'unclaimed';
Output checklist
Sau khi xong full import: