Keep raw README text separate
Full README text is not directly placed in main CSV, since sample may get large, which can make the CSV heavy and messy.
Better approach:
data/ ├─ raw/ │ └─ readmes/ │ ├─ owner_repo_README.md │ └─ another_owner_repo_README.md │ ├─ processed/ │ └─ github_readme_readiness_scores.csv
Then your CSV stores metadata and scores, while the raw README files stay in a folder.
Later Supabase structure
Once the notebook results are strong, then import into research-specific tables, not your production submissions table yet:
github_repo_samples github_readme_scores github_readme_rubric_versions
Then later you can compare against:
builder_project_submissions builder_project_reviews builder_readme_scores
That separation is important because public GitHub research data and Builder Showcase user data are related, but they are not the same kind of record.
The business value gets stronger over time
Eventually, you’ll be able to say:
Before Builder Showcase:
- User projects averaged 52/100 README readiness.
After Builder Showcase review:
- User projects averaged 81/100 README readiness.
Compared to public GitHub sample:
- Public baseline averaged 43/100.
That is not just analytics — that is product impact evidence.
Best next move
Yes, share your current scoring structure/code when ready. The best thing to do next is turn it into a reusable function like:
score_readme(readme_text, rubric_version=”builder_showcase_readme_v1”)
Then your notebook can score both:
public_github_readmes builder_showcase_submissions
Current Setup and Approash
- No individual .md files
- No .yml sidecar files
- One checkpoint CSV
- One final readiness CSV
- Parsing happens in memory
- Scoring happens from parsed signals
Later, if the dataset gets large, the upgrade path is:
- CSV for scores
- Parquet or JSONL.GZ for raw README text archive
- Supabase only for clean metadata + scores
For now, this is the best Builder Showcase MVP pipeline: accurate enough, easy to debug, and not over-engineered.
Seperate staging github pages site setup
This section documents the staging / testing site for datainsidedata.com
Create _site_staging_full folder
bundle exec jekyll build --config _config.yml,_config_staging.yml --destination_site_staging_full
gh workflow run deploy-staging.yml --ref feat/community-projects