Data Inside Data™

Keep raw README text separate

Full README text is not directly placed in main CSV, since sample may get large, which can make the CSV heavy and messy.

Better approach:

data/ ├─ raw/ │ └─ readmes/ │ ├─ owner_repo_README.md │ └─ another_owner_repo_README.md │ ├─ processed/ │ └─ github_readme_readiness_scores.csv

Then your CSV stores metadata and scores, while the raw README files stay in a folder.

Later Supabase structure

Once the notebook results are strong, then import into research-specific tables, not your production submissions table yet:

github_repo_samples github_readme_scores github_readme_rubric_versions

Then later you can compare against:

builder_project_submissions builder_project_reviews builder_readme_scores

That separation is important because public GitHub research data and Builder Showcase user data are related, but they are not the same kind of record.

The business value gets stronger over time

Eventually, you’ll be able to say:

Before Builder Showcase:

User projects averaged 52/100 README readiness.

After Builder Showcase review:

User projects averaged 81/100 README readiness.

Compared to public GitHub sample:

Public baseline averaged 43/100.

That is not just analytics — that is product impact evidence.

Best next move

Yes, share your current scoring structure/code when ready. The best thing to do next is turn it into a reusable function like:

score_readme(readme_text, rubric_version=”builder_showcase_readme_v1”)

Then your notebook can score both:

public_github_readmes builder_showcase_submissions

Current Setup and Approash

No individual .md files
No .yml sidecar files
One checkpoint CSV
One final readiness CSV
Parsing happens in memory
Scoring happens from parsed signals

Later, if the dataset gets large, the upgrade path is:

CSV for scores
Parquet or JSONL.GZ for raw README text archive
Supabase only for clean metadata + scores

For now, this is the best Builder Showcase MVP pipeline: accurate enough, easy to debug, and not over-engineered.

Seperate staging github pages site setup

This section documents the staging / testing site for datainsidedata.com

Create _site_staging_full folder

bundle exec jekyll build --config _config.yml,_config_staging.yml --destination_site_staging_full

gh workflow run deploy-staging.yml --ref feat/community-projects