We're building high-quality evaluation and training datasets to improve how Large Language Models (LLMs) interact with realistic software engineering tasks. You will have the opportunity to work on a diverse range of projects from helping models traverse complex code bases to building agents that improve model performance.
Work across multiple different projects to improve LLM performance on code. Sample projects:
Leading and delivering end-to-end agent use cases such as home automation agents, coding copilots, or creative design assistants.
Collaborate with the team to identify edge cases and ambiguities in model behavior.
Review and compare 3-4 model-generated code responses per task using a structured ranking system.
Evaluate code diffs for correctness, code quality, style, and efficiency. Provide clear, detailed rationales explaining the reasoning behind each ranking decision.
Several years of software engineering experience, including 2+ continuous years at a top-tier product company (e.g., Google, Stripe, Amazon, Apple, Meta, Netflix, Microsoft, Datadog, Dropbox, Shopify, PayPal, IBM Research).
Strong expertise in building full-stack applications and deploying scalable, production-grade software using modern languages and tools.
Deep understanding of software architecture, design, development, debugging, and code quality/review assessment.
Proven ability to review code diffs and evaluate correctness, maintainability, and efficiency.
Excellent oral and written communication skills for clear, structured evaluation rationales.
Commitment: flexible engagement, minimum 10 hrs/week, up to 40 hrs/week (partial PST overlap required).
Type: Contractor (no medical/paid leave).
Duration: 1 month with potential extensions based on performance and fit.