Factorio सीखने का वातावरण – फैक्टरी बनाने वाले एजेंट

(jackhopkins.github.io)

1 पॉइंट द्वारा GN⁺ 2025-03-12 | 1 टिप्पणियां | WhatsApp पर शेयर करें

फैक्टरी automation गेम Factorio का उपयोग करके एजेंट की दीर्घकालिक planning और spatial reasoning को verify करने वाला FLE 0.3.0 जारी हुआ है, और इसमें Claude Code को Factorio से जोड़ने वाला demo भी शामिल है
नया version headless execution, pixel observation renderer, OpenAI Gym-compatible interface, CLI evaluation run, Weights and Biases logging और analysis tools के जरिए research experiments को आसान बनाता है
उदाहरण agent power generation, iron ore mining, smelting, assembling machine placement और belt connections को बार-बार debug करके प्रति मिनट 16 iron gear wheel production का लक्ष्य हासिल करता है
Lab-play benchmark सितंबर 2025 तक के मजबूत models पर Pass@8 evaluate करता है, जिसमें solid items के लिए प्रति मिनट 16, fluids के लिए प्रति मिनट 250 production target और maximum 64-step limit है
Frontier models v0.2.0 से बेहतर हुए हैं, लेकिन manual transport, box buffers, API misuse और dynamic game state की गलत judgment अब भी बनी हुई है, इसलिए Factorio दीर्घकालिक planning और dynamic recovery क्षमता दिखाने वाला चुनौतीपूर्ण environment बना हुआ है

FLE 0.3.0 में क्या बदला

FLE 0.3.0 Factorio factory construction tasks के जरिए long-term planning, reasoning और world modeling test करने वाले learning environment का major update है
पिछले FLE paper में frontier models को बदलते environment में adapt करने, long-term goals set करने और dynamic recovery में कठिनाई दिखी थी, और 0.2.0 ने multi-agency, backtracking agents और vision introduce किया था
0.3.0 के मुख्य बदलाव:
- Claude Code को FLE के जरिए Factorio से जोड़ा गया और Twitch पर demo किया गया
- Factorio game client dependency हटाकर large-scale experiments के लिए headless scaling support किया गया
- नया headless game renderer multimodal agent research के लिए realistic pixel observations देता है
- evaluation environment को OpenAI Gym interface के अनुरूप किया गया, जिससे मौजूदा research codebases के साथ integration आसान हुआ
- FLE CLI से 1-line shell command experiments run करने का support, और evaluation code, Weights and Biases logging, sweep resume और analysis tools को open source के रूप में उपलब्ध कराया गया

Quick start

# 1. Install FLE with uv
uv add factorio-learning-environment



# 2. Start a Factorio server cluster
fle cluster start



# 3. Run an evaluation (with API keys in .env)
fle eval --config configs/gym_run_config.json

FLE को uv से install किया जाता है; fle cluster start से Factorio server cluster शुरू करने के बाद, .env में API keys और config file के साथ evaluation run किया जाता है

Automated iron gear wheel factory का उदाहरण

उदाहरण agent lab-play world में item inventory और iron gear wheel factory बनाने के लक्ष्य के साथ शुरू करता है
Python से FLE API call करके game environment से interact करता है, और हर run result के standard output और error messages observe करता है
Power setup
- nearest(Resource.Water) से water location खोजकर offshore pump place करता है
- boiler और steam engine place करने के बाद connect_entities से pipes connect करता है और boiler में coal डालता है
- 5 seconds wait करने के बाद steam engine की energy value check करके power generation verify करता है
Iron mining और smelting
- iron ore location खोजने के बाद 2 electric mining drills और electric furnace place करता है
- calculate करता है कि प्रति मिनट 16 iron gear wheel के लिए प्रति मिनट 32 iron plate चाहिए, और electric mining drill 60 seconds में 30 ore mine करता है, इसलिए 2 drills की जरूरत है
- drills और electric furnace medium electric pole से steam engine power network से connect होते हैं
Assembling machine placement
- mining area से कम से कम 20 tiles दूर AssemblingMachine2 place करता है
- assembling machine की recipe को Prototype.IronGearWheel पर set करता है, input/output inserters place करता है और power network से connect करता है
- Assembling machine 2 60 seconds में 90 iron gear wheels craft कर सकता है, इसलिए target throughput के लिए 1 machine काफी है
Belt connection और error recovery
- furnace output inserter और assembler input inserter को सीधे belt से connect करने की कोशिश की, लेकिन पाया कि storage chest रास्ता रोक रहा है
- रास्ता रोक रहे 2 wooden chests से iron plate निकालकर chests हटाए, लेकिन assembler side का input buffer chest रहने से फिर error हुआ
- अंत में assembler input chest भी हटाकर transport belt-based logistics network connect किया और automated iron gear wheel system target throughput तक पहुंच गया

Observation space और agent harness

हर step पर agent को game state वाला structured Observation object मिलता है
मुख्य fields:
- raw_text: पिछले action program execution का standard output और error messages, source code line numbers
- entities: game world की सभी entities और उनकी location, type, direction, inventory, warnings जैसी properties
- inventory: agent की personal inventory में item types और quantities
- research: researched technologies, current research progress, prerequisites और cost वाली available technologies
- game_info: tick count, elapsed time, game speed
- flows: input/output ratios, crafted items, gathered resources, economy evaluation के लिए optional price list
- messages: multi-agent coordination के लिए agents के बीच messages
- task_info: goal description, instructions, task identifier, maximum trajectory length
- task_verification: success/failure और goal progress metadata
- serialized_functions: पहले define किए गए helper functions और abstractions
- map_image: visual agents के लिए base64-encoded PNG factory layout
यह observation space spatial awareness, production metrics tracking, error debugging और multi-step automation planning support करता है
evaluation agent harness इन fields को formatted Markdown string में concatenate करता है

Lab-play benchmark settings

Lab-play एक constrained environment है, जिसमें fixed resources और single target entity देकर production throughput maximize कराया जाता है
Open-play बहुत अधिक complex है, क्योंकि इसमें procedurally generated map पर starting inventory के बिना, ज्यादा sparse resources और complex goals handle करने होते हैं
सितंबर 2025 तक के strong models पर original FLE paper की methodology को lab-play setting के हिसाब से replicate किया गया
standardized agent harness single conversation history में environment interactions लगातार जोड़ता है, और token budget कम पड़ने पर पुराने history को summarize करके reasoning जारी रखने देता है
FLE 0.2.0 में इस्तेमाल हुए backtracking या reflection logic को evaluate नहीं किया गया
Evaluation conditions
- Goal: solid items के लिए प्रति मिनट 16, fluids के लिए प्रति मिनट 250 production throughput हासिल करना
- Prompt: FLE API docs, Factorio recipes, general pattern guide
- Inventory: functional factory बनाने के लिए useful item set
- Maximum steps: 64 steps, completion पर early stop
- Reasoning: reasoning support करने वाले models पर default setting {"enabled": true} apply की गई

Model performance और बाकी limitations

open source models ने मई 2025 के v0.2.0 में observe की गई state-of-the-art performance को पकड़ लिया, और electronic circuits, steel plate, sulfur, plastic automation में success cases मिले
latest frontier models FLE v0.2.0 की तुलना में काफी बेहतर हुए, और पहली बार 12 से ज्यादा ingredient dependencies use कर सकने वाले कठिन आधे tasks में भी सफल हुए
FLE lab-play में advanced models की ranking और performance gaps लगभग Claude > GPT > Gemini > Grok के क्रम में थे, और OpenAI के GDPVal से सबसे ज्यादा similar थे
Humanity's Last Exam, AIME 25, GPQA, MMMU जैसे static exam-style benchmarks में FLE पर कमजोर models कभी-कभी ज्यादा performance दिखाते हैं, इसलिए results में contrast दिखता है
सफल agents भी complex tasks में robust automation के बजाय अक्सर semi-manual strategies पर निर्भर करते हैं
- resources खुद carry करते हैं
- storage chest को resource buffer की तरह use करते हैं
- fully automated logistics chain बनाने को bypass करते हैं
intermediate buffers कुछ समय के लिए throughput checks satisfy कर सकते हैं, जिससे measurement कठिन हो जाती है
evaluation इस problem को कम करने के लिए agent को factory 60 seconds तक वैसी ही छोड़ने वाले holdout period के बाद quota पूरा हुआ या नहीं check करता है
ज्यादा throughput targets देने पर manual logistics से pass करना कठिन हो जाता है, जिससे उचित automation की मांग की जा सकती है

Error types और model-wise differences

frontier models को errors accumulate होने पर उनसे recover करने में लगातार difficulty दिखती है
Average error rate comparison: {b:23,25,27,41}
Average error rates:
- Claude Opus 4.1: 22.99%
- GPT-5: 25.05%
- Gemini 2.5 Pro: 27.29%
- Grok 4: 40.89%
Grok 4 अक्सर regressive debug loops में फंस जाता है, जबकि GPT-5 में ज्यादा graceful recovery pattern दिखता है
ज्यादातर models में factory complexity बढ़ने वाली trajectory के middle section में error rate बढ़ जाता है
Failure types
- Syntax errors: invalid Python code, grammar mistakes, execution को ही रोकने वाली errors
- Semantic errors: FLE commands या tool arguments का misuse, docs समझने में failure, TypeError, AttributeError, NameError आदि
- Practical errors: current game state के बारे में गलत reasoning, जैसे inventory में न होने वाला item insert करने की कोशिश
- Planning/control errors: primitives जानने के बावजूद actions को consistently connect न कर पाना, जिससे inefficient या incomplete trajectory बनती है
- इस category में individual error types से ज्यादा higher-level strategic consistency देखनी होती है, इसलिए automated trajectory analysis से reliably quantify करना कठिन है
Model-wise error distribution
- Claude Opus 4.1 में syntax errors 0 हैं और 97.7% errors practical errors के करीब हैं, जिससे पता चलता है कि code generation मजबूत है लेकिन game state का accurate mental model बनाए रखने में difficulty है
- Gemini 2.5 Pro, Grok 4, GPT-5 में 12–17% level के API understanding errors दिखते हैं, जिससे FLE API docs को accurately use करने में difficulty दिखती है
- GPT-5 और Grok 4 में क्रमशः 21% और 17% syntax errors दिखते हैं, यानी latest top-performing coding benchmark models के बावजूद valid Python generate करने में failures अक्सर दिखते हैं
- केवल Gemini 2.5 Pro current helper functions और abstractions define करके use करने वाला approach दिखाता है

Claude Code और MCP

v0.2.0 में external agents को FLE से interact करने देने के लिए MCP server जारी किया गया था
v0.3.0 में इसे expand करके Claude Code adapter शामिल किया गया
Factorio खेलने वाली Claude Code stream Twitch पर देखी जा सकती है

Next research directions

वर्तमान frontier models human standards के हिसाब से Factorio में बहुत अच्छे नहीं हैं, और dynamic environment representation/modeling तथा future tools के रूप में इस्तेमाल होने वाले formal abstractions develop करने में struggle करते हैं
फिर भी 2025 के दौरान lab-play में frontier model capability लगातार improve हुई है
Factorio का इस्तेमाल long-term planning, domain adaptation, world modeling, spatial reasoning जैसी general model capabilities उजागर करने वाले environment के रूप में जारी रह सकता है
FLE v0.3.0 lab-play को पहले formal benchmark के रूप में establish करता है, लेकिन यह research agenda की शुरुआत भर है
Near-term tasks
- Human baseline: task difficulty के हिसाब से human performance को systematically measure करके agent capability calibrate करना
- Reward hacking response: agents द्वारा complex items के लिए उचित automation के बजाय manual crafting use करने की problem handle करना
- METR-style task scaling: task difficulty और required capabilities को systematically link करने वाला scaling chart develop करना
Long-term tasks
- Open-play और megabase expansion: constrained lab-play से procedurally generated maps, multi-stage goals और हजारों connected machines वाली megabase तक difficulty बढ़ाना
- Latency constraints में real-time performance: अभी actions के बीच thinking time unlimited है, लेकिन Factorio लगातार चलने वाले benchmark से response latency और solution quality के trade-off को evaluate करना
- Multi-agent coordination: cooperation, competition, emergent market dynamics, division of labor, resource allocation negotiation और comparative advantage formation को handle करना
- Mod-based out-of-distribution environments: नए tech tree और game mechanics में causal structure फिर से सीख सकते हैं या नहीं, यह evaluate करना
- Native computer-use interface: Python API के बजाय इंसान जैसे keyboard/mouse/vision interface से agents को evaluate करना
- Adversarial dynamics और robustness: hostile aliens और nondeterministic environment challenges introduce करके adaptive control और resilience evaluate करना

Participate कैसे करें

FLE में code और missions दोनों open source हैं
जिन contributors की जरूरत है:
- long-term planning और spatial reasoning के लिए new architectures explore करने वाले researchers
- large-scale evaluation और training infrastructure optimize करने वाले engineers
- नए challenge domains design करने वाले Modders
team में शामिल होने में interest हो तो Discord पर मिल सकते हैं

1 टिप्पणियां

GN⁺ 2025-03-12

Hacker News की राय

अब तो मैं पूरी तरह फंस गया हूं, और Anthropic Factorio लैब में तुरंत अप्लाई करने का मन हो रहा है
पेपर या comments देखकर यह पता नहीं चलता कि वे multimodal data वापस भेज रहे हैं या नहीं; लेकिन कई models multimodal नहीं हैं, इसलिए शायद नहीं भेज रहे होंगे। हालांकि कुछ कर सकते हैं, और हाल में आया Qwen 2.5 VLM अपने size के हिसाब से काफी strong दिखता है
उन्होंने spatial ability की कमी पर काफी जोर दिया, और planning व spatial planning दोनों की मुश्किलों की बात भी की; तो जानना चाहता हूं कि क्या वे screenshots जैसी images भी भेज रहे हैं। अगर नहीं, तो इस बारे में उनकी सोच भी जानना चाहूंगा
साथ ही, MCP के जरिए Python libraries enable करके tool-use कर सकने वाले हर LLM से Factorio खिलवाना स्वाभाविक रूप से जरूर करने वाली चीज लगती है
- फिलहाल यह text-only environment है, लेकिन आगे visual input support करने की योजना है
  कुछ tests में game state screenshots शामिल करने पर भी off-the-shelf models की performance बेहतर नहीं हुई। जैसे-जैसे game state complex होती गई और screenshots में entities बढ़ीं, models और confuse हुए; उन्होंने directions या entities hallucinate कीं, या missing transport belts और गलत तरीके से rotated inserters जैसी साफ दिखने वाली गलतियां भी ठीक नहीं कर पाए
  हमारा मानना है कि इसकी वजह यह है कि current VLMs बहुत detail वाली images में spatial reasoning अच्छे से नहीं कर पाते, और fine-tuning से इसमें काफी सुधार हो सकता है। MCP भी आजकल तेजी से उभर रहा है, इसलिए उसे भी देखेंगे
- अगर factory state का text description समझना आसान है और confusion भी कम करता है, तो screenshot की जरूरत क्यों है, यह समझ नहीं आता
  game grid पर चलता है, इसलिए game state को ASCII representation में बदलना आसान होना चाहिए
कुछ समय पहले HN पर एक team की post थी, जिसने reinforcement learning से Pokémon Red पूरा करने वाला agent train किया था। उनका कहना था कि cost function को इस तरह tune करना पड़ा कि exploration के लिए छोटे rewards मिलें और gym defeat जैसे essential tasks के लिए बड़े rewards
जानना चाहता हूं कि क्या Factorio में भी यही approach इस्तेमाल हो सकती है। Pokémon Red वाली analogy में Factorio के मुख्य essential tasks नए items और नए science packs की automation बनाना हैं
हर item के per-second production के लिए छोटा reward, नए item automation के लिए medium reward, और नए science pack automation के लिए बड़ा reward — इस तरह एक अच्छा reward function बन सकता है
Factorio agent से बस “बड़ी factory बनाओ” कहना Pokémon Red agent से “game पूरा करो” कहने जैसा है; इसे छोटे steps और बहुत सावधानी से tuned reward function में बांटना होगा
यह सोचते-सोचते इस project में कूद पड़ने का मन हो रहा है
- Factorio में 2–3 हजार घंटे लगाने के अनुभव से जोड़ूं तो, “जितनी बड़ी हो सके उतनी बड़ी factory” बनाने का लक्ष्य बहुत vague है और सही metric नहीं है
  जब Factorio players बड़े megabase बनाते हैं, तो वे size को target नहीं करते, बल्कि science research per minute (SPM) को target करते हैं। agent को दिया जाने वाला metric “सबसे बड़ा” base नहीं, SPM होना चाहिए
- FLE में उन milestones तक access है जो बताते हैं कि कोई नई entity पहली बार कब बनाई गई, लेकिन automation level के हिसाब से rewards को hierarchy में बांटना भी सचमुच interesting होगा। साथ में करके देखना अच्छा रहेगा
- यह दिलचस्प हिस्सा है। Claude lab-play में iron gear wheel factory जैसे essential tasks और simple automation कर सकता था, लेकिन “सबसे बड़ी factory बनाना” वाले game episode में उसने कोशिश तक नहीं की
  models ऐसे essential tasks कर सकते हैं, लेकिन जब उन्हें “game पूरा करो” जैसा general goal मिलता है, तो उस पर कोशिश करने लायक long-term planning level उनमें नहीं होता। वे अक्सर existing factory को expand करने की कोशिश नहीं करते, बल्कि बिना coordination के छोटी-छोटी structures ही बनाते हैं
  vague और general goal मिलने पर models कैसे behave करते हैं, यह जानना भी हमारे goals में से एक था
- यही approach life में भी इस्तेमाल की जा सकती है
- पता नहीं आपने page पढ़ा या नहीं। असल में हर produced item पर reward दिया गया था, और ज्यादा complex items के लिए ज्यादा reward दिया गया था
छह frontier language models को दो settings में evaluate किया गया, यह हिस्सा दिलचस्प है, लेकिन non-reasoning models की planning ability को saturate कर सकने वाले इससे कहीं ज्यादा simple dynamic benchmarks भी बहुत हैं
शहरों के बीच flight connections की list देकर उनके बीच itinerary पूछना भर भी काफी है; अगर दो nodes के बीच shortest path पर्याप्त लंबा हो जाए, तो ये सभी models confuse हो जाते हैं
हर length के लिए 10 में से 8 बार reliably खोजे जा सकने वाले शहरों के बीच सबसे लंबे shortest paths ये थे
| Model | Path Length |
|------------------+-------------|
| Claude Sonnet3.5 | 10 |
| GPT-4o | 7 |
| GPT-4o-mini | 4 |
| Deepseek-v3 | 6 |
| Gemini-2-Flash | Not tested |
| Llama3.3-70B-Ins | 4 |
- सही है। ऐसे models की planning ability को saturate करने वाले simpler benchmarks मौजूद हैं
  हालांकि हम एक ऐसा broader-spectrum evaluation environment बनाना चाहते थे जो कई abilities को एक साथ test करे और आगे भी relevant बना रहे
कई zones वाली factory बनाते समय सभी models में spatial planning की limits दिखीं, यह बात समझ आती है। failures आम तौर पर entities को बहुत पास-पास रखना, connection space न छोड़ना, या inserters को गलत जगह रखना जैसे थे
LLMs spatial reasoning में कमजोर क्यों हैं, यह समझ आता है। इसकी वजह है कि उस तरह का training data ज्यादा नहीं है। spatial reasoning solve हो जाए तो कौन-सी additional reasoning abilities उभरेंगी, यह जानने की उत्सुकता है
- spatial data ज्यादा नहीं है, यह बात ठीक से समझ नहीं आती
  सिर्फ सबसे simple simulator हो तो भी practically infinite data बनाया जा सकता है, नहीं?
  उदाहरण के लिए infinite grid पर tic-tac-toe को करीब 10 lines code में implement कर दें, तो unlimited training set generate किया जा सकता है
“Lab Play” tasks की एक और category के रूप में balancer design देखना चाहूंगा
छोटे balancers भी काफी complex हो सकते हैं(https://factorioprints.com/view/-NopheiSZZ7d8VitIQv9), और models उन्हें design करके problems solve कर पाते हैं या नहीं, यह देखना interesting होगा
- किसी ने उस problem को ज्यादा traditional SAT solver से approach किया था
  https://github.com/R-O-C-K-E-T/Factorio-SAT
शानदार आइडिया है
यहां किए जा सकने वाले कई दिलचस्प प्रयोग दिखते हैं। lab-play scenario में समय से जुड़े elements डालना अच्छा विचार लगता है। Factorio के ज़्यादातर users जो biters चालू करके खेलते हैं, इसे time-space constraints के संयोजन की तरह देखते होंगे, और agent पर समय-सीमा लगाने से असली गेम स्थितियों के साथ एक तरह की proxy तुलना संभव होती है।
इस framework design की अच्छी बात यह है कि यह DOTA 2 या StarCraft 2 experiments में दिखने वाली micro-management क्षमता से अलग चीज़ test करता है। खासकर StarCraft 2 में, infinite APM हो तो workers को बेहद बारीकी से micro-manage करके थोड़ा ज़्यादा minerals निकालने जैसी behaviour दिखती है।
ऐसी behaviour संकुचित context में एक दिलचस्प learning result है, लेकिन असल में इसे execute करना भारी होता है और pro players भी गलती कर सकते हैं। साथ ही यह agent की long-term planning, execution और analysis performance पर कोई extra insight देती हुई भी नहीं लगती।
उस लिहाज़ से FLE एक higher-level thinking evaluation framework के रूप में कहीं ज़्यादा दिलचस्प है। यह भी जानना चाहूंगा कि क्या किसी दिए गए factory cell में X inputs और Y outputs होने पर performance optimize करने जैसे layout optimization benchmark की योजना है।
- हम biters को X stages या हर X seconds पर release करने जैसे, थोड़ा और tower defense जैसे task बनाने की बात कर रहे हैं।
  लक्ष्य यह test करना है कि agent military-industrial complex बनाने में कितना सक्षम है। इस idea को develop करते समय एक मज़ेदार समस्या यह थी कि frontier models ‘GunTurret’ जैसे नाम वाली entities बनाने से हिचकते थे। शायद उन्हें यह constitution के खिलाफ लगता है। हो सकता है turret का नाम ‘SuperSoaker’ जैसा कुछ रखना पड़े।
  layout optimization benchmark पर हमने सच में कल ही चर्चा की। मेरे हिसाब से दो तरह के layout tasks चाहिए। 1) subtle तरीके से टूटी हुई factory को fix करना, 2) इस factory का throughput improve करना। Implementation अपेक्षाकृत आसान होना चाहिए, इसलिए इसे देखना अच्छा रहेगा।
मुझे ठीक से समझ नहीं आया। क्या इन models को Factorio खेलने के लिए post-trained किया गया है?
A) अगर हां, तो Claude जैसे public weights न रखने वाले models में यह कैसे संभव है? B) अगर नहीं, तो agent को कैसे पता है कि API क्या करती है? मान लें API command के English meaning से, जैसे place_entity_next_to का मतलब किसी चीज़ के बगल में entity रखना है, अंदाज़ा लगा लेता है, तो recipes कैसे जानता है? अगर try करके सीखता है, तो फिर A पर लौट आते हैं।
PDF पढ़कर लगता है कि post-training नहीं की गई, लेकिन फिर B वाले सवाल कैसे explain होते हैं, समझ नहीं आता।
अगर सचमुच post-training नहीं है और recipe exploration को context window में ही expected माना गया है, तो reinforcement-learning-style improvement के लिए यह बहुत छोटा लगता है।
संक्षेप में, मुझे नहीं पता कि इन models को post-training के साथ test किया जा सका या नहीं, और अगर बिना post-training किया गया तो इन सबने अविश्वसनीय रूप से अच्छा किया।
अगर authors देखें, तो मैं जानना चाहूंगा कि औसतन context window में API queries और API response pairs कितने आते हैं। आगे, अगर API call names को abbreviate करके एक context window में ज़्यादा response pairs रखे जाएं, तो क्या results बेहतर होते हैं?
- tools के संदर्भ में agents को function signatures, यानी tool docstring, input-output types की access थी, और हर tool के लिए एक छोटी “manual” भी थी।
  इस manual में बताया गया था कि tool क्या करता है, game state पर उसका क्या असर होता है, और place_entity_next_to से मौजूदा chest के बगल में inserter रखने जैसे कुछ usage examples दिए गए थे।
  जैसा Jack ने कहा, कोई post-training बिल्कुल नहीं थी, लेकिन सभी agents के context में tools, entities और research सहित पूरी API description थी। इसलिए ये results कुछ हद तक दिखाते हैं कि modern agents proper documentation वाली पूरी तरह out-of-distribution API को कितनी अच्छी तरह use कर सकते हैं।
- ये models post-trained नहीं थे, सभी off-the-shelf models ही थे।
  context में अधिकतम लगभग 128 pairs डाले जा सकते थे, लेकिन performance 32 pairs जैसी ही थी, इसलिए cost और latency की वजह से आखिर में 32 pairs चुने गए।
  inputs/outputs को और छोटा encode करने पर performance गिर गई। लगता है descriptive names pretrained models को यह intuition देते हैं कि वे क्या करते हैं, इसलिए मददगार होते हैं।
- author intro की footnote पढ़ने पर लगता है कि उनमें से एक Anthropic में काम करता है। internal access रही होगी।
यह दिलचस्प है कि complex scenarios बस कुछ ही हैं। मैं हमेशा सोचता था कि ML game agent को game mechanics ठीक से सीखने के लिए, हर एक के सैकड़ों variants वाले बहुत छोटे puzzles के सैकड़ों sets चाहिए।
उदाहरण के लिए ऐसी चीज़ें। factory में power नहीं है, इसलिए missing electric pole रखना; factory में items कम हैं, इसलिए missing belt लगाना; 200 assembling machines craft करके place करना; assembling machine किसी वजह से रुक गई है, उसे fix करना; factory output बहुत कम है, उसे दोगुना करना; factory के भीतर किसी दूसरे point तक जितनी जल्दी हो सके जाना; power shortage fix करना; और इन सभी tasks को robots होने और न होने के cases में बांटना।
ऐसे कुछ हज़ार example scenarios को programmatically generate करना अपेक्षाकृत आसान होना चाहिए। फिर इन्हें IQ test question bank की तरह use करके, question bank से करीब 12 चुनें और time व used materials के आधार पर हर एक की performance evaluate करें।
मेरा मानना है कि ML agent को smoothly बढ़ती complexity वाले बड़े scenario bank से samples लेकर evaluate किया जाए, और कम complexity पर पर्याप्त high score मिलने के बाद जब ज्यादा complex scenarios दिए जाएं तो वह ज्यादा जल्दी सीखता है।
- जैसा आपने suggest किया, scenario को text में generate करना आसान है, लेकिन starting point के तौर पर सही factory game state बनाना कहीं ज्यादा कठिन है।
  मेरी जानकारी में आखिरकार यह initial state और पूरा किए जाने वाले task को manually design करने वाले उसी काम में बदल जाता है।
- additional training के लिए हम ऐसा curriculum approach सोच रहे हैं।
  हालांकि current work evaluation पर focused था, इसलिए हमने ऐसा नहीं किया। अलग-अलग tasks की “difficulty” काफी subjective होती है, इसलिए evaluation को प्रभावित कर सकने वाले arbitrary decisions लेने पड़ते। उदाहरण के लिए कौन सा task किस scenario के बाद आना चाहिए, या सभी difficulty levels पर्याप्त रूप से cover हो रहे हैं या नहीं, जैसी समस्याएं हैं।
इस तरह के interface के लिए कोई human play benchmark है या नहीं, यह जानने की इच्छा है। मेरा मतलब यह नहीं कि यह जरूरी या relevant है, बस programming-style Factorio कैसा महसूस होता है, यह जानना चाहता हूं।
text prompt के इर्द-गिर्द spatial reasoning करना human players के लिए भी काफी कठिन लगता है।
- Factorio के human benchmark वे speedrunners हैं जो first rocket launch को लक्ष्य बनाकर दौड़ते हैं।
  मौजूदा record solo play में 4 घंटे से थोड़ा अधिक है, और team में 90 मिनट है। सिर्फ इससे भी पता चलता है कि multitasking LLMs के पास humans से आगे निकलने की गुंजाइश है।
सोच रहा हूँ कि कुछ साल बाद क्या हर in-game opponent ऐसा LLM होगा जिसे इस तरह के game-control API तक access हो।
यह भी जानना चाहूँगा कि क्या कुछ खास तरह के tasks थे जिनमें models को विशेष रूप से मुश्किल हुई, या difficulty मुख्य रूप से deploy किए जाने वाले items की संख्या के साथ बढ़ती है
- LLMs का opponent की भूमिका में बड़े पैमाने पर इस्तेमाल होने की संभावना बहुत कम है। ज़्यादातर games में enemy AI को machine learning जितनी complexity की ज़रूरत नहीं होती। compute cost को फिलहाल अलग रख दें, तब भी यही बात है।
  enemy AI का मुख्य लक्ष्य दुनिया की सबसे कठिन चीज़ बनना नहीं, बल्कि player को पार करने लायक एक रोचक challenge देना है। ज़्यादातर games में बहुत high-performance AI बनाना ज़रूरी नहीं कि मुश्किल हो, लेकिन इससे उसके खिलाफ खेलना मज़ेदार भी नहीं हो जाता।
  ज़्यादातर games में finite logical states होते हैं, बस वे इतने बड़े होते हैं कि इंसान सारे solutions नहीं खोज पाते। बेशक, इंसान ऐसी states के किनारों को push करके workaround ढूँढने में बहुत अच्छे होते हैं।
  जिन games में state space सामान्य से कहीं बड़ा होता है, उनमें भी super AI चाहने के मामले दुर्लभ हैं। उदाहरण के लिए, FPS में aimbot के खिलाफ खेलना किसी को पसंद नहीं होता।
  Factorio आम games से अलग इस मायने में exception है कि “जीत” की असली शर्त लगभग पूरी तरह player पर निर्भर करती है। बिना DLC वाले Factorio में, game की win condition यानी rocket को उन चीज़ों के लिए सबसे basic structures के अलावा, जिन्हें हाथ से नहीं बनाया जा सकता, लगभग कोई factory बनाए बिना भी बनाया जा सकता है। यह बेहद धीमा होगा, लेकिन संभव विकल्प है। इसलिए ऐसे benchmarks में “क्या यह काम करता है” से ज़्यादा efficiency महत्वपूर्ण है।
- मुझे लगता है यह संभव है। क्योंकि इसे चलाने के लिए अलग training compute की ज़रूरत नहीं होती। API मिल जाए तो नए game में अलग-अलग models को plug and play तरीके से जोड़ना बहुत आसान है।
  models को मुख्य रूप से दो क्षेत्रों में कठिनाई होती है। पहला है spatial reasoning। models अक्सर off-by-one errors कर देते हैं, और factory programming की तरह ऐसे mistakes के प्रति बहुत sensitive होती है, इसलिए recover करना मुश्किल होता है।
  दूसरा है long-term planning। tactical sub-goals बनाने से पहले strategically क्या करना है, यह समझने की क्षमता।
  lab-play में difficulty आम तौर पर production chain की depth के proportional होती है। अगर कोई item बनाने के लिए पहले कई factory sections चाहिए हों, तो यह कहीं ज़्यादा मुश्किल हो जाता है। यह planning से जुड़ा लगता है, क्योंकि models पहले बड़ा plan बनाने के बजाय छोटी-मोटी problems ठीक करने वाली details में घुस जाने की tendency रखते हैं।
- “Claude plays Pokémon” देखें तो वह Mount Moon में struggle करता है, और चार साल की उम्र में मैं भी करता था।
- LLM ही क्यों होना चाहिए? क्या इस तरह की चीज़ों में AlphaZero अच्छा नहीं है? उपयोगी machine learning models LLM के अलावा भी बहुत सारे हैं!

Factorio सीखने का वातावरण – फैक्टरी बनाने वाले एजेंट

FLE 0.3.0 में क्या बदला

Quick start

Automated iron gear wheel factory का उदाहरण

Power setup

Iron mining और smelting

Assembling machine placement

Belt connection और error recovery

Observation space और agent harness

Lab-play benchmark settings

Evaluation conditions

Model performance और बाकी limitations

Error types और model-wise differences

Failure types

Model-wise error distribution

Claude Code और MCP

Next research directions

Near-term tasks

Long-term tasks

Participate कैसे करें

संबंधित पढ़ाई

1 टिप्पणियां

Hacker News की राय