<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Isomorphic]]></title><description><![CDATA[Ruminations on mathematics, computer science, philosophy, art, and life, as well as the unlikely threads connecting them together.]]></description><link>https://www.isomorphic.group</link><image><url>https://substackcdn.com/image/fetch/$s_!-drq!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7c1b0c9-be1c-4fa2-b0e6-4281b80b8da7_1024x1024.png</url><title>Isomorphic</title><link>https://www.isomorphic.group</link></image><generator>Substack</generator><lastBuildDate>Fri, 01 May 2026 09:23:10 GMT</lastBuildDate><atom:link href="https://www.isomorphic.group/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Dan DiPietro]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dandipietro@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dandipietro@substack.com]]></itunes:email><itunes:name><![CDATA[Dan DiPietro]]></itunes:name></itunes:owner><itunes:author><![CDATA[Dan DiPietro]]></itunes:author><googleplay:owner><![CDATA[dandipietro@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dandipietro@substack.com]]></googleplay:email><googleplay:author><![CDATA[Dan DiPietro]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[On the Ineffability of Sight and Sound]]></title><description><![CDATA[To what extent are the auditory applications of LLMs limited by natural language itself?]]></description><link>https://www.isomorphic.group/p/on-the-ineffability-of-sight-and</link><guid isPermaLink="false">https://www.isomorphic.group/p/on-the-ineffability-of-sight-and</guid><dc:creator><![CDATA[Dan DiPietro]]></dc:creator><pubDate>Thu, 25 Jan 2024 03:11:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sTIf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sTIf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png" data-component-name="Image2ToDOM"><div class="image2-inset image2-full-screen"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sTIf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 424w, https://substackcdn.com/image/fetch/$s_!sTIf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 848w, https://substackcdn.com/image/fetch/$s_!sTIf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!sTIf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sTIf!,w_5760,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;full&quot;,&quot;height&quot;:456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6753828,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-fullscreen" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sTIf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 424w, https://substackcdn.com/image/fetch/$s_!sTIf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 848w, https://substackcdn.com/image/fetch/$s_!sTIf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!sTIf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73ec6050-c961-41cb-b0f1-2eb0802726f9_3340x1046.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Nice idea, but is it fundamentally limited by reality?</h3><p>I&#8217;ve recently come across a few interesting LLM apps that claim to produce high-fidelity music from natural language descriptions (I particularly like <a href="https://www.suno.ai/">Suno</a> and <a href="https://google-research.github.io/seanet/musiclm/examples/">Google&#8217;s MusicLM</a>). This domain isn&#8217;t <em>that</em> new, especially given how fast things move in LLM-land, but the quality has gotten quite good in the past year.</p><p>Still, the idea of mapping a natural language description of music to the music itself feels flawed. If you were asked to describe your favorite song, how would you do it? How would you describe the accompaniment to <em>Mary Had a Little Lamb?</em> Of course, you could hum or sing it; you could recite the notes. But, I&#8217;d posit that describing the song with any level of specificity using <em>natural</em> language alone is intractable. There are many ways to describe a given song, and you could generate many (very different sounding) songs from a description.</p><p>To be clear, I&#8217;m not making the claim that you can&#8217;t use LLMs to generate music at all. Rather, you can&#8217;t use them to generate music with any real level of intentionality or specificity. I don&#8217;t think they&#8217;ll democratize music creation in the way that they might democratize visual/digital art creation. It&#8217;s hard to imagine a universe where somebody who is not already a musician&#8212;but can perhaps imagine a piece of music in their head that they&#8217;d like to create&#8212;can communicate that idea in natural language and have their song come to life via an LLM.</p><p>Ultimately, this ends up having little to do with LLMs and more to do with natural language itself and its ability to accurately describe or relate various kinds of perception.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.isomorphic.group/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Isomorphic! Subscribe for free to receive new posts and support my work. I&#8217;d really appreciate it!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Let&#8217;s try to be rigorous</h3><p>Few people will contest the idea that you can more or less perfectly capture specific sights and sounds with an infinitely long description. At a certain point, you can just state the precise arrangement of the pixels, atoms, wavelengths, etc. Of course, you would never do this, but it serves to illustrate that the descriptive power of language at its limits just isn&#8217;t that surprising or interesting. But what does the behavior look like before the asymptote?</p><p>Foremost, what we&#8217;re really interested in here is a formal measure of &#8220;precision&#8221; or &#8220;intentionality.&#8221; How do we know how &#8220;good&#8221; a description is? Here&#8217;s one proposal for such a measure: suppose you have some bijective function between an input space (language) and an output space (sound or sight). Select a point in language space (your description) and use your function to map it to a point in the output space (a sound). Then, select all points in the output space within some small epsilon of the originally obtained point, and use your function to map them back into language space.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> What does the distribution or variance of these obtained points in the input space look like? Or, more concisely, how large is something like:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Var}(\\{f^{-1}(f(a) + b), b \\le \\epsilon\\})&quot;,&quot;id&quot;:&quot;TQIDBEVTQJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>You&#8217;d probably want to include some normalizing terms to make this measure a bit more meaningful for comparison, but I think this gets the idea across. As indicated by <em>Figure 1</em>&#8212;if this measure is high, it might suggest that similar sounds can be captured with very different descriptions. Similarly, if this measure is high when computed over the inverse of <em>f</em>, it might suggest that very different sounds can be captured by very similar descriptions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ea4a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ea4a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 424w, https://substackcdn.com/image/fetch/$s_!Ea4a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 848w, https://substackcdn.com/image/fetch/$s_!Ea4a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 1272w, https://substackcdn.com/image/fetch/$s_!Ea4a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ea4a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png" width="521" height="380.7004538577912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:661,&quot;resizeWidth&quot;:521,&quot;bytes&quot;:49427,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ea4a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 424w, https://substackcdn.com/image/fetch/$s_!Ea4a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 848w, https://substackcdn.com/image/fetch/$s_!Ea4a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 1272w, https://substackcdn.com/image/fetch/$s_!Ea4a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfcddaa7-b983-47d8-ae3a-861eb5d47200_661x483.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 1:</em> Visual aid of the proposed bootleg metric for how ineffable your output space is</figcaption></figure></div><p>This measure being high in either direction is pretty sad in the context of generative AI. However, very different sounds being mapped to by similar descriptions would bode especially poorly. In such an instance, the maximum likelihood estimation outcome would be to learn whatever mapping outputs the blandest, averaged-over, measure-of-center-y sounds imaginable. And, in doing so, such a mapping jettisons any creativity or intentionality (although it might be great at capturing &#8220;vibes&#8221;&#8230; and I don&#8217;t doubt that music-generating LLMs can do that). The average of the &#8220;all-human-content-ever&#8221; distribution is probably pretty uninteresting; all the great stuff is in the tails!<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>Computing such a metric in a meaningful way, e.g. using real models, would probably be pretty and hard expensive, but interesting. This is just a blog post, so, like any rigorous scientist, I&#8217;ll just go ahead and provide a photo of what I think the results might look like. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f6EX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f6EX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 424w, https://substackcdn.com/image/fetch/$s_!f6EX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 848w, https://substackcdn.com/image/fetch/$s_!f6EX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 1272w, https://substackcdn.com/image/fetch/$s_!f6EX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f6EX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png" width="524" height="373.05349794238685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:346,&quot;width&quot;:486,&quot;resizeWidth&quot;:524,&quot;bytes&quot;:25091,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f6EX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 424w, https://substackcdn.com/image/fetch/$s_!f6EX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 848w, https://substackcdn.com/image/fetch/$s_!f6EX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 1272w, https://substackcdn.com/image/fetch/$s_!f6EX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52259b08-2b5e-47c0-a772-7fc4bb272de2_486x346.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 2:</em> If you were to quantitatively measure the average &#8220;descriptive&#8221;-ness of a given length description of a sight/sound, I suspect you&#8217;d get something like this.</figcaption></figure></div><h3>Perception: Global and Local</h3><p><em>How</em> we perceive is core to all of this. It&#8217;s not like we&#8217;re just talking about unstructured pixels and sounds&#8212;we&#8217;re talking about music and art (I can only write so much about the variance of audio pre-images without feeling like a geeky philistine).</p><p>What does it mean for a neighborhood of songs or images to have high variance? What does it even mean for two songs to differ? Let&#8217;s ignore the math and only think about good old qualitative perception.</p><p>Fundamentally, auditory and visual art are pleasant because of their <em>structure.</em> There&#8217;s micro-structure: specific patterns of notes, or the way that short brushstrokes might blend together on a canvas. There&#8217;s also macro-structure: the &#8220;feeling&#8221; or &#8220;mood&#8221; of a song, or the composition or message of a painting.</p><p>I&#8217;d posit that capturing micro-structure of <em>any</em> art form in natural language is baseline pretty challenging. You&#8217;re probably not going to have DALL-E generate a painting brushstroke by brushstroke. And, if you wanted Suno to generate a song note-by-note, well, then you could just write the sheet music (which doesn&#8217;t require physical training in the way that executing desired brushstrokes might).</p><p>For either sight or sound, capturing macro-structure in natural language is considerably easier. When speaking with a generative model, you can say something like &#8220;generate a surrealist painting of a man wearing a blue suit with black pants, sitting on a partially burnt log in a redwood forest.&#8221; Similarly, in the case of generative music, you might say something like &#8220;generate a piano piece that starts with a gentle, happy rhythm and crescendos into a gloomy, muddled melody.&#8221; Either of these would work just fine. The micro-structure isn&#8217;t going away&#8212;it&#8217;s just left as an exercise to the model.</p><p>Unfortunately, I suspect that specific control of microstructure is quite a bit more important for intentionality in sound than it is for sight. That perfect sequence of three notes, followed by a key change, can have an intense emotional effect on you. Can you say the same for three perfectly executed brushstrokes (not to undermine their impressiveness)? Perhaps things were different when painting was more a matter of showing off technical mastery, but people don&#8217;t quite receive visual art that way anymore.</p><p>Regardless, this is a hard thing to prove. However, it feels natural when you think about the process of listening to a song versus observing a painting. We listen to the song as a time series, sound by sound. You anchor a sound to its neighbors (the microstructure)&#8212;there&#8217;s no concept of listening to a song &#8220;all at once.&#8221; Visual art, however, need not be enjoyed in series. Your eyes can dart from corner to corner, alternating between subtle details and broad themes. Music forces you to dwell on microstructure; visual art does not.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>To illustrate this effect, I&#8217;ve created two rough &#8220;neighbor swapping&#8221; experiments. The procedure is simple enough: we decompose art into smaller units and probabilistically shuffle the units around. What happens as the units get bigger?</p><p>As <em>Figure 3</em> demonstrates, we can shuffle neighboring pixels without anything being lost (top right corner). Even shuffling slightly larger rectangles maintains the integrity of the image, despite the micro-structure being substantially altered or even erased. It isn&#8217;t until we shuffle large rectangles containing macroscopic features like hands and ears that things start to actually look wonky.</p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e0789b-c2c1-4cdd-9ff2-1a1a1a351f1f_640x954.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7be97b9e-76a5-4c5a-a9c4-3ba1c0439966_640x954.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de9df52d-7daf-4066-a596-6501d7796a60_640x954.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2de08b34-4d86-4105-8ea1-9e1c0bbc3f93_640x954.png&quot;}],&quot;caption&quot;:&quot;Figure 3: The Mona Lisa with increasing amounts of neighbor-shuffling. The intended image is pretty obvious for all shufflings except the last (and it probably still is, just due to this particular painting's notoriety).&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0568c93c-e2ba-4d28-8d7d-254fd898da09_1456x1456.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p>In the recording below, we perform the same exercise with Beethoven&#8217;s <em>F&#252;r Elise.</em> The first sample is the unaltered piece. The second sample probabilistically swaps neighboring notes, the third sample probabilistically swaps neighboring measures, and the fourth sample probabilistically swaps the first and second halves of the piece. Unlike the procedure applied to the Mona Lisa, the small, micro-structure alterations have the largest effect on the listening experience and intentionality of the piece. Flipping the halves of the piece doesn&#8217;t really change the listening experience much, except for the discontinuity in the middle.</p><div class="native-audio-embed" data-component-name="AudioPlaceholder" data-attrs="{&quot;label&quot;:null,&quot;mediaUploadId&quot;:&quot;3798ceb6-9e82-466a-a144-175bc531566b&quot;,&quot;duration&quot;:45.087345,&quot;downloadable&quot;:false,&quot;isEditorNode&quot;:true}"></div><p>Maybe you disagree (especially if you&#8217;re a visual artist) and maintain that macro-structure and micro-structure are equally important for intentional creation of sight and sound. Even so, it&#8217;s much easier to iteratively refine visual micro-structure in natural language. You can very concisely drill into specific parts of the image and request the changes e.g. &#8220;in the upper left quadrant, rather than her hair flowing to the right, please make it flow to the left and be slightly more curly.&#8221; I&#8217;m not sure how to do the equivalent for music. Sure, you could request something like &#8220;change triplet at the beginning of measure 32 to a C# quarter note,&#8221; but, if you know what those things mean, why aren&#8217;t you just using some score-editing software like Musescore?</p><h3>Less about language, more about us</h3><p>You&#8217;ll notice that I always take care to clarify the limitations of <em>natural langauge. </em>I do this because there <em>is</em> a language that&#8217;s <em>great</em> for concisely describing sounds&#8212;the language of music theory.</p><p>But why might natural languages like English fall short when it comes to describing sounds versus visuals? I doubt that this is an innate feature of natural languages, but rather a consequence of human preferences. Humans are probably just more prone to describe a thing they&#8217;ve seen rather than a thing they&#8217;ve heard, and so language has evolved to work incredibly well for the former and so-so for the latter. Maybe we emotionally relate to sights more&#8212;or maybe we just encounter them more often (especially in the pre-industrial world, where hearing structured sound was a revered, reserved occasion). </p><p>Word frequencies in <em>Figure 4</em> seem to support this, with all cases of &#8220;see&#8221; or &#8220;look&#8221; being used more frequently than the corresponding cases of &#8220;hear&#8221; or &#8220;listen&#8221; in recent times (note that the less common &#8220;watch&#8221; is used less frequently than &#8220;hear,&#8221; although still more commonly than the analogous &#8220;listen&#8221;). Interestingly, you can observed that usage of &#8220;saw&#8221; and &#8220;heard&#8221; didn&#8217;t really diverge until the 1980s, perhaps indicating that the strong present preference for visual-ness is a recent and possibly technology-driven phenomenon (but this is very speculative).</p><p>The frequencies of various adjectives (<em>Figure 5</em>) are less clearly convincing, although one might still argue that &#8220;dark&#8221; and &#8220;bright&#8221; are used more frequently than the corresponding &#8220;quiet&#8221; and &#8220;loud&#8221; respectively (as well as &#8220;visual&#8221; being used far more often than &#8220;auditory&#8221;.)</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QtjV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QtjV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 424w, https://substackcdn.com/image/fetch/$s_!QtjV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 848w, https://substackcdn.com/image/fetch/$s_!QtjV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!QtjV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QtjV!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png" width="1200" height="396.42857142857144" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:481,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:454348,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QtjV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 424w, https://substackcdn.com/image/fetch/$s_!QtjV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 848w, https://substackcdn.com/image/fetch/$s_!QtjV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!QtjV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e6834b9-533d-429a-955f-67cb6e66cd52_4260x1406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 4:</em> The usage of various sight and sound-related verbs over time. <a href="https://books.google.com/ngrams/graph?content=Hear%2CSee%2CHeard%2CSaw%2CListen%2CListened%2CLook%2CLooked%2CWatch%2CWatched&amp;year_start=1820&amp;year_end=2019&amp;corpus=en-2019&amp;smoothing=1&amp;case_insensitive=true">Data here</a>.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WTT1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WTT1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 424w, https://substackcdn.com/image/fetch/$s_!WTT1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 848w, https://substackcdn.com/image/fetch/$s_!WTT1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 1272w, https://substackcdn.com/image/fetch/$s_!WTT1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WTT1!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png" width="1200" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:455,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:413866,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WTT1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 424w, https://substackcdn.com/image/fetch/$s_!WTT1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 848w, https://substackcdn.com/image/fetch/$s_!WTT1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 1272w, https://substackcdn.com/image/fetch/$s_!WTT1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec9b1cc-a81c-45b5-9038-1a08a583bf06_4260x1332.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 5:</em> The usage of various sight and sound-related adjectives over time. <a href="https://books.google.com/ngrams/graph?content=Dark%2CQuiet%2CLoud%2CVisual%2CAuditory%2CBright&amp;year_start=1820&amp;year_end=2019&amp;case_insensitive=on&amp;corpus=en-2019&amp;smoothing=1">Data here</a></figcaption></figure></div><p></p><p></p><h3>Parting Thoughts</h3><p>Although pretty speculative, I hope this offered an interesting if concise glance into some of the questions raised by audio-generating AI. It&#8217;d be wonderful for there to be more rigorous research conducted in this direction; I suspect there are pretty interesting applications, especially in the context of low and high resource languages and/or translation (How intentional can somebody be in language X if translating from language Y? What if we translate from Y to A to X?)</p><p>And finally, for the people making these music-generating apps: they&#8217;re fascinating and impressive, but maybe natural language isn&#8217;t the right medium here. What if, instead, we mapped sound to sound, creating music by humming melodies and accompanying them with supplementary natural language descriptions when necessary (&#8220;this should be played by a violin, make it gloomy&#8221;). Now, that&#8217;d be pretty cool.</p><p><em>Thanks to Lucy for chatting with me about these topics for a few hours!</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>If you&#8217;re getting flashbacks to your undergraduate analysis class&#8230; yeah, me too.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Of course, model builders are very clever, and it&#8217;s not like they&#8217;re running a naive MLE and calling it a day. You can probably massage a lot of these problems out with clever objective function design and RLHF.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>This reminds me of the alignment versus creativity debate. The most &#8220;aligned&#8221; model is probably one that does the measure-of-center-y output, but the most &#8220;creative&#8221; model is the one that does the opposite. How might one balance this? Who knows!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Except for, you know, movies.</p></div></div>]]></content:encoded></item><item><title><![CDATA[The Least Human Humans]]></title><description><![CDATA[A mundane yet worrying variant of AI doomerism]]></description><link>https://www.isomorphic.group/p/the-least-human-humans</link><guid isPermaLink="false">https://www.isomorphic.group/p/the-least-human-humans</guid><dc:creator><![CDATA[Dan DiPietro]]></dc:creator><pubDate>Fri, 21 Jul 2023 02:28:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-drq!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7c1b0c9-be1c-4fa2-b0e6-4281b80b8da7_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>A Worrying Proposition</h2><p>AI doomerism is <a href="https://www.nytimes.com/2023/05/01/technology/ai-google-chatbot-engineer-quits-hinton.html">all</a> <a href="https://www.vox.com/the-highlight/23621198/artificial-intelligence-chatgpt-openai-existential-risk-china-ai-safety-technology">the</a> <a href="https://www.theguardian.com/technology/2023/may/24/openai-leaders-call-regulation-prevent-ai-destroying-humanity">rage</a> these days.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> Automated authoritarian systems of oppression, self-replicating killer intelligences, and widespread economic collapse seem to be top of mind in particular. Beloved researchers like Geoffrey Hinton now argue that artificial neural networks&#8212;their life&#8217;s work&#8212;can mean the end of humanity, unless their risks are carefully and properly managed.</p><p>These conversations are worth having. But, I worry that rampant sensationalism might be detracting from a more mundane, more likely, and, in many ways, more worrying reality. <em>What it means to be human</em> is in a state of flux.</p><p>With high confidence, <em>every</em> person born <em>today</em> <em>onwards</em> will spend their <em>entire life</em> dumber<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> than a large language model (LLM). At no point in their life will they outperform a state-of-the-art LLM in any cognitive task. As they grow up and learn, so too will the models.</p><p>Generously, humans have been around for no more than 300,000 years. We&#8217;re a young species. Assuming we stick around for as long as species usually do, most humans haven&#8217;t actually been born yet. By extension, most humans that will ever live will spend their entire lives dumber than an LLM.</p><p>What effects does this have on society? On a real-world, individual level, what effects does this have on the human experience?</p><h2>Featherless Bipeds that Play Chess or Something</h2><p>I&#8217;ll put on my optimist hat for a moment.</p><p>What it means to be human has never been static. Many millennia ago, perhaps the most human humans were the best subsistence hunters. Some time in the past, chess playing was a defining characteristic&#8212;machines got good at it, and we decided to move to tool use. We realized chimpanzees and corvids were good at that, and now we proudly proclaim that sophisticated language and &#8220;cognition&#8221; is our human bastion.</p><p>It&#8217;s always been a moving goal post, affected by new areas where we&#8217;ve begun to uniquely excel as a species and old abilities that we&#8217;ve realized aren&#8217;t so special anymore. Although &#8220;cognition&#8221; <em>feels</em> uniquely human, I suspect we&#8217;re suffering from recency bias. Maybe we&#8217;ll run out of special things to cling to eventually, but, for the time being, I propose the following infinitely viable goalpost:</p><blockquote><p>The defining human trait is constructing cognitive agents. Should those cognitive agents construct cognitive agents of their own, the defining human trait is constructing cognitive agents that construct cognitive agents of their own. Repeat ad infinitum.</p></blockquote><p>Hooray! We&#8217;re special, forever. Now I can sleep at night. Back to my pessimist hat.</p><h2>Caution! Local Minima.</h2><p>I&#8217;m not intrinsically worried about the shifting goal post of what it means to be human. Most members of our species aren&#8217;t arm chair philosophers, and I suspect such &#8220;human-ness&#8221; revelations will not have material implications at a societal level.</p><p>That said, I <em>am</em> worried. Humans generally don&#8217;t like doing things that they&#8217;re bad at, or that machines do better than them. Just like many young students wonder &#8220;why do I need to learn math if the calculator can do it for me,&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> I worry that new generations of students will now wonder &#8220;why do I need to learn [insert literally any cognitive task] if the LLM can do it for me?&#8221;</p><p>Of course, these are ridiculous statements to make. Calculators are better are rote arithmetic than humans, but you need to understand rote arithmetic before you can understand all the beautiful higher level math that humans <em>are</em> better than calculators at. Nonetheless, it&#8217;s a local minima that catches many students early on and makes them discontinue their math education, unable to see the point.</p><p>LLMs spawn a massive minefield of local minima, threatening to demotivate the next generation of students in a drastic, incomprehensible way. I find the LLM local minima uniquely large in both breadth and depth.</p><p>How many thinkers, writers, programmers, and mathematicians will we lose?</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.isomorphic.group/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Isomorphic. If you like my work, please subscribe&#8211;it&#8217;s free!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Granted, the conversation hasn&#8217;t been totally one-sided. Plenty have rushed to silence the naysayers, arguing that important technological developments often engender complicated, worrying feelings but uniformly end up being <a href="https://a16z.com/2023/06/06/ai-will-save-the-world/">hugely beneficial</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>By any standard measure of intelligence.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>I don&#8217;t want to neglect the fact that many people also discontinue their math education due to the extremely unpalatable and poor ways in which the material is generally communicated.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Prime Idea Factorization]]></title><description><![CDATA[I&#8217;ve never had a truly original idea.]]></description><link>https://www.isomorphic.group/p/prime-idea-factorization</link><guid isPermaLink="false">https://www.isomorphic.group/p/prime-idea-factorization</guid><dc:creator><![CDATA[Dan DiPietro]]></dc:creator><pubDate>Sun, 23 Apr 2023 23:23:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-drq!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7c1b0c9-be1c-4fa2-b0e6-4281b80b8da7_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve never had a truly original idea.</p><p>I had that realization around a month ago. It irked me at first. After some thought, it&#8217;s brought me a lot of peace, focus, and optimism. I&#8217;d like to share why.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.isomorphic.group/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Isomorphic! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Most people have never had original, novel ideas. This applies to people that pride themselves on having great ideas. It also applies to people that <em>history</em> prides on having great ideas. And let me be clear: I&#8217;m not making a semantic or hand-wavey epistemic argument. Take any reasonable interpretation of the word idea, and I&#8217;ll tell you that people generally just don&#8217;t have novel ones.</p><p>Perhaps ironically, I&#8217;m not the first person to say this. In a 1903 letter to Helen Keller, Mark Twain remarked:</p><blockquote><p>The kernel, the soul &#8212; let us go further and say the substance, the bulk, the actual and valuable material of all human utterances &#8212; is plagiarism.</p></blockquote><p>Twain&#8217;s thoughts are well-explored in the literary and artistic domain. Take any well-known novel or movie from the 21st century, and you can probably reduce it to something done five hundred (or more) years ago; repeat ad nauseum until you&#8217;ve exhausted the human historical record. As it turns out, there are fundamental facets of the human experience that have remained unchanged over time and permeate our artistic and literary expressions. We keep writing the same stories&#8211;painting the same feats of heroism and depths of despair&#8211;over and over again. I suspect that&#8217;s what Twain was getting at. It&#8217;s beautiful, if not a bit cynical. But it&#8217;s not <em>really</em> what I&#8217;m getting at.</p><p>I don&#8217;t mean books, movies, or paintings. I mean &#8220;invention.&#8221; I mean cutting-edge research. The latest articles published in <em>Nature</em>, or the startup that just raised $100m.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> Rather than trying to take pick apart specific instances, I think it&#8217;d be more productive to explain the framework I&#8217;ve begun using to think about ideas.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>Are you familiar with the story of <em>Frankenstein</em>? Here&#8217;s a one sentence summary:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><blockquote><p>Frankenstein is a novel about a scientist [Dr. Frankenstein] who creates a monster out of body parts and brings it to life, but the creature becomes violent and wreaks havoc, leading the scientist to regret his actions.</p></blockquote><p>People overwhelmingly think a bit like Dr. Frankenstein (if you ignore the grave robbing and wreaking havoc parts). They chop different parts off of existing ideas&#8211;maybe articles they&#8217;ve read or companies they&#8217;ve heard of&#8211;and combine them together. Maybe the resulting combination is novel, but none of the parts are. You can often easily decompose these ideas into their parts.</p><p>I&#8217;m not knocking Frankenstein ideas. All of my ideas are Frankenstein ideas. Cars are a Frankenstein idea. So were telephones. Pretty much <em>every</em> notable machine learning paper of the past 20 years has been a Frankenstein idea or remarkably close to it.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a><sup> </sup>Sometimes Frankenstein ideas consist of parts taken from the same technical domain. Often, great Frankenstein ideas take parts from other, seemingly unrelated areas of invention and research literature.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><p>All this talk about parts and decompositions might give the mathematically astute a bit of d&#233;j&#224; vu. It feels like we&#8217;re talking about numbers. Take any number, and you can <em>uniquely</em> represent it as a product of prime numbers, up to the power on each prime (often called a prime factorization or prime decomposition). This is the Fundamental Theorem of Arithmetic, and it has surprisingly applicable intuition for thinking about ideas.</p><p>Let&#8217;s throw any mathematical formalisms out the window. Ideas are numbers now. Instead of &#8220;1, 2, 3,&#8221; we count &#8220;wheel, penicillin, indoor plumbing.&#8221; Convinced?</p><p>There are infinitely many prime (novel) ideas, but they get less common as you stray from zero. The overall proportion of composite (Frankenstein) ideas increases. Composite ideas are all unique in their construction and can be decomposed into a product of prime ideas. Try it&#8211;take some of your best ideas and see if you can perform a prime idea factorization. Maybe <code>car = wheel * engine * horse buggy</code>. You could probably represent some forms of neural networks as <code>(logistic regression)^n</code> for large <code>n</code> (and you can decompose logistic regression even further). I think performing these decompositions (in a serious, thoughtful manner&#8211;not like the above) can actually be a useful exercise for getting at the crux of what something really is.</p><p>Now, we&#8217;re back in the real world. Numbers are back to being numbers&#8211;ah yes, 1, 2, 3. However, I hope the prime idea factorization intuition remains.</p><p>If anything, this should all serve to make highly motivated, perpetually stressed out individuals breathe a sigh of relief. You don&#8217;t have to be good at coming up with novel things&#8211;pretty much nobody is! If that&#8217;s what you&#8217;re optimizing for, you&#8217;re probably wasting your time. Instead, focus on learning and deeply understanding what you learn. Be curious&#8211;build hypothetical Frankenstein ideas in your head whenever you can. Look to successful, innovative ideas and understand where they drew inspiration from and how all of the parts came together. To me, that sounds a whole lot more achievable than hoping you&#8217;ll magically have some utterly novel idea that nobody else has had.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Maybe these things are solving the same <em>fundamental human problems</em> (getting from point A to B, feeling connected to others, etc.), but that falls back into the semantic argument of what an idea even is. I don&#8217;t want to do that. No, the ideas themselves aren&#8217;t novel.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>I&#8217;m probably too much of a theorist (and too lazy) to pick apart specific examples anyway.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Courtesy of ChatGPT :)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>In fact, running into so many machine learning research papers that just seemed to be lopping pieces of existing techniques together (and I&#8217;m guilty of this too) is what initially inspired these thoughts.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This is part of the reason why anybody that wants to create something meaningful should have a variety of interests outside of their work or primary area of thought. They should read voraciously. Complete specialization isn&#8217;t always a good thing, and I&#8217;ll likely explore this in future content.</p></div></div>]]></content:encoded></item></channel></rss>