{
  "version": "https://jsonfeed.org/version/1",
  "title": "Ian's Digital Garden",
  "home_page_url": "https://ianwwagner.com/",
  "feed_url": "https://ianwwagner.com//archive-2024.json",
  "description": "",
  "items": [
    {
      "id": "https://ianwwagner.com//conserving-memory-while-streaming-from-duckdb.html",
      "url": "https://ianwwagner.com//conserving-memory-while-streaming-from-duckdb.html",
      "title": "Conserving Memory while Streaming from DuckDB",
      "content_html": "<p>In the weeks since my previous post on <a href=\"working-with-arrow-and-duckdb-in-rust.html\">Working with Arrow and DuckDB in Rust</a>,\nI've found a few gripes that I'd like to address.</p>\n<h1><a href=\"#memory-usage-of-query_arrow-and-stream_arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"memory-usage-of-query_arrow-and-stream_arrow\"></a>Memory usage of <code>query_arrow</code> and <code>stream_arrow</code></h1>\n<p>In the previous post, I used the <code>query_arrow</code> API.\nIt's pretty straightforward and gives you iterator-compatible access to the query results.\nHowever, there's one small problem: its memory consumption scales roughly linearly with your result set.</p>\n<p>This isn't a problem for many uses of DuckDB, but if your datasets are in the tens or hundreds of gigabytes\nand you're wanting to process a large number of rows, the RAM requirements can be excessive.\nThe memory profile of <code>query_arrow</code> seems to be &quot;create all of the <code>RecordBatch</code>es upfront\nand keep them around for as long as you hold the <code>Arrow</code> handle.</p>\n<div class=\"markdown-alert markdown-alert-note\">\n<p class=\"markdown-alert-title\">Disclaimer</p>\n<p>I have <strong>not</strong> done extensive allocation-level memory profiling as of this writing.\nIt's quite possible that I've missed something, but this seems to be what's happening\nfrom watching Activity Monitor.\nPlease let me know if I've misrepresented anything!</p>\n</div>\n<p>Fortunately, DuckDB also has another API: <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.stream_arrow\"><code>stream_arrow</code></a>.\nThis appears to allocate <code>RecordBatch</code>es on demand rather than all at once.\nThere is also some overhead, which I'll revisit later that varies with result size.\nBut overall, profiling indicates that <code>stream_arrow</code> requires significantly less RAM over the life of a large <code>Arrow</code> iterator.</p>\n<p>Unfortunately, none of the above information about memory consumption appears to be documented,\nand there are no (serious) code samples demonstrating the use of <code>stream_arrow</code>!</p>\n<blockquote>\n<p>[!question] Down the rabbit hole...\nDigging into the code in duckdb-rs raises even more questions,\nsince several underlying C functions, like <a href=\"https://duckdb.org/docs/api/c/api.html\"><code>duckdb_execute_prepared_streaming</code></a>\nare marked as deprecated.\nPresumably, alternatives are being developed or the methods are just not stable yet.</p>\n</blockquote>\n<h1><a href=\"#getting-a-schemaref\" aria-hidden=\"true\" class=\"anchor\" id=\"getting-a-schemaref\"></a>Getting a <code>SchemaRef</code></h1>\n<p>The signature of <code>stream_arrow</code> is a bit different from that of <code>query_arrow</code>.\nHere's what it looks like as of crate version 1.1.1:</p>\n<pre><code class=\"language-rust\">pub fn stream_arrow&lt;P: Params&gt;(\n    &amp;mut self,\n    params: P,\n    schema: SchemaRef,\n) -&gt; Result&lt;ArrowStream&lt;'_&gt;&gt;\n</code></pre>\n<p>This looks pretty familiar at first if you've used <code>query_arrow</code>,\nbut there's a new third parameter: <code>schema</code>.\n<code>SchemaRef</code> is just a type alias for <code>Arc&lt;Schema&gt;</code>.\nArrow objects have a schema associated with them,\nso this is a reasonable detail for a low-level API.\nBut DuckDB is fine at inferring this when needed!\nSurely there is a way of getting it from a query, right?\n(After all, <code>query_arrow</code> has to do something similar, but doesn't burden the caller.)</p>\n<p>My first attempt at getting a <code>Schema</code> object was to call the <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.schema\"><code>schema()</code></a> method on <code>Statement</code>.\nThe <code>Statement</code> type in duckdb-rs is actually a high-level wrapper around <code>RawStatement</code>,\nand at the time of this writing, the schema getter <a href=\"https://github.com/duckdb/duckdb-rs/blob/2bd811e7b1b7398c4f461de4de263e629572dc90/crates/duckdb/src/raw_statement.rs#L212\">hides an <code>unwrap</code></a>.\nThe docs do tell you this (using a somewhat nonstandard heading?),\nbut basically you can't get a schema without executing a query.\nI wish they used the <a href=\"https://cliffle.com/blog/rust-typestate/\">Typestate pattern</a>\nor at least made the result an <code>Option</code>, but alas...</p>\n<p>This leaves developers with three options.</p>\n<ol>\n<li>Construct the schema manually.</li>\n<li>Construct a different <code>Statement</code> that is the same SQL, but with a <code>LIMIT 0</code> clause at the end.</li>\n<li>Execute the statement, but don't load all the results into RAM.</li>\n</ol>\n<h2><a href=\"#manually-construct-a-schema\" aria-hidden=\"true\" class=\"anchor\" id=\"manually-construct-a-schema\"></a>Manually construct a Schema?</h2>\n<p>Manually constructing the schema is a non-starter for me.\nA program which has a hand-written code dependency on a SQL string is a terrible idea\non several levels.\nBesides, DuckDB clearly <em>can</em> infer the schema in <code>query_arrow</code>, so why not here?</p>\n<h2><a href=\"#query-another-nearly-identical-statement\" aria-hidden=\"true\" class=\"anchor\" id=\"query-another-nearly-identical-statement\"></a>Query another, nearly identical statement</h2>\n<p>The second idea is, amusingly, what ChatGPT o1 suggested (after half a dozen prompts;\nit seems like it will just confidently refuse to fetch documentation now,\nand hallucinates new APIs based off its outdated training data).\nThe basic idea is to add <code>LIMIT 0</code> to the end of the original query\nso it's able to get the schema, but doesn't actually return any results.</p>\n<pre><code class=\"language-rust\">fn fetch_schema_for_query(db: &amp;Connection, sql: &amp;str) -&gt; duckdb::Result&lt;SchemaRef&gt; {\n    // Append &quot;LIMIT 0&quot; to the original query, so we don't actually fetch anything\n    // NB: This does NOT handle cases such as the original query ending in a semicolon!\n    let schema_sql = format!(&quot;{} LIMIT 0&quot;, sql);\n\n    let mut statement = db.prepare(&amp;schema_sql)?;\n    let arrow_result = statement.query_arrow([])?;\n\n    Ok(arrow_result.get_schema())\n}\n</code></pre>\n<p>There is nothing fundamentally unsound about this approach.\nBut it requires string manipulation, which is less than ideal.\nThere is also at least one obvious edge case.</p>\n<h2><a href=\"#execute-the-stamement-without-loading-all-results-first\" aria-hidden=\"true\" class=\"anchor\" id=\"execute-the-stamement-without-loading-all-results-first\"></a>Execute the stamement without loading all results first</h2>\n<p>The third option is not as straightforward as I expected it to be.\nAt first, I tried the <code>row_count</code> method,\nbut internally this <a href=\"https://github.com/duckdb/duckdb-rs/blob/2bd811e7b1b7398c4f461de4de263e629572dc90/crates/duckdb/src/raw_statement.rs#L79\">just calls a single FFI function</a>.\nThis doesn't actually update the internal <code>schema</code> field.\nYou really <em>do</em> need to run through a more &quot;normal&quot; execution path.</p>\n<p>A solution that <em>seems</em> reasonably clean is to do what the docs say and call <code>stmt.execute()</code>.\nIt's a bit strange to do this on a <code>SELECT</code> query to be honest,\nbut the API does indeed internally mutate the <code>Schema</code> property,\n<em>and</em> returns a row count.\nSo it seems semantically equivalent to a <code>SELECT COUNT(*) FROM (...)</code>\n(and in my case, getting the row count was helpful too).</p>\n<p>In my testing, it <em>appears</em> that this may actually allocate a non-trivial amount of memory,\nwhich may be mildly surprising.\nHowever, the max amount of memory we require during execution is definitely less overall.\nAny ideas why this is?</p>\n<h1><a href=\"#full-example-using-stream_arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"full-example-using-stream_arrow\"></a>Full example using <code>stream_arrow</code></h1>\n<p>Let's bring what we've learned into a &quot;real&quot; example.</p>\n<pre><code class=\"language-rust\">// let sql = &quot;SELECT * FROM table;&quot;;\nlet mut stmt = conn.prepare(sql)?;\n// Execute the query (so we have a usable schema)\nlet size = stmt.execute([])?;\n// Now we run the &quot;real&quot; query using `stream_arrow`.\n// This returned in a few hundred milliseconds for my dataset.\nlet mut arrow = stmt.stream_arrow([], stmt.schema())?;\n// Iterate over arrow...\n</code></pre>\n<p>When you structure your code like this rather than using the easier <code>query_arrow</code>,\nyou can significantly reduce your memory footprint for large datasets.\nIn my testing, there was no appreciable impact on performance.</p>\n<h1><a href=\"#open-questions\" aria-hidden=\"true\" class=\"anchor\" id=\"open-questions\"></a>Open Questions</h1>\n<p>The above leaves me with a few open questions.\nFirst, with my use case (a dataset of around 12GB of Parquet files), <code>execute</code> took several <em>seconds</em>.\nThe &quot;real&quot; <code>stream_arrow</code> query took a few hundred milliseconds.\nWhat's going on here?\nPerhaps it's doing a scan and/or caching some data initially the way to make subsequent queries faster?</p>\n<p>Additionally, the memory profile does have a &quot;spike&quot; which makes me wonder what exactly each step loads into RAM,\nand thus, the memory requirements for working with extremely large datasets.\nIn my testing, adding a <code>WHERE</code> clause that significantly reduces the result set\nDOES reduce the memory footprint.\nThat's somewhat worrying to me, since it implies there is still measurable overhead\nproportional to the dataset size.\nWhat practical limits does this impose on dataset size?</p>\n<div class=\"markdown-alert markdown-alert-note\">\n<p class=\"markdown-alert-title\">Note</p>\n<p>An astute reader may be asking whether the memory profile of the <code>LIMIT 0</code> and <code>execute</code> approaches are equivalent.\nThe answer appears to be yes.</p>\n</div>\n<p>I've <a href=\"https://github.com/duckdb/duckdb-rs/issues/418\">opened issue #418</a>\nasking for clarification.\nIf any readers have any insights, post them in the issue thread!</p>\n",
      "summary": "",
      "date_published": "2024-12-31T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "rust",
        "apache arrow",
        "parquet",
        "duckdb",
        "big data",
        "data engineering"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//how-and-why-to-work-with-arrow-and-duckdb-in-rust.html",
      "url": "https://ianwwagner.com//how-and-why-to-work-with-arrow-and-duckdb-in-rust.html",
      "title": "How (and why) to work with Arrow and DuckDB in Rust",
      "content_html": "<p>My day job involves wrangling a lot of data very fast.\nI've heard a lot of people raving about several technologies like DuckDB,\n(Geo)Parquet, and Apache Arrow recently.\nBut despite being an &quot;early adopter,&quot;\nit took me quite a while to figure out how and why to leverage these practiclaly.</p>\n<p>Last week, a few things &quot;clicked&quot; for me, so I'd like to share what I learned in case it helps you.</p>\n<h1><a href=\"#geoparquet\" aria-hidden=\"true\" class=\"anchor\" id=\"geoparquet\"></a>(Geo)Parquet</h1>\n<p>(Geo)Parquet is quite possibly the best understood tech in the mix.\nIt is not exactly new.\nParquet has been around for quite a while in the big data ecosystem.\nIf you need a refresher, the <a href=\"https://guide.cloudnativegeo.org/geoparquet/\">Cloud-optimized Geospatial Formats Guide</a>\ngives a great high-level overview.</p>\n<p>Here are the stand-out features:</p>\n<ul>\n<li>It has a schema and some data types, unlike CSV (you can even have maps and lists!).</li>\n<li>On disk, values are written in groups per <em>column</em>, rather than writing one row at a time.\nThis makes the data much easier to compress, and lets readers easily skip over data they don't need.</li>\n<li>Statistics at several levels which enable &quot;predicate pushdown.&quot; Even though the files are columnar in nature,\nyou can narrow which files and &quot;row groups&quot; within each file have the data you need!</li>\n</ul>\n<p>Practically speaking, parquet lets you can distribute large datasets in <em>one or more</em> files\nwhich will be significantly <em>smaller and faster to query</em> than other familiar formats.</p>\n<h2><a href=\"#why-you-should-care\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care\"></a>Why you should care</h2>\n<p>The value proposition is clear for big data processing.\nIf you're trying to get a record of all traffic accidents in California,\nor find the hottest restaurants in Paris based on a multi-terabyte dataset,\nparquet provides clear advantages.\nYou can skip row groups within the parquet file or even whole files\nto narrow your search!\nAnd since datasets can be split across files,\nyou can keep adding to the dataset over time, parallelize queries,\nand other nice things.</p>\n<p>But what if you're not doing these high-level analytical things?\nWhy not just use a more straightforward format like CSV\nthat avoids the need to &quot;rotate&quot; back into rows\nfor non-aggregation use cases?\nHere are a few reasons to like Parquet:</p>\n<ul>\n<li>You actually have a schema! This means less format shifting and validation in your code.</li>\n<li>Operating on row groups turns out to be pretty efficient, even when you're reading the whole dataset.\nCombining batch reads with compression, your processing code will usually get faster.</li>\n<li>It's designed to be readable from object storage.\nThis means you can often process massive datasets from your laptop.\nParquet readers are smart and can skip over data you don't need.\nYou can't do this with CSV.</li>\n</ul>\n<p>The upshot of all this is that it generally gets both <em>easier</em> and <em>faster</em>\nto work with your data...\nprovided that you have the right tools to leverage it.</p>\n<h1><a href=\"#duckdb\" aria-hidden=\"true\" class=\"anchor\" id=\"duckdb\"></a>DuckDB</h1>\n<p>DuckDB describes itself as an in-process, portable, feature-rich, and fast database\nfor analytical workloads.\nDuckDB was that tool that triggered my &quot;lightbulb moment&quot; last week.\nFoursquare, an app which I've used for a decade or more,\nrecently released an <a href=\"https://location.foursquare.com/resources/blog/products/foursquare-open-source-places-a-new-foundational-dataset-for-the-geospatial-community/\">open data set</a>,\nwhich was pretty cool!\nIt was also in Parquet format (just like <a href=\"https://overturemaps.org/\">Overture</a>'s data sets).</p>\n<p>You can't just open up a Parquet file in a text editor or spreadsheet software like you can a CSV.\nMy friend Oliver released a <a href=\"https://wipfli.github.io/foursquare-os-places-pmtiles/\">web-based demo</a>\na few weeks ago which lets you inspect the data on a map at the point level.\nBut to do more than spot checking, you'll probably want a database that can work with Parquet.\nAnd that's where DuckDB comes in.</p>\n<h2><a href=\"#why-you-should-care-1\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care-1\"></a>Why you should care</h2>\n<h3><a href=\"#its-embedded\" aria-hidden=\"true\" class=\"anchor\" id=\"its-embedded\"></a>It's embedded</h3>\n<p>I understood the in-process part of DuckDB's value proposition right away.\nIt's similar to SQLite, where you don't have to go through a server\nor over an HTTP connection.\nThis is both simpler to reason about and <a href=\"quadrupling-the-performance-of-a-data-pipeline.html\">is usually quite a bit faster</a>\nthan having to call out to a separate service!</p>\n<p>DuckDB is pretty quick to compile from source.\nYou probably don't need to muck around with this if you're just using the CLI,\nbut I wanted to eventually use it embedded in some Rust code.\nCompiling from source turned out to be the easiest way to get their crate working.\nIt looks for a shared library by default, but I couldn't get this working after a <code>brew</code> install.\nThis was mildly annoying, but on the other hand,\nvendoring the library does make consistent Docker builds easier 🤷🏻‍♂️</p>\n<h3><a href=\"#features-galore\" aria-hidden=\"true\" class=\"anchor\" id=\"features-galore\"></a>Features galore!</h3>\n<p>DuckDB includes a mind boggling number of features.\nNot in a confusing way; more in a Python stdlib way where just about everything you'd want is already there.\nYou can query a whole directory (or bucket) of CSV files,\na Postgres database, SQLite, or even an OpenStreetMap PBF file 🤯\nYou can even write a SQL query against a glob expression of Parquet files in S3\nas your &quot;table.&quot;\n<strong>That's really cool!</strong>\n(If you've been around the space, you may recognize this concept from\nAWS Athena and others.)</p>\n<h3><a href=\"#speed\" aria-hidden=\"true\" class=\"anchor\" id=\"speed\"></a>Speed</h3>\n<p>Writing a query against a local directory of files is actually really fast!\nIt does a bit of munging upfront, and yes,\nit's not quite as fast as if you'd prepped the data into a clean table,\nbut you actually can run quite efficient queries this way locally!</p>\n<p>When running a query against local data,\nDuckDB will make liberal use of your system memory\n(the default is 80% of system RAM)\nand as many CPUs as you can throw at it.\nBut it will reward you with excellent response times,\ncourtesy of the &quot;vectorized&quot; query engine.\nWhat I've heard of the design reminds me of how array-oriented programming languages like APL\n(or less esoteric libraries like numpy) are often implemented.</p>\n<p>I was able to do some spatial aggregation operations\n(bucketing a filtered list of locations by H3 index)\nin about <strong>10 seconds on a dataset of more than 40 million rows</strong>!\n(The full dataset is over 100 million rows, so I also got to see the selective reading in action.)\nThat piqued my interest, to say the least.\n(Here's the result of that query, visualized).</p>\n<p><figure><img src=\"media/foursquare-os-places-density-2024.png\" alt=\"A map of the world showing heavy density in the US, southern Canada, central Mexico, parts of coastal South America, Europe, Korea, Japan, parts of SE Aaia, and Australia\" /></figure></p>\n<h3><a href=\"#that-analytical-thing\" aria-hidden=\"true\" class=\"anchor\" id=\"that-analytical-thing\"></a>That analytical thing...</h3>\n<p>And now for the final buzzword in DuckDB's marketing: analytical.\nDuckDB frequently describes itself as optimized for OLAP (OnLine Analytical Processing) workloads.\nThis is contrasted with OLTP (OnLine Transaction Processing).\n<a href=\"https://en.wikipedia.org/wiki/Online_analytical_processing\">Wikipedia</a> will tell you some differences\nin a lot of sweepingly broad terms, like being used for &quot;business reporting&quot; and read operations\nrather than &quot;transactions.&quot;</p>\n<p>When reaching for a definition, many sources focus on things like <em>aggregation</em> queries\nas a differentiator.\nThis didn't help, since most of my use cases involve slurping most or all of the data set.\nThe DuckDB marketing and docs didn't help clarify things either.</p>\n<p>Let me know on Mastodon if you have a better explanatation of what an &quot;analytical&quot; database is 🤣</p>\n<p>I think a better explanation is probably 1) you do mostly <em>read</em> queries,\nand 2) it can execute highly parallel queries.\nSo far, DuckDB has been excellent for both the &quot;aggregate&quot; and the &quot;iterative&quot; use case.\nI assume it's just not the best choice per se if your workload is a lot of single-record writes?</p>\n<h2><a href=\"#how-im-using-duckdb\" aria-hidden=\"true\" class=\"anchor\" id=\"how-im-using-duckdb\"></a>How I'm using DuckDB</h2>\n<p>Embedding DuckDB in a Rust project allowed me to deliver something with a better end-user experience,\nis easier to maintain,\nand saved writing hundreds of lines of code in the process.</p>\n<p>Most general-purpose languages like Python and Rust\ndon't have primitives for expressing things like joins across datasets.\nDuckDB, like most database systems, does!\nYes, I <em>could</em> write some code using the <code>parquet</code> crate\nthat would filter across a nested directory tree of 5,000 files.\nBut DuckDB does that out of the box!</p>\n<p>It feels like this is a &quot;regex moment&quot; for data processing.\nJust like you don't (usually) need to hand-roll string processing,\nthere's now little reason to hand-roll data aggregation.</p>\n<p>For the above visualization, I used the Rust DuckDB crate for the data processing,\nconverted the results to JSON,\nand served it up from an Axum web server.\nAll in a <em>single binary</em>!\nThat's lot nicer than a bash script that executes SQL,\ndumps to a file, and then starts up a Python or Node web server!\nAnd breaks when you don't have Python or Node installed,\nyour OS changes its default shell,\nyou forget that some awk flag doesn't work on the GNU version,\nand so on.</p>\n<h1><a href=\"#apache-arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"apache-arrow\"></a>Apache Arrow</h1>\n<p>The final thing I want to touch on is <a href=\"https://arrow.apache.org/\">Apache Arrow</a>.\nThis is yet another incredibly useful technology which I've been following for a while,\nbut never quite figured out how to properly use until last week.</p>\n<p>Arrow is a <em>language-independent memory format</em>\nthat's <em>optimized for efficient analytic operations</em> on modern CPUs and GPUs.\nThe core idea is that, rather than having to convert data from one format to another (this implies copying!),\nArrow defines as shared memory format which many systems understand.\nIn practice, this ends up being a bunch of standards which define common representations for different types,\nand libraries for working with them.\nFor example, the <a href=\"https://geoarrow.org/\">GeoArrow</a> spec\nbuilds on the Arrow ecosystem to enable operations on spatial data in a common memory format.\nPretty cool!</p>\n<h2><a href=\"#why-you-should-care-2\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care-2\"></a>Why you should care</h2>\n<p>It turns out that copying and format shifting data can really eat into your processing times.\nArrow helps you sidestep that by reducing the amount of both you'll need to do,\nand by working on data in groups.</p>\n<h2><a href=\"#how-the-heck-to-use-it\" aria-hidden=\"true\" class=\"anchor\" id=\"how-the-heck-to-use-it\"></a>How the heck to use it?</h2>\n<p>Arrow is mostly hidden from view beneath other libraries.\nSo most of the time, especially if you're writing in a very high level language like Python,\nyou won't even see it.</p>\n<p>But if you're writing something at a slightly lower level,\nit's something you may have to touch for critical sections.\nThe <a href=\"https://docs.rs/duckdb/latest/duckdb/\">DuckDB crate</a>\nincludes an <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.query_arrow\">Arrow API</a>\nwhich will give you an iterator over <code>RecordBatch</code>es.\nThis is pretty convenient, since you can use DuckDB to gather all your data\nand just consume the stream of batches!</p>\n<p>So, how do we work with <code>RecordBatch</code>es?\nThe Arrow ecosystem, like Parquet, takes a lot of work to understand,\nand using the low-level libraries directly is difficult.\nEven as a seasoned Rustacean, I found the docs rather obtuse.</p>\n<p>After some searching, I finally found <a href=\"https://docs.rs/serde_arrow/\"><code>serde_arrow</code></a>.\nIt builds on the <code>serde</code> ecosystem with easy-to-use methods that operate on <code>RecordBatch</code>es.\nFinally; something I can use!</p>\n<p>I was initilaly worried about how performant the shift from columns to rows + any (minimal) <code>serde</code> overhead would be,\nbut this turned out to not be an issue.</p>\n<p>Here's how the code looks:</p>\n<pre><code class=\"language-rust\">serde_arrow::from_record_batch::&lt;Vec&lt;FoursquarePlaceRecord&gt;&gt;(&amp;batch)\n</code></pre>\n<p>A few combinators later and you've got a proper data pipeline!</p>\n<h1><a href=\"#review-what-this-enables\" aria-hidden=\"true\" class=\"anchor\" id=\"review-what-this-enables\"></a>Review: what this enables</h1>\n<p>What this ultimately enabled for me was being able to get a lot closer to &quot;scripting&quot;\na pipeline in Rust.\nMost people turn to Python or JavaScript for tasks like this,\nbut Rust has something to add: strong typing and all the related guarantees <em>which can only come with some level of formalism</em>.\nBut that doesn't necessarily have to get in the way of productivity!</p>\n<p>Hopefully this sparks some ideas for making your next data pipeline both fast and correct.</p>\n",
      "summary": "",
      "date_published": "2024-12-08T00:00:00-00:00",
      "image": "media/foursquare-os-places-density-2024.png",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "rust",
        "apache arrow",
        "parquet",
        "duckdb",
        "big data",
        "data engineering",
        "gis"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//quadrupling-the-performance-of-a-data-pipeline.html",
      "url": "https://ianwwagner.com//quadrupling-the-performance-of-a-data-pipeline.html",
      "title": "Quadrupling the Performance of a Data Pipeline",
      "content_html": "<p>Over the past two weeks, I've been focused on optimizing some data pipelines.\nI inherited some old ones which seemed especially slow,\nand I finally hit a limit where an overhaul made sense.\nThe pipelines process and generate data on the order of hundreds of gigabytes,\nrequiring correlation and conflated across several datasets.</p>\n<p>The pipelines in question happened to be written in Node.js,\nwhich I will do my absolute best not to pick on too much throughout.\nNode is actually a perfectly fine solution for certain problems,\nbut was being used especially badly in this case.\nThe rewritten pipeline, using Rust, clocked in at 4x faster than the original.\nBut as we'll soon see, the choice of language wasn't even the main factor in the sluggishness.</p>\n<p>So, let's get into it...</p>\n<h1><a href=\"#problem-1-doing-cpu-bound-work-on-a-single-thread\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-1-doing-cpu-bound-work-on-a-single-thread\"></a>Problem 1: Doing CPU-bound work on a single thread</h1>\n<p>Node.js made a splash in the early 2010s,\nand I can remember a few years where it was the hot new thing to write everything in.\nOne of the selling points was its ability to handle thousands (or tens of thousands)\nof connections with ease; all from JavaScript!\nThe key to this performance is <strong>async I/O</strong>.\nModern operating systems are insanely good at this, and Node made it <em>really</em> easy to tap into it.\nThis was novel to a lot of developers at the time, but it's pretty standard now\nfor building I/O heavy apps.</p>\n<p><strong>Node performs well as long as you were dealing with I/O-bound workloads</strong>,\nbut the magic fades if your workload requires a lot of CPU work.\nBy default, Node is single-threaded.\nYou need to bring in <code>libuv</code>, worker threads (Node 10 or so), or something similar\nto access <em>parallel</em> processing from JavaScript.\nI've only seen a handful of Node programs use these,\nand the pipelines in question were not among them.</p>\n<h2><a href=\"#going-through-the-skeleton\" aria-hidden=\"true\" class=\"anchor\" id=\"going-through-the-skeleton\"></a>Going through the skeleton</h2>\n<p>If you ingest data files (CSV and the like) record-by-record in a naïve way,\nyou'll just read one record at a time, process, insert to the database, and so on in a loop.\nThe original pipeline code was fortunately not quite this bad (it did have batching at least),\nbut had some room for improvemnet.</p>\n<p>The ingestion phase, where you're just reading data from CSV, parquet, etc.\nmaps naturally to Rust's <a href=\"https://rust-lang.github.io/async-book/05_streams/01_chapter.html\">streams</a>\n(the cousin of futures).\nThe original node code was actually fine at this stage,\nif a bit less elegant.\nBut the Rust structure we settled on is worth a closer look.</p>\n<pre><code class=\"language-rust\">fn csv_record_stream&lt;'a, S: AsyncRead + Unpin + Send + 'a, T: TryFrom&lt;StringRecord&gt;&gt;(\n    stream: S,\n    delimiter: u8,\n) -&gt; impl Stream&lt;Item = T&gt; + 'a\nwhere\n    &lt;T as TryFrom&lt;StringRecord&gt;&gt;::Error: Debug,\n{\n    let reader = AsyncReaderBuilder::new()\n        .delimiter(delimiter)\n        // Other config elided...\n        .create_reader(stream);\n    reader.into_records().filter_map(|res| async move {\n        let Ok(record) = res else {\n            log::error!(&quot;Error reading from the record stream: {:?}&quot;, res);\n            return None;\n        };\n\n        match T::try_from(record) {\n            Ok(parsed) =&gt; Some(parsed),\n            Err(e) =&gt; {\n                log::error!(&quot;Error parsing record: {:?}.&quot;, e);\n                None\n            }\n        }\n    })\n}\n</code></pre>\n<p>It starts off dense, but the concept is simple.\nWe'll take an async reader,\nconfigure a CSV reader to pull records for it,\nand map them to another data type using <code>TryFrom</code>.\nIf there are any errors, we just drop them from the stream and log an error.\nThis usually isn't a reason to stop processing for our use case.</p>\n<p>You should <em>not</em> be putting expensive code in your <code>TryFrom</code> implementation.\nBut really quick things like verifying that you have the right number of fields,\nor that a field contains an integer or is non-blank are usually fair game.</p>\n<p>Rust's trait system really shines here.\nOur code can turn <em>any</em> CSV(-like) file\ninto an arbitrary record type.\nAnd the same techniques can apply to just about any other data format too.</p>\n<h2><a href=\"#how-to-use-tokio-for-cpu-bound-operations\" aria-hidden=\"true\" class=\"anchor\" id=\"how-to-use-tokio-for-cpu-bound-operations\"></a>How to use Tokio for CPU-bound operations?</h2>\n<p>Now that we've done the light format shifting and discarded some obviously invalid records,\nlet's turn to the heavier processing.</p>\n<pre><code class=\"language-rust\">let available_parallelism = std::thread::available_parallelism()?.get();\n// let record_pipeline = csv_record_stream(...);\nrecord_pipeline\n    .chunks(500)  // Batch the work (your optimal size may vary)\n    .for_each_concurrent(available_parallelism, |chunk| {\n        // Clone your database connection pool or whatnot before `move`\n        // Every app is different, but this is a pretty common pattern\n        // for sqlx, Elastic Search, hyper, and more which use Arcs and cheap clones for pools.\n        let db_pool = db_pool.clone();\n        async move {\n            // Process your records using a blocking threadpool\n            let documents = tokio::task::spawn_blocking(move || {\n                // Do the heavy work here!\n                chunk\n                    .into_iter()\n                    .map(do_heavy_work)\n                    .collect()\n            })\n            .await\n            .expect(&quot;Problem spawning a blocking task&quot;);\n\n            // Insert processesd data to your database\n            db_pool.bulk_insert(documents).await.expect(&quot;You probably need an error handling strategy here...&quot;);\n        }\n    })\n    .await;\n</code></pre>\n<p>We used the <a href=\"https://docs.rs/futures/latest/futures/stream/trait.StreamExt.html#method.chunks\"><code>chunks</code></a>\nadaptor to pull hundreds of items at a time for more efficient processing in batches.\nThen, we used <a href=\"https://docs.rs/futures/latest/futures/stream/trait.StreamExt.html#method.for_each_concurrent\"><code>for_each_concurrent</code></a>\nin conjunction with <a href=\"https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html\"><code>spawn_blocking</code></a>\nto introduce parallel processing.</p>\n<p>Note that neither <code>chunks</code> nor even <code>for_each_concurrent</code> imply any amount of <em>parallelism</em>\non their own.\n<code>spawn_blocking</code> is the only thing that can actually create a new thread of execution!\nChunking simply splits the work into batches (most workloads like this tend to benefit from batching).\nAnd <code>for_each_concurrent</code> allows for <em>concurrent</em> operations over multiple batches.\nBut <code>spawn_blocking</code> is what enables computation in a background thread.\nIf you don't use <code>spawn_blocking</code>,\nyou'll end up blocking Tokio's async workers,\nand your performance will tank.\nJust like the old Node.js code.</p>\n<p>The astute reader may point out that using <code>spawn_blocking</code> like this\nis not universally accepted as a solution.\nTokio is (relatively) optimized for non-blocking workloads, so some claim that you should avoid this pattern.\nBut my experience having done this for 5+ years in production code serving over 2 billion requests/month,\nis that Tokio can be a great scheduler for heavier tasks too!</p>\n<p>One thing that's often overlooked in these discussions\nis that not all &quot;long-running operations&quot; are the same.\nOne category consists of graphics event loops,\nlong-running continuous computations,\nor other things that may not have an obvious &quot;end.&quot;\nBut some tasks <em>can</em> be expected to complete within some period of time,\nthat's longer than a blink.</p>\n<p>In the case of the former (&quot;long-lived&quot; tasks), then spawning a dedicated thread often makes sense.\nIn the latter scenario though, Tokio tasks with <code>spawn_blocking</code> can be a great choice.</p>\n<p>For our workload, we were doing a lot of the latter sort of operation.\nOne helpful rule of thumb I've seen is that if your task takes longer than tens of microseconds,\nyou should move it off the Tokio worker threads.\nUsing <code>chunks</code> and <code>spawn_blocking</code> avoids this death by a thousand cuts.\nIn our case, the parallelism resulted in a VERY clear speedup.</p>\n<h1><a href=\"#problem-2-premature-optimization-rather-than-backpressure\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-2-premature-optimization-rather-than-backpressure\"></a>Problem 2: Premature optimization rather than backpressure</h1>\n<p>The original data pipeline was very careful to not overload the data store.\nPerhaps a bit too careful!\nThis may have been necessary at some point in the distant past,\nbut most data storage, from vanilla databases to multi-node clustered storage,\nhave some level of natural backpressure built-in.\nThe Node implementation was essentially limiting the amount of work in-flight that hadn't been flushed.</p>\n<p>This premature optimization and the numerous micro-pauses it introduced\nwere another death by a thousand cuts problem.\nDropping the artificial limits approximately doubled throughput.\nIt turned out that our database was able to process 2-4x more records than under the previous implementation.</p>\n<p><strong>TL;DR</strong> — set a reasonable concurrency, let the server tell you when it's chugging (usually via slower response times),\nand let your async runtime handle the rest!</p>\n<h1><a href=\"#problem-3-serde-round-trips\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-3-serde-round-trips\"></a>Problem 3: Serde round-trips</h1>\n<p>Serde, or serialization + deserialization, can be a silent killer.\nAnd unless you're tracking things carefully, you often won't notice!</p>\n<p>I recently listened to <a href=\"https://www.recodingamerica.us/\">Recoding America</a> at the recommendation of a friend.\nOne of the anecdotes made me want to laugh and cry at the same time.\nEngineers had designed a major improvemnet to GPS, but the rollout is delayed\ndue to a performance problem that renders it unusable.</p>\n<p>The project is overseen by Raytheyon, a US government contractor.\nAnd they can't deliver because some arcane federal guidance (not even a regulation proper)\n&quot;recommends&quot; an &quot;Enterprise Service Bus&quot; in the architecture.\nThe startuppper in me dies when I hear such things.\nThe &quot;recommendation&quot; boils down to a data exchange medium where one &quot;service&quot; writes data and another consumes it.\nThink message queues like you may have used before.</p>\n<p>This is fine (even necessary) for some applications,\nbut positively crippling for others.\nIn the case of the new positioning system,\nwhich was heavily dependent on timing,\nthis was a wildly inefficient architecture.\nEven worse, the guidelines stated that it should be encrypted.</p>\n<p>This wasn't even &quot;bad&quot; guidance, but in the context of the problem,\nwhich depended on rapid exchange of time-sensitive messages,\nit was a horrendously bad fit.</p>\n<p>In our data pipeline, I discovered a situation with humorous resemblance in retrospect.\nThe pipeline was set up using a microservice architecture,\nwhich I'm sure souded like a good idea at the time,\nbut it introduced some truly obscene overhead.\nAll services involved were capable of working with data in the same format,\nbut the Node.js implementation was split into multiple services with HTTP and JSON round trips in the middle!\nDouble whammy!</p>\n<p>The new data pipeline simply imports the &quot;service&quot; as a crate,\nand gets rid of all the overhead by keeping everything in-process.\nIf you do really need to have a microservice architecture (ex: to scale another service up independently),\nthen other communication + data exchange formats may improve your performance.\nBut if it's possible to keep everything in-process, your overhead is roughly zero.\nThat's hard to beat!</p>\n<h1><a href=\"#conclusion\" aria-hidden=\"true\" class=\"anchor\" id=\"conclusion\"></a>Conclusion</h1>\n<p>In the end, the new pipeline was 4x the speed of the old.\nI happened to rewrite it in Rust, but Rust itself wasn't the source of all the speedups:\nunderstanding the architecture was.\nYou could achieve similar results in Node.js or Python,\nbut Rust makes it significantly easy to reason about the architecture and correctness of your code.\nThis is especially important when it comes to parallelizing sections of a pipeline,\nwhere Rust's type system will save you from the most common mistakes.</p>\n<p>These and other non-performance-related reasons to use Rust will be the subject of a future blog post (or two).</p>\n",
      "summary": "",
      "date_published": "2024-11-29T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "algorithms",
        "rust",
        "elasticsearch",
        "nodejs",
        "data engineering",
        "gis"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//searching-for-tiger-features.html",
      "url": "https://ianwwagner.com//searching-for-tiger-features.html",
      "title": "Searching for TIGER Features",
      "content_html": "<p>Today I had a rather peculiar need to search through features from TIGER\nmatching specific attributes.\nThese files are not CSV or JSON, but rather ESRI Shapefiles.\nShapefiles are a binary format which have long outlived their welcome\naccording to many in the industry, but they still persist today.</p>\n<h1><a href=\"#context\" aria-hidden=\"true\" class=\"anchor\" id=\"context\"></a>Context</h1>\n<p>Yeah, so this post probably isn't interesting to very many people,\nbut here's a bit of context in case you don't know what's going on and you're still reading.\nTIGER is a geospatial dataset published by the US government.\nThere's far more to this dataset than fits in this TIL post,\nbut my interest in it lies in finding addresses.\nSpecifically, <em>guessing</em> at where an address might be.</p>\n<p>When you type an address into your maps app,\nthey might not actually have the exact address in their database.\nThis happens more than you might imagine,\nbut you can usually get a pretty good guess of where the address is\nvia a process called interpolation.\nThe basic idea is that you take address data from multiple sources and use that to make a better guess.</p>\n<p>Some of the input to this is existing address points.\nBut there's one really interesting form of data that brings us to today's TIL:\naddress ranges.\nOne of the TIGER datasets is a set of lines (for the roads.\nEach segment is annotated with info letting us know the range of house numbers on each side of the road.</p>\n<p>I happen to use this data for my day job at Stadia Maps,\nwhere I was investigating a data issue today related to our geocoder and TIGER data.</p>\n<h1><a href=\"#getting-the-data\" aria-hidden=\"true\" class=\"anchor\" id=\"getting-the-data\"></a>Getting the data</h1>\n<p>In case you find yourself in a similar situation,\nyou may notice that the data from the government is sitting in an FTP directory,\nwhich contains a bunch of confusingly named ZIP files.\nThe data that I'm interested in (address features)\nhas names like <code>tl_2024_48485_addrfeat.zip</code>.</p>\n<p>The year might be familiar, but what's that other number?\nThat's a FIPS code for the county whose data is contained in the archive.\nYou can find a <a href=\"https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt\">list here</a>.\nThis is somewhat interesting in itself, since the first 2 characters are a state code.\nTexas, in this case.\nThe full number makes up a county: Wichita County, in this case.\nYou can suck down the entire dataset, just one file, or anything in-between\nfrom the <a href=\"https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html\">Census website</a>.</p>\n<h1><a href=\"#searching-for-features\" aria-hidden=\"true\" class=\"anchor\" id=\"searching-for-features\"></a>Searching for features</h1>\n<p>So, now you have a directory full of ZIP files.\nEach of which has a bunch of files necessary to interpret the shapefile.\nIsn't GIS lovely?</p>\n<p>The following script will let you write a simple &quot;WHERE&quot; clause,\nfiltering the data exactly as it comes from the Census Bureau!</p>\n<pre><code class=\"language-bash\">#!/bin/bash\nset -e;\n\nfind &quot;$1&quot; -type f -iname &quot;*.zip&quot; -print0 |\\\n  while IFS= read -r -d $'\\0' filename; do\n\n    filtered_json=$(ogr2ogr -f GeoJSON -t_srs crs:84 -where &quot;$2&quot; /vsistdout/ /vsizip/$filename);\n    # Check if the filtered GeoJSON has any features\n    feature_count=$(echo &quot;$filtered_json&quot; | jq '.features | length')\n\n    if [ &quot;$feature_count&quot; -gt 0 ]; then\n      # echo filename to stderr\n      &gt;&amp;2 echo $(date -u) &quot;Match(es) found in $filename&quot;;\n      echo &quot;$filtered_json&quot;;\n    fi\n\n  done;\n</code></pre>\n<p>You can run it like so:</p>\n<pre><code class=\"language-shell\">./find-tiger-features.sh $HOME/Downloads/tiger-2021/ &quot;TFIDL = 213297979 OR TFIDR = 213297979&quot;\n</code></pre>\n<p>This ends up being a LOT easier and faster than QGIS in my experience\nif you want to search for specific known attributes.\nEspecially if you don't know the specific area that you're looking for.\nI was surprised that so such tool for things like ID lookps existed already!</p>\n<p>Note that this isn't exactly &quot;fast&quot; by typical data processing workload standards.\nIt takes around 10 minutes to run on my laptop.\nBut it's a lot faster than the alternatives in many circumstances,\nespecilaly if you don't know exactly which file the data is in!</p>\n<p>For details on the fields available,\nrefer to the technical documentation on the <a href=\"https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html\">Census Bureau website</a>.</p>\n",
      "summary": "",
      "date_published": "2024-11-09T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "gis",
        "shell",
        "ogr2ogr"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//learning-a-language-the-lazy-way.html",
      "url": "https://ianwwagner.com//learning-a-language-the-lazy-way.html",
      "title": "Learning a Language the Lazy Way",
      "content_html": "<p>I've lived in South Korea for quite some time,\nand during my stay here I've become reasonably fluent in the language.\nPeople often ask how long it took to become fluent\nand if I have any tips for their language learning aspirations.\nThis post is about what I've learned from a decade-plus fascination with language learning.</p>\n<h1><a href=\"#anki-supermemo-memrise-etc\" aria-hidden=\"true\" class=\"anchor\" id=\"anki-supermemo-memrise-etc\"></a>Anki, SuperMemo, Memrise, etc.</h1>\n<p>Starting off, way too many years ago,\nI heard there were all these great tools that would help you remember ANYTHING.\nThe promise isn't exactly wrong, but there's something you should know about me:\nI'm pretty lazy 😉</p>\n<p>Memorizing things in particular takes a lot of effort, at least for me.\nI very much dislike things that are boring or tedious,\nand this was definitely that.\nEven so, I want to be careful to point out that this strategy MIGHT\nwork for you if you follow a few rules.</p>\n<ol>\n<li>Only put stuff in there that you really do want to remember.\nDon't put EVERYTHING (easy words etc.) in there.</li>\n<li>Make your own decks, unless you're studying for something like JLPT which has a very clear list.\nI mostly used Anki with decks built by others.\nThat was a mistake because there were often low quality cards,\nthings I didn't really care about, and so on,\nwhich made it feel even more like a chore.</li>\n<li>Use it as a supplement for other tools to remember what you learned.\nYou will NOT learn a language by just doing Anki reviews.\nYou need to have to have a reliable way of learning what you want to remember in the first place.</li>\n</ol>\n<h1><a href=\"#classes\" aria-hidden=\"true\" class=\"anchor\" id=\"classes\"></a>Classes</h1>\n<p>Classes are hit or miss.\nWhile I was in the US, I took clasess over a span of 3 years.</p>\n<p>I had a few 1:1 and 1:2 tutor sessions over a summer on a weekly basis.\nThis taught me how to read and write,\nand <em>most</em> of how to pronounce things.\nBut it wasn't very effective beyond that (neither was a professional and it was a bit rough).</p>\n<p>After that, I took classes at a local Korean culture center.\nIt was great fun and I did learn some things.\nMostly I had my pronunciation fine tuned by native speakers.\nBut to be honest, I didn't really learn a lot of useful phrases,\nand the vocabulary were just words.</p>\n<p>This was mostly a function of the class, not the instructor.\nThe classes were cheap, highly social, and motivation was generally low.\nThe cultural background, history, and etiquette were the most important things\nI learned from these classes.</p>\n<p>Formal classes and me don't go well together, so I didn't pursue those at all.\nAlso remember, this is about how to learn a language the <em>lazy</em> way!</p>\n<h1><a href=\"#aside-graded-readers\" aria-hidden=\"true\" class=\"anchor\" id=\"aside-graded-readers\"></a>Aside: graded readers</h1>\n<p>Quick break from being lazy for a moment; graded readers are awesome.\nIf your language of interest has them,\nparticularly if you can get audio to go along with a bilingual book,\nDO IT.\nI learned a very respectable amount of German in a very short time\nthanks to <a href=\"https://www.briansmith.de/books.php\">Brian Smith's German readers</a>.\nUnfortunately, I have never been able to find such good material for Korean.</p>\n<h1><a href=\"#immersion\" aria-hidden=\"true\" class=\"anchor\" id=\"immersion\"></a>Immersion</h1>\n<p>When I arrived in Korea, I learned very quickly that I didn't know anything\nbeyond how to read and say a very few basic phrases.\nI didn't even learn how to order a cappuccino or lunch to go\nuntil my first week here!\nImmersion is hard to beat.</p>\n<p>But just being somewhere is not very effective unless you are very much out and about\nand able to engage with native speakers.\nImmersion isn't just going to a country and chilling for 20 years with a bunch of expats.\n(I know people who did this and they still only speak English.)</p>\n<p>Choose your local friends carefully if you are serious about using them to learn a language.\nIf they can speak English (or some other language you prefer) comfortably,\nyou will both quickly revert to that.\nWhich means you both need to be pretty patient and committed to figuring out how to communicate.\nEating, drinking, and taking trips together with locals that can barely speak your language\nis the fastest way to learn the local culture, slang,\nand a <em>reasonably</em> effective way to learn your target language.</p>\n<h1><a href=\"#media\" aria-hidden=\"true\" class=\"anchor\" id=\"media\"></a>Media</h1>\n<p>I used to not put too much faith in the whole &quot;watch movies to learn a language&quot; thing.\nI still mostly don't, but my views have evolved a bit over time.\nPart of my belief came from realizing that most movies (that an adult learner) wants to watch\nuse vocabulary and phrases that are at the wrong level.\nIt didn't really help that in my target language,\nI really had ZERO interest in the popular shows (&quot;dramas&quot;).\nYou NEED to enjoy what you're consuming!</p>\n<p>However... kids shows are gold.\nIf you want to learn Korean, go binge watch Pororo.\nSeriously.\nIt's silly, super basic, every day vocabulary\nat a pre-school level.\nYou'll learn idioms, nuance (the characters over exaggerate their expressions, since it's a kids show),\nand a whole lot more.</p>\n<p>The show is also EXTREMELY dialog-heavy.\nSo find shows like that, watch all 200 or whatever episodes,\nand you'll be able to move on to something slightly more advanced.\nFor Korean learners, I'd recommend Titipo next.\nI can only describe this show as a modern Korean version of Thomas the Tank Engine.</p>\n<p>Subtitles?\nI personally don't like them, because they quickly grab my attention,\nand I'm watching the subs rather than listening intently.\nThis is a bit of a hot take, but I would highly recommend giving it a try.\nException: native language subtitles aren't as problematic.\nI don't use them often,\nbut it CAN be useful for improving your reading speed or stopping to look up a word.</p>\n<p>Speaking of stopping, try with all your might to not stop the show.\nThis is the lazy method, remember?\nIt's also how literally every native speaker learns.\nYes, it's slower, but your learning will be much stronger.\nAs you work through binge watching kids shows,\nyou'll learn which shows you can understand.\nAfter a few hundred episodes of at least 2 or 3 kids shows,\nyou'll have a solid base understanding to work with,\nand it won't feel like you had to work very hard for it.\nZero memorization ideally.</p>\n<p>From here, try to stick with shows that you understand like 80% of the <em>plot</em> of every episode.\nI'm talking about an intuitive feeling by the way,\nnot necessarily even recognizing that percent of the words.\nYou certainly don't need to know every word to understand the general meaning of a phrase.\nSee the &quot;optimal learning failure rate&quot;;\nsome scientists put this at 85%,\nbut I don't think it needs to be that high for word recognition.</p>\n<p>We gloss over words that we can't quite define in our native languages all the time,\nwithout impairing our ability to understand the meaning of a sentence.\nLook things up after you hear them a few times and want to clarify your understanding.</p>\n<p>Books I have personally had limited success with.\nWriting is a bit more formal, especially in Korean.\nBut comics... oh boy!\nFortunately, I LOVE Japanese Manga.\nAnd all the popular series are translated into Korean.\nSo I'd read the manga on my e-Reader\nand then watch the show.\nThis was an absolutely fantastic pairing which took me way too many years to discover,\nand regrettably probably only works for a few languages.</p>\n<p>If you are looking for Korean media,\nyou can find comics at <a href=\"https://ridibooks.com/comics/ebook\">Ridi Books</a>.\nThey have pretty good apps,\nand I use it on my Boox e-Reader tablet (highly recommend).\nFor streaming animated series',\nI use <a href=\"https://laftel.net/\">Laftel</a>,\nbut as far as I know it's quite difficult to use outside Korea (payments in particular are a problem).\nI think Netflix is also an option,\nbut Laftel certainly has a larger content library.</p>\n<p>Music is, in my opinion, interesting purely for enjoyment.\nI don't think you learn a language from listening to music,\nsince &quot;musical language&quot; is so different from how anyone talks.\nOften even the pronunciation is different.\nI can sing <em>Dragostea din Tei</em> perfectly from memory,\nbut could hardly tell you the meaning of a dozen words in Romanian\n(fun aside: O-Zone is from Moldova, but their popular songs are in Romanian!).\nSongs lack context to enable learning.</p>\n<h1><a href=\"#grammar\" aria-hidden=\"true\" class=\"anchor\" id=\"grammar\"></a>Grammar</h1>\n<p>Grammar isn't on the curriculum in the school of lazy language learning.\nLearn how to spell.\nLearn the <em>basic</em> sentence structures,\nbut treat it like pattern recognition.</p>\n<p>You'll be surprised how much of conversational language use\nfollows a very limited subset of a language.\nJust focus on that, like the kids shows do.\nYou'll gradually pick up other rules over time.</p>\n<p>Fortunately, while Korean grammar isn't always easy per se,\nthe forms used in conversation are generally very simple and formulaic.\nUnfortuanetly this advice can't apply equally to all languages.\nFor example, German is complicated by gendered nouns,\nand Estonian is complicated by over a dozen cases.\nBut even there, just start with simple pattern recognition of what's most common.\nThe kids shows and friends will guide you.</p>\n<h1><a href=\"#math\" aria-hidden=\"true\" class=\"anchor\" id=\"math\"></a>Math</h1>\n<p>Huh, what's this heading doing here?\nWell, part of communicating is learning to count.\nYou should do that.\nThis is some of the only &quot;proper&quot; study you'll have to do.\nBut you really just need to learn how to count.</p>\n<p>On a more philosophical note,\nI'm an engineer, and I love thinking in formulas, rules, and patterns.\nA lot of the way that language is taught formally\nattempts to put it in a box like this.\nYou study grammatical rules, forms, suffixes, and so on.\nBut the number of people who ACTUALLY learn languages this way is a rounding error.</p>\n<p>You will find more inconsistencies in any language than you can shake a stick at.\nThat's part of what makes learning human languages hard.\nComputer languages are, by contrast, easy.\nThey have to adhere to rigid rules,\nand after you've learned a few (much easier than it may sound),\nyou can often pick up a new one in a weekend.\nThis approach will only hurt you though with human languages.</p>\n<p>As Jack Sparrow infamously said,\nit's more like &quot;guidelines&quot; than actual rules.\nLoosen up and enjoy!</p>\n<h1><a href=\"#other-resources\" aria-hidden=\"true\" class=\"anchor\" id=\"other-resources\"></a>Other resources</h1>\n<p>My friend Andrew put together a list of useful resources he's found for Korean self-study.\nCheck it out <a href=\"https://andrewzah.com/blog/korean-learning-useful-apps/\">on his blog</a>.</p>\n",
      "summary": "",
      "date_published": "2024-11-06T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "learning",
        "languages",
        "self-improvement"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//sequence-locks.html",
      "url": "https://ianwwagner.com//sequence-locks.html",
      "title": "Sequence Locks",
      "content_html": "<p>I just listened to a fantastic Two's Complement <a href=\"https://www.twoscomplement.org/podcast/sequence_locks.mp3\">podcast episode</a>\n(<a href=\"https://www.twoscomplement.org/podcast/sequence_locks.txt\">transcript</a>)\nin which Matt and Ben discussed a data structure I'd never heard of before:\nthe <a href=\"https://en.wikipedia.org/wiki/Seqlock\">sequence lock</a>.\nIt is not very well known,\nbut it's useful for cases where you want to avoid writer starvation\nand it's acceptable for readers to do a bit more work\n(and occasionally fail).</p>\n<p>Some use cases they discussed:</p>\n<ul>\n<li>Getting the time information from the kernel without a syscall</li>\n<li>Getting the latest stock ticker price when latency isn't a concern</li>\n</ul>\n<p>The basic idea is to increment an atomic variable two times:\nonce when a writer enters the critical section, and again when it is finished.\nThis clever dance lets readers quickly check that they aren't reading in the middle of an update,\nsince a simple modulo 2 check will tell you if you're reading during an update.\nClever!</p>\n<p>It's a lot lighter than a mutex,\nallows a single writer in the basic case,\nbut can support more (per Matt).\nAnd of course there is no mutex or similar mechanism creating a bottleneck.</p>\n<p>The tradeoff is that reads can fail,\nand it is the responsibility of the reader to retry in this case.\nReaders will always be able to get the latest value <em>eventually</em>,\nand writers are never blocked.</p>\n<p>During the podcast, Matt mentioned formal verification in passing,\nsince it's difficult to gain confidence in these sorts of things.\nThis area is something I've long been fascinated with,\nbut agree with him that it's not practical for most applications.\nIf formal verification is interesting to you,\nOxide &amp; Friends did a whole <a href=\"https://oxide-and-friends.transistor.fm/episodes/software-verificationpalooza\">episode on it</a>.</p>\n",
      "summary": "",
      "date_published": "2024-10-30T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "concurrency",
        "data-structures",
        "podcasts"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//copying-and-unarchiving-from-a-server-without-a-temp-file.html",
      "url": "https://ianwwagner.com//copying-and-unarchiving-from-a-server-without-a-temp-file.html",
      "title": "Copying and Unarchiving From a Server Without a Temp File",
      "content_html": "<p>Sometimes I want to copy files from a remote machine--usually a server I control.\nEasy; just use <code>scp</code>, right?</p>\n<p>Well, today I had a subtly different twist to the usual problem.\nI needed to transfer a ~100GB tarball to my local machine,\nand I really wanted to unarchive it so that I could get at the internal data directly.\nAnd I wanted to do it in one step, since I didn't have 200GB of free space.</p>\n<p>I happened to remember that this <em>should</em> be possible with pipes or something.\nThe tarball format is helpfully designed to allow streaming.\nBut it took me a bit to come up with the right set of commands to do this.</p>\n<p><code>scp</code> is really designed for dumping to files first.\nI found a few suggestions on StackOverflow that looked like they might work,\nbut didn't for me (might have been my shell? I use <code>fish</code> rather than <code>bash</code>).\nBut I noticed that almost all of the answers recommended using <code>ssh</code> instead,\nsince it's a bit more suited to the purpose.</p>\n<p>The basic idea is to dump the file to standard out on the remote host,\nthen pipe the ssh output into <code>tar</code> locally.\nThe <code>tar</code> flags are probably familiar or easily understandable: <code>-xvf -</code> in my case.\nThis puts <code>tar</code> into extract mode,\nenables verbose logging (so you see its progress),\nand tells it to read from stdin (<code>-</code>).\nMy <em>tarball</em> was not compressed.\nIf yours is, add the appropriate decompression flags.</p>\n<p>The SSH flags were a bit trickier.\nI discovered the <code>-C</code> flag, which enables gzip compression.\nI happen to know this dataset compresses well with gzip,\nand further that the network link between me and the remote is not the best,\nso I enabled it.\nDon't use this if your data does not compress well,\nor if it is already compressed.</p>\n<p>Another flag, <code>-e none</code>,\nI found via <a href=\"https://www.unix.com/unix-for-dummies-questions-and-answers/253941-scp-uncompress-file.html\">this unix.com forum post</a>.\nThis seemed like a good thing to enable after some research,\nsince sequences like <code>~.</code> will not be interpreted as &quot;kill the session.&quot;\nIt also prevents more subtle bugs which would look like data corruption.</p>\n<p><code>-T</code> was suggested after I pressed ChatGPT o1-preview for other flags that might be helpful.\nIt just doesn't allocate a pseudo-terminal.\nWhich we didn't need anyways.\n(Aside: ChatGPT 4o will give you some hot garbage suggestions; o1-preview was only helpful in suggesting refinements.)</p>\n<p>Finally, the command executes <code>cat</code> on the remote host to dump the tarball to <code>stdout</code>.\nI saw suggestions as well to use <code>dd</code> since you can set the block size explicitly.\nThat might improve perf in some situations if you know your hardware well.\nOr it might just be a useless attempt at premature optimization ;)</p>\n<p>Here's the final command:</p>\n<pre><code class=\"language-shell\">ssh -C -e none -T host.example.com 'cat /path/to/archive.tar' | tar -xvf -\n</code></pre>\n",
      "summary": "",
      "date_published": "2024-10-28T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "ssh",
        "terminal",
        "shell",
        "tar"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//kwdc-24.html",
      "url": "https://ianwwagner.com//kwdc-24.html",
      "title": "KWDC 24",
      "content_html": "<p>Yesterday I had the pleasure of attending KWDC 24,\nan Apple developer conference modeled after WWDC,\nbut for the Korean market.\nRegrettably, I only heard about it a few days prior\nthrough a friend at the <a href=\"https://www.meetup.com/seoul-ios-meetup/\">Seoul iOS Meetup</a>,\nso I wasn’t able to give a talk.</p>\n<h1><a href=\"#overall-impressions\" aria-hidden=\"true\" class=\"anchor\" id=\"overall-impressions\"></a>Overall impressions</h1>\n<p>The iOS meetup typically has 20-30 attendees.\nBut wow... the turnout at KWDC far exceeded my expectations.\nNearly 600 attendees showed up,\nand this is only its second year (I also didn’t hear about it last year;\nclearly I live under a rock)!\nThe staff were well organized and friendly,\nand the international participation was significantly better than I had expected.</p>\n<h1><a href=\"#the-challenge-of-multi-lingual-events\" aria-hidden=\"true\" class=\"anchor\" id=\"the-challenge-of-multi-lingual-events\"></a>The challenge of multi-lingual events</h1>\n<p>Surprisingly, most of the (30+) event staff also spoke English very well (that’s a first for a Korean conference that I’ve been to).\nI think the organizers did an excellent job in not only attracting an international audience\n(<a href=\"https://nerdyak.tech/\">one speaker</a> flew in from Czechia!),\nbut also making them feel welcome.\nHats off to the organizing team for that!\nThis makes me really happy, and I hope this is a big step in raising the level of conferences here.</p>\n<p>Not only did they have a mix of English and Korean talks,\nthey handled live translation much better than any other event I’ve seen.\nThey used a service called <a href=\"https://flitto.com\">Flitto</a>.\nApparently it’s a Korean company,\nand they were using some AI models to do the heavy lifting.\nIt had some hiccups to be sure,\nbut it did a surprisingly good job!\nThe main complaint I heard was that it would wait for half or 2/3 of a screenful of content, causing jumps that were hard to read.</p>\n<p>The speakers I talked with said they had to provide a script in advance,\nwhich we speculate were used to improve the quality of the translations\n(which were still live, even when the presenter went “off script”).\nThe model still failed to recognize technical terms on occasion,\nbut the hiccups were to be expected.\nOverall, the quality of the translation was excellent,\nand I think this will be the future of such events!\nI’ve been at half a dozen events where they give you a radio receiver\nand an earpiece, and it is never a great experience.\nEveryone I talked with said they liked the\ntext on screen approach was better too\n(especially since it was alongside the slides,\nwhich were honestly Apple quality at every single talk I saw!).</p>\n<p><figure><img src=\"media/IMG_8818.jpeg\" alt=\"View of the stage during a talk about Swift 6, showing the screen with live translation next to the slides\" /></figure></p>\n<h1><a href=\"#favorite-talks\" aria-hidden=\"true\" class=\"anchor\" id=\"favorite-talks\"></a>Favorite talks</h1>\n<p>My favorite talk was Pavel’s on “The Magic of SwiftUI Animations.”\nHe even walked up to the podium in a wizard robe 🧙\nI was blown away by the amount of effort that he put into the slides,\nand got a bunch of things to follow up on (like this <a href=\"https://m.youtube.com/watch?v=f4s1h2YETNY\">video on shaders</a>).\nTalking with him after, he said it was the culmination of around 4 years of effort.</p>\n<p><figure><img src=\"media/IMG_8820.jpeg\" alt=\"Pavel Zak on stage\" /></figure></p>\n<p><a href=\"https://x.com/riana_soumi\">Riana’s</a> talk on Swift Testing\ngot me fired up to switch.\nI wanted to shout with excitement when I heard that Swift <em>finally</em>\nsupports parameterized tests in a native testing framework!\nAnd it’s <a href=\"https://github.com/swiftlang/swift-testing\">open source</a>,\nso I hope it will be improve faster than XCTest.\nWho knows; maybe it’ll even get property testing (like QuickCheck, Hypothesis, etc.)!</p>\n<p>The third talk that stuck out to me was <a href=\"https://www.rudrank.com/\">Rudrank’s</a>\ntalk on widgets.\nHe was a GREAT presenter, with a number of Korean expressions woven throughout,\nwhich the audience loved.\nI also liked how he cleverly wove the Rime of the Ancient Mariner throughout\nthe talk (the title was “Widgets, Widgets Everywhere, and not a Pixel to Spare”).\nMy biggest learning was in the weird differences in the mental model for updates:\nit’s all about the timeline!</p>\n<h1><a href=\"#networking\" aria-hidden=\"true\" class=\"anchor\" id=\"networking\"></a>Networking</h1>\n<p>Networking at Korean conferences is typically a bit slow to be honest,\nas it is not normal in Korean culture to walk up to someone\nand start a conversation without much context.\nThis event also happened to be exceptional in a good way!</p>\n<p>The first big networking opportunity was lunch.\nBut there wasn't a very clear announcement of how lunch would work.\nEveryone was on their own, and the info was rather buried in some PDFs (which I didn’t get somehow)\nand Discord (which I had a hard time navigating).\nTogether with a Danish friend I met at the iOS meetup a few days prior,\nI suggested we wing it and just follow the crowd outside to see where we ended up,\nsince they were clearly better informed than us 🤣</p>\n<p>We ended up taking a few turns following a group in front of us,\nand eventually I asked if they would be cool with us crashing their party.\nWe ended up at a crowded Donkatsu buffet a few minutes later.\nThe two at our table were iOS engineers, one working at Hyundai AutoEver,\nand another at <a href=\"https://www.bucketplace.com/en/\">오늘의집</a>,\nand we had a great conversation over lunch!\n(They even bought us coffee after; so friendly!)</p>\n<p>One of them mentioned that we should check out the networking area in-between sessions, which I did later.\nIt was a bit hard to find, since it was in a narrow hall,\nafter you passed through a cafe on another floor.\nI think this could have been announced a bit better,\nsince not many people used it, but the conversations I had there were great!</p>\n<p>Aside: another cool thing that I haven’t seen done elsewhere is round-table Q&amp;A.\nAlong with the session times, each speaker was available for Q&amp;A around (literally)\nround tables near the networking zone.\nVery cool idea!</p>\n<p>The networking area had one room dedicated to local communities as well,\nincluding the Seoul iOS Meetup,\nthe Korean <a href=\"https://github.com/Swift-Coding-Club\">Swift Coding Club</a>,\nand the AWS Korea User group.\nThe Swift Coding Club in particular was a super cool group.\nSeveral were students,\nand one was working on some apps related to EV charging.\nThis naturally lead to a conversation about the geocoding,\nmaps, and navigation SDKs I’ve been working on at <a href=\"https://docs.stadiamaps.com/sdks/overview/\">Stadia Maps</a>.\nIt was a good time!</p>\n<p>Finally, there wasn’t a big after-party or anything,\nbut there was a small event at a bar organized for the speakers and sponsors.\nI didn’t speak, but they were fine letting me tag along.\nI ended up talking for well over an hour with Riana about everything from Swift\nto world cultures to under-representation of women in tech.\nAnd she 100% sold me on attending <a href=\"https://tryswift.jp/_en\"><code>try! Swift</code></a>\nin Tokyo next year.\nAnd Mark from <a href=\"https://www.revenuecat.com/\">RevenueCat</a>,\nwho I also met at the iOS meetup prior,\ntaught me a bunch of things I didn’t know about the history of MacRuby\n(turns out he built the first BaseCamp app using RubyMotion back in the day!).</p>\n<p>I ended up getting home at 1:30am for the second time this week.\nBut it was worth it!</p>\n",
      "summary": "",
      "date_published": "2024-10-26T00:00:00-00:00",
      "image": "media/IMG_8818.jpeg",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "conferences",
        "AI",
        "translation",
        "apple",
        "swift"
      ],
      "language": "en"
    }
  ]
}