So you just finished your brand new Sitecore implementation. All your components are shiny, beautiful, bug free! Your client is happy and thrilled, BUT they have thousands upon thousands of pieces of content from their legacy system that they need to be migrated into Sitecore! You could do it manually over the next decade, but we’re lazy so lets try to automate (where we can)!
Inventory and export the content
When I begin a migration process, the first thing I do is try to understand the content. I usually start with the sitemap.xml to get the general structure of the site. The structure of the site usually helps you classify content by type; news, blogs, products etc. Once you get a sense of the type of content the site has, you can dig deeper into each type and see if here is commonalities. If there are, those are great candidates for migrating programmatically. Once you identify which content you want to programmatically migrate, I like to export the content into CSV files. CSV files are really easy to work with and easy to create and manipulate, but use whatever format you are comfortable with.
Importing with Sitecore PowerShell Extensions
If you haven’t used SPE, start. Right now. Here is the link: https://doc.sitecorepowershell.com/
With SPE it is super easy to read CSV files:
Import-CSV "C:\exportFIles\site_export.csv" | foreach{
$field1Val = $_.field1Val # field1Val is the heading name
$field2Val = $_."field2 Val" # sometimes they gave spaces
}
Once you start reading the file, now you can start parsing the content in the CSV file. A lot of the time the exported content in the HML and you need to parse it and break it out into fields. Parsing HTML is not the most fun, but Sitecore comes packaged with Html Agility Pack it is a great tool to read and parse html.
#loads an html document form string
$htmlDocument = New-Object -TypeName HtmlAgilityPack.HtmlDocument
$htmlDocument.LoadHtml($htmlString)
Once the document is loaded you can do a bunch of need stuff with it. For example, get all links in the document, so you can find old links in the content and map them to the new links
#get all links and extract href values
$htmlDocument.DocumentNode.SelectNodes("//a[@href]") | foreach{
$hrefString = $_.GetAttributeValue("href", "")
}
Most of the exported content is walls of HTML and its important to break it up to put into sections so that you can put it into components. You can traverse the HTML document and parse out pieces that you want
$title = ""
$body = ""
$htmlDocument.DocumentNode.ChildNodes | foreach{
if($_.Name -eq "h2"){
$title = $_.InnerHtml # gets the inner html of the h2 to be put into a title field. Doesnt include the actual h2 tags
}
else{
$body += $_.OuterHtml # puts the rest of the content in the body variable. Includes surronding html
}
}
Creating Sitecore Items and Components
Once you have the content parsed out you want to put those values in actual sitecore items. Creating items and setting presentation details is a breeze with SPE and you can easily create Pages, Renderings and datasources:
$path = "sitecore/content/home"
#create sitecore item
$sitecorePage = New-Item -Path $path -ItemType "guid of templateID"
#set some fields
$sitecorePage.Editing.BeginEdit()
$sitecorePage["somefield"] = "some value"
$sitecorePage.Editing.EndEdit()
#create datasource item. In this case Im creating it as a child of the previous item
$datasourceItem= New-Item -Parent $sitecorePage -ItemType "guid of templateID"
$datasourceItem.Editing.BeginEdit()
$datasourceItem["somefield"] = "some value"
$datasourceItem.Editing.EndEdit()
# create
# Find the rendering item and convert to a rendering
$renderingPath = "/sitecore/layout/Renderings/my cool rendering"
$renderingItem = Get-Item -Database "master" -Path $renderingPath | New-Rendering -Placeholder "main"
#add rendering and set datasource to page
Add-Rendering -Item $sitecorePage -PlaceHolder "main" -Instance $renderingItem -Datasource $datasourceItem.Path -FinalLayout
Final Thoughts
Above is a very general approach and tools I use to migrate content into Sitecore. One thing I want to end with is that not all content should be migrated programmatically. If you are finding yourself coding edge cases until your fingers bleed, then maybe this is a job for a human!