PR
This commit is contained in:
commit
0f562fbdea
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
@ -0,0 +1,2 @@
|
||||
.DS_Store
|
||||
|
880
n1-sl.ipynb
Normal file
880
n1-sl.ipynb
Normal file
File diff suppressed because one or more lines are too long
831
n2-sl.ipynb
Normal file
831
n2-sl.ipynb
Normal file
File diff suppressed because one or more lines are too long
696
n3-sl.ipynb
Normal file
696
n3-sl.ipynb
Normal file
@ -0,0 +1,696 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"## Napovedovanje vrednosti\n",
|
||||
"\n",
|
||||
"Podatkovno rudarjenje, naloga, `27. april 2025`\n",
|
||||
"**`Gašper Dobrovoljc`**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"Spoznali bomo praktično uporabo enostavnih metod nadzorovanega modeliranja oz.\n",
|
||||
"napovedovanja. Skupna lastnost vseh omenjenih metod je, da s pomočjo\n",
|
||||
"naključnih spremenljivk (atributov) modelirajo vrednosti posebne spremenljivke,\n",
|
||||
"ki ji pravimo *razred* (v kontekstu uvrščanja v razrede, klasifikacije)\n",
|
||||
"ali *odziv* (v kontekstu regresije). Osnovne razlike med kontekstoma smo\n",
|
||||
"spoznali na predavanjih in vajah.\n",
|
||||
"\n",
|
||||
"Praktična cilja, ki ju bomo zasledovali sta:\n",
|
||||
"* modeliranje ocen posameznega uporabnika (odziva) s pomočjo vseh ostalih uporabnikov,\n",
|
||||
"* primerjava metod nadzorovanega modeliranja."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"### Podatki\n",
|
||||
"\n",
|
||||
"Opis podatkovne zbirke MovieLens ostaja enak prvi nalogi."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"### Predpriprava podatkov\n",
|
||||
"\n",
|
||||
"Za potrebe te naloge bomo podatke pripravili na naslednji način:\n",
|
||||
"1. Izberi $m$ filmov z vsaj 100 ogledi.\n",
|
||||
"2. Izberi $n$ uporabnikov, ki si je ogledalo vsaj 100 filmov.\n",
|
||||
"3. Pripravi matriko $X$ velikosti $m \\times n$, kjer vrstice predstavljajo filme, stolpci pa uporabnike. Neznane vrednosti zamenjaj z $0$.\n",
|
||||
"\n",
|
||||
"Za vsakega od izbranih $n$ uporabnikov bo zgrajen regresijski model, \n",
|
||||
"katerega cilj bo napoved ocen za filme. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<table>\n",
|
||||
" <tr style=\"background-color: white;\">\n",
|
||||
" <td style=\"border-right: 1px solid #000;\"></td>\n",
|
||||
" <td></td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">$y^{(0)}$</td>\n",
|
||||
" <td colspan=3 style=\"text-align:center;\">$X^{(0)}$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr style=\"border-bottom: 1px solid #000;\">\n",
|
||||
" <td style=\"border-right: 1px solid #000;\"></td>\n",
|
||||
" <td>Film/uporabnik</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">$u_0$</td>\n",
|
||||
" <td>$u_1$</td>\n",
|
||||
" <td>$u_2$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_1}$</td>\n",
|
||||
" <td>Twelve Monkeys (a.k.a. 12 Monkeys) (1995)</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2.5</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_2}$</td>\n",
|
||||
" <td>Dances with Wolves (1990) </td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">4</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_3}$</td>\n",
|
||||
" <td>Apollo 13 (1995)</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_4}$</td>\n",
|
||||
" <td>Sixth Sense, The (1999)</td><td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">3</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
"</table>\n",
|
||||
"\n",
|
||||
"<table>\n",
|
||||
" <tr style=\"background-color: white;\">\n",
|
||||
" <td style=\"border-right: 1px solid #000;\"></td>\n",
|
||||
" <td></td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">$y^{(1)}$</td>\n",
|
||||
" <td colspan=3 style=\"text-align:center;\">$X^{(1)}$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr style=\"border-bottom: 1px solid #000;\">\n",
|
||||
" <td style=\"border-right: 1px solid #000;\"></td>\n",
|
||||
" <td>Film/uporabnik</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">$u_1$</td>\n",
|
||||
" <td>$u_0$</td>\n",
|
||||
" <td>$u_2$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_1}$</td>\n",
|
||||
" <td>Twelve Monkeys (a.k.a. 12 Monkeys) (1995)</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2.5</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_2}$</td>\n",
|
||||
" <td>Dances with Wolves (1990) </td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">0</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_3}$</td>\n",
|
||||
" <td>Apollo 13 (1995)</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">2</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">${f_4}$</td>\n",
|
||||
" <td>Sixth Sense, The (1999)</td><td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <td style=\"border-right: 1px solid #000;\">$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" <td style=\"border-right: 1px solid #000; border-left: 1px solid #000;\">$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" <td>$\\cdots$</td>\n",
|
||||
" </tr>\n",
|
||||
"</table>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": "Razdelitev podatkov za model uporabnika $u_0$ (zgorja matrika) in uporabnika $u_1$ (spodaj matrika).\n"
|
||||
},
|
||||
{
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-04-27T16:16:12.001788Z",
|
||||
"start_time": "2025-04-27T16:16:11.631245Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"ratings = pd.read_csv('./podatki/ml-latest-small/ratings.csv')\n",
|
||||
"movies = pd.read_csv('./podatki/ml-latest-small/movies.csv')"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": 1
|
||||
},
|
||||
{
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-04-27T16:16:12.098867Z",
|
||||
"start_time": "2025-04-27T16:16:12.082840Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"movie_rating_count = ratings.groupby('movieId')['rating'].count().reset_index()\n",
|
||||
"movie_rating_count = movie_rating_count[movie_rating_count['rating'] >= 100]\n",
|
||||
"\n",
|
||||
"user_rating_count = ratings.groupby('userId')['rating'].count().reset_index()\n",
|
||||
"user_rating_count = user_rating_count[user_rating_count['rating'] >= 100]\n",
|
||||
"\n",
|
||||
"filtered_ratings = ratings[ratings['movieId'].isin(movie_rating_count['movieId'])]\n",
|
||||
"filtered_ratings = filtered_ratings[filtered_ratings['userId'].isin(user_rating_count['userId'])]\n",
|
||||
"\n",
|
||||
"matrix = filtered_ratings.pivot_table(index='movieId', columns='userId', values='rating', fill_value=0)"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": 2
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"### Vprašanja"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"#### 1. Regresija (100%) \n",
|
||||
"Za vsakega uporabnika postavite regresijski model. Uporabite eno ali več metod za učenje regresijskih modelov (linearna regresija, Ridge, Lasso, itd.).\n",
|
||||
"\n",
|
||||
"Za vsakega od $n$ uporabnikov izberite ustrezni stolpec v matriki podatkov. Za uporabnika $i$ imamo torej:\n",
|
||||
"\n",
|
||||
"* Vektor odziva $y^{(i)}$,\n",
|
||||
"* Matriko podatkov $X^{(i)}$, ki vsebuje vse stolpce *razen* $i$.\n",
|
||||
" \n",
|
||||
"Za lažjo predstavo si oglej zgornji tabeli. Nekajkrat (npr., trikrat) ponovite postopek preverjanja s pomočjo učne in testne množice:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"* Množico filmov, ki si jih je uporabnik ogledal, *naključno* razdelite v razmerju 75% (učna množica) in 25% (testna množica).\n",
|
||||
"* Naučite regresijski model na učni množici (izberite ustrezne vrstice v $X$ in $y$).\n",
|
||||
"* Ovrednotite model na testni množici (ponovno izberite ustrezne vrstice v $X$ in $y$).\n",
|
||||
"\n",
|
||||
"Oceno vrednotenja nato delite s številom poizkusov, da dobite končno oceno.\n",
|
||||
"\n",
|
||||
"Poročajte o uspešnosti vašega modela. Pri tem se osredotočite na naslednja vprašanja:\n",
|
||||
"* Utemeljite ustrezno mero vrednotenja. Ali model dobro napoveduje ocene?\n",
|
||||
"* Z izbrano mero ocenite modele za vseh $n$ uporabnikov.\n",
|
||||
"\n",
|
||||
"Kodo za odgovore lahko razdelite v več celic."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-04-27T16:16:12.948299Z",
|
||||
"start_time": "2025-04-27T16:16:12.110729Z"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"from sklearn.linear_model import LinearRegression, Ridge, Lasso\n",
|
||||
"from sklearn.metrics import mean_squared_error\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"\n",
|
||||
"def eval_user(user_id, model_cls, model_kwargs=None, num_trials=3):\n",
|
||||
" if model_kwargs is None:\n",
|
||||
" model_kwargs = {}\n",
|
||||
"\n",
|
||||
" y = matrix[user_id].values\n",
|
||||
" x_others = matrix.drop(columns=user_id).values\n",
|
||||
"\n",
|
||||
" rated_indices = np.where(y > 0)[0]\n",
|
||||
"\n",
|
||||
" rmse_list = []\n",
|
||||
"\n",
|
||||
" for _ in range(num_trials):\n",
|
||||
" train_idx, test_idx = train_test_split(rated_indices, test_size=0.25, random_state=None)\n",
|
||||
"\n",
|
||||
" x_train = x_others[train_idx]\n",
|
||||
" y_train = y[train_idx]\n",
|
||||
"\n",
|
||||
" x_test = x_others[test_idx]\n",
|
||||
" y_test = y[test_idx]\n",
|
||||
"\n",
|
||||
" model = model_cls(**model_kwargs)\n",
|
||||
" model.fit(x_train, y_train)\n",
|
||||
"\n",
|
||||
" y_pred = model.predict(x_test)\n",
|
||||
" rmse = mean_squared_error(y_test, y_pred)\n",
|
||||
" rmse_list.append(rmse)\n",
|
||||
"\n",
|
||||
" return np.mean(rmse_list)"
|
||||
],
|
||||
"outputs": [],
|
||||
"execution_count": 3
|
||||
},
|
||||
{
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-04-27T16:16:14.998843Z",
|
||||
"start_time": "2025-04-27T16:16:12.962010Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"from tqdm import tqdm\n",
|
||||
"\n",
|
||||
"user_rmse_results = {}\n",
|
||||
"\n",
|
||||
"models = {\n",
|
||||
" \"Linear\": (LinearRegression, {}),\n",
|
||||
" \"Ridge\": (Ridge, {\"alpha\": 1.0}),\n",
|
||||
" \"Lasso\": (Lasso, {\"alpha\": 0.1, \"max_iter\": 10000}),\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"for model_name, (model_cls, model_kwargs) in models.items():\n",
|
||||
" print(f\"Evaluating model: {model_name}\")\n",
|
||||
" model_rmse = {}\n",
|
||||
"\n",
|
||||
" for user_id in tqdm(matrix.columns):\n",
|
||||
" avg_rmse = eval_user(user_id, model_cls, model_kwargs)\n",
|
||||
" if avg_rmse is not None:\n",
|
||||
" model_rmse[user_id] = avg_rmse\n",
|
||||
"\n",
|
||||
" user_rmse_results[model_name] = model_rmse"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Evaluating model: Linear\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 263/263 [00:00<00:00, 369.67it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Evaluating model: Ridge\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 263/263 [00:00<00:00, 526.27it/s]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Evaluating model: Lasso\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 263/263 [00:00<00:00, 328.20it/s]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"execution_count": 4
|
||||
},
|
||||
{
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-04-27T16:16:15.018908Z",
|
||||
"start_time": "2025-04-27T16:16:15.015890Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"for model_name, result in user_rmse_results.items():\n",
|
||||
" rmse_values = list(result.values())\n",
|
||||
" avg_rmse = np.mean(rmse_values)\n",
|
||||
" print(f\"{model_name} – Povprečni RMSE za vse uporabnike: {avg_rmse:.4f}\")"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Linear – Povprečni RMSE za vse uporabnike: 0.7776\n",
|
||||
"Ridge – Povprečni RMSE za vse uporabnike: 0.7747\n",
|
||||
"Lasso – Povprečni RMSE za vse uporabnike: 0.8388\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"execution_count": 5
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"cell_type": "markdown",
|
||||
"source": "Ustrezna mera vrednotenja je koren povprečne kvadratne napake (RMSE), ker večja odstopanja bolj kaznuje."
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"#### Bonus vprašanje (15%)\n",
|
||||
"Ustvarite novega uporabnika, ki predstavlja vaše ocene\n",
|
||||
"filmov. Ocenite nekaj filmov po lastnem okusu in preverite, kako modeli ocenijo neizbrane filme.\n",
|
||||
"Ali se vam zdijo napovedi primerne?\n",
|
||||
"\n",
|
||||
"Kodo za odgovore lahko razdelite v več celic."
|
||||
]
|
||||
},
|
||||
{
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2025-04-27T16:16:15.055360Z",
|
||||
"start_time": "2025-04-27T16:16:15.041567Z"
|
||||
}
|
||||
},
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"ratings = {\n",
|
||||
" 75: 4.0,\n",
|
||||
" 1: 4.8,\n",
|
||||
" 316: 4.5,\n",
|
||||
" 364: 4.9,\n",
|
||||
" 541: 4.7,\n",
|
||||
" 124: 3.7,\n",
|
||||
" 3114: 4.7,\n",
|
||||
" 4306: 5.0,\n",
|
||||
" 5349: 4.7\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"ratings_series = pd.Series(0, index=matrix.index, dtype=float)\n",
|
||||
"for movie_id, rating in ratings.items():\n",
|
||||
" if movie_id in ratings_series.index:\n",
|
||||
" ratings_series.loc[movie_id] = rating\n",
|
||||
"\n",
|
||||
"matrix2 = matrix.copy()\n",
|
||||
"matrix2[\"me\"] = ratings_series\n",
|
||||
"\n",
|
||||
"y_me = matrix2[\"me\"].values\n",
|
||||
"x_other = matrix2.drop(columns=\"me\").values\n",
|
||||
"\n",
|
||||
"unrated_idx = np.where(y_me == 0)[0]\n",
|
||||
"\n",
|
||||
"rated_idx = np.where(y_me > 0)[0]\n",
|
||||
"x_train = x_other[rated_idx]\n",
|
||||
"y_train = y_me[rated_idx]\n",
|
||||
"\n",
|
||||
"x_test = x_other[unrated_idx]\n",
|
||||
"\n",
|
||||
"model = Ridge(alpha=1.0)\n",
|
||||
"model.fit(x_train, y_train)\n",
|
||||
"\n",
|
||||
"y_pred = model.predict(x_test)\n",
|
||||
"\n",
|
||||
"(pd.DataFrame({\n",
|
||||
" \"movieId\": matrix.index[unrated_idx],\n",
|
||||
" \"predictedRating\": y_pred,\n",
|
||||
"})\n",
|
||||
" .merge(movies, on=\"movieId\")\n",
|
||||
" .sort_values(by=\"predictedRating\", ascending=False)\n",
|
||||
" .head(10))"
|
||||
],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
" movieId predictedRating \\\n",
|
||||
"106 2571 4.935168 \n",
|
||||
"30 356 4.903377 \n",
|
||||
"129 4993 4.884689 \n",
|
||||
"24 296 4.876314 \n",
|
||||
"117 2959 4.859066 \n",
|
||||
"133 5952 4.856036 \n",
|
||||
"25 318 4.847719 \n",
|
||||
"111 2762 4.834292 \n",
|
||||
"40 527 4.832231 \n",
|
||||
"114 2858 4.825916 \n",
|
||||
"\n",
|
||||
" title \\\n",
|
||||
"106 Matrix, The (1999) \n",
|
||||
"30 Forrest Gump (1994) \n",
|
||||
"129 Lord of the Rings: The Fellowship of the Ring,... \n",
|
||||
"24 Pulp Fiction (1994) \n",
|
||||
"117 Fight Club (1999) \n",
|
||||
"133 Lord of the Rings: The Two Towers, The (2002) \n",
|
||||
"25 Shawshank Redemption, The (1994) \n",
|
||||
"111 Sixth Sense, The (1999) \n",
|
||||
"40 Schindler's List (1993) \n",
|
||||
"114 American Beauty (1999) \n",
|
||||
"\n",
|
||||
" genres \n",
|
||||
"106 Action|Sci-Fi|Thriller \n",
|
||||
"30 Comedy|Drama|Romance|War \n",
|
||||
"129 Adventure|Fantasy \n",
|
||||
"24 Comedy|Crime|Drama|Thriller \n",
|
||||
"117 Action|Crime|Drama|Thriller \n",
|
||||
"133 Adventure|Fantasy \n",
|
||||
"25 Crime|Drama \n",
|
||||
"111 Drama|Horror|Mystery \n",
|
||||
"40 Drama|War \n",
|
||||
"114 Drama|Romance "
|
||||
],
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>movieId</th>\n",
|
||||
" <th>predictedRating</th>\n",
|
||||
" <th>title</th>\n",
|
||||
" <th>genres</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>106</th>\n",
|
||||
" <td>2571</td>\n",
|
||||
" <td>4.935168</td>\n",
|
||||
" <td>Matrix, The (1999)</td>\n",
|
||||
" <td>Action|Sci-Fi|Thriller</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>30</th>\n",
|
||||
" <td>356</td>\n",
|
||||
" <td>4.903377</td>\n",
|
||||
" <td>Forrest Gump (1994)</td>\n",
|
||||
" <td>Comedy|Drama|Romance|War</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>129</th>\n",
|
||||
" <td>4993</td>\n",
|
||||
" <td>4.884689</td>\n",
|
||||
" <td>Lord of the Rings: The Fellowship of the Ring,...</td>\n",
|
||||
" <td>Adventure|Fantasy</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>24</th>\n",
|
||||
" <td>296</td>\n",
|
||||
" <td>4.876314</td>\n",
|
||||
" <td>Pulp Fiction (1994)</td>\n",
|
||||
" <td>Comedy|Crime|Drama|Thriller</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>117</th>\n",
|
||||
" <td>2959</td>\n",
|
||||
" <td>4.859066</td>\n",
|
||||
" <td>Fight Club (1999)</td>\n",
|
||||
" <td>Action|Crime|Drama|Thriller</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>133</th>\n",
|
||||
" <td>5952</td>\n",
|
||||
" <td>4.856036</td>\n",
|
||||
" <td>Lord of the Rings: The Two Towers, The (2002)</td>\n",
|
||||
" <td>Adventure|Fantasy</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>25</th>\n",
|
||||
" <td>318</td>\n",
|
||||
" <td>4.847719</td>\n",
|
||||
" <td>Shawshank Redemption, The (1994)</td>\n",
|
||||
" <td>Crime|Drama</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>111</th>\n",
|
||||
" <td>2762</td>\n",
|
||||
" <td>4.834292</td>\n",
|
||||
" <td>Sixth Sense, The (1999)</td>\n",
|
||||
" <td>Drama|Horror|Mystery</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>40</th>\n",
|
||||
" <td>527</td>\n",
|
||||
" <td>4.832231</td>\n",
|
||||
" <td>Schindler's List (1993)</td>\n",
|
||||
" <td>Drama|War</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>114</th>\n",
|
||||
" <td>2858</td>\n",
|
||||
" <td>4.825916</td>\n",
|
||||
" <td>American Beauty (1999)</td>\n",
|
||||
" <td>Drama|Romance</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"execution_count": 6
|
||||
},
|
||||
{
|
||||
"metadata": {},
|
||||
"cell_type": "markdown",
|
||||
"source": "Napovedi se mi zdijo smiselne, saj bi predlagane filme ocenil podobno kot predvidene ocene."
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"lang": "sl"
|
||||
},
|
||||
"source": [
|
||||
"### Zapiski\n",
|
||||
"\n",
|
||||
"Implementacijo, opis in vrednotenje metod za nadzorovanjo učenje vsebujejo knjižnice `sklearn` ali `Orange`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
},
|
||||
"nbTranslate": {
|
||||
"displayLangs": [
|
||||
"sl"
|
||||
],
|
||||
"hotkey": "alt-t",
|
||||
"langInMainMenu": true,
|
||||
"sourceLang": "sl",
|
||||
"targetLang": "en",
|
||||
"useGoogleTranslate": true
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
147
podatki/ml-latest-small/README.txt
Normal file
147
podatki/ml-latest-small/README.txt
Normal file
@ -0,0 +1,147 @@
|
||||
Summary
|
||||
=======
|
||||
|
||||
This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100004 ratings and 1296 tag applications across 9125 movies. These data were created by 671 users between January 09, 1995 and October 16, 2016. This dataset was generated on October 17, 2016.
|
||||
|
||||
Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
|
||||
|
||||
The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.
|
||||
|
||||
This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.
|
||||
|
||||
This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.
|
||||
|
||||
|
||||
Usage License
|
||||
=============
|
||||
|
||||
Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:
|
||||
|
||||
* The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
|
||||
* The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
|
||||
* The user may redistribute the data set, including transformations, so long as it is distributed under these same license conditions.
|
||||
* The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
|
||||
* The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.
|
||||
|
||||
In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).
|
||||
|
||||
If you have any further questions or comments, please email <grouplens-info@umn.edu>
|
||||
|
||||
|
||||
Citation
|
||||
========
|
||||
|
||||
To acknowledge use of the dataset in publications, please cite the following paper:
|
||||
|
||||
> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=<http://dx.doi.org/10.1145/2827872>
|
||||
|
||||
|
||||
Further Information About GroupLens
|
||||
===================================
|
||||
|
||||
GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:
|
||||
|
||||
* recommender systems
|
||||
* online communities
|
||||
* mobile and ubiquitious technologies
|
||||
* digital libraries
|
||||
* local geographic information systems
|
||||
|
||||
GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <grouplens-info@cs.umn.edu> - we are always interested in working with external collaborators.
|
||||
|
||||
|
||||
Content and Use of Files
|
||||
========================
|
||||
|
||||
Formatting and Encoding
|
||||
-----------------------
|
||||
|
||||
The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.
|
||||
|
||||
User Ids
|
||||
--------
|
||||
|
||||
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).
|
||||
|
||||
Movie Ids
|
||||
---------
|
||||
|
||||
Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).
|
||||
|
||||
|
||||
Ratings Data File Structure (ratings.csv)
|
||||
-----------------------------------------
|
||||
|
||||
All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
|
||||
|
||||
userId,movieId,rating,timestamp
|
||||
|
||||
The lines within this file are ordered first by userId, then, within user, by movieId.
|
||||
|
||||
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
|
||||
|
||||
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
|
||||
|
||||
Tags Data File Structure (tags.csv)
|
||||
-----------------------------------
|
||||
|
||||
All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
|
||||
|
||||
userId,movieId,tag,timestamp
|
||||
|
||||
The lines within this file are ordered first by userId, then, within user, by movieId.
|
||||
|
||||
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
|
||||
|
||||
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
|
||||
|
||||
Movies Data File Structure (movies.csv)
|
||||
---------------------------------------
|
||||
|
||||
Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
|
||||
|
||||
movieId,title,genres
|
||||
|
||||
Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
|
||||
|
||||
Genres are a pipe-separated list, and are selected from the following:
|
||||
|
||||
* Action
|
||||
* Adventure
|
||||
* Animation
|
||||
* Children's
|
||||
* Comedy
|
||||
* Crime
|
||||
* Documentary
|
||||
* Drama
|
||||
* Fantasy
|
||||
* Film-Noir
|
||||
* Horror
|
||||
* Musical
|
||||
* Mystery
|
||||
* Romance
|
||||
* Sci-Fi
|
||||
* Thriller
|
||||
* War
|
||||
* Western
|
||||
* (no genres listed)
|
||||
|
||||
Links Data File Structure (links.csv)
|
||||
---------------------------------------
|
||||
|
||||
Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:
|
||||
|
||||
movieId,imdbId,tmdbId
|
||||
|
||||
movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.
|
||||
|
||||
imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.
|
||||
|
||||
tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.
|
||||
|
||||
Use of the resources listed above is subject to the terms of each provider.
|
||||
|
||||
Cross-Validation
|
||||
----------------
|
||||
|
||||
Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see [LensKit](http://lenskit.org) for tools, documentation, and open-source code examples.
|
9126
podatki/ml-latest-small/cast.csv
Normal file
9126
podatki/ml-latest-small/cast.csv
Normal file
File diff suppressed because it is too large
Load Diff
9126
podatki/ml-latest-small/links.csv
Normal file
9126
podatki/ml-latest-small/links.csv
Normal file
File diff suppressed because it is too large
Load Diff
9126
podatki/ml-latest-small/movies.csv
Normal file
9126
podatki/ml-latest-small/movies.csv
Normal file
File diff suppressed because it is too large
Load Diff
100005
podatki/ml-latest-small/ratings.csv
Normal file
100005
podatki/ml-latest-small/ratings.csv
Normal file
File diff suppressed because it is too large
Load Diff
1297
podatki/ml-latest-small/tags.csv
Normal file
1297
podatki/ml-latest-small/tags.csv
Normal file
File diff suppressed because it is too large
Load Diff
Loading…
x
Reference in New Issue
Block a user